Question

edgeR cpm filter

0

Entering edit mode

teshome.dagnem • 0

@teshomedagnem-14199

Last seen 7.5 years ago

Hi,

We have a three factor RNA-Seq experiment result from 167 samples (6 replications per sample).

Factor 1 = salinity, level = 2 (fresh water, salt water)

SALINITY <- c(rep(rep(c("FW","SW"), each=6), 7), rep(c("FW","SW"), c(5,6)), rep(rep(c("FW","SW"), each=6), time=6))

Factor 2 = photoperiod, level = 3 (Long day (LD), short day (SP), and spring (SPLD))

DAYLENGTH <- c(rep(c("LD"),12), rep(rep(c("LD","SP"), each=12),2), rep(rep(c("LD","SP"), each=12),1), rep(c("SPLD"),5),rep(c("SPLD"),6), rep(c("LD","SP","SPLD"),each=12,2))

Factor 3 = Time course, level = 6 (T[1-6])

TIME <- c(rep(c("T1"), each=12), rep(c("T2","T3"), each=24), rep(c("T4"), each=24), rep(c("T4"),5), rep(c("T4"),6), rep(c("T5","T6"),each=36))

The group looks like

GROUP <- factor(paste(TIME, DAYLENGTH, SALINITY, sep="_"))

Design model

DESIGN <- model.matrix(~0 + GROUP)
colnames(DESIGN) <- gsub('GROUP', '', colnames(DESIGN))

We tried to do initial filtering based on CPM values as follows according to the edgeR Vignette.

keep<-rowSums(cpm(y)>1) >=5

y<-y[keep, ,keep.lib.sizes=FALSE]

Our intention was to filter CPM values per sample (6 replications per sample) but we discovered that this doesn't work and we are not sure that we are doing the right thing. What is the best way of filtering based on our design?

Thanks for the help.

Teshome

edger independent filtering multiple factor design cpm • 5.3k views

ADD COMMENT • link updated 7.5 years ago by Gordon Smyth 52k • written 7.5 years ago by teshome.dagnem • 0

0

Entering edit mode

What library sizes do you have? For example, what is the output to

summary(y$samples$lib.size)

?

ADD REPLY • link 7.5 years ago Gordon Smyth 52k

0

Entering edit mode

> summary(RG.1$samples$lib.size)
Min. 1st Qu. Median Mean 3rd Qu. Max.
11330000 21890000 23870000 23900000 26330000 37230000

ADD REPLY • link 7.5 years ago teshome.dagnem • 0

0

Entering edit mode

The code itself looks like it should work, so when you say "this doesn't work", what do you mean, exactly? Are you getting an error somewhere, or?

Also, if you have 6 replicates per group your keep rowSums threshold should technically be >= 6, but this probably won't change much in the downstream processing anyway.

ADD REPLY • link 7.5 years ago Steve Lianoglou ★ 13k

0

Entering edit mode

When i say "this doesn't work", the rowSums >=5 is filtering across all columns (167 samples) rather than per sample for the 6 replications. I double checked the dataset after the filter and i confirm the results. We can also extend our discussion using the following reproducible example:

data = data.frame(v1 = c(1,4,3,3,3,3,1), v2 = c(0,56,0,88,4,4,0), v3 =c(1,6,2,9,0,6,0), x1=c(0,6,1,9,7,7,1), x2 =c(1,4,7,0,6,5,1), x3 = c(0,0,4,6,6,7,1))
data
rowSums(data) >=3

This example returns all TURE because rowSums checks across columns. Regarding the choice of rowSums threshold, we choose >=5 because we have one sample with 5 replications.

ADD REPLY • link 7.5 years ago teshome.dagnem • 0

0

Entering edit mode

Everything is working as it should and, further, your filtering strategy should be blind to the groups each sample belongs to.

You can either increase your filtering threshold (ie. keep <- rowSums(cpm(y) > 2) >=5) if you want to remove more (any) genes, or you can switch to filtering based on aveLogCPM, as outlined by Aaron in the post below (use increasing threshold as you like)

A: ABOUT CPM FILTERING IN EDGER

ADD REPLY • link 7.5 years ago Steve Lianoglou ★ 13k

score 4 · Accepted Answer · 2017-10-18

4

Entering edit mode

Gordon Smyth 52k

@gordon-smyth

Last seen 12 hours ago

WEHI, Melbourne, Australia

The filtering you are doing is correct, although I think you are being too conservative. Give that most of your library sizes are 24 million or more, I would suggest using a lower cpm cutoff such as

keep <- rowSums(cpm(y) > 0.4) >= 5

With your library sizes, a cutoff of 0.4 for the CPM will still require about 0.4 * 24 = 10 reads per sample.

Note that it is extremely important that you apply this cutoff to all the libraries at once, as recommended in the edgeR User's Guide, rather than applying a cutoff separately to each group of libraries. The filtering has to be independent of the treatment groups. By that I mean, the filtering rule must not use any information about which sample belongs to which group. Otherwise you will not be able to control the error rates correctly.

As you probably already know, the >= 5 part of the filtering rule is chosen because 5 is the smallest group size in your experiment. With the filter defined like this, any gene that is expressed in all the samples of any group in your experiment will pass the filter.

ADD COMMENT • link 7.5 years ago Gordon Smyth 52k

0

Entering edit mode

Do you think >=5 is a reasonable threshold to all 167 libraries?

ADD REPLY • link 7.5 years ago teshome.dagnem • 0

0

Entering edit mode

Yes. Why would it not be?

ADD REPLY • link 7.5 years ago Gordon Smyth 52k

0

Entering edit mode

Now, we agree that the filter shouldn't be per sample. Then, is it an arbitrary choice? If yes, how do we decide the optimal threshold for the rowSum?

ADD REPLY • link 7.5 years ago teshome.dagnem • 0

0

Entering edit mode

No, of course it's not arbitrary. I'm a bit puzzled by your question. I thought you had followed the reasoning of the edgeR User's Guide, so you must already know why >= 5 is right for your design. If you didn't follow the reasoning of the guide, how did you come up with >=5 yourself?

ADD REPLY • link 7.5 years ago Gordon Smyth 52k

0

Entering edit mode

The user manual is clear and tried to follow the reasoning, for example section 4.2.4. I was a bit confused with section 2.6 and that is why i say arbitrary. "As a rule of thumb, genes are dropped if they can’t possibly be expressed in all the samples for any of the conditions" and it provides how to filter by group. " A requirement for expression in two or more libraries is used as the minimum number of samples in each group is two. This ensures that a gene will be retained if it is only expressed in both samples in group 2." Thanks for the support.

ADD REPLY • link 7.5 years ago teshome.dagnem • 0