Hi,
We have a three factor RNA-Seq experiment result from 167 samples (6 replications per sample).
Factor 1 = salinity, level = 2 (fresh water, salt water)
SALINITY <- c(rep(rep(c("FW","SW"), each=6), 7), rep(c("FW","SW"), c(5,6)), rep(rep(c("FW","SW"), each=6), time=6))
Factor 2 = photoperiod, level = 3 (Long day (LD), short day (SP), and spring (SPLD))
DAYLENGTH <- c(rep(c("LD"),12), rep(rep(c("LD","SP"), each=12),2), rep(rep(c("LD","SP"), each=12),1), rep(c("SPLD"),5),rep(c("SPLD"),6), rep(c("LD","SP","SPLD"),each=12,2))
Factor 3 = Time course, level = 6 (T[1-6])
TIME <- c(rep(c("T1"), each=12), rep(c("T2","T3"), each=24), rep(c("T4"), each=24), rep(c("T4"),5), rep(c("T4"),6), rep(c("T5","T6"),each=36))
The group looks like
GROUP <- factor(paste(TIME, DAYLENGTH, SALINITY, sep="_"))
Design model
DESIGN <- model.matrix(~0 + GROUP) colnames(DESIGN) <- gsub('GROUP', '', colnames(DESIGN))
We tried to do initial filtering based on CPM values as follows according to the edgeR Vignette.
keep<-rowSums(cpm(y)>1) >=5
y<-y[keep, ,keep.lib.sizes=FALSE]
Our intention was to filter CPM values per sample (6 replications per sample) but we discovered that this doesn't work and we are not sure that we are doing the right thing. What is the best way of filtering based on our design?
Thanks for the help.
Teshome
What library sizes do you have? For example, what is the output to
?
> summary(RG.1$samples$lib.size)
Min. 1st Qu. Median Mean 3rd Qu. Max.
11330000 21890000 23870000 23900000 26330000 37230000
The code itself looks like it should work, so when you say "this doesn't work", what do you mean, exactly? Are you getting an error somewhere, or?
Also, if you have 6 replicates per group your keep rowSums threshold should technically be >= 6, but this probably won't change much in the downstream processing anyway.
When i say "this doesn't work", the rowSums >=5 is filtering across all columns (167 samples) rather than per sample for the 6 replications. I double checked the dataset after the filter and i confirm the results. We can also extend our discussion using the following reproducible example:
This example returns all TURE because rowSums checks across columns. Regarding the choice of rowSums threshold, we choose >=5 because we have one sample with 5 replications.
Everything is working as it should and, further, your filtering strategy should be blind to the groups each sample belongs to.
You can either increase your filtering threshold (ie.
keep <- rowSums(cpm(y) > 2) >=5
) if you want to remove more (any) genes, or you can switch to filtering based onaveLogCPM
, as outlined by Aaron in the post below (use increasing threshold as you like)A: ABOUT CPM FILTERING IN EDGER