Hi,
Like many others, I have struggled choosing the most adequate filtering parameters for lowly expressed genes in a DE analysis. I would like to filter out genes that don't meet a minimum expression threshold within all samples from each condition. This idea was expressed in the comments of this post: edgeR cpm filter with >1 factor but some users also mentioned that this is impossible without demanding expression in every sample, because filtering must me done unsupervised (= without knowledge of which condition is applied to each sample), before feeding the data to edgeR.
I don't understand why this is the case. Codewise, I think it's relatively easy to filter lowly expressed genes that don't meet a fixed threshold within each group, and then keep all genes that passed the filter in at least one condition. I am thinking of something like:
#rnacounts: matrix with read counts from all samples, with the all n samples from condition 1 arranged in the first n columns, and all m samples from condition 2 arranged in the last m columns #sample.sizes[1]: number of samples in condition one (n) #sample.sizes[2]: number of samples in condition two (m) cpm.rnacounts <- cpm(rnacounts) cpm.condition1 <- cpm.rnacounts[,1:sample.sizes[1]] cpm.condition2 <- cpm.rnacounts[,(sample.sizes[1] + 1):(sample.sizes[1] + sample.sizes[2])] #this filters demands values of cpm > 1 for all samples in each condition isexpr.1 <- rowSums(cpm.condition1>1) == sample.sizes[1] isexpr.2 <- rowSums(cpm.condition2>1) == sample.sizes[2] #now keep genes that meet the filter in at least one of the two conditions. isexpr_either_condition <- isexpr.1 | isexpr.2 #Finally go back to the original counts matrix (before the cpm transformation) and select only those in isexpr_either_condition rnacounts_myfilter <- rnacounts[isexpr_either_condition,]
Intuitively, I don't see any biases being introduced by this procedure, since all samples are subjected to the same filter, and genes are kept for all samples even if they meet the expression criteria in only one condition.
Thank you!
Thank you both for your replies.
I read the paper and I was able to grasp the basics. I wonder if it is possible to apply a desired filter and then somehow test if "the conditional and unconditional null distributions of Ui-II are the same." I was thinking that this could serve as a way to argue for or against a particular filter for a given data set.
Thanks!
I've moved your comment here. If you have followup questions, please post them using "ADD COMMENT" rather than as an "Answer", otherwise it appears like you're answering your own question.
Anyway, the answer to your follow up question is "no". Any analysis of that sort is inherently impossible because it would require you to know the true DE status of genes.