Question

Why does the filtering of lowly expressed genes for analysis with edgeR must be done unsupervised?

0

Entering edit mode

Mau ▴ 50

@mau-11194

Last seen 6.0 years ago

Hi,

Like many others, I have struggled choosing the most adequate filtering parameters for lowly expressed genes in a DE analysis. I would like to filter out genes that don't meet a minimum expression threshold within all samples from each condition. This idea was expressed in the comments of this post: edgeR cpm filter with >1 factor but some users also mentioned that this is impossible without demanding expression in every sample, because filtering must me done unsupervised (= without knowledge of which condition is applied to each sample), before feeding the data to edgeR.

I don't understand why this is the case. Codewise, I think it's relatively easy to filter lowly expressed genes that don't meet a fixed threshold within each group, and then keep all genes that passed the filter in at least one condition. I am thinking of something like:

#rnacounts: matrix with read counts from all samples, with the all n samples from condition 1 arranged in the first n columns, and all m samples from condition 2 arranged in the last m columns
#sample.sizes[1]: number of samples in condition one (n)
#sample.sizes[2]: number of samples in condition two (m)

cpm.rnacounts <- cpm(rnacounts)
cpm.condition1 <- cpm.rnacounts[,1:sample.sizes[1]]
cpm.condition2 <- cpm.rnacounts[,(sample.sizes[1] + 1):(sample.sizes[1] + sample.sizes[2])]

#this filters demands values of cpm > 1 for all samples in each condition
isexpr.1 <- rowSums(cpm.condition1>1) == sample.sizes[1]
isexpr.2 <- rowSums(cpm.condition2>1) == sample.sizes[2]

#now keep genes that meet the filter in at least one of the two conditions.
isexpr_either_condition <- isexpr.1 | isexpr.2

#Finally go back to the original counts matrix (before the cpm transformation) and select only those in isexpr_either_condition
rnacounts_myfilter <- rnacounts[isexpr_either_condition,]

Intuitively, I don't see any biases being introduced by this procedure, since all samples are subjected to the same filter, and genes are kept for all samples even if they meet the expression criteria in only one condition.

Thank you!

edgeR RNAseq DE analysis gene filtering • 2.5k views

ADD COMMENT • link 8.7 years ago Mau ▴ 50

0

Entering edit mode

Thank you both for your replies.

I read the paper and I was able to grasp the basics. I wonder if it is possible to apply a desired filter and then somehow test if "the conditional and unconditional null distributions of Ui-II are the same." I was thinking that this could serve as a way to argue for or against a particular filter for a given data set.

Thanks!

ADD REPLY • link updated 8.7 years ago by Gordon Smyth 52k • written 8.7 years ago by Mau ▴ 50

0

Entering edit mode

I've moved your comment here. If you have followup questions, please post them using "ADD COMMENT" rather than as an "Answer", otherwise it appears like you're answering your own question.

Anyway, the answer to your follow up question is "no". Any analysis of that sort is inherently impossible because it would require you to know the true DE status of genes.

ADD REPLY • link 8.7 years ago Gordon Smyth 52k

score 4 · Answer 1 · 2016-07-29

4

Entering edit mode

Gordon Smyth 52k

@gordon-smyth

Last seen 4 hours ago

WEHI, Melbourne, Australia

Your intuition is letting you down.

It is incorrect to select genes that are expressed in at least one experimental condition, because doing so tends to select genes that appear to be DE, even when no DE actually exists. Doing so tends to inflate the FDR in downstream analyses.

ADD COMMENT • link 8.7 years ago Gordon Smyth 52k

score 1 · Answer 2 · 2016-07-29

1

Entering edit mode

Steve Lianoglou ★ 13k

@steve-lianoglou-2771

Last seen 10 days ago

United States

For a more formal treatment of your intuition, read this paper:

Independent filtering increases detection power for high-throughput experiments

Specifically you want to make sure you read the bit that says "the authors discuss a filter which requires the fraction of present calls to exceed a threshold in at least one condition."

ADD COMMENT • link 8.7 years ago Steve Lianoglou ★ 13k