Question

Filtering step in Differential analysis with RSEM values

0

Entering edit mode

Biologist ▴ 120

@biologist-9801

Last seen 5.1 years ago

Dear Aaron,

In this post (C: Possible ways of performing differential gene expression and analysis of RNA-Seq) Gordon gave a code for Differential analysis with RSEM values. In that he used a filtering step keeping genes that have about 10 counts or more in atleast 14 samples. Which means there are 14 samples in the smallest "experimental group".

In my case:

table(targets$Sample.type)

MB1 MB2
286 80

So, on what sample number should I filter now?

Do I need to filter like this?

keep <- rowSums(y > log2(11)) >= 80

rsem differential gene expression edger • 1.8k views

ADD COMMENT • link updated 7.7 years ago by Gordon Smyth 52k • written 7.7 years ago by Biologist ▴ 120

score 3 · Answer 1 · 2017-08-16

The reasons why we don't give prescriptive rules on how to filter are

A range of sensible filtering cutoffs will give good results. You don't need to worry about what exact threshold you use, as long as it's in the sensible range.
Good filtering depends on the nature of your data and what biological questions you're trying to answer.

In your case, you need to decide now many samples you would need a gene to be expressed in before it became biologically interesting.

Suppose a genes was expressed in 79 of the MB2 samples but none of the MB1. Would you want to call that gene as DE? Probably yes.

Suppose a genes was expressed in 60 of the MB2 samples but none of the MB1. Would you want to call that gene as DE? Again, probably yes.

Suppose a genes was expressed in 20 of the MB2 samples but none of the MB1. Would you want to call that gene as DE? Probably not. It's only expressed in a minority of samples for either group.

You need to decide the minimum number of samples that a gene would have to be expressed in for it to be biologically interesting to you. That shouldn't be higher than 80, but it might be as low as 50 or 60. You decide. This question gets back to why you're doing the DE analysis in the first place. If you want a suggestion, I'd probably go with around 60.