I have an rna seq dataset and I am using Deseq2 to find differentially expressed genes between the two groups. However, I also want to remove genes in low counts by using a base mean threshold. I used pre-filtering to remove any genes that have no counts or only one count across the samples, however, I also want to remove those that have low counts compared to the rest of the genes. Is there a common threshold used for the basemean or a way to work out what this threshold should be?
I would not use the baseMean for any filtering as it is (at least to me) hard to deconvolute. You do not know why the baseMean is low, either because there is no difference between groups and the gene is just lowly-expressed (and/or short), or it is moderately expressed in one but off in the other. The baseMean could be the same in these two scenarios. If you filter I would do it on the counts. So you could say that all or a fraction of samples of at least one group must have 10 or more counts. That will ensure that you remove genes that have many low counts or zeros across the groups rather than nested by group, the latter would be a good DE candidate so it should not be removed. Or you do that automated, e.g. using the edgeR function filterByExpr.
Besides from speed will it make any difference if i apply this filtering after selecting differential exons based on e.g padj < 0.01 & a log2fc >=2 and <= -2 ?
Filtering is not done for speed reasons, it is to both increase precision of the model parameter estimation and to reduce the multiple testing burden. It only makes sense to me if done before running the statistical testing.
Yes, agree you can use filterByExpr or I commonly just use something like:
where
x
is the minimal number of samples that should have a count of 10 or more. E.g. you can use the smallest group sample size.Would you argue that this is could/should also be applied to a DEXSeq dataset , where you want to identify differentially expressed exons.
Yes, it is even noted in the vignette that prefiltering might make sense.
Besides from speed will it make any difference if i apply this filtering after selecting differential exons based on e.g padj < 0.01 & a log2fc >=2 and <= -2 ?
Filtering is not done for speed reasons, it is to both increase precision of the model parameter estimation and to reduce the multiple testing burden. It only makes sense to me if done before running the statistical testing.