Entering edit mode
James Perkins
▴
120
@james-perkins-4948
Last seen 10.3 years ago
Hi all,
I understand it is normal to filter out lowly expressed genes before
performing differential expression analysis on RNA-seq data (e.g.,
edgeR, DESeq).
However I notice with such methods as edgeR, I find a number of genes
where there is clearly one outlier that is causing the gene to be
deemed significantly DE (thought the dispersion value is quite high):
for example
control1 control2 control3 case1 case2 case3
geneA 0 1 3 1 2 30
Note that case3 is not an outlier sample, MDS plots show it to be like
the other case samples, and the phenotype of the samples is as we
would expect. I would say this gene is an outlier rather than the
sample being an outlier, if that makes sense.
Would it be fair to filter such examples out? I am thinking of a
filtering rule such that:
for each gene, if it has a number of counts below X for at least one
case sample AND at least one control sample, discard it.
This way I don't get rid of genes where the expression is high in case
and very low (or unexpressed) in control and vice versa.
However, I understand that this means I will be using the class labels
for my filtering step, which I believe might lead to problems at the
multiple testing correction stage.
Thanks in advance for any help/ideas on this issue.
Jim
--
James Perkins
Institute of Structural and Molecular Biology
Division of Biosciences
University College London
Gower Street
London, WC1E 6BT
UK
email: j.perkins at ucl.ac.uk