RNA-seq, low count filtering and multiple testing

0

Entering edit mode

James Perkins ▴ 120

@james-perkins-4948

Last seen 10.6 years ago

Hi all, I understand it is normal to filter out lowly expressed genes before performing differential expression analysis on RNA-seq data (e.g., edgeR, DESeq). However I notice with such methods as edgeR, I find a number of genes where there is clearly one outlier that is causing the gene to be deemed significantly DE (thought the dispersion value is quite high): for example control1 control2 control3 case1 case2 case3 geneA 0 1 3 1 2 30 Note that case3 is not an outlier sample, MDS plots show it to be like the other case samples, and the phenotype of the samples is as we would expect. I would say this gene is an outlier rather than the sample being an outlier, if that makes sense. Would it be fair to filter such examples out? I am thinking of a filtering rule such that: for each gene, if it has a number of counts below X for at least one case sample AND at least one control sample, discard it. This way I don't get rid of genes where the expression is high in case and very low (or unexpressed) in control and vice versa. However, I understand that this means I will be using the class labels for my filtering step, which I believe might lead to problems at the multiple testing correction stage. Thanks in advance for any help/ideas on this issue. Jim -- James Perkins Institute of Structural and Molecular Biology Division of Biosciences University College London Gower Street London, WC1E 6BT UK email: j.perkins at ucl.ac.uk

edgeR edgeR • 3.6k views

ADD COMMENT • link updated 12.4 years ago by Gordon Smyth 52k • written 12.4 years ago by James Perkins ▴ 120

0

Entering edit mode

Ryan C. Thompson ★ 7.9k

@ryan-c-thompson-5618

Last seen 6 months ago

Icahn School of Medicine at Mount Sinai…

You are correct that using the class labels in filtering is "cheating" and could undermine the assumptions used in multiple testing correction. A common method is to filter rows where fewer than N samples meet a given threefold count, where N is the size of the smallest group (in your example N = 3). On Dec 10, 2012 11:53 PM, "James Perkins" <j.perkins@ucl.ac.uk> wrote: > Hi all, > > I understand it is normal to filter out lowly expressed genes before > performing differential expression analysis on RNA-seq data (e.g., > edgeR, DESeq). > > However I notice with such methods as edgeR, I find a number of genes > where there is clearly one outlier that is causing the gene to be > deemed significantly DE (thought the dispersion value is quite high): > > for example > > control1 control2 control3 case1 case2 case3 > geneA 0 1 3 1 2 30 > > Note that case3 is not an outlier sample, MDS plots show it to be like > the other case samples, and the phenotype of the samples is as we > would expect. I would say this gene is an outlier rather than the > sample being an outlier, if that makes sense. > > Would it be fair to filter such examples out? I am thinking of a > filtering rule such that: > > for each gene, if it has a number of counts below X for at least one > case sample AND at least one control sample, discard it. > > This way I don't get rid of genes where the expression is high in case > and very low (or unexpressed) in control and vice versa. > > However, I understand that this means I will be using the class labels > for my filtering step, which I believe might lead to problems at the > multiple testing correction stage. > > Thanks in advance for any help/ideas on this issue. > > Jim > > -- > James Perkins > Institute of Structural and Molecular Biology > Division of Biosciences > University College London > Gower Street > London, WC1E 6BT > UK > > email: j.perkins@ucl.ac.uk > > _______________________________________________ > Bioconductor mailing list > Bioconductor@r-project.org > https://stat.ethz.ch/mailman/listinfo/bioconductor > Search the archives: > http://news.gmane.org/gmane.science.biology.informatics.conductor > [[alternative HTML version deleted]]

ADD COMMENT • link 12.4 years ago Ryan C. Thompson ★ 7.9k

0

Entering edit mode

Hi Ryan, Michael, Gordon Thanks Ryan, your less directed filter of lowly expressed genes makes sense to me, I guess there could be some genes taht get through such as for example: control1 control2 control3 case1 case2 case3 geneB 20 10 0 1 0 30 Which really we would probably want to remove, but these should (hopefully) be few and far between Thanks also Michael and Gordon, I like the sound of both of your suggestions, esp. the gof() function, I will investigate. Thanks, Jim On 11 December 2012 11:21, Ryan Thompson <rct@thompsonclan.org> wrote: > You are correct that using the class labels in filtering is "cheating" and > could undermine the assumptions used in multiple testing correction. A > common method is to filter rows where fewer than N samples meet a given > threefold count, where N is the size of the smallest group (in your example > N = 3). > On Dec 10, 2012 11:53 PM, "James Perkins" <j.perkins@ucl.ac.uk> wrote: > > > Hi all, > > > > I understand it is normal to filter out lowly expressed genes before > > performing differential expression analysis on RNA-seq data (e.g., > > edgeR, DESeq). > > > > However I notice with such methods as edgeR, I find a number of genes > > where there is clearly one outlier that is causing the gene to be > > deemed significantly DE (thought the dispersion value is quite high): > > > > for example > > > > control1 control2 control3 case1 case2 case3 > > geneA 0 1 3 1 2 30 > > > > Note that case3 is not an outlier sample, MDS plots show it to be like > > the other case samples, and the phenotype of the samples is as we > > would expect. I would say this gene is an outlier rather than the > > sample being an outlier, if that makes sense. > > > > Would it be fair to filter such examples out? I am thinking of a > > filtering rule such that: > > > > for each gene, if it has a number of counts below X for at least one > > case sample AND at least one control sample, discard it. > > > > This way I don't get rid of genes where the expression is high in case > > and very low (or unexpressed) in control and vice versa. > > > > However, I understand that this means I will be using the class labels > > for my filtering step, which I believe might lead to problems at the > > multiple testing correction stage. > > > > Thanks in advance for any help/ideas on this issue. > > > > Jim > > > > -- > > James Perkins > > Institute of Structural and Molecular Biology > > Division of Biosciences > > University College London > > Gower Street > > London, WC1E 6BT > > UK > > > > email: j.perkins@ucl.ac.uk > > > > _______________________________________________ > > Bioconductor mailing list > > Bioconductor@r-project.org > > https://stat.ethz.ch/mailman/listinfo/bioconductor > > Search the archives: > > http://news.gmane.org/gmane.science.biology.informatics.conductor > > > > [[alternative HTML version deleted]] > > _______________________________________________ > Bioconductor mailing list > Bioconductor@r-project.org > https://stat.ethz.ch/mailman/listinfo/bioconductor > Search the archives: > http://news.gmane.org/gmane.science.biology.informatics.conductor > -- James Perkins Institute of Structural and Molecular Biology Division of Biosciences University College London Gower Street London, WC1E 6BT UK email: j.perkins@ucl.ac.uk [[alternative HTML version deleted]]

ADD REPLY • link 12.4 years ago James Perkins ▴ 120

0

Entering edit mode

Gordon Smyth 52k

@gordon-smyth

Last seen 10 hours ago

WEHI, Melbourne, Australia

Dear James, It does make sense to view a gene as an outlier (or a particular count as an outlier), although the data example you give you doesn't look all that bad, depending on the sequencing depths. In the future, edgeE will automatically detect outlier genes and downweight them in an appropriate way (our yet to be made public in- house version of edgeR already does that). Using the current official release of edgeR, you could simply reduce the prior.df when estimating the tagwise dispersions, say: y <- estimateGLMTagwise(y, design, prior.df=5) This will ensure that genes with outlier or very variable counts are down-weighted more than is the default. If you must filter genes on variability, then it would be better to do so based on goodness of fit statistics, rather than on the ad hoc filter you propose. You can explore outliers using the gof (goodness of fit) function in edgeR, for example fit <- glmFit(y,design) gof(fit, plot=TRUE) will make a Q-Q plot from which outliers can be identified. The gof() function will also compute p-values and flag outlier genes for you. Best wishes Gordon > Date: Tue, 11 Dec 2012 08:52:20 +0100 > From: James Perkins <j.perkins at="" ucl.ac.uk=""> > To: Bioconductor at r-project.org > Subject: [BioC] RNA-seq, low count filtering and multiple testing > > Hi all, > > I understand it is normal to filter out lowly expressed genes before > performing differential expression analysis on RNA-seq data (e.g., > edgeR, DESeq). > > However I notice with such methods as edgeR, I find a number of genes > where there is clearly one outlier that is causing the gene to be deemed > significantly DE (thought the dispersion value is quite high): > > for example > > control1 control2 control3 case1 case2 case3 > geneA 0 1 3 1 2 30 > > Note that case3 is not an outlier sample, MDS plots show it to be like > the other case samples, and the phenotype of the samples is as we would > expect. I would say this gene is an outlier rather than the sample being > an outlier, if that makes sense. > > Would it be fair to filter such examples out? I am thinking of a > filtering rule such that: > > for each gene, if it has a number of counts below X for at least one > case sample AND at least one control sample, discard it. > > This way I don't get rid of genes where the expression is high in case > and very low (or unexpressed) in control and vice versa. > > However, I understand that this means I will be using the class labels > for my filtering step, which I believe might lead to problems at the > multiple testing correction stage. > > Thanks in advance for any help/ideas on this issue. > > Jim > > -- > James Perkins > Institute of Structural and Molecular Biology > Division of Biosciences > University College London > Gower Street > London, WC1E 6BT > UK > > email: j.perkins at ucl.ac.uk > ______________________________________________________________________ The information in this email is confidential and intend...{{dropped:4}}

ADD COMMENT • link 12.4 years ago Gordon Smyth 52k

Login before adding your answer.