edgeR outlier question

0

Entering edit mode

Simon Melov ▴ 340

@simon-melov-266

Last seen 10.2 years ago

I have a reasonable RNASeq data set of 10 biological replicates of a control group versus 10 biological replicates experimental I've gone through the edgeR workflow, and get a nice list of about 1000 genes differentially expressed due to the experimental manipulation. I input the data based on total reads per gene (I'd like to get to exons too, but first things first). The data is obtained via a paired end strategy, so its pretty good quality. The number of reads per sample (library) is about 10 million reads each. My question is, as I go through list of significant genes which are differentially expressed between the two groups (normalized via the workflow), ranked by BH FDR down to 0.05, I see genes being judged as differentially expressed which have very low expression in most samples, yet are thrown off by 1 or 2 values, thereby achieving statistical significance. For example, a gene might have between 1 and 2 counts per million reads in one group, and be basically the same in the other group, but one of the values is perhaps at a 1000 or so counts, which seems to throw off the entire group, thereby becoming "significant". Shouldn't edgeR take into account this sort of biological variation within a group and account for it in assessing significance? Its clear that in the above example, that sample is an outlier, and therefore the variance is so high, so it shouldn't be ranked as being differentially expressed. I filtered the data by applying the criteria of at least 1 count per sample, and I have to have at least 8 samples per group which have this. Should there be an additional filtering criteria to exclude these outliers? or doesn't edgeR take into account this sort of situation (I thought it did). Am I doing something wrong here?

RNASeq edgeR RNASeq edgeR • 4.3k views

ADD COMMENT • link updated 12.5 years ago by Gordon Smyth 51k • written 12.6 years ago by Simon Melov ▴ 340

0

Entering edit mode

Ann Loraine ▴ 110

@ann-loraine-3863

Last seen 3.3 years ago

United States

Greetings, I am seeing the same kind of thing with a data set in my lab, as well. In fact, a few of the most significant genes (smallest p values) in my data set are in this group - huge numbers for one sample and small for all the rest. Best, Ann On Mon, May 7, 2012 at 3:19 PM, Simon Melov <smelov at="" buckinstitute.org=""> wrote: > I have a reasonable RNASeq data set of 10 biological replicates of a control group versus 10 biological replicates experimental I've gone through the edgeR workflow, and get a nice list of about 1000 genes differentially expressed due to the experimental manipulation. I input the data based on total reads per gene (I'd like to get to exons too, but first things first). The data is obtained via a paired end strategy, so its pretty good quality. The number of reads per sample (library) is about 10 million reads each. My question is, as I go through list of significant genes which are differentially expressed between the two groups ?(normalized via the workflow), ranked by BH FDR down to 0.05, I see genes being judged as differentially expressed which have very low expression in most samples, yet are thrown off by 1 or 2 values, thereby achieving statistical significance. For example, a gene might have between 1 and 2 counts per million reads in one group, and be basically the ! > ?same in the other group, but one of the values is perhaps at a 1000 or so counts, which seems to throw off the entire group, thereby becoming "significant". > > Shouldn't edgeR take into account this sort of biological variation within a group and account for it in assessing significance? Its clear that in the above example, that sample is an outlier, and therefore the variance is so high, so it shouldn't be ranked as being differentially expressed. I filtered the data by applying the criteria of at least 1 count per sample, and I have to have at least 8 samples per group which have this. Should there be an additional filtering criteria to exclude these outliers? or doesn't edgeR take into account this sort of situation (I thought it did). > > Am I doing something wrong here? > > _______________________________________________ > Bioconductor mailing list > Bioconductor at r-project.org > https://stat.ethz.ch/mailman/listinfo/bioconductor > Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor

ADD COMMENT • link 12.6 years ago Ann Loraine ▴ 110

0

Entering edit mode

alessandro.guffanti@genomnia.com ▴ 330

@alessandroguffantigenomniacom-4436

Last seen 10.2 years ago

That seems to be exactly my same problem, so I am including here Gordon's answer. Actually filtering at the cpm level worked quite nicely to ameliorate the situation - look at the latest update (May 2011) of the User Manual to see a neat example of the procedure. Alessandro -- Dear Alessandro, You seem to giving examples of miRs that are expressed at a high degree is just one sample. The easiest way to deal with such miRs, if you really don't want to detect them, is to filter out miRs that fail to be expressed to a reasonable degree in at least four samples (since your groups are of size four). See for example pages 24-25 of the edgeR user's guide, where this is done for the Dclk1 mouse case study. We often suggest cpm>1 for at least m samples, where m is the minimum group size. Another obvious thing to do is to examine an MDS plot to identify outlier samples. -- [BioC] edgeR: effect of 'outlier' tags on differential expression calls alessandro.guffanti at genomnia.com alessandro.guffanti at genomnia.com Tue Apr 24 12:48:22 CEST 2012 Dear colleagues: I am using edgeR to examine differential expression on small RNA data I noticed this problem also when working with SAGE datasets: when just one of the samples is clearly an outlier, like you can see below for sample 7 (the comparison is 1-4 versus 5-8), there is a call of significant differential expression which seems to be inappropriate, or at least it should be reexamined. How can we diagnose these situations before checking manually the tag counts for all the significant differential expression calls ? Please note that these are tumoral samples, so an high sample by sample variability is expected in principle.. Thanks a lot in advance, Alessandro miRNA_ID 1.mirna 2.mirna 3.mirna 4.mirna 5.mirna 6.mirna 7.mirna 8.mirna hsa-miR-515-3p 3 1 1 1 1 7 1601 3 hsa-miR-518e 4 0 1 0 1 2 1715 2 hsa-miR-520d-3p 0 0 0 0 0 1 243 0 hsa-miR-519c-3p 0 0 0 0 0 1 248 0 hsa-miR-520f 0 0 0 0 0 0 163 0 hsa-miR-519d 12 1 0 1 1 4 1754 1 hsa-miR-520h 0 0 0 0 0 0 189 2 hsa-miR-519c-5p 0 0 0 0 0 0 123 0 hsa-miR-520g 16 1 1 4 2 4 1917 2 hsa-miR-518b 5 0 0 1 1 3 686 1 hsa-miR-517a 100 5 4 2 6 45 10024 3 miRNA_ID logConc logFC P.Value adj.P.Val hsa-miR-515-3p -15.09154 -8.61753 0.00000 0.00082 hsa-miR-518e -15.30278 -9.22926 0.00000 0.00110 hsa-miR-520d-3p -18.23592 -9.46747 0.00001 0.00201 hsa-miR-519c-3p -17.98705 -9.01722 0.00002 0.00338 hsa-miR-520f -32.04992 -35.93228 0.00002 0.00338 hsa-miR-519d -14.46073 -7.61177 0.00003 0.00338 hsa-miR-520h -18.02925 -8.34496 0.00003 0.00338 hsa-miR-519c-5p -32.25620 -35.51970 0.00004 0.00382 hsa-miR-520g -14.16219 -7.27220 0.00005 0.00382 hsa-miR-518b -15.70611 -7.39997 0.00006 0.00382 hsa-miR-517a -11.74423 -7.21374 0.00006 0.00382 ----------------------------------------------------- Alessandro Guffanti - Bioinformatics, Genomnia srl Via Nerviano, 31 - 20020 Lainate, Milano, Italy Ph: +39-0293305.702 Fax: +39-0293305.777 http://www.genomnia.com "If you can dream it, you can do it" (Walt Disney) -----Original Message----- From: Simon Melov <smelov@buckinstitute.org> To: "bioconductor@r-project.org" <bioconductor@r-project.org> Date: Mon, 7 May 2012 12:19:19 -0700 Subject: [BioC] edgeR outlier question I have a reasonable RNASeq data set of 10 biological replicates of a control group versus 10 biological replicates experimental I've gone through the edgeR workflow, and get a nice list of about 1000 genes differentially expressed due to the experimental manipulation. I input the data based on total reads per gene (I'd like to get to exons too, but first things first). The data is obtained via a paired end strategy, so its pretty good quality. The number of reads per sample (library) is about 10 million reads each. My question is, as I go through list of significant genes which are differentially expressed between the two groups (normalized via the workflow), ranked by BH FDR down to 0.05, I see genes being judged as differentially expressed which have very low expression in most samples, yet are thrown off by 1 or 2 values, thereby achieving statistical significance. For example, a gene might have between 1 and 2 counts per million reads in one group, and be basically the ! same in the other group, but one of the values is perhaps at a 1000 or so counts, which seems to throw off the entire group, thereby becoming "significant". Shouldn't edgeR take into account this sort of biological variation within a group and account for it in assessing significance? Its clear that in the above example, that sample is an outlier, and therefore the variance is so high, so it shouldn't be ranked as being differentially expressed. I filtered the data by applying the criteria of at least 1 count per sample, and I have to have at least 8 samples per group which have this. Should there be an additional filtering criteria to exclude these outliers? or doesn't edgeR take into account this sort of situation (I thought it did). Am I doing something wrong here? _______________________________________________ Bioconductor mailing list Bioconductor@r-project.org https://stat.ethz.ch/mailman/listinfo/bioconductor [https://stat.ethz.ch/mailman/listinfo/bioconductor] Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor [http://news.gmane.org/gmane.science.biology.informatics.conductor] ----------------------------------------------------------- Il Contenuto del presente messaggio potrebbe contenere informazioni confidenziali a favore dei soli destinatari del messaggio stesso. Qualora riceviate per errore questo messaggio siete pregati di cancellarlo dalla memoria del computer e di contattare i numeri sopra indicati. Ogni utilizzo o ritrasmissione dei contenuti del messaggio da parte di soggetti diversi dai destinatari è da considerarsi vietato ed abusivo. The information transmitted is intended only for the per...{{dropped:10}}

ADD COMMENT • link 12.6 years ago alessandro.guffanti@genomnia.com ▴ 330

0

Entering edit mode

Hi Alessandro, I don't think this helps me, as I'm not looking to eliminate an entire gene based on a single replicate. I mentioned in my original post that I had applied the filtering discussed at length in the guide, (allowing genes with at least one read, in a minimum of 8 samples was my filtering criteria). But this doesn't address the problem of a very high level of reads in a single sample. This issue of variance should be incorporated into the analysis, and not result in genes being listed as significant due to a high levels in a single sample. This sort of problem is not unusual in the genomics world, and I think the microarray literature had numerous solutions to this sort of problem. I'm surprised it popped up so early in my analysis, as I thought this would have been "solved" by now. As a later poster alluded to, perhaps its due to a relatively "high" number of biological replicates (N=10 per group). This number of replicates going forward is going to be commonplace as sequencing costs tumble. So some guidance as to how to deal with this in edgeR would be very welcome. thanks Simon. On May 7, 2012, at 1:56 PM, Guffanti Alessandro wrote: That seems to be exactly my same problem, so I am including here Gordon's answer. Actually filtering at the cpm level worked quite nicely to ameliorate the situation - look at the latest update (May 2011) of the User Manual to see a neat example of the procedure. Alessandro -- Dear Alessandro, You seem to giving examples of miRs that are expressed at a high degree is just one sample. The easiest way to deal with such miRs, if you really don't want to detect them, is to filter out miRs that fail to be expressed to a reasonable degree in at least four samples (since your groups are of size four). See for example pages 24-25 of the edgeR user's guide, where this is done for the Dclk1 mouse case study. We often suggest cpm>1 for at least m samples, where m is the minimum group size. Another obvious thing to do is to examine an MDS plot to identify outlier samples. -- [BioC] edgeR: effect of 'outlier' tags on differential expression calls alessandro.guffanti at genomnia.com<http: genomnia.com=""> alessandro.guffanti at genomnia.com<http: genomnia.com=""> Tue Apr 24 12:48:22 CEST 2012 Dear colleagues: I am using edgeR to examine differential expression on small RNA data I noticed this problem also when working with SAGE datasets: when just one of the samples is clearly an outlier, like you can see below for sample 7 (the comparison is 1-4 versus 5-8), there is a call of significant differential expression which seems to be inappropriate, or at least it should be reexamined. How can we diagnose these situations before checking manually the tag counts for all the significant differential expression calls ? Please note that these are tumoral samples, so an high sample by sample variability is expected in principle.. Thanks a lot in advance, Alessandro miRNA_ID 1.mirna 2.mirna 3.mirna 4.mirna 5.mirna 6.mirna 7.mirna 8.mirna hsa-miR-515-3p 3 1 1 1 1 7 1601 3 hsa-miR-518e 4 0 1 0 1 2 1715 2 hsa-miR-520d-3p 0 0 0 0 0 1 243 0 hsa-miR-519c-3p 0 0 0 0 0 1 248 0 hsa-miR-520f 0 0 0 0 0 0 163 0 hsa-miR-519d 12 1 0 1 1 4 1754 1 hsa-miR-520h 0 0 0 0 0 0 189 2 hsa-miR-519c-5p 0 0 0 0 0 0 123 0 hsa-miR-520g 16 1 1 4 2 4 1917 2 hsa-miR-518b 5 0 0 1 1 3 686 1 hsa-miR-517a 100 5 4 2 6 45 10024 3 miRNA_ID logConc logFC P.Value adj.P.Val hsa-miR-515-3p -15.09154 -8.61753 0.00000 0.00082 hsa-miR-518e -15.30278 -9.22926 0.00000 0.00110 hsa-miR-520d-3p -18.23592 -9.46747 0.00001 0.00201 hsa-miR-519c-3p -17.98705 -9.01722 0.00002 0.00338 hsa-miR-520f -32.04992 -35.93228 0.00002 0.00338 hsa-miR-519d -14.46073 -7.61177 0.00003 0.00338 hsa-miR-520h -18.02925 -8.34496 0.00003 0.00338 hsa-miR-519c-5p -32.25620 -35.51970 0.00004 0.00382 hsa-miR-520g -14.16219 -7.27220 0.00005 0.00382 hsa-miR-518b -15.70611 -7.39997 0.00006 0.00382 hsa-miR-517a -11.74423 -7.21374 0.00006 0.00382 ----------------------------------------------------- Alessandro Guffanti - Bioinformatics, Genomnia srl Via Nerviano, 31 - 20020 Lainate, Milano, Italy Ph: +39-0293305.702 Fax: +39-0293305.777 http://www.genomnia.com "If you can dream it, you can do it" (Walt Disney) -----Original Message----- From: Simon Melov <smelov@buckinstitute.org<mailto:smelov@buckinstitute.org>> To: "bioconductor at r-project.org<mailto:bioconductor at="" r-project.org="">" <bioconductor at="" r-project.org<mailto:bioconductor="" at="" r-project.org="">> Date: Mon, 7 May 2012 12:19:19 -0700 Subject: [BioC] edgeR outlier question I have a reasonable RNASeq data set of 10 biological replicates of a control group versus 10 biological replicates experimental I've gone through the edgeR workflow, and get a nice list of about 1000 genes differentially expressed due to the experimental manipulation. I input the data based on total reads per gene (I'd like to get to exons too, but first things first). The data is obtained via a paired end strategy, so its pretty good quality. The number of reads per sample (library) is about 10 million reads each. My question is, as I go through list of significant genes which are differentially expressed between the two groups (normalized via the workflow), ranked by BH FDR down to 0.05, I see genes being judged as differentially expressed which have very low expression in most samples, yet are thrown off by 1 or 2 values, thereby achieving statistical significance. For example, a gene might have between 1 and 2 counts per million reads in one group, and be basically the ! same in the other group, but one of the values is perhaps at a 1000 or so counts, which seems to throw off the entire group, thereby becoming "significant". Shouldn't edgeR take into account this sort of biological variation within a group and account for it in assessing significance? Its clear that in the above example, that sample is an outlier, and therefore the variance is so high, so it shouldn't be ranked as being differentially expressed. I filtered the data by applying the criteria of at least 1 count per sample, and I have to have at least 8 samples per group which have this. Should there be an additional filtering criteria to exclude these outliers? or doesn't edgeR take into account this sort of situation (I thought it did). Am I doing something wrong here? _______________________________________________ Bioconductor mailing list Bioconductor at r-project.org<mailto:bioconductor at="" r-project.org=""> https://stat.ethz.ch/mailman/listinfo/bioconductor Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor ----------------------------------------------------------- Il Contenuto del presente messaggio potrebbe contenere informazioni confidenziali a favore dei soli destinatari del messaggio stesso. Qualora riceviate per errore questo messaggio siete pregati di cancellarlo dalla memoria del computer e di contattare i numeri sopra indicati. Ogni utilizzo o ritrasmissione dei contenuti del messaggio da parte di soggetti diversi dai destinatari ? da considerarsi vietato ed abusivo. The information transmitted is intended only for the per...{{dropped:11}}

ADD REPLY • link 12.6 years ago Simon Melov ▴ 340

0

Entering edit mode

Simon, Ann, if you just need a simple two-group sample comparison you may want to try the tweeDEseq package which is based on a more flexible discrete count distribution (the Poisson-Tweedie) which includes the negative-binomial as a special case. the case you describe below could be described as a heavy-tail behavior, this is also commented in figure 2 of the tweeDEseq vignette where one single sample increases the fold change, the Poisson-Tweedie distribution provides additional flexibility to fit these cases. if you don't want to go through the vignette just do 'example(tweeDE)' to see how it works and if you have several CPU cores load the 'parallel' library first as computations with the Poisson-Tweedie model take longer than with the negative binomial. as mentioned in other posts, filtering is great and you can find some of the strategies mentioned in the function 'filterCounts()'. cheers, robert. On 05/08/2012 03:22 AM, Simon Melov wrote: > Hi Alessandro, > I don't think this helps me, as I'm not looking to eliminate an entire gene based on > a single replicate. I mentioned in my original post that I had applied the filtering discussed at length in the guide, (allowing genes with at least one read, in a minimum > of 8 samples was my filtering criteria). But this doesn't address the problem of a very > high level of reads in a single sample. This issue of variance should be incorporated > into the analysis, and not result in genes being listed as significant due to a high > levels in a single sample. This sort of problem is not unusual in the genomics world, > and I think the microarray literature had numerous solutions to this sort of problem. > I'm surprised it popped up so early in my analysis, as I thought this would have been > "solved" by now. As a later poster alluded to, perhaps its due to a relatively "high" > number of biological replicates (N=10 per group). This number of replicates going > forward is going to be commonplace as sequencing costs tumble. So some guidance as to > how to deal with this in edgeR would be very welcome. > > thanks > > Simon. > > On May 7, 2012, at 1:56 PM, Guffanti Alessandro wrote: > > That seems to be exactly my same problem, so I am including here Gordon's answer. > > Actually filtering at the cpm level worked quite nicely to ameliorate the situation - look at the > latest update (May 2011) of the User Manual to see a neat example of the procedure. > > > Alessandro > > -- > > Dear Alessandro, > > You seem to giving examples of miRs that are expressed at a high degree is > just one sample. The easiest way to deal with such miRs, if you really > don't want to detect them, is to filter out miRs that fail to be expressed > to a reasonable degree in at least four samples (since your groups are of > size four). See for example pages 24-25 of the edgeR user's guide, where > this is done for the Dclk1 mouse case study. We often suggest cpm>1 for > at least m samples, where m is the minimum group size. > > Another obvious thing to do is to examine an MDS plot to identify outlier > samples. > > -- > > [BioC] edgeR: effect of 'outlier' tags on differential expression calls > alessandro.guffanti at genomnia.com<http: genomnia.com=""> alessandro.guffanti at genomnia.com<http: genomnia.com=""> > Tue Apr 24 12:48:22 CEST 2012 > > Dear colleagues: I am using edgeR to examine differential expression on > small RNA data > > I noticed this problem also when working with SAGE datasets: when just one > of the samples is clearly an outlier, like you can see below for sample 7 > (the comparison is 1-4 versus 5-8), there is a call of significant > differential expression which seems to be inappropriate, or at least it > should be reexamined. > > How can we diagnose these situations before checking manually the tag > counts for all the significant differential expression calls ? Please note > that these are tumoral samples, so an high sample by sample variability is > expected in principle.. > > Thanks a lot in advance, > > Alessandro > > > miRNA_ID 1.mirna 2.mirna 3.mirna 4.mirna > 5.mirna 6.mirna > 7.mirna 8.mirna > hsa-miR-515-3p 3 1 1 1 1 7 1601 3 > hsa-miR-518e 4 0 1 0 1 2 1715 2 > hsa-miR-520d-3p 0 0 0 0 0 1 > 243 0 > hsa-miR-519c-3p 0 0 0 0 0 1 > 248 0 > hsa-miR-520f 0 0 0 0 0 0 163 0 > hsa-miR-519d 12 1 0 1 1 4 1754 1 > hsa-miR-520h 0 0 0 0 0 0 189 2 > hsa-miR-519c-5p 0 0 0 0 0 0 > 123 0 > hsa-miR-520g 16 1 1 4 2 4 1917 2 > hsa-miR-518b 5 0 0 1 1 3 686 1 > hsa-miR-517a 100 5 4 2 6 45 10024 3 > > > > miRNA_ID logConc logFC P.Value adj.P.Val > hsa-miR-515-3p -15.09154 -8.61753 0.00000 0.00082 > hsa-miR-518e -15.30278 -9.22926 0.00000 0.00110 > hsa-miR-520d-3p -18.23592 -9.46747 0.00001 > 0.00201 > hsa-miR-519c-3p -17.98705 -9.01722 0.00002 > 0.00338 > hsa-miR-520f -32.04992 -35.93228 0.00002 0.00338 > hsa-miR-519d -14.46073 -7.61177 0.00003 0.00338 > hsa-miR-520h -18.02925 -8.34496 0.00003 0.00338 > hsa-miR-519c-5p -32.25620 -35.51970 0.00004 > 0.00382 > hsa-miR-520g -14.16219 -7.27220 0.00005 0.00382 > hsa-miR-518b -15.70611 -7.39997 0.00006 0.00382 > hsa-miR-517a -11.74423 -7.21374 0.00006 0.00382 > > ----------------------------------------------------- > Alessandro Guffanti - Bioinformatics, Genomnia srl > Via Nerviano, 31 - 20020 Lainate, Milano, Italy > Ph: +39-0293305.702 Fax: +39-0293305.777 > http://www.genomnia.com > "If you can dream it, you can do it" (Walt Disney) > > -----Original Message----- > From: Simon Melov<smelov at="" buckinstitute.org<mailto:smelov="" at="" buckinstitute.org="">> > To: "bioconductor at r-project.org<mailto:bioconductor at="" r-project.org="">"<bioconductor at="" r-project.org<mailto:bioconductor="" at="" r-project.org="">> > Date: Mon, 7 May 2012 12:19:19 -0700 > Subject: [BioC] edgeR outlier question > > I have a reasonable RNASeq data set of 10 biological replicates of a control group versus 10 biological replicates experimental I've gone through the edgeR workflow, and get a nice list of about 1000 genes differentially expressed due to the experimental manipulation. I input the data based on total reads per gene (I'd like to get to exons too, but first things first). The data is obtained via a paired end strategy, so its pretty good quality. The number of reads per sample (library) is about 10 million reads each. My question is, as I go through list of significant genes which are differentially expressed between the two groups (normalized via the workflow), ranked by BH FDR down to 0.05, I see genes being judged as differentially expressed which have very low expression in most samples, yet are thrown off by 1 or 2 values, thereby achieving statistical significance. For example, a gene might have between 1 and 2 counts per million reads in one group, and be basically the ! > same in the other group, but one of the values is perhaps at a 1000 or so counts, which seems to throw off the entire group, thereby becoming "significant". > > Shouldn't edgeR take into account this sort of biological variation within a group and account for it in assessing significance? Its clear that in the above example, that sample is an outlier, and therefore the variance is so high, so it shouldn't be ranked as being differentially expressed. I filtered the data by applying the criteria of at least 1 count per sample, and I have to have at least 8 samples per group which have this. Should there be an additional filtering criteria to exclude these outliers? or doesn't edgeR take into account this sort of situation (I thought it did). > > Am I doing something wrong here? > > _______________________________________________ > Bioconductor mailing list > Bioconductor at r-project.org<mailto:bioconductor at="" r-project.org=""> > https://stat.ethz.ch/mailman/listinfo/bioconductor > Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor > > ----------------------------------------------------------- > Il Contenuto del presente messaggio potrebbe contenere informazioni confidenziali a favore dei > soli destinatari del messaggio stesso. Qualora riceviate per errore questo messaggio siete pregati > di cancellarlo dalla memoria del computer e di contattare i numeri sopra indicati. Ogni utilizzo o > ritrasmissione dei contenuti del messaggio da parte di soggetti diversi dai destinatari ? da > considerarsi vietato ed abusivo. > > The information transmitted is intended only for the per...{{dropped:11}} > > _______________________________________________ > Bioconductor mailing list > Bioconductor at r-project.org > https://stat.ethz.ch/mailman/listinfo/bioconductor > Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor

ADD REPLY • link 12.6 years ago Robert Castelo ★ 3.4k

0

Entering edit mode

Hi - actually this problem pops out even with as low as two replicates, and I would tend to attribute it to a technical feature of NGS (at least the ones in which there is an em-PCR step such as 454 and SOLiD) - which is over-amplification of a certain sequence set in a single sample. And this could be called shot noise I guess .. I sqw it both in SAGE and miRNA sequencing in multiple samples. I agree of course in principle on not throwing away genes for what happens sporadically in one sample. However, in my experience these 'read shots' always happens in the very grey area of few reads per samples, and if you reason in cpm this will be the area of less than 10 count per millions - I don't know it this is the same situation for you So, these are genes usually located in the area where biological variance is well hidden below technical variance. I guess that these will not be your most significant findings and the solution of reasoning with edgeR in terms of cmp for the threshold selection - rather than read counts even in normalized libraries - worked nicely for my miRNAs when I went back to MDS plots to explore the situation... This is only my experience, though, so I would be interested to know if this 'read shot noise' happens also in areas where there are large counts Regards, Alessandro On 5/8/2012 3:22 AM, Simon Melov wrote: > Hi Alessandro, > I don't think this helps me, as I'm not looking to eliminate an entire gene based on a single replicate. I mentioned in my original post that I had applied the filtering discussed at length in the guide, (allowing genes with at least one read, in a minimum of 8 samples was my filtering criteria). But this doesn't address the problem of a very high level of reads in a single sample. This issue of variance should be incorporated into the analysis, and not result in genes being listed as significant due to a high levels in a single sample. This sort of problem is not unusual in the genomics world, and I think the microarray literature had numerous solutions to this sort of problem. I'm surprised it popped up so early in my analysis, as I thought this would have been "solved" by now. As a later poster alluded to, perhaps its due to a relatively "high" number of biological replicates (N=10 per group). This number of replicates going forward is going to be commonplace as sequencing costs tumble. So some guidance as to how to deal with this in edgeR would be very welcome. > > thanks > > Simon. > -- Alessandro Guffanti - Head, Bioinformatics, Genomnia srl Via Nerviano, 31 - 20020 Lainate, Milano, Italy Ph: +39-0293305.702 Fax: +39-0293305.777 http://www.genomnia.com "When you're curious, you find lots of interesting things to do." (Walt Disney) ----------------------------------------------------------- Il Contenuto del presente messaggio potrebbe contenere informazioni confidenziali a favore dei soli destinatari del messaggio stesso. Qualora riceviate per errore questo messaggio siete pregati di cancellarlo dalla memoria del computer e di contattare i numeri sopra indicati. Ogni utilizzo o ritrasmissione dei contenuti del messaggio da parte di soggetti diversi dai destinatari ? da considerarsi vietato ed abusivo. The information transmitted is intended only for the per...{{dropped:12}}

ADD REPLY • link 12.6 years ago alessandro.guffanti@genomnia.com ▴ 330

0

Entering edit mode

Dear Simon, the problem that such outliers occur has nothing to do with the number of replicates (as Alessandro also points out below; and I did not imply that in my post). However, the potential solutions to the problem have a lot to do with it. 1. With very many replicates, in principle you don't need to make any parametric assumptions, and you could just use a permutation or rank based method. 2. With an intermediate number of replicates, you could use a highly flexible family of distributions, such as Poisson-Tweedie (PT, suggested by Robert Castelo). For this, you need to fit 3 parameters per gene. In the tweeDESeq vignette, they look at RNA-Seq data from 69 HapMap individuals. 3. The Negative Binomial (NB) distribution (used in edgeR and DESeq) is a special case of PT, with 2 parameters per gene. Less replicates are sufficient to determine these (e.g. sharingMode = "gene-est-only" in DESeq's estimateDispersions). 4. With few replicates (say, 2 vs 2), even two parameters per gene are too many to fit reliably. For these instances, 'information sharing' across genes is used, either assuming a common dispersion, or a mean-dependent one. In the extreme case, you end up with essentially 1 parameter per gene (e.g. sharingMode = "fit-only" in DESeq's estimateDispersions). 5. An intermediate solution, that can be thought of as using somewhere between 1 and 2 parameters per gene, is the shrinkage approach used by edgeR, or sharingMode = "maximum" in DESeq, which we find to work well. I hope this overview helps. The current instances of software require you to select your analysis strategy between these 5 options manually. It would be interesting to see if that can be (or should be?) automated. PS In addition to the above, one can go back to the robust statistics textbook, stay with simple parametric distributions, but remove or down-weigh outlier data points (not necessarily the whole gene). Best wishes Wolfgang alessandro.guffanti at genomnia.com scripsit 05/08/2012 02:35 PM: > Hi - actually this problem pops out even with as low as two replicates, > and I would > tend to attribute it to a technical feature of NGS (at least the ones in > which there is > an em-PCR step such as 454 and SOLiD) - which is over-amplification of a > certain > sequence set in a single sample. And this could be called shot noise I > guess .. I sqw > it both in SAGE and miRNA sequencing in multiple samples. > > I agree of course in principle on not throwing away genes for what > happens sporadically in > one sample. However, in my experience these 'read shots' always happens > in the very grey area > of few reads per samples, and if you reason in cpm this will be the area > of less than 10 count > per millions - I don't know it this is the same situation for you > > So, these are genes usually located in the area where biological > variance is well hidden below > technical variance. I guess that these will not be your most significant > findings and the solution of > reasoning with edgeR in terms of cmp for the threshold selection - > rather than read counts even > in normalized libraries - worked nicely for my miRNAs when I went back > to MDS plots to explore > the situation... > > This is only my experience, though, so I would be interested to know if > this 'read shot noise' happens also > in areas where there are large counts > > Regards, > > Alessandro > > On 5/8/2012 3:22 AM, Simon Melov wrote: >> Hi Alessandro, >> I don't think this helps me, as I'm not looking to eliminate an entire >> gene based on a single replicate. I mentioned in my original post that >> I had applied the filtering discussed at length in the guide, >> (allowing genes with at least one read, in a minimum of 8 samples was >> my filtering criteria). But this doesn't address the problem of a very >> high level of reads in a single sample. This issue of variance should >> be incorporated into the analysis, and not result in genes being >> listed as significant due to a high levels in a single sample. This >> sort of problem is not unusual in the genomics world, and I think the >> microarray literature had numerous solutions to this sort of problem. >> I'm surprised it popped up so early in my analysis, as I thought this >> would have been "solved" by now. As a later poster alluded to, perhaps >> its due to a relatively "high" number of biological replicates (N=10 >> per group). This number of replicates going forward is going to be >> commonplace as sequencing costs tumble. So some guidance as to how to >> deal with this in edgeR would be very welcome. >> >> thanks >> >> Simon. >> > > -- Best wishes Wolfgang Wolfgang Huber EMBL http://www.embl.de/research/units/genome_biology/huber

ADD REPLY • link 12.6 years ago Wolfgang Huber ★ 13k

0

Entering edit mode

Wolfgang Huber ★ 13k

@wolfgang-huber-3550

Last seen 3 months ago

EMBL European Molecular Biology Laborat…

Dear Simon, Alessandro I assume that the inference that you are refering to is based on shrunken (empirical Bayes) estimates of the dispersions. Perhaps what you are observing is that the shrinkage turns out to be too strong for your data - such that genes with large empirical dispersion (driven by the 'outliers') get their estimate shrunk too much, while their apparent fold change is not shrunk, which would make them appear significant. With 10+10 replicates you do not need (and probably don't want) to shrink your dispersion estimates, you can just use the empirical values. Others are better qualified to point how to best achieve this with edgeR. (In DESeq, this is controlled by the parameter 'sharingMode' of the 'estimateDispersions' function, which you could set to 'gene-est-only'. Its default is 'maximum', which we find useful for situations with fewer replicates.) Also, to improve power, it is always advisable to perform 'independent filtering' of genes before the testing, in order to weed out genes that anyway have negligible chance of being differentially expressed. This concept is explained in [1]. A suitable filter criterion would be e.g. the median of a gene's values across samples (irrespective of condition!). [1] Richard Bourgon et al., Independent filtering increases detection power for high-throughput experiments. PNAS 2010 http://www.pnas.org/content/107/21/9546.long Best wishes Wolfgang May/7/12 9:19 PM, Simon Melov scripsit:: > I have a reasonable RNASeq data set of 10 biological replicates of a > control group versus 10 biological replicates experimental I've gone > through the edgeR workflow, and get a nice list of about 1000 genes > differentially expressed due to the experimental manipulation. I > input the data based on total reads per gene (I'd like to get to > exons too, but first things first). The data is obtained via a paired > end strategy, so its pretty good quality. The number of reads per > sample (library) is about 10 million reads each. My question is, as I > go through list of significant genes which are differentially > expressed between the two groups (normalized via the workflow), > ranked by BH FDR down to 0.05, I see genes being judged as > differentially expressed which have very low expression in most > samples, yet are thrown off by 1 or 2 values, thereby achieving > statistical significance. For example, a gene might have between 1 > and 2 counts per million reads in one group, and be basically the ! > same in the other group, but one of the values is perhaps at a 1000 > or so counts, which seems to throw off the entire group, thereby > becoming "significant". > > Shouldn't edgeR take into account this sort of biological variation > within a group and account for it in assessing significance? Its > clear that in the above example, that sample is an outlier, and > therefore the variance is so high, so it shouldn't be ranked as being > differentially expressed. I filtered the data by applying the > criteria of at least 1 count per sample, and I have to have at least > 8 samples per group which have this. Should there be an additional > filtering criteria to exclude these outliers? or doesn't edgeR take > into account this sort of situation (I thought it did). > > Am I doing something wrong here? > > _______________________________________________ Bioconductor mailing > list Bioconductor at r-project.org > https://stat.ethz.ch/mailman/listinfo/bioconductor Search the > archives: > http://news.gmane.org/gmane.science.biology.informatics.conductor -- Best wishes Wolfgang Wolfgang Huber EMBL http://www.embl.de/research/units/genome_biology/huber

ADD COMMENT • link 12.6 years ago Wolfgang Huber ★ 13k

0

Entering edit mode

Gordon Smyth 51k

@gordon-smyth

Last seen 41 minutes ago

WEHI, Melbourne, Australia

Hi Simon, edgeR does take into account the amount of biological variation within groups, and it does de-prioritize genes that are inconsistent within groups, although it seems not as strongly as you'd like in your case. Here are some quick solutions. First, the filtering you've done sounds good, but I would require a minimum cpm for at least ten samples for your experiment, rather than eight. That's because both of your groups are of size ten. From your description, if you filter genes that fail to achieve at least 2 cpm in >= 10 samples, that may take care of the one-offs. Second, edgeR (unlike limma) doesn't have to ability to automatically adapt the degree of empirical Bayes smoothing, but you can adjust it yourself. The default prior degrees of freedom for the edgeR empirical Bayes procedure is set at 20. You might need a smaller value, perhaps a lot smaller. Try prior df of 2, say, which you can achieve by setting prior.n=2/18 when you run estimateTagwiseDisp(). The smaller you make this value, the more strongly edgeR will down-weight genes that are inconsistent within replicates. A more radical solution would be to use edgeR's glm pipeline, and to use glmQLFTest() in place of the more usual glmLRT(). In this quasi glm pipeline, estimateGLMTagwiseDisp() is omitted, and instead edgeR calls limma functions to do the empirical Bayes shrinkage, meaning that the prior df is estimated rather than preset. This also provides a more conservative statistical test that fully takes into account the uncertainty with which the dispersion is estimated. This pipeline will strongly de-prioritize genes that are inconsistent within replicates. Finally, you could consider removing outlier genes manually. There are a few ways to do that. We always look at plotBCV() plots of the estimated dispersions, and sometimes if there are obvious outliers we will identify and filter them out. If you have a small percentage of extreme outliers, this is the way to go. Best wishes Gordon > Date: Mon, 7 May 2012 12:19:19 -0700 > From: Simon Melov <smelov at="" buckinstitute.org=""> > To: "bioconductor at r-project.org" <bioconductor at="" r-project.org=""> > Subject: [BioC] edgeR outlier question > > I have a reasonable RNASeq data set of 10 biological replicates of a > control group versus 10 biological replicates experimental I've gone > through the edgeR workflow, and get a nice list of about 1000 genes > differentially expressed due to the experimental manipulation. I input > the data based on total reads per gene (I'd like to get to exons too, > but first things first). The data is obtained via a paired end strategy, > so its pretty good quality. The number of reads per sample (library) is > about 10 million reads each. My question is, as I go through list of > significant genes which are differentially expressed between the two > groups (normalized via the workflow), ranked by BH FDR down to 0.05, I > see genes being judged as differentially expressed which have very low > expression in most samples, yet are thrown off by 1 or 2 values, thereby > achieving statistical significance. For example, a gene might have > between 1 and 2 counts per million reads in one group, and be basically > the ! same in the other group, but one of the values is perhaps at a > 1000 or so counts, which seems to throw off the entire group, thereby > becoming "significant". > > Shouldn't edgeR take into account this sort of biological variation > within a group and account for it in assessing significance? Its clear > that in the above example, that sample is an outlier, and therefore the > variance is so high, so it shouldn't be ranked as being differentially > expressed. I filtered the data by applying the criteria of at least 1 > count per sample, and I have to have at least 8 samples per group which > have this. Should there be an additional filtering criteria to exclude > these outliers? or doesn't edgeR take into account this sort of > situation (I thought it did). > > Am I doing something wrong here? ______________________________________________________________________ The information in this email is confidential and intend...{{dropped:4}}

ADD COMMENT • link 12.6 years ago Gordon Smyth 51k

0

Entering edit mode

Gordon Smyth 51k

@gordon-smyth

Last seen 41 minutes ago

WEHI, Melbourne, Australia

I don't agree that switching to purely genewise dispersion estimates is the best solution. One can of course achieve this in edgeR using a very small prior.n, but I think there is still plenty of benefit to be had from borrowing information between genes at sample sizes of 10 and above. The beauty of the empirical Bayes moderation approach of edgeR is that the dispersion estimators transition smoothly from nearly global estimators at small n to nearly purely genewise at large n. This is a natural consequence of the fact that the prior stays roughly constant while the amount of information in the data increases. This smooth transition seems preferable to me that than having to switch between very different estimation strategies from one sample size to another. Note that edgeR moderates genewise dispersions both up and down towards a global estimate, so it doesn't necessarily "shrink". Some genewise estimates are decreased while others are increased, and the latter is just as important as the former. I prefer to call it "moderation" or "smoothing" or "squeezing". Gordon > Date: Tue, 08 May 2012 00:40:49 +0200 > From: Wolfgang Huber <whuber at="" embl.de=""> > To: bioconductor at r-project.org > Subject: Re: [BioC] edgeR outlier question > Message-ID: <4FA84F71.8020806 at embl.de> > Content-Type: text/plain; charset=ISO-8859-1; format=flowed > > Dear Simon, Alessandro > > I assume that the inference that you are refering to is based on > shrunken (empirical Bayes) estimates of the dispersions. Perhaps what > you are observing is that the shrinkage turns out to be too strong for > your data - such that genes with large empirical dispersion (driven by > the 'outliers') get their estimate shrunk too much, while their apparent > fold change is not shrunk, which would make them appear significant. > > With 10+10 replicates you do not need (and probably don't want) to > shrink your dispersion estimates, you can just use the empirical values. > Others are better qualified to point how to best achieve this with > edgeR. (In DESeq, this is controlled by the parameter 'sharingMode' of > the 'estimateDispersions' function, which you could set to > 'gene-est-only'. Its default is 'maximum', which we find useful for > situations with fewer replicates.) > > Also, to improve power, it is always advisable to perform 'independent > filtering' of genes before the testing, in order to weed out genes that > anyway have negligible chance of being differentially expressed. This > concept is explained in [1]. A suitable filter criterion would be e.g. > the median of a gene's values across samples (irrespective of condition!). > > [1] Richard Bourgon et al., Independent filtering increases detection > power for high-throughput experiments. PNAS 2010 > http://www.pnas.org/content/107/21/9546.long > > Best wishes > Wolfgang > > > > > May/7/12 9:19 PM, Simon Melov scripsit:: >> I have a reasonable RNASeq data set of 10 biological replicates of a >> control group versus 10 biological replicates experimental I've gone >> through the edgeR workflow, and get a nice list of about 1000 genes >> differentially expressed due to the experimental manipulation. I >> input the data based on total reads per gene (I'd like to get to >> exons too, but first things first). The data is obtained via a paired >> end strategy, so its pretty good quality. The number of reads per >> sample (library) is about 10 million reads each. My question is, as I >> go through list of significant genes which are differentially >> expressed between the two groups (normalized via the workflow), >> ranked by BH FDR down to 0.05, I see genes being judged as >> differentially expressed which have very low expression in most >> samples, yet are thrown off by 1 or 2 values, thereby achieving >> statistical significance. For example, a gene might have between 1 >> and 2 counts per million reads in one group, and be basically the ! >> same in the other group, but one of the values is perhaps at a 1000 >> or so counts, which seems to throw off the entire group, thereby >> becoming "significant". >> >> Shouldn't edgeR take into account this sort of biological variation >> within a group and account for it in assessing significance? Its >> clear that in the above example, that sample is an outlier, and >> therefore the variance is so high, so it shouldn't be ranked as being >> differentially expressed. I filtered the data by applying the >> criteria of at least 1 count per sample, and I have to have at least >> 8 samples per group which have this. Should there be an additional >> filtering criteria to exclude these outliers? or doesn't edgeR take >> into account this sort of situation (I thought it did). >> >> Am I doing something wrong here? > ______________________________________________________________________ The information in this email is confidential and intend...{{dropped:4}}

ADD COMMENT • link 12.5 years ago Gordon Smyth 51k

0

Entering edit mode

Gordon Smyth 51k

@gordon-smyth

Last seen 41 minutes ago

WEHI, Melbourne, Australia

I'm commenting just on point 5 below. It is certainly true that edgeR is able to borrow information between genes, meaning that only part of a parameter is used up for estimating the dispersion for each gene, i.e., between 1 and 2 parameters total for gene. However DESeq, from what I understand, uses either a global dispersion estimate or a purely genewise estimate for each gene. Always one of the extremes, never anything in between. So it estimates either 1 or 2 parameters for each gene, not an intermediate quantity. Gordon > Date: Tue, 08 May 2012 16:36:45 +0200 > From: Wolfgang Huber <whuber at="" embl.de=""> > To: bioconductor at r-project.org > Subject: Re: [BioC] edgeR outlier question > > > Dear Simon, > > the problem that such outliers occur has nothing to do with the number > of replicates (as Alessandro also points out below; and I did not imply > that in my post). However, the potential solutions to the problem have a > lot to do with it. > > 1. With very many replicates, in principle you don't need to make any > parametric assumptions, and you could just use a permutation or rank > based method. > > 2. With an intermediate number of replicates, you could use a highly > flexible family of distributions, such as Poisson-Tweedie (PT, suggested > by Robert Castelo). For this, you need to fit 3 parameters per gene. In > the tweeDESeq vignette, they look at RNA-Seq data from 69 HapMap > individuals. > > 3. The Negative Binomial (NB) distribution (used in edgeR and DESeq) is > a special case of PT, with 2 parameters per gene. Less replicates are > sufficient to determine these (e.g. sharingMode = "gene-est-only" in > DESeq's estimateDispersions). > > 4. With few replicates (say, 2 vs 2), even two parameters per gene are > too many to fit reliably. For these instances, 'information sharing' > across genes is used, either assuming a common dispersion, or a > mean-dependent one. In the extreme case, you end up with essentially 1 > parameter per gene (e.g. sharingMode = "fit-only" in DESeq's > estimateDispersions). > > 5. An intermediate solution, that can be thought of as using somewhere > between 1 and 2 parameters per gene, is the shrinkage approach used by > edgeR, or sharingMode = "maximum" in DESeq, which we find to work well. > > I hope this overview helps. The current instances of software require > you to select your analysis strategy between these 5 options manually. > It would be interesting to see if that can be (or should be?) automated. > > PS In addition to the above, one can go back to the robust statistics > textbook, stay with simple parametric distributions, but remove or > down-weigh outlier data points (not necessarily the whole gene). > > Best wishes > Wolfgang > ______________________________________________________________________ The information in this email is confidential and intend...{{dropped:4}}

ADD COMMENT • link 12.5 years ago Gordon Smyth 51k

Login before adding your answer.