total count filter cutoff

0

Entering edit mode

Guest User ★ 13k

@guest-user-4897

Last seen 10.6 years ago

I'm using edgeR for analysis of may data and I'm not sure what total count filter value cutoff value I should use, My reads are paired 50bP reads and total reads per sample is about 80,000,000. I tried cutoff values of 5,10,15,30,50 and 100 and I only saw differences between 50 and 100 but still looking for logical reason to chose the cutoff value. Appreciate your help, Mahnaz -- output of sessionInfo(): R 3.0.2 -- Sent via the guest posting facility at bioconductor.org.

edgeR edgeR • 2.9k views

ADD COMMENT • link updated 11.0 years ago by Wolfgang Huber ★ 13k • written 11.0 years ago by Guest User ★ 13k

0

Entering edit mode

Wolfgang Huber ★ 13k

@wolfgang-huber-3550

Last seen 7 weeks ago

EMBL European Molecular Biology Laborat…

Dear Mahnaz http://bioconductor.org/packages/release/bioc/html/genefilter.html -> Diagnostics for independent filtering -> Section 4 provides some options. Wolfgang Il giorno 30 Apr 2014, alle ore 20:29, mahnaz Kiani [guest] <guest at="" bioconductor.org=""> ha scritto: > > I'm using edgeR for analysis of may data and I'm not sure what total count filter value cutoff value I should use, My reads are paired 50bP reads and total reads per sample is about 80,000,000. I tried cutoff values of 5,10,15,30,50 and 100 and I only saw differences between 50 and 100 but still looking for logical reason to chose the cutoff value. > > Appreciate your help, > Mahnaz > > -- output of sessionInfo(): > > R 3.0.2 > > -- > Sent via the guest posting facility at bioconductor.org. > > _______________________________________________ > Bioconductor mailing list > Bioconductor at r-project.org > https://stat.ethz.ch/mailman/listinfo/bioconductor > Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor

ADD COMMENT • link 11.0 years ago Wolfgang Huber ★ 13k

0

Entering edit mode

Thanks for quick response, I did check that but didn't find any information about total count filter cutoff, would you please help me with that. Thanks, Mahnaz On Wed, Apr 30, 2014 at 1:47 PM, Wolfgang Huber <whuber@embl.de> wrote: > Dear Mahnaz > http://bioconductor.org/packages/release/bioc/html/genefilter.html -> > Diagnostics for independent filtering -> Section 4 provides some options. > Wolfgang > > Il giorno 30 Apr 2014, alle ore 20:29, mahnaz Kiani [guest] < > guest@bioconductor.org> ha scritto: > > > > > I'm using edgeR for analysis of may data and I'm not sure what total > count filter value cutoff value I should use, My reads are paired 50bP > reads and total reads per sample is about 80,000,000. I tried cutoff values > of 5,10,15,30,50 and 100 and I only saw differences between 50 and 100 but > still looking for logical reason to chose the cutoff value. > > > > Appreciate your help, > > Mahnaz > > > > -- output of sessionInfo(): > > > > R 3.0.2 > > > > -- > > Sent via the guest posting facility at bioconductor.org. > > > > _______________________________________________ > > Bioconductor mailing list > > Bioconductor@r-project.org > > https://stat.ethz.ch/mailman/listinfo/bioconductor > > Search the archives: > http://news.gmane.org/gmane.science.biology.informatics.conductor > > [[alternative HTML version deleted]]

ADD REPLY • link 11.0 years ago Mahnaz Kiani ▴ 20

0

Entering edit mode

Dear Mahnaz, Total count filtering and mean count filtering are equivalent, since the only difference is a constant factor (dividing by number of samples), so the mean count filter demonstrated in the genefilter vignette corresponds to your question. If you are expecting the vignette to simply give you a specific number to use a as a cutoff, that's not possible, because the threshold depends on the data. I suggest that you adapt the R code in this vignette to your data in order to choose an appropriate cutoff. -Ryan On Wed 30 Apr 2014 12:04:33 PM PDT, Mahnaz Kiani wrote: > Thanks for quick response, I did check that but didn't find any information > about total count filter cutoff, would you please help me with that. > > Thanks, > Mahnaz > > > On Wed, Apr 30, 2014 at 1:47 PM, Wolfgang Huber <whuber at="" embl.de=""> wrote: > >> Dear Mahnaz >> http://bioconductor.org/packages/release/bioc/html/genefilter.html -> >> Diagnostics for independent filtering -> Section 4 provides some options. >> Wolfgang >> >> Il giorno 30 Apr 2014, alle ore 20:29, mahnaz Kiani [guest] < >> guest at bioconductor.org> ha scritto: >> >>> >>> I'm using edgeR for analysis of may data and I'm not sure what total >> count filter value cutoff value I should use, My reads are paired 50bP >> reads and total reads per sample is about 80,000,000. I tried cutoff values >> of 5,10,15,30,50 and 100 and I only saw differences between 50 and 100 but >> still looking for logical reason to chose the cutoff value. >>> >>> Appreciate your help, >>> Mahnaz >>> >>> -- output of sessionInfo(): >>> >>> R 3.0.2 >>> >>> -- >>> Sent via the guest posting facility at bioconductor.org. >>> >>> _______________________________________________ >>> Bioconductor mailing list >>> Bioconductor at r-project.org >>> https://stat.ethz.ch/mailman/listinfo/bioconductor >>> Search the archives: >> http://news.gmane.org/gmane.science.biology.informatics.conductor >> >> > > [[alternative HTML version deleted]] > > _______________________________________________ > Bioconductor mailing list > Bioconductor at r-project.org > https://stat.ethz.ch/mailman/listinfo/bioconductor > Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor

ADD REPLY • link 11.0 years ago Ryan C. Thompson ★ 7.9k

0

Entering edit mode

In my lab, we typically follow a "CPM of at least X in at least Y samples" rule, where X=1 (arbitrary but reasonable, can be changed) and Y=size of smallest replicate group, according to one of the case studies in the user's guide, for example: ------ 4.3.6 Filtering We fi lter out very lowly expressed tags, keeping genes that are expressed at a reasonable level in at least one treatment condition. Since the smallest group size is three, we keep genes that achieve at least one count per million (cpm) in at least three samples: > keep <- rowSums(cpm(y)>1) >= 3 > y <- y[keep,] ------ (http://www.bioconductor.org/packages/release/bioc/vignettes/edgeR/ins t/doc/edgeRUsersGuide.pdf) Cheers, Mark ---------- Prof. Dr. Mark Robinson Statistical Bioinformatics, Institute of Molecular Life Sciences University of Zurich http://ow.ly/riRea On 30.04.2014, at 21:23, "Ryan C. Thompson" <rct at="" thompsonclan.org=""> wrote: > Dear Mahnaz, > > Total count filtering and mean count filtering are equivalent, since the only difference is a constant factor (dividing by number of samples), so the mean count filter demonstrated in the genefilter vignette corresponds to your question. > > If you are expecting the vignette to simply give you a specific number to use a as a cutoff, that's not possible, because the threshold depends on the data. I suggest that you adapt the R code in this vignette to your data in order to choose an appropriate cutoff. > > -Ryan > > On Wed 30 Apr 2014 12:04:33 PM PDT, Mahnaz Kiani wrote: >> Thanks for quick response, I did check that but didn't find any information >> about total count filter cutoff, would you please help me with that. >> >> Thanks, >> Mahnaz >> >> >> On Wed, Apr 30, 2014 at 1:47 PM, Wolfgang Huber <whuber at="" embl.de=""> wrote: >> >>> Dear Mahnaz >>> http://bioconductor.org/packages/release/bioc/html/genefilter.html -> >>> Diagnostics for independent filtering -> Section 4 provides some options. >>> Wolfgang >>> >>> Il giorno 30 Apr 2014, alle ore 20:29, mahnaz Kiani [guest] < >>> guest at bioconductor.org> ha scritto: >>> >>>> >>>> I'm using edgeR for analysis of may data and I'm not sure what total >>> count filter value cutoff value I should use, My reads are paired 50bP >>> reads and total reads per sample is about 80,000,000. I tried cutoff values >>> of 5,10,15,30,50 and 100 and I only saw differences between 50 and 100 but >>> still looking for logical reason to chose the cutoff value. >>>> >>>> Appreciate your help, >>>> Mahnaz >>>> >>>> -- output of sessionInfo(): >>>> >>>> R 3.0.2 >>>> >>>> -- >>>> Sent via the guest posting facility at bioconductor.org. >>>> >>>> _______________________________________________ >>>> Bioconductor mailing list >>>> Bioconductor at r-project.org >>>> https://stat.ethz.ch/mailman/listinfo/bioconductor >>>> Search the archives: >>> http://news.gmane.org/gmane.science.biology.informatics.conductor >>> >>> >> >> [[alternative HTML version deleted]] >> >> _______________________________________________ >> Bioconductor mailing list >> Bioconductor at r-project.org >> https://stat.ethz.ch/mailman/listinfo/bioconductor >> Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor > > _______________________________________________ > Bioconductor mailing list > Bioconductor at r-project.org > https://stat.ethz.ch/mailman/listinfo/bioconductor > Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor

ADD REPLY • link 11.0 years ago Mark Robinson ▴ 880

0

Entering edit mode

this is perhaps obvious to some, but I've seen colleagues surprised by it nonetheless: if each sample has been sequenced to a depth of ~20 million reads, then with cpm >= 1, you're effectively/approximately requiring raw counts >= 20; if your depth is 100 million reads, then you're requiring counts > 100 (and presumably the whole reason you paid for 100 million reads was to get larger dynamic range at the low end, which you've just thrown away). That "1 cpm rule of thumb" seems to be pervasive, and often used without thought to library size and dynamic range. We did want to try to be better than microarrays, right? So, is there a disadvantage for filtering based on "raw count >= X (where X is 5, 10, etc.) in at least Y samples" rather than CPM? Or would you suggest in such cases still normalizing by read depth but lowering the threshold (e.g. cpm >= 1/(mean lib. size in millions)). I'm assuming non-pathological cases of fairly homogenous library size per sample. -Aaron On Wed, Apr 30, 2014 at 3:34 PM, Mark Robinson <mark.robinson@imls.uzh.ch>wrote: > > In my lab, we typically follow a "CPM of at least X in at least Y samples" > rule, where X=1 (arbitrary but reasonable, can be changed) and Y=size of > smallest replicate group, according to one of the case studies in the > user's guide, for example: > > ------ > 4.3.6 Filtering > We fi lter out very lowly expressed tags, keeping genes that are expressed > at a reasonable level in at least one treatment condition. Since the > smallest group size is three, we keep genes that achieve at least one count > per million (cpm) in at least three samples: > > > keep <- rowSums(cpm(y)>1) >= 3 > > y <- y[keep,] > ------ > > ( > http://www.bioconductor.org/packages/release/bioc/vignettes/edgeR/in st/doc/edgeRUsersGuide.pdf > ) > > Cheers, Mark > > > ---------- > Prof. Dr. Mark Robinson > Statistical Bioinformatics, Institute of Molecular Life Sciences > University of Zurich > http://ow.ly/riRea > > > > > > > > On 30.04.2014, at 21:23, "Ryan C. Thompson" <rct@thompsonclan.org> wrote: > > > Dear Mahnaz, > > > > Total count filtering and mean count filtering are equivalent, since the > only difference is a constant factor (dividing by number of samples), so > the mean count filter demonstrated in the genefilter vignette corresponds > to your question. > > > > If you are expecting the vignette to simply give you a specific number > to use a as a cutoff, that's not possible, because the threshold depends on > the data. I suggest that you adapt the R code in this vignette to your data > in order to choose an appropriate cutoff. > > > > -Ryan > > > > On Wed 30 Apr 2014 12:04:33 PM PDT, Mahnaz Kiani wrote: > >> Thanks for quick response, I did check that but didn't find any > information > >> about total count filter cutoff, would you please help me with that. > >> > >> Thanks, > >> Mahnaz > >> > >> > >> On Wed, Apr 30, 2014 at 1:47 PM, Wolfgang Huber <whuber@embl.de> wrote: > >> > >>> Dear Mahnaz > >>> http://bioconductor.org/packages/release/bioc/html/genefilter.html -> > >>> Diagnostics for independent filtering -> Section 4 provides some > options. > >>> Wolfgang > >>> > >>> Il giorno 30 Apr 2014, alle ore 20:29, mahnaz Kiani [guest] < > >>> guest@bioconductor.org> ha scritto: > >>> > >>>> > >>>> I'm using edgeR for analysis of may data and I'm not sure what total > >>> count filter value cutoff value I should use, My reads are paired 50bP > >>> reads and total reads per sample is about 80,000,000. I tried cutoff > values > >>> of 5,10,15,30,50 and 100 and I only saw differences between 50 and 100 > but > >>> still looking for logical reason to chose the cutoff value. > >>>> > >>>> Appreciate your help, > >>>> Mahnaz > >>>> > >>>> -- output of sessionInfo(): > >>>> > >>>> R 3.0.2 > >>>> > >>>> -- > >>>> Sent via the guest posting facility at bioconductor.org. > >>>> > >>>> _______________________________________________ > >>>> Bioconductor mailing list > >>>> Bioconductor@r-project.org > >>>> https://stat.ethz.ch/mailman/listinfo/bioconductor > >>>> Search the archives: > >>> http://news.gmane.org/gmane.science.biology.informatics.conductor > >>> > >>> > >> > >> [[alternative HTML version deleted]] > >> > >> _______________________________________________ > >> Bioconductor mailing list > >> Bioconductor@r-project.org > >> https://stat.ethz.ch/mailman/listinfo/bioconductor > >> Search the archives: > http://news.gmane.org/gmane.science.biology.informatics.conductor > > > > _______________________________________________ > > Bioconductor mailing list > > Bioconductor@r-project.org > > https://stat.ethz.ch/mailman/listinfo/bioconductor > > Search the archives: > http://news.gmane.org/gmane.science.biology.informatics.conductor > > _______________________________________________ > Bioconductor mailing list > Bioconductor@r-project.org > https://stat.ethz.ch/mailman/listinfo/bioconductor > Search the archives: > http://news.gmane.org/gmane.science.biology.informatics.conductor > [[alternative HTML version deleted]]

ADD REPLY • link 11.0 years ago Aaron Mackey ▴ 200

0

Entering edit mode

Filtering on raw counts has a statistical motivation, i.e. something like "we can't do statistics with less than X reads". Filtering on CPM is sometimes just used as a proxy for count-based filtering, but sometimes it also has a biological motivation, i.e. "we believe that CPM < X represents biological noise transcription rather than genuine regulated transcription relevant to the biological system in question". So you have to consider what your goals are for filtering and choose an appropriate method. -Ryan On Wed 30 Apr 2014 01:04:03 PM PDT, Aaron Mackey wrote: > this is perhaps obvious to some, but I've seen colleagues surprised by > it nonetheless: if each sample has been sequenced to a depth of ~20 > million reads, then with cpm >= 1, you're effectively/approximately > requiring raw counts >= 20; if your depth is 100 million reads, then > you're requiring counts > 100 (and presumably the whole reason you > paid for 100 million reads was to get larger dynamic range at the low > end, which you've just thrown away). That "1 cpm rule of thumb" seems > to be pervasive, and often used without thought to library size and > dynamic range. We did want to try to be better than microarrays, right? > > So, is there a disadvantage for filtering based on "raw count >= X > (where X is 5, 10, etc.) in at least Y samples" rather than CPM? Or > would you suggest in such cases still normalizing by read depth but > lowering the threshold (e.g. cpm >= 1/(mean lib. size in millions)). > I'm assuming non-pathological cases of fairly homogenous library size > per sample. > > -Aaron > > > On Wed, Apr 30, 2014 at 3:34 PM, Mark Robinson > <mark.robinson at="" imls.uzh.ch="" <mailto:mark.robinson="" at="" imls.uzh.ch="">> wrote: > > > In my lab, we typically follow a "CPM of at least X in at least Y > samples" rule, where X=1 (arbitrary but reasonable, can be > changed) and Y=size of smallest replicate group, according to one > of the case studies in the user's guide, for example: > > ------ > 4.3.6 Filtering > We fi lter out very lowly expressed tags, keeping genes that are > expressed at a reasonable level in at least one treatment > condition. Since the smallest group size is three, we keep genes > that achieve at least one count per million (cpm) in at least > three samples: > > > keep <- rowSums(cpm(y)>1) >= 3 > > y <- y[keep,] > ------ > > (http://www.bioconductor.org/packages/release/bioc/vignettes/edg eR/inst/doc/edgeRUsersGuide.pdf) > > Cheers, Mark > > > ---------- > Prof. Dr. Mark Robinson > Statistical Bioinformatics, Institute of Molecular Life Sciences > University of Zurich > http://ow.ly/riRea > > > > > > > > On 30.04.2014, at 21:23, "Ryan C. Thompson" <rct at="" thompsonclan.org=""> <mailto:rct at="" thompsonclan.org="">> wrote: > > > Dear Mahnaz, > > > > Total count filtering and mean count filtering are equivalent, > since the only difference is a constant factor (dividing by number > of samples), so the mean count filter demonstrated in the > genefilter vignette corresponds to your question. > > > > If you are expecting the vignette to simply give you a specific > number to use a as a cutoff, that's not possible, because the > threshold depends on the data. I suggest that you adapt the R code > in this vignette to your data in order to choose an appropriate > cutoff. > > > > -Ryan > > > > On Wed 30 Apr 2014 12:04:33 PM PDT, Mahnaz Kiani wrote: > >> Thanks for quick response, I did check that but didn't find any > information > >> about total count filter cutoff, would you please help me with > that. > >> > >> Thanks, > >> Mahnaz > >> > >> > >> On Wed, Apr 30, 2014 at 1:47 PM, Wolfgang Huber <whuber at="" embl.de=""> <mailto:whuber at="" embl.de="">> wrote: > >> > >>> Dear Mahnaz > >>> > http://bioconductor.org/packages/release/bioc/html/genefilter.html -> > >>> Diagnostics for independent filtering -> Section 4 provides > some options. > >>> Wolfgang > >>> > >>> Il giorno 30 Apr 2014, alle ore 20:29, mahnaz Kiani [guest] < > >>> guest at bioconductor.org <mailto:guest at="" bioconductor.org="">> ha > scritto: > >>> > >>>> > >>>> I'm using edgeR for analysis of may data and I'm not sure > what total > >>> count filter value cutoff value I should use, My reads are > paired 50bP > >>> reads and total reads per sample is about 80,000,000. I tried > cutoff values > >>> of 5,10,15,30,50 and 100 and I only saw differences between 50 > and 100 but > >>> still looking for logical reason to chose the cutoff value. > >>>> > >>>> Appreciate your help, > >>>> Mahnaz > >>>> > >>>> -- output of sessionInfo(): > >>>> > >>>> R 3.0.2 > >>>> > >>>> -- > >>>> Sent via the guest posting facility at bioconductor.org > <http: bioconductor.org="">. > >>>> > >>>> _______________________________________________ > >>>> Bioconductor mailing list > >>>> Bioconductor at r-project.org <mailto:bioconductor at="" r-project.org=""> > >>>> https://stat.ethz.ch/mailman/listinfo/bioconductor > >>>> Search the archives: > >>> http://news.gmane.org/gmane.science.biology.informatics.conductor > >>> > >>> > >> > >> [[alternative HTML version deleted]] > >> > >> _______________________________________________ > >> Bioconductor mailing list > >> Bioconductor at r-project.org <mailto:bioconductor at="" r-project.org=""> > >> https://stat.ethz.ch/mailman/listinfo/bioconductor > >> Search the archives: > http://news.gmane.org/gmane.science.biology.informatics.conductor > > > > _______________________________________________ > > Bioconductor mailing list > > Bioconductor at r-project.org <mailto:bioconductor at="" r-project.org=""> > > https://stat.ethz.ch/mailman/listinfo/bioconductor > > Search the archives: > http://news.gmane.org/gmane.science.biology.informatics.conductor > > _______________________________________________ > Bioconductor mailing list > Bioconductor at r-project.org <mailto:bioconductor at="" r-project.org=""> > https://stat.ethz.ch/mailman/listinfo/bioconductor > Search the archives: > http://news.gmane.org/gmane.science.biology.informatics.conductor > >

ADD REPLY • link 11.0 years ago Ryan C. Thompson ★ 7.9k

0

Entering edit mode

Hi, On Wed, Apr 30, 2014 at 1:11 PM, Ryan C. Thompson <rct at="" thompsonclan.org=""> wrote: > Filtering on raw counts has a statistical motivation, i.e. something like > "we can't do statistics with less than X reads". Filtering on CPM is > sometimes just used as a proxy for count-based filtering, but sometimes it > also has a biological motivation, i.e. "we believe that CPM < X represents > biological noise transcription rather than genuine regulated transcription > relevant to the biological system in question". So you have to consider what > your goals are for filtering and choose an appropriate method. Even still, in the "biological motivation" case: if you want to use CPM, shouldn't you really prefer {R|F}PKM so you don't "enrich" for removal of lowly expressed short transcripts while letting lowly expressed long transcripts slip through? -steve -- Steve Lianoglou Computational Biologist Genentech

ADD REPLY • link 11.0 years ago Steve Lianoglou ★ 13k

0

Entering edit mode

Yes, that is a good point that I forgot to mention. Thanks for correcting me. -Ryan On Wed 30 Apr 2014 02:25:09 PM PDT, Steve Lianoglou wrote: > Hi, > > On Wed, Apr 30, 2014 at 1:11 PM, Ryan C. Thompson <rct at="" thompsonclan.org=""> wrote: >> Filtering on raw counts has a statistical motivation, i.e. something like >> "we can't do statistics with less than X reads". Filtering on CPM is >> sometimes just used as a proxy for count-based filtering, but sometimes it >> also has a biological motivation, i.e. "we believe that CPM < X represents >> biological noise transcription rather than genuine regulated transcription >> relevant to the biological system in question". So you have to consider what >> your goals are for filtering and choose an appropriate method. > > Even still, in the "biological motivation" case: if you want to use > CPM, shouldn't you really prefer {R|F}PKM so you don't "enrich" for > removal of lowly expressed short transcripts while letting lowly > expressed long transcripts slip through? > > -steve >

ADD REPLY • link 11.0 years ago Ryan C. Thompson ★ 7.9k

0

Entering edit mode

Sorry, didn't mean to have that come across as a "correction" ... just wanted to add some more confusion (or clarity(?)) to the debate is all ;-) -steve On Wed, Apr 30, 2014 at 2:49 PM, Ryan C. Thompson <rct at="" thompsonclan.org=""> wrote: > Yes, that is a good point that I forgot to mention. Thanks for correcting > me. > > -Ryan > > > On Wed 30 Apr 2014 02:25:09 PM PDT, Steve Lianoglou wrote: >> >> Hi, >> >> On Wed, Apr 30, 2014 at 1:11 PM, Ryan C. Thompson <rct at="" thompsonclan.org=""> >> wrote: >>> >>> Filtering on raw counts has a statistical motivation, i.e. something like >>> "we can't do statistics with less than X reads". Filtering on CPM is >>> sometimes just used as a proxy for count-based filtering, but sometimes >>> it >>> also has a biological motivation, i.e. "we believe that CPM < X >>> represents >>> biological noise transcription rather than genuine regulated >>> transcription >>> relevant to the biological system in question". So you have to consider >>> what >>> your goals are for filtering and choose an appropriate method. >> >> >> Even still, in the "biological motivation" case: if you want to use >> CPM, shouldn't you really prefer {R|F}PKM so you don't "enrich" for >> removal of lowly expressed short transcripts while letting lowly >> expressed long transcripts slip through? >> >> -steve >> > > _______________________________________________ > Bioconductor mailing list > Bioconductor at r-project.org > https://stat.ethz.ch/mailman/listinfo/bioconductor > Search the archives: > http://news.gmane.org/gmane.science.biology.informatics.conductor -- Steve Lianoglou Computational Biologist Genentech

ADD REPLY • link 11.0 years ago Steve Lianoglou ★ 13k

Login before adding your answer.