perspective on differential expression counting

perspective on differential expression counting - unmapped reads?

0

Entering edit mode

Jonathan ▴ 60

@jonathan-3868

Last seen 10.6 years ago

Hello, I'm interested in calculating differential expression from some paired RNAseq samples. I've used htseq-count after mapping; quite happy with how easy that was. My question is with regard to whether or not to trip the last five rows from htseq-count output. Those rows look like this: no_feature 152030 ambiguous 4876 too_low_aQual 0 not_aligned 0 alignment_not_unique 0 I can dream of reasons supporting either side of this question.. The number of unmapped or ambiguously-mapping reads do contribute to the total library size. However, I'm also interested in quantifying the difference between what's human in both samples, so intuition would tell me to remove those reads. Because the counts are big, this matters a great deal. I'm using EdgeR (again, very happy with that software), and the manual cites htseq- count as a viable methodology, but doesn't comment on their preferred treatment of the unmapped reads. My first (somewhat careless) utilization of EdgeR gave us results that appeared to make sense, but upon digging a little deeper, I noticed that this question affects the p-values quite a lot because the unmapped counts are so big. I would appreciate any comments/opinions! Thanks, Jonathan [[alternative HTML version deleted]]

edgeR edgeR • 1.8k views

ADD COMMENT • link updated 11.5 years ago by Devon Ryan ▴ 200 • written 11.5 years ago by Jonathan ▴ 60

0

Entering edit mode

Ryan C. Thompson ★ 7.9k

@ryan-c-thompson-5618

Last seen 6 months ago

Icahn School of Medicine at Mount Sinai…

I would not expect including including these counts to have a noticeable effect on the p-values of all the other genes, since edgeR does not normalize to the total counts, but rather uses TMM (unless you forgot to use calcNormFactors). On Thu Oct 31 10:30:54 2013, Jon BR wrote: > Hello, > I'm interested in calculating differential expression from some paired > RNAseq samples. > > I've used htseq-count after mapping; quite happy with how easy that was. > > My question is with regard to whether or not to trip the last five rows > from htseq-count output. > > Those rows look like this: > no_feature 152030 > ambiguous 4876 > too_low_aQual 0 > not_aligned 0 > alignment_not_unique 0 > > I can dream of reasons supporting either side of this question.. The number > of unmapped or ambiguously-mapping reads do contribute to the total library > size. However, I'm also interested in quantifying the difference between > what's human in both samples, so intuition would tell me to remove those > reads. > > Because the counts are big, this matters a great deal. I'm using EdgeR > (again, very happy with that software), and the manual cites htseq- count as > a viable methodology, but doesn't comment on their preferred treatment of > the unmapped reads. > > My first (somewhat careless) utilization of EdgeR gave us results that > appeared to make sense, but upon digging a little deeper, I noticed that > this question affects the p-values quite a lot because the unmapped counts > are so big. > > I would appreciate any comments/opinions! > > Thanks, > Jonathan > > [[alternative HTML version deleted]] > > _______________________________________________ > Bioconductor mailing list > Bioconductor at r-project.org > https://stat.ethz.ch/mailman/listinfo/bioconductor > Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor

ADD COMMENT • link 11.5 years ago Ryan C. Thompson ★ 7.9k

0

Entering edit mode

Devon Ryan ▴ 200

@devon-ryan-6054

Last seen 9.3 years ago

Germany

You can just remove those lines (in fact, that's what DESeq2 does internally), they'll just needlessly increase the number of tests performed. ____________________________________________ Devon Ryan, Ph.D. Email: dpryan at dpryan.com Tel: +49 (0)178 298-6067 Molecular and Cellular Cognition Lab German Centre for Neurodegenerative Diseases (DZNE) Ludwig-Erhard-Allee 2 53175 Bonn, Germany On Oct 31, 2013, at 6:30 PM, Jon BR wrote: > Hello, > I'm interested in calculating differential expression from some paired > RNAseq samples. > > I've used htseq-count after mapping; quite happy with how easy that was. > > My question is with regard to whether or not to trip the last five rows > from htseq-count output. > > Those rows look like this: > no_feature 152030 > ambiguous 4876 > too_low_aQual 0 > not_aligned 0 > alignment_not_unique 0 > > I can dream of reasons supporting either side of this question.. The number > of unmapped or ambiguously-mapping reads do contribute to the total library > size. However, I'm also interested in quantifying the difference between > what's human in both samples, so intuition would tell me to remove those > reads. > > Because the counts are big, this matters a great deal. I'm using EdgeR > (again, very happy with that software), and the manual cites htseq- count as > a viable methodology, but doesn't comment on their preferred treatment of > the unmapped reads. > > My first (somewhat careless) utilization of EdgeR gave us results that > appeared to make sense, but upon digging a little deeper, I noticed that > this question affects the p-values quite a lot because the unmapped counts > are so big. > > I would appreciate any comments/opinions! > > Thanks, > Jonathan > > [[alternative HTML version deleted]] > > _______________________________________________ > Bioconductor mailing list > Bioconductor at r-project.org > https://stat.ethz.ch/mailman/listinfo/bioconductor > Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor

ADD COMMENT • link 11.5 years ago Devon Ryan ▴ 200

0

Entering edit mode

Hi, Ryan, thanks for your thoughts! I noticed in the manual the calcNormFactors step, and I've used it when applying EdgeR to detect genes di erentially expressed between tumor and normal tissue, adjusting for any di fferences between the patients (as in the oral carcinoma case study). That analysis provides a single p-value, looking for the genes most up, or down as a result of the disease across patients. Now I'm trying to look for differences on a patient-by-patient basis. For example, we have a handful of genes that we think can help explain some of the variability between patients (they all have the same disease, but they are heterogeneous, and we have some gene-specific hypotheses). We'd like to query whether expression changes coincide with some other phenotypes, looking patient-by-patient. In other words, I want to answer: Is gene x differentially expressed in patient y? What about z? Is there a gene that's up in y but down in z? I realize that there are no replicates, as we have one tumor and one normal for each patient, so we'll need to be careful drawing any major conclusions... but I'd still like something, probably I can feel OK about the most dramatically different genes in a single patient. To this end, my reading of the manual pointed me towards simply using the binomTest function, which reads in a vector of counts for each of two samples, and produces a vector of p-values, one for each gene. Under that scenario, I notice big (order of magnitude) differences of p-values depending on whether I kept the unmapped reads or filtered them out of the counts vectors. If anyone has an alternative/better idea for a way to handle this, I would love to hear it! Presuming the binomial test is the best way to go, my current opinion is to remove the unmapped/ambiguously mapped before passing the counts vectors in. Any thoughts welcome! (does it make sense?) Thanks, Jonathan On Thu, Oct 31, 2013 at 1:55 PM, Devon Ryan <dpryan@dpryan.com> wrote: > You can just remove those lines (in fact, that's what DESeq2 does > internally), they'll just needlessly increase the number of tests performed. > > ____________________________________________ > Devon Ryan, Ph.D. > Email: dpryan@dpryan.com > Tel: +49 (0)178 298-6067 > Molecular and Cellular Cognition Lab > German Centre for Neurodegenerative Diseases (DZNE) > Ludwig-Erhard-Allee 2 > 53175 Bonn, Germany > > On Oct 31, 2013, at 6:30 PM, Jon BR wrote: > > > Hello, > > I'm interested in calculating differential expression from some paired > > RNAseq samples. > > > > I've used htseq-count after mapping; quite happy with how easy that was. > > > > My question is with regard to whether or not to trip the last five rows > > from htseq-count output. > > > > Those rows look like this: > > no_feature 152030 > > ambiguous 4876 > > too_low_aQual 0 > > not_aligned 0 > > alignment_not_unique 0 > > > > I can dream of reasons supporting either side of this question.. The > number > > of unmapped or ambiguously-mapping reads do contribute to the total > library > > size. However, I'm also interested in quantifying the difference between > > what's human in both samples, so intuition would tell me to remove those > > reads. > > > > Because the counts are big, this matters a great deal. I'm using EdgeR > > (again, very happy with that software), and the manual cites htseq-count > as > > a viable methodology, but doesn't comment on their preferred treatment > of > > the unmapped reads. > > > > My first (somewhat careless) utilization of EdgeR gave us results that > > appeared to make sense, but upon digging a little deeper, I noticed that > > this question affects the p-values quite a lot because the unmapped > counts > > are so big. > > > > I would appreciate any comments/opinions! > > > > Thanks, > > Jonathan > > > > [[alternative HTML version deleted]] > > > > _______________________________________________ > > Bioconductor mailing list > > Bioconductor@r-project.org > > https://stat.ethz.ch/mailman/listinfo/bioconductor > > Search the archives: > http://news.gmane.org/gmane.science.biology.informatics.conductor > > [[alternative HTML version deleted]]

ADD REPLY • link 11.5 years ago Jonathan ▴ 60

0

Entering edit mode

Hi Jonathan, I believe that even with the binomTest function, you could still use calcNormFactors and feed in the normalized library sizes (= library size * normfactor) for the n1 and n2 arguments to binomTest. However, it is certainly correct to exclude those special counts from binomTest in any case. Remember that ninomTest was written for SAGE data, where there is no such thing as "unmapped reads" or reads that don't map to a feature. For what it's worth, I always take the (un-normalized) library size to be the sum of reads that were counted toward a feature. -Ryan On Thu 31 Oct 2013 11:59:36 AM PDT, Jon BR wrote: > Hi, > Ryan, thanks for your thoughts! > > I noticed in the manual the calcNormFactors step, and I've used it when > applying EdgeR to detect genes di erentially expressed between tumor and > normal tissue, adjusting for any di fferences between the patients (as in > the oral carcinoma case study). > > That analysis provides a single p-value, looking for the genes most up, or > down as a result of the disease across patients. > > Now I'm trying to look for differences on a patient-by-patient basis. For > example, we have a handful of genes that we think can help explain some of > the variability between patients (they all have the same disease, but they > are heterogeneous, and we have some gene-specific hypotheses). We'd like > to query whether expression changes coincide with some other phenotypes, > looking patient-by-patient. > > In other words, I want to answer: Is gene x differentially expressed in > patient y? What about z? Is there a gene that's up in y but down in z? > > I realize that there are no replicates, as we have one tumor and one normal > for each patient, so we'll need to be careful drawing any major > conclusions... but I'd still like something, probably I can feel OK about > the most dramatically different genes in a single patient. > > To this end, my reading of the manual pointed me towards simply using > the binomTest function, which reads in a vector of counts for each of two > samples, and produces a vector of p-values, one for each gene. Under that > scenario, I notice big (order of magnitude) differences of p-values > depending on whether I kept the unmapped reads or filtered them out of the > counts vectors. > > > If anyone has an alternative/better idea for a way to handle this, I would > love to hear it! > > Presuming the binomial test is the best way to go, my current opinion is to > remove the unmapped/ambiguously mapped before passing the counts vectors in. > > Any thoughts welcome! (does it make sense?) > > Thanks, > Jonathan > > > > > On Thu, Oct 31, 2013 at 1:55 PM, Devon Ryan <dpryan at="" dpryan.com=""> wrote: > >> You can just remove those lines (in fact, that's what DESeq2 does >> internally), they'll just needlessly increase the number of tests performed. >> >> ____________________________________________ >> Devon Ryan, Ph.D. >> Email: dpryan at dpryan.com >> Tel: +49 (0)178 298-6067 >> Molecular and Cellular Cognition Lab >> German Centre for Neurodegenerative Diseases (DZNE) >> Ludwig-Erhard-Allee 2 >> 53175 Bonn, Germany >> >> On Oct 31, 2013, at 6:30 PM, Jon BR wrote: >> >>> Hello, >>> I'm interested in calculating differential expression from some paired >>> RNAseq samples. >>> >>> I've used htseq-count after mapping; quite happy with how easy that was. >>> >>> My question is with regard to whether or not to trip the last five rows >>> from htseq-count output. >>> >>> Those rows look like this: >>> no_feature 152030 >>> ambiguous 4876 >>> too_low_aQual 0 >>> not_aligned 0 >>> alignment_not_unique 0 >>> >>> I can dream of reasons supporting either side of this question.. The >> number >>> of unmapped or ambiguously-mapping reads do contribute to the total >> library >>> size. However, I'm also interested in quantifying the difference between >>> what's human in both samples, so intuition would tell me to remove those >>> reads. >>> >>> Because the counts are big, this matters a great deal. I'm using EdgeR >>> (again, very happy with that software), and the manual cites htseq-count >> as >>> a viable methodology, but doesn't comment on their preferred treatment >> of >>> the unmapped reads. >>> >>> My first (somewhat careless) utilization of EdgeR gave us results that >>> appeared to make sense, but upon digging a little deeper, I noticed that >>> this question affects the p-values quite a lot because the unmapped >> counts >>> are so big. >>> >>> I would appreciate any comments/opinions! >>> >>> Thanks, >>> Jonathan >>> >>> [[alternative HTML version deleted]] >>> >>> _______________________________________________ >>> Bioconductor mailing list >>> Bioconductor at r-project.org >>> https://stat.ethz.ch/mailman/listinfo/bioconductor >>> Search the archives: >> http://news.gmane.org/gmane.science.biology.informatics.conductor >> >> > > [[alternative HTML version deleted]] > > _______________________________________________ > Bioconductor mailing list > Bioconductor at r-project.org > https://stat.ethz.ch/mailman/listinfo/bioconductor > Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor

ADD REPLY • link 11.5 years ago Ryan C. Thompson ★ 7.9k

Login before adding your answer.