Using GenomicAlignments to get Exon-centric Counts

0

Entering edit mode

Fong Chun Chan ▴ 320

@fong-chun-chan-5706

Last seen 10.6 years ago

Hi, Was wondering if anyone had any experience working with the GenomicAlignments R package when trying to retrieve per-exon count data? Specifically, whether these is rationale for using readGAlignmentsPairs() over readGAlignments() when using the summarizeOverlaps() function? From my understanding, the readGAlignmentsPairs() function will traverse the input bam file and return each read pair as one GAlignment. Looking at the summarizeOverlaps() function document, it states that: "paired-end reads : Paired-end reads are counted the same as single- end reads with a junction (N operation in the CIGAR); each pair registers as a single hit. Paired-end records can be counted in a GAlignmentPairs container or BAM file." If the paired-end reads are counted as single-end reads, does this not suggest that using the readGAlignments() and readGAlignmentsPairs() returns you somewhat similar results? Another question I have is if a single-read is actually split between two exons and the exons features are stored as a GRanges (and NOT a GRangesList) object then the default summarizeOverlaps() function would technically discard this read if I understand correctly? Since the default mode is Union and the read overlaps with both features. In fact, wouldn't all the modes actually discard the read? If so, is it correct to use inter.feature = FALSE, to include split-reads into the equationb? Thanks for any advice on this. sessionInfo() below: > sessionInfo() R version 3.1.0 (2014-04-10) Platform: x86_64-unknown-linux-gnu (64-bit) locale: [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C LC_TIME=en_US.UTF-8 LC_COLLATE=en_US.UTF-8 LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8 LC_PAPER=en_US.UTF-8 LC_NAME=C LC_ADDRESS=C LC_TELEPHONE=C LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C attached base packages: [1] parallel stats graphics grDevices utils datasets methods base other attached packages: [1] argparse_1.0.1 proto_0.3-10 GenomicAlignments_1.0.1 BSgenome_1.32.0 Rsamtools_1.16.1 Biostrings_2.32.0 XVector_0.4.0 GenomicFeatures_1.16.2 AnnotationDbi_1.26.0 Biobase_2.24.0 GenomicRanges_1.16.3 GenomeInfoDb_1.0.2 IRanges_1.22.9 BiocGenerics_0.10.0 vimcom.plus_0.9-93 [16] setwidth_1.0-3 colorout_1.0-2 loaded via a namespace (and not attached): [1] BatchJobs_1.2 BBmisc_1.7 BiocParallel_0.6.1 biomaRt_2.20.0 bitops_1.0-6 brew_1.0-6 checkmate_1.0 codetools_0.2-8 DBI_0.2-7 digest_0.6.4 fail_1.2 findpython_1.0.1 foreach_1.4.2 getopt_1.20.0 iterators_1.0.7 plyr_1.8.1 Rcpp_0.11.2 RCurl_1.95-4.1 [19] rjson_0.2.14 RSQLite_0.11.4 rtracklayer_1.24.2 sendmailR_1.1-2 stats4_3.1.0 stringr_0.6.2 tools_3.1.0 XML_3.98-1.1 zlibbioc_1.10.0 [[alternative HTML version deleted]]

• 1.4k views

ADD COMMENT • link updated 10.8 years ago by Valerie Obenchain ★ 6.8k • written 10.8 years ago by Fong Chun Chan ▴ 320

0

Entering edit mode

Valerie Obenchain ★ 6.8k

@valerie-obenchain-4275

Last seen 3.3 years ago

United States

Hi Fong, readGAlignments() is for single-end data, readGAlignmentPairs() and readGAlignmentsList() are for paired-end. If you read paired-end data with readGAlignments() and perform counting, a read that overlaps both pair parts will be counted twice (once for each part). This is probably not what you want. More below. On 06/30/2014 03:19 PM, Fong Chun Chan wrote: > Hi, > > Was wondering if anyone had any experience working with the > GenomicAlignments R package when trying to retrieve per-exon count data? > Specifically, whether these is rationale for using readGAlignmentsPairs() > over readGAlignments() when using the summarizeOverlaps() function? From my > understanding, the readGAlignmentsPairs() function will traverse the input > bam file and return each read pair as one GAlignment. Data read with readGAlignmentPairs() is put in a GAlignmentPairs class. Counting operations on a GAlignmentPairs class are aware that each pair should be counted as one record. If a range overlaps one or both portions of the pair it will be counted as one hit. The same paired-end data read in with readGAlignments() is put in a GAlignments class. Counting operations on this class treat each record/row individually so you will get more counts than with the GAlignmentPairs class (you would get exactly double the counts if each read overlapped both parts of the pair). Looking at the > summarizeOverlaps() function document, it states that: > > "paired-end reads : Paired-end reads are counted the same as single- end > reads with a junction (N operation in the CIGAR); each pair registers as a > single hit. Paired-end records can be counted in a GAlignmentPairs > container or BAM file." This statement is confusing and should be removed. Thanks for pointing it out. I'll update the docs. summarizeOverlaps() uses different read functions depending on what arguments are used to call it. Here is a summary. - readGAlignments() when 'sigleEnd=TRUE' - readGAlignmentPairs() when 'singleEnd=FALSE' AND 'fragments=FALSE' - readGAlignmentsList() when 'singleEnd=FALSE' AND 'fragments=TRUE' If you have paired-end data you should be setting 'singleEnd=FALSE' in summarizeOverlaps(). If you are interested in all reads, not just those that are paired according to the algorithem, use 'fragments=TRUE'. The ?readGAlignmentsList man page describes the pairing algorithm and how to use the 'mate_status' metadata column to investigate reads of different pairing status. > > If the paired-end reads are counted as single-end reads, does this not > suggest that using the readGAlignments() and readGAlignmentsPairs() returns > you somewhat similar results? From above: paired-end are not counted as single-end. > > Another question I have is if a single-read is actually split between two > exons and the exons features are stored as a GRanges (and NOT a > GRangesList) object then the default summarizeOverlaps() function would > technically discard this read if I understand correctly? Since the default > mode is Union and the read overlaps with both features. In fact, wouldn't > all the modes actually discard the read? If so, is it correct to use > inter.feature = FALSE, to include split-reads into the equationb? The default is 'Union' but you can change that to 'IntersectionStrict' or 'IntersectionNotEmpty'. It is possible that none of the modes will be able to 'make a decision' about what feature the read should be assigned to. There is a graphic in the vignette that shows what types of overlaps are handled. browseVignettes('GenomicAlignments') Using 'inter.feature=FALSE' will allow double counting; the spanning read will be counted once for each feature it overlaps. See the 'Counting multi-hit reads' example section at the bottom of the summarizeOverlaps man page. Valerie > > Thanks for any advice on this. sessionInfo() below: > >> sessionInfo() > R version 3.1.0 (2014-04-10) > Platform: x86_64-unknown-linux-gnu (64-bit) > > locale: > [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C > LC_TIME=en_US.UTF-8 LC_COLLATE=en_US.UTF-8 > LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8 LC_PAPER=en_US.UTF-8 > LC_NAME=C LC_ADDRESS=C LC_TELEPHONE=C > LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C > > attached base packages: > [1] parallel stats graphics grDevices utils datasets methods > base > > other attached packages: > [1] argparse_1.0.1 proto_0.3-10 > GenomicAlignments_1.0.1 BSgenome_1.32.0 Rsamtools_1.16.1 > Biostrings_2.32.0 XVector_0.4.0 GenomicFeatures_1.16.2 > AnnotationDbi_1.26.0 Biobase_2.24.0 GenomicRanges_1.16.3 > GenomeInfoDb_1.0.2 IRanges_1.22.9 BiocGenerics_0.10.0 > vimcom.plus_0.9-93 > [16] setwidth_1.0-3 colorout_1.0-2 > > loaded via a namespace (and not attached): > [1] BatchJobs_1.2 BBmisc_1.7 BiocParallel_0.6.1 > biomaRt_2.20.0 bitops_1.0-6 brew_1.0-6 checkmate_1.0 > codetools_0.2-8 DBI_0.2-7 digest_0.6.4 fail_1.2 > findpython_1.0.1 foreach_1.4.2 getopt_1.20.0 iterators_1.0.7 > plyr_1.8.1 Rcpp_0.11.2 RCurl_1.95-4.1 > [19] rjson_0.2.14 RSQLite_0.11.4 rtracklayer_1.24.2 > sendmailR_1.1-2 stats4_3.1.0 stringr_0.6.2 tools_3.1.0 > XML_3.98-1.1 zlibbioc_1.10.0 > > [[alternative HTML version deleted]] > > _______________________________________________ > Bioconductor mailing list > Bioconductor at r-project.org > https://stat.ethz.ch/mailman/listinfo/bioconductor > Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor > -- Valerie Obenchain Program in Computational Biology Fred Hutchinson Cancer Research Center 1100 Fairview Ave. N, Seattle, WA 98109 Email: vobencha at fhcrc.org Phone: (206) 667-3158

ADD COMMENT • link 10.8 years ago Valerie Obenchain ★ 6.8k

0

Entering edit mode

Hi Valerie, Thanks for your reply. After re-reading that paired-end statement, I understand what you meant now and was definitely confused because I thought it was being treated as two single-end reads which would have defeated the purpose of using reading it in as a GAlignmentPairs object. Thanks for the clarification on the inter.feature = FALSE parameter. That is exactly what I want actually. I wanted split reads across two exons should be counted as contributing to each exon. Even more importantly are paired-reads where the other pair is mapped to another distinct exon needs to be included also as contributing one to each exon. The default counting mode settings are likely more applicable to when working on gene-centric models where you have GRangeList objects with each exon as a GRange object (hopefully I am correct in my understanding of that) and in those situations you probably don't want to double count. On Mon, Jun 30, 2014 at 4:29 PM, Valerie Obenchain <vobencha@fhcrc.org> wrote: > Hi Fong, > > readGAlignments() is for single-end data, readGAlignmentPairs() and > readGAlignmentsList() are for paired-end. If you read paired-end data with > readGAlignments() and perform counting, a read that overlaps both pair > parts will be counted twice (once for each part). This is probably not what > you want. > > More below. > > > On 06/30/2014 03:19 PM, Fong Chun Chan wrote: > >> Hi, >> >> Was wondering if anyone had any experience working with the >> GenomicAlignments R package when trying to retrieve per-exon count data? >> Specifically, whether these is rationale for using readGAlignmentsPairs() >> over readGAlignments() when using the summarizeOverlaps() function? From >> my >> understanding, the readGAlignmentsPairs() function will traverse the input >> bam file and return each read pair as one GAlignment. >> > > Data read with readGAlignmentPairs() is put in a GAlignmentPairs class. > Counting operations on a GAlignmentPairs class are aware that each pair > should be counted as one record. If a range overlaps one or both portions > of the pair it will be counted as one hit. > > The same paired-end data read in with readGAlignments() is put in a > GAlignments class. Counting operations on this class treat each record/row > individually so you will get more counts than with the GAlignmentPairs > class (you would get exactly double the counts if each read overlapped both > parts of the pair). > > > Looking at the > >> summarizeOverlaps() function document, it states that: >> >> "paired-end reads : Paired-end reads are counted the same as single-end >> reads with a junction (N operation in the CIGAR); each pair registers as a >> single hit. Paired-end records can be counted in a GAlignmentPairs >> container or BAM file." >> > > This statement is confusing and should be removed. Thanks for pointing it > out. I'll update the docs. > > summarizeOverlaps() uses different read functions depending on what > arguments are used to call it. Here is a summary. > > - readGAlignments() when 'sigleEnd=TRUE' > - readGAlignmentPairs() when 'singleEnd=FALSE' AND 'fragments=FALSE' > - readGAlignmentsList() when 'singleEnd=FALSE' AND 'fragments=TRUE' > > If you have paired-end data you should be setting 'singleEnd=FALSE' in > summarizeOverlaps(). If you are interested in all reads, not just those > that are paired according to the algorithem, use 'fragments=TRUE'. The > ?readGAlignmentsList man page describes the pairing algorithm and how to > use the 'mate_status' metadata column to investigate reads of different > pairing status. > > > >> If the paired-end reads are counted as single-end reads, does this not >> suggest that using the readGAlignments() and readGAlignmentsPairs() >> returns >> you somewhat similar results? >> > > From above: paired-end are not counted as single-end. > > > >> Another question I have is if a single-read is actually split between two >> exons and the exons features are stored as a GRanges (and NOT a >> GRangesList) object then the default summarizeOverlaps() function would >> technically discard this read if I understand correctly? Since the default >> mode is Union and the read overlaps with both features. In fact, wouldn't >> all the modes actually discard the read? If so, is it correct to use >> inter.feature = FALSE, to include split-reads into the equationb? >> > > The default is 'Union' but you can change that to 'IntersectionStrict' or > 'IntersectionNotEmpty'. It is possible that none of the modes will be able > to 'make a decision' about what feature the read should be assigned to. > There is a graphic in the vignette that shows what types of overlaps are > handled. > > browseVignettes('GenomicAlignments') > > Using 'inter.feature=FALSE' will allow double counting; the spanning read > will be counted once for each feature it overlaps. See the 'Counting > multi-hit reads' example section at the bottom of the summarizeOverlaps man > page. > > > > Valerie > > >> Thanks for any advice on this. sessionInfo() below: >> >> sessionInfo() >>> >> R version 3.1.0 (2014-04-10) >> Platform: x86_64-unknown-linux-gnu (64-bit) >> >> locale: >> [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C >> LC_TIME=en_US.UTF-8 LC_COLLATE=en_US.UTF-8 >> LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8 LC_PAPER=en_US.UTF-8 >> LC_NAME=C LC_ADDRESS=C >> LC_TELEPHONE=C >> LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C >> >> attached base packages: >> [1] parallel stats graphics grDevices utils datasets methods >> base >> >> other attached packages: >> [1] argparse_1.0.1 proto_0.3-10 >> GenomicAlignments_1.0.1 BSgenome_1.32.0 Rsamtools_1.16.1 >> Biostrings_2.32.0 XVector_0.4.0 GenomicFeatures_1.16.2 >> AnnotationDbi_1.26.0 Biobase_2.24.0 GenomicRanges_1.16.3 >> GenomeInfoDb_1.0.2 IRanges_1.22.9 BiocGenerics_0.10.0 >> vimcom.plus_0.9-93 >> [16] setwidth_1.0-3 colorout_1.0-2 >> >> loaded via a namespace (and not attached): >> [1] BatchJobs_1.2 BBmisc_1.7 BiocParallel_0.6.1 >> biomaRt_2.20.0 bitops_1.0-6 brew_1.0-6 checkmate_1.0 >> codetools_0.2-8 DBI_0.2-7 digest_0.6.4 fail_1.2 >> findpython_1.0.1 foreach_1.4.2 getopt_1.20.0 >> iterators_1.0.7 >> plyr_1.8.1 Rcpp_0.11.2 RCurl_1.95-4.1 >> [19] rjson_0.2.14 RSQLite_0.11.4 rtracklayer_1.24.2 >> sendmailR_1.1-2 stats4_3.1.0 stringr_0.6.2 tools_3.1.0 >> XML_3.98-1.1 zlibbioc_1.10.0 >> >> [[alternative HTML version deleted]] >> >> _______________________________________________ >> Bioconductor mailing list >> Bioconductor@r-project.org >> https://stat.ethz.ch/mailman/listinfo/bioconductor >> Search the archives: http://news.gmane.org/gmane. >> science.biology.informatics.conductor >> >> > > -- > Valerie Obenchain > Program in Computational Biology > Fred Hutchinson Cancer Research Center > 1100 Fairview Ave. N, Seattle, WA 98109 > > Email: vobencha@fhcrc.org > Phone: (206) 667-3158 > [[alternative HTML version deleted]]

ADD REPLY • link 10.8 years ago Fong Chun Chan ▴ 320

0

Entering edit mode

Hi, On 06/30/2014 04:51 PM, Fong Chun Chan wrote: > Hi Valerie, > > Thanks for your reply. After re-reading that paired-end statement, I > understand what you meant now and was definitely confused because I > thought it was being treated as two single-end reads which would have > defeated the purpose of using reading it in as a GAlignmentPairs object. > > Thanks for the clarification on the inter.feature = FALSE parameter. > That is exactly what I want actually. I wanted split reads across two > exons should be counted as contributing to each exon. Even more > importantly are paired-reads where the other pair is mapped to another > distinct exon needs to be included also as contributing one to each exon. Great! I'm glad it fits that niche. > > The default counting mode settings are likely more applicable to when > working on gene-centric models where you have GRangeList objects with > each exon as a GRange object (hopefully I am correct in my understanding > of that) and in those situations you probably don't want to double count. Yes, the modes w/ default settings were intended for gene-based models, pattered after Simon Anders' HTSeq. 'inter.feature' was added later by request and has extended the functionality to situations like you describe. Valerie > > > > > On Mon, Jun 30, 2014 at 4:29 PM, Valerie Obenchain <vobencha at="" fhcrc.org=""> <mailto:vobencha at="" fhcrc.org="">> wrote: > > Hi Fong, > > readGAlignments() is for single-end data, readGAlignmentPairs() and > readGAlignmentsList() are for paired-end. If you read paired-end > data with readGAlignments() and perform counting, a read that > overlaps both pair parts will be counted twice (once for each part). > This is probably not what you want. > > More below. > > > On 06/30/2014 03:19 PM, Fong Chun Chan wrote: > > Hi, > > Was wondering if anyone had any experience working with the > GenomicAlignments R package when trying to retrieve per-exon > count data? > Specifically, whether these is rationale for using > readGAlignmentsPairs() > over readGAlignments() when using the summarizeOverlaps() > function? From my > understanding, the readGAlignmentsPairs() function will traverse > the input > bam file and return each read pair as one GAlignment. > > > Data read with readGAlignmentPairs() is put in a GAlignmentPairs > class. Counting operations on a GAlignmentPairs class are aware that > each pair should be counted as one record. If a range overlaps one > or both portions of the pair it will be counted as one hit. > > The same paired-end data read in with readGAlignments() is put in a > GAlignments class. Counting operations on this class treat each > record/row individually so you will get more counts than with the > GAlignmentPairs class (you would get exactly double the counts if > each read overlapped both parts of the pair). > > > Looking at the > > summarizeOverlaps() function document, it states that: > > "paired-end reads : Paired-end reads are counted the same as > single-end > reads with a junction (N operation in the CIGAR); each pair > registers as a > single hit. Paired-end records can be counted in a GAlignmentPairs > container or BAM file." > > > This statement is confusing and should be removed. Thanks for > pointing it out. I'll update the docs. > > summarizeOverlaps() uses different read functions depending on what > arguments are used to call it. Here is a summary. > > - readGAlignments() when 'sigleEnd=TRUE' > - readGAlignmentPairs() when 'singleEnd=FALSE' AND 'fragments=FALSE' > - readGAlignmentsList() when 'singleEnd=FALSE' AND 'fragments=TRUE' > > If you have paired-end data you should be setting 'singleEnd=FALSE' > in summarizeOverlaps(). If you are interested in all reads, not just > those that are paired according to the algorithem, use > 'fragments=TRUE'. The ?readGAlignmentsList man page describes the > pairing algorithm and how to use the 'mate_status' metadata column > to investigate reads of different pairing status. > > > > If the paired-end reads are counted as single-end reads, does > this not > suggest that using the readGAlignments() and > readGAlignmentsPairs() returns > you somewhat similar results? > > > From above: paired-end are not counted as single-end. > > > > Another question I have is if a single-read is actually split > between two > exons and the exons features are stored as a GRanges (and NOT a > GRangesList) object then the default summarizeOverlaps() > function would > technically discard this read if I understand correctly? Since > the default > mode is Union and the read overlaps with both features. In fact, > wouldn't > all the modes actually discard the read? If so, is it correct to use > inter.feature = FALSE, to include split-reads into the equationb? > > > The default is 'Union' but you can change that to > 'IntersectionStrict' or 'IntersectionNotEmpty'. It is possible that > none of the modes will be able to 'make a decision' about what > feature the read should be assigned to. There is a graphic in the > vignette that shows what types of overlaps are handled. > > browseVignettes('__GenomicAlignments') > > Using 'inter.feature=FALSE' will allow double counting; the spanning > read will be counted once for each feature it overlaps. See the > 'Counting multi-hit reads' example section at the bottom of the > summarizeOverlaps man page. > > > > Valerie > > > Thanks for any advice on this. sessionInfo() below: > > sessionInfo() > > R version 3.1.0 (2014-04-10) > Platform: x86_64-unknown-linux-gnu (64-bit) > > locale: > [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C > LC_TIME=en_US.UTF-8 LC_COLLATE=en_US.UTF-8 > LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8 > LC_PAPER=en_US.UTF-8 > LC_NAME=C LC_ADDRESS=C > LC_TELEPHONE=C > LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C > > attached base packages: > [1] parallel stats graphics grDevices utils datasets > methods > base > > other attached packages: > [1] argparse_1.0.1 proto_0.3-10 > GenomicAlignments_1.0.1 BSgenome_1.32.0 Rsamtools_1.16.1 > Biostrings_2.32.0 XVector_0.4.0 > GenomicFeatures_1.16.2 > AnnotationDbi_1.26.0 Biobase_2.24.0 > GenomicRanges_1.16.3 > GenomeInfoDb_1.0.2 IRanges_1.22.9 > BiocGenerics_0.10.0 > vimcom.plus_0.9-93 > [16] setwidth_1.0-3 colorout_1.0-2 > > loaded via a namespace (and not attached): > [1] BatchJobs_1.2 BBmisc_1.7 BiocParallel_0.6.1 > biomaRt_2.20.0 bitops_1.0-6 brew_1.0-6 > checkmate_1.0 > codetools_0.2-8 DBI_0.2-7 digest_0.6.4 fail_1.2 > findpython_1.0.1 foreach_1.4.2 getopt_1.20.0 > iterators_1.0.7 > plyr_1.8.1 Rcpp_0.11.2 RCurl_1.95-4.1 > [19] rjson_0.2.14 RSQLite_0.11.4 rtracklayer_1.24.2 > sendmailR_1.1-2 stats4_3.1.0 stringr_0.6.2 tools_3.1.0 > XML_3.98-1.1 zlibbioc_1.10.0 > > [[alternative HTML version deleted]] > > _________________________________________________ > Bioconductor mailing list > Bioconductor at r-project.org <mailto:bioconductor at="" r-project.org=""> > https://stat.ethz.ch/mailman/__listinfo/bioconductor > <https: stat.ethz.ch="" mailman="" listinfo="" bioconductor=""> > Search the archives: > http://news.gmane.org/gmane.__science.biology.informatics.__conductor > <http: news.gmane.org="" gmane.science.biology.informatics.conductor=""> > > > > -- > Valerie Obenchain > Program in Computational Biology > Fred Hutchinson Cancer Research Center > 1100 Fairview Ave. N, Seattle, WA 98109 > > Email: vobencha at fhcrc.org <mailto:vobencha at="" fhcrc.org=""> > Phone: (206) 667-3158 <tel:%28206%29%20667-3158> > >

ADD REPLY • link 10.8 years ago Valerie Obenchain ★ 6.8k

Login before adding your answer.