Question

summarizeOverlaps question

0

Entering edit mode

James W. MacDonald 67k

@james-w-macdonald-5106

Last seen 1 day ago

United States

I am getting results from summarizeOverlaps(<grangeslist>, <bamfilelist>) that I don't understand. Here is the way I normally use the function: library(Rsamtools) library(TxDb.Mmusculus.UCSC.mm9.knownGene) ex <- exonsBy(TxDb.Mmusculus.UCSC.mm9.knownGene, use.names = TRUE) samps_casava <- BamFileList(<path to="" bamfiles="">) olaps <- summarizeOverlaps(ex, samps_casava, mode = "IntersectionNotEmpty") and then as an example, check the counts for uc007amp.2 > assays(olaps)$counts["uc007amp.2",, drop=F] GG1_ATCACG_L002.uniques.sorted.bam uc007amp.2 1 GG2_TTAGGC_L002.uniques.sorted.bam uc007amp.2 1 GG3_ACTTGA_L002.uniques.sorted.bam uc007amp.2 0 GG4_GATCAG_L002.uniques.sorted.bam uc007amp.2 0 GG5_TAGCTT_L003.uniques.sorted.bam uc007amp.2 0 GG6_GGCTAC_L003.uniques.sorted.bam uc007amp.2 1 GG7_GTGGCC_L003.uniques.sorted.bam uc007amp.2 0 GG8_GTTTCG_L003.uniques.sorted.bam uc007amp.2 0 GG9_CGTACG_L004.uniques.sorted.bam uc007amp.2 0 GG10_GAGTGG_L004.uniques.sorted.bam uc007amp.2 0 GG11_ACTGAT_L004.uniques.sorted.bam uc007amp.2 1 GG12_ATTCCT_L004.uniques.sorted.bam uc007amp.2 3 However, this doesn't make any sense to me, as there are actually tons of reads that overlap this transcript (I know this from looking at the transcript and these data using IGV). So let's just summarize the overlaps for that transcript alone. > ex2 <- ex["uc007amp.2"] > olaps2 <- summarizeOverlaps(ex2, samps_casava, mode = "IntersectionNotEmpty") > assays(olaps2)$counts GG1_ATCACG_L002.uniques.sorted.bam uc007amp.2 45 GG2_TTAGGC_L002.uniques.sorted.bam uc007amp.2 90 GG3_ACTTGA_L002.uniques.sorted.bam uc007amp.2 70 GG4_GATCAG_L002.uniques.sorted.bam uc007amp.2 64 GG5_TAGCTT_L003.uniques.sorted.bam uc007amp.2 111 GG6_GGCTAC_L003.uniques.sorted.bam uc007amp.2 26 GG7_GTGGCC_L003.uniques.sorted.bam uc007amp.2 34 GG8_GTTTCG_L003.uniques.sorted.bam uc007amp.2 36 GG9_CGTACG_L004.uniques.sorted.bam uc007amp.2 98 GG10_GAGTGG_L004.uniques.sorted.bam uc007amp.2 153 GG11_ACTGAT_L004.uniques.sorted.bam uc007amp.2 164 GG12_ATTCCT_L004.uniques.sorted.bam uc007amp.2 174 Any ideas where I am going wrong? Best, Jim -- James W. MacDonald, M.S. Biostatistician University of Washington Environmental and Occupational Health Sciences 4225 Roosevelt Way NE, # 100 Seattle WA 98105-6099

• 747 views

ADD COMMENT • link updated 12.1 years ago by Valerie Obenchain ★ 6.8k • written 12.1 years ago by James W. MacDonald 67k

score 0 · Answer 1 · 2012-08-29

0

Entering edit mode

Valerie Obenchain ★ 6.8k

@valerie-obenchain-4275

Last seen 2.7 years ago

United States

Hi Jim, I think what you're seeing is due to how the annotation is provided. In the first example there are many transcripts the reads can hit and in the second, there is only one. When a GRangesList is given as 'subject' each outer list element is considered the feature. So in the first example we have 55419 features, > library(TxDb.Mmusculus.UCSC.mm9.knownGene) > ex <- exonsBy(TxDb.Mmusculus.UCSC.mm9.knownGene, use.names = TRUE) > class(ex) [1] "GRangesList" attr(,"package") [1] "GenomicRanges" > length(ex) [1] 55419 In the second we have one, > ex2 <- ex["uc007amp.2"] > class(ex2) [1] "GRangesList" attr(,"package") [1] "GenomicRanges" > length(ex2) [1] 1 summarizeOverlaps() will either discard reads that hit >1 feature ('Union') or it will make a decision about how to assign the read to a single feature if possible ('IntersectionStrict' or 'IntersectionNotEmpty'). I believe in the first case many of your reads hit > 1 transcript and cannot be resolved using 'IntersectionNotEmpty'. When a read can't be resolved it is dropped. In the second case no reads should be dropped due to multi-hits because there is only 1 possible feature to hit. Let me know if this does not clear things up and maybe I can take a closer look at your data. Valerie On 08/29/2012 09:39 AM, James W. MacDonald wrote: > I am getting results from summarizeOverlaps(<grangeslist>, > <bamfilelist>) that I don't understand. Here is the way I normally use > the function: > > library(Rsamtools) > library(TxDb.Mmusculus.UCSC.mm9.knownGene) > ex <- exonsBy(TxDb.Mmusculus.UCSC.mm9.knownGene, use.names = TRUE) > samps_casava <- BamFileList(<path to="" bamfiles="">) > olaps <- summarizeOverlaps(ex, samps_casava, mode = > "IntersectionNotEmpty") > > and then as an example, check the counts for uc007amp.2 > > > assays(olaps)$counts["uc007amp.2",, drop=F] > GG1_ATCACG_L002.uniques.sorted.bam > uc007amp.2 1 > GG2_TTAGGC_L002.uniques.sorted.bam > uc007amp.2 1 > GG3_ACTTGA_L002.uniques.sorted.bam > uc007amp.2 0 > GG4_GATCAG_L002.uniques.sorted.bam > uc007amp.2 0 > GG5_TAGCTT_L003.uniques.sorted.bam > uc007amp.2 0 > GG6_GGCTAC_L003.uniques.sorted.bam > uc007amp.2 1 > GG7_GTGGCC_L003.uniques.sorted.bam > uc007amp.2 0 > GG8_GTTTCG_L003.uniques.sorted.bam > uc007amp.2 0 > GG9_CGTACG_L004.uniques.sorted.bam > uc007amp.2 0 > GG10_GAGTGG_L004.uniques.sorted.bam > uc007amp.2 0 > GG11_ACTGAT_L004.uniques.sorted.bam > uc007amp.2 1 > GG12_ATTCCT_L004.uniques.sorted.bam > uc007amp.2 3 > > However, this doesn't make any sense to me, as there are actually tons > of reads that overlap this transcript (I know this from looking at the > transcript and these data using IGV). So let's just summarize the > overlaps for that transcript alone. > > > ex2 <- ex["uc007amp.2"] > > olaps2 <- summarizeOverlaps(ex2, samps_casava, mode = > "IntersectionNotEmpty") > > assays(olaps2)$counts > GG1_ATCACG_L002.uniques.sorted.bam > uc007amp.2 45 > GG2_TTAGGC_L002.uniques.sorted.bam > uc007amp.2 90 > GG3_ACTTGA_L002.uniques.sorted.bam > uc007amp.2 70 > GG4_GATCAG_L002.uniques.sorted.bam > uc007amp.2 64 > GG5_TAGCTT_L003.uniques.sorted.bam > uc007amp.2 111 > GG6_GGCTAC_L003.uniques.sorted.bam > uc007amp.2 26 > GG7_GTGGCC_L003.uniques.sorted.bam > uc007amp.2 34 > GG8_GTTTCG_L003.uniques.sorted.bam > uc007amp.2 36 > GG9_CGTACG_L004.uniques.sorted.bam > uc007amp.2 98 > GG10_GAGTGG_L004.uniques.sorted.bam > uc007amp.2 153 > GG11_ACTGAT_L004.uniques.sorted.bam > uc007amp.2 164 > GG12_ATTCCT_L004.uniques.sorted.bam > uc007amp.2 174 > > > Any ideas where I am going wrong? > > Best, > > Jim > >

ADD COMMENT • link 12.1 years ago Valerie Obenchain ★ 6.8k

0

Entering edit mode

Hi Valerie, On 8/29/2012 5:37 PM, Valerie Obenchain wrote: > Hi Jim, > > I think what you're seeing is due to how the annotation is provided. > In the first example there are many transcripts the reads can hit and > in the second, there is only one. When a GRangesList is given as > 'subject' each outer list element is considered the feature. So in the > first example we have 55419 features, > > > library(TxDb.Mmusculus.UCSC.mm9.knownGene) > > ex <- exonsBy(TxDb.Mmusculus.UCSC.mm9.knownGene, use.names = TRUE) > > class(ex) > [1] "GRangesList" > attr(,"package") > [1] "GenomicRanges" > > length(ex) > [1] 55419 > > In the second we have one, > > ex2 <- ex["uc007amp.2"] > > class(ex2) > [1] "GRangesList" > attr(,"package") > [1] "GenomicRanges" > > length(ex2) > [1] 1 > > summarizeOverlaps() will either discard reads that hit >1 feature > ('Union') or it will make a decision about how to assign the read to a > single feature if possible ('IntersectionStrict' or > 'IntersectionNotEmpty'). I believe in the first case many of your > reads hit > 1 transcript and cannot be resolved using > 'IntersectionNotEmpty'. When a read can't be resolved it is dropped. > In the second case no reads should be dropped due to multi-hits > because there is only 1 possible feature to hit. > > Let me know if this does not clear things up and maybe I can take a > closer look at your data. Ah. An unexpected result. I didn't think this through as well as I should have. The HTSeq diagram in the GenomicRanges package (as well as the original at http://www-huber.embl.de/users/anders/HTSeq/doc/count.html) imply overlapping exons from different genes, so I didn't think about the repercussions of overlapping exons from the *same* gene in different transcripts. Thanks, Jim > > Valerie > > > On 08/29/2012 09:39 AM, James W. MacDonald wrote: >> I am getting results from summarizeOverlaps(<grangeslist>, >> <bamfilelist>) that I don't understand. Here is the way I normally >> use the function: >> >> library(Rsamtools) >> library(TxDb.Mmusculus.UCSC.mm9.knownGene) >> ex <- exonsBy(TxDb.Mmusculus.UCSC.mm9.knownGene, use.names = TRUE) >> samps_casava <- BamFileList() >> olaps <- summarizeOverlaps(ex, samps_casava, mode = >> "IntersectionNotEmpty") >> >> and then as an example, check the counts for uc007amp.2 >> >> > assays(olaps)$counts["uc007amp.2",, drop=F] >> GG1_ATCACG_L002.uniques.sorted.bam >> uc007amp.2 1 >> GG2_TTAGGC_L002.uniques.sorted.bam >> uc007amp.2 1 >> GG3_ACTTGA_L002.uniques.sorted.bam >> uc007amp.2 0 >> GG4_GATCAG_L002.uniques.sorted.bam >> uc007amp.2 0 >> GG5_TAGCTT_L003.uniques.sorted.bam >> uc007amp.2 0 >> GG6_GGCTAC_L003.uniques.sorted.bam >> uc007amp.2 1 >> GG7_GTGGCC_L003.uniques.sorted.bam >> uc007amp.2 0 >> GG8_GTTTCG_L003.uniques.sorted.bam >> uc007amp.2 0 >> GG9_CGTACG_L004.uniques.sorted.bam >> uc007amp.2 0 >> GG10_GAGTGG_L004.uniques.sorted.bam >> uc007amp.2 0 >> GG11_ACTGAT_L004.uniques.sorted.bam >> uc007amp.2 1 >> GG12_ATTCCT_L004.uniques.sorted.bam >> uc007amp.2 3 >> >> However, this doesn't make any sense to me, as there are actually >> tons of reads that overlap this transcript (I know this from looking >> at the transcript and these data using IGV). So let's just summarize >> the overlaps for that transcript alone. >> >> > ex2 <- ex["uc007amp.2"] >> > olaps2 <- summarizeOverlaps(ex2, samps_casava, mode = >> "IntersectionNotEmpty") >> > assays(olaps2)$counts >> GG1_ATCACG_L002.uniques.sorted.bam >> uc007amp.2 45 >> GG2_TTAGGC_L002.uniques.sorted.bam >> uc007amp.2 90 >> GG3_ACTTGA_L002.uniques.sorted.bam >> uc007amp.2 70 >> GG4_GATCAG_L002.uniques.sorted.bam >> uc007amp.2 64 >> GG5_TAGCTT_L003.uniques.sorted.bam >> uc007amp.2 111 >> GG6_GGCTAC_L003.uniques.sorted.bam >> uc007amp.2 26 >> GG7_GTGGCC_L003.uniques.sorted.bam >> uc007amp.2 34 >> GG8_GTTTCG_L003.uniques.sorted.bam >> uc007amp.2 36 >> GG9_CGTACG_L004.uniques.sorted.bam >> uc007amp.2 98 >> GG10_GAGTGG_L004.uniques.sorted.bam >> uc007amp.2 153 >> GG11_ACTGAT_L004.uniques.sorted.bam >> uc007amp.2 164 >> GG12_ATTCCT_L004.uniques.sorted.bam >> uc007amp.2 174 >> >> >> Any ideas where I am going wrong? >> >> Best, >> >> Jim >> >> > -- James W. MacDonald, M.S. Biostatistician University of Washington Environmental and Occupational Health Sciences 4225 Roosevelt Way NE, # 100 Seattle WA 98105-6099

ADD REPLY • link 12.1 years ago James W. MacDonald 67k

0

Entering edit mode

As another option, summarizeOverlaps() can take any counting function as the 'mode' and put the results in a SummarizedExperiment. If the built-in counting modes are too conservative or if you have your own method of counting you can create your own function and pass it to 'mode'. The restrictions are (1) your function must have the same signature as existing modes which is function(reads, features, ignore.strand = FALSE, ...) and (2) it should return a vector of counts the same length as 'features'. Here is an example using countOverlaps(), countOverlapsSO <- function(reads, features, ignore.strand = FALSE, ...) { countOverlaps(features, reads, ignore.strand=ignore.strand) } library("TxDb.Dmelanogaster.UCSC.dm3.ensGene") exbygene <- exonsBy(TxDb.Dmelanogaster.UCSC.dm3.ensGene, "gene") fl <- system.file("extdata", "untreated1_chr4.bam", package="pasillaBamSubset") bfl <- BamFileList(fl) > summarizeOverlaps(exbygene, bfl, mode=countOverlapsSO) class: SummarizedExperiment dim: 14869 1 exptData(0): assays(1): counts rownames(14869): FBgn0000003 FBgn0000008 ... FBgn0261574 FBgn0261575 rowData metadata column names(0): colnames(1): /home/vobencha/R/R-dev/R-2-15-branch/library/pasillaBamSubset/extdata/ untreated1_chr4.bam colData names(1): fileName Valerie > sessionInfo() R version 2.15.1 Patched (2012-07-19 r59904) Platform: x86_64-unknown-linux-gnu (64-bit) ... other attached packages: [1] TxDb.Dmelanogaster.UCSC.dm3.ensGene_2.7.1 [2] GenomicFeatures_1.9.36 [3] AnnotationDbi_1.19.30 [4] Biobase_2.17.6 [5] Rsamtools_1.9.28 [6] Biostrings_2.25.12 [7] GenomicRanges_1.9.57 [8] IRanges_1.15.42 [9] BiocGenerics_0.3.1 loaded via a namespace (and not attached): [1] biomaRt_2.13.2 bitops_1.0-4.1 BSgenome_1.25.6 [4] DBI_0.2-5 RCurl_1.91-1 RSQLite_0.11.1 [7] rtracklayer_1.17.19 stats4_2.15.1 tools_2.15.1 [10] XML_3.9-4 zlibbioc_1.3.0 On 08/30/2012 08:20 AM, James W. MacDonald wrote: > Hi Valerie, > > On 8/29/2012 5:37 PM, Valerie Obenchain wrote: >> Hi Jim, >> >> I think what you're seeing is due to how the annotation is provided. >> In the first example there are many transcripts the reads can hit and >> in the second, there is only one. When a GRangesList is given as >> 'subject' each outer list element is considered the feature. So in >> the first example we have 55419 features, >> >> > library(TxDb.Mmusculus.UCSC.mm9.knownGene) >> > ex <- exonsBy(TxDb.Mmusculus.UCSC.mm9.knownGene, use.names = TRUE) >> > class(ex) >> [1] "GRangesList" >> attr(,"package") >> [1] "GenomicRanges" >> > length(ex) >> [1] 55419 >> >> In the second we have one, >> > ex2 <- ex["uc007amp.2"] >> > class(ex2) >> [1] "GRangesList" >> attr(,"package") >> [1] "GenomicRanges" >> > length(ex2) >> [1] 1 >> >> summarizeOverlaps() will either discard reads that hit >1 feature >> ('Union') or it will make a decision about how to assign the read to >> a single feature if possible ('IntersectionStrict' or >> 'IntersectionNotEmpty'). I believe in the first case many of your >> reads hit > 1 transcript and cannot be resolved using >> 'IntersectionNotEmpty'. When a read can't be resolved it is dropped. >> In the second case no reads should be dropped due to multi-hits >> because there is only 1 possible feature to hit. >> >> Let me know if this does not clear things up and maybe I can take a >> closer look at your data. > > Ah. An unexpected result. I didn't think this through as well as I > should have. The HTSeq diagram in the GenomicRanges package (as well > as the original at > http://www-huber.embl.de/users/anders/HTSeq/doc/count.html) imply > overlapping exons from different genes, so I didn't think about the > repercussions of overlapping exons from the *same* gene in different > transcripts. > > Thanks, > > Jim > > > >> >> Valerie >> >> >> On 08/29/2012 09:39 AM, James W. MacDonald wrote: >>> I am getting results from summarizeOverlaps(<grangeslist>, >>> <bamfilelist>) that I don't understand. Here is the way I normally >>> use the function: >>> >>> library(Rsamtools) >>> library(TxDb.Mmusculus.UCSC.mm9.knownGene) >>> ex <- exonsBy(TxDb.Mmusculus.UCSC.mm9.knownGene, use.names = TRUE) >>> samps_casava <- BamFileList() >>> olaps <- summarizeOverlaps(ex, samps_casava, mode = >>> "IntersectionNotEmpty") >>> >>> and then as an example, check the counts for uc007amp.2 >>> >>> > assays(olaps)$counts["uc007amp.2",, drop=F] >>> GG1_ATCACG_L002.uniques.sorted.bam >>> uc007amp.2 1 >>> GG2_TTAGGC_L002.uniques.sorted.bam >>> uc007amp.2 1 >>> GG3_ACTTGA_L002.uniques.sorted.bam >>> uc007amp.2 0 >>> GG4_GATCAG_L002.uniques.sorted.bam >>> uc007amp.2 0 >>> GG5_TAGCTT_L003.uniques.sorted.bam >>> uc007amp.2 0 >>> GG6_GGCTAC_L003.uniques.sorted.bam >>> uc007amp.2 1 >>> GG7_GTGGCC_L003.uniques.sorted.bam >>> uc007amp.2 0 >>> GG8_GTTTCG_L003.uniques.sorted.bam >>> uc007amp.2 0 >>> GG9_CGTACG_L004.uniques.sorted.bam >>> uc007amp.2 0 >>> GG10_GAGTGG_L004.uniques.sorted.bam >>> uc007amp.2 0 >>> GG11_ACTGAT_L004.uniques.sorted.bam >>> uc007amp.2 1 >>> GG12_ATTCCT_L004.uniques.sorted.bam >>> uc007amp.2 3 >>> >>> However, this doesn't make any sense to me, as there are actually >>> tons of reads that overlap this transcript (I know this from looking >>> at the transcript and these data using IGV). So let's just summarize >>> the overlaps for that transcript alone. >>> >>> > ex2 <- ex["uc007amp.2"] >>> > olaps2 <- summarizeOverlaps(ex2, samps_casava, mode = >>> "IntersectionNotEmpty") >>> > assays(olaps2)$counts >>> GG1_ATCACG_L002.uniques.sorted.bam >>> uc007amp.2 45 >>> GG2_TTAGGC_L002.uniques.sorted.bam >>> uc007amp.2 90 >>> GG3_ACTTGA_L002.uniques.sorted.bam >>> uc007amp.2 70 >>> GG4_GATCAG_L002.uniques.sorted.bam >>> uc007amp.2 64 >>> GG5_TAGCTT_L003.uniques.sorted.bam >>> uc007amp.2 111 >>> GG6_GGCTAC_L003.uniques.sorted.bam >>> uc007amp.2 26 >>> GG7_GTGGCC_L003.uniques.sorted.bam >>> uc007amp.2 34 >>> GG8_GTTTCG_L003.uniques.sorted.bam >>> uc007amp.2 36 >>> GG9_CGTACG_L004.uniques.sorted.bam >>> uc007amp.2 98 >>> GG10_GAGTGG_L004.uniques.sorted.bam >>> uc007amp.2 153 >>> GG11_ACTGAT_L004.uniques.sorted.bam >>> uc007amp.2 164 >>> GG12_ATTCCT_L004.uniques.sorted.bam >>> uc007amp.2 174 >>> >>> >>> Any ideas where I am going wrong? >>> >>> Best, >>> >>> Jim >>> >>> >> >

ADD REPLY • link 12.1 years ago Valerie Obenchain ★ 6.8k