Coverage of reads per base position from BAM files

0

Entering edit mode

Guest User ★ 13k

@guest-user-4897

Last seen 10.2 years ago

I am trying to obtain per reference base position coverage for a BAM file. The BAM file has been generated by Samtools and contains alignment info for 10 million reads to a 7kb reference plasmid genome. What is the best way to go about this? I have explored the "readBamGappedAlignments" and "readAligned" functions but I am not sure of the way forward from there. Also should I be using a BAM file or the sorted BAM file with its corresponding index file? I am currently using the later. Thanks much. -- output of sessionInfo(): R version 2.14.2 (2012-02-29) Platform: x86_64-unknown-linux-gnu (64-bit) locale: [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C [3] LC_TIME=en_US.UTF-8 LC_COLLATE=en_US.UTF-8 [5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8 [7] LC_PAPER=C LC_NAME=C [9] LC_ADDRESS=C LC_TELEPHONE=C [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C attached base packages: [1] stats graphics grDevices utils datasets methods base other attached packages: [1] ShortRead_1.12.4 latticeExtra_0.6-19 RColorBrewer_1.0-5 [4] lattice_0.20-6 Rsamtools_1.6.3 Biostrings_2.22.0 [7] GenomicRanges_1.6.7 IRanges_1.12.6 loaded via a namespace (and not attached): [1] Biobase_2.14.0 bitops_1.0-4.1 BSgenome_1.22.0 grid_2.14.2 [5] hwriter_1.3 RCurl_1.91-1 rtracklayer_1.14.4 tools_2.14.2 [9] XML_3.9-4 zlibbioc_1.0.1 -- Sent via the guest posting facility at bioconductor.org.

Coverage GO Coverage GO • 2.7k views

ADD COMMENT • link updated 12.3 years ago by Martin Morgan 25k • written 12.3 years ago by Guest User ★ 13k

0

Entering edit mode

Michael Lawrence ★ 11k

@michael-lawrence-3846

Last seen 2.9 years ago

United States

You can readGappedAlignments to load the file as a GappedAlignments. Then call coverage() on that. Michael On Tue, Jul 17, 2012 at 12:56 PM, Rithun Mukherjee [guest] < guest@bioconductor.org> wrote: > > I am trying to obtain per reference base position coverage for a BAM file. > The BAM file has been generated by Samtools and contains alignment info for > 10 million reads to a 7kb reference plasmid genome. What is the best way to > go about this? I have explored the "readBamGappedAlignments" and > "readAligned" functions but I am not sure of the way forward from there. > > Also should I be using a BAM file or the sorted BAM file with its > corresponding index file? I am currently using the later. > > Thanks much. > > > -- output of sessionInfo(): > > R version 2.14.2 (2012-02-29) > Platform: x86_64-unknown-linux-gnu (64-bit) > > locale: > [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C > [3] LC_TIME=en_US.UTF-8 LC_COLLATE=en_US.UTF-8 > [5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8 > [7] LC_PAPER=C LC_NAME=C > [9] LC_ADDRESS=C LC_TELEPHONE=C > [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C > > attached base packages: > [1] stats graphics grDevices utils datasets methods base > > other attached packages: > [1] ShortRead_1.12.4 latticeExtra_0.6-19 RColorBrewer_1.0-5 > [4] lattice_0.20-6 Rsamtools_1.6.3 Biostrings_2.22.0 > [7] GenomicRanges_1.6.7 IRanges_1.12.6 > > loaded via a namespace (and not attached): > [1] Biobase_2.14.0 bitops_1.0-4.1 BSgenome_1.22.0 grid_2.14.2 > [5] hwriter_1.3 RCurl_1.91-1 rtracklayer_1.14.4 tools_2.14.2 > [9] XML_3.9-4 zlibbioc_1.0.1 > > > -- > Sent via the guest posting facility at bioconductor.org. > > _______________________________________________ > Bioconductor mailing list > Bioconductor@r-project.org > https://stat.ethz.ch/mailman/listinfo/bioconductor > Search the archives: > http://news.gmane.org/gmane.science.biology.informatics.conductor > [[alternative HTML version deleted]]

ADD COMMENT • link 12.3 years ago Michael Lawrence ★ 11k

0

Entering edit mode

Thanks very much, it worked! Regards, Rithun ----- Original Message ----- From: "Michael Lawrence" <lawrence.michael@gene.com> To: "Rithun Mukherjee [guest]" <guest at="" bioconductor.org=""> Cc: bioconductor at r-project.org, rmukherj at fhcrc.org, "Rsamtools Maintainer" <maintainer at="" bioconductor.org=""> Sent: Tuesday, July 17, 2012 2:11:33 PM Subject: Re: [BioC] Coverage of reads per base position from BAM files You can readGappedAlignments to load the file as a GappedAlignments. Then call coverage() on that. Michael On Tue, Jul 17, 2012 at 12:56 PM, Rithun Mukherjee [guest] < guest at bioconductor.org > wrote: I am trying to obtain per reference base position coverage for a BAM file. The BAM file has been generated by Samtools and contains alignment info for 10 million reads to a 7kb reference plasmid genome. What is the best way to go about this? I have explored the "readBamGappedAlignments" and "readAligned" functions but I am not sure of the way forward from there. Also should I be using a BAM file or the sorted BAM file with its corresponding index file? I am currently using the later. Thanks much. -- output of sessionInfo(): R version 2.14.2 (2012-02-29) Platform: x86_64-unknown-linux-gnu (64-bit) locale: [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C [3] LC_TIME=en_US.UTF-8 LC_COLLATE=en_US.UTF-8 [5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8 [7] LC_PAPER=C LC_NAME=C [9] LC_ADDRESS=C LC_TELEPHONE=C [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C attached base packages: [1] stats graphics grDevices utils datasets methods base other attached packages: [1] ShortRead_1.12.4 latticeExtra_0.6-19 RColorBrewer_1.0-5 [4] lattice_0.20-6 Rsamtools_1.6.3 Biostrings_2.22.0 [7] GenomicRanges_1.6.7 IRanges_1.12.6 loaded via a namespace (and not attached): [1] Biobase_2.14.0 bitops_1.0-4.1 BSgenome_1.22.0 grid_2.14.2 [5] hwriter_1.3 RCurl_1.91-1 rtracklayer_1.14.4 tools_2.14.2 [9] XML_3.9-4 zlibbioc_1.0.1 -- Sent via the guest posting facility at bioconductor.org . _______________________________________________ Bioconductor mailing list Bioconductor at r-project.org https://stat.ethz.ch/mailman/listinfo/bioconductor Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor -- Rithun Mukherjee Postdoctoral Research Associate Computational Biology Program Fred Hutchinson Cancer Research Center rmukherj at fhcrc.org 206.667.7661

ADD REPLY • link 12.3 years ago Mukherjee, Rithun ▴ 10

0

Entering edit mode

Martin Morgan 25k

@martin-morgan-1513

Last seen 3 months ago

United States

On 07/17/2012 12:56 PM, Maintainer wrote: > > I am trying to obtain per reference base position coverage for a BAM file. The BAM file has been generated by Samtools and contains alignment info for 10 million reads to a 7kb reference plasmid genome. What is the best way to go about this? I have explored the "readBamGappedAlignments" and "readAligned" functions but I am not sure of the way forward from there. > > Also should I be using a BAM file or the sorted BAM file with its corresponding index file? I am currently using the later. For simple depth of coverage follow Michaels' advice wrt readGappedAlignments (where 10 million reads might fit into memory w/out trouble, so no need to sort / index / query the BAM file though that doesn't hurt). If by 'per reference base position coverage' you mean the number of A, C, G, T above each position, you'll want Rsamtools::applyPileups. See ?applyPileups for a worked example; this isn't as easy as it should be. Normally, you'll want to use a sorted, indexed BAM file (?sortBam, ?indexBam) so that you can query just the regions you're interested in. One oddity of the BAM 'index' is that it indexes intervals on the reference genome, rather than say dividing the reads into bins of equal numbers. The smallest interval for indexing is 2^14 = 16kb, so in your case all your 10 million reads fall in a single bin and any search is linear(!) -- the first read at the first position is found fast, the last read at the last position only after looking at 9,999,999 other reads. If you want to process the whole file then iterating through it (via PileupParam yieldSize) works; random access will be terrible. Very recent samtools can generate more flexible indexes, but I think the resolution (with respect to the reference genome) is still relatively course and I have not tested files with these indexes in Rsamtools. Martin > Thanks much. > > > -- output of sessionInfo(): > > R version 2.14.2 (2012-02-29) > Platform: x86_64-unknown-linux-gnu (64-bit) > > locale: > [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C > [3] LC_TIME=en_US.UTF-8 LC_COLLATE=en_US.UTF-8 > [5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8 > [7] LC_PAPER=C LC_NAME=C > [9] LC_ADDRESS=C LC_TELEPHONE=C > [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C > > attached base packages: > [1] stats graphics grDevices utils datasets methods base > > other attached packages: > [1] ShortRead_1.12.4 latticeExtra_0.6-19 RColorBrewer_1.0-5 > [4] lattice_0.20-6 Rsamtools_1.6.3 Biostrings_2.22.0 > [7] GenomicRanges_1.6.7 IRanges_1.12.6 > > loaded via a namespace (and not attached): > [1] Biobase_2.14.0 bitops_1.0-4.1 BSgenome_1.22.0 grid_2.14.2 > [5] hwriter_1.3 RCurl_1.91-1 rtracklayer_1.14.4 tools_2.14.2 > [9] XML_3.9-4 zlibbioc_1.0.1 > > > -- > Sent via the guest posting facility at bioconductor.org. > > ____________________________________________________________________ ____ > devteam-bioc mailing list > To unsubscribe from this mailing list send a blank email to > devteam-bioc-leave at lists.fhcrc.org > You can also unsubscribe or change your personal options at > https://lists.fhcrc.org/mailman/listinfo/devteam-bioc > -- Computational Biology / Fred Hutchinson Cancer Research Center 1100 Fairview Ave. N. PO Box 19024 Seattle, WA 98109 Location: Arnold Building M1 B861 Phone: (206) 667-2793

ADD COMMENT • link 12.3 years ago Martin Morgan 25k

Login before adding your answer.