readGappedAlignments and param argument
1
0
Entering edit mode
Elena Grassi ▴ 10
@elena-grassi-6138
Last seen 7.1 years ago
Italy, Rozzano, Humanitas University
Hello, I've stumbled across a strange behaviour when indexing bams and then reading them with ReadGappedAlignments with a param holding a which section. Given this bam: data at tungsteno:/rogue_bis/data/bioinfotree/prj/roar/dataset/0.2/roa r_test_2$ samtools view test.bam D1JP1ACXX130107:2:2201:7003:63880 16 chr1 243675625 255 92M * 0 0 * * > library(GenomicRanges) > bam <- "/scarlet/data/bioinfotree/prj/roar/dataset/0.2/roar_test_2/test.bam" > ga <- readGappedAlignments(bam) > ga GappedAlignments with 1 alignment and 0 metadata columns: seqnames strand cigar qwidth start end width ngap <rle> <rle> <character> <integer> <integer> <integer> <integer> <integer> [1] chr1 - 92M 92 243675625 243675716 92 0 --- seqlengths: chr1 chr2 ... chrUn_gl000226 chr18_gl000207_random 249250621 243199373 ... 15008 4262 Everything is fine here. But if index it and load it with a param: > gene_id <- c("A", "B") > features <- GRanges( + seqnames = Rle(c("chr1", "chr1")), + strand = strand(c("-", "-")), + ranges = IRanges( + start=c(243675625, 243675627), + width=c(2, 100)), + DataFrame(gene_id) + ) > indexBam(bam) /tmp/RtmpVbmBLu/filead32d37188.bam "/tmp/RtmpVbmBLu/filead32d37188.bam.bai" > ga_s_p <- readGappedAlignments(bam, param=ScanBamParam(which=features)) > ga_s_p GappedAlignments with 2 alignments and 0 metadata columns: seqnames strand cigar qwidth start end width ngap <rle> <rle> <character> <integer> <integer> <integer> <integer> <integer> [1] chr1 - 92M 92 243675625 243675716 92 0 [2] chr1 - 92M 92 243675625 243675716 92 0 --- seqlengths: chr1 chr2 ... chrUn_gl000226 chr18_gl000207_random 249250621 243199373 ... 15008 4262 > The single read is duplicated...I'm not an expert with these libraries so maybe this is my fault (if I remove the second entry in features everything works, but from the ScanBamParam man about which I was not expecting this behaviour). Sorry for the convoluted example but I've stumbled on this problem while comparing counts obtained with summarizeOverlaps in a script of mine. Here's my session info: > sessionInfo() R version 3.0.1 (2013-05-16) Platform: x86_64-pc-linux-gnu (64-bit) locale: [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C [3] LC_TIME=en_US.UTF-8 LC_COLLATE=en_US.UTF-8 [5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8 [7] LC_PAPER=C LC_NAME=C [9] LC_ADDRESS=C LC_TELEPHONE=C [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C attached base packages: [1] parallel stats graphics grDevices utils datasets methods [8] base other attached packages: [1] rtracklayer_1.20.2 Rsamtools_1.12.3 Biostrings_2.28.0 [4] GenomicRanges_1.12.4 IRanges_1.18.1 BiocGenerics_0.6.0 loaded via a namespace (and not attached): [1] bitops_1.0-5 BSgenome_1.28.0 RCurl_1.95-4.1 stats4_3.0.1 [5] tools_3.0.1 XML_3.98-1.1 zlibbioc_1.6.0 Thanks, E.G. -- $ pom
Alignment Alignment • 1.1k views
ADD COMMENT
0
Entering edit mode
@martin-morgan-1513
Last seen 10 weeks ago
United States
Hi Elena -- On 09/09/2013 03:57 AM, Elena Grassi wrote: > Hello, > > I've stumbled across a strange behaviour when indexing bams and then > reading them with ReadGappedAlignments with a param > holding a which section. > Given this bam: > data at tungsteno:/rogue_bis/data/bioinfotree/prj/roar/dataset/0.2/r oar_test_2$ > samtools view test.bam > D1JP1ACXX130107:2:2201:7003:63880 16 chr1 243675625 > 255 92M * 0 0 * * > >> library(GenomicRanges) >> bam <- "/scarlet/data/bioinfotree/prj/roar/dataset/0.2/roar_test_2/test.bam" >> ga <- readGappedAlignments(bam) >> ga > GappedAlignments with 1 alignment and 0 metadata columns: > seqnames strand cigar qwidth start end > width ngap > <rle> <rle> <character> <integer> <integer> <integer> > <integer> <integer> > [1] chr1 - 92M 92 243675625 243675716 > 92 0 > --- > seqlengths: > chr1 chr2 ... > chrUn_gl000226 chr18_gl000207_random > 249250621 243199373 ... > 15008 4262 > > Everything is fine here. But if index it and load it with a param: >> gene_id <- c("A", "B") >> features <- GRanges( > + seqnames = Rle(c("chr1", "chr1")), > + strand = strand(c("-", "-")), > + ranges = IRanges( > + start=c(243675625, 243675627), > + width=c(2, 100)), > + DataFrame(gene_id) > + ) >> indexBam(bam) > /tmp/RtmpVbmBLu/filead32d37188.bam > "/tmp/RtmpVbmBLu/filead32d37188.bam.bai" >> ga_s_p <- readGappedAlignments(bam, param=ScanBamParam(which=features)) >> ga_s_p > GappedAlignments with 2 alignments and 0 metadata columns: > seqnames strand cigar qwidth start end > width ngap > <rle> <rle> <character> <integer> <integer> <integer> > <integer> <integer> > [1] chr1 - 92M 92 243675625 243675716 > 92 0 > [2] chr1 - 92M 92 243675625 243675716 > 92 0 > --- > seqlengths: > chr1 chr2 ... > chrUn_gl000226 chr18_gl000207_random > 249250621 243199373 ... > 15008 4262 >> > > The single read is duplicated...I'm not an expert with these libraries Your GRanges query is analogous to asking for (something like, maybe I didn't get the indexing correct) samtools view <input.bam> chr1:243675625-243675622 chr1:243675627-243675727 which performs two independent queries -- a read overlapping both regions will be returned twice. You could reduce(features) to make ranges non- overlapping, although it's still possible to have a single read overlapping two distinct regions hence returned twice. Or you could use findOverlaps() and manipulate the returned object to implement your scheme for selecting reads, or identify the appropriate mode in summarizeOverlaps for doing the counting. Or perhaps you can provide more of an explanation for the problem that led you to this? Hope that helps. Martin > so maybe this is my fault (if I remove > the second entry in features everything works, but from the > ScanBamParam man about which I was not expecting > this behaviour). > Sorry for the convoluted example but I've stumbled on this problem > while comparing counts obtained > with summarizeOverlaps in a script of mine. > > Here's my session info: >> sessionInfo() > R version 3.0.1 (2013-05-16) > Platform: x86_64-pc-linux-gnu (64-bit) > > locale: > [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C > [3] LC_TIME=en_US.UTF-8 LC_COLLATE=en_US.UTF-8 > [5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8 > [7] LC_PAPER=C LC_NAME=C > [9] LC_ADDRESS=C LC_TELEPHONE=C > [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C > > attached base packages: > [1] parallel stats graphics grDevices utils datasets methods > [8] base > > other attached packages: > [1] rtracklayer_1.20.2 Rsamtools_1.12.3 Biostrings_2.28.0 > [4] GenomicRanges_1.12.4 IRanges_1.18.1 BiocGenerics_0.6.0 > > loaded via a namespace (and not attached): > [1] bitops_1.0-5 BSgenome_1.28.0 RCurl_1.95-4.1 stats4_3.0.1 > [5] tools_3.0.1 XML_3.98-1.1 zlibbioc_1.6.0 > > > Thanks, > E.G. > > -- Computational Biology / Fred Hutchinson Cancer Research Center 1100 Fairview Ave. N. PO Box 19024 Seattle, WA 98109 Location: Arnold Building M1 B861 Phone: (206) 667-2793
ADD COMMENT

Login before adding your answer.

Traffic: 1060 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6