readGAlignmentPairs not working correctly when reading a bam file by chunks
1
2
Entering edit mode
jfiksel ▴ 30
@jfiksel-7391
Last seen 7.5 years ago
United States

Hello,

I have several large BAM files and would like to use readGAlignmentPairs when reading in the Bam files by chunks to reduce memory usage. However, I've noticed that readGAlignmentPairs does not produce the same output when reading in a Bam file by chunks as it does when it reads in the whole Bam file. Is there a work around for this? I have produced sample code below to show that the outputs are not equal.

library(GenomicAlignments)
library(Rsamtools)
library(GenomicFiles)
library(pasillaBamSubset)
param <- ScanBamParam(flag=scanBamFlag(isDuplicate = F,
                                       isSecondaryAlignment = F))
un3 <- untreated3_chr4()
bf1 <- BamFile(un3, yieldSize=100, asMates=T)
bf2 <- BamFile(un3, asMates=T)
yield <- function(x) readGAlignmentPairs(x, param=param)
map <- identity
reduce <- c
galp1 <- reduceByYield(bf1, yield, map, reduce)
galp2 <- readGAlignmentPairs(bf2, param=param)
identical(galp1, galp2)
galp1[100:110]
galp2[100:110]

SessionInfo:

R version 3.3.1 (2016-06-21)
Platform: x86_64-apple-darwin13.4.0 (64-bit)
Running under: OS X 10.11.5 (El Capitan)

locale:
[1] C

attached base packages:
[1] stats4    parallel  stats     graphics  grDevices utils     datasets 
[8] methods   base     

other attached packages:
 [1] pasillaBamSubset_0.10.0    GenomicFiles_1.8.0        
 [3] rtracklayer_1.32.1         BiocParallel_1.6.2        
 [5] GenomicAlignments_1.8.3    Rsamtools_1.24.0          
 [7] Biostrings_2.40.2          XVector_0.12.0            
 [9] SummarizedExperiment_1.2.3 Biobase_2.32.0            
[11] GenomicRanges_1.24.2       GenomeInfoDb_1.8.3        
[13] IRanges_2.6.1              S4Vectors_0.10.2          
[15] BiocGenerics_0.18.0       

loaded via a namespace (and not attached):
 [1] AnnotationDbi_1.34.3     zlibbioc_1.18.0          BSgenome_1.40.1         
 [4] tools_3.3.1              DBI_0.4-1                bitops_1.0-6            
 [7] RCurl_1.95-4.8           biomaRt_2.28.0           RSQLite_1.0.0           
[10] compiler_3.3.1           GenomicFeatures_1.24.3   XML_3.98-1.4            
[13] VariantAnnotation_1.18.5

Thanks,

Jacob

rsamtools genomicalignments genomicfiles • 1.4k views
ADD COMMENT
2
Entering edit mode
@herve-pages-1542
Last seen 2 days ago
Seattle, WA, United States

Hi Jacob,

It's only the order that differs between galp1 and galp2 but the content is the same:

galp1 <- galp1[do.call(order, as.data.frame(galp1))]
galp2 <- galp2[do.call(order, as.data.frame(galp2))]
identical(galp1, galp2)
# [1] TRUE

The pairing algorithm needs to do some arrangements when reading the file by chunk. This affects the order in which the pairs are loaded in memory. 

Cheers,

H.

ADD COMMENT
0
Entering edit mode

Perfect, thanks for the help! 

ADD REPLY

Login before adding your answer.

Traffic: 574 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6