I am calculating weighted coverage for two reads one of which has a gap. I find that the coverage weights are assigned wrong after the gapped region due to the GAlignment being converted to GRangesList with multiple GRanges. Is there a way to calculate the right weighted coverage?
I can correct this manually by expanding the assigned weights by hand as shown below, but the lapply step seems to be very slow when working with millions of reads.
> library(GenomicAlignments) > reads <- GAlignments( seqnames=Rle(rep("chr1",2)), pos=as.integer(c(2,40)), strand=strand(rep("+",2)), cigar=c("8M2N12M", "10M"), weight=c(0.1, 0.2)) > reads GAlignments object with 2 alignments and 1 metadata column: seqnames strand cigar qwidth start end width <Rle> <Rle> <character> <integer> <integer> <integer> <integer> [1] chr1 + 8M2N12M 20 2 23 22 [2] chr1 + 10M 10 40 49 10 njunc | weight <integer> | <numeric> [1] 1 | 0.1 [2] 0 | 0.2 ------- seqinfo: 1 sequence from an unspecified genome; no seqlengths > coverage(reads) RleList of length 1 $chr1 integer-Rle of length 49 with 6 runs Lengths: 1 8 2 12 16 10 Values : 0 1 0 1 0 1 > coverage(reads, weight=mcols(reads)$weight) RleList of length 1 $chr1 numeric-Rle of length 49 with 6 runs Lengths: 1 8 2 12 16 10 Values : 0 0.1 0 0.2 0 0.1 Warning message: In .recycle(arg, length(x), arg.label, "x") : 'weight' length is not a divisor of 'x' length #### correct above problem by manually assigning weights > shape <- lapply(grlist(reads), function(x) rep(1, length(x))) > weights <- unlist(mapply("*", shape, mcols(reads)$weight) > coverage(reads, weight=weights) RleList of length 1 $chr1 numeric-Rle of length 49 with 6 runs Lengths: 1 8 2 12 16 10 Values : 0 0.1 0 0.1 0 0.2 > sessionInfo() R version 3.3.2 (2016-10-31) Platform: x86_64-pc-linux-gnu (64-bit) Running under: Ubuntu 14.04.3 LTS locale: [1] LC_CTYPE=en_US.utf8 LC_NUMERIC=C [3] LC_TIME=en_US.utf8 LC_COLLATE=en_US.utf8 [5] LC_MONETARY=en_US.utf8 LC_MESSAGES=en_US.utf8 [7] LC_PAPER=en_US.utf8 LC_NAME=C [9] LC_ADDRESS=C LC_TELEPHONE=C [11] LC_MEASUREMENT=en_US.utf8 LC_IDENTIFICATION=C attached base packages: [1] stats4 parallel stats graphics grDevices datasets utils [8] methods base other attached packages: [1] GenomicAlignments_1.10.0 Rsamtools_1.26.1 [3] Biostrings_2.42.1 XVector_0.14.0 [5] SummarizedExperiment_1.4.0 Biobase_2.34.0 [7] GenomicRanges_1.26.3 GenomeInfoDb_1.10.3 [9] IRanges_2.8.1 S4Vectors_0.12.1 [11] BiocGenerics_0.20.0 loaded via a namespace (and not attached): [1] lattice_0.20-34 bitops_1.0-6 grid_3.3.2 zlibbioc_1.20.0 [5] Matrix_1.2-8 BiocParallel_1.8.1 tools_3.3.2 RCurl_1.95-4.8 [9] compiler_3.3.2