Dear BioC experts,
I'm trying to do an efficient counting of all possible nucleotide changes between aligned nucleotides from short-read sequencing and corresponding reference nucleotides, i.e. how many A>C, A>G, A>T, C>A, C>G, C>T, etc. I have crafted a first approach using the Biostrings and GenomicAlignment packages and the examples I found in the documentation, but I end up doing a nested for loop which obviously is right now the performance bottleneck. I would like to ask if anyone has a suggestion on how could I do this job more efficiently, prior to start parallelizing the code. Please find below the code using example data from the GenomicAlignments package.
thanks!!
library(GenomicAlignments) library(RNAseqData.HNRNPC.bam.chr14) library(BSgenome.Hsapiens.UCSC.hg19) bamfiles <- RNAseqData.HNRNPC.bam.chr14_BAMFILES sbparam <- ScanBamParam(what="seq", flag=scanBamFlag(isProperPair=TRUE, isDuplicate=FALSE, isSecondaryAlignment=FALSE)) gal <- readGAlignments(bamfiles[1], use.names=TRUE, param=sbparam) maxqwidth <- max(qwidth(gal)) ## fetch query and reference sequences qseq <- mcols(gal)$seq rseq <- getSeq(Hsapiens, as(gal, "GRanges")) ## discard indels and skipped regions qseq2 <- sequenceLayer(qseq, cigar(gal), from="query", to="pairwise-dense") rseq2 <- sequenceLayer(rseq, cigar(gal), from="reference", to="pairwise-dense") stopifnot(identical(elementNROWS(qseq2), elementNROWS(rseq2))) ## QC ## reverse complement query sequences aligning to the reverse strand mask <- strand(gal) == "-" qseq2[mask] <- reverseComplement(qseq2[mask]) ## select only short-reads aligning to the maximum length masklen <- width(qseq2) == maxqwidth & width(rseq2) == maxqwidth qseq2 <- qseq2[masklen] rseq2 <- rseq2[masklen] ## tally nucleotide changes nt2 <- combn(DNA_BASES, 2) ntPairs <- apply(nt2, 2, paste, collapse=">") ntPairs <- c(ntPairs, apply(rbind(nt2[2, ], nt2[1, ]), 2, paste, collapse=">")) ntChanges <- matrix(0, nrow=length(ntPairs), ncol=maxqwidth, dimnames=list(ntPairs, 1:maxqwidth)) ## for every pair of possible nucleotide changes for (nucQ in DNA_BASES) { for (nucR in setdiff(DNA_BASES, nucQ)) { ## build a logical mask of the query nucleotide qNs <- hasLetterAt(qseq2, paste(rep(nucQ, maxqwidth), collapse=""), at=1:maxqwidth) ## build a logical mask of the reference nucleotide rNs <- hasLetterAt(rseq2, paste(rep(nucR, maxqwidth), collapse=""), at=1:maxqwidth) ## sum the number of occurrences of both nucleotides at each position ntChanges[paste(nucR, nucQ, sep=">"), ] <- colSums(qNs & rNs) } } dim(ntChanges) ## nucleotide changes by short-read positions ## [1] 12 72 ntChanges[1:5, 1:10] ## 1 2 3 4 5 6 7 8 9 10 ## A>C 4221 410 248 419 496 714 904 1274 1557 1119 ## A>G 534 1550 1003 1227 1140 789 715 1234 1310 1440 ## A>T 5155 367 218 229 321 446 1548 987 1442 1094 ## C>G 149 325 354 322 561 484 334 686 683 893 ## C>T 3057 764 518 334 513 377 1153 702 899 757
hi Michael, thanks for hint. i've been thinking looking at the documentation and examples on pileup() and thinking about it and i can't see how to use it because pileup() tallies nucleotides along the reference sequence, e.g.:
while i'm interested to tally nucleotides along positions of aligned reads (see the ntChanges matrix in my example above). i can't see how to transform the pileup data.frame into a read-level summary of substitutions. any further idea?
thanx!
Sorry I will edit my answer.