Hi Stefanie,
The "Finding mismatches between reads and reference sequence" post is old and the good news is that it's a little bit easier to do these things these days with sequenceLayer()
(which I added since then). sequenceLayer()
is defined in the GenomicAlignments package and supports insertions, deletions, and clipping (soft and hard).
library(RNAseqData.HNRNPC.bam.chr14)
bam <- RNAseqData.HNRNPC.bam.chr14_BAMFILES[1]
param <- ScanBamParam(what="seq")
reads <- readGAlignments(bam, param=param)
colSums(cigarOpTable(cigar(reads)))
## M I D N S H P =
## 57631946 2902 1560 659051789 0 0 0 0
## X
## 0
Our reads contain insertions (I), deletions (D), and skipped regions (N) (aka junctions).
Query sequences:
qseqs <- mcols(reads)$seq
Extract the reference sequences from the reference genome. Important: all the sequences must be extracted from the plus strand:
library(BSgenome.Hsapiens.UCSC.hg19)
genome <- BSgenome.Hsapiens.UCSC.hg19
read_ranges <- granges(reads)
strand(read_ranges) <- "+"
rseqs <- getSeq(genome, read_ranges)
Use sequenceLayer()
to bring the reference sequences "in the same space as the query sequences". Concretely this will remove from rseqs
the nucleotides that correspond to deletions (D) and junctions (N), and will insert the letter "-" where insertions (I) occurred:
rseqs2 <- sequenceLayer(rseqs, cigar(reads), from="reference", to="query")
A sanity check:
identical(width(qseqs), width(rseqs2)) # TRUE (that was not the case with rseqs)
From here we can re-use the mismatches()
function to find the positions of the mismatches between qseqs
and rseqs2
:
mismatches <- function(x, y)
{
if (!is(x, "DNAStringSet") || !is(y, "DNAStringSet"))
stop("'x' and 'y' must be DNAStringSet objects")
x_width <- width(x)
y_width <- width(y)
if (!identical(x_width, y_width))
stop("'x' and 'y' must have the same shape ",
"(i.e. same length and widths)")
unlisted_ans <- which(as.raw(unlist(x)) != as.raw(unlist(y)))
breakpoints <- cumsum(x_width)
ans_eltlens <- tabulate(findInterval(unlisted_ans - 1L,
breakpoints) + 1L,
nbins=length(x))
skeleton <- PartitioningByEnd(cumsum(ans_eltlens))
ans <- relist(unlisted_ans, skeleton)
offsets <- c(0L, breakpoints[-length(breakpoints)])
ans <- ans - offsets
ans
}
mm <- mismatches(qseqs, rseqs2)
The result is an IntegerList "parallel" to qseqs
(and to rseqs2
), that is, it has one list element per query sequence:
> mm
IntegerList of length 800484
[[1]] integer(0)
[[2]] integer(0)
[[3]] integer(0)
[[4]] integer(0)
[[5]] integer(0)
[[6]] integer(0)
[[7]] integer(0)
[[8]] 7 9
[[9]] 7 9
[[10]] 7 9
...
<800474 more elements>
Each list element is an integer vector giving the 1-based position of the mismatches with respect to the corresponding sequence in qseqs
(i.e. to the corresponding sequence in the SEQ field of the BAM file). Keep in mind that these sequences are the reverse complement of the original read sequences when they aligned to the minus strand. Also they could have been hard-clipped by the aligner.
Nb of mismatches per read:
elementLengths(mm)
Summary of nb of mismatches per read:
> summary(elementLengths(mm))
Min. 1st Qu. Median Mean 3rd Qu. Max.
0.0000 0.0000 0.0000 0.2147 0.0000 2.0000
To see the nucleotides that correspopnd to these mismatches, qseqs
and rseqs2
can be subsetted with mm
(no need to use adhoc subsetByIntegerList()
anymore):
> head(qseqs[mm], n=10)
A DNAStringSet instance of length 10
width seq
[1] 0
[2] 0
[3] 0
[4] 0
[5] 0
[6] 0
[7] 0
[8] 2 GA
[9] 2 GA
[10] 2 GA
> head(rseqs2[mm], n=10)
A DNAStringSet instance of length 10
width seq
[1] 0
[2] 0
[3] 0
[4] 0
[5] 0
[6] 0
[7] 0
[8] 2 AG
[9] 2 AG
[10] 2 AG
Hope this helps. These tools have been added quite recently. Let me know if you run into problems and/or if they need improvement.
H.
Thanks so much for the detailed comment!