Question

Finding mismatches between reads and reference sequence

0

Entering edit mode

Stefanie Tauber ▴ 40

@stefanie-tauber-3978

Last seen 3.1 years ago

Germany

Hi,

following up the post Finding mismatches between reads and reference sequence and, in particular, the two very handy functions 'mismatches' and 'subsetByIntegerList' presented in this post by Herve Pages:

Is there already some functionality available for finding the position and identity of mismatches between reads and references sequence when having insertions in my alignments?

Best,

Stefanie

bsgenome cigar alignment • 2.5k views

ADD COMMENT • link updated 10.4 years ago by Hervé Pagès 16k • written 10.4 years ago by Stefanie Tauber ▴ 40

score 1 · Accepted Answer · 2014-12-09

Hi Stefanie,

The "Finding mismatches between reads and reference sequence" post is old and the good news is that it's a little bit easier to do these things these days with sequenceLayer() (which I added since then). sequenceLayer() is defined in the GenomicAlignments package and supports insertions, deletions, and clipping (soft and hard).

library(RNAseqData.HNRNPC.bam.chr14)
bam <- RNAseqData.HNRNPC.bam.chr14_BAMFILES[1]
param <- ScanBamParam(what="seq")
reads <- readGAlignments(bam, param=param)
colSums(cigarOpTable(cigar(reads)))
##        M         I         D         N         S         H         P         = 
## 57631946      2902      1560 659051789         0         0         0         0 
##        X 
##        0

Our reads contain insertions (I), deletions (D), and skipped regions (N) (aka junctions).

Query sequences:

qseqs <- mcols(reads)$seq

Extract the reference sequences from the reference genome. Important: all the sequences must be extracted from the plus strand:

library(BSgenome.Hsapiens.UCSC.hg19)
genome <- BSgenome.Hsapiens.UCSC.hg19
read_ranges <- granges(reads)
strand(read_ranges) <- "+"
rseqs <- getSeq(genome, read_ranges)

Use sequenceLayer() to bring the reference sequences "in the same space as the query sequences". Concretely this will remove from rseqs the nucleotides that correspond to deletions (D) and junctions (N), and will insert the letter "-" where insertions (I) occurred:

rseqs2 <- sequenceLayer(rseqs, cigar(reads), from="reference", to="query")

A sanity check:

identical(width(qseqs), width(rseqs2))  # TRUE (that was not the case with rseqs)

From here we can re-use the mismatches() function to find the positions of the mismatches between qseqs and rseqs2:

mismatches <- function(x, y)
{
    if (!is(x, "DNAStringSet") || !is(y, "DNAStringSet"))
        stop("'x' and 'y' must be DNAStringSet objects")
    x_width <- width(x)
    y_width <- width(y)
    if (!identical(x_width, y_width))
        stop("'x' and 'y' must have the same shape ",
             "(i.e. same length and widths)")
    unlisted_ans <- which(as.raw(unlist(x)) != as.raw(unlist(y)))
    breakpoints <- cumsum(x_width)
    ans_eltlens <- tabulate(findInterval(unlisted_ans - 1L,
                                         breakpoints) + 1L,
                            nbins=length(x))
    skeleton <- PartitioningByEnd(cumsum(ans_eltlens))
    ans <- relist(unlisted_ans, skeleton)
    offsets <- c(0L, breakpoints[-length(breakpoints)])
    ans <- ans - offsets
    ans
}

mm <- mismatches(qseqs, rseqs2)

The result is an IntegerList "parallel" to qseqs (and to rseqs2), that is, it has one list element per query sequence:

 > mm
IntegerList of length 800484
[[1]] integer(0)
[[2]] integer(0)
[[3]] integer(0)
[[4]] integer(0)
[[5]] integer(0)
[[6]] integer(0)
[[7]] integer(0)
[[8]] 7 9
[[9]] 7 9
[[10]] 7 9
...
<800474 more elements>

Each list element is an integer vector giving the 1-based position of the mismatches with respect to the corresponding sequence in qseqs (i.e. to the corresponding sequence in the SEQ field of the BAM file). Keep in mind that these sequences are the reverse complement of the original read sequences when they aligned to the minus strand. Also they could have been hard-clipped by the aligner.

Nb of mismatches per read:

elementLengths(mm)

Summary of nb of mismatches per read:

> summary(elementLengths(mm))
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
 0.0000  0.0000  0.0000  0.2147  0.0000  2.0000

To see the nucleotides that correspopnd to these mismatches, qseqs and rseqs2 can be subsetted with mm (no need to use adhoc subsetByIntegerList() anymore):

> head(qseqs[mm], n=10)
  A DNAStringSet instance of length 10
     width seq
 [1]     0 
 [2]     0 
 [3]     0 
 [4]     0 
 [5]     0 
 [6]     0 
 [7]     0 
 [8]     2 GA
 [9]     2 GA
[10]     2 GA

> head(rseqs2[mm], n=10)
  A DNAStringSet instance of length 10
     width seq
 [1]     0 
 [2]     0 
 [3]     0 
 [4]     0 
 [5]     0 
 [6]     0 
 [7]     0 
 [8]     2 AG
 [9]     2 AG
[10]     2 AG

Hope this helps. These tools have been added quite recently. Let me know if you run into problems and/or if they need improvement.

H.