"S" bases on mapFromAlignments - counted or not?
1
0
Entering edit mode
@laurenfitch-11575
Last seen 4.6 years ago

I'm having trouble mapping from read coordinates to reference coordinates using GenomicAlignments::mapFromAlignments when my reads contain soft-clipped bases (CIGAR = "S"). Please see the example below:

> first.ref.locs <- mapFromAlignments(first.ranges, GenomicAlignments::first(bam))
> cigar(GenomicAlignments::first(bam))[26]
[1] "27S73="
> mcols(GenomicAlignments::first(bam))$pos[26]
[1] 1
> first.ref.locs[1]
IRanges object with 1 range and 0 metadata columns:
                                             start       end     width
                                         <integer> <integer> <integer>
  K00284:219:HGGG3BBXY:2:1101:18040:1209        69        69         1
> first.ranges[1]
IRanges object with 1 range and 0 metadata columns:
                                             start       end     width
                                         <integer> <integer> <integer>
  K00284:219:HGGG3BBXY:2:1101:18040:1209        69        69         1

The first.ranges object are the ranges I'm trying to map, and first.ref.locs is the result of the mapping. So I'm trying to map the 69th base to the reference. The output is 69, but this is not correct. Because of the soft clip, the 28th base of the read is the 1st base of the reference. Therefore the 69th base of the read should be the 42nd base of the reference, but that's not what mapFromAlignments outputs.

Simple example:

              12345678901
ref           ACGTACGTTAA
read   GTTTCAGACGTACGTTAA
       123456789012345678

In this case the CIGAR string would be 7S11=, pos would be 1. If I tried to map position 8 on the read to the reference, mapFromAlignments would currently return 8. Why is this the desired behavior?

Is it possible that the pos is the problem here? I aligned these reads with minimap2. Should the pos actually be negative when the read begins before the reference?

Thanks for any guidance you can provide, this question has been stumping me for a long time!

genomicalignments • 1.7k views
ADD COMMENT
2
Entering edit mode
@herve-pages-1542
Last seen 2 days ago
Seattle, WA, United States

First, let's clarify the meaning of POS, especially when soft-clipping is involved. According to the SAM Specs (https://samtools.github.io/hts-specs/SAMv1.pdf):

POS: 1-based leftmost mapping POSition of the first CIGAR operation that “consumes” a reference base.

This is a fancy way of saying that the reported POS is the reference position of the first base in the read that is mapped to the reference. In other words, this means that the soft-clipped bases are ignored. For example in the case of your 100-base read, the 27S73= cigar and the POS=1 mean that the read got aligned to the 1st base of the reference after the soft-clipping i.e. after trimming its first 27 bases.

Another thing to realize is how mapFromAlignments() interprets the ranges in its first argument (x). These ranges are expected to describe "local positions" i.e. positions with respect to the read. However what the man page should probably clarify is that local positions on a soft-clipped read should be given with respect to the read after soft-clipping. So if you are trying to map the 69th base of your read to the reference, assuming you mean the 69th base before soft-clipping, then yes, after soft-clipping this base becomes the 42th base on a 73-base read and this is what needs to go in x.

To programmatically convert local positions before clipping to local positions after clipping, here is a little function that will help:

## Calculate the number of soft-clipped bases on the left side of a read.
## Vectorized (i.e. cigars can be a vector of cigar strings).
left_softclipping <- function(cigars)
{
    ops <- as.character(heads(explodeCigarOps(cigars), n=1))
    ops_len <- as.integer(heads(explodeCigarOpLengths(cigars), n=1))
    ops_len[ops != "S"] <- 0L
    ops_len
}

If mypos contains a local position before soft-clipping (e.g. 69), we can convert it to local position after soft-clipping with:

x <- shift(mypos, -left_softclipping("27S73="))

before calling mapFromAlignments():

mapFromAlignments(x, alignments)
# IRanges object with 1 range and 0 metadata columns:
#             start       end     width
#         <integer> <integer> <integer>
#   read1        42        42         1

That was not too bad because we had only one position to map to the reference and we already knew the cigar associated with it. But a more typical use of mapFromAlignments is to map a vector of local positions to the reference. This is still not too bad if all our local positions (before soft-clipping) are with respect to the same read:

mypos <- IRanges(c(28, 45, 69, 100), width=1, names=rep("read1", 4))
alignments <- GAlignments("chr1", pos=1L, cigar="27S73=", strand("+"), names="read1")
x <- shift(mypos, -left_softclipping("27S73="))
mapFromAlignments(x, alignments)
# IRanges object with 4 ranges and 0 metadata columns:
#             start       end     width
#         <integer> <integer> <integer>
#   read1         1         1         1
#   read1        18        18         1
#   read1        42        42         1
#   read1        73        73         1

For the more general case where mypos contains local positions (before soft-clipping) on various reads, we need to fetch the vector of cigar strings associated with the positions before we can call shift(mypos, -left_softclipping(cigars)). This can be done by matching the names on mypos to the names on alignments.

Anyway this sounds like a little bit too much work for a use case that it seems mapFromAlignments() should be able to handle somehow. Maybe Michael (one of the original authors of mapFromAlignments) has some idea about how this can be done in an easier way?

ADD COMMENT
0
Entering edit mode
# First Read
first.soft.clip <- str_extract(cigar(GenomicAlignments::first(bam)), "^[[:digit:]]+(?=S)")
names(first.soft.clip) <- names(GenomicAlignments::first(bam))
first.ref.locs <- mapFromAlignments(first.ranges, GenomicAlignments::first(bam))
first.sc.adj <- as.integer(first.soft.clip[names(first.ref.locs)]) * -1
first.sc.adj[is.na(first.sc.adj)] <- 0
first.ref.locs <- shift(first.ref.locs, first.sc.adj)

I wrote this up before seeing your answer and I have the adjustment for the soft clip coming after the mapFromAlignments call rather than before. From spot-checking some reads it seems to give the correct answer but I think it could give the wrong result if there's an indel between the pos and the mapped location (x), so I'll change it.

Thanks a lot for your detailed answer!

ADD REPLY

Login before adding your answer.

Traffic: 957 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6