Question

get trimmed sequences based on cigar string in genomic alignments

0

Entering edit mode

Jake ▴ 90

@jake-7236

Last seen 2.6 years ago

United States

Hi,

I have rna seq data that I have mapped with STAR which allows soft clipping of reads. I have imported the bam file with the following code:

param <- ScanBamParam(flag=scanBamFlag(isSecondaryAlignment = F), tag=c('NH'), what='seq')
reads <- readGAlignments('mapped.bam', use.names=T, param = param)

I know that the cigar string shows how many nucleotides have been soft clipped from either the 5' and 3' end of a read with the #S#M or #M#S. I'd like to extract just the nucleotides that have been clipped from the 5' and the 3' end for each read. I saw that there are some functions to operate on cigar strings and genomicAlignments, but didn't see any that looked like they would easily do this. Is there an easy way to extract this information? Thanks

genomicalignments • 3.3k views

ADD COMMENT • link updated 9.3 years ago by Hervé Pagès 16k • written 9.4 years ago by Jake ▴ 90

score 0 · Answer 1 · 2015-12-04

0

Entering edit mode

Hervé Pagès 16k

@herve-pages-1542

Last seen 10 hours ago

Seattle, WA, United States

Hi Jake,

Yep the cigar utilities in GenomicAlignments are pretty low-level and not particularly easy to use but they do the job:

library(GenomicAlignments)
cigar <- c("3S15M3I24S5H", "25M300N15M", "30M10S")
cigarRangesAlongQuerySpace(cigar, ops="S")
# IRangesList of length 3
# [[1]]
# IRanges of length 2
#     start end width
# [1]     1   3     3
# [2]    22  45    24
# 
# [[2]]
# IRanges of length 0
# 
# [[3]]
# IRanges of length 1
#     start end width
# [1]    31  40    10

Then use this IRangesList object to extract the soft-clipped sequences from the query sequences with extractAt() from the Biostrings package.

Cheers,

H.

ADD COMMENT • link 9.3 years ago Hervé Pagès 16k

0

Entering edit mode

Great this is exactly what I was looking for. Is there anyway to separate out 5' soft clip vs 3' soft clip?

ADD REPLY • link 9.3 years ago Jake ▴ 90

0

Entering edit mode

Yes. For alignments located on the + strand, 5' soft clip is on the left and 3' soft clip is on the right. For alignments on the - strand it's the other way around. Load the BAM file as a GAlignments object with gal <- readGAlignments(..., param=ScanBamParam(what="seq")). Then get the strand with strand(gal), the CIGAR strings with cigar(gal), and the query sequences with mcols(gal)$seq. Compute the soft clip ranges with Sranges <- cigarRangesAlongQuerySpace(cigar(gal), ops="S"). Sranges is an IRangesList object parallel to gal. Ranges in Sranges that start at 1 are on the left, and ranges that end at width(mcols(gal)$seq) (which is the same as qwidth(gal)) are on the right.

Hope this helps,

H.

ADD REPLY • link 9.3 years ago Hervé Pagès 16k