Hi,
I have rna seq data that I have mapped with STAR which allows soft clipping of reads. I have imported the bam file with the following code:
param <- ScanBamParam(flag=scanBamFlag(isSecondaryAlignment = F), tag=c('NH'), what='seq') reads <- readGAlignments('mapped.bam', use.names=T, param = param)
I know that the cigar string shows how many nucleotides have been soft clipped from either the 5' and 3' end of a read with the #S#M or #M#S. I'd like to extract just the nucleotides that have been clipped from the 5' and the 3' end for each read. I saw that there are some functions to operate on cigar strings and genomicAlignments, but didn't see any that looked like they would easily do this. Is there an easy way to extract this information? Thanks
Great this is exactly what I was looking for. Is there anyway to separate out 5' soft clip vs 3' soft clip?
Yes. For alignments located on the + strand, 5' soft clip is on the left and 3' soft clip is on the right. For alignments on the - strand it's the other way around. Load the BAM file as a GAlignments object with
gal <- readGAlignments(..., param=ScanBamParam(what="seq"))
. Then get the strand withstrand(gal)
, the CIGAR strings withcigar(gal)
, and the query sequences withmcols(gal)$seq
. Compute the soft clip ranges withSranges <- cigarRangesAlongQuerySpace(cigar(gal), ops="S")
.Sranges
is an IRangesList object parallel togal
. Ranges inSranges
that start at 1 are on the left, and ranges that end atwidth(mcols(gal)$seq)
(which is the same asqwidth(gal)
) are on the right.Hope this helps,
H.