Slice of aligned sequences
1
1
Entering edit mode
@thomas-poulsen-8302
Last seen 6.2 years ago
Denmark / Copenhagen / Novozymes

Dear BioConductors,

I'm looking for a way to "cut" out a "slice" of aligned sequences from a BAM file given start and end positions in the reference sequence.

I have looked at the GenomicAlignments package and GenomicRanges because I would expect those to be able to do the trick.

I can construct a ScanBamParam call that picks out the reads overlapping my region of interest, but the sequence I get out in the $seq metadata column is the full sequence of the read - not only the sequence  inside my range of interest.

Here's a small example of what I do trying to get the aligned "slice" between positions 19363800 and 19363805 on chromosome 14, but as it shows, I get the full reads that overlap the range - not the slice:

library(GenomicAlignments)
library(RNAseqData.HNRNPC.bam.chr14)
fl <- RNAseqData.HNRNPC.bam.chr14_BAMFILES[1]
bfl <- BamFile(fl)
aln <- readGAlignments(bfl, param=ScanBamParam(what="seq", which=GRanges("chr14",  IRanges(19363800,19363805 ))))

> aln
GAlignments object with 2 alignments and 1 metadata column:
      seqnames strand       cigar    qwidth     start       end     width
         <Rle>  <Rle> <character> <integer> <integer> <integer> <integer>
  [1]    chr14      +         72M        72  19363738  19363809        72
  [2]    chr14      -         72M        72  19363755  19363826        72
          njunc |
      <integer> |
  [1]         0 |
  [2]         0 |
                                                                           seq
                                                                <DNAStringSet>
  [1] CCCATATGTACATCAGGCCCCAGGTATACACTGGACTCCAGGTGGACACCAGCACTCAGTTGGATACACACA
  [2] CCCCAGGTATACACTGGACTCCAGGTGGACACCAGCACTCAGTTGGATACACACACTCAAGGTGGACACCAG
  -------
  seqinfo: 93 sequences from an unspecified genome

What I want to get out is only the "slice" between 19363800 and 19363805, ie

GATACA
GATACA

I'm sure there must be a way to do just this, but I have not been able to find it.

I realize there are complications with reads having insertions in the specified region, but I'll be happy for a solution that works for simpleCigar=TRUE.

My motivation for wanting this is to count occurrences of individual gene-variants (over the region of interest) and count codon- and amino acid frequencies per position in the region of interest.

Thank you all for some great packages.


> sessionInfo()

R version 3.2.0 (2015-04-16)
Platform: x86_64-pc-linux-gnu (64-bit)
Running under: Ubuntu 14.04.2 LTS

locale:
 [1] LC_CTYPE=da_DK.UTF-8       LC_NUMERIC=C              
 [3] LC_TIME=da_DK.UTF-8        LC_COLLATE=da_DK.UTF-8    
 [5] LC_MONETARY=da_DK.UTF-8    LC_MESSAGES=da_DK.UTF-8   
 [7] LC_PAPER=da_DK.UTF-8       LC_NAME=C                 
 [9] LC_ADDRESS=C               LC_TELEPHONE=C            
[11] LC_MEASUREMENT=da_DK.UTF-8 LC_IDENTIFICATION=C       

attached base packages:
[1] stats4    parallel  stats     graphics  grDevices utils     datasets
[8] methods   base     

other attached packages:
 [1] RNAseqData.HNRNPC.bam.chr14_0.6.0 GenomicAlignments_1.4.1          
 [3] Rsamtools_1.20.4                  Biostrings_2.36.1                
 [5] XVector_0.8.0                     GenomicRanges_1.20.5             
 [7] GenomeInfoDb_1.4.1                IRanges_2.2.5                    
 [9] S4Vectors_0.6.1                   BiocGenerics_0.14.0              

loaded via a namespace (and not attached):
[1] bitops_1.0-6         futile.options_1.0.0 zlibbioc_1.14.0     
[4] futile.logger_1.4.1  lambda.r_1.1.7       BiocParallel_1.2.6  
[7] tools_3.2.0          compiler_3.2.0      
 

bam granges genomicalignments • 1.6k views
ADD COMMENT
3
Entering edit mode
@herve-pages-1542
Last seen 3 hours ago
Seattle, WA, United States

Hi Thomas,

Looks like what you want is stackStringsFromBam():

library(GenomicAlignments)
which <- GRanges("chr14", IRanges(19363800, 19363805))
stackStringsFromBam(bfl, param=ScanBamParam(which=which))
#   A DNAStringSet instance of length 2
#     width seq
# [1]     6 GATACA
# [2]     6 GATACA

See ?stackStringsFromBam in the GenomicAlignments package for more information.

Cheers,

H.

ADD COMMENT
0
Entering edit mode

Hi Hervé,

This does exactly what I need.

Thank you very much!

All the best,

Thomas

ADD REPLY

Login before adding your answer.

Traffic: 734 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6