Question

Slice of aligned sequences

1

Entering edit mode

Thomas Poulsen ▴ 10

@thomas-poulsen-8302

Last seen 6.6 years ago

Denmark / Copenhagen / Novozymes

Dear BioConductors,

I'm looking for a way to "cut" out a "slice" of aligned sequences from a BAM file given start and end positions in the reference sequence.

I have looked at the GenomicAlignments package and GenomicRanges because I would expect those to be able to do the trick.

I can construct a ScanBamParam call that picks out the reads overlapping my region of interest, but the sequence I get out in the $seq metadata column is the full sequence of the read - not only the sequence inside my range of interest.

Here's a small example of what I do trying to get the aligned "slice" between positions 19363800 and 19363805 on chromosome 14, but as it shows, I get the full reads that overlap the range - not the slice:

library(GenomicAlignments)
library(RNAseqData.HNRNPC.bam.chr14)
fl <- RNAseqData.HNRNPC.bam.chr14_BAMFILES[1]
bfl <- BamFile(fl)
aln <- readGAlignments(bfl, param=ScanBamParam(what="seq", which=GRanges("chr14",  IRanges(19363800,19363805 ))))

> aln
GAlignments object with 2 alignments and 1 metadata column:
      seqnames strand       cigar    qwidth     start       end     width
         <Rle>  <Rle> <character> <integer> <integer> <integer> <integer>
  [1]    chr14      +         72M        72  19363738  19363809        72
  [2]    chr14      -         72M        72  19363755  19363826        72
          njunc |
      <integer> |
  [1]         0 |
  [2]         0 |
                                                                           seq
                                                                <DNAStringSet>
  [1] CCCATATGTACATCAGGCCCCAGGTATACACTGGACTCCAGGTGGACACCAGCACTCAGTTGGATACACACA
  [2] CCCCAGGTATACACTGGACTCCAGGTGGACACCAGCACTCAGTTGGATACACACACTCAAGGTGGACACCAG
  -------
  seqinfo: 93 sequences from an unspecified genome

What I want to get out is only the "slice" between 19363800 and 19363805, ie

GATACA
GATACA

I'm sure there must be a way to do just this, but I have not been able to find it.

I realize there are complications with reads having insertions in the specified region, but I'll be happy for a solution that works for simpleCigar=TRUE.

My motivation for wanting this is to count occurrences of individual gene-variants (over the region of interest) and count codon- and amino acid frequencies per position in the region of interest.

Thank you all for some great packages.

> sessionInfo()

R version 3.2.0 (2015-04-16)
Platform: x86_64-pc-linux-gnu (64-bit)
Running under: Ubuntu 14.04.2 LTS

locale:
[1] LC_CTYPE=da_DK.UTF-8       LC_NUMERIC=C
[3] LC_TIME=da_DK.UTF-8        LC_COLLATE=da_DK.UTF-8
[5] LC_MONETARY=da_DK.UTF-8    LC_MESSAGES=da_DK.UTF-8
[7] LC_PAPER=da_DK.UTF-8       LC_NAME=C
[9] LC_ADDRESS=C               LC_TELEPHONE=C
[11] LC_MEASUREMENT=da_DK.UTF-8 LC_IDENTIFICATION=C

attached base packages:
[1] stats4 parallel stats graphics grDevices utils datasets
[8] methods base

other attached packages:
[1] RNAseqData.HNRNPC.bam.chr14_0.6.0 GenomicAlignments_1.4.1
[3] Rsamtools_1.20.4                  Biostrings_2.36.1
[5] XVector_0.8.0                     GenomicRanges_1.20.5
[7] GenomeInfoDb_1.4.1                IRanges_2.2.5
[9] S4Vectors_0.6.1                   BiocGenerics_0.14.0

loaded via a namespace (and not attached):
[1] bitops_1.0-6         futile.options_1.0.0 zlibbioc_1.14.0
[4] futile.logger_1.4.1 lambda.r_1.1.7       BiocParallel_1.2.6
[7] tools_3.2.0          compiler_3.2.0

bam granges genomicalignments • 1.7k views

ADD COMMENT • link updated 9.8 years ago by Hervé Pagès 16k • written 9.8 years ago by Thomas Poulsen ▴ 10

score 3 · Accepted Answer · 2015-06-30

3

Entering edit mode

Hervé Pagès 16k

@herve-pages-1542

Last seen 1 day ago

Seattle, WA, United States

Hi Thomas,

Looks like what you want is stackStringsFromBam():

library(GenomicAlignments)
which <- GRanges("chr14", IRanges(19363800, 19363805))
stackStringsFromBam(bfl, param=ScanBamParam(which=which))
#   A DNAStringSet instance of length 2
#     width seq
# [1]     6 GATACA
# [2]     6 GATACA

See ?stackStringsFromBam in the GenomicAlignments package for more information.

Cheers,

H.

ADD COMMENT • link 9.8 years ago Hervé Pagès 16k

0

Entering edit mode

Hi Hervé,

This does exactly what I need.

Thank you very much!

All the best,

Thomas

ADD REPLY • link 9.8 years ago Thomas Poulsen ▴ 10