Consensus sequence from BAM file
1
0
Entering edit mode
@o-william-mcclung-22004
Last seen 5.3 years ago
United States

I have a .bam file which aligns reads against a reference genome. The locations of the reads partition themselves into a small number of clusters. Can BioConductor be used, perhaps with other tools such as samtools, to compute a consensus sequence for each cluster?

consensus BAM • 1.8k views
ADD COMMENT
0
Entering edit mode
@herve-pages-1542
Last seen 12 days ago
Seattle, WA, United States

Hi,

Depending on how you are clustering you reads exactly, here are some tools that I know of that can help you with extracting consensus sequences:

  • consensusMatrix() and consensusString() in the Biostrings package.

  • sequenceLayer() in the GenomicAlignments package.

  • pileup() in the Rsamtools package.

Let's say you have some read sequences:

## Let's load some read sequences:
library(pasillaBamSubset)
library(GenomicAlignments)
param <- ScanBamParam(what="seq")
reads <- readGAlignments(untreated1_chr4(), param=param)
readseqs <- mcols(reads)$seq
readseqs
#   A DNAStringSet instance of length 204355
#          width seq
#      [1]    75 CTGTGGTGACCAACACCACAGAA...CCCCTTTCCTGGCTAGGTTGTCC
#      [2]    75 TCGGGCCCAATTAGAGGGTTCCC...GCTCATTTCCTGGGCTGTTGTTG
#      [3]    75 CCCAATTAGAGGATTCTCTGCCC...TTTCCCGGGATGTTGTTGTGTCC
#      [4]    75 GTTCTCTGCCCCTTTCCTGGCTA...TTGTTGTGTCCCGGGACCCACCT
#      [5]    75 TTCCTGGCTAGGTTGTCCGCTAG...GGACACACCTTATTGTGAGTTTG
#      ...   ... ...
# [204351]    75 ATTTAATACAATATTTTCAAAAT...TATAGGCTTCTTCTTACTATGGT
# [204352]    75 ATTTAATACAATATTTTCAAAAT...TATAGGCTTCTTCTTACTATGGT
# [204353]    75 ATTTAATACAATATTTTCAAAAT...TATAGGCTTCTTCTTACTATGGG
# [204354]    75 CCTCCGCTTTGGTTCACGTTCTG...GGCTTCACTTTTAGCTACTGTTG
# [204355]    75 CCTCCGCTTTGGTTCACGTTCTG...GGCTTCACTTTTAGCTACTGTTG

And let's say you want to cluster them by position (POS field in the BAM file):

read_clusters <- split(readseqs, paste(seqnames(reads), start(reads), sep="-"))

## Sizes of the clusters:
lengths(read_clusters)[1:12]
#  chr4-10000  chr4-10002  chr4-10003  chr4-10004  chr4-10005  chr4-10007 
#           2           1           1           1           1           1 
#  chr4-10008  chr4-10009  chr4-10010 chr4-100112  chr4-10013  chr4-10014 
#           2           1           2           1           3           1 

You can compute the consensus matrix and consensus sequence of the reads positioned at nucleotide 10021 on chromosome 4 (RNAME=chr4 and POS=10021 in the BAM file) with something like:

cluster1 <- read_clusters[["chr4-10021"]]
cluster1
#   A DNAStringSet instance of length 8
#     width seq
# [1]    75 GCAGATGCCTACGATTAACTCCGAACTTTACTG...TCCACGATAGTGCTTGCATGGTTAAGCAAGCC
# [2]    75 GCAGATGCCTACGATTAACTCCGAACTTTACTG...TCCACGATAGTGCTTGCATGGTTAAGCAAGCC
# [3]    75 GCAGATGCCTACGATTAACTCCGAACTTTACTG...TCCACGATAGTGCTTGCATGGTTAAGCAAGCC
# [4]    75 GCAGATGCCTACGATTAACTCCGAACTTTACTG...TCCACGATAGTGCTTGCATGGTTAAGCAAGCC
# [5]    75 GCAGATGCCTACGACTAACTCCGAACTTTACTG...TCCACGATAGTGCTTGCATGGTTAAGCAAGCC
# [6]    75 GCAGATGCCTACGATTAACTCCGAACTTTACTG...TCCACGATAGTGCTTGCATGGTTAAGCAAGCC
# [7]    75 GCAGATGCCTACGATTAACTCCGAACTTTACTG...TCCACGATAGTGCTTGCATGGTTAAGCAAGCC
# [8]    75 GCAGATGCCTACGATTAACTCCGAACTTTACTG...TCCACGATAGTGCTTGCATGGTTAAGCAAGCC

cons_mat <- consensusMatrix(cluster1, baseOnly=TRUE)
dim(cons_mat)
# [1]  5 75

cons_mat[ , 1:12]
#       [,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9] [,10] [,11] [,12]
# A        0    0    8    0    8    0    0    0    0     0     8     0
# C        0    8    0    0    0    0    0    8    8     0     0     8
# G        8    0    0    8    0    0    8    0    0     0     0     0
# T        0    0    0    0    0    8    0    0    0     8     0     0
# other    0    0    0    0    0    0    0    0    0     0     0     0

consensusString(cluster1)
# [1] "GCAGATGCCTACGATTAACTCCGAACTTTACTGTTGGACGGACTCCACGATAGTGCTTGCATGGTTAAGCAAGCC"

HOWEVER: It's important to note that this approach only makes sense if the reads have no indels or junctions (i.e. the CIGAR strings have no I, D, or N in them). If they have indels or junctions, the consensus obtained by the above approach would be meaningless unless the reads sequences are first stretched or/and chrunk based on the corresponding CIGAR. sequenceLayer() could be used for this stretching/chrinking.

An alternative approach is to use pileup():

pileup_res <- pileup(untreated1_chr4())
dim(pileup_res)
# [1] 860358      5

head(pileup_res)
#   seqnames pos strand nucleotide count
# 1     chr4 892      -          C     1
# 2     chr4 893      -          T     1
# 3     chr4 894      -          G     1
# 4     chr4 895      -          T     1
# 5     chr4 896      -          G     1
# 6     chr4 897      -          G     1

But then there is some significant work to do to extract consensus sequences from this data.frame.

Hope this helps,

H.

ADD COMMENT

Login before adding your answer.

Traffic: 364 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6