Reading large DNA alignments and keeping polymorphic sites
1
1
Entering edit mode
ben.ward ▴ 30
@benward-7169
Last seen 8.7 years ago
United Kingdom

 

Hi Everyone at Bioconductor,

 

I've been doing work with large multiple sequence alignments in which I find the polymorphic sites, keeping them and their base positions, and rejecting all the non-polymorphic sites. I usually use a package like ape for this - which implements the DNAbin class. This time the MSA I have is simply too large and can't be represented as a DNAbin class and R gives an error about not supporting Long vectors of Raw. I've heard and read that the sequence types in Bioconductor are good because they don't store the entire sequence in memory and are built to handle very large sequences efficiently. This will be my first use of Bioconductor. I've currently identified the polymorphic sites using the following code:

library(Biostrings)
origMAlign <- readDNAStringSet(filepath = "~/Dropbox/MySequences.fas", format = "fasta")
consensusM <- consensusMatrix(origMAlign)[1:4,]
polymorphic <- which(colSums(consensusM != 0) > 1)
for(i in 1:length(origMAlign))
  origMAlign[[i]] <- origMAlign[[i]][polymorphic]

Is this the most efficient way to do this with the Biostring classes - being R, use of a loop feels instinctively wrong! 

Thanks,

Ben Ward - UEA & TSL.

 

sequence alignment biostrings • 1.8k views
ADD COMMENT
2
Entering edit mode
@herve-pages-1542
Last seen 23 hours ago
Seattle, WA, United States

Hi Ben,

A much faster way is to use an IntegerList subscript to subset origMAlign:

i <- rep.int(IntegerList(polymorphic), length(origMAlign))
origMAlign[i]

In the S4Vectors/IRanges/GenomicRanges/Biostrings infrastructure (and for S4 objects that rely on that infrastructure) we've extended the capabilities of the [ operator so you can:

  1. Use an Rle (logical-, integer-, numeric-, character-, or factor-Rle) or Ranges subscript to subset a Vector derivative.
  2. Use a list-like object (i.e. list or List) to subset a List derivative 'x'. In that case, the i-th list element in the subscript is used to subset the i-th list element in 'x'.

Hope this helps,

H.

ADD COMMENT
0
Entering edit mode

The example you give returns an error: 

Error in rep.int(IntegerList(InformativeBp), length(FullSequence)) : 

  attempt to replicate an object of type 'S4'

But something like: 

origMAlign[rep(list(1:10), 10)]

Seems to work.

ADD REPLY
0
Entering edit mode

Hi Ben,

Replicating an IntegerList object does work for me:

> library(IRanges)
> rep.int(IntegerList(1:10), 10)
IntegerList of length 10
[[1]] 1 2 3 4 5 6 7 8 9 10
[[2]] 1 2 3 4 5 6 7 8 9 10
[[3]] 1 2 3 4 5 6 7 8 9 10
[[4]] 1 2 3 4 5 6 7 8 9 10
[[5]] 1 2 3 4 5 6 7 8 9 10
[[6]] 1 2 3 4 5 6 7 8 9 10
[[7]] 1 2 3 4 5 6 7 8 9 10
[[8]] 1 2 3 4 5 6 7 8 9 10
[[9]] 1 2 3 4 5 6 7 8 9 10
[[10]] 1 2 3 4 5 6 7 8 9 10

Please make sure that you are using the current release of BioC (3.0) and that all your packages are up-to-date (especially the BiocGenerics, S4Vectors and IRanges packages). See my sessionInfo() below.

Cheers,

H.

> sessionInfo()
R version 3.1.2 (2014-10-31)
Platform: x86_64-unknown-linux-gnu (64-bit)

locale:
 [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C              
 [3] LC_TIME=en_US.UTF-8        LC_COLLATE=en_US.UTF-8    
 [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8   
 [7] LC_PAPER=en_US.UTF-8       LC_NAME=C                 
 [9] LC_ADDRESS=C               LC_TELEPHONE=C            
[11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C       

attached base packages:
[1] stats4    parallel  stats     graphics  grDevices utils     datasets 
[8] methods   base     

other attached packages:
[1] IRanges_2.0.1       S4Vectors_0.4.0     BiocGenerics_0.12.1
ADD REPLY

Login before adding your answer.

Traffic: 943 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6