Question

Reading large DNA alignments and keeping polymorphic sites

1

Entering edit mode

ben.ward ▴ 30

@benward-7169

Last seen 9.1 years ago

United Kingdom

Hi Everyone at Bioconductor,

I've been doing work with large multiple sequence alignments in which I find the polymorphic sites, keeping them and their base positions, and rejecting all the non-polymorphic sites. I usually use a package like ape for this - which implements the DNAbin class. This time the MSA I have is simply too large and can't be represented as a DNAbin class and R gives an error about not supporting Long vectors of Raw. I've heard and read that the sequence types in Bioconductor are good because they don't store the entire sequence in memory and are built to handle very large sequences efficiently. This will be my first use of Bioconductor. I've currently identified the polymorphic sites using the following code:

library(Biostrings)
origMAlign <- readDNAStringSet(filepath = "~/Dropbox/MySequences.fas", format = "fasta")
consensusM <- consensusMatrix(origMAlign)[1:4,]
polymorphic <- which(colSums(consensusM != 0) > 1)
for(i in 1:length(origMAlign))
  origMAlign[[i]] <- origMAlign[[i]][polymorphic]

Is this the most efficient way to do this with the Biostring classes - being R, use of a loop feels instinctively wrong!

Thanks,

Ben Ward - UEA & TSL.

sequence alignment biostrings • 1.9k views

ADD COMMENT • link updated 10.3 years ago by Hervé Pagès 16k • written 10.4 years ago by ben.ward ▴ 30

score 2 · Answer 1 · 2014-12-17

2

Entering edit mode

Hervé Pagès 16k

@herve-pages-1542

Last seen 1 hour ago

Seattle, WA, United States

Hi Ben,

A much faster way is to use an IntegerList subscript to subset origMAlign:

i <- rep.int(IntegerList(polymorphic), length(origMAlign))
origMAlign[i]

In the S4Vectors/IRanges/GenomicRanges/Biostrings infrastructure (and for S4 objects that rely on that infrastructure) we've extended the capabilities of the [ operator so you can:

Use an Rle (logical-, integer-, numeric-, character-, or factor-Rle) or Ranges subscript to subset a Vector derivative.
Use a list-like object (i.e. list or List) to subset a List derivative 'x'. In that case, the i-th list element in the subscript is used to subset the i-th list element in 'x'.

Hope this helps,

H.

ADD COMMENT • link 10.3 years ago Hervé Pagès 16k

0

Entering edit mode

The example you give returns an error:

Error in rep.int(IntegerList(InformativeBp), length(FullSequence)) : 

  attempt to replicate an object of type 'S4'

But something like:

origMAlign[rep(list(1:10), 10)]

Seems to work.

ADD REPLY • link 10.2 years ago ben.ward ▴ 30

0

Entering edit mode

Hi Ben,

Replicating an IntegerList object does work for me:

> library(IRanges)
> rep.int(IntegerList(1:10), 10)
IntegerList of length 10
[[1]] 1 2 3 4 5 6 7 8 9 10
[[2]] 1 2 3 4 5 6 7 8 9 10
[[3]] 1 2 3 4 5 6 7 8 9 10
[[4]] 1 2 3 4 5 6 7 8 9 10
[[5]] 1 2 3 4 5 6 7 8 9 10
[[6]] 1 2 3 4 5 6 7 8 9 10
[[7]] 1 2 3 4 5 6 7 8 9 10
[[8]] 1 2 3 4 5 6 7 8 9 10
[[9]] 1 2 3 4 5 6 7 8 9 10
[[10]] 1 2 3 4 5 6 7 8 9 10

Please make sure that you are using the current release of BioC (3.0) and that all your packages are up-to-date (especially the BiocGenerics, S4Vectors and IRanges packages). See my sessionInfo() below.

Cheers,

H.

> sessionInfo()
R version 3.1.2 (2014-10-31)
Platform: x86_64-unknown-linux-gnu (64-bit)

locale:
 [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C              
 [3] LC_TIME=en_US.UTF-8        LC_COLLATE=en_US.UTF-8    
 [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8   
 [7] LC_PAPER=en_US.UTF-8       LC_NAME=C                 
 [9] LC_ADDRESS=C               LC_TELEPHONE=C            
[11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C       

attached base packages:
[1] stats4    parallel  stats     graphics  grDevices utils     datasets 
[8] methods   base     

other attached packages:
[1] IRanges_2.0.1       S4Vectors_0.4.0     BiocGenerics_0.12.1

ADD REPLY • link 10.2 years ago Hervé Pagès 16k