Hola Bioconductors , Biotransformers , Bioamplifiers ...
Before the question special thanx to team from UC Riverside who made nice R seq tutorial and of course Biostring developers. Oh and thank you for this Support site lunch ala StackExchange.
Biostrings did great job for my need (align short reads to long gene) but i got Memory issue when aligning 2 big genes using global alignement. There is also similar post regarding Memory issue (post from 2012).
It went ok till 40.000 bp , R was using ~3% of 160GB memory we have, and when running 50K it failed (bellow is the demo code). Please comment why this sudden memory jump saying to 16777216 Tb ?
One more question for the developer, was many local alignments (Waterman-Eggert) not considered for implementation ? (wiki link for aligners directs to tool "matcher")
oh and one note for support website: it is said "To create a new tag just type it in and press ENTER."... if i enter "pairwiseAlignment" and press enter to get new tag, nothing happenes)
Now the the demo code :
library(Biostrings)
set.seed (42)
refS=DNAString (paste( sample(c("A","T","G","C"), 90000, replace=T), collapse=""))
deb.mat <- nucleotideSubstitutionMatrix (match=1, mismatch=0, baseOnly=TRUE)
seqlns=seq(100, nchar(refS), 10000)
runT=rep(0,length(seqlns)); names(runT)= seqlns
for (L in 1:length(seqlns) ) { cat ( "using seq Length:" , seqlns[L] , "\n")
runT[L]= system.time(
glob.pwa <- pairwiseAlignment( substr(refS,1,seqlns[L]) , subject= substr(refS,1,seqlns[L]), type="global",substitutionMatrix=deb.mat, gapOpening=0, gapExtension=-1)
)[3]
}
# using seq Length: 100
# using seq Length: 10100
# using seq Length: 20100
# using seq Length: 30100
# using seq Length: 40100
# using seq Length: 50100
# Error in .Call2("XStringSet_align_pairwiseAlignment", pattern, subject, :
# cannot allocate memory block of size 16777216 Tb
# Timing stopped at: 0.02 0 0.017
plot (names(runT),runT, type="l", ylab="sec", xlab="seq.len")
runT
##100 10100 20100 30100 40100 50100 60100 70100 80100
0.045 2.747 9.323 20.103 35.792 0.000 0.000 0.000 0.000
traceback()
7: .Call(.NAME, ..., PACKAGE = PACKAGE)
6: .Call2("XStringSet_align_pairwiseAlignment", pattern, subject,
type, typeCode, scoreOnly, gapOpening, gapExtension, useQuality,
substitutionArray, dim(substitutionArray), substitutionLookupTable,
fuzzyMatrix, dim(fuzzyMatrix), fuzzyLookupTable, PACKAGE = "Biostrings")
5: XStringSet.pairwiseAlignment(pattern = pattern, subject = subject,
type = type, substitutionMatrix = substitutionMatrix, gapOpening = gapOpening,
gapExtension = gapExtension, scoreOnly = scoreOnly)
4: mpi.XStringSet.pairwiseAlignment(pattern, subject, type = type,
substitutionMatrix = substitutionMatrix, gapOpening = gapOpening,
gapExtension = gapExtension, scoreOnly = scoreOnly)
3: .local(pattern, subject, ...)
2: pairwiseAlignment(substr(refS, 1, seqlns[L]), subject = substr(refS,
1, seqlns[L]), type = "global", substitutionMatrix = deb.mat,
gapOpening = 0, gapExtension = -1)
1: pairwiseAlignment(substr(refS, 1, seqlns[L]), subject = substr(refS,
1, seqlns[L]), type = "global", substitutionMatrix = deb.mat,
gapOpening = 0, gapExtension = -1)
> sessionInfo()
R version 3.1.1 (2014-07-10)
Platform: x86_64-pc-linux-gnu (64-bit)
locale:
[1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C
[3] LC_TIME=en_US.UTF-8 LC_COLLATE=en_US.UTF-8
[5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8
[7] LC_PAPER=en_US.UTF-8 LC_NAME=C
[9] LC_ADDRESS=C LC_TELEPHONE=C
[11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C
attached base packages:
[1] parallel stats graphics grDevices utils datasets methods
[8] base
other attached packages:
[1] Biostrings_2.32.0 XVector_0.4.0 IRanges_1.22.3
[4] BiocGenerics_0.10.0 BiocInstaller_1.14.1
loaded via a namespace (and not attached):
[1] stats4_3.1.1 tools_3.1.1 zlibbioc_1.10.0
Looks like you (or someone) figured out how to tag your post with "pairwisealignment".
: ))) I guess it worked at the end but will check when posting new question .