Dear altruists, the following code is running properly when I'm trying to compare AXT (from UCSC) sequence pairs and keep only [A, T, C, G] uppercases and compute the matched sequence lengths. But somehow, this function is producing a lot of NaN values. Actually it should not contain any NaN values at all. I implemented the same algorithm for Python and it worked properly. But it seems that Bioconductor's DNAStringSet has limitations like it can't compare lower cases. It converts lowercase to uppercase automatically and then compares. So, I converted the AXT file's lowercases to '-' and then used it as an input. Still, I'm getting NaN values now. How can I overcome this?
The full code can be found here: https://github.com/alexander-nash/kurtosis_conservation/blob/master/get_identical_seq_locations.R
N.B: I changed the logic for the identical_bases variable. After this change, I'm getting NaN values.
getLengthsOfIdenticalSeqs<-function(x){
#x must be an Axt object imported by CNEr
qr<-BStringSet(apply(as.data.frame(querySeqs(x)), 2, function(z){
tmp<-paste0("M", z)
out<-paste0(tmp, "N")
}))
tr<-BStringSet(apply(as.data.frame(targetSeqs(x)), 2, function(z){
tmp<-paste0("K", z)
out<-paste0(tmp, "V")
}))
# identical_bases <- mapply(x=qr, y=tr, function(x,y){
# out<-which(as.raw(x)==as.raw(y) & (as.raw(x)==charToRaw('A') | as.raw(x)==charToRaw('T') | as.raw(x)==charToRaw('C') | as.raw(x)==charToRaw('G')) & (as.raw(y)==charToRaw('A') | as.raw(y)==charToRaw('T') | as.raw(y)==charToRaw('C') | as.raw(y)==charToRaw('G')))
# })
identical_bases <- mapply(x=qr, y=tr, function(x,y){
out<-which((as.matrix(x) == as.matrix(y)) &
(as.matrix(x) != '-') & (as.matrix(y) != '-') &
(as.matrix(x)=='A' | as.matrix(x)=='T' | as.matrix(x)=='C' | as.matrix(x)=='G') &
(as.matrix(y)=='A' | as.matrix(y)=='T' | as.matrix(y)=='C' | as.matrix(y)=='G'))
})
relative_ranges <- lapply(identical_bases, function(x){
reduce(IRanges(x-2, x-2))
})
absolute_ranges <- mapply(z=relative_ranges, y=as.list(x), function(z,y){
if(length(z) == 0){
out<-list(data.frame())
} else {
out<-list(data.frame("first.seqnames"=seqnames(first(y)),
"first.start"=(start(first(y)) + start(z)),
"first.end"=(start(first(y)) + end(z)),
"first.width"=(end(z) - start(z) + 1),
"first.strand"="*",
"second.seqnames"=seqnames(second(y)),
"second.start"=(start(second(y)) + start(z)),
"second.end"=(start(second(y)) + end(z)),
"second.width"=(end(z) - start(z) + 1),
"second.strand"="*"))
}
})
return(absolute_ranges)
}