Hi All
I'm not sure if this is even the correct forum to ask this. I wonder if anyone can offer some advice about where i am going wrong with a loop i'm trying to write for cleaning up UNIPROT data names.
Basically, the name i have from proteomics analysis is something like tr|A0A02DLI66|A0A02DLI66_MYTGA but i would like to strip it to just A0A02DLI66.
I have gotten to the point where i can manually clean it up per sample using this code:
fData(x)$UNPROTKB=fData(x)$DatabaseAccess
fData(x)$UNPROTKB=gsub(pattern="^[^ab][^ab].", replacement="",x=fData(x)$UNPROTKB)
fData(x)$UNPROTKB=gsub(pattern="\|..", replacement="",x=fData(x)$UNPROTKB)
but my samples are running quite high and this is time cosuming. So i thought a loop would help and while it is processing, it is not changing anything. I'm unsure if i am missing something in the code which is as follows:
tmp <- sapply(nms, function(.bap) {
cat("Processing", .bap, "... ")
x <- get(.bap, envir = .GlobalEnv)
fData(x)$UNPROTKB=fData(x)$DatabaseAccess
fData(x)$UNPROTKB=gsub(pattern="^[^ab][^ab].", replacement="",x=fData(x)$UNPROTKB)
fData(x)$UNPROTKB=gsub(pattern="\|..", replacement="",x=fData(x)$UNPROTKB)
varnm <- sub("bap", "bap", .bap)
assign(varnm, x, envir = .GlobalEnv)
cat("done\n")})
Any help would be much appreciated.
L
These days the tradeoff for a loop and a vectorized operation really only start to matter with really large N:
So at 5M, the vectorized approach is ~40% faster, but it's ten seconds.
You are right. Your solution will be even twice as fast as the
gsub
if you usefixed=TRUE
andvapply
: