Entering edit mode
Daniel.Berner@unibas.ch
▴
90
@danielbernerunibasch-4268
Last seen 4.0 years ago
Hi
Can someone help me with this question: I have a large data frame (say
'dat') with 2 columns, one is genomic loci (chromosome-by-position,
e.g. 'chr1_1253454'), the other is Illumina sequences. Now I want to
perform some operations on each UNIQUE locus. I thus derive the unique
loci:
u.loc<-unique(dat[,1])
and build a loop allowing me to access the relevant data for each
unique locus, and to perform my operations:
for(i in 1:length(u.loc)){
subdat<-subset(dat, dat[,1]==u.loc[i])
# now the relevant sequence data are accessible for my
operations...
}
This works fine. But since I have some 10 million rows in the dat
object, the subset() call takes time, hence the whole code is slow. I
would therefore like to get rid of the rows already processed within
the loop, which would speed up the code as it progresses. I therefore
thought about adding this as the last line within the loop:
dat<-dat[-as.integer(row.names(subdat)),]
This should eliminate the processed lines and continuously reduce the
dat objects volume. However, the output I get when using this latter
line is incorrect, it does not agree with the output I get without row
deletion. It seems deletion does not work correctly. Any idea why this
is, and how I could do the row elimination properly?
Thanks!
Daniel Berner
Zoological Institute
University of Basel
Vesalgasse 1
4051 Basel
Switzerland
+41 (0)61 267 0328
daniel.berner@unibas.ch
[[alternative HTML version deleted]]