Deleting object rows while looping

0

Entering edit mode

Daniel.Berner@unibas.ch ▴ 90

@danielbernerunibasch-4268

Last seen 4.2 years ago

Hi Can someone help me with this question: I have a large data frame (say 'dat') with 2 columns, one is genomic loci (chromosome-by-position, e.g. 'chr1_1253454'), the other is Illumina sequences. Now I want to perform some operations on each UNIQUE locus. I thus derive the unique loci: u.loc<-unique(dat[,1]) and build a loop allowing me to access the relevant data for each unique locus, and to perform my operations: for(i in 1:length(u.loc)){ subdat<-subset(dat, dat[,1]==u.loc[i]) # now the relevant sequence data are accessible for my operations... } This works fine. But since I have some 10 million rows in the dat object, the subset() call takes time, hence the whole code is slow. I would therefore like to get rid of the rows already processed within the loop, which would speed up the code as it progresses. I therefore thought about adding this as the last line within the loop: dat<-dat[-as.integer(row.names(subdat)),] This should eliminate the processed lines and continuously reduce the dat objects volume. However, the output I get when using this latter line is incorrect, it does not agree with the output I get without row deletion. It seems deletion does not work correctly. Any idea why this is, and how I could do the row elimination properly? Thanks! Daniel Berner Zoological Institute University of Basel Vesalgasse 1 4051 Basel Switzerland +41 (0)61 267 0328 daniel.berner@unibas.ch [[alternative HTML version deleted]]

• 1.0k views

ADD COMMENT • link updated 12.0 years ago by Martin Morgan 25k • written 12.0 years ago by Daniel.Berner@unibas.ch ▴ 90

0

Entering edit mode

Martin Morgan 25k

@martin-morgan-1513

Last seen 12 weeks ago

United States

On 04/29/2013 01:28 AM, Daniel Berner wrote: > Hi > Can someone help me with this question: I have a large data frame (say 'dat') with 2 columns, one is genomic loci (chromosome-by- position, e.g. 'chr1_1253454'), the other is Illumina sequences. Now I want to perform some operations on each UNIQUE locus. I thus derive the unique loci: > > u.loc<-unique(dat[,1]) > > ? and build a loop allowing me to access the relevant data for each unique locus, and to perform my operations: > > for(i in 1:length(u.loc)){ > subdat<-subset(dat, dat[,1]==u.loc[i]) > # now the relevant sequence data are accessible for my operations... > } Hi Daniel -- One possibility is to use split lst = split(dat[,2], dat[,1]) (it would be very expensive to create, say, 1 million data.frames in a call like split(dat, dat[,1]); stick with vectors only if possible) and then l/s/mapply result = lapply(lst, doWork) but probably better is to think about how to implement 'doWork' so that it operates on the entire vector of sequences, to avoid the cost of invoking doWork on each unique value of dat[,1]. Hints about what is in 'doWork' might lead to some suggestions on how to make it vectorized (or functions that already implement this efficiently). Martin > > This works fine. But since I have some 10 million rows in the dat object, the subset() call takes time, hence the whole code is slow. I would therefore like to get rid of the rows already processed within the loop, which would speed up the code as it progresses. I therefore thought about adding this as the last line within the loop: > > dat<-dat[-as.integer(row.names(subdat)),] > > This should eliminate the processed lines and continuously reduce the dat object?s volume. However, the output I get when using this latter line is incorrect, it does not agree with the output I get without row deletion. It seems deletion does not work correctly. Any idea why this is, and how I could do the row elimination properly? > > Thanks! > > Daniel Berner > Zoological Institute > University of Basel > Vesalgasse 1 > 4051 Basel > Switzerland > +41 (0)61 267 0328 > daniel.berner at unibas.ch > > [[alternative HTML version deleted]] > > > > _______________________________________________ > Bioconductor mailing list > Bioconductor at r-project.org > https://stat.ethz.ch/mailman/listinfo/bioconductor > Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor > -- Computational Biology / Fred Hutchinson Cancer Research Center 1100 Fairview Ave. N. PO Box 19024 Seattle, WA 98109 Location: Arnold Building M1 B861 Phone: (206) 667-2793

ADD COMMENT • link 12.0 years ago Martin Morgan 25k

0

Entering edit mode

Greetings, On Mon, Apr 29, 2013 at 6:39 AM, Martin Morgan <mtmorgan at="" fhcrc.org=""> wrote: > On 04/29/2013 01:28 AM, Daniel Berner wrote: >> >> Hi >> Can someone help me with this question: I have a large data frame (say >> 'dat') with 2 columns, one is genomic loci (chromosome-by-position, e.g. >> 'chr1_1253454'), the other is Illumina sequences. Now I want to perform some >> operations on each UNIQUE locus. I thus derive the unique loci: >> >> u.loc<-unique(dat[,1]) >> >> ? and build a loop allowing me to access the relevant data for each unique >> locus, and to perform my operations: >> >> for(i in 1:length(u.loc)){ >> subdat<-subset(dat, dat[,1]==u.loc[i]) >> # now the relevant sequence data are accessible for my >> operations... >> } > > > Hi Daniel -- > > One possibility is to use split > > lst = split(dat[,2], dat[,1]) > > (it would be very expensive to create, say, 1 million data.frames in a call > like split(dat, dat[,1]); stick with vectors only if possible) and then > l/s/mapply > > result = lapply(lst, doWork) > > but probably better is to think about how to implement 'doWork' so that it > operates on the entire vector of sequences, to avoid the cost of invoking > doWork on each unique value of dat[,1]. Hints about what is in 'doWork' > might lead to some suggestions on how to make it vectorized (or functions > that already implement this efficiently). These are the situations where the data.table package really shines, if you'd like to give it a shot. Say the colnames of your data.frame are "locus" and "sequence", you would do something like so: dt <- data.table(dat, key="locus") result <- [, { ## The sequences that are in the current subset ## are injected into the scope of this expression ## by the name of their column (`sequence`), say ## you wanted to count the number of GATACA ## motifs in this subset (or something): list(n.motifs=length(grep('GATACA', sequence)) }, by='locus'] I'd probably store this data with chromosome and position split, ie. with column names such as "chr", "pos", "sequence", then convert this to a data.table and set the key to be c("chr", "pos"), eg: dt <- data.table(dat, key=c("chr", "pos")) result <- [, { list(n.motifs=length(grep('GATACA', sequence)) }, by=c("chr", "pos")] You should find a large difference (for the better) in terms of speed and memory use as compared to other approaches (split, ddply, etc.) given the size of your data -- of course sometimes ||-ization is the right way to go, but you can try both and see. HTH, -steve -- Steve Lianoglou Computational Biologist Department of Bioinformatics and Computational Biology Genentech

ADD REPLY • link 12.0 years ago Steve Lianoglou ★ 13k

0

Entering edit mode

Sorry -- I'm not quite in full swing yet ... this new monitor is making me read things funny: > given the size of your data -- of course sometimes ||-ization is the > right way to go, but you can try both and see. I briefly skimmed Martin's post and thought he suggested ||-ization, which he didn't -- although his approach w/ the work implemented in `doWork` does lend it self nicely to ||-ization in the future, if need be, but that's neither here nor there ... even if some of us are actually here, and others are there. -steve -- Steve Lianoglou Computational Biologist Department of Bioinformatics and Computational Biology Genentech

ADD REPLY • link 12.0 years ago Steve Lianoglou ★ 13k

Login before adding your answer.