This is a speedup question like my last, my data is too big to do what I would normally do, I'm doing several TB of data.
So my question is how to match two GRangeslist by name and calculate distance between each match( that is, they are from the same transcript)
, where the first list contains several ORFs per transcript while cds only have unique rows, namely the first exon.
I have already made these lists, so a solution would be:
#uorfs: list of ORFs in utr #cdsFirstExon: list of all first exons that have uorfs, so it only contains transcripts that have uorfs. merged = merge( uorfs, cdsFirstExon, by.x = names(uorfs), by.y = names(cdsFirstExon) distances = merged$uorf.end - merged$cdsFirstExon.start # distances now contains what I want
But the merging step is too slow with big data, is there a vectorized way ?