Subsetting Rle with ORF-sequence-positions by the 3 frames
1
0
Entering edit mode
@hauken_heyken-13992
Last seen 2.1 years ago
Bergen

I have a list of 4.5 million orfs, so I need a fast method for this.

Right now the function uses 0.1 terabyte of RAM  per core, which I think is way too much. (2 terabyte server)

So I can only run  20 processes at a time.

 

Here is the problem:

First I get the coverage of ribo-seq per ORF.

Lets just make a toy example, so people can run:

library(GenomicRanges)

library(GenomicFeatures)
# The riboseq data

riboseq <- GRanges("1", IRanges(25, 25), "+") # In reality I have 103 datasets of 20M reads.

# Here are the 4.5 Million ORFs

 ORF <- GRanges(seqnames = "1",
                ranges = IRanges(start = c(1, 10, 20), end = c(5, 15, 25)),
                strand = "+")
 names(ORF) <- c("tx1", "tx1", "tx1")
 grl <- GRangesList(tx1_1 = ORF)

cov <- coverageByTranscript(riboseq, grl) # this is fast, < 1 min for whole set, great function!

# I acctually use my custome function here, since coverageByTranscript struggles with seqlengths.

Now I have the Rle.

I want to subset by frames(1,2,3).

Two ways that work are:

# lets make a toy example for first frame

lapply(cov, function(x){
      x[c(T,F,F)]
    })

# or

library(data.table)

dt <- as.data.table(cov)

dt[, .SD[seq.int(1, .N, 3)], by = group]

 

First one will never finish on 4.5 million, second one is faster, but never finished on whole set.

Another problem is that the conversion to data.table creates a 10 GB object, which I dont want.

Is there a smart function to do this ? (Right now im looking through the github code of S4Vectors and IRanges)

The Rle object of the coverage is only 450 MB, so in theory the frames should be 1/3 of that.
A funny note on that is that the names of the Rle is 95% of the total size..

Thank you.

 

UPDATE: I found a faster version, but still not finishing on the whole set:

positionFrame <- lapply(lengths(cov), function(x){seq.int(1, x, 3)}) # very fast
cov[positionFrame] # still too slow..

UPDATE2: I will try some heuristic improvements:

I made a unique orf function, so that I only do each unique ORF once(unique orf by sequence and exons), I also only check the ORFs that have ribo-seq hits, since the ORF score of no-hit ORFs are always 0.

This reduces the sets to 25% of original size.

Rle Big data coverageByTranscript • 1.3k views
ADD COMMENT
1
Entering edit mode
@hauken_heyken-13992
Last seen 2.1 years ago
Bergen

Ok, will answer this one my self.

I found a sufficient answer, which uses S4Vectors math operators.

A very fast way to find which frame a read hit is on in a big RleList:

This example finds the frame of  the read in all ORFs with only 1 hit.

cov # the Rle

sum(runLength(cov)[(cumsum(runValue(cov)) == 0) | (runValue(cov) == 1)]) %% 3

 

1 is inframe, 2 is 1 shifted right, 0 is 2 shifted right.

Rle is a fantastic class, that should get extended feature set in S4Vectors, thank you.

ADD COMMENT

Login before adding your answer.

Traffic: 623 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6