Memory-efficient method to calculate GC content of millions of GRanges
2
0
Entering edit mode
jma1991 ▴ 70
@jma1991-11856
Last seen 20 months ago
Cumbernauld

I need to calculate the GC content of windows across the entire mouse genome. The windows are not regularly spaced and are different sizes. In total there are 6,665,053 windows ranging from 1,239 bp to 3,102,000 bp. When I try to calculate the GC content of these windows my computer runs out of memory (16Gb RAM). Is there a more memory-efficient way than this approach?

> windowViews <- Views(BSgenome.Mmusculus.UCSC.mm10, windowRanges)
> gcFrequency <- letterFrequency(windowViews, letters="GC", as.prob=TRUE)

The variable windowRanges is a single GRanges object containing all the window ranges across the entire genome. If I split the ranges by chromosome, would this load/unload each chromosome? Similar to using bsapply to calculate the GC content of each chromsome?

> param <- new("BSParams", X=BSgenome.Mmusculus.UCSC.mm10 , FUN=letterFrequency)
> bsapply(param, letters="GC", as.prob=TRUE)
granges biostrings bsapply BSgenome • 2.7k views
ADD COMMENT
0
Entering edit mode

Right now letterFrequency,BSgenomeViews is loading all of the views into an XStringSet, which will consume a lot of memory. Iterating over the sequences does seem like a better strategy.
 

ADD REPLY
5
Entering edit mode
@herve-pages-1542
Last seen 4 hours ago
Seattle, WA, United States

Hi,

Iterating over the sequences should help. Another strategy is to process the BSgenomeViews object itself by chunk:

## A memory-efficient "letterFrequency" method for
## BSgenomeViews objects.
letterFrequency2 <- function(x, letters, OR="|",
                             as.prob=FALSE, ...)
{
    stopifnot(is(x, "BSgenomeViews"))
    chunksize <- 500000L
    chunks <- breakInChunks(length(x), chunksize)
    chunks <- as(chunks, "IRanges")
    ans_chunks <- lapply(seq_along(chunks),
        function(i) {
            x_chunk <- extractROWS(x, chunks[i])
            letterFrequency(x_chunk, letters, OR=OR,
                            as.prob=as.prob)
        })
    do.call(rbind, ans_chunks)
}

Then:

library(BSgenome.Mmusculus.UCSC.mm10)
windowViews <- Views(BSgenome.Mmusculus.UCSC.mm10, windowRanges)
gcFrequency <- letterFrequency2(windowViews, letters="GC", as.prob=TRUE)

At some point I should improve the current implementation of the "letterFrequency" method for BSgenomeViews objects along this idea.

Cheers,

H.

ADD COMMENT
0
Entering edit mode

Thank you for the reply. One question, does creating a BSgenome views object not also load the entire genome into memory?

ADD REPLY
1
Entering edit mode

Hi,

Creating the BSgenomeViews object doesn't load anything from disk. And displaying it only loads the content of the views that are being displayed.

Note that the above implementation of letterFrequency2 loads the content of 500000 views at a time. You can change this by modifying the value of chunksize. I'm thinking of making this a parameter that is stored in the BSgenomeViews object itself and that the user would specify at construction time. There would also be a chunksize getter and setter to let the user access the value and change it afterwards.

Cheers,

H.

ADD REPLY
0
Entering edit mode

Great, thanks for your reply.

ADD REPLY

Login before adding your answer.

Traffic: 627 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6