Calculating and plotting AT content in a tiling window across the genome with GRanges
1
0
Entering edit mode
@atotalamateur-22296
Last seen 5.0 years ago

Hi, Very new to R here. I have been trying to plot the relationship between AT content and proteins binding to a particualr stretch of DNA, as in a heat map or the figures seen in this paper: https://www.ncbi.nlm.nih.gov/pubmed/29267285. However, I have been having a lot of confusion with how exactly to go about doing this. I understand how to tile the genome into windows with tileGenome, this outputs a GRanges List, not an object. I also understand how to calcualte the GC content of specified ranges, such as "gcContent(Hsapiens[["chr1"]])". I don't understand how to go about merging these two approaches to calculate the AT content (1-GC content) of each interval in the GRanges List, as if I had a single GRanges object I could use something like:

> windowViews <- Views(BSgenome.Mmusculus.UCSC.mm10, windowRanges) gcFrequency <- letterFrequency(windowViews, letters="GC", as.prob=TRUE)

or

letterFrequency2 <- function(x, letters, OR="|",
                         as.prob=FALSE, ...)   {
stopifnot(is(x, "BSgenomeViews"))
chunksize <- 500000L
chunks <- breakInChunks(length(x), chunksize)
chunks <- as(chunks, "IRanges")
ans_chunks <- lapply(seq_along(chunks),
    function(i) {
        x_chunk <- extractROWS(x, chunks[i])
        letterFrequency(x_chunk, letters, OR=OR,
                        as.prob=as.prob)
    })
do.call(rbind, ans_chunks)

}

But I clearly am confused on what arguments to pass to View, how to store the AT content calculation for each range, and how to plot this (at all, or against the binding preferences of a specific protein). In short, I am not sure how to calculate and store the GC/AT contents for windows of arbitrary size across the genome. Any advice you can provide on how to do this would be great, sorry if this is extremely trivial!

Also, the above code was take from H. Page in a previous question, full credit.

bioconductor biostrings gc content biostring granges • 1.3k views
ADD COMMENT
0
Entering edit mode
@herve-pages-1542
Last seen 3 hours ago
Seattle, WA, United States

Hi,

From ?tileGenome:

Value:

     If ‘cut.last.tile.in.chrom’ is ‘FALSE’ (the default), a
     GRangesList object with one list element per tile, each
     of them containing a number of genomic ranges equal
     to the number of chromosomes it overlaps with. Note
     that when the tiles are small (i.e.  much smaller than the
     chromosomes), most of them only overlap with a single
     chromosome.

     If ‘cut.last.tile.in.chrom’ is ‘TRUE’, a GRanges object
     with one element (i.e. one genomic range) per tile.

So:

library(BSgenome.Mmusculus.UCSC.mm10)
library(GenomicRanges)

tiles <- tileGenome(seqinfo(BSgenome.Mmusculus.UCSC.mm10), tilewidth=1e6,
                    cut.last.tile.in.chrom=TRUE)
genome_views <- Views(BSgenome.Mmusculus.UCSC.mm10, tiles)

### Loads the entire genome (3Gb) in memory at once.
AT_frequency <- letterFrequency(genome_views, letters="AT")

Takes about 24 sec on my laptop.

If memory usage is an issue, letterFrequency2() can be used instead of the letterFrequency(). It reduces memory usage by processing the views by batch so loading only the part of the genome that corresponds to the current batch. Here is a parallelized version of letterFrequency2():

library(BiocParallel)
letterFrequency2p <- function(x, letters, OR="|", as.prob=FALSE, ...,
                              BPPARAM=NULL)
{
    stopifnot(is(x, "BSgenomeViews")) 
    nchunk <- ceiling(sum(width(x)) / 5e8)
    if (!is.null(BPPARAM)) {
        stopifnot(is(BPPARAM, "BiocParallelParam"))
        nchunk <- nchunk * bpnworkers(BPPARAM)
    }
    chunks <- breakInChunks(length(x), nchunk=nchunk)
    chunks <- as(chunks, "IRanges")
    process_chunk <- function(i) {
        x_chunk <- extractROWS(x, chunks[i])
        letterFrequency(x_chunk, letters, OR=OR, as.prob=as.prob, ...)
    }
    if (is.null(BPPARAM)) {
        ans_chunks <- lapply(seq_along(chunks), process_chunk)
    } else {
        ans_chunks <- bplapply(seq_along(chunks), process_chunk,
                               BPPARAM=BPPARAM)
    }
    do.call(rbind, ans_chunks)
}

letterFrequency2p(genome_views, letters="AT", BPPARAM=MulticoreParam(6)) takes 7 sec on my laptop.

H.

ADD COMMENT

Login before adding your answer.

Traffic: 669 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6