Question

Obtain coordinates of 5' upstream region up to closest coding region

0

Entering edit mode

eggrandio • 0

@eggrandio-20085

Last seen 6.1 years ago

Hi,

I am doing an analysis of transcription factor binding and I need to retrieve the promoter region of all the genes in the genome. I can obtain 2kb upstream easily from a TxDB object by:

library(GenomicRanges)
library(TxDb.Athaliana.BioMart.plantsmart28)
gnflanks = flank(genes(TxDb.Athaliana.BioMart.plantsmart28), width=2000)

However, some regions overlap with coding regions upstream (other genes).

I would like to obtain 2kb of the upstream sequence up to the nearest coding region (2kb if no overlaping gene).

I know how to subtract the coding regions from the promoters, but if there is still some "promoter" region upstream of the upstream gene, it will be retained.

example:

upstream region:

========================(TSS)gene

overlapping gene:

     ======
========================(TSS)gene

If I remove it, I am left with:

=====      =============(TSS)gene

And I want to retrieve only:

           =============

Thanks in advance!

r granges promoter • 1.6k views

ADD COMMENT • link updated 6.1 years ago by Hervé Pagès 16k • written 6.1 years ago by eggrandio • 0

score 0 · Answer 1 · 2019-03-08

Hi,

So IIUC, for each gene you want the longest upstream region that is <= 2kb and does not overlap with any known CDS. Or, said otherwise, you want to minimally trim (on the 5' side) the upstream regions in gnflanks so that they are CDS-free. Note that this trimming will sometimes result in an empty upstream region. This will happen for genes for which the upstream position immediately adjacent to the gene is already in a known CDS (supposedly from another gene). Maybe that's a very rare and/or unrealistic occurrence but for the sake of robustness our code should handle it. The way it will handle it is by shrinking the upstream region to a zero-width range.

A naive but not very efficient way to do the above is:

cds <- cds(TxDb.Athaliana.BioMart.plantsmart28)

## This lapply() loop takes about 40 min on my laptop!
gnflanks2 <- lapply(seq_along(gnflanks),
    function(i) {
        gnflank <- gnflanks[i]
        upstream_ranges <- setdiff(gnflank, cds)
        if (as.logical(strand(gnflank) == "+")) {
            ## We're on the plus strand
            upstream_range <- tail(upstream_ranges, n=1)
            if (length(upstream_range) == 0 ||
                end(upstream_range) != end(gnflank))
            {
                ## Shrink to zero-width range
                upstream_range <- gnflank
                start(upstream_range) <- end(upstream_range) + 1
            }
        } else {
            ## We're on the minus strand
            upstream_range <- head(upstream_ranges, n=1)
            if (length(upstream_range) == 0 ||
                start(upstream_range) != start(gnflank))
            {
                ## Shrink to zero-width range
                upstream_range <- gnflank
                end(upstream_range) <- start(upstream_range) - 1
            }
        }
        upstream_range
    })

gnflanks2 <- do.call("c", gnflanks2)
names(gnflanks2) <- names(gnflanks)  # propagate the gene ids

gnflanks2 is a GRanges object parallel to GRanges object gnflanks, that is, the 2 objects have the same length and the i-th range in one object corresponds to the i-th range in the other. Furthermore, each range in gnflanks2 is either the same as the corresponding range in gnflanks (if no CDS got in the way), or a trimmed version of it (if one or more CDS got in the way).

Lightly tested only. Hope this helps.

H.