Question

Better way to find head to head GRanges?

0

Entering edit mode

Charles Plessy ▴ 180

@charles-plessy-7857

Last seen 19 months ago

Japan

In order to detect head-to-head promoters I am looking for a good way to detect head-to-head genomic ranges, like B and C in the example below.

                         ------C---->  
     <---A---  <---B---          <----D----

They are sometimes also called divergent or bidirectional and are of biological interests. Here are two examples from the littereature, With CAGE data they are found in transcribed enhancer regions (Andersson 2014). In a GRO-seq analysis (Tome 2018), they have been categorised as "divergent pairs" of TSS when they are less than 300 bp apart.

I am tempted to implement head-to-head promoter detection in CAGEr. At the moment I found the way that I describe below, but I wonder if:

This is already done in a Bioconductor packages that I can depend on ?
There is a more efficient or elegant way to do it ?

Here is how I would do at the moment.

As a source of GRanges an example, I am taking a GRanges object from the AnnotationHub. Keep only genes, their names and their biotype to simplify example output.

ah = AnnotationHub::AnnotationHub()
gtf <- ah[["AH64858"]]
gr <- gtf[gtf$type == "gene"]
mcols(gr) <- DataFrame(gr$gene_id, gr$gene_name, gr$gene_biotype)
gr$source <- gr$phase <- gr$score <- gr$type <- gr$gene_version <- NULL

Index the ranges by their order.

names(gr) <- 1:length(gr)

Extract their 5-prime ends, copy the index, then sort.

p <- GPos(promoters(gr, 0, 1))
names(p) <- names(gr)
p <- sort(p, ignore.strand = TRUE)

Stitch the next range on the same line as the current one. Remove the pairs that are on different chromosomes.

p$Next <- c(p[-1], p[1])
p <- p[seqnames(p) == seqnames(p$Next)]

Find which pairs have a plus / minus (head to head) orientation.

p$strandStrand <- paste(strand(p), strand(p$Next), sep = "")
plusMinus <- which(p$strandStrand == "-+")
hh <- p[plusMinus]

Use the indexes to retrieve the original ranges.

i1 <- names(hh)
i2 <- names(hh$Next)
sort(c(gr[i1], gr[i2]), ignore.strand=TRUE)

## GRanges object with 9384 ranges and 3 metadata columns:
##           seqnames        ranges strand |         gr.gene_id      gr.gene_name
##              <Rle>     <IRanges>  <Rle> |        <character>       <character>
##       6          1 381289-396363      - | ENSTRUG00000009462             cox10
##       7          1 425325-445021      + | ENSTRUG00000009368 si:ch211-216b21.2
##       8          1 488176-495178      - | ENSTRUG00000009358              <NA>
##      11          1 509902-520076      + | ENSTRUG00000009053              sgca
##      14          1 551027-551347      - | ENSTRUG00000007292              <NA>
##     ...        ...           ...    ... .                ...               ...
##   20921 HE596567.1     3463-3571      + | ENSTRUG00000024526           RF00001
##   21029 HE597694.1        45-776      - | ENSTRUG00000019592              <NA>
##   21030 HE597694.1     1132-2168      + | ENSTRUG00000024912              <NA>
##   18947 HE598556.1     1435-2893      - | ENSTRUG00000001164          prss59.1
##   18948 HE598556.1    7000-11144      + | ENSTRUG00000022925            mrpl17
##         gr.gene_biotype
##             <character>
##       6  protein_coding
##       7  protein_coding
##       8  protein_coding
##      11  protein_coding
##      14  protein_coding
##     ...             ...
##   20921            rRNA
##   21029  protein_coding
##   21030  protein_coding
##   18947  protein_coding
##   18948  protein_coding
##   -------
##   seqinfo: 1627 sequences (1 circular) from FUGU5 genome; no seqlengths

Store the head to head pair information in an object that represents span of the head to head pair and provides the coordinates of the gap within. Keep only the pairs that are separated by less than 50 nt.

HH <- GRanges(seqnames(gr[i1]), IRanges(start(gr[i1]), end(gr[i2])))
names(HH) <- names(hh)
HH$interval <- GRanges(seqnames(hh), IRanges(pos(hh), pos(hh$Next)))
within50 <- width(HH$interval) < 50
sort(c(gr[i1][within50], gr[i2][within50]), ignore.strand=TRUE)

## GRanges object with 136 ranges and 3 metadata columns:
##           seqnames            ranges strand |         gr.gene_id
##              <Rle>         <IRanges>  <Rle> |        <character>
##     467          1   7758838-7765115      - | ENSTRUG00000020313
##     468          1   7765148-7774656      + | ENSTRUG00000025597
##     957          1 15182345-15189325      - | ENSTRUG00000019879
##     958          1 15189333-15194992      + | ENSTRUG00000017600
##    1329          1 22606072-22608556      - | ENSTRUG00000020162
##     ...        ...               ...    ... .                ...
##   18436 HE591882.1       41358-43374      + | ENSTRUG00000025239
##   18721 HE591938.1         3285-3398      - | ENSTRUG00000025173
##   18722 HE591938.1         3402-8100      + | ENSTRUG00000024399
##   19233 HE592076.1         6177-9784      - | ENSTRUG00000002645
##   19234 HE592076.1        9831-14047      + | ENSTRUG00000003176
##              gr.gene_name gr.gene_biotype
##               <character>     <character>
##     467            znf668  protein_coding
##     468            znf646  protein_coding
##     957            Scrn3a  protein_coding
##     958              CIR1  protein_coding
##    1329             maip1  protein_coding
##     ...               ...             ...
##   18436              lsm1  protein_coding
##   18721           RF00020           snRNA
##   18722 NXPH4 (1 of many)  protein_coding
##   19233              <NA>  protein_coding
##   19234              tlr2  protein_coding
##   -------
##   seqinfo: 1627 sequences (1 circular) from FUGU5 genome; no seqlengths

promoter • 1.6k views

ADD COMMENT • link updated 4.9 years ago by Hervé Pagès 16k • written 5.0 years ago by Charles Plessy ▴ 180

0

Entering edit mode

Hi Charles, can you please define what you mean by "head to head ranges"? I don't want to have to go thru the algorithm you're showing above to try and guess. I've learned that it's a lot more productive to spend some time defining exactly "what" we want to achieve before we work on the "how". Thanks!

ADD REPLY • link 4.9 years ago Hervé Pagès 16k

0

Entering edit mode

Thanks ! I edited the question to add background information.

ADD REPLY • link 4.9 years ago Charles Plessy ▴ 180

score 1 · Answer 1 · 2020-05-18

I would tackle this using findOverlaps. Essentially what you need are overlaps between promoters that are on different strands. For this lets define promoters as 300bp upstream of the start coordinate and then find overlaps, followed by retrieval of the pairs:

## get promoters as 300bp upstream
proms <- promoters(gr, upstream=300, downstream=0, use.names=TRUE)

## separate by strand
Plus_strand  <- proms[ strand(proms) == "+" ,]
Minus_strand <- proms[ strand(proms) == "-" ,]

## find overlaps between the promoter regions for genes on different strands
oLap <- findOverlaps(Plus_strand, Minus_strand, ignore.strand = TRUE)

## write into a data.frame (or any format you want)
Pairs <- data.frame( Plus  = data.frame(gr[strand(gr) == "+"][oLap@from]),
                     Minus = data.frame(gr[strand(gr) == "-"][oLap@to]))

With your example data I find 658 pairs with the above code.

score 1 · Answer 2 · 2020-05-19

Thanks for explaining the head-to-head thing, that helps a lot.

Here is a fast implementation based on findOverlaps():

findAllHeadToHeadRanges <- function(x, maxgap=50)
{
    stopifnot(is(x, "GenomicRanges"))
    y <- invertStrand(x)
    ans <- as(findOverlaps(x, y, maxgap=maxgap, select="all"),
              "SelfHits")

    ## Keep hits that put 'x' and 'y' in a head to head
    ## configuration.
    ans_from <- from(ans)
    ans_to <- to(ans)
    xx <- x[ans_from]
    yy <- y[ans_to]
    code <- pcompare(xx, yy)
    keep_idx <- (strand(xx) == "-") & (code %in% c(-6L, -5L))
    ans <- ans[keep_idx]

    ## Add 'gap' metadata column.
    ans_from <- from(ans)
    ans_to <- to(ans)
    mcols(ans)$gap <- start(y)[ans_to] - end(x)[ans_from] - 1L

    ans
}

Example:

gr <- GRanges(c("chr1:51-80:-",
                "chr1:131-150:+",
                "chr1:80-105:-",
                "chr1:50-79:-",
                "chr1:160-170:-",
                "chr1:80-105:+",
                "chr1:155-180:+",
                "chr1:51-70:-"))

hits1 <- findAllHeadToHeadRanges(gr)
hits1
# SelfHits object with 5 hits and 1 metadata column:
#            from        to |       gap
#       <integer> <integer> | <integer>
#   [1]         1         2 |        50
#   [2]         3         2 |        25
#   [3]         3         7 |        49
#   [4]         4         6 |         0
#   [5]         8         6 |         9
#   -------
#   nnode: 8

The returned object is a SelfHits object that contains all the pairs that are in a head-to-head configuration with a gap no greater than maxgap. For each pair from and to are the indices of the ranges in gr that form the pair. from is the index of the range on the minus strand and to the index of the range on the plus strand.

This SelfHits object can be used in many different ways e.g. you can turn it into a data frame with as.data.frame(hits1) or use it directly to construct a Pairs object:

Pairs(gr, gr, gap=mcols(hits1)$gap, hits=hits1)
# Pairs object with 5 pairs and 1 metadata column:
#               first         second |       gap
#           <GRanges>      <GRanges> | <integer>
#   [1]  chr1:51-80:- chr1:131-150:+ |        50
#   [2] chr1:80-105:- chr1:131-150:+ |        25
#   [3] chr1:80-105:- chr1:155-180:+ |        49
#   [4]  chr1:50-79:-  chr1:80-105:+ |         0
#   [5]  chr1:51-70:-  chr1:80-105:+ |         9

As its name suggests findAllHeadToHeadRanges() returns all pairs that are in a head-to-head configuration and with a gap no greater than maxgap. For example, with the following ranges

                                   -----D---->
                                ---C--->
          <---A---  <----B----

it will return all the possible pairs if the gap between A and D is <= maxgap, that is: (A,C), (A,D), (B,C), and (B,D), Even though this is unlikely to happen with real biological data, it's better to not assume that this will never happen.

Here is an alternative implementation that only returns the closest pairs (i.e. (B,C) in the above example):

findClosestHeadToHeadRanges <- function(x, maxgap=50)
{
    stopifnot(is(x, "GenomicRanges"))
    if (!isSingleNumber(maxgap))
        stop(wmsg("'maxgap' must be a single number"))
    promo <- promoters(x, upstream=0, downstream=1)
    promo_seqnames <- S4Vectors:::decodeRle(seqnames(promo))
    promo_start <- start(promo)
    promo_strand <- S4Vectors:::decodeRle(strand(promo))
    oo <- order(promo_seqnames, promo_start, promo_strand)
    sorted_strand <- Rle(promo_strand[oo])
    Lidx <- cumsum(runLength(sorted_strand))[runValue(sorted_strand) == "-"]
    ans_from <- oo[Lidx]
    ans_to <- oo[Lidx + 1L]
    ans_gap <- promo_start[ans_to] - promo_start[ans_from] - 1L
    ok <- promo_seqnames[ans_from] == promo_seqnames[ans_to] &
          ans_gap <= maxgap
    keep_idx <- which(ok)
    ans_from <- ans_from[keep_idx]
    ans_to <- ans_to[keep_idx]
    ans_gap <- ans_gap[keep_idx]
    SelfHits(ans_from, ans_to, nnode=length(x), gap=ans_gap)
}

Then:

hits2 <- findClosestHeadToHeadRanges(gr)
hits2
# SelfHits object with 2 hits and 1 metadata column:
#            from        to |       gap
#       <integer> <integer> | <integer>
#   [1]         4         6 |         0
#   [2]         3         2 |        25
#   -------
#   nnode: 8

Pairs(gr, gr, gap=mcols(hits2)$gap, hits=hits2)
# Pairs object with 2 pairs and 1 metadata column:
#               first         second |       gap
#           <GRanges>      <GRanges> | <integer>
#   [1]  chr1:50-79:-  chr1:80-105:+ |         0
#   [2] chr1:80-105:- chr1:131-150:+ |        25

Note that with your original example data, using findAllHeadToHeadRanges() or findClosestHeadToHeadRanges() doesn't make any difference: both return 68 pairs.

H.