Better way to find head to head GRanges?
2
0
Entering edit mode
@charles-plessy-7857
Last seen 14 months ago
Japan

In order to detect head-to-head promoters I am looking for a good way to detect head-to-head genomic ranges, like B and C in the example below.

                         ------C---->  
     <---A---  <---B---          <----D----

They are sometimes also called divergent or bidirectional and are of biological interests. Here are two examples from the littereature, With CAGE data they are found in transcribed enhancer regions (Andersson 2014). In a GRO-seq analysis (Tome 2018), they have been categorised as "divergent pairs" of TSS when they are less than 300 bp apart.

I am tempted to implement head-to-head promoter detection in CAGEr. At the moment I found the way that I describe below, but I wonder if:

  • This is already done in a Bioconductor packages that I can depend on ?
  • There is a more efficient or elegant way to do it ?

Here is how I would do at the moment.

As a source of GRanges an example, I am taking a GRanges object from the AnnotationHub. Keep only genes, their names and their biotype to simplify example output.

ah = AnnotationHub::AnnotationHub()
gtf <- ah[["AH64858"]]
gr <- gtf[gtf$type == "gene"]
mcols(gr) <- DataFrame(gr$gene_id, gr$gene_name, gr$gene_biotype)
gr$source <- gr$phase <- gr$score <- gr$type <- gr$gene_version <- NULL

Index the ranges by their order.

names(gr) <- 1:length(gr)

Extract their 5-prime ends, copy the index, then sort.

p <- GPos(promoters(gr, 0, 1))
names(p) <- names(gr)
p <- sort(p, ignore.strand = TRUE)

Stitch the next range on the same line as the current one. Remove the pairs that are on different chromosomes.

p$Next <- c(p[-1], p[1])
p <- p[seqnames(p) == seqnames(p$Next)]

Find which pairs have a plus / minus (head to head) orientation.

p$strandStrand <- paste(strand(p), strand(p$Next), sep = "")
plusMinus <- which(p$strandStrand == "-+")
hh <- p[plusMinus]

Use the indexes to retrieve the original ranges.

i1 <- names(hh)
i2 <- names(hh$Next)
sort(c(gr[i1], gr[i2]), ignore.strand=TRUE)
## GRanges object with 9384 ranges and 3 metadata columns:
##           seqnames        ranges strand |         gr.gene_id      gr.gene_name
##              <Rle>     <IRanges>  <Rle> |        <character>       <character>
##       6          1 381289-396363      - | ENSTRUG00000009462             cox10
##       7          1 425325-445021      + | ENSTRUG00000009368 si:ch211-216b21.2
##       8          1 488176-495178      - | ENSTRUG00000009358              <NA>
##      11          1 509902-520076      + | ENSTRUG00000009053              sgca
##      14          1 551027-551347      - | ENSTRUG00000007292              <NA>
##     ...        ...           ...    ... .                ...               ...
##   20921 HE596567.1     3463-3571      + | ENSTRUG00000024526           RF00001
##   21029 HE597694.1        45-776      - | ENSTRUG00000019592              <NA>
##   21030 HE597694.1     1132-2168      + | ENSTRUG00000024912              <NA>
##   18947 HE598556.1     1435-2893      - | ENSTRUG00000001164          prss59.1
##   18948 HE598556.1    7000-11144      + | ENSTRUG00000022925            mrpl17
##         gr.gene_biotype
##             <character>
##       6  protein_coding
##       7  protein_coding
##       8  protein_coding
##      11  protein_coding
##      14  protein_coding
##     ...             ...
##   20921            rRNA
##   21029  protein_coding
##   21030  protein_coding
##   18947  protein_coding
##   18948  protein_coding
##   -------
##   seqinfo: 1627 sequences (1 circular) from FUGU5 genome; no seqlengths

Store the head to head pair information in an object that represents span of the head to head pair and provides the coordinates of the gap within. Keep only the pairs that are separated by less than 50 nt.

HH <- GRanges(seqnames(gr[i1]), IRanges(start(gr[i1]), end(gr[i2])))
names(HH) <- names(hh)
HH$interval <- GRanges(seqnames(hh), IRanges(pos(hh), pos(hh$Next)))
within50 <- width(HH$interval) < 50
sort(c(gr[i1][within50], gr[i2][within50]), ignore.strand=TRUE)
## GRanges object with 136 ranges and 3 metadata columns:
##           seqnames            ranges strand |         gr.gene_id
##              <Rle>         <IRanges>  <Rle> |        <character>
##     467          1   7758838-7765115      - | ENSTRUG00000020313
##     468          1   7765148-7774656      + | ENSTRUG00000025597
##     957          1 15182345-15189325      - | ENSTRUG00000019879
##     958          1 15189333-15194992      + | ENSTRUG00000017600
##    1329          1 22606072-22608556      - | ENSTRUG00000020162
##     ...        ...               ...    ... .                ...
##   18436 HE591882.1       41358-43374      + | ENSTRUG00000025239
##   18721 HE591938.1         3285-3398      - | ENSTRUG00000025173
##   18722 HE591938.1         3402-8100      + | ENSTRUG00000024399
##   19233 HE592076.1         6177-9784      - | ENSTRUG00000002645
##   19234 HE592076.1        9831-14047      + | ENSTRUG00000003176
##              gr.gene_name gr.gene_biotype
##               <character>     <character>
##     467            znf668  protein_coding
##     468            znf646  protein_coding
##     957            Scrn3a  protein_coding
##     958              CIR1  protein_coding
##    1329             maip1  protein_coding
##     ...               ...             ...
##   18436              lsm1  protein_coding
##   18721           RF00020           snRNA
##   18722 NXPH4 (1 of many)  protein_coding
##   19233              <NA>  protein_coding
##   19234              tlr2  protein_coding
##   -------
##   seqinfo: 1627 sequences (1 circular) from FUGU5 genome; no seqlengths
promoter • 1.4k views
ADD COMMENT
0
Entering edit mode

Hi Charles, can you please define what you mean by "head to head ranges"? I don't want to have to go thru the algorithm you're showing above to try and guess. I've learned that it's a lot more productive to spend some time defining exactly "what" we want to achieve before we work on the "how". Thanks!

ADD REPLY
0
Entering edit mode

Thanks ! I edited the question to add background information.

ADD REPLY
1
Entering edit mode
ATpoint ★ 4.5k
@atpoint-13662
Last seen 2 days ago
Germany

I would tackle this using findOverlaps. Essentially what you need are overlaps between promoters that are on different strands. For this lets define promoters as 300bp upstream of the start coordinate and then find overlaps, followed by retrieval of the pairs:

## get promoters as 300bp upstream
proms <- promoters(gr, upstream=300, downstream=0, use.names=TRUE)

## separate by strand
Plus_strand  <- proms[ strand(proms) == "+" ,]
Minus_strand <- proms[ strand(proms) == "-" ,]

## find overlaps between the promoter regions for genes on different strands
oLap <- findOverlaps(Plus_strand, Minus_strand, ignore.strand = TRUE)

## write into a data.frame (or any format you want)
Pairs <- data.frame( Plus  = data.frame(gr[strand(gr) == "+"][oLap@from]),
                     Minus = data.frame(gr[strand(gr) == "-"][oLap@to]))

With your example data I find 658 pairs with the above code.

ADD COMMENT
0
Entering edit mode

Thanks, your solution is very neat and executes faster for the same window size. As your solution requires to set the upstream window before starting the search, it can not find head-to-head GRanges regardless of their distance. But in many situations this is not required.

ADD REPLY
1
Entering edit mode
@herve-pages-1542
Last seen 1 day ago
Seattle, WA, United States

Thanks for explaining the head-to-head thing, that helps a lot.

Here is a fast implementation based on findOverlaps():

findAllHeadToHeadRanges <- function(x, maxgap=50)
{
    stopifnot(is(x, "GenomicRanges"))
    y <- invertStrand(x)
    ans <- as(findOverlaps(x, y, maxgap=maxgap, select="all"),
              "SelfHits")

    ## Keep hits that put 'x' and 'y' in a head to head
    ## configuration.
    ans_from <- from(ans)
    ans_to <- to(ans)
    xx <- x[ans_from]
    yy <- y[ans_to]
    code <- pcompare(xx, yy)
    keep_idx <- (strand(xx) == "-") & (code %in% c(-6L, -5L))
    ans <- ans[keep_idx]

    ## Add 'gap' metadata column.
    ans_from <- from(ans)
    ans_to <- to(ans)
    mcols(ans)$gap <- start(y)[ans_to] - end(x)[ans_from] - 1L

    ans
}

Example:

gr <- GRanges(c("chr1:51-80:-",
                "chr1:131-150:+",
                "chr1:80-105:-",
                "chr1:50-79:-",
                "chr1:160-170:-",
                "chr1:80-105:+",
                "chr1:155-180:+",
                "chr1:51-70:-"))

hits1 <- findAllHeadToHeadRanges(gr)
hits1
# SelfHits object with 5 hits and 1 metadata column:
#            from        to |       gap
#       <integer> <integer> | <integer>
#   [1]         1         2 |        50
#   [2]         3         2 |        25
#   [3]         3         7 |        49
#   [4]         4         6 |         0
#   [5]         8         6 |         9
#   -------
#   nnode: 8

The returned object is a SelfHits object that contains all the pairs that are in a head-to-head configuration with a gap no greater than maxgap. For each pair from and to are the indices of the ranges in gr that form the pair. from is the index of the range on the minus strand and to the index of the range on the plus strand.

This SelfHits object can be used in many different ways e.g. you can turn it into a data frame with as.data.frame(hits1) or use it directly to construct a Pairs object:

Pairs(gr, gr, gap=mcols(hits1)$gap, hits=hits1)
# Pairs object with 5 pairs and 1 metadata column:
#               first         second |       gap
#           <GRanges>      <GRanges> | <integer>
#   [1]  chr1:51-80:- chr1:131-150:+ |        50
#   [2] chr1:80-105:- chr1:131-150:+ |        25
#   [3] chr1:80-105:- chr1:155-180:+ |        49
#   [4]  chr1:50-79:-  chr1:80-105:+ |         0
#   [5]  chr1:51-70:-  chr1:80-105:+ |         9

As its name suggests findAllHeadToHeadRanges() returns all pairs that are in a head-to-head configuration and with a gap no greater than maxgap. For example, with the following ranges

                                   -----D---->
                                ---C--->
          <---A---  <----B----            

it will return all the possible pairs if the gap between A and D is <= maxgap, that is: (A,C), (A,D), (B,C), and (B,D), Even though this is unlikely to happen with real biological data, it's better to not assume that this will never happen.

Here is an alternative implementation that only returns the closest pairs (i.e. (B,C) in the above example):

findClosestHeadToHeadRanges <- function(x, maxgap=50)
{
    stopifnot(is(x, "GenomicRanges"))
    if (!isSingleNumber(maxgap))
        stop(wmsg("'maxgap' must be a single number"))
    promo <- promoters(x, upstream=0, downstream=1)
    promo_seqnames <- S4Vectors:::decodeRle(seqnames(promo))
    promo_start <- start(promo)
    promo_strand <- S4Vectors:::decodeRle(strand(promo))
    oo <- order(promo_seqnames, promo_start, promo_strand)
    sorted_strand <- Rle(promo_strand[oo])
    Lidx <- cumsum(runLength(sorted_strand))[runValue(sorted_strand) == "-"]
    ans_from <- oo[Lidx]
    ans_to <- oo[Lidx + 1L]
    ans_gap <- promo_start[ans_to] - promo_start[ans_from] - 1L
    ok <- promo_seqnames[ans_from] == promo_seqnames[ans_to] &
          ans_gap <= maxgap
    keep_idx <- which(ok)
    ans_from <- ans_from[keep_idx]
    ans_to <- ans_to[keep_idx]
    ans_gap <- ans_gap[keep_idx]
    SelfHits(ans_from, ans_to, nnode=length(x), gap=ans_gap)
}

Then:

hits2 <- findClosestHeadToHeadRanges(gr)
hits2
# SelfHits object with 2 hits and 1 metadata column:
#            from        to |       gap
#       <integer> <integer> | <integer>
#   [1]         4         6 |         0
#   [2]         3         2 |        25
#   -------
#   nnode: 8

Pairs(gr, gr, gap=mcols(hits2)$gap, hits=hits2)
# Pairs object with 2 pairs and 1 metadata column:
#               first         second |       gap
#           <GRanges>      <GRanges> | <integer>
#   [1]  chr1:50-79:-  chr1:80-105:+ |         0
#   [2] chr1:80-105:- chr1:131-150:+ |        25

Note that with your original example data, using findAllHeadToHeadRanges() or findClosestHeadToHeadRanges() doesn't make any difference: both return 68 pairs.

H.

ADD COMMENT
1
Entering edit mode

I forgot to mention that allowing bigger gaps will of course make a difference between the 2 functions. For example calling them on your original example data with maxgap=1e4 returns 5532 and 3477 pairs, respectively. And the bigger the gap the more "polluted" the result returned by findAllHeadToHeadRanges() will be.

OTOH findClosestHeadToHeadRanges(gr) always returns a "clean" result in the sense that the gap between the 2 heads of any pair in the result is guaranteed to not contain the start of any gene. It can even be used with maxgap=Inf in which case it returns all closest head-to-head pairs with no limit on the size of the gaps.

Calling findClosestHeadToHeadRanges(gr, maxgap=Inf) on your original example data returns 4693 pairs:

hits3 <- findClosestHeadToHeadRanges(gr, maxgap=Inf)
length(hits3)
# [1] 4693

The smallest and biggest gaps are 1 and 286880 nucleotides, respectively:

pairs <- Pairs(gr, gr, gap=mcols(hits3)$gap, hits=hits3)

min(mcols(pairs)$gap)
# [1] 1

max(mcols(pairs)$gap)
# [1] 286880

and are obtained for the following gene pairs:

pairs[mcols(pairs)$gap == 1]
# Pairs object with 2 pairs and 1 metadata column:
#                      first               second |       gap
#                  <GRanges>            <GRanges> | <integer>
#   [1]  4:8468564-8472877:-  4:8472879-8478957:+ |         1
#   [2] 11:6664196-6674024:- 11:6674026-6675963:+ |         1

pairs[mcols(pairs)$gap == 286880]
# Pairs object with 1 pair and 1 metadata column:
#                            first                     second |       gap
#                        <GRanges>                  <GRanges> | <integer>
#   [1] HE591497.1:352842-390707:- HE591497.1:677588-680955:+ |    286880

paired_gene_ids <- data.frame(left_gene_id=mcols(first(pairs))$gene_id,
                              right_gene_id=mcols(second(pairs))$gene_id,
                              gap=mcols(pairs)$gap)

subset(paired_gene_ids, gap == 1)
#            left_gene_id      right_gene_id gap
# 777  ENSTRUG00000001336 ENSTRUG00000025835   1
# 2032 ENSTRUG00000001161 ENSTRUG00000025516   1

subset(paired_gene_ids, gap == 286880)
#            left_gene_id      right_gene_id    gap
# 3979 ENSTRUG00000022609 ENSTRUG00000005969 286880
ADD REPLY

Login before adding your answer.

Traffic: 516 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6