In order to detect head-to-head promoters I am looking for a good way to detect head-to-head genomic ranges, like B and C in the example below.
------C---->
<---A--- <---B--- <----D----
They are sometimes also called divergent or bidirectional and are of biological interests. Here are two examples from the littereature, With CAGE data they are found in transcribed enhancer regions (Andersson 2014). In a GRO-seq analysis (Tome 2018), they have been categorised as "divergent pairs" of TSS when they are less than 300 bp apart.
I am tempted to implement head-to-head promoter detection in CAGEr. At the moment I found the way that I describe below, but I wonder if:
- This is already done in a Bioconductor packages that I can depend on ?
- There is a more efficient or elegant way to do it ?
Here is how I would do at the moment.
As a source of GRanges an example, I am taking a GRanges object from the AnnotationHub. Keep only genes, their names and their biotype to simplify example output.
ah = AnnotationHub::AnnotationHub()
gtf <- ah[["AH64858"]]
gr <- gtf[gtf$type == "gene"]
mcols(gr) <- DataFrame(gr$gene_id, gr$gene_name, gr$gene_biotype)
gr$source <- gr$phase <- gr$score <- gr$type <- gr$gene_version <- NULL
Index the ranges by their order.
names(gr) <- 1:length(gr)
Extract their 5-prime ends, copy the index, then sort.
p <- GPos(promoters(gr, 0, 1))
names(p) <- names(gr)
p <- sort(p, ignore.strand = TRUE)
Stitch the next range on the same line as the current one. Remove the pairs that are on different chromosomes.
p$Next <- c(p[-1], p[1])
p <- p[seqnames(p) == seqnames(p$Next)]
Find which pairs have a plus / minus (head to head) orientation.
p$strandStrand <- paste(strand(p), strand(p$Next), sep = "")
plusMinus <- which(p$strandStrand == "-+")
hh <- p[plusMinus]
Use the indexes to retrieve the original ranges.
i1 <- names(hh)
i2 <- names(hh$Next)
sort(c(gr[i1], gr[i2]), ignore.strand=TRUE)
## GRanges object with 9384 ranges and 3 metadata columns:
## seqnames ranges strand | gr.gene_id gr.gene_name
## <Rle> <IRanges> <Rle> | <character> <character>
## 6 1 381289-396363 - | ENSTRUG00000009462 cox10
## 7 1 425325-445021 + | ENSTRUG00000009368 si:ch211-216b21.2
## 8 1 488176-495178 - | ENSTRUG00000009358 <NA>
## 11 1 509902-520076 + | ENSTRUG00000009053 sgca
## 14 1 551027-551347 - | ENSTRUG00000007292 <NA>
## ... ... ... ... . ... ...
## 20921 HE596567.1 3463-3571 + | ENSTRUG00000024526 RF00001
## 21029 HE597694.1 45-776 - | ENSTRUG00000019592 <NA>
## 21030 HE597694.1 1132-2168 + | ENSTRUG00000024912 <NA>
## 18947 HE598556.1 1435-2893 - | ENSTRUG00000001164 prss59.1
## 18948 HE598556.1 7000-11144 + | ENSTRUG00000022925 mrpl17
## gr.gene_biotype
## <character>
## 6 protein_coding
## 7 protein_coding
## 8 protein_coding
## 11 protein_coding
## 14 protein_coding
## ... ...
## 20921 rRNA
## 21029 protein_coding
## 21030 protein_coding
## 18947 protein_coding
## 18948 protein_coding
## -------
## seqinfo: 1627 sequences (1 circular) from FUGU5 genome; no seqlengths
Store the head to head pair information in an object that represents span of the head to head pair and provides the coordinates of the gap within. Keep only the pairs that are separated by less than 50 nt.
HH <- GRanges(seqnames(gr[i1]), IRanges(start(gr[i1]), end(gr[i2])))
names(HH) <- names(hh)
HH$interval <- GRanges(seqnames(hh), IRanges(pos(hh), pos(hh$Next)))
within50 <- width(HH$interval) < 50
sort(c(gr[i1][within50], gr[i2][within50]), ignore.strand=TRUE)
## GRanges object with 136 ranges and 3 metadata columns:
## seqnames ranges strand | gr.gene_id
## <Rle> <IRanges> <Rle> | <character>
## 467 1 7758838-7765115 - | ENSTRUG00000020313
## 468 1 7765148-7774656 + | ENSTRUG00000025597
## 957 1 15182345-15189325 - | ENSTRUG00000019879
## 958 1 15189333-15194992 + | ENSTRUG00000017600
## 1329 1 22606072-22608556 - | ENSTRUG00000020162
## ... ... ... ... . ...
## 18436 HE591882.1 41358-43374 + | ENSTRUG00000025239
## 18721 HE591938.1 3285-3398 - | ENSTRUG00000025173
## 18722 HE591938.1 3402-8100 + | ENSTRUG00000024399
## 19233 HE592076.1 6177-9784 - | ENSTRUG00000002645
## 19234 HE592076.1 9831-14047 + | ENSTRUG00000003176
## gr.gene_name gr.gene_biotype
## <character> <character>
## 467 znf668 protein_coding
## 468 znf646 protein_coding
## 957 Scrn3a protein_coding
## 958 CIR1 protein_coding
## 1329 maip1 protein_coding
## ... ... ...
## 18436 lsm1 protein_coding
## 18721 RF00020 snRNA
## 18722 NXPH4 (1 of many) protein_coding
## 19233 <NA> protein_coding
## 19234 tlr2 protein_coding
## -------
## seqinfo: 1627 sequences (1 circular) from FUGU5 genome; no seqlengths
Hi Charles, can you please define what you mean by "head to head ranges"? I don't want to have to go thru the algorithm you're showing above to try and guess. I've learned that it's a lot more productive to spend some time defining exactly "what" we want to achieve before we work on the "how". Thanks!
Thanks ! I edited the question to add background information.