Hi everyone,
I'm learning the course ph525.5x, which is a fantastic course to learn Bioconductor I think. When I repeated the code "res <- precede(erbs, ghs)
" as at 5:10 of this video, I could not get the same result with the instructor. Then I looked though precede()
and found that it does not perform exactly as what is described in its help page. There is a sentence in "Details" of ?`precede,GenomicRanges,GenomicRanges-method
`:
"x on * strand can match to ranges on any of +, - or * strands. In the case of a tie the first range by order is chosen."
First look at the source code.
> getMethod(precede, signature(x="GenomicRanges", subject="GenomicRanges")) Method Definition: function (x, subject = x, ...) { .local <- function (x, subject, select = c("arbitrary", "all"), ignore.strand = FALSE) { select <- match.arg(select) .GenomicRanges_findPrecedeFollow(x, subject, select, ignore.strand, "precede") } .local(x, subject, ...) } <environment: namespace:GenomicRanges> Signatures: x subject target "GenomicRanges" "GenomicRanges" defined "GenomicRanges" "GenomicRanges" > GenomicRanges:::.GenomicRanges_findPrecedeFollow function (query, subject, select, ignore.strand, where = c("precede", "follow")) { if (!length(query) || !length(subject)) return(Hits(nLnode = length(query), nRnode = length(subject), sort.by.query = TRUE)) leftOf <- "precede" == match.arg(where) if (ignore.strand) strand(query) <- strand(subject) <- "+" if (leftOf) { plusfun <- function(xstart, xend, ystart, yend, sentinel) .GenomicRanges_findNearest0(xend, ystart, sentinel, leftOf) minusfun <- function(xstart, xend, ystart, yend, sentinel) .GenomicRanges_findNearest0(xstart, yend, sentinel, !leftOf) } else { plusfun <- function(xstart, xend, ystart, yend, sentinel) .GenomicRanges_findNearest0(xstart, yend, sentinel, leftOf) minusfun <- function(xstart, xend, ystart, yend, sentinel) .GenomicRanges_findNearest0(xend, ystart, sentinel, !leftOf) } endq <- start(query) + width(query) ends <- start(subject) + width(subject) maxend <- max(max(endq), max(ends)) + 1 lvls <- union(seqlevels(query), seqlevels(subject)) offset <- setNames((seq_along(lvls) - 1) * maxend, lvls) stopifnot(typeof(offset) == "double") sentinel <- c(0, seq_along(lvls) * maxend) queryOff <- unname(offset[as.character(seqnames(query))]) queryStart <- start(query) + queryOff queryEnd <- end(query) + queryOff qid <- seq_along(query) subjectOff <- unname(offset[as.character(seqnames(subject))]) subjectStart <- start(subject) + subjectOff subjectEnd <- end(subject) + subjectOff spid <- which(strand(subject) != "-") smid <- which(strand(subject) != "+") idx <- which(strand(query) == "+") phit <- plusfun(queryStart[idx], queryEnd[idx], subjectStart[spid], subjectEnd[spid], sentinel) phit <- .Hits(qid[idx][queryHits(phit)], spid[subjectHits(phit)]) idx <- which(strand(query) == "-") mhit <- minusfun(queryStart[idx], queryEnd[idx], subjectStart[smid], subjectEnd[smid], sentinel) mhit <- .Hits(qid[idx][queryHits(mhit)], smid[subjectHits(mhit)]) idx <- which(strand(query) == "*") bhit <- local({ qid <- qid[idx] phit <- plusfun(queryStart[idx], queryEnd[idx], subjectStart, subjectEnd, sentinel) .Hits(qid[queryHits(phit)], subjectHits(phit), length(query), length(subject)) }) qryHits <- c(queryHits(phit), queryHits(mhit), queryHits(bhit)) subjHits <- c(subjectHits(phit), subjectHits(mhit), subjectHits(bhit)) if ("arbitrary" == select) { hits <- integer() hits[length(query)] <- NA_integer_ idx <- !duplicated(qryHits) hits[qryHits[idx]] <- subjHits[idx] } else { hits <- .Hits(qryHits, subjHits, length(query), length(subject)) } hits } <environment: namespace:GenomicRanges>
In fact, that sentence is right but misleading. For the ranges of x on "*" strand, what precede()
does is identical to precede(., ignore.strand = TRUE)
, say, just to treat all ranges x on * strand, and all ranges of subject as they are all on the "+" strand. However, what users desire (also what the instructor desire) is to treat the x as they were on "+" and do precede() with ranges of subject on "+" and "*", then treat x as they were on "-" and do precede() with ranges of subject on "-" and "*", then return the range for each x that is nearest to x among which x precede on either strand. We want it that way because it is very useful to find the nearest gene that each TFBS peak precede.
I write some code to realize that demand.
precedeGenes <- function(x, genes) { xPlus <- GRanges(x, strand = "+") xMinus <- GRanges(x, strand = "-") idxGenesPlus <- precede(xPlus, genes, ignore.strand = FALSE) idxGenesMinus <- precede(xMinus, genes, ignore.strand = FALSE) disPlus <- rep(Inf, length(x)) disMinus <- rep(Inf, length(x)) disPlus[!is.na(idxGenesPlus)] <- distance(xPlus[!is.na(idxGenesPlus)], genes[na.omit(idxGenesPlus)]) disMinus[!is.na(idxGenesMinus)] <- distance(xMinus[!is.na(idxGenesMinus)], genes[na.omit(idxGenesMinus)]) dis <- pmin(disPlus, disMinus) idxGenes <- ifelse(disPlus > disMinus, idxGenesMinus, idxGenesPlus) return(idxGenes) } precedeGenes(erbs, ghs) [1] 22817 16173 21772 7870 20199 22131 13257 20469 21943 8564 [11] 19061 15836 21282 8959 278 14620 1657 22774 21746 10777 [21] 14568 20910 6626 12830 9838 19991 21643 15740 12911 13326 [31] 8303 20834 934 10485 23027 4443 6613 21335 21125 14508 [41] 15979 2693 4122 9485 17665 18376 17932 2696 16990 16952 [51] 15735 9386 1642 3634 6037 20891 15062 1779 3316 11746 [61] 12168 21689 22406 19979 14099 13558 3653 6077 21280 9705 [71] 13095 3322 6343 1553 9552
It is right but very fragile in the wild. So I wish the precede()
could be modified to do the right thing. Now I have some idea but not complete. In the bold Italic,
phit <- plusfun(queryStart[idx], queryEnd[idx], subjectStart, subjectEnd, sentinel)
can be replaced by
phit <- plusfun(queryStart[idx], queryEnd[idx], subjectStart[spid], subjectEnd[spid], sentinel)
mhit <- minusfun(queryStart[idx], queryEnd[idx], subjectStart[smid], subjectEnd[smid], sentinel)
Then use some code to combine them appropriately.
You can reproduce results in the video as below.
# library(devtools) # install_github("genomicsclass/ERBS") library(ERBS) library(Homo.sapiens) data(HepG2) data(GM12878) res <- findOverlaps(HepG2, GM12878) index <- queryHits(res) erbs <- HepG2[index, ] erbs <- granges(erbs) ghs <- genes(Homo.sapiens) res <- precede(erbs, ghs) res [1] 16845 16173 21772 13617 18834 15172 4543 22522 21943 22493 [11] 846 15836 3903 21053 18112 17682 19650 22112 21746 22839 [21] 14568 9043 20477 12830 9841 5379 21643 14441 12911 553 [31] 7980 6028 22699 4480 11293 20120 11859 17016 14199 6719 [41] 22946 2693 6923 20570 6680 15953 17932 20311 8049 11969 [51] 3738 2914 1642 4173 6443 4446 15062 11268 16695 3496 [61] 8458 568 22406 3201 14099 13558 8545 7569 13718 16702 [71] 13095 17686 19348 12849 9552 precede(erbs, ghs, ignore.strand = TRUE) [1] 16845 16173 21772 13617 18834 15172 4543 22522 21943 22493 [11] 846 15836 3903 21053 18112 17682 19650 22112 21746 22839 [21] 14568 9043 20477 12830 9841 5379 21643 14441 12911 553 [31] 7980 6028 22699 4480 11293 20120 11859 17016 14199 6719 [41] 22946 2693 6923 20570 6680 15953 17932 20311 8049 11969 [51] 3738 2914 1642 4173 6443 4446 15062 11268 16695 3496 [61] 8458 568 22406 3201 14099 13558 8545 7569 13718 16702 [71] 13095 17686 19348 12849 9552 identical(precede(erbs, ghs, ignore.strand = TRUE), precede(erbs, ghs, ignore.strand = FALSE)) # should not be TRUE I think [1] TRUE precedeGenes <- function(x, genes) { xPlus <- GRanges(x, strand = "+") xMinus <- GRanges(x, strand = "-") idxGenesPlus <- precede(xPlus, genes, ignore.strand = FALSE) idxGenesMinus <- precede(xMinus, genes, ignore.strand = FALSE) disPlus <- rep(Inf, length(x)) disMinus <- rep(Inf, length(x)) disPlus[!is.na(idxGenesPlus)] <- distance(xPlus[!is.na(idxGenesPlus)], genes[na.omit(idxGenesPlus)]) disMinus[!is.na(idxGenesMinus)] <- distance(xMinus[!is.na(idxGenesMinus)], genes[na.omit(idxGenesMinus)]) dis <- pmin(disPlus, disMinus) idxGenes <- ifelse(disPlus > disMinus, idxGenesMinus, idxGenesPlus) return(idxGenes) } precedeGenes(erbs, ghs) # the result is identical to that in the video [1] 22817 16173 21772 7870 20199 22131 13257 20469 21943 8564 [11] 19061 15836 21282 8959 278 14620 1657 22774 21746 10777 [21] 14568 20910 6626 12830 9838 19991 21643 15740 12911 13326 [31] 8303 20834 934 10485 23027 4443 6613 21335 21125 14508 [41] 15979 2693 4122 9485 17665 18376 17932 2696 16990 16952 [51] 15735 9386 1642 3634 6037 20891 15062 1779 3316 11746 [61] 12168 21689 22406 19979 14099 13558 3653 6077 21280 9705 [71] 13095 3322 6343 1553 9552
sessionInfo() R version 3.3.2 (2016-10-31) Platform: x86_64-w64-mingw32/x64 (64-bit) Running under: Windows 7 x64 (build 7601) Service Pack 1 locale: [1] LC_COLLATE=Chinese (Simplified)_People's Republic of China.936 [2] LC_CTYPE=Chinese (Simplified)_People's Republic of China.936 [3] LC_MONETARY=Chinese (Simplified)_People's Republic of China.936 [4] LC_NUMERIC=C [5] LC_TIME=Chinese (Simplified)_People's Republic of China.936 attached base packages: [1] parallel stats4 stats graphics grDevices utils [7] datasets methods base other attached packages: [1] Homo.sapiens_1.3.1 [2] TxDb.Hsapiens.UCSC.hg19.knownGene_3.2.2 [3] org.Hs.eg.db_3.4.0 [4] GO.db_3.4.0 [5] OrganismDbi_1.16.0 [6] GenomicFeatures_1.26.2 [7] GenomicRanges_1.26.1 [8] GenomeInfoDb_1.10.1 [9] AnnotationDbi_1.36.0 [10] IRanges_2.8.1 [11] S4Vectors_0.12.1 [12] Biobase_2.34.0 [13] BiocGenerics_0.20.0 [14] ERBS_1.0 loaded via a namespace (and not attached): [1] graph_1.52.0 Rcpp_0.12.8 [3] XVector_0.14.0 zlibbioc_1.20.0 [5] GenomicAlignments_1.10.0 BiocParallel_1.8.1 [7] lattice_0.20-34 tools_3.3.2 [9] grid_3.3.2 SummarizedExperiment_1.4.0 [11] DBI_0.5-1 RBGL_1.50.0 [13] digest_0.6.10 Matrix_1.2-7.1 [15] rtracklayer_1.34.1 bitops_1.0-6 [17] RCurl_1.95-4.8 biomaRt_2.30.0 [19] memoise_1.0.0 RSQLite_1.1-1 [21] BiocInstaller_1.24.0 Biostrings_2.42.1 [23] Rsamtools_1.26.1 XML_3.98-1.5
Thanks in advance.
Can
Hi Michael,
I agree that adding options adds complexity, and appreciate that bioconductor people do so much work to make functions complete, including considering all combinations of options, and make exhaustive documents. In this aspect it does more work and does better than bedtools I think. It is really a hard task. Thank you all.
As you mentioned, resize(), flank(), promoters() take strand into account (it is meaningful for genes), but shift() and narrow() do not, maybe they are important in some case without genes I guess. What makes me satisfied is that when some ranges are on "*" strand, promoters() treats them as on "+" strand and throws out a warning. It make more sense than just do things silently. When some ranges were out of bounds, warnings would also be showed out. That makes users know maybe they should do things more reasonably and carefully by themselves. So even ignore.strand is not that satisfied in some case, if people knew what happens behind it, they could modify their input and call functions in a more reasonable way. So warnings can be a good thing to work together with nonsense combinations of parameters and can save you from deciding what are better results in a nonsense calling from users. Because in that case you can throw an somewhat arbitrary result and tell users they should know the result is nonsense for their bad use of the function.
I can imagine some inconsistency about meaning of '*' strand may come from the real world, not bioconductor. And as you said, "Changing default behavior, for full consistency, would be difficult now." I agree. Because people have been adapted to current behaviors of those functions. In that context, maybe more training and examples can make things better. Fortunately, your HelloRanges package is really a fantastic translator from bedtools to GenomicRanges, which would help people who have been already familiar with bedtools to do real tasks more efficiently in bioconductor. Thanks for that!
Return back to my post and Janet's GenomicRanges bug in follow?. I think the version of precede() and follow() before Janet's post are better than now. Because both of us can achieve our demands without a lot of codes. He can use ignore.strand = TRUE, I can use ignore.strand = FALSE or by default. When all the ranges in x (or query) and subject are on * strand, to make less people confused, you can throw out a warning: "precede() or follow() treats ranges on '*' as on both '+' and '-' strands, so it returns the nearest but not overlapped range(s) in subject for each range in x. If you want to do precede() or follow() on '+' strand, please use ignore.strand = TRUE, or change strands of x and subject to + manually". To take '*' as 'both' is more useful than 'none' in this case. Because I can solve the tasks:
(1) Set x as TF peaks and subject as genes and use precede() to find nearest but not overlapped gene(s) ('gene' when select = "arbitrary", 'gene or genes' when select = "all") that each TF peak precede on either strand (means in the 5' upstream of genes).
(2) Set x as genes, subject as TF peaks, follow() can find nearest but not overlapped TF(s) that each gene follow (means the TF is in its 5' upstream).
(3) Set x as TF1 peaks, subject as TF2 peaks, both have '*', precede() and follow() will both find the nearest but not overlapped TF2 peak(s) for each TF1 peak, which is useful (after filtered with distance()) to find co-regulators both bound to DNA. In that case, precede() and follow() are not equal to nearest() where 'nearest' including 'overlapped', and they still make sense so they can not be replaced by nearest(). Thus, what Hervé Pagès said in Janet's post is not totally right. I know, the last case is a little hacky, maybe nearestButNotOverlapped() can be created by users themselves to do that thing.
Maybe there are still some aspects not covered, but I think I have done my best to try to use precede() and follow() in as many cases as I can.
Thanks a lot to you and Valerie. Actions of you are much harder than words of mine. Appreciate your work on bioconductor.
Can
It would be nice to have a version of nearest() that does not allow overlaps. I discovered that when developing HelloRanges. Maybe a new verb, adjacent()? I kind of prefer that to a parameter like ignore.overlaps, just because it's simpler to list operations as functions, compared to function/parameter combinations. We want to avoid a combinatorial explosion, but I think we're far from that in this case. Then again, nearest(x, y, ignore.overlaps=TRUE) would be very readable.
Yes, I agree that adjacent() would be clearer and easier to maintain than nearest() with ignore.overlaps. nearest(x, y, ignore.overlaps = TRUE) is readable for users and can be a substitute for adjacent() temporarily.
Back to the post.
You said "It's not clear how the gene use case requires treating "*" in subject as "both". " The task (2) in my last reply gives a case that is clear. That is for each gene to find adjacent TFs in its 5' upstream.
I will illustrate current version of precede() in graphs (follow() is similar):
(1) For ranges on '+' in x
subspace:+ direction:+ precede( + in x, + and * in subject) --|
find nearest for each + range in x (just use the res on +)
subspace:- direction:- do nothing-------------------------|
(2) For ranges on '-' in x
subspace:+ direction:+ do nothing-------------------------|
find nearest for each - range in x (just use the res on -)
subspace:- direction:- precede( - in x, - and * in subject) --|
(3) For ranges on '*' in x
subspace:+ direction:+ precede( * in x, + and - and * in subject) --|
find nearest for each * range in x (just use res on +)
subspace:- direction:- do nothing-----------------------------------|
The (1) and (2) can make sure the second task I mentioned can be solved. The (3) is wrong and you agree.
(3)' in old version of precede():
subspace:+ direction:+ precede( * in x, + and * in subject) --|
find nearest for each * range in x
subspace:- direction:- precede( * in x, - and * in subject) --|
This (3)' would cause Janet's problem, but can be solved by throwing warnings. And this (3)' can solve task (1) and (3) in my reply.
Another (3)'':
subspace:+ direction:+ precede( * in x, + and * in subject) --|
find nearest for each * range in x
subspace:- direction:- precede( * in x, - in subject) --------|
This (3)'' can solve Janet's problem. And it can also solve task (1) in my post. Even though it can't solve task(3), adjacent() is more suitable for that task.
So which one will your team accept? Or some other solution that can solve both the problem of Janet and problem in this post appropriately?
Can
I don't think a gene can follow/precede a peak, because a peak has no orientation. It's better to ask, which peaks precede (the transcription of) a gene, and then find the closest peak for each gene. The
precede()
,follow()
interface is too low-level to make that simple. One option would be a high-level interface that returned a GRanges of the peak hits, formally grouped by gene. Then have a high-level function that finds the closest within each group. But given that we do not have such a high-level interface yet, maybe specify which argument determines the orientation:By default, the orientation comes from the subject.
Hi Michael,
As you may notice that. precede(A, B) are not same with follow(B, A). precede(A, B): for each range in A, find the nearest range in B it precedes, the result has the same length with A. follow(B, A): for each range B, find the nearest range in A it follows, the result has the same length with B. precede(A, B) focus on A, but follow(B, A) focus on B. So when I want to know for each peak, which gene's upstream it lies on. I can use precede(peaks, genes). The (3)' or (3)'' can let precede() do that. When I want to know for each gene, which peak lies on its upstream, I can use follow(genes, peaks). The (1) and (2) can help follow() do that. And this is indeed "to ask, which peaks precede (the transcription of) a gene, and then find the closest peak for each gene."
The old version of precede() and follow() have already had the capability to do the above two things. So we don't need any parameter like select = "nearest".
Can you tell me how to install an old version of GenomicRanges, for example, GenomicRanges_1.20.8. Maybe latter I can give an example to show you those things I imagine.
Can
As I tried to say in the previous post (which I later edited), I think the contract of the function would be simpler if we just say that the orientation is determined by one of the arguments, default "subject". If the orientation argument has "*" strand, it means "none". Then, Janet's use case works as expected, by default, as it is now. Your use case with the gene as the subject will also just work. If the gene is "x", just pass orientation="x".
It is a good idea to use
orientation
! I agree that it can sovle Janet's case, my case in this post (also task (1)), and task (2). Thanks for the long discussion.At last, I want to show how to solve Janet's problem and task(1)(2)(3) with old precede() and follow().
I installed
Bioconductor version 3.0 (BiocInstaller 1.16.5)
andGenomicRanges_1.18.4
onR 3.1.3
.In the end, hope orientation can be implemented to solve those problems. Thanks a lot to you and Valerie.
Best regards
Can