Question

`precede` does not perform as expected

0

Entering edit mode

wcstcyx ▴ 30

@wcstcyx-11636

Last seen 7.1 years ago

China/Beijing/AMSS,CAS

Hi everyone,

I'm learning the course ph525.5x, which is a fantastic course to learn Bioconductor I think. When I repeated the code "res <- precede(erbs, ghs)" as at 5:10 of this video, I could not get the same result with the instructor. Then I looked though precede() and found that it does not perform exactly as what is described in its help page. There is a sentence in "Details" of ?`precede,GenomicRanges,GenomicRanges-method`:

"x on * strand can match to ranges on any of +, - or * strands. In the case of a tie the first range by order is chosen."

First look at the source code.

> getMethod(precede, signature(x="GenomicRanges", subject="GenomicRanges"))
Method Definition:

function (x, subject = x, ...)
{
    .local <- function (x, subject, select = c("arbitrary", "all"),
        ignore.strand = FALSE)
    {
        select <- match.arg(select)
        .GenomicRanges_findPrecedeFollow(x, subject, select,
            ignore.strand, "precede")
    }
    .local(x, subject, ...)
}
<environment: namespace:GenomicRanges>

Signatures:
        x               subject       
target  "GenomicRanges" "GenomicRanges"
defined "GenomicRanges" "GenomicRanges"

> GenomicRanges:::.GenomicRanges_findPrecedeFollow
function (query, subject, select, ignore.strand, where = c("precede",
    "follow"))
{
    if (!length(query) || !length(subject))
        return(Hits(nLnode = length(query), nRnode = length(subject),
            sort.by.query = TRUE))
    leftOf <- "precede" == match.arg(where)
    if (ignore.strand)
        strand(query) <- strand(subject) <- "+"
    if (leftOf) {
        plusfun <- function(xstart, xend, ystart, yend, sentinel) .GenomicRanges_findNearest0(xend,
            ystart, sentinel, leftOf)
        minusfun <- function(xstart, xend, ystart, yend, sentinel) .GenomicRanges_findNearest0(xstart,
            yend, sentinel, !leftOf)
    }
    else {
        plusfun <- function(xstart, xend, ystart, yend, sentinel) .GenomicRanges_findNearest0(xstart,
            yend, sentinel, leftOf)
        minusfun <- function(xstart, xend, ystart, yend, sentinel) .GenomicRanges_findNearest0(xend,
            ystart, sentinel, !leftOf)
    }
    endq <- start(query) + width(query)
    ends <- start(subject) + width(subject)
    maxend <- max(max(endq), max(ends)) + 1
    lvls <- union(seqlevels(query), seqlevels(subject))
    offset <- setNames((seq_along(lvls) - 1) * maxend, lvls)
    stopifnot(typeof(offset) == "double")
    sentinel <- c(0, seq_along(lvls) * maxend)
    queryOff <- unname(offset[as.character(seqnames(query))])
    queryStart <- start(query) + queryOff
    queryEnd <- end(query) + queryOff
    qid <- seq_along(query)
    subjectOff <- unname(offset[as.character(seqnames(subject))])
    subjectStart <- start(subject) + subjectOff
    subjectEnd <- end(subject) + subjectOff
    spid <- which(strand(subject) != "-")
    smid <- which(strand(subject) != "+")
    idx <- which(strand(query) == "+")
    phit <- plusfun(queryStart[idx], queryEnd[idx], subjectStart[spid],
        subjectEnd[spid], sentinel)
    phit <- .Hits(qid[idx][queryHits(phit)], spid[subjectHits(phit)])
    idx <- which(strand(query) == "-")
    mhit <- minusfun(queryStart[idx], queryEnd[idx], subjectStart[smid],
        subjectEnd[smid], sentinel)
    mhit <- .Hits(qid[idx][queryHits(mhit)], smid[subjectHits(mhit)])
    idx <- which(strand(query) == "*")
    bhit <- local({
        qid <- qid[idx]
        phit <- plusfun(queryStart[idx], queryEnd[idx], subjectStart,
            subjectEnd, sentinel)
        .Hits(qid[queryHits(phit)], subjectHits(phit), length(query),
            length(subject))
    })
    qryHits <- c(queryHits(phit), queryHits(mhit), queryHits(bhit))
    subjHits <- c(subjectHits(phit), subjectHits(mhit), subjectHits(bhit))
    if ("arbitrary" == select) {
        hits <- integer()
        hits[length(query)] <- NA_integer_
        idx <- !duplicated(qryHits)
        hits[qryHits[idx]] <- subjHits[idx]
    }
    else {
        hits <- .Hits(qryHits, subjHits, length(query), length(subject))
    }
    hits
}
<environment: namespace:GenomicRanges>

In fact, that sentence is right but misleading. For the ranges of x on "*" strand, what precede() does is identical to precede(., ignore.strand = TRUE), say, just to treat all ranges x on * strand, and all ranges of subject as they are all on the "+" strand. However, what users desire (also what the instructor desire) is to treat the x as they were on "+" and do precede() with ranges of subject on "+" and "*", then treat x as they were on "-" and do precede() with ranges of subject on "-" and "*", then return the range for each x that is nearest to x among which x precede on either strand. We want it that way because it is very useful to find the nearest gene that each TFBS peak precede.

I write some code to realize that demand.

precedeGenes <- function(x, genes) {
  xPlus <- GRanges(x, strand = "+")
  xMinus <- GRanges(x, strand = "-")
  idxGenesPlus <- precede(xPlus, genes, ignore.strand = FALSE)
  idxGenesMinus <- precede(xMinus, genes, ignore.strand = FALSE)
  disPlus <- rep(Inf, length(x))
  disMinus <- rep(Inf, length(x))
  disPlus[!is.na(idxGenesPlus)] <- distance(xPlus[!is.na(idxGenesPlus)], genes[na.omit(idxGenesPlus)])
  disMinus[!is.na(idxGenesMinus)] <- distance(xMinus[!is.na(idxGenesMinus)], genes[na.omit(idxGenesMinus)])
  dis <- pmin(disPlus, disMinus)
  idxGenes <- ifelse(disPlus > disMinus, idxGenesMinus, idxGenesPlus)
  return(idxGenes)
}

precedeGenes(erbs, ghs)

 [1] 22817 16173 21772  7870 20199 22131 13257 20469 21943  8564
[11] 19061 15836 21282  8959   278 14620  1657 22774 21746 10777
[21] 14568 20910  6626 12830  9838 19991 21643 15740 12911 13326
[31]  8303 20834   934 10485 23027  4443  6613 21335 21125 14508
[41] 15979  2693  4122  9485 17665 18376 17932  2696 16990 16952
[51] 15735  9386  1642  3634  6037 20891 15062  1779  3316 11746
[61] 12168 21689 22406 19979 14099 13558  3653  6077 21280  9705
[71] 13095  3322  6343  1553  9552

It is right but very fragile in the wild. So I wish the precede() could be modified to do the right thing. Now I have some idea but not complete. In the bold Italic,

phit <- plusfun(queryStart[idx], queryEnd[idx], subjectStart, 
            subjectEnd, sentinel)

can be replaced by

phit <- plusfun(queryStart[idx], queryEnd[idx], subjectStart[spid], 
        subjectEnd[spid], sentinel)

mhit <- minusfun(queryStart[idx], queryEnd[idx], subjectStart[smid], 
        subjectEnd[smid], sentinel)

Then use some code to combine them appropriately.

You can reproduce results in the video as below.

# library(devtools)
# install_github("genomicsclass/ERBS")
library(ERBS)
library(Homo.sapiens)
data(HepG2)
data(GM12878)
res <- findOverlaps(HepG2, GM12878)
index <- queryHits(res)
erbs <- HepG2[index, ]
erbs <- granges(erbs)
ghs <- genes(Homo.sapiens)
res <- precede(erbs, ghs)
res
 [1] 16845 16173 21772 13617 18834 15172  4543 22522 21943 22493
[11]   846 15836  3903 21053 18112 17682 19650 22112 21746 22839
[21] 14568  9043 20477 12830  9841  5379 21643 14441 12911   553
[31]  7980  6028 22699  4480 11293 20120 11859 17016 14199  6719
[41] 22946  2693  6923 20570  6680 15953 17932 20311  8049 11969
[51]  3738  2914  1642  4173  6443  4446 15062 11268 16695  3496
[61]  8458   568 22406  3201 14099 13558  8545  7569 13718 16702
[71] 13095 17686 19348 12849  9552

precede(erbs, ghs, ignore.strand = TRUE)
 [1] 16845 16173 21772 13617 18834 15172  4543 22522 21943 22493
[11]   846 15836  3903 21053 18112 17682 19650 22112 21746 22839
[21] 14568  9043 20477 12830  9841  5379 21643 14441 12911   553
[31]  7980  6028 22699  4480 11293 20120 11859 17016 14199  6719
[41] 22946  2693  6923 20570  6680 15953 17932 20311  8049 11969
[51]  3738  2914  1642  4173  6443  4446 15062 11268 16695  3496
[61]  8458   568 22406  3201 14099 13558  8545  7569 13718 16702
[71] 13095 17686 19348 12849  9552

identical(precede(erbs, ghs, ignore.strand = TRUE), precede(erbs, ghs, ignore.strand = FALSE)) # should not be TRUE I think
[1] TRUE

precedeGenes <- function(x, genes) {
  xPlus <- GRanges(x, strand = "+")
  xMinus <- GRanges(x, strand = "-")
  idxGenesPlus <- precede(xPlus, genes, ignore.strand = FALSE)
  idxGenesMinus <- precede(xMinus, genes, ignore.strand = FALSE)
  disPlus <- rep(Inf, length(x))
  disMinus <- rep(Inf, length(x))
  disPlus[!is.na(idxGenesPlus)] <- distance(xPlus[!is.na(idxGenesPlus)], genes[na.omit(idxGenesPlus)])
  disMinus[!is.na(idxGenesMinus)] <- distance(xMinus[!is.na(idxGenesMinus)], genes[na.omit(idxGenesMinus)])
  dis <- pmin(disPlus, disMinus)
  idxGenes <- ifelse(disPlus > disMinus, idxGenesMinus, idxGenesPlus)
  return(idxGenes)
}
precedeGenes(erbs, ghs) # the result is identical to that in the video

 [1] 22817 16173 21772  7870 20199 22131 13257 20469 21943  8564
[11] 19061 15836 21282  8959   278 14620  1657 22774 21746 10777
[21] 14568 20910  6626 12830  9838 19991 21643 15740 12911 13326
[31]  8303 20834   934 10485 23027  4443  6613 21335 21125 14508
[41] 15979  2693  4122  9485 17665 18376 17932  2696 16990 16952
[51] 15735  9386  1642  3634  6037 20891 15062  1779  3316 11746
[61] 12168 21689 22406 19979 14099 13558  3653  6077 21280  9705
[71] 13095  3322  6343  1553  9552

sessionInfo()
R version 3.3.2 (2016-10-31)
Platform: x86_64-w64-mingw32/x64 (64-bit)
Running under: Windows 7 x64 (build 7601) Service Pack 1

locale:
[1] LC_COLLATE=Chinese (Simplified)_People's Republic of China.936 
[2] LC_CTYPE=Chinese (Simplified)_People's Republic of China.936   
[3] LC_MONETARY=Chinese (Simplified)_People's Republic of China.936
[4] LC_NUMERIC=C                                                   
[5] LC_TIME=Chinese (Simplified)_People's Republic of China.936    

attached base packages:
[1] parallel  stats4    stats     graphics  grDevices utils    
[7] datasets  methods   base     

other attached packages:
 [1] Homo.sapiens_1.3.1                     
 [2] TxDb.Hsapiens.UCSC.hg19.knownGene_3.2.2
 [3] org.Hs.eg.db_3.4.0                     
 [4] GO.db_3.4.0                            
 [5] OrganismDbi_1.16.0                     
 [6] GenomicFeatures_1.26.2                 
 [7] GenomicRanges_1.26.1                   
 [8] GenomeInfoDb_1.10.1                    
 [9] AnnotationDbi_1.36.0                   
[10] IRanges_2.8.1                          
[11] S4Vectors_0.12.1                       
[12] Biobase_2.34.0                         
[13] BiocGenerics_0.20.0                    
[14] ERBS_1.0                               

loaded via a namespace (and not attached):
 [1] graph_1.52.0               Rcpp_0.12.8               
 [3] XVector_0.14.0             zlibbioc_1.20.0           
 [5] GenomicAlignments_1.10.0   BiocParallel_1.8.1        
 [7] lattice_0.20-34            tools_3.3.2               
 [9] grid_3.3.2                 SummarizedExperiment_1.4.0
[11] DBI_0.5-1                  RBGL_1.50.0               
[13] digest_0.6.10              Matrix_1.2-7.1            
[15] rtracklayer_1.34.1         bitops_1.0-6              
[17] RCurl_1.95-4.8             biomaRt_2.30.0            
[19] memoise_1.0.0              RSQLite_1.1-1             
[21] BiocInstaller_1.24.0       Biostrings_2.42.1         
[23] Rsamtools_1.26.1           XML_3.98-1.5

Thanks in advance.

Can

genomicranges precede ph525.5x • 3.6k views

ADD COMMENT • link 8.3 years ago wcstcyx ▴ 30

0

Entering edit mode

Hi Michael,

I agree that adding options adds complexity, and appreciate that bioconductor people do so much work to make functions complete, including considering all combinations of options, and make exhaustive documents. In this aspect it does more work and does better than bedtools I think. It is really a hard task. Thank you all.

As you mentioned, resize(), flank(), promoters() take strand into account (it is meaningful for genes), but shift() and narrow() do not, maybe they are important in some case without genes I guess. What makes me satisfied is that when some ranges are on "*" strand, promoters() treats them as on "+" strand and throws out a warning. It make more sense than just do things silently. When some ranges were out of bounds, warnings would also be showed out. That makes users know maybe they should do things more reasonably and carefully by themselves. So even ignore.strand is not that satisfied in some case, if people knew what happens behind it, they could modify their input and call functions in a more reasonable way. So warnings can be a good thing to work together with nonsense combinations of parameters and can save you from deciding what are better results in a nonsense calling from users. Because in that case you can throw an somewhat arbitrary result and tell users they should know the result is nonsense for their bad use of the function.

I can imagine some inconsistency about meaning of '*' strand may come from the real world, not bioconductor. And as you said, "Changing default behavior, for full consistency, would be difficult now." I agree. Because people have been adapted to current behaviors of those functions. In that context, maybe more training and examples can make things better. Fortunately, your HelloRanges package is really a fantastic translator from bedtools to GenomicRanges, which would help people who have been already familiar with bedtools to do real tasks more efficiently in bioconductor. Thanks for that!

Return back to my post and Janet's GenomicRanges bug in follow?. I think the version of precede() and follow() before Janet's post are better than now. Because both of us can achieve our demands without a lot of codes. He can use ignore.strand = TRUE, I can use ignore.strand = FALSE or by default. When all the ranges in x (or query) and subject are on * strand, to make less people confused, you can throw out a warning: "precede() or follow() treats ranges on '*' as on both '+' and '-' strands, so it returns the nearest but not overlapped range(s) in subject for each range in x. If you want to do precede() or follow() on '+' strand, please use ignore.strand = TRUE, or change strands of x and subject to + manually". To take '*' as 'both' is more useful than 'none' in this case. Because I can solve the tasks:

(1) Set x as TF peaks and subject as genes and use precede() to find nearest but not overlapped gene(s) ('gene' when select = "arbitrary", 'gene or genes' when select = "all") that each TF peak precede on either strand (means in the 5' upstream of genes).

(2) Set x as genes, subject as TF peaks, follow() can find nearest but not overlapped TF(s) that each gene follow (means the TF is in its 5' upstream).

(3) Set x as TF1 peaks, subject as TF2 peaks, both have '*', precede() and follow() will both find the nearest but not overlapped TF2 peak(s) for each TF1 peak, which is useful (after filtered with distance()) to find co-regulators both bound to DNA. In that case, precede() and follow() are not equal to nearest() where 'nearest' including 'overlapped', and they still make sense so they can not be replaced by nearest(). Thus, what Hervé Pagès said in Janet's post is not totally right. I know, the last case is a little hacky, maybe nearestButNotOverlapped() can be created by users themselves to do that thing.

Maybe there are still some aspects not covered, but I think I have done my best to try to use precede() and follow() in as many cases as I can.

Thanks a lot to you and Valerie. Actions of you are much harder than words of mine. Appreciate your work on bioconductor.

Can

ADD REPLY • link 8.3 years ago wcstcyx ▴ 30

0

Entering edit mode

It would be nice to have a version of nearest() that does not allow overlaps. I discovered that when developing HelloRanges. Maybe a new verb, adjacent()? I kind of prefer that to a parameter like ignore.overlaps, just because it's simpler to list operations as functions, compared to function/parameter combinations. We want to avoid a combinatorial explosion, but I think we're far from that in this case. Then again, nearest(x, y, ignore.overlaps=TRUE) would be very readable.

ADD REPLY • link 8.3 years ago Michael Lawrence ★ 11k

0

Entering edit mode

Yes, I agree that adjacent() would be clearer and easier to maintain than nearest() with ignore.overlaps. nearest(x, y, ignore.overlaps = TRUE) is readable for users and can be a substitute for adjacent() temporarily.

Back to the post.

You said "It's not clear how the gene use case requires treating "*" in subject as "both". " The task (2) in my last reply gives a case that is clear. That is for each gene to find adjacent TFs in its 5' upstream.

I will illustrate current version of precede() in graphs (follow() is similar):

(1) For ranges on '+' in x

subspace:+ direction:+ precede( + in x, + and * in subject) --|

find nearest for each + range in x (just use the res on +)

subspace:- direction:- do nothing-------------------------|

(2) For ranges on '-' in x

subspace:+ direction:+ do nothing-------------------------|

find nearest for each - range in x (just use the res on -)

subspace:- direction:- precede( - in x, - and * in subject) --|

(3) For ranges on '*' in x

subspace:+ direction:+ precede( * in x, + and - and * in subject) --|

find nearest for each * range in x (just use res on +)

subspace:- direction:- do nothing-----------------------------------|

The (1) and (2) can make sure the second task I mentioned can be solved. The (3) is wrong and you agree.

(3)' in old version of precede():

subspace:+ direction:+ precede( * in x, + and * in subject) --|

find nearest for each * range in x

subspace:- direction:- precede( * in x, - and * in subject) --|

This (3)' would cause Janet's problem, but can be solved by throwing warnings. And this (3)' can solve task (1) and (3) in my reply.

Another (3)'':

subspace:+ direction:+ precede( * in x, + and * in subject) --|

find nearest for each * range in x

subspace:- direction:- precede( * in x, - in subject) --------|

This (3)'' can solve Janet's problem. And it can also solve task (1) in my post. Even though it can't solve task(3), adjacent() is more suitable for that task.

So which one will your team accept? Or some other solution that can solve both the problem of Janet and problem in this post appropriately?

Can

ADD REPLY • link 8.3 years ago wcstcyx ▴ 30

0

Entering edit mode

I don't think a gene can follow/precede a peak, because a peak has no orientation. It's better to ask, which peaks precede (the transcription of) a gene, and then find the closest peak for each gene. The precede(), follow() interface is too low-level to make that simple. One option would be a high-level interface that returned a GRanges of the peak hits, formally grouped by gene. Then have a high-level function that finds the closest within each group. But given that we do not have such a high-level interface yet, maybe specify which argument determines the orientation:

follow(genes, peaks, orientation="x")

By default, the orientation comes from the subject.

ADD REPLY • link 8.3 years ago Michael Lawrence ★ 11k

0

Entering edit mode

Hi Michael,

As you may notice that. precede(A, B) are not same with follow(B, A). precede(A, B): for each range in A, find the nearest range in B it precedes, the result has the same length with A. follow(B, A): for each range B, find the nearest range in A it follows, the result has the same length with B. precede(A, B) focus on A, but follow(B, A) focus on B. So when I want to know for each peak, which gene's upstream it lies on. I can use precede(peaks, genes). The (3)' or (3)'' can let precede() do that. When I want to know for each gene, which peak lies on its upstream, I can use follow(genes, peaks). The (1) and (2) can help follow() do that. And this is indeed "to ask, which peaks precede (the transcription of) a gene, and then find the closest peak for each gene."

The old version of precede() and follow() have already had the capability to do the above two things. So we don't need any parameter like select = "nearest".

Can you tell me how to install an old version of GenomicRanges, for example, GenomicRanges_1.20.8. Maybe latter I can give an example to show you those things I imagine.

Can

ADD REPLY • link 8.3 years ago wcstcyx ▴ 30

0

Entering edit mode

As I tried to say in the previous post (which I later edited), I think the contract of the function would be simpler if we just say that the orientation is determined by one of the arguments, default "subject". If the orientation argument has "*" strand, it means "none". Then, Janet's use case works as expected, by default, as it is now. Your use case with the gene as the subject will also just work. If the gene is "x", just pass orientation="x".

ADD REPLY • link 8.3 years ago Michael Lawrence ★ 11k

0

Entering edit mode

It is a good idea to use orientation! I agree that it can sovle Janet's case, my case in this post (also task (1)), and task (2). Thanks for the long discussion.

ADD REPLY • link 8.3 years ago wcstcyx ▴ 30

0

Entering edit mode

At last, I want to show how to solve Janet's problem and task(1)(2)(3) with old precede() and follow().

I installed Bioconductor version 3.0 (BiocInstaller 1.16.5) and GenomicRanges_1.18.4 on R 3.1.3.

> (peak1 <- GRanges("chr1", IRanges(c(3, 7), width = 1)))
GRanges object with 2 ranges and 0 metadata columns:
      seqnames    ranges strand
         <Rle> <IRanges>  <Rle>
  [1]     chr1    [3, 3]      *
  [2]     chr1    [7, 7]      *
  -------
  seqinfo: 1 sequence from an unspecified genome; no seqlengths
> (peak2 <- GRanges("chr1", IRanges(c(1, 3, 5, 7, 9), width = 1)))
GRanges object with 5 ranges and 0 metadata columns:
      seqnames    ranges strand
         <Rle> <IRanges>  <Rle>
  [1]     chr1    [1, 1]      *
  [2]     chr1    [3, 3]      *
  [3]     chr1    [5, 5]      *
  [4]     chr1    [7, 7]      *
  [5]     chr1    [9, 9]      *
  -------
  seqinfo: 1 sequence from an unspecified genome; no seqlengths
> ## task (3) -- nearest but not overlapped -- adjacent()
> precede(peak1, peak2, select = "all") # nearest but not overlapped
Hits of length 4
queryLength: 2
subjectLength: 5
  queryHits subjectHits 
   <integer>   <integer> 
 1         1           1 
 2         1           3 
 3         2           3 
 4         2           5 
> identical(precede(peak1, peak2, select = "all"),
+           follow(peak1, peak2, select = "all"))
[1] TRUE
> nearest(peak1, peak2, select = "all") # including overlapped
Hits of length 2
queryLength: 2
subjectLength: 5
  queryHits subjectHits 
   <integer>   <integer> 
 1         1           2 
 2         2           4 
> ## Janet's problem can be solved by ignore.strand = TRUE as your anwser in his post
> precede(peak1, peak2)
[1] 1 3
> follow(peak1, peak2)
[1] 1 3
> precede(peak1, peak2, ignore.strand = TRUE)
[1] 3 5
> follow(peak1, peak2, ignore.strand = TRUE)
[1] 1 3
> 
> gene <- rep(peak2, each = 2)
> strand(gene) <- rep(c("+", "-"), times = 5)
> gene
GRanges object with 10 ranges and 0 metadata columns:
       seqnames    ranges strand
          <Rle> <IRanges>  <Rle>
   [1]     chr1    [1, 1]      +
   [2]     chr1    [1, 1]      -
   [3]     chr1    [3, 3]      +
   [4]     chr1    [3, 3]      -
   [5]     chr1    [5, 5]      +
   [6]     chr1    [5, 5]      -
   [7]     chr1    [7, 7]      +
   [8]     chr1    [7, 7]      -
   [9]     chr1    [9, 9]      +
  [10]     chr1    [9, 9]      -
  -------
  seqinfo: 1 sequence from an unspecified genome; no seqlengths
> ## task (1), also the original case in my post
> ## for each peak, find nearest gene it precede
> precede(peak1, gene, select = "all")
Hits of length 4
queryLength: 2
subjectLength: 10
  queryHits subjectHits 
   <integer>   <integer> 
 1         1           2 
 2         1           5 
 3         2           6 
 4         2           9 
> ## task (2)
> ## for each gene, find nearest peak lying on its upstream
> follow(gene, peak1)
 [1] NA  1 NA  2  1  2  1 NA  2 NA

ADD REPLY • link 8.3 years ago wcstcyx ▴ 30

0

Entering edit mode

In the end, hope orientation can be implemented to solve those problems. Thanks a lot to you and Valerie.

Best regards

Can

ADD REPLY • link 8.3 years ago wcstcyx ▴ 30

score 2 · Accepted Answer · 2016-12-30

2

Entering edit mode

Valerie Obenchain ★ 6.8k

@valerie-obenchain-4275

Last seen 3.2 years ago

United States

Hi Can,

As you can imagine, people have different opinions on 'right behavior' of a function. precede() and follow() handle strand the way they do because of user input at the time they were written.

It sounds like you want to make specific strand comparisons in a certain order. Instead of modifying the function to meet a specific use case it's better to make several calls to the function while imposing your strand restrictions as necessary.

This statement you made above is not true (or I've misinterpreted it):
"... For the ranges of x on "*" strand, what precede() does is identical to precede(., ignore.strand = TRUE), say, just to treat all ranges x on * strand, and all ranges of subject as they are all on the "+" strand. ..."

As stated in the man page, a '*' strand range can match to '+', '-' or '*'. In the case of a tie, the first range by order is chosen.

The '*' query matches to the "+' range subject because it's first by order.

query <- GRanges("A", IRanges(start=5, width=1), strand="*")
subject <- GRanges("A", IRanges(start=c(10, 10), width=1), strand=c("+", "-"))
> precede(query, subject)
[1] 1

Reversing the order of the subject now the '-' range is chosen because it's first by order.

> precede(query, rev(subject))
[1] 1

If you feel the documentation does not accurately reflect the behavior please cite the section and I'll try to make it more clear.

Thanks.
Valerie

ADD COMMENT • link 8.3 years ago Valerie Obenchain ★ 6.8k

0

Entering edit mode

Hi Valerie,

I agree that 'right behavior' of a function is in the eye of the beholder, even though I think what Rafael want precede() to do in the video is more meaningful.

I agree that to make several calls to reproduce the result of Rafael is a better way, and have warpt them up into function precedeGenes showed in my post.

I still persist the statement "For the ranges of x on "*" strand, what precede() does is identical to precede(., ignore.strand = TRUE), say, just to treat all ranges x on * strand, and all ranges of subject as they are all on the "+" strand." is ture. I will illustrate it from source code and examples. First look at the source code of GenomicRanges:::.GenomicRanges_findPrecedeFollow. You can see two segments in it:

if (ignore.strand)
        strand(query) <- strand(subject) <- "+"

and

idx <- which(strand(query) == "*")
bhit <- local({
    qid <- qid[idx]
    phit <- plusfun(queryStart[idx], queryEnd[idx], subjectStart,
         subjectEnd, sentinel)
    .Hits(qid[queryHits(phit)], subjectHits(phit), length(query),
        length(subject))
})

The first segment says that if ignore.strand = TRUE, the function treat all the ranges in query and subject as on "+" strand and do precede in the meaning of + strand. In detail, it does

phit <- plusfun(queryStart[idx], queryEnd[idx], subjectStart[spid],
        subjectEnd[spid], sentinel)

The second segment says that for ranges on * strand in query, do

phit <- plusfun(queryStart[idx], queryEnd[idx], subjectStart,
        subjectEnd, sentinel)

which means for ranges on * strand in query and all ranges on any strand in subject, treat all of them on + strand and do precede in the meaning of + strand. That is same to do precede with ignore.strand = TRUE for the ranges on * strand in query.

Next, I use an example from ?`precede,GenomicRanges,GenomicRanges-method` to show it

query <- GRanges("A", IRanges(10, width=1), c("+", "-", "*"))
subject <- GRanges("A", IRanges(c(5, 5, 5, 15, 15, 15), width=1),
             rep(c("+", "-", "*"), 2))

> query
GRanges object with 3 ranges and 0 metadata columns:
      seqnames    ranges strand
         <Rle> <IRanges>  <Rle>
  [1]        A  [10, 10]      +
  [2]        A  [10, 10]      -
  [3]        A  [10, 10]      *
  -------
  seqinfo: 1 sequence from an unspecified genome; no seqlengths
> subject
GRanges object with 6 ranges and 0 metadata columns:
      seqnames    ranges strand
         <Rle> <IRanges>  <Rle>
  [1]        A  [ 5,  5]      +
  [2]        A  [ 5,  5]      -
  [3]        A  [ 5,  5]      *
  [4]        A  [15, 15]      +
  [5]        A  [15, 15]      -
  [6]        A  [15, 15]      *
  -------
  seqinfo: 1 sequence from an unspecified genome; no seqlengths

I use query[3] to do test.

> precede(query[3], subject, select = "all")
Hits object with 3 hits and 0 metadata columns:
      queryHits subjectHits
      <integer>   <integer>
  [1]         1           4
  [2]         1           5
  [3]         1           6
  -------
  queryLength: 1 / subjectLength: 6

> precede(query[3], subject, select = "all", ignore.strand = TRUE)
Hits object with 3 hits and 0 metadata columns:
      queryHits subjectHits
      <integer>   <integer>
  [1]         1           4
  [2]         1           5
  [3]         1           6
  -------
  queryLength: 1 / subjectLength: 6

The results are the same, and query[3] precedes subject[5]!!! Because they are all treated as on + strand and precede is in the meaning of + strand, as I explained before. However, is it meaningful that query[3] precedes subject[5] without ignore.strand = TRUE? No, because neither of them is on the + strand but precede performs in the meaning of + strand! It is easy for users to misuse it.

Your example shows that "In the case of a tie, the first range by order is chosen." That's true. However, it is still not showed by your example what the "match" means in "x on * strand can match to ranges on any of +, - or * strands." In fact, "match" means to do precede in the meaning of + strand. Nevertheless, users won't understand that unless they test that.

Looking forward to your reply.

Thanks,

Can

ADD REPLY • link 8.3 years ago wcstcyx ▴ 30

0

Entering edit mode

Thanks for the small, concise example. You're right, "*" strand is treating all ranges as if on the "+" strand. Here is my svn commit from 2015 confirming it:

------------------------------------------------------------------------
r108978 | v.obenchain | 2015-09-29 14:56:24 -0700 (Tue, 29 Sep 2015) | 3 lines

modify behavior of '*' strand in precede() / follow() to mimic
ignore.strand=TRUE

So the documentation on the man page was never updated to reflect this - my oversight, sorry about that. I have updated it in release (1.26.2) and devel ( 1.27.18). Unfortunately the svn commit doesn't go into detail as to why the behavior was changed to mimic ignore.strand=TRUE. My best guess is that the behavior came up on the mailing list, there was a discussion, and this was the consensus. I'll check with team members next week to see if anyone has better 'historical memory' than myself and remembers why this change was made.

Valerie

ADD REPLY • link 8.3 years ago Valerie Obenchain ★ 6.8k

0

Entering edit mode

Hi Valerie,

Thanks for your reply. I guess one reason for that commit can be that people would expect if both ranges in query and subject are all on * strand, precede() and follow() should behavior as ignore.strand = TRUE, say, perform in the meaning of + strand, which is the strand of reference genome. And precede() and follow() in 1.26.1 really do things that way. However, if precede() and follow() do what I want in my post, when all ranges of query and subject are on * strand, both of them will find the nearest but not overlapped range in subject for each range in query, because precede() and follow() are performed on both strand and give the nearest one. So both precede()s and follow()s in 1.26.1 and in the way I prefer will encounter some case people will misunderstand. However, I still persist that the precede() and follow() I prefer is a better version. Because the + strand and - strand should be considered equal in biology. Genes can appear on both + and - strand with nearly same frequency. People think the + strand should be the default direction of precede() and follow() just because the reference genome is displayed on only one strand of DNA for convenience and that very strand is named + strand. It is more like an artifact and is not fair for - strand in biology. So when ranges in query and subject are all on * strand, neither + strand nor - strand is a good choice, maybe take precede on both strand is a relatively better choice when people do not give an default strand. So I recommend add a parameter "default.strand" with "ignore.strand = TRUE" for precede() and follow(). When ranges in query and subject are all on * strand, if people want to do precede() on + strand, they should set "ignore.strand = TRUE" and default.strand = "+" (that is identical to do follow() on - strand with default.strand = "-"). They can also do precede() on - strand with default.strand = "-", which is identical to do follow() on + strand with default.strand = "+". A complete use of ignore.strand and default.strand can be:

     old                       new

ignore.strand = FALSE <==> ignore.strand = FALSE

ignore.strand = TRUE <==> (ignore.strand = TRUE and default.strand = "+")                      or (ignore.strand = TRUE and default.strand = "-") or    (ignore.strand = TRUE and default.strand = NA)

For findOverlaps(), ignore.strand = TRUE means to take all ranges as on * strand, which means to compare all pairs in query and subject. That is much easier than the case in precede() and follow(). Because precede() and follow() need a default direction in addition. So ignore.strand should be taken carefully by programmers, however the meaning comes from biology. Without biology, it is easy to create some thing that is an artifact.

To make a unified framework and understanding for ignore.strand, I recommend that to set ignore.strand = TRUE is equal to do strand(query) <- "*" and strand(subject) <- "*" and then do some things according to the meaning of the function that is called. If the meaning is still unclear for the function, other parameters specific to that function can be set for the very function itself. This framework can give an unified understanding of ignore.strand and easy for users to remember, and also easy to programming. Because funs can be written as:

fun(query, subject, ignore.strand, other.specific.para.for.this.fun, ...) {

if (ignore.strand) {

 strand(query) <- "*"

 strand(subject) <- "*"

}

.fun(query, subject, other.specific.para.for.this.fun, ...) {# do some thing without need of considering about whether ignore.strand is TRUE or FALSE}

}

It is just a recommend. Hope it can be helpful in some way. :)

Thanks.

Can

ADD REPLY • link 8.3 years ago wcstcyx ▴ 30

0

Entering edit mode

Hi Valerie,

I am so lucky that I found A: GenomicRanges bug in follow? which motivated you to change the behavior of precede() and follow() about ranges in query on * strand. My guess in my last reply is right: "I guess one reason for that commit ..."

That is why Janet Young think the behavior of precede() and follow() should change. Michael Lawrence and Hervé Pagès noticed that many people would be confused by it and you finally changed it. However, after that change, people like me are also confused by the new behavior. For example in my first reply to you:

"However, is it meaningful that query[3] precedes subject[5] without ignore.strand = TRUE? No, because neither of them is on the + strand but precede performs in the meaning of + strand! It is easy for users to misuse it."

So it is a trade off between people's old common sense and justifiability of function respect to the meaning of biology. I think the latter is more important!

Michael Lawrence said

"The "*" could be interpreted as meaning "ignore", in which case behavior should conform to that of ranges(gr) (as well as considering seqnames). That would still be consistent with the current findOverlaps() behavior."

I have also thought about that in my last reply to you. For ranges on * strand, GenomicRanges::findOverlaps() should compare them with ranges on both +/-/* strand, so it happened to behavior like IRanges::findOverlaps(). However,

"That is much easier than the case in precede() and follow(). Because precede() and follow() need a default direction in addition."

Why the default strand should be "+"? Just because people think the default behavior should like IRanges? I can't agree! IRanges is a tool to build GenomicRanges in a convenient way, the principle that instruct what GenomicRanges to do should come from biology, not IRanges. In biology,

"Genes can appear on both + and - strand with nearly same frequency (you can see that in table(strand(genes(TxDb.Hsapiens.UCSC.hg19.knownGene)))). People think the + strand should be the default direction of precede() and follow() just because the reference genome is displayed on only one strand of DNA for convenience and that very strand is named + strand. It is more like an artifact and is not fair for - strand in biology. So when ranges in query and subject are all on * strand, neither + strand nor - strand is a good choice, maybe take precede on both strand is a relatively better choice when people do not give an default strand. So I recommend add a parameter "default.strand" with "ignore.strand = TRUE" for precede() and follow()."

You can see description of ignore.strand in help pages of findOverlaps(), flank() and precede(). Only in help page of precede(), + strand is very special: "When TRUE, strand is set to '+'." That is not very consistent indeed.

I wish bioconductor can rethink about the * strand and how to treat it in a consistent and biologically meaningful way.

Sorry to make troubles for you.

Thanks!

Can

ADD REPLY • link 8.3 years ago wcstcyx ▴ 30

0

Entering edit mode

For the present use case, where the subject ranges have a strand, I agree that we should respect it. It's not clear how the gene use case requires treating "*" in subject as "both". The strand has two functions in GenomicRanges: (1) specifying a subspace used in overlap detection (where ranges on opposite strands do not overlap) and (2) direction. For (1), "*" means "either space", but for (2) I think it should mean "none". The only use case I know of for "direction" is the direction of transcription and any bi-directional transcription should be represented using two separate transcripts/genes.

Thus, the change made in the previous thread was wrong, because it should have only applied when both x and subject had "*" strand, not whenever x had "*" strand. The subject strand is what matters when considering direction.

If we were doing this all over again, I would argue for a more nuanced ignore.strand parameter, which defaults to "subspace" but could also be "direction" (ignore the direction aspect), FALSE (ignore neither aspect) and TRUE (ignore both aspects). It is very surprising to many users that ranges on different strands do not overlap by default.

ADD REPLY • link 8.3 years ago Michael Lawrence ★ 11k

0

Entering edit mode

Hi Michael,

I agree with you on most of your opinions. I think the your discussions under function precede() not working with GRanges can make you recall some original ideas. Most of them are very helpful. At that time people want ignore.strand to be added. Now maybe only ignore.strand is not enough to specify what the function should do, because different function deal with different problems, and ignore.strand would has different meaning. Additional parameters that specific to a function can be added to help explicitly specify the actions. I think bedtools can be a reference to GenomicRanges for some functions (not mean bedtools is better, I prefer GenomicRanges because I can do things in an interactive manner). It has many parameters for different functions to treat strand in a meaningful way. For example, if I want to shift ChIP-seq reads of a histone modification 100 bps from 5' to 3', I can use -p 100 -m -100 according to this page. Now GenomicRanges::shift() can not do this directly. But I can do shift() on either strand differently then combine the results. However, this is not very convenient.

IMO, we bind too much meaning into ignore.strand, which brings ambiguity. I think to add specific parameters to different function will make ignore.strand much clearer and do not need to use a lot of words to illustrate what ignore.strand means in that particular function, because just to illustrate the specific parameters will make things clear, and sometimes those parameters are slef-explanatory. So additional parameters can eliminate ambiguity. ignore.strand really takes too many responsibilities now.

Can

ADD REPLY • link 8.3 years ago wcstcyx ▴ 30

0

Entering edit mode

I agree that a more explicit API can be a good idea. It makes it easier to document, and the code easier to read, up to a point. In this case though, I think your use and Janet's use case are both intuitively satisfied by ignoring strand when the subject has "*", not when x has "*". Adding options also adds complexity, as we feel compelled to support all combinations of options, even when they do not make sense.

Strand handling is a rough spot in GenomicRanges, because it developed in a somewhat organic fashion, and so inconsistencies have arisen. For example, resize() considers strand by default, but not shift(). Changing default behavior, for full consistency, would be difficult now. A lot of these weaknesses were exposed by my HelloRanges package. It compiles bedtools command lines to Ranges-based code. So for your example it shows us the complexity:

> HelloRanges::bedtools_shift("-i my.bed -p 100 -m -100")
{
    genome <- Seqinfo(genome = NULL)
    gr_a <- import("my.bed", genome = genome)
    ans <- gr_a
    ans[strand(ans) == "+"] <- shift(ans[strand(ans) == "+"],
        100)
    ans[strand(ans) == "-"] <- shift(ans[strand(ans) == "-"],
        -100)
    ans
}

ADD REPLY • link 8.3 years ago Michael Lawrence ★ 11k

1

Entering edit mode

After some discussion, the following changes were made to precede() and follow() in GenomicRanges 1.27.21:

1) The 'select' argument has been made consistent with the methods in IRanges: precede() now supports select=c("first", "all") and follow() supports select=c("last", "all").

2) When 'x' is '*' and 'subject' is '-' orientation is from right to left (subject determines strand); case (h) below.

3) When both 'x' and 'subject' are on '*' strand, both 'x' and 'subject' are set to '+'.

This summary of orientations has been included on the man page:

       x  |  subject  |  orientation
     -----+-----------+----------------
a)     +  |  +        |  --->
b)     +  |  -        |  NA
c)     +  |  *        |  --->
d)     -  |  +        |  NA
e)     -  |  -        |  <---
f)     -  |  *        |  <---
g)     *  |  +        |  --->
h)     *  |  -        |  <---
i)     *  |  *        |  --->  (the only situation where * arbitrarily means +)

Case (h) should address your original concern. Thanks to Herve for the ascii art and unit tests.

Valerie

ADD REPLY • link 8.2 years ago Valerie Obenchain ★ 6.8k

0

Entering edit mode

Thanks very much for your work!!!

Can

ADD REPLY • link 8.2 years ago wcstcyx ▴ 30

0

Entering edit mode

I can use the new precede() with selectNearest() to finish my task now.

subjectHits(selectNearest(precede(erbs, ghs, select = "all"), erbs, ghs))

Thanks!

By the way, selectNearest() in the IRanges package is not in the "Usage" part of its help page. Hope you can complete it.

Can

ADD REPLY • link 8.0 years ago wcstcyx ▴ 30

0

Entering edit mode

The new nearest(), precede() and follow() in GenomicRanges 1.27 consider strand in a more reasonable way as you showed in the beautiful ascii table. When select = "all", the result is meaningful.

> gene <- GRanges("chr1:5-6:*")
> peak <- GRanges(c("chr1:1-2:+", "chr1:13-14:-", "chr1:1-2:+", "chr1:13-14:-"))
> nearest(gene, peak, select = "all")
Hits object with 4 hits and 0 metadata columns:
      queryHits subjectHits
      <integer>   <integer>
  [1]         1           1
  [2]         1           2
  [3]         1           3
  [4]         1           4
  -------
  queryLength: 1 / subjectLength: 4

It find the nearest subject on each strand, that is meaningful and user can do more things beyond that.

However when select is not "all", the result is not very satisfied.

> nearest(gene, peak, select = "arbitrary")
[1] 4

However, the nearest is 1 or 3 and I can use the following code to get it.

> subjectHits(selectNearest(nearest(gene, peak, select = "all"), gene, peak))
[1] 1 3

> subjectHits(selectNearest(nearest(gene, peak, select = "all"), gene, peak))[1] # keep only one in subject
[1] 1

follow() has a similar problem:

> follow(gene, peak, select = "all")
Hits object with 4 hits and 0 metadata columns:
      queryHits subjectHits
      <integer>   <integer>
  [1]         1           1
  [2]         1           2
  [3]         1           3
  [4]         1           4
  -------
  queryLength: 1 / subjectLength: 4
> follow(gene, peak, select = "last")
[1] 4

I think follow(*, select = "last") should return a nearest one in the result of follow(*, select = "all"). The argument select = "last" should take the last one from 1 or 3, not 1, 2, 3, or 4. So the result should be 3.

The reason that follow() return 4 is from the following code block in GenomicRanges:::.GenomicRanges_findPrecedeFollow:

if (select == "all") {
        hits
    }
    else if (select == "first") {
        first <- rep(NA_integer_, length(query))
        idx <- which(!duplicated(queryHits(hits)))
        first[queryHits(hits)[idx]] <- subjectHits(hits)[idx]
        first
    }
    else if ("last" == select) {
        last <- rep(NA_integer_, length(query))
        rev_query <- rev(queryHits(hits))
        idx <- which(!duplicated(rev_query))
        last[rev_query[idx]] <- rev(subjectHits(hits))[idx]
        last
    }

To solve the problem, I suggest add the following one line code before that chunk:

hits <- selectNearest(hits, query, subject)

I believe the problem of nearest() when select is not "all" can also be solved with a small modification.

The modification is small, but I think it could give more reasonable results and keep better compatibility with the old behavior of nearest(), follow() and precede() in GenomicRanges 1.26. So people would not be influenced too much by the release of GenomicRanges 1.27 with Bioc 3.5. Otherwise a lot of codes involves nearest(), precede() and follow() should be change to something like

subjectHits(selectNearest(nearest(query, subject, select = "all"), query, subject))[1]

Looking forward to your reply.

Can

ADD REPLY • link 8.0 years ago wcstcyx ▴ 30