I am attempting to find the distance to the closest upstream but distanceToNearest does not appear to return the closest.
A simple reproduction is as follows:
> subject <- GRanges(seqnames="chr1", ranges=IRanges(start=c(1000, 1500, 2000, 2500, 3000, 3500), width=1),strand=c("+", "-", "+", "-", "+", "-")) > distanceToNearest(GRanges(seqnames="chr1", ranges=IRanges(start=1100, width=1), strand="*"), subject, select="all") Hits object with 2 hits and 1 metadata column: queryHits subjectHits | distance <integer> <integer> | <integer> [1] 1 1 | 99 [2] 1 2 | 399 ------- queryLength: 1 / subjectLength: 6
So far so good. We're 99bp downstream from the from the first subject row, and 399 downstream on the opposite strand from the second subject row. Without the subject parameter, one would expect this to return the 99 as that's less than 399.
> distanceToNearest(GRanges(seqnames="chr1", ranges=IRanges(start=1100, width=1), strand="*"), subject) Hits object with 1 hit and 1 metadata column: queryHits subjectHits | distance <integer> <integer> | <integer> [1] 1 2 | 399 ------- queryLength: 1 / subjectLength: 6
Why do I get the 399 result, and not the 99 result? The documentation states that select resolves arbitrarily for nearest(), but that doesn't make sense for distancetoNearest() when there is an actual distance metric returned. Why isn't the hit with the minimum distance returned? I'm confused as to how distancetoNearest() can be useful for queries with strand="*" with its current behaviour. Am I missing something?
> sessionInfo()
R version 3.4.2 (2017-09-28)
Platform: x86_64-w64-mingw32/x64 (64-bit)
Running under: Windows 7 x64 (build 7601) Service Pack 1
Matrix products: default
locale:
[1] LC_COLLATE=English_Australia.1252 LC_CTYPE=English_Australia.1252 LC_MONETARY=English_Australia.1252 LC_NUMERIC=C
[5] LC_TIME=English_Australia.1252
attached base packages:
[1] parallel stats4 stats graphics grDevices utils datasets methods base
other attached packages:
[1] biomaRt_2.34.0 org.Hs.eg.db_3.5.0 bindrcpp_0.2 TxDb.Hsapiens.UCSC.hg38.knownGene_3.4.0
[5] GenomicFeatures_1.30.0 AnnotationDbi_1.40.0 Biobase_2.38.0 rtracklayer_1.38.0
[9] GenomicRanges_1.30.0 GenomeInfoDb_1.14.0 IRanges_2.12.0 S4Vectors_0.16.0
[13] BiocGenerics_0.24.0 forcats_0.2.0 stringr_1.2.0 dplyr_0.7.4
[17] purrr_0.2.4 readr_1.1.1 tidyr_0.7.2 tibble_1.3.4
[21] ggplot2_2.2.1 tidyverse_1.2.1 openxlsx_4.0.17
loaded via a namespace (and not attached):
[1] nlme_3.1-131 bitops_1.0-6 matrixStats_0.52.2 lubridate_1.7.1 bit64_0.9-7 progress_1.1.2
[7] httr_1.3.1 rprojroot_1.2 tools_3.4.2 backports_1.1.1 R6_2.2.2 DBI_0.7
[13] lazyeval_0.2.1 colorspace_1.3-2 prettyunits_1.0.2 mnormt_1.5-5 RMySQL_0.10.13 bit_1.1-12
[19] compiler_3.4.2 cli_1.0.0 rvest_0.3.2 xml2_1.1.1 DelayedArray_0.4.1 scales_0.5.0
[25] psych_1.7.8 digest_0.6.12 Rsamtools_1.30.0 foreign_0.8-69 rmarkdown_1.8 XVector_0.18.0
[31] base64enc_0.1-3 pkgconfig_2.0.1 htmltools_0.3.6 rlang_0.1.4 readxl_1.0.0 rstudioapi_0.7
[37] RSQLite_2.0 bindr_0.1 jsonlite_1.5 BiocParallel_1.12.0 RCurl_1.95-4.8 magrittr_1.5
[43] GenomeInfoDbData_0.99.1 Matrix_1.2-12 Rcpp_0.12.14 munsell_0.4.3 yaml_2.1.14 stringi_1.1.6
[49] SummarizedExperiment_1.8.0 zlibbioc_1.24.0 plyr_1.8.4 grid_3.4.2 blob_1.1.0 crayon_1.3.4
[55] lattice_0.20-35 Biostrings_2.46.0 haven_1.1.0 hms_0.4.0 knitr_1.17 reshape2_1.4.2
[61] XML_3.98-1.9 glue_1.2.0 evaluate_0.10.1 modelr_0.1.1 cellranger_1.1.0 gtable_0.2.0
[67] assertthat_0.2.0 broom_0.4.3 GenomicAlignments_1.14.1 memoise_1.1.0
I should also note that your intuition as to what should be nearest is only true if you ignore the strand information, which you can do with the ignore.strand argument:
I'm looking for the closest upstream TSS for viral insertion sites. This means my query strand is *, but I do care about transcript orientation (subjects are coordinate and strand of TSS). In the case of overlapping genes, distanceToNearest() is not returning the one with the closest TSS.
But the closest subject ranges are at positions 1 and 2, not 2 and 3 as can be seen by calling nearest():
> nearest(GRanges(seqnames="chr1", ranges=IRanges(start=1100, width=1), strand="*"), subject, select="all")
Hits object with 2 hits and 0 metadata columns:
queryHits subjectHits
<integer> <integer>
[1] 1 1
[2] 1 2
-------
queryLength: 1 / subjectLength: 6
Additionally, the distances of 99, and 399 (not 399 and 899) are returned by distanceToNearest() when select="all" is specified.
The problem I am having isn't with the strandedness of the nearest - the problem is that when there are two possible distances (one of each strand), distanceToNearest() does not return the lesser of the two distances.
Thanks for reporting this bug. We'll post back when it's fixed.
Valerie