rtracklayer can't access hg38 on UCSC?
2
0
Entering edit mode
@4db1a7b6
Last seen 15 days ago
United States

The code below worked for 'hg19', but returned an error for 'hg38', indicating that hg38 is not available.

Is there a different way of accessing hg38, or is it just not supported?


> targetTrack = with(targets_df, GRangesForUCSCGenome("hg38", "chr16", ranges, "+", "CDH13"))

Error in GRangesForGenome(genome, chrom = chrom, ranges = ranges, method = "UCSC",  : 
  Failed to obtain information for genome 'hg38'

hg38 does exist in the list of genomes from rtracklayer

> df = rtracklayer::ucscGenomes()
> filter(df, species == "Human")
    db species      date                                                                     name
1 hg16   Human July 2003                                                            NCBI Build 34
2 hg17   Human  May 2004                                                            NCBI Build 35
3 hg18   Human Mar. 2006                                                          NCBI Build 36.1
4 hg19   Human Feb. 2009  GRCh37 Genome Reference Consortium Human Reference 37 (GCA_000001405.1)
5 hg38   Human Dec. 2013 GRCh38 Genome Reference Consortium Human Reference 38 (GCA_000001405.15)
> sessionInfo()
R version 4.1.2 (2021-11-01)
Platform: x86_64-pc-linux-gnu (64-bit)
Running under: Ubuntu 18.04.6 LTS

Matrix products: default
BLAS:   /usr/lib/x86_64-linux-gnu/blas/libblas.so.3.7.1
LAPACK: /opt/intel/compilers_and_libraries_2020.0.166/linux/mkl/lib/intel64_lin/libmkl_rt.so

locale:
 [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C               LC_TIME=en_US.UTF-8        LC_COLLATE=en_US.UTF-8     LC_MONETARY=en_US.UTF-8   
 [6] LC_MESSAGES=en_US.UTF-8    LC_PAPER=en_US.UTF-8       LC_NAME=C                  LC_ADDRESS=C               LC_TELEPHONE=C            
[11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C       

attached base packages:
[1] parallel  stats4    stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
 [1] forcats_0.5.1        stringr_1.4.0        dplyr_1.0.7          purrr_0.3.4          readr_2.0.2          tidyr_1.1.4          tibble_3.1.5        
 [8] ggplot2_3.3.5        tidyverse_1.3.1      rtracklayer_1.52.1   GenomicRanges_1.44.0 GenomeInfoDb_1.28.4  IRanges_2.26.0       S4Vectors_0.30.2    
[15] BiocGenerics_0.38.0 

loaded via a namespace (and not attached):
 [1] bitops_1.0-7                matrixStats_0.61.0          fs_1.5.0                    usethis_2.1.3               lubridate_1.8.0            
 [6] httr_1.4.2                  rprojroot_2.0.2             tools_4.1.2                 backports_1.3.0             utf8_1.2.2                 
[11] R6_2.5.1                    DBI_1.1.1                   colorspace_2.0-2            withr_2.4.2                 tidyselect_1.1.1           
[16] compiler_4.1.2              cli_3.1.0                   rvest_1.0.2                 Biobase_2.52.0              xml2_1.3.2                 
[21] desc_1.4.0                  DelayedArray_0.18.0         scales_1.1.1                Rsamtools_2.8.0             XVector_0.32.0             
[26] pkgconfig_2.0.3             sessioninfo_1.1.1           MatrixGenerics_1.4.3        dbplyr_2.1.1                rlang_0.4.12               
[31] readxl_1.3.1                rstudioapi_0.13             BiocIO_1.2.0                generics_0.1.1              jsonlite_1.7.2             
[36] BiocParallel_1.26.2         RCurl_1.98-1.5              magrittr_2.0.1              GenomeInfoDbData_1.2.6      Matrix_1.3-4               
[41] Rcpp_1.0.7                  munsell_0.5.0               fansi_0.5.0                 lifecycle_1.0.1             stringi_1.7.5              
[46] whisker_0.4                 yaml_2.2.1                  SummarizedExperiment_1.22.0 zlibbioc_1.38.0             grid_4.1.2                 
[51] crayon_1.4.1                lattice_0.20-45             Biostrings_2.60.2           haven_2.4.3                 hms_1.1.1                  
[56] knitr_1.36                  pillar_1.6.4                rjson_0.2.20                reprex_2.0.1                XML_3.99-0.8               
[61] glue_1.4.2                  BiocManager_1.30.16         modelr_0.1.8                vctrs_0.3.8                 tzdb_0.2.0                 
[66] cellranger_1.1.0            gtable_0.3.0                assertthat_0.2.1            xfun_0.27                   broom_0.7.9                
[71] restfulr_0.0.13             roxygen2_7.1.2              GenomicAlignments_1.28.0    ellipsis_0.3.2
rtracklayer Visualization • 1.8k views
ADD COMMENT
0
Entering edit mode
@james-w-macdonald-5106
Last seen 6 minutes ago
United States

It's not clear what is in your targets_df object, nor how that is supposed to interact with GRangesFromUCSCGenome. Also, other than the first three arguments, which are named and can be passed in positionally, any arguments you pass in via the ellipsis will just be added to the mcols of the resulting GRanges and are probably not a thing? I mean you could do

> GRangesForUCSCGenome("hg38", paste0("chr", 1:22), IRanges(1,1000), "+", "CDH13")
GRanges object with 22 ranges and 1 metadata column:
       seqnames    ranges strand |     "CDH13"
          <Rle> <IRanges>  <Rle> | <character>
   [1]     chr1    1-1000      + |       CDH13
   [2]     chr2    1-1000      + |       CDH13
   [3]     chr3    1-1000      + |       CDH13
   [4]     chr4    1-1000      + |       CDH13
   [5]     chr5    1-1000      + |       CDH13
   ...      ...       ...    ... .         ...
  [18]    chr18    1-1000      + |       CDH13
  [19]    chr19    1-1000      + |       CDH13
  [20]    chr20    1-1000      + |       CDH13
  [21]    chr21    1-1000      + |       CDH13
  [22]    chr22    1-1000      + |       CDH13
  -------
  seqinfo: 640 sequences (1 circular) from hg38 genome

Which doesn't really make any sense. The following seems more like a use case for the function

> GRangesForUCSCGenome("hg38", paste0("chr", 1:22))
GRanges object with 22 ranges and 0 metadata columns:
       seqnames      ranges strand
          <Rle>   <IRanges>  <Rle>
   [1]     chr1 1-248956422      *
   [2]     chr2 1-242193529      *
   [3]     chr3 1-198295559      *
   [4]     chr4 1-190214555      *
   [5]     chr5 1-181538259      *
   ...      ...         ...    ...
  [18]    chr18  1-80373285      *
  [19]    chr19  1-58617616      *
  [20]    chr20  1-64444167      *
  [21]    chr21  1-46709983      *
  [22]    chr22  1-50818468      *
  -------
  seqinfo: 640 sequences (1 circular) from hg38 genome

But maybe you are trying to do something different?

ADD COMMENT
0
Entering edit mode

I am trying to follow the rtracklayer vignette. The object setup is everything before section 2.2, but it is section 2.2 which describes how to launch the UCSC browser.

The annotations themselves are imported using rtracklayer::import() from gencode38.

here is what each of the objects look like:

> targets_df
DataFrame with 1 row and 21 columns
                     source     type     score     phase            gene_id      gene_type   gene_name       level     hgnc_id
                   <factor> <factor> <numeric> <integer>        <character>    <character> <character> <character> <character>
ENSG00000140945.17   HAVANA     gene        NA        NA ENSG00000140945.17 protein_coding       CDH13           1   HGNC:1753
                            havana_gene transcript_id transcript_type transcript_name transcript_support_level               tag
                            <character>   <character>     <character>     <character>              <character>       <character>
ENSG00000140945.17 OTTHUMG00000176635.3            NA              NA              NA                       NA overlapping_locus
                   havana_transcript exon_number     exon_id         ont  protein_id      ccdsid
                         <character> <character> <character> <character> <character> <character>
ENSG00000140945.17                NA          NA          NA          NA          NA          NA
> chr
factor-Rle of length 1 with 1 run
  Lengths:     1
  Values : chr16
Levels(25): chr1 chr2 chr3 chr4 chr5 chr6 chr7 chr8 chr9 chr10 chr11 ... chr15 chr16 chr17 chr18 chr19 chr20 chr21 chr22 chrX chrY chrM
> ranges
IRanges object with 1 range and 0 metadata columns:
                         start       end     width
                     <integer> <integer> <integer>
  ENSG00000140945.17  82626965  83800640   1173676
> strand
factor-Rle of length 1 with 1 run
  Lengths: 1
  Values : +
Levels(3): + - *
> name
[1] "CDH13"

This works -- no errors, and it launches the UCSC browser correctly:

targetTrack = with(targets_df, GRangesForUCSCGenome("hg19", chr, ranges, strand, name))

session <- browserSession("UCSC")
track(session, "targets") <- targetTrack
subTargetTrack <- targetTrack[1] #not actually necessary here, but you can make targetTrack a list
view <- browserView(session, subTargetTrack * -10, pack = "targets")

This does not work:

targetTrack = with(targets_df, GRangesForUCSCGenome("hg38", chr, ranges, strand, name))

session <- browserSession("UCSC")
track(session, "targets") <- targetTrack
subTargetTrack <- targetTrack[1] #not actually necessary here, but you can make targetTrack a list
view <- browserView(session, subTargetTrack * -10, pack = "targets")

With the following error:

Error in GRangesForGenome(genome, chrom = chrom, ranges = ranges, method = "UCSC", : Failed to obtain information for genome 'hg38'
11.
stop("Failed to obtain information for genome '", genome, "'")
10.
GRangesForGenome(genome, chrom = chrom, ranges = ranges, method = "UCSC", seqinfo = NULL, ...)
9.
GRangesForUCSCGenome("hg38", chr, ranges, strand, name)
8.
eval(expr, as.env(envir, enclos))
7.
eval(expr, as.env(envir, enclos))
6.
eval(expr, as.env(envir, enclos))
5.
eval(expr, envir, enclos)
4.
eval(expr, envir, enclos)
3.
safeEval(substitute(expr), data, parent.frame(), ...)
2.
with(targets_df, GRangesForUCSCGenome("hg38", chr, ranges, strand, name))
1.
with(targets_df, GRangesForUCSCGenome("hg38", chr, ranges, strand, name))
ADD REPLY
0
Entering edit mode

The with function is intended to be used in the case where you have a bunch of data in an object and you want to simplify operations on it. As an example:

> d.f <- data.frame(x=1:4, y=5:8)
## by hand
> d.f[,2] - d.f[,1]
[1] 4 4 4 4
## much cooler
> with(d.f, y-x)
[1] 4 4 4 4

If you have a DataFrame that has a bunch of stuff in it and you are not planning on using those data for anything, then using with in that context is not useful and possibly problematic. In other words, there is nothing in your targets_df object that you are extracting, so you don't need to do what you are doing.

Anyway, the error you see has nothing to do with any of that. It's coming from rtracklayer:::GRangesForGenome, where it is trying to get the Seqinfo for hg38. Basically what happens is this:

session <- browserSession("UCSC")
genome(session) <- "hg38"
seqinfo(session)

## which if I do that, this happens
> session <- browserSession("UCSC")
> genome(session) <- "hg38"
> seqinfo(session)
Seqinfo object with 640 sequences (1 circular) from hg38 genome:
  seqnames             seqlengths isCircular genome
  chr1                  248956422      FALSE   hg38
  chr2                  242193529      FALSE   hg38
  chr3                  198295559      FALSE   hg38
  chr4                  190214555      FALSE   hg38
  chr5                  181538259      FALSE   hg38
  ...                         ...        ...    ...
  chr22_KN196486v1_alt     153027      FALSE   hg38
  chr22_KQ458387v1_alt     155930      FALSE   hg38
  chr22_KQ458388v1_alt     174749      FALSE   hg38
  chr22_KQ759761v1_alt     145162      FALSE   hg38
  chrX_KV766199v1_alt      188004      FALSE   hg38

What happens when you do that?

ADD REPLY
0
Entering edit mode
@herve-pages-1542
Last seen 2 minutes ago
Seattle, WA, United States

Hi,

rtracklayer::GRangesForUCSCGenome() calls rtracklayer::SeqinfoForUCSCGenome() internally which stopped working on hg18 a few weeks ago in BioC 3.13:

library(rtracklayer)
SeqinfoForUCSCGenome("hg38")
# NULL

And the reason SeqinfoForUCSCGenome("hg38") stopped working a few weeks ago is because of a change on the UCSC side that happened at that time. That change broke GenomeInfoDb::getChromInfoFromUCSC("hg18") which SeqinfoForUCSCGenome("hg38") calls internally:

library(GenomeInfoDb)
getChromInfoFromUCSC("hg38")
# Error in .order_seqlevels(chrom_sizes[, "chrom"]) : 
#   !anyNA(m32) is not TRUE

Unfortunately SeqinfoForUCSCGenome() uses a tryCatch mechanism to return a NULL in case of an error and hide the details of the error.

This was with BioC 3.13. See my sessionInfo() below.

The GenomeInfoDb::getChromInfoFromUCSC("hg18") issue has been reported on this site here and here.

TLDR: This is fixed in BioC 3.14 but won't be fixed in BioC 3.13. Time to upgrade!

Cheers,

H.

sessionInfo():

R version 4.1.0 beta (2021-05-06 r80268)
Platform: x86_64-pc-linux-gnu (64-bit)
Running under: Ubuntu 21.10

Matrix products: default
BLAS:   /home/hpages/R/R-4.1.r80268/lib/libRblas.so
LAPACK: /home/hpages/R/R-4.1.r80268/lib/libRlapack.so

locale:
 [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C              
 [3] LC_TIME=en_GB              LC_COLLATE=en_US.UTF-8    
 [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8   
 [7] LC_PAPER=en_US.UTF-8       LC_NAME=C                 
 [9] LC_ADDRESS=C               LC_TELEPHONE=C            
[11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C       

attached base packages:
[1] parallel  stats4    stats     graphics  grDevices utils     datasets 
[8] methods   base     

other attached packages:
 [1] BSgenome.Hsapiens.UCSC.hg38_1.4.3 BSgenome_1.60.0                  
 [3] Biostrings_2.60.2                 XVector_0.32.0                   
 [5] rtracklayer_1.52.1                GenomicRanges_1.44.0             
 [7] GenomeInfoDb_1.28.4               IRanges_2.26.0                   
 [9] S4Vectors_0.31.0                  BiocGenerics_0.38.0              

loaded via a namespace (and not attached):
 [1] rstudioapi_0.13             zlibbioc_1.38.0            
 [3] GenomicAlignments_1.28.0    BiocParallel_1.26.2        
 [5] lattice_0.20-45             rjson_0.2.20               
 [7] tools_4.1.0                 grid_4.1.0                 
 [9] SummarizedExperiment_1.22.0 Biobase_2.52.0             
[11] matrixStats_0.61.0          yaml_2.2.1                 
[13] crayon_1.4.1                BiocIO_1.2.0               
[15] Matrix_1.3-4                GenomeInfoDbData_1.2.6     
[17] restfulr_0.0.13             bitops_1.0-7               
[19] RCurl_1.98-1.5              DelayedArray_0.18.0        
[21] compiler_4.1.0              MatrixGenerics_1.4.3       
[23] Rsamtools_2.8.0             XML_3.99-0.8
ADD COMMENT

Login before adding your answer.

Traffic: 1272 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6