GenomeInfoDb's Seqinfo() error on hg38
1
2
Entering edit mode
@mikelove
Last seen 1 day ago
United States

I noticed recently that Seqinfo gives an error on hg38:

> Seqinfo(genome="hg38")
Error in FUN(genome = names(SUPPORTED_UCSC_GENOMES)[idx], circ_seqs = supported_genome$circ_seqs,  :
  cannot map the following UCSC seqlevel(s) to an NCBI seqlevel: chr8_KZ208915v1_fix,
  chr15_KN538374v1_fix, chr15_KQ031389v1_alt, chr16_KV880768v1_fix, chr12_KZ208916v1_fix,
  chr14_KZ208920v1_fix, chr7_KZ208913v1_alt, chr5_KV575244v1_fix, chr7_KZ208912v1_fix,
  chr1_KV880763v1_alt, chr12_KN538369v1_fix, chr2_KQ983256v1_alt, chr2_KQ031384v1_fix,
  chr16_KZ559113v1_fix, chr7_KV880765v1_fix, chr1_KQ031383v1_fix, chr1_KN538360v1_fix,
  chr3_KN196475v1_fix, chr4_KV766193v1_alt, chr10_KN538367v1_fix, chr3_KN538364v1_fix,
  chr3_KV766192v1_fix, chr18_KQ090028v1_fix, chr19_KQ458386v1_fix, chr3_KQ031385v1_fix,
  chr19_KN196484v1_fix, chr2_KN538363v1_fix, chr5_KV575243v1_alt, chr13_KN538372v1_fix,
  chr1_KQ458383v1_alt, chr9_KN196479v1_fix, chr1_KZ208906v1_fix, chr6_KQ031387v1_fix,
  chr12_KQ759760v1_fix, chr3_KN196476v1_fix, chr1_KN538361v1_fix, chr11_KZ559108v1_fix,
  chr3_KZ559103v1_alt, chr11_KZ559110v1_alt, chr19_KV575249v1_alt, chr17_KV766196v1_fix,
  chr11_KZ559109v1_fix, chr1_
Calls: Seqinfo ... .fetch_sequence_info_for_UCSC_genome -> fetchExtendedChromInfoFromUCSC -> FUN

The output with hg19:

> Seqinfo(genome="hg19")
Seqinfo object with 93 sequences (1 circular) from hg19 genome:
  seqnames       seqlengths isCircular genome
  chr1            249250621      FALSE   hg19
  chr2            243199373      FALSE   hg19
  chr3            198022430      FALSE   hg19
  chr4            191154276      FALSE   hg19
  chr5            180915260      FALSE   hg19
  ...                   ...        ...    ...
  chrUn_gl000245      36651      FALSE   hg19
  chrUn_gl000246      38154      FALSE   hg19
  chrUn_gl000247      36422      FALSE   hg19
  chrUn_gl000248      39786      FALSE   hg19
  chrUn_gl000249      38502      FALSE   hg19

My session info:

> sessionInfo()
R version 3.5.0 (2018-04-23)
Platform: x86_64-apple-darwin15.6.0 (64-bit)
Running under: macOS  10.14.2

Matrix products: default
BLAS: /Library/Frameworks/R.framework/Versions/3.5/Resources/lib/libRblas.0.dylib
LAPACK: /Library/Frameworks/R.framework/Versions/3.5/Resources/lib/libRlapack.dylib

locale:
[1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8

attached base packages:
[1] stats4    parallel  stats     graphics  grDevices datasets  utils     methods   base

other attached packages:
[1] GenomeInfoDb_1.18.1 IRanges_2.16.0      S4Vectors_0.20.1    BiocGenerics_0.28.0
[5] testthat_2.0.1      rmarkdown_1.11      devtools_1.13.6

loaded via a namespace (and not attached):
 [1] Rcpp_1.0.0             knitr_1.20             magrittr_1.5           R6_2.3.0
 [5] rlang_0.3.1            stringr_1.4.0          tools_3.5.0            withr_2.1.2
 [9] htmltools_0.3.6        rprojroot_1.3-2        digest_0.6.18          GenomeInfoDbData_1.2.0
[13] BiocManager_1.30.4     bitops_1.0-6           RCurl_1.95-4.11        memoise_1.1.0
[17] evaluate_0.12          stringi_1.2.4          compiler_3.5.0         backports_1.1.3
genomeinfodb • 3.6k views
ADD COMMENT
2
Entering edit mode
@herve-pages-1542
Last seen 6 hours ago
Seattle, WA, United States

Hi Mike,

Sequences like chr15_KN538374v1_fix don't belong to GRCh38 but to recent patched versions of GRCh38 (e.g. GRCh38.p12, which is the latest). Unfortunately it looks like the UCSC folks updated the chromInfo table for hg38 a couple of days ago, and that the table now contains chrom info for GRCh38.p12 (or maybe GRCh38.p11). You can see a dump of this SQL table here:

http://hgdownload.cse.ucsc.edu/goldenPath/hg38/database/

The timestamp for the dump indicates that it was updated on Feb 10.

We have a lot of code around (e.g. in GenomeInfoDb, GenomicFeatures, and probably more) that assumes that hg38 is mapped to GRCh38, and not to a patched version of it. The fact that the UCSC folks are changing the reference genome of the hg38 tracks is totally new (I don't remember seeing this for any other organism they support in the last 10 years or so). This a bummer and a game changer.

Also the description they provide for the hg38 browser is misleading:

UCSC Genome Browser assembly ID: hg38
Sequencing/Assembly provider ID: Genome Reference Consortium Human GRCh38 (GCA_000001405.15)
Assembly date: Dec. 2013
Assembly accession: GCA_000001405.15
NCBI Genome ID: 51 (Homo sapiens (human))
NCBI Assembly ID: 883148 (GRCh38, GCA_000001405.15)

It means that from now on we'll have to guess what assembly the hg38 tracks are really based on :-/

I need to think of how to best deal with this in GenomeInfoDb and GenomicFeatures.

H.

ADD COMMENT
0
Entering edit mode

Thanks for the quick detective work. Yeah, this is a really useful function in GenomeInfoDb and figured that since I hadn’t updated versions something at the source was the issue.

ADD REPLY
1
Entering edit mode

Should be fixed in GenomeInfoDb 1.19.2 (devel, see commit here) and 1.18.2 (release). These new versions won't show up on the build reports (devel and release) before Wednesday morning.

H.

ADD REPLY
0
Entering edit mode

Thanks for the rapid fix, Hervé

ADD REPLY
0
Entering edit mode

BTW this change was finally announced recently on the Genome Browser's blog:

http://genome.ucsc.edu/blog/patches/

and on their mailing list:

https://groups.google.com/a/soe.ucsc.edu/forum/#!topic/genome-announce/oMMnhnPr6gg

The hg38 gateway page still doesn't say what assembly the hg38 tracks are really based on though:

http://genome.ucsc.edu/cgi-bin/hgGateway?db=hg38

ADD REPLY
0
Entering edit mode

Better late than never :-)

ADD REPLY
0
Entering edit mode

I have a similar issue: https://support.bioconductor.org/p/127017/ Could you please help me?

Thanks, M

ADD REPLY
1
Entering edit mode

I replied over on the other thread, update needed, then post back if the issue persists (after restarting R).

https://support.bioconductor.org/p/127017/#127019

ADD REPLY
0
Entering edit mode

Hi!

I am using GenomeInfoDb_1.22.0 and I am having the same issue with hg19:

> Seqinfo(genome="hg19")
Error in FUN(genome = names(SUPPORTED_UCSC_GENOMES)[idx], circ_seqs = supported_genome$circ_seqs,  : 
  cannot map the following UCSC seqlevel(s) to an NCBI seqlevel:
  chr1_jh636052_fix, chrX_jh806600_fix, chrX_jh806587_fix,
  chr7_jh159134_fix, chrX_jh159150_fix, chrX_jh806590_fix,
  chr10_jh591181_fix, chr1_jh636053_fix, chr5_gl339449_alt,
  chr14_kb021645_fix, chrX_jh720453_fix, chrX_jh806601_fix,
  chr7_gl582971_fix, chrX_jh806599_fix, chr19_gl949749_alt,
  chr19_gl949750_alt, chr19_gl949748_alt, chr19_kb021647_fix,
  chrX_jh806597_fix, chr10_ke332501_fix, chr19_gl949751_alt,
  chr19_gl949746_alt, chr19_gl949752_alt, chrX_jh806598_fix,
  chrX_jh720451_fix, chrX_jh806591_fix, chr11_jh806581_fix,
  chrX_jh806588_fix, chrX_jh806592_fix, chr19_gl949753_alt,
  chr1_jh636054_fix, chrX_jh720454_fix, chr19_gl949747_alt,
  chr7_jh636058_fix, chrX_jh806602_fix, chr17_gl383561_fix,
  chr8_gl949743_fix, chr2_kb663603_fix, chr19_gl582977_fix,
  chr19_ke332505_fix, chr11_jh159140_fix, chr5_ke332497_fix,
  chr17_gl383560_fix, chrX_jh720452_fix, chr4_ke332496_fix,
  chr6_kb663604_fix, chr

Other genomes are working fine.

My sessionInfo:

> sessionInfo()
R version 3.6.0 (2019-04-26)
Platform: x86_64-pc-linux-gnu (64-bit)
Running under: Red Hat Enterprise Linux Server 7.7 (Maipo)

Matrix products: default
BLAS/LAPACK: /nas/longleaf/apps/r/3.6.0/lib/libopenblas_haswellp-r0.3.5.so

locale:
 [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C              
 [3] LC_TIME=en_US.UTF-8        LC_COLLATE=en_US.UTF-8    
 [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8   
 [7] LC_PAPER=en_US.UTF-8       LC_NAME=C                 
 [9] LC_ADDRESS=C               LC_TELEPHONE=C            
[11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C       

attached base packages:
[1] stats4    parallel  stats     graphics  grDevices utils     datasets 
[8] methods   base     

other attached packages:
[1] GenomeInfoDb_1.22.0 IRanges_2.20.2      S4Vectors_0.24.3   
[4] BiocGenerics_0.32.0

loaded via a namespace (and not attached):
[1] compiler_3.6.0         tools_3.6.0            GenomeInfoDbData_1.2.2
[4] RCurl_1.98-1.1         bitops_1.0-6        
ADD REPLY
1
Entering edit mode

Hi,

This is fixed in BioC 3.11 (the current devel version of Bioconductor, to be released in one month) but is unlikely to get fixed in BioC 3.10 (the version you're using). See this issue for more information. You will need to install a recent version of R devel (R 4.0) in order to use BioC 3.11.

Best, H.

ADD REPLY
1
Entering edit mode

OK so this is also fixed in release (BioC 3.11):

hpages@spectre:~/GenomeInfoDb$ git log -n 1
commit a4109ce3ab5a98ba627f2346829a9dbf94b9f927
Author: Hervé Pagès <hpages@fredhutch.org>
Date:   Fri Mar 27 12:34:28 2020 -0700

    Fix Seqinfo(genome="hg19")

    Still returns the 93 sequences of the **original** hg19 (unlike in BioC
    3.11 where it returns all the new sequences).

Seems like too many things are currently broken in release because of this.

The fix is in GenomeInfoDb 1.22.1.

Cheers,

H.

ADD REPLY

Login before adding your answer.

Traffic: 637 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6