supportedUCSCtables(genome="hg38"): Is ensGene track really available?
1
0
Entering edit mode
@chao-jen-wong-7035
Last seen 17 months ago
USA/Seattle/Fred Hutchinson Cancer Rese…

Hi,

I was using supportedUCSCtables() to find what tracks are available  on UCSC genome browser and found the function is a bit mis-leading. The "ensGene" is in deed available on hg19 genome, but not on hg38. So when I do the following, I would get errors. I then use rtracklayer::trackName() to find what track is really available on UCSC genome browser. 

> supportedUCSCtables(genome="hg38")
                                               track           subtrack
knownGene                                GENCODE v22               <NA>
knownGeneOld3                         Old UCSC Genes               <NA>
ccdsGene                                        CCDS               <NA>
refGene                                 RefSeq Genes               <NA>
xenoRefGene                             Other RefSeq               <NA>
vegaGene                                  Vega Genes Vega Protein Genes
vegaPseudoGene                            Vega Genes   Vega Pseudogenes
ensGene                                Ensembl Genes               <NA>
> makeTxDbPackageFromUCSC(version="1.0.0",
+                         maintainer="Chao-Jen Wong <cwon2@fredhutch.org>",
+                         destDir="~/tapscott/hg38",
+                         author="Chao-Jen Wong",
+                         genome="hg38",
+                         tablename="ensGene")
Error in normArgTrack(track, trackids) : Unknown track: Ensembl Genes

 

> sessionInfo()
R Under development (unstable) (2016-05-01 r70566)
Platform: x86_64-pc-linux-gnu (64-bit)
Running under: Ubuntu 14.04.3 LTS

locale:
 [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C
 [3] LC_TIME=en_US.UTF-8        LC_COLLATE=en_US.UTF-8
 [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8
 [7] LC_PAPER=en_US.UTF-8       LC_NAME=C
 [9] LC_ADDRESS=C               LC_TELEPHONE=C
[11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C

attached base packages:
[1] stats4    parallel  stats     graphics  grDevices utils     datasets
[8] methods   base

other attached packages:
 [1] rtracklayer_1.31.13     GenomicFeatures_1.23.31 AnnotationDbi_1.33.15
 [4] Biobase_2.31.3          GenomicRanges_1.23.27   GenomeInfoDb_1.7.6
 [7] IRanges_2.5.47          S4Vectors_0.9.52        BiocGenerics_0.17.5
[10] biomaRt_2.27.2          BiocInstaller_1.21.6

loaded via a namespace (and not attached):
 [1] XML_3.98-1.4                Rsamtools_1.23.11
 [3] Biostrings_2.39.14          GenomicAlignments_1.7.21
 [5] bitops_1.0-6                DBI_0.4
 [7] RSQLite_1.0.0               zlibbioc_1.17.1
 [9] XVector_0.11.8              BiocParallel_1.5.22
[11] tools_3.4.0                 RCurl_1.95-4.8
[13] SummarizedExperiment_1.1.27
>
genomicfeatures • 1.9k views
ADD COMMENT
1
Entering edit mode
@herve-pages-1542
Last seen 10 hours ago
Seattle, WA, United States

Hi Chao-Jen,

Thanks for reporting this. I made a couple of changes and improvements to supportedUCSCtables() in BioC devel (i.e. BioC 3.4):

  • The table names are now returned in a proper column instead of the rownames of the returned data frame so this data frame now has 3 columns (tablenametrack, and subtrack) instead of 2 (track and subtrack).
  • Tracks that don't exist for the specified genome are now removed from the returned data frame. This cleaning is done by querying the UCSC Genome Browser to get the set of tracks for the specified genome (using rtracklayer::trackNames()) which slows down supportedUCSCtables() a bit. Before this change, supportedUCSCtables() was very snappy because it didn't need to access the internet (but yeah, as we all know, there are many fast ways to get the wrong result).

With GenomicFeatures 1.25.6:

supportedUCSCtables(genome="hg38")
#           tablename          track           subtrack
# 1     knownGeneOld3 Old UCSC Genes               <NA>
# 2          ccdsGene           CCDS               <NA>
# 3           refGene   RefSeq Genes               <NA>
# 4       xenoRefGene   Other RefSeq               <NA>
# 5           sibGene      SIB Genes               <NA>
# 6           sgpGene      SGP Genes               <NA>
# 7            geneid   Geneid Genes               <NA>
# 8           genscan  Genscan Genes               <NA>
# 9      augustusGene       Augustus               <NA>
# 10    augustusHints       Augustus     Augustus Hints
# 11      augustusXRA       Augustus   Augustus De Novo
# 12 augustusAbinitio       Augustus Augustus Ab Initio

Unfortunately the returned data frame can still contain some spurious rows for tracks with subtracks (e.g. only the 1st row in the Augustus group above is correct, the 3 other rows are spurious, but for hg18 it would be the other way around). Removing these lines would slow down supportedUCSCtables() quite significantly, especially for genomes with many tracks where it could take several minutes (unless I'm missing a quick way to map a set of tracks with their "central tables" -- this is a one-to-many mapping when tracks have subtracks). See ?supportedUCSCtables for more information.

Cheers,

H.

ADD COMMENT
1
Entering edit mode

Oops, I goofed with GenomicFeatures 1.25.6 (no more knownGene table). With GenomicFeatures 1.25.7:

supportedUCSCtables("hg38")
#           tablename          track           subtrack
# 1         knownGene    GENCODE v22               <NA>
# 2     knownGeneOld8 Old UCSC Genes               <NA>
# 3     knownGeneOld7 Old UCSC Genes               <NA>
# 4     knownGeneOld6 Old UCSC Genes               <NA>
# 5     knownGeneOld4 Old UCSC Genes               <NA>
# 6     knownGeneOld3 Old UCSC Genes               <NA>
# 7          ccdsGene           CCDS               <NA>
# 8           refGene   RefSeq Genes               <NA>
# 9       xenoRefGene   Other RefSeq               <NA>
# 10          sibGene      SIB Genes               <NA>
# 11          sgpGene      SGP Genes               <NA>
# 12           geneid   Geneid Genes               <NA>
# 13          genscan  Genscan Genes               <NA>
# 14     augustusGene       Augustus               <NA>
# 15    augustusHints       Augustus     Augustus Hints
# 16      augustusXRA       Augustus   Augustus De Novo
# 17 augustusAbinitio       Augustus Augustus Ab Initio

I also added browseUCSCtrack() for browsing the UCSC track page for a given genome/table. It can be used to quickly check that a combination of genome/tablename actually exists. See ?browseUCSCtrack for more information.

H.

ADD REPLY
0
Entering edit mode

Thanks, Herve. browseUCSCtrack() is very helpful!!!!

ADD REPLY
0
Entering edit mode

Great! Glad you like it :-)

H.

ADD REPLY

Login before adding your answer.

Traffic: 413 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6