Entering edit mode
Erik van den Akker
▴
50
@erik-van-den-akker-4165
Last seen 10.5 years ago
Hi all,
I'm a PhD student in bioinformatics working at the Leiden University
Medical
Center and at the Delft University of Technlogy in the Netherlands.
Currently
I'm working on the vizualization of genome wide data sources, such as
Linkage,
GWAS & Expression data.
In order to be able to quickely access information on gene locations
(along
with the UTR, CDS, exons etc), I thought it would be a good idea to
make use
of the GenomicFeatures package. This package works perfectly and very
quickely
for the example provided in the vignette (good job!):
> library(GenomicFeatures)
> system.time(mm9KG <- makeTranscriptDbFromUCSC(genome = "mm9",
tablename =
"knownGene"))
user system elapsed
49.50 0.69 100.05
> mm9KG
TranscriptDb object:
| Db type: TranscriptDb
| Data source: UCSC
| Genome: mm9
| UCSC Table: knownGene
| Type of Gene ID: Entrez Gene ID
| Full dataset: yes
| transcript_nrow: 49409
| exon_nrow: 237551
| cds_nrow: 204831
| Db created by: GenomicFeatures package from Bioconductor
| Creation time: 2010-07-14 14:07:54 +0200 (Wed, 14 Jul 2010)
| GenomicFeatures version at creation time: 1.0.3
| RSQLite version at creation time: 0.9-1
And even for larger databases(humans), this works perfectly:
> system.time(hg19KG <- makeTranscriptDbFromUCSC(genome = "hg19",
tablename
= "knownGene"))
user system elapsed
82.09 1.11 162.53
> hg19KG
TranscriptDb object:
| Db type: TranscriptDb
| Data source: UCSC
| Genome: hg19
| UCSC Table: knownGene
| Type of Gene ID: Entrez Gene ID
| Full dataset: yes
| transcript_nrow: 77614
| exon_nrow: 281605
| cds_nrow: 236664
| Db created by: GenomicFeatures package from Bioconductor
| Creation time: 2010-07-14 14:11:03 +0200 (Wed, 14 Jul 2010)
| GenomicFeatures version at creation time: 1.0.3
| RSQLite version at creation time: 0.9-1
However, for tablename = "refGene" I had to shoot down my R session
after
half an hour for both the settings genome = "mm9" & genome = "hg19"
> system.time(hg19KG <- makeTranscriptDbFromUCSC(genome = "mm9",
tablename =
"refGene"))
> system.time(hg19KG <- makeTranscriptDbFromUCSC(genome = "hg19",
tablename
= "refGene"))
As this package makes use of functionalities provided by rtracklayer,
before
the actual SQLite db is stored, I verified whether this was working
correctly:
> library(rtracklayer)
> session <- browserSession()
> genome(session) <- "hg19"
> query <- ucscTableQuery(session,"refGene")
> system.time(Table <- getTable(query))
user system elapsed
7.70 0.39 61.73
Typing "head(Table)" gave the expected results, suggesting that
something
is not working correctly in creating the SQLite databases.
So, my question:
Given that refGene pops up when using supportedUCSCtables(),
I wondered:
1) Did I do something wrong?; 2) should I just have more patience & 3)
could
anyone
confirm these problems?
And
@PackageMaintainers: If this is a genuine bug, are you planning to fix
this
or speed things up?
As I work with gene expression data, which are commonly annotated to
either
RefSeqIDs or Ensembl Transcript IDs, I would prefer to work with
TranscriptDBs
based on these features. Although I can think of many work around
solutions
using "knownGene" I would prefer to work with the package as
originally
intended
and hence this post.
Thanks for the work already done on this great package!
Cheerz,
Erik van den Akker
> sessionInfo()
R version 2.11.1 (2010-05-31)
i386-pc-mingw32
locale:
[1] LC_COLLATE=Dutch_Netherlands.1252 LC_CTYPE=Dutch_Netherlands.1252
LC_MONETARY=Dutch_Netherlands.1252 LC_NUMERIC=C
[5] LC_TIME=Dutch_Netherlands.1252
attached base packages:
[1] stats graphics grDevices utils datasets methods base
other attached packages:
[1] rtracklayer_1.8.1 RCurl_1.4-2 bitops_1.0-4.1
GenomicFeatures_1.0.3 GenomicRanges_1.0.5 IRanges_1.6.8
loaded via a namespace (and not attached):
[1] Biobase_2.8.0 biomaRt_2.4.0 Biostrings_2.16.7
BSgenome_1.16.5
DBI_0.2-5 RSQLite_0.9-1 tools_2.11.1 XML_3.1-0
[[alternative HTML version deleted]]