Missing mm10 in geneLenDataBase
1
2
Entering edit mode
@robert-k-bradley-5997
Last seen 10 months ago
United States

I've noticed that some of the newer genome assemblies (most notably mm10) are missing from geneLenDataBase. Would it be possible to add these new assemblies to geneLenDataBase?

 

For example, using Bioconductor release 3.0, I find:

> grep("mm9\\..*\\.LENGTH",as.data.frame(data(package="geneLenDataBase")$results,stringsAsFactors=FALSE)$Item,ignore.case=TRUE,value=TRUE) [1] "mm9.acembly.LENGTH"     "mm9.ccdsGene.LENGTH"    "mm9.ensGene.LENGTH"     "mm9.exoniphy.LENGTH"    "mm9.geneSymbol.LENGTH"  "mm9.geneid.LENGTH"      "mm9.genscan.LENGTH"    
 [8] "mm9.knownGene.LENGTH"   "mm9.nscanGene.LENGTH"   "mm9.refGene.LENGTH"     "mm9.sgpGene.LENGTH"     "mm9.xenoRefGene.LENGTH"

> grep("mm10\\..*\\.LENGTH",as.data.frame(data(package="geneLenDataBase")$results,stringsAsFactors=FALSE)$Item,ignore.case=TRUE,value=TRUE)
character(0)

My question is related to this old thread:

mm10 goseq supported genome

 

Best,

Rob

geneLenDataBase mm10 • 2.2k views
ADD COMMENT
0
Entering edit mode
That is a contributed package. Have you contacted the maintainers directly?
ADD REPLY
0
Entering edit mode

I did send them a note alerting them to this post....

ADD REPLY
3
Entering edit mode
@herve-pages-1542
Last seen 9 days ago
Seattle, WA, United States

Hi Robert,

Looks like the data in this package has not been updated for at least the
last 3 years. We'll try to do something about it.

Also, unfortunately the downloadLengthFromUCSC() function seems to have been
removed from the package a long time ago so here is a replacement: 

library(GenomicFeatures)

## Args 'genome' and 'tablename' as in the makeTxDbFromUCSC()
## function.
fetchTranscriptLengthsFromUCSC <- function(genome, tablename)
{
    txdb <- makeTxDbFromUCSC(genome, tablename)
    ans <- mcols(transcripts(txdb, columns=c("tx_name", "gene_id")))
    ans$gene_id <- as.character(ans$gene_id)
    ex_by_tx <- exonsBy(txdb, by="tx", use.names=TRUE)
    tx_lengths <- sum(width(ex_by_tx))
    ans$tx_len <- tx_lengths[ans$tx_name]
    as.data.frame(ans)
}

Then:

mm10_tx_len <- fetchTranscriptLengthsFromUCSC("mm10", "knownGene")
head(mm10_tx_len)
#      tx_name gene_id tx_len
# 1 uc007afg.1   18777   2355
# 2 uc007afh.1   18777   2433
# 3 uc007afi.2   21399   2671
# 4 uc011wht.1   21399   2668
# 5 uc011whu.1   21399   2564
# 6 uc007afm.1  108664   2309

We should probably add this function to the GenomicFeatures package.

Cheers,
H.

ADD COMMENT
0
Entering edit mode

I think the functionality is in goseq::getlength(), where the additional step is to map transcripts to genes and take median transcript length. For mm10 / knownGene it would make sense to use the existing TxDb directly. And since I guess a typical scenario is "I / my lab is working on organism X with annotation Y and I'll have many questions about these annotations" maybe it's better to save the txdb as a package for sharing with colleagues. So

txdb <- makeTxDbFromUCSC(genome, tablename)
makeTxDbPackage(txdb, "0.0.1", "Martin Morgan",
    "Martin Morgan <mtmorgan@usual.place>", 
    "~/labshare/")

with the helper functions

txlengths <- function(txdb) {
    ans <- mcols(transcripts(txdb, columns=c("tx_name", "gene_id")))
    ans$gene_id <- as.character(ans$gene_id)
    ex_by_tx <- exonsBy(txdb, by="tx", use.names=TRUE)
    sum(width(ex_by_tx))
}

(I see that Herve has added GenomicFeatures::transcriptLengths()to devel...)

genelengths <- function(txdb, map=character()){
    len <- txlengths(txdb)
    if (missing(map))
        map <- mapIds(txdb, names(len), "GENEID", "TXNAME")
    median(splitAsList(len, map[names(len)]))
}

(map is a named character vector meant to allow some flexibility in how genes are named; the TxDb uses ENTREZ ids, but are these universally useful? The names of map would be the transcript ids, and the values  would be the corresponding genes)

ADD REPLY
0
Entering edit mode

It's indeed better to keep creation and querying of the TxDb separated so I added transcriptLengths() for extracting the transcript lengths (and other metrics) from a TxDb object:

library(TxDb.Mmusculus.UCSC.mm10.knownGene)
txdb <- TxDb.Mmusculus.UCSC.mm10.knownGene
mm10_txlens <- transcriptLengths(txdb)

head(mm10_txlens)
#   tx_id    tx_name gene_id nexon tx_len
# 1     1 uc007afg.1   18777     8   2355
# 2     2 uc007afh.1   18777     9   2433
# 3     3 uc007afi.2   21399    10   2671
# 4     4 uc011wht.1   21399    10   2668
# 5     5 uc011whu.1   21399    10   2564
# 6     6 uc007afm.1  108664     6   2309

mm10_txlens <- transcriptLengths(txdb, with.cds_len=TRUE,
                                       with.utr5_len=TRUE,
                                       with.utr3_len=TRUE)
head(mm10_txlens)
#   tx_id    tx_name gene_id nexon tx_len cds_len utr5_len utr3_len
# 1     1 uc007afg.1   18777     8   2355     651       36     1668
# 2     2 uc007afh.1   18777     9   2433     693       21     1719
# 3     3 uc007afi.2   21399    10   2671     906      220     1545
# 4     4 uc011wht.1   21399    10   2668     903      220     1545
# 5     5 uc011whu.1   21399    10   2564     939       80     1545
# 6     6 uc007afm.1  108664     6   2309     591      149     1569

 

See ?transcriptLengths for the details.

This is with GenomicFeatures 1.19.20, which will become available to BioC devel users via biocLite() in the next 24 hours or so.

Cheers,

H.

ADD REPLY
0
Entering edit mode

Hello all,

GeneLenDataBase is becoming deprecated. As mentioned above, goseq will grab gene length information for mm10 and other new assemblies on the fly. Using TxDb is a more elegant solution, so I'll add this into goseq for future releases.

Cheers,

Nadia.

ADD REPLY
0
Entering edit mode

Yes please.

I found it is not clear how the 'Length' column was calculated in the 'geneLenDataBase' package, which doesn't seem to include the UTR regions?

ADD REPLY
0
Entering edit mode
sum(width(reduce(ex_by_tx)))

That could be the 'exon union' model?

Anyway, it's good to know that 'getlength()' takes the median values of transcript lengths available, which is not clearly mentioned in the document. 

 

ADD REPLY

Login before adding your answer.

Traffic: 688 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6