How to get list of known genes in TxDb.Mmusculus.UCSC.mm10.knownGene
1
1
Entering edit mode
msr3nf ▴ 10
@4b2696f4
Last seen 3.3 years ago
United States

I am trying to get gene IDs (and eventually entrez gene IDs) for the list of known genes in the TxDb.Mmusculus.UCSC.mm10.knownGene database. My ultimate goal is to get the 5'UTR regions for all known protein-coding mouse genes, which I have attempted to do using an external list of about 24,000 genes, but this returned an error of "subscript contains invalid names" when following the Genomic Ranges vignette.

Therefore, I would like to look "within" TxDb.Mmusculus.UCSC.mm10.knownGene to see what genes I can work with (I am assuming it less than 24,000 due to my error).

Here is the code that was generating the error:

txdb <- TxDb.Mmusculus.UCSC.mm10.knownGene
txbygene <- transcriptsBy(txdb, "gene")[entrez_list_all]
TxDb.Mmusculus.UCSC.mm10.knownGene • 2.0k views
ADD COMMENT
0
Entering edit mode
@james-w-macdonald-5106
Last seen 3 days ago
United States

Without providing any code it's hard to parse what you did and what might have caused you to get the error you describe. But getting all the NCBI Gene IDs is simple enough.

>  library(TxDb.Mmusculus.UCSC.mm10.knownGene)
Loading required package: GenomicFeatures
Loading required package: AnnotationDbi
> z <- keys(TxDb.Mmusculus.UCSC.mm10.knownGene)
> head(z)
[1] "100009600" "100009609" "100009614" "100009664" "100012"    "100017"   
> length(z)
[1] 24594
ADD COMMENT
0
Entering edit mode

Thanks, I updated my post to show the code. Also, my list of entrez IDs is 22433, which is less than the 24594 shows, so I am now sure why my list has invalid names...

ADD REPLY
0
Entering edit mode

Actually, I was able to resolve this issue by just finding the list of common IDs from my list and the list you showed how to generate with the keys function. Thanks!

ADD REPLY
0
Entering edit mode

That will work, but it might just be a cosmetic fix. In other words, Gene IDs come and go, and they mainly go when NCBI realizes that two IDs actually describe the same thing, so they deprecate one in lieu of the other. So you could make the argument that some of the Gene IDs that are missing from the TxDb aren't actually missing per se, but instead have been subsumed into another ID, and all things equal you might want to first map the 'missing' IDs to whatever current ID still exists and then use the updated Gene ID list.

That said, trying to do that mapping might be difficult or boring (I'm really not sure - maybe you could get those data from NCBI's E-utils, or using the reutils package), in which case what you are doing might be good enough.

ADD REPLY

Login before adding your answer.

Traffic: 500 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6