Question

How to get list of known genes in TxDb.Mmusculus.UCSC.mm10.knownGene

1

Entering edit mode

msr3nf ▴ 10

@4b2696f4

Last seen 3.7 years ago

United States

I am trying to get gene IDs (and eventually entrez gene IDs) for the list of known genes in the TxDb.Mmusculus.UCSC.mm10.knownGene database. My ultimate goal is to get the 5'UTR regions for all known protein-coding mouse genes, which I have attempted to do using an external list of about 24,000 genes, but this returned an error of "subscript contains invalid names" when following the Genomic Ranges vignette.

Therefore, I would like to look "within" TxDb.Mmusculus.UCSC.mm10.knownGene to see what genes I can work with (I am assuming it less than 24,000 due to my error).

Here is the code that was generating the error:

txdb <- TxDb.Mmusculus.UCSC.mm10.knownGene
txbygene <- transcriptsBy(txdb, "gene")[entrez_list_all]

TxDb.Mmusculus.UCSC.mm10.knownGene • 2.4k views

ADD COMMENT • link updated 3.7 years ago by James W. MacDonald 68k • written 3.7 years ago by msr3nf ▴ 10

score 0 · Answer 1 · 2021-08-10

0

Entering edit mode

James W. MacDonald 68k

@james-w-macdonald-5106

Last seen 2 days ago

United States

Without providing any code it's hard to parse what you did and what might have caused you to get the error you describe. But getting all the NCBI Gene IDs is simple enough.

>  library(TxDb.Mmusculus.UCSC.mm10.knownGene)
Loading required package: GenomicFeatures
Loading required package: AnnotationDbi
> z <- keys(TxDb.Mmusculus.UCSC.mm10.knownGene)
> head(z)
[1] "100009600" "100009609" "100009614" "100009664" "100012"    "100017"   
> length(z)
[1] 24594

ADD COMMENT • link 3.7 years ago James W. MacDonald 68k

0

Entering edit mode

Thanks, I updated my post to show the code. Also, my list of entrez IDs is 22433, which is less than the 24594 shows, so I am now sure why my list has invalid names...

ADD REPLY • link 3.7 years ago msr3nf ▴ 10

0

Entering edit mode

Actually, I was able to resolve this issue by just finding the list of common IDs from my list and the list you showed how to generate with the keys function. Thanks!

ADD REPLY • link 3.7 years ago msr3nf ▴ 10

0

Entering edit mode

That will work, but it might just be a cosmetic fix. In other words, Gene IDs come and go, and they mainly go when NCBI realizes that two IDs actually describe the same thing, so they deprecate one in lieu of the other. So you could make the argument that some of the Gene IDs that are missing from the TxDb aren't actually missing per se, but instead have been subsumed into another ID, and all things equal you might want to first map the 'missing' IDs to whatever current ID still exists and then use the updated Gene ID list.

That said, trying to do that mapping might be difficult or boring (I'm really not sure - maybe you could get those data from NCBI's E-utils, or using the reutils package), in which case what you are doing might be good enough.

ADD REPLY • link 3.7 years ago James W. MacDonald 68k