Get MANE transcripts from TxDb.Hsapiens.UCSC.hg38.knownGene
1
@789c70a6
Last seen 12 weeks ago
Switzerland
Hello,
I am using TxDb.Hsapiens.UCSC.hg38.knownGene
to get a list of all transcripts for a set of genes I am interested in using this command:
transcripts <- transcriptsBy(txdb, by = "gene")
genes_of_interest <- transcripts[names(transcripts) %in% gene_names]
Now I want to continue looking at only one transcript per gene. Ideally I would use the MANE transcript if it is available for a gene.
So I was wondering if this information is somewhere stored in the TxDb.Hsapiens.UCSC.hg38.knownGene
package or if there is another way to incorperate this information.
Any help is much appreciated!
AnnotationDbi
• 2.1k views
@james-w-macdonald-5106
Last seen 6 hours ago
United States
There is the ncbiRefSeqSelect table that has the MANE and RefSeq Select transcripts. It's not just the MANE though.
> z <- makeTxDbFromUCSC("hg38", "ncbiRefSeqSelect")
Download the ncbiRefSeqSelect table ... OK
Extract the 'transcripts' data frame ... OK
Extract the 'splicings' data frame ... OK
Download and preprocess the 'chrominfo' data frame ... OK
Prepare the 'metadata' data frame ... OK
Make the TxDb object ... OK
Warning message:
In .extract_cds_locs_from_UCSC_txtable(ucsc_txtable) :
UCSC data anomaly in 142 transcript(s): the cds cumulative length is
not a multiple of 3 for transcripts 'NM_001282866.2' 'NM_001321967.2'
'NM_000218.3' 'NM_021008.4' 'NM_001170820.4' 'NM_181721.3'
'NM_015231.3' 'NM_022337.3' 'NM_017448.5' 'NM_001286615.2'
'NM_001170738.2' 'NM_001372106.1' 'NM_001024808.3' 'NM_178040.4'
'NM_006322.6' 'NM_002929.3' 'NM_001198950.3' 'NM_014608.6'
'NM_052903.6' 'NM_001277313.2' 'NM_030922.7' 'NM_001270974.2'
'NM_001301771.2' 'NM_153688.4' 'NM_001025200.4' 'NM_014494.4'
'NM_182705.2' 'NM_021962.5' 'NM_001143775.2' 'NM_004715.5'
'NM_025189.4' 'NM_003423.4' 'NM_133180.3' 'NM_133180.3' 'NM_133180.3'
'NM_006737.4' 'NM_133180.3' 'NM_133180.3' 'NM_133180.3' 'NM_133180.3'
'NM_133180.3' 'NM_014686.5' 'NM_001291485.2' 'NM_005481.3'
'NM_006737.4' 'NM_006737.4' 'NM_004829.7' 'NM_004829.7' 'NM_004829.7'
'NM_004829.7' 'NM_004829.7' 'NM_004829.7' 'NM_133180.3'
'NM_001042600.3' 'NM_006737.4' 'NM_006737.4' 'NM_004829.7'
'NM_006737.4' 'NM_0067 [... truncated]
> z
TxDb object:
# Db type: TxDb
# Supporting package: GenomicFeatures
# Data source: UCSC
# Genome: hg38
# Organism: Homo sapiens
# Taxonomy ID: 9606
# UCSC Table: ncbiRefSeqSelect
# UCSC Track: NCBI RefSeq
# Resource URL: http://genome.ucsc.edu/
# Type of Gene ID: no gene ids
# Full dataset: yes
# miRBase build ID: NA
# Nb of transcripts: 21868
# Db created by: GenomicFeatures package from Bioconductor
# Creation time: 2022-05-18 09:12:57 -0400 (Wed, 18 May 2022)
# GenomicFeatures version at creation time: 1.48.1
# RSQLite version at creation time: 2.2.14
# DBSCHEMAVERSION: 1.2
> transcripts(z)
GRanges object with 21868 ranges and 2 metadata columns:
seqnames ranges strand | tx_id tx_name
<Rle> <IRanges> <Rle> | <integer> <character>
[1] chr1 65419-71585 + | 1 NM_001005484.2
[2] chr1 923923-944574 + | 2 NM_001385641.1
[3] chr1 960584-965719 + | 3 NM_198317.3
[4] chr1 966482-975865 + | 4 NM_032129.3
[5] chr1 1013497-1014540 + | 5 NM_005101.4
... ... ... ... . ... ...
[21864] chrX_ML143381v1_fix 297006-300328 - | 21864 NM_001080145.2
[21865] chrX_ML143381v1_fix 301876-305160 - | 21865 NM_001080146.3
[21866] chrX_ML143382v1_fix 24479-28824 + | 21866 NM_033031.3
[21867] chrUn_GL000195v1 42938-86735 - | 21867 NM_001347681.2
[21868] chrUn_GL000213v1 103809-139339 - | 21868 NM_001257362.3
-------
seqinfo: 640 sequences (1 circular) from hg38 genome
If you just want the MANE transcripts, you can get that from NCBI (and presumably EBI/EMBL but I didn't check)
> z <- makeTxDbFromGFF("https://ftp.ncbi.nlm.nih.gov/refseq/MANE/MANE_human/release_1.0/MANE.GRCh38.v1.0.ensembl_genomic.gff.gz")
Import genomic features from the file as a GRanges object ... trying URL 'https://ftp.ncbi.nlm.nih.gov/refseq/MANE/MANE_human/release_1.0/MANE.GRCh38.v1.0.ensembl_genomic.gff.gz'
Content type 'application/x-gzip' length 9421273 bytes (9.0 MB)
downloaded 9.0 MB
OK
Prepare the 'metadata' data frame ... OK
Make the TxDb object ... OK
Warning message:
In .get_cds_IDX(mcols0$type, mcols0$phase) :
The "phase" metadata column contains non-NA values for features of type
stop_codon. This information was ignored.
> z
TxDb object:
# Db type: TxDb
# Supporting package: GenomicFeatures
# Data source: https://ftp.ncbi.nlm.nih.gov/refseq/MANE/MANE_human/release_1.0/MANE.GRCh38.v1.0.ensembl_genomic.gff.gz
# Organism: NA
# Taxonomy ID: NA
# miRBase build ID: NA
# Genome: NA
# Nb of transcripts: 19120
# Db created by: GenomicFeatures package from Bioconductor
# Creation time: 2022-05-18 09:45:54 -0400 (Wed, 18 May 2022)
# GenomicFeatures version at creation time: 1.48.1
# RSQLite version at creation time: 2.2.14
# DBSCHEMAVERSION: 1.2
> transcripts(z)
GRanges object with 19120 ranges and 2 metadata columns:
seqnames ranges strand | tx_id tx_name
<Rle> <IRanges> <Rle> | <integer> <character>
[1] chr1 65419-71585 + | 1 ENST00000641515.2
[2] chr1 923923-944574 + | 2 ENST00000616016.5
[3] chr1 960584-965719 + | 3 ENST00000338591.8
[4] chr1 966482-975865 + | 4 ENST00000379410.8
[5] chr1 1013497-1014540 + | 5 ENST00000649529.1
... ... ... ... . ... ...
[19116] chrY 22168542-22182923 - | 19116 ENST00000303766.12
[19117] chrY 23129355-23199008 - | 19117 ENST00000405239.6
[19118] chrY 24045793-24048019 - | 19118 ENST00000382407.1
[19119] chrY 24763069-24813393 - | 19119 ENST00000382365.7
[19120] chrY 25030901-25052104 - | 19120 ENST00000382287.5
-------
seqinfo: 24 sequences from an unspecified genome; no seqlengths
You can substitute 'refseq' for 'ensembl' in the GFF name if you want the RefSeq version.
Login before adding your answer.
Traffic: 553 users visited in the last hour
Great answer! Do you know if this is also possible with hg19? Thanks.
I doubt it. NCBI and EBI/Embl have been working on this for maybe 7 years now, and GRCh37 was superceded in 2009, so I can't imagine they want to add the extra work to deal with something that old. You could always lift it over to GRCh37 if you are still using that build.