Get MANE transcripts from TxDb.Hsapiens.UCSC.hg38.knownGene
1
0
Entering edit mode
nhaus ▴ 70
@789c70a6
Last seen 7 weeks ago
Switzerland

Hello,

I am using TxDb.Hsapiens.UCSC.hg38.knownGene to get a list of all transcripts for a set of genes I am interested in using this command:

transcripts <- transcriptsBy(txdb, by = "gene")
genes_of_interest <-  transcripts[names(transcripts) %in% gene_names]

Now I want to continue looking at only one transcript per gene. Ideally I would use the MANE transcript if it is available for a gene. So I was wondering if this information is somewhere stored in the TxDb.Hsapiens.UCSC.hg38.knownGene package or if there is another way to incorperate this information.

Any help is much appreciated!

AnnotationDbi • 2.1k views
ADD COMMENT
2
Entering edit mode
@james-w-macdonald-5106
Last seen 9 hours ago
United States

There is the ncbiRefSeqSelect table that has the MANE and RefSeq Select transcripts. It's not just the MANE though.

> z <- makeTxDbFromUCSC("hg38", "ncbiRefSeqSelect")
Download the ncbiRefSeqSelect table ... OK
Extract the 'transcripts' data frame ... OK
Extract the 'splicings' data frame ... OK
Download and preprocess the 'chrominfo' data frame ... OK
Prepare the 'metadata' data frame ... OK
Make the TxDb object ... OK
Warning message:
In .extract_cds_locs_from_UCSC_txtable(ucsc_txtable) :
  UCSC data anomaly in 142 transcript(s): the cds cumulative length is
  not a multiple of 3 for transcripts 'NM_001282866.2' 'NM_001321967.2'
  'NM_000218.3' 'NM_021008.4' 'NM_001170820.4' 'NM_181721.3'
  'NM_015231.3' 'NM_022337.3' 'NM_017448.5' 'NM_001286615.2'
  'NM_001170738.2' 'NM_001372106.1' 'NM_001024808.3' 'NM_178040.4'
  'NM_006322.6' 'NM_002929.3' 'NM_001198950.3' 'NM_014608.6'
  'NM_052903.6' 'NM_001277313.2' 'NM_030922.7' 'NM_001270974.2'
  'NM_001301771.2' 'NM_153688.4' 'NM_001025200.4' 'NM_014494.4'
  'NM_182705.2' 'NM_021962.5' 'NM_001143775.2' 'NM_004715.5'
  'NM_025189.4' 'NM_003423.4' 'NM_133180.3' 'NM_133180.3' 'NM_133180.3'
  'NM_006737.4' 'NM_133180.3' 'NM_133180.3' 'NM_133180.3' 'NM_133180.3'
  'NM_133180.3' 'NM_014686.5' 'NM_001291485.2' 'NM_005481.3'
  'NM_006737.4' 'NM_006737.4' 'NM_004829.7' 'NM_004829.7' 'NM_004829.7'
  'NM_004829.7' 'NM_004829.7' 'NM_004829.7' 'NM_133180.3'
  'NM_001042600.3' 'NM_006737.4' 'NM_006737.4' 'NM_004829.7'
  'NM_006737.4' 'NM_0067 [... truncated]

> z
TxDb object:
# Db type: TxDb
# Supporting package: GenomicFeatures
# Data source: UCSC
# Genome: hg38
# Organism: Homo sapiens
# Taxonomy ID: 9606
# UCSC Table: ncbiRefSeqSelect
# UCSC Track: NCBI RefSeq
# Resource URL: http://genome.ucsc.edu/
# Type of Gene ID: no gene ids
# Full dataset: yes
# miRBase build ID: NA
# Nb of transcripts: 21868
# Db created by: GenomicFeatures package from Bioconductor
# Creation time: 2022-05-18 09:12:57 -0400 (Wed, 18 May 2022)
# GenomicFeatures version at creation time: 1.48.1
# RSQLite version at creation time: 2.2.14
# DBSCHEMAVERSION: 1.2
> transcripts(z)
GRanges object with 21868 ranges and 2 metadata columns:
                     seqnames          ranges strand |     tx_id        tx_name
                        <Rle>       <IRanges>  <Rle> | <integer>    <character>
      [1]                chr1     65419-71585      + |         1 NM_001005484.2
      [2]                chr1   923923-944574      + |         2 NM_001385641.1
      [3]                chr1   960584-965719      + |         3    NM_198317.3
      [4]                chr1   966482-975865      + |         4    NM_032129.3
      [5]                chr1 1013497-1014540      + |         5    NM_005101.4
      ...                 ...             ...    ... .       ...            ...
  [21864] chrX_ML143381v1_fix   297006-300328      - |     21864 NM_001080145.2
  [21865] chrX_ML143381v1_fix   301876-305160      - |     21865 NM_001080146.3
  [21866] chrX_ML143382v1_fix     24479-28824      + |     21866    NM_033031.3
  [21867]    chrUn_GL000195v1     42938-86735      - |     21867 NM_001347681.2
  [21868]    chrUn_GL000213v1   103809-139339      - |     21868 NM_001257362.3
  -------
  seqinfo: 640 sequences (1 circular) from hg38 genome

If you just want the MANE transcripts, you can get that from NCBI (and presumably EBI/EMBL but I didn't check)

> z <- makeTxDbFromGFF("https://ftp.ncbi.nlm.nih.gov/refseq/MANE/MANE_human/release_1.0/MANE.GRCh38.v1.0.ensembl_genomic.gff.gz")
Import genomic features from the file as a GRanges object ... trying URL 'https://ftp.ncbi.nlm.nih.gov/refseq/MANE/MANE_human/release_1.0/MANE.GRCh38.v1.0.ensembl_genomic.gff.gz'
Content type 'application/x-gzip' length 9421273 bytes (9.0 MB)
downloaded 9.0 MB

OK
Prepare the 'metadata' data frame ... OK
Make the TxDb object ... OK
Warning message:
In .get_cds_IDX(mcols0$type, mcols0$phase) :
  The "phase" metadata column contains non-NA values for features of type
  stop_codon. This information was ignored.
> z
TxDb object:
# Db type: TxDb
# Supporting package: GenomicFeatures
# Data source: https://ftp.ncbi.nlm.nih.gov/refseq/MANE/MANE_human/release_1.0/MANE.GRCh38.v1.0.ensembl_genomic.gff.gz
# Organism: NA
# Taxonomy ID: NA
# miRBase build ID: NA
# Genome: NA
# Nb of transcripts: 19120
# Db created by: GenomicFeatures package from Bioconductor
# Creation time: 2022-05-18 09:45:54 -0400 (Wed, 18 May 2022)
# GenomicFeatures version at creation time: 1.48.1
# RSQLite version at creation time: 2.2.14
# DBSCHEMAVERSION: 1.2
> transcripts(z)
GRanges object with 19120 ranges and 2 metadata columns:
          seqnames            ranges strand |     tx_id            tx_name
             <Rle>         <IRanges>  <Rle> | <integer>        <character>
      [1]     chr1       65419-71585      + |         1  ENST00000641515.2
      [2]     chr1     923923-944574      + |         2  ENST00000616016.5
      [3]     chr1     960584-965719      + |         3  ENST00000338591.8
      [4]     chr1     966482-975865      + |         4  ENST00000379410.8
      [5]     chr1   1013497-1014540      + |         5  ENST00000649529.1
      ...      ...               ...    ... .       ...                ...
  [19116]     chrY 22168542-22182923      - |     19116 ENST00000303766.12
  [19117]     chrY 23129355-23199008      - |     19117  ENST00000405239.6
  [19118]     chrY 24045793-24048019      - |     19118  ENST00000382407.1
  [19119]     chrY 24763069-24813393      - |     19119  ENST00000382365.7
  [19120]     chrY 25030901-25052104      - |     19120  ENST00000382287.5
  -------
  seqinfo: 24 sequences from an unspecified genome; no seqlengths

You can substitute 'refseq' for 'ensembl' in the GFF name if you want the RefSeq version.

ADD COMMENT
0
Entering edit mode

Great answer! Do you know if this is also possible with hg19? Thanks.

ADD REPLY
0
Entering edit mode

I doubt it. NCBI and EBI/Embl have been working on this for maybe 7 years now, and GRCh37 was superceded in 2009, so I can't imagine they want to add the extra work to deal with something that old. You could always lift it over to GRCh37 if you are still using that build.

ADD REPLY

Login before adding your answer.

Traffic: 584 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6