Question

Get MANE transcripts from TxDb.Hsapiens.UCSC.hg38.knownGene

0

Entering edit mode

nhaus ▴ 70

@789c70a6

Last seen 5 months ago

Switzerland

Hello,

I am using TxDb.Hsapiens.UCSC.hg38.knownGene to get a list of all transcripts for a set of genes I am interested in using this command:

transcripts <- transcriptsBy(txdb, by = "gene")
genes_of_interest <-  transcripts[names(transcripts) %in% gene_names]

Now I want to continue looking at only one transcript per gene. Ideally I would use the MANE transcript if it is available for a gene. So I was wondering if this information is somewhere stored in the TxDb.Hsapiens.UCSC.hg38.knownGene package or if there is another way to incorperate this information.

Any help is much appreciated!

AnnotationDbi • 2.3k views

ADD COMMENT • link updated 18 months ago by James W. MacDonald 68k • written 2.8 years ago by nhaus ▴ 70

score 2 · Accepted Answer · 2022-05-18

There is the ncbiRefSeqSelect table that has the MANE and RefSeq Select transcripts. It's not just the MANE though.

> z <- makeTxDbFromUCSC("hg38", "ncbiRefSeqSelect")
Download the ncbiRefSeqSelect table ... OK
Extract the 'transcripts' data frame ... OK
Extract the 'splicings' data frame ... OK
Download and preprocess the 'chrominfo' data frame ... OK
Prepare the 'metadata' data frame ... OK
Make the TxDb object ... OK
Warning message:
In .extract_cds_locs_from_UCSC_txtable(ucsc_txtable) :
  UCSC data anomaly in 142 transcript(s): the cds cumulative length is
  not a multiple of 3 for transcripts 'NM_001282866.2' 'NM_001321967.2'
  'NM_000218.3' 'NM_021008.4' 'NM_001170820.4' 'NM_181721.3'
  'NM_015231.3' 'NM_022337.3' 'NM_017448.5' 'NM_001286615.2'
  'NM_001170738.2' 'NM_001372106.1' 'NM_001024808.3' 'NM_178040.4'
  'NM_006322.6' 'NM_002929.3' 'NM_001198950.3' 'NM_014608.6'
  'NM_052903.6' 'NM_001277313.2' 'NM_030922.7' 'NM_001270974.2'
  'NM_001301771.2' 'NM_153688.4' 'NM_001025200.4' 'NM_014494.4'
  'NM_182705.2' 'NM_021962.5' 'NM_001143775.2' 'NM_004715.5'
  'NM_025189.4' 'NM_003423.4' 'NM_133180.3' 'NM_133180.3' 'NM_133180.3'
  'NM_006737.4' 'NM_133180.3' 'NM_133180.3' 'NM_133180.3' 'NM_133180.3'
  'NM_133180.3' 'NM_014686.5' 'NM_001291485.2' 'NM_005481.3'
  'NM_006737.4' 'NM_006737.4' 'NM_004829.7' 'NM_004829.7' 'NM_004829.7'
  'NM_004829.7' 'NM_004829.7' 'NM_004829.7' 'NM_133180.3'
  'NM_001042600.3' 'NM_006737.4' 'NM_006737.4' 'NM_004829.7'
  'NM_006737.4' 'NM_0067 [... truncated]

> z
TxDb object:
# Db type: TxDb
# Supporting package: GenomicFeatures
# Data source: UCSC
# Genome: hg38
# Organism: Homo sapiens
# Taxonomy ID: 9606
# UCSC Table: ncbiRefSeqSelect
# UCSC Track: NCBI RefSeq
# Resource URL: http://genome.ucsc.edu/
# Type of Gene ID: no gene ids
# Full dataset: yes
# miRBase build ID: NA
# Nb of transcripts: 21868
# Db created by: GenomicFeatures package from Bioconductor
# Creation time: 2022-05-18 09:12:57 -0400 (Wed, 18 May 2022)
# GenomicFeatures version at creation time: 1.48.1
# RSQLite version at creation time: 2.2.14
# DBSCHEMAVERSION: 1.2
> transcripts(z)
GRanges object with 21868 ranges and 2 metadata columns:
                     seqnames          ranges strand |     tx_id        tx_name
                        <Rle>       <IRanges>  <Rle> | <integer>    <character>
      [1]                chr1     65419-71585      + |         1 NM_001005484.2
      [2]                chr1   923923-944574      + |         2 NM_001385641.1
      [3]                chr1   960584-965719      + |         3    NM_198317.3
      [4]                chr1   966482-975865      + |         4    NM_032129.3
      [5]                chr1 1013497-1014540      + |         5    NM_005101.4
      ...                 ...             ...    ... .       ...            ...
  [21864] chrX_ML143381v1_fix   297006-300328      - |     21864 NM_001080145.2
  [21865] chrX_ML143381v1_fix   301876-305160      - |     21865 NM_001080146.3
  [21866] chrX_ML143382v1_fix     24479-28824      + |     21866    NM_033031.3
  [21867]    chrUn_GL000195v1     42938-86735      - |     21867 NM_001347681.2
  [21868]    chrUn_GL000213v1   103809-139339      - |     21868 NM_001257362.3
  -------
  seqinfo: 640 sequences (1 circular) from hg38 genome

If you just want the MANE transcripts, you can get that from NCBI (and presumably EBI/EMBL but I didn't check)

> z <- makeTxDbFromGFF("https://ftp.ncbi.nlm.nih.gov/refseq/MANE/MANE_human/release_1.0/MANE.GRCh38.v1.0.ensembl_genomic.gff.gz")
Import genomic features from the file as a GRanges object ... trying URL 'https://ftp.ncbi.nlm.nih.gov/refseq/MANE/MANE_human/release_1.0/MANE.GRCh38.v1.0.ensembl_genomic.gff.gz'
Content type 'application/x-gzip' length 9421273 bytes (9.0 MB)
downloaded 9.0 MB

OK
Prepare the 'metadata' data frame ... OK
Make the TxDb object ... OK
Warning message:
In .get_cds_IDX(mcols0$type, mcols0$phase) :
  The "phase" metadata column contains non-NA values for features of type
  stop_codon. This information was ignored.
> z
TxDb object:
# Db type: TxDb
# Supporting package: GenomicFeatures
# Data source: https://ftp.ncbi.nlm.nih.gov/refseq/MANE/MANE_human/release_1.0/MANE.GRCh38.v1.0.ensembl_genomic.gff.gz
# Organism: NA
# Taxonomy ID: NA
# miRBase build ID: NA
# Genome: NA
# Nb of transcripts: 19120
# Db created by: GenomicFeatures package from Bioconductor
# Creation time: 2022-05-18 09:45:54 -0400 (Wed, 18 May 2022)
# GenomicFeatures version at creation time: 1.48.1
# RSQLite version at creation time: 2.2.14
# DBSCHEMAVERSION: 1.2
> transcripts(z)
GRanges object with 19120 ranges and 2 metadata columns:
          seqnames            ranges strand |     tx_id            tx_name
             <Rle>         <IRanges>  <Rle> | <integer>        <character>
      [1]     chr1       65419-71585      + |         1  ENST00000641515.2
      [2]     chr1     923923-944574      + |         2  ENST00000616016.5
      [3]     chr1     960584-965719      + |         3  ENST00000338591.8
      [4]     chr1     966482-975865      + |         4  ENST00000379410.8
      [5]     chr1   1013497-1014540      + |         5  ENST00000649529.1
      ...      ...               ...    ... .       ...                ...
  [19116]     chrY 22168542-22182923      - |     19116 ENST00000303766.12
  [19117]     chrY 23129355-23199008      - |     19117  ENST00000405239.6
  [19118]     chrY 24045793-24048019      - |     19118  ENST00000382407.1
  [19119]     chrY 24763069-24813393      - |     19119  ENST00000382365.7
  [19120]     chrY 25030901-25052104      - |     19120  ENST00000382287.5
  -------
  seqinfo: 24 sequences from an unspecified genome; no seqlengths

You can substitute 'refseq' for 'ensembl' in the GFF name if you want the RefSeq version.