1, via org.Hs.eg.db

Question

TCGA data Query (GDCquery): external_gene_name " are missing

0

Entering edit mode

H. Z. Amini ▴ 10

@habibolla-24859

Last seen 3.7 years ago

Morgantown

Hi, I just got some weird output from TCGA dataset. As you can see in the below picture, some of the "external_gene_name " are missing. Would you please help me out with this issue? Thank you.

query.seq <- GDCquery(project = "TCGA-BRCA", 
                      data.category = "Transcriptome Profiling", 
                      data.type = "Gene Expression Quantification",
                      sample.type = c("Solid Tissue Normal", "Primary Tumor"),
                      workflow.type = "HTSeq - Counts")


GDCdownload(query.seq)

seq.brca <- GDCprepare(query = query.seq, summarizedExperiment = TRUE)

enter image description here

TCGAbiolinks • 1.6k views

ADD COMMENT • link 3.9 years ago H. Z. Amini ▴ 10

score 0 · Answer 1 · 2021-06-01

Hi,

You already received an answer on Biostars: https://www.biostars.org/p/9473246/#9473248

The different annotation databases do not overlap perfectly. Each [database] includes transcripts and genes based on different rules. There are actually many previous questions on this topic, and even publications. It is not strictly an issue with the TCGAbiolinks package.

Let's see if we can find out more about the two circled, i.e., ENSG00000281904 and ENSG00000281920:

genes <- c('ENSG00000281904','ENSG00000281920')

1, via org.Hs.eg.db

require(org.Hs.eg.db)

mapIds(
  org.Hs.eg.db,
  keys = genes,
  column = 'SYMBOL',
  keytype = 'ENSEMBL')

Error in .testForValidKeys(x, keys, keytype, fks) : 
  None of the keys entered are valid keys for 'ENSEMBL'. Please use the keys method to see a listing of valid arguments.

Not there.

2, via biomaRt

require(biomaRt)

ensembl <- useMart('ensembl', dataset = 'hsapiens_gene_ensembl')

annot <- getBM(
  attributes = c(
    'hgnc_symbol',
    'external_gene_name',
    'ensembl_gene_id',
    'entrezgene_id',
    'gene_biotype'),
  filters = 'ensembl_gene_id',
  values = genes,
  mart = ensembl)

annot <- merge(
  x = as.data.frame(genes),
  y =  annot,
  by.y = 'ensembl_gene_id',
  all.x = T,
  by.x = 'genes')

annot
            genes hgnc_symbol external_gene_name entrezgene_id gene_biotype
1 ENSG00000281904          NA                 NA            NA       lncRNA
2 ENSG00000281920          NA                 NA            NA       lncRNA

Nothing there either, but at least that we can see that these are long non-coding RNAs.

Kevin