biomaRt question (missing gene information?)
1
0
Entering edit mode
Mike ▴ 10
@mike-18117
Last seen 5.4 years ago

I am usingPeripheral Blood Mononuclear Cells (PBMC) data from Seurat website (https://satijalab.org/seurat/pbmc3k_tutorial.html) I am trying to get gene information by using biomaRt package on R.

 

library(biomaRt)
ensembl <- useMart("ensembl", 
                   dataset = "hsapiens_gene_ensembl")
pbmc_filter <- pbmc@var.genes
attr <- c("ensembl_gene_id", "hgnc_symbol","entrezgene","chromosome_name", "start_position", "end_position")
Info <- getBM(attributes = attr,
              filters = "hgnc_symbol",
              values = pbmc_filter,
              mart = ensembl)

 

This is the code that I used but the problem is pbmc_filter or pbmc@var.genes have 1838 genes but this code only gives 1625 gene information (after deleting repetitive hgnc_symbol names).

I found that some of the gene information is not in hsapiens_gene_ensembl dataset. For example, "CPSF3L" is in pbmc_filter but not in hsapiens_gene_ensembl dataset. Is there anyone who can figure out how to get all 1838 gene information by using biomaRt?

Thank you in advance.

 

 

biomart R • 2.3k views
ADD COMMENT
2
Entering edit mode
Mike Smith ★ 6.6k
@mike-smith
Last seen 7 hours ago
EMBL Heidelberg

If you search Ensembl for CPSF3L you'll find you get directed to the page for INTS11, and CPSF3L is listed as a Gene Synonym.  The exact reason for this isn't given but usually this is because the old name has been retired, or two entries have been merged together sometime in the past.

Unfortunately, although a search of Ensembl will get you to the correct page, this is not available via the BioMart interface.  If you don't use the current gene symbol you don't find anything, as you are experiencing.

Happily, there are several ways to convert old symbols to the most up-to-date using Bioconductor packages.  My favourite is the alias2Symbol() function in limma. Here's an example with CPSF3L

> limma::alias2Symbol("CPSF3L")
[1] "INTS11"

You can then use biomaRt with these updated symbols.


If you want to convert more than one symbol, note that alias2Symbol() does not preserve the order of the vector you pass it or indicated a mapping back to the original order e.g.

> limma::alias2Symbol(c("CPSF3L", "CDC6", "BIM"))
[1] "CDC6"    "BCL2L11" "INTS11"

This might not matter, but if you want to retain the original order (e.g it matches rows in counts matrix) I suggest using vapply():

> vapply(X = c("CPSF3L", "CDC6", "BIM"), 
+        FUN = limma::alias2Symbol,
+        FUN.VALUE = character(1))
   CPSF3L      CDC6       BIM 
 "INTS11"    "CDC6" "BCL2L11"
ADD COMMENT
1
Entering edit mode

As an alternative, you can use mapIds directly

> mapIds(org.Hs.eg.db, c("CPSF3L", "CDC6", "BIM"), "SYMBOL", "ALIAS")
'select()' returned 1:1 mapping between keys and columns
   CPSF3L      CDC6       BIM
 "INTS11"    "CDC6" "BCL2L11"
ADD REPLY

Login before adding your answer.

Traffic: 685 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6