Does anyone know of a method I can use to find out which genes are coding and non-coding from my gene list? I'm trying to avoid having to look up each individual gene. Thanks!
--Dr. S
Does anyone know of a method I can use to find out which genes are coding and non-coding from my gene list? I'm trying to avoid having to look up each individual gene. Thanks!
--Dr. S
Hey,
biomaRt is one solution. Take a look at this example, starting with HGNC symbols:
genes <- c('BRCA1', 'XIST', 'TXNIP', 'AFG3L1P')
require(biomaRt)
mart <- useMart("ENSEMBL_MART_ENSEMBL", host = "useast.ensembl.org")
mart <- useDataset("hsapiens_gene_ensembl", mart)
annotLookup <- getBM(
mart = mart,
attributes = c(
"hgnc_symbol",
"entrezgene_id",
"ensembl_gene_id",
"gene_biotype"),
filter = "hgnc_symbol",
values = genes,
uniqueRows=TRUE)
annotLookup
hgnc_symbol entrezgene_id ensembl_gene_id gene_biotype
1 AFG3L1P 172 ENSG00000223959 transcribed_unitary_pseudogene
2 BRCA1 672 ENSG00000012048 protein_coding
3 TXNIP 10628 ENSG00000265972 protein_coding
4 XIST NA ENSG00000229807 lncRNA
Kevin
Another approach uses the ensembldb and resources from AnnotationHub. Load packages
library(ensembldb)
library(AnnotationHub)
library(dplyr)
Discover and retrieve the appropriate database -- for Homo sapiens build 97
hub = AnnotationHub()
query(hub, c("EnsDb", "Homo sapiens", "97"))
edb = hub[["AH73881"]]
Discover fields available for query (keytypes()
) and for retrieval (columns()
), and map all HGNC symbols to Entrez and Ensembl identifiers and gene biotypes, transforming to a tibble for convenience
keytypes(edb)
columns(edb)
keys = keys(edb, "GENENAME")
columns = c("GENEID", "ENTREZID", "GENEBIOTYPE")
tbl =
ensembldb::select(edb, keys, columns, keytype = "GENENAME") %>%
as_tibble()
The result is
> tbl
# A tibble: 68,027 x 4
GENENAME GENEID ENTREZID GENEBIOTYPE
<chr> <chr> <int> <chr>
1 A1BG ENSG00000121410 1 protein_coding
2 A1BG-AS1 ENSG00000268895 NA lncRNA
3 A1CF ENSG00000148584 29974 protein_coding
4 A2M ENSG00000175899 2 protein_coding
5 A2M-AS1 ENSG00000245105 144571 lncRNA
6 A2ML1 ENSG00000166535 144568 protein_coding
7 A2ML1-AS1 ENSG00000256661 NA lncRNA
8 A2ML1-AS2 ENSG00000256904 NA lncRNA
9 A2MP1 ENSG00000256069 3 transcribed_unprocessed_pseudogene
10 A3GALT2 ENSG00000184389 127550 protein_coding
# … with 68,017 more rows
Filters are a very useful feature, discovered and used to retrieve the same results as above but restricted to protein_coding
biotype
supportedFilters()
filter = ~ gene_name %in% keys & gene_biotype == "protein_coding"
tbl =
ensembldb::select(edb, filter, columns) %>%
as_tibble()
Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Hi Kevin,
I used biomart in the following way, to filter out protein-coding genes from my list. But I think it is missing some genes. Any suggestions
In allcodinggenes I got 19391 genes names. Out of which 19,081 matches with my data. but in the non-coding list (
rawcount <- rawcount[!(row.names(rawcount) %in% all_coding_genes$hgnc_symbol),]
), I can still find some protein_coding genes (gene card) such as SEPT14, PRR26 etc.There will always be some discrepancies between the different gene annotation databases, considering the fact that these are constantly being updated.
In this case, it looks like SEPT14 is actually there, but has a different symbol:
Thanks for your reply. Do you know, how to deal with it for bulk number of genes?
It may be better to deal with this issue from the source of the data. For example, start with Ensembl or Entrez IDs, which are unique, in place of gene symbols. These IDs were obviously introduced due to these discrepancies that can exist with gene symbols. Dealing with the issue now is problematic.