Biomart annotation

0

Entering edit mode

jarod_v6@libero.it ▴ 40

@jarod_v6liberoit-6654

Last seen 6.1 years ago

Italy

Hi there! I need to convert all my ensemble gene id on hgnc symbols and entrez gene id. My ensemble release s the n?72. I use this script: dif.DEs$ensembl <- sapply(strsplit(rownames(dif.DEs),split="nn+"),"[",1) #use biomart library( "biomaRt" ) ensembl = useMart( host="jun2013.archive.ensembl.org",biomart=" ENSEMBL_MART_ENSEMBL", dataset = "hsapiens_gene_ensembl" ) genemap <- getBM( attributes = c("ensembl_gene_id", "entrezgene", "hgnc_symbol"), filters = "ensembl_gene_id", values = dif.DEs$ensembl, mart = ensembl ) idx <- match(dif.DEs$ensembl, genemap$ensembl_gene_id ) dif.DEs$entrez <- genemap$entrezgene[ idx ] dif.DEs$hgnc_symbol <- genemap$hgnc_symbol[ idx ] dif.DEs$entrez [1] 25870 89869 54465 2840 NA 80230 57673 123264 NA NA [11] NA 392364 NA NA NA NA 221883 NA NA NA [21] NA NA NA NA Many of that are as NA. How can I annotate all the genes? thanks in advance for any help!

annotate convert annotate convert • 3.2k views

ADD COMMENT • link updated 10.7 years ago by John Blischak ▴ 190 • written 10.7 years ago by jarod_v6@libero.it ▴ 40

0

Entering edit mode

John Blischak ▴ 190

@john-blischak-6562

Last seen 7.4 years ago

Hi, I don't think there is a problem. Ensembl includes annotations for some genes that Entrez does not. An example that I found using the code below, RN7SL163P is a pseudogene included in Ensembl (ENSG00000266195) but not in Entrez. If you are not interested in pseudogenes, this should not be an issue for you analysis. library("biomaRt") ensembl <- useMart(host = "jun2013.archive.ensembl.org", biomart = "ENSEMBL_MART_ENSEMBL", dataset = "hsapiens_gene_ensembl") ens_id <- getBM(attributes = "ensembl_gene_id", mart = ensembl) entrez_id <- getBM(attributes = c("ensembl_gene_id", "entrezgene", "hgnc_symbol"), filters = "ensembl_gene_id", values = ens_id$ensembl_gene_id, mart = ensembl) dim(entrez_id) # [1] 66211 3 sumis.na(entrez_id$entrezgene)) # [1] 36545 head(entrez_id) # ensembl_gene_id entrezgene hgnc_symbol # 1 ENSG00000266195 NA RN7SL163P # 2 ENSG00000264715 NA # 3 ENSG00000264800 100422895 MIR4294 # 4 ENSG00000207390 NA # 5 ENSG00000206995 NA # 6 ENSG00000266431 100847076 MIR5580 John On Fri, Jul 18, 2014 at 4:27 AM, jarod_v6@libero.it <jarod_v6@libero.it> wrote: > Hi there! > I need to convert all my ensemble gene id on hgnc symbols and entrez gene > id. > My ensemble release s the nÂ°72. > > > I use this script: > > dif.DEs$ensembl <- sapply(strsplit(rownames(dif.DEs),split="nn+"),"[",1) > #use biomart > library( "biomaRt" ) > ensembl = useMart( host="jun2013.archive.ensembl.org",biomart=" > ENSEMBL_MART_ENSEMBL", dataset = "hsapiens_gene_ensembl" ) > genemap <- getBM( attributes = c("ensembl_gene_id", "entrezgene", > "hgnc_symbol"), > filters = "ensembl_gene_id", > values = dif.DEs$ensembl, > mart = ensembl ) > idx <- match(dif.DEs$ensembl, genemap$ensembl_gene_id ) > dif.DEs$entrez <- genemap$entrezgene[ idx ] > dif.DEs$hgnc_symbol <- genemap$hgnc_symbol[ idx ] > > > dif.DEs$entrez > [1] 25870 89869 54465 2840 NA 80230 57673 123264 NA NA > [11] NA 392364 NA NA NA NA 221883 NA NA NA > [21] NA NA NA NA > > Many of that are as NA. How can I annotate all the genes? > thanks in advance for any help! > > _______________________________________________ > Bioconductor mailing list > Bioconductor@r-project.org > https://stat.ethz.ch/mailman/listinfo/bioconductor > Search the archives: > http://news.gmane.org/gmane.science.biology.informatics.conductor [[alternative HTML version deleted]]

ADD COMMENT • link 10.7 years ago John Blischak ▴ 190

0

Entering edit mode

Also, you can specify the biotype filter so you only retrieve protein coding genes by changing your query to: entrez_id <- getBM(attributes = c("ensembl_gene_id", "entrezgene", "hgnc_symbol"), filters = c("ensembl_gene_id","biotype"), values = list(ens_id$ensembl_gene_id,"protein_coding"), mart = ensembl) This wil remove most of the NAs for entrez gene ids. sumis.na(entrez_id$entrezgene)) [1] 1432 Best, Steffen On Fri, Jul 18, 2014 at 10:00 AM, John Blischak <jdblischak@gmail.com> wrote: > Hi, > > I don't think there is a problem. Ensembl includes annotations for some > genes that Entrez does not. An example that I found using the code below, > RN7SL163P is a pseudogene included in Ensembl (ENSG00000266195) but not in > Entrez. If you are not interested in pseudogenes, this should not be an > issue for you analysis. > > library("biomaRt") > ensembl <- useMart(host = "jun2013.archive.ensembl.org", > biomart = "ENSEMBL_MART_ENSEMBL", > dataset = "hsapiens_gene_ensembl") > ens_id <- getBM(attributes = "ensembl_gene_id", mart = ensembl) > entrez_id <- getBM(attributes = c("ensembl_gene_id", "entrezgene", > "hgnc_symbol"), > filters = "ensembl_gene_id", > values = ens_id$ensembl_gene_id, > mart = ensembl) > dim(entrez_id) > # [1] 66211 3 > sumis.na(entrez_id$entrezgene)) > # [1] 36545 > head(entrez_id) > # ensembl_gene_id entrezgene hgnc_symbol > # 1 ENSG00000266195 NA RN7SL163P > # 2 ENSG00000264715 NA > # 3 ENSG00000264800 100422895 MIR4294 > # 4 ENSG00000207390 NA > # 5 ENSG00000206995 NA > # 6 ENSG00000266431 100847076 MIR5580 > > John > > > On Fri, Jul 18, 2014 at 4:27 AM, jarod_v6@libero.it <jarod_v6@libero.it> > wrote: > > > Hi there! > > I need to convert all my ensemble gene id on hgnc symbols and entrez > gene > > id. > > My ensemble release s the nÂ°72. > > > > > > I use this script: > > > > dif.DEs$ensembl <- sapply(strsplit(rownames(dif.DEs),split="nn+"),"[",1) > > #use biomart > > library( "biomaRt" ) > > ensembl = useMart( host="jun2013.archive.ensembl.org",biomart=" > > ENSEMBL_MART_ENSEMBL", dataset = "hsapiens_gene_ensembl" ) > > genemap <- getBM( attributes = c("ensembl_gene_id", "entrezgene", > > "hgnc_symbol"), > > filters = "ensembl_gene_id", > > values = dif.DEs$ensembl, > > mart = ensembl ) > > idx <- match(dif.DEs$ensembl, genemap$ensembl_gene_id ) > > dif.DEs$entrez <- genemap$entrezgene[ idx ] > > dif.DEs$hgnc_symbol <- genemap$hgnc_symbol[ idx ] > > > > > > dif.DEs$entrez > > [1] 25870 89869 54465 2840 NA 80230 57673 123264 NA > NA > > [11] NA 392364 NA NA NA NA 221883 NA NA > NA > > [21] NA NA NA NA > > > > Many of that are as NA. How can I annotate all the genes? > > thanks in advance for any help! > > > > _______________________________________________ > > Bioconductor mailing list > > Bioconductor@r-project.org > > https://stat.ethz.ch/mailman/listinfo/bioconductor > > Search the archives: > > http://news.gmane.org/gmane.science.biology.informatics.conductor > > [[alternative HTML version deleted]] > > > _______________________________________________ > Bioconductor mailing list > Bioconductor@r-project.org > https://stat.ethz.ch/mailman/listinfo/bioconductor > Search the archives: > http://news.gmane.org/gmane.science.biology.informatics.conductor > [[alternative HTML version deleted]]

ADD REPLY • link 10.7 years ago Steffen Durinck ▴ 540

Login before adding your answer.