find overlaping genes in ENSEMBL gene ID list and NCBI gene ID list

0

Entering edit mode

Fong Chun Chan ▴ 20

@fong-chun-chan-4536

Last seen 10.2 years ago

You can use the DAVID Bioinformatics tool (http://david.abcc.ncifcrf.gov/conversion.jsp) to convert your ENSEMBL gene ID to NCBI gene, or vice versa and then use a venn diagram software like http://www.pangloss.com/seidel/Protocols/venn.cgi to find the overlapping genes and non-overlapping genes. FOng

convert convert • 1.2k views

ADD COMMENT • link updated 13.7 years ago by Hervé Pagès 16k • written 13.7 years ago by Fong Chun Chan ▴ 20

0

Entering edit mode

Hervé Pagès 16k

@herve-pages-1542

Last seen 2 days ago

Seattle, WA, United States

Fernando, I forgot to mention that the code below works fine and is reasonably fast with BioC 2.8 (current devel). However, with BioC 2.7 (current release) it might not run or you might run into some performance issues since some of the methods operating on GRangesList objects (like range()) didn't exist of were too slow. Cheers, H. On 03/08/2011 03:36 PM, Hervé Pagès wrote: > Hi Fernando, > > In addition to Vince's suggestions, here is an approach where you > actually compute the mapping yourself based on the gene locations. > > It uses the GenomicFeatures package to retrieve the transcript and > exon locations for each gene, and then findOverlaps() to find the > overlapping genes. All the information can be downloaded from the > "RefSeq Genes" and "Ensembl Genes" UCSC tracks. The first track is > associated with Entrez Gene IDs (NCBI) and the second track with > Ensembl Gene IDs. > > library(GenomicFeatures) > txdb1 <- makeTranscriptDbFromUCSC("bosTau4", tablename="refGene") > txdb2 <- makeTranscriptDbFromUCSC("bosTau4", tablename="ensGene") > genes1 <- range(transcriptsBy(txdb1, by="gene")) # 11668 Entrez genes > genes2 <- range(transcriptsBy(txdb2, by="gene")) # 25669 Ensembl genes > mm <- findOverlaps(genes1, genes2) > > ## Entrez genes that don't overlap: > unmapped_ezids <- names(genes1)[-unique(queryHits(mm))] # 101 Entrez genes > > ## Ensembl genes that don't overlap: > unmapped_ensids <- names(genes2)[-unique(subjectHits(mm))] # 12876 > Ensembl genes > > ## Mapping between Entrez genes and Ensembl genes: > map <- data.frame(ezid=names(genes1)[queryHits(mm)], > ensid=names(genes2)[subjectHits(mm)]) > > As you can see, it's not a 1-to-1 mapping: > > > head(map) > ezid ensid > 1 100034674 ENSBTAG00000006026 > 2 100036590 ENSBTAG00000038843 > 3 100048947 ENSBTAG00000009091 > 4 100048949 ENSBTAG00000033312 > 5 100101492 ENSBTAG00000032159 > 6 100101492 ENSBTAG00000032234 > > Note that alternatively makeTranscriptDbFromBiomart() could be used to > make 'txdb2' but that means you would first need to make sure that the > current version of Ensembl (Ensembl 61) is also mapping the genes, > transcripts and exons of Bos taurus to bosTau4. > > Hope this helps, > H. > > > On 03/08/2011 11:49 AM, Vincent Carey wrote: >> There are various approaches using Bioconductor. The fundamental >> resource is the package org.Bt.eg.db, which you can acquire using >> biocLite(). >> >> You can find associations between ENSEMBL ids and Entrez ids using >> mappings in that package. >> >> You may find GSEABase useful also. For example >> >>> dput(ensex) >> c("ENSBTAG00000000005", "ENSBTAG00000000008", "ENSBTAG00000000009", >> "ENSBTAG00000000010", "ENSBTAG00000000011", "ENSBTAG00000000012", >> "ENSBTAG00000000013", "ENSBTAG00000000014", "ENSBTAG00000000015", >> "ENSBTAG00000000016", "ENSBTAG00000000018", "ENSBTAG00000000019", >> "ENSBTAG00000000020", "ENSBTAG00000000021", "ENSBTAG00000000022", >> "ENSBTAG00000000023", "ENSBTAG00000000024", "ENSBTAG00000000025", >> "ENSBTAG00000000026", "ENSBTAG00000000027") >>> e1 = GeneSet(ensex, geneIdType=ENSEMBLIdentifier("org.Bt.eg.db")) >>> e1 >> setName: NA >> geneIds: ENSBTAG00000000005, ENSBTAG00000000008, ..., >> ENSBTAG00000000027 (total: 20) >> geneIdType: ENSEMBL (org.Bt.eg.db) >> collectionType: Null >> details: use 'details(object)' >>> g1 = e1 >>> geneIdType(g1) = EntrezIdentifier("org.Bt.eg.db") >>> g1 >> setName: NA >> geneIds: 282136, 539250, ..., 512788 (total: 21) >> geneIdType: EntrezId (org.Bt.eg.db) >> collectionType: Null >> details: use 'details(object)' >> >> This shows that there are 21 Entrez IDs associated with the 20 ENSEMBL >> ids given above. >> After converting sets to common ID type, you can use intersect, >> setdiff methods to answer some of the >> questions you pose. >> >> >>> sessionInfo() >> R version 2.13.0 Under development (unstable) (2011-03-01 r54628) >> Platform: x86_64-apple-darwin10.4.0/x86_64 (64-bit) >> >> locale: >> [1] C >> >> attached base packages: >> [1] stats graphics grDevices datasets tools utils methods >> [8] base >> >> other attached packages: >> [1] GSEABase_1.13.3 graph_1.29.3 annotate_1.29.3 >> [4] org.Bt.eg.db_2.4.6 RSQLite_0.9-4 DBI_0.2-5 >> [7] AnnotationDbi_1.13.15 Biobase_2.11.9 weaver_1.17.0 >> [10] codetools_0.2-8 digest_0.4.2 >> >> loaded via a namespace (and not attached): >> [1] Matrix_0.999375-47 XML_3.2-0 grid_2.13.0 lattice_0.19-17 >> [5] xtable_1.5-6 >> >> >> On Tue, Mar 8, 2011 at 12:13 PM, Biase, Fernando<biase at="" illinois.edu=""> >> wrote: >>> Hi everyone, >>> >>> I have a list of ENSEMBL gene _IDS and a list with NCBI gene_IDs. I >>> need to find which ids correspond to genes in both list (overlapping >>> genes) and each genes are in each one of them but not present in the >>> other list (non-overlapping genes). >>> Can anyone give me some advice on this task? Or indicate a material >>> do read? >>> In case it is relevant, the organism is Bos taurus. >>> >>> Thanks in advance, >>> Fernando >>> >>> _______________________________________________ >>> Bioconductor mailing list >>> Bioconductor at r-project.org >>> https://stat.ethz.ch/mailman/listinfo/bioconductor >>> Search the archives: >>> http://news.gmane.org/gmane.science.biology.informatics.conductor >>> >> >> _______________________________________________ >> Bioconductor mailing list >> Bioconductor at r-project.org >> https://stat.ethz.ch/mailman/listinfo/bioconductor >> Search the archives: >> http://news.gmane.org/gmane.science.biology.informatics.conductor > > -- Hervé Pagès Program in Computational Biology Division of Public Health Sciences Fred Hutchinson Cancer Research Center 1100 Fairview Ave. N, M2-B876 P.O. Box 19024 Seattle, WA 98109-1024 E-mail: hpages at fhcrc.org Phone: (206) 667-5791 Fax: (206) 667-1319

ADD COMMENT • link 13.7 years ago Hervé Pagès 16k

Login before adding your answer.