find overlaping genes in ENSEMBL gene ID list and NCBI gene ID list
2
0
Entering edit mode
@biase-fernando-4475
Last seen 9.8 years ago
Hi everyone, I have a list of ENSEMBL gene _IDS and a list with NCBI gene_IDs. I need to find which ids correspond to genes in both list (overlapping genes) and each genes are in each one of them but not present in the other list (non-overlapping genes). Can anyone give me some advice on this task? Or indicate a material do read? In case it is relevant, the organism is Bos taurus. Thanks in advance, Fernando
Organism Bos taurus Organism Bos taurus • 1.5k views
ADD COMMENT
0
Entering edit mode
@vincent-j-carey-jr-4
Last seen 6 weeks ago
United States
There are various approaches using Bioconductor. The fundamental resource is the package org.Bt.eg.db, which you can acquire using biocLite(). You can find associations between ENSEMBL ids and Entrez ids using mappings in that package. You may find GSEABase useful also. For example > dput(ensex) c("ENSBTAG00000000005", "ENSBTAG00000000008", "ENSBTAG00000000009", "ENSBTAG00000000010", "ENSBTAG00000000011", "ENSBTAG00000000012", "ENSBTAG00000000013", "ENSBTAG00000000014", "ENSBTAG00000000015", "ENSBTAG00000000016", "ENSBTAG00000000018", "ENSBTAG00000000019", "ENSBTAG00000000020", "ENSBTAG00000000021", "ENSBTAG00000000022", "ENSBTAG00000000023", "ENSBTAG00000000024", "ENSBTAG00000000025", "ENSBTAG00000000026", "ENSBTAG00000000027") > e1 = GeneSet(ensex, geneIdType=ENSEMBLIdentifier("org.Bt.eg.db")) > e1 setName: NA geneIds: ENSBTAG00000000005, ENSBTAG00000000008, ..., ENSBTAG00000000027 (total: 20) geneIdType: ENSEMBL (org.Bt.eg.db) collectionType: Null details: use 'details(object)' > g1 = e1 > geneIdType(g1) = EntrezIdentifier("org.Bt.eg.db") > g1 setName: NA geneIds: 282136, 539250, ..., 512788 (total: 21) geneIdType: EntrezId (org.Bt.eg.db) collectionType: Null details: use 'details(object)' This shows that there are 21 Entrez IDs associated with the 20 ENSEMBL ids given above. After converting sets to common ID type, you can use intersect, setdiff methods to answer some of the questions you pose. > sessionInfo() R version 2.13.0 Under development (unstable) (2011-03-01 r54628) Platform: x86_64-apple-darwin10.4.0/x86_64 (64-bit) locale: [1] C attached base packages: [1] stats graphics grDevices datasets tools utils methods [8] base other attached packages: [1] GSEABase_1.13.3 graph_1.29.3 annotate_1.29.3 [4] org.Bt.eg.db_2.4.6 RSQLite_0.9-4 DBI_0.2-5 [7] AnnotationDbi_1.13.15 Biobase_2.11.9 weaver_1.17.0 [10] codetools_0.2-8 digest_0.4.2 loaded via a namespace (and not attached): [1] Matrix_0.999375-47 XML_3.2-0 grid_2.13.0 lattice_0.19-17 [5] xtable_1.5-6 On Tue, Mar 8, 2011 at 12:13 PM, Biase, Fernando <biase at="" illinois.edu=""> wrote: > Hi everyone, > > I have a list of ENSEMBL gene _IDS ?and a list with NCBI gene_IDs. I need to find which ?ids correspond to genes in both list (overlapping genes) and each genes are in each one of them but not present in the other list (non-overlapping genes). > Can anyone give me some advice on this task? Or indicate a material do read? > In case it is relevant, the organism is Bos taurus. > > Thanks in advance, > Fernando > > _______________________________________________ > Bioconductor mailing list > Bioconductor at r-project.org > https://stat.ethz.ch/mailman/listinfo/bioconductor > Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor >
ADD COMMENT
0
Entering edit mode
Hi Fernando, In addition to Vince's suggestions, here is an approach where you actually compute the mapping yourself based on the gene locations. It uses the GenomicFeatures package to retrieve the transcript and exon locations for each gene, and then findOverlaps() to find the overlapping genes. All the information can be downloaded from the "RefSeq Genes" and "Ensembl Genes" UCSC tracks. The first track is associated with Entrez Gene IDs (NCBI) and the second track with Ensembl Gene IDs. library(GenomicFeatures) txdb1 <- makeTranscriptDbFromUCSC("bosTau4", tablename="refGene") txdb2 <- makeTranscriptDbFromUCSC("bosTau4", tablename="ensGene") genes1 <- range(transcriptsBy(txdb1, by="gene")) # 11668 Entrez genes genes2 <- range(transcriptsBy(txdb2, by="gene")) # 25669 Ensembl genes mm <- findOverlaps(genes1, genes2) ## Entrez genes that don't overlap: unmapped_ezids <- names(genes1)[-unique(queryHits(mm))] # 101 Entrez genes ## Ensembl genes that don't overlap: unmapped_ensids <- names(genes2)[-unique(subjectHits(mm))] # 12876 Ensembl genes ## Mapping between Entrez genes and Ensembl genes: map <- data.frame(ezid=names(genes1)[queryHits(mm)], ensid=names(genes2)[subjectHits(mm)]) As you can see, it's not a 1-to-1 mapping: > head(map) ezid ensid 1 100034674 ENSBTAG00000006026 2 100036590 ENSBTAG00000038843 3 100048947 ENSBTAG00000009091 4 100048949 ENSBTAG00000033312 5 100101492 ENSBTAG00000032159 6 100101492 ENSBTAG00000032234 Note that alternatively makeTranscriptDbFromBiomart() could be used to make 'txdb2' but that means you would first need to make sure that the current version of Ensembl (Ensembl 61) is also mapping the genes, transcripts and exons of Bos taurus to bosTau4. Hope this helps, H. On 03/08/2011 11:49 AM, Vincent Carey wrote: > There are various approaches using Bioconductor. The fundamental > resource is the package org.Bt.eg.db, which you can acquire using > biocLite(). > > You can find associations between ENSEMBL ids and Entrez ids using > mappings in that package. > > You may find GSEABase useful also. For example > >> dput(ensex) > c("ENSBTAG00000000005", "ENSBTAG00000000008", "ENSBTAG00000000009", > "ENSBTAG00000000010", "ENSBTAG00000000011", "ENSBTAG00000000012", > "ENSBTAG00000000013", "ENSBTAG00000000014", "ENSBTAG00000000015", > "ENSBTAG00000000016", "ENSBTAG00000000018", "ENSBTAG00000000019", > "ENSBTAG00000000020", "ENSBTAG00000000021", "ENSBTAG00000000022", > "ENSBTAG00000000023", "ENSBTAG00000000024", "ENSBTAG00000000025", > "ENSBTAG00000000026", "ENSBTAG00000000027") >> e1 = GeneSet(ensex, geneIdType=ENSEMBLIdentifier("org.Bt.eg.db")) >> e1 > setName: NA > geneIds: ENSBTAG00000000005, ENSBTAG00000000008, ..., > ENSBTAG00000000027 (total: 20) > geneIdType: ENSEMBL (org.Bt.eg.db) > collectionType: Null > details: use 'details(object)' >> g1 = e1 >> geneIdType(g1) = EntrezIdentifier("org.Bt.eg.db") >> g1 > setName: NA > geneIds: 282136, 539250, ..., 512788 (total: 21) > geneIdType: EntrezId (org.Bt.eg.db) > collectionType: Null > details: use 'details(object)' > > This shows that there are 21 Entrez IDs associated with the 20 ENSEMBL > ids given above. > After converting sets to common ID type, you can use intersect, > setdiff methods to answer some of the > questions you pose. > > >> sessionInfo() > R version 2.13.0 Under development (unstable) (2011-03-01 r54628) > Platform: x86_64-apple-darwin10.4.0/x86_64 (64-bit) > > locale: > [1] C > > attached base packages: > [1] stats graphics grDevices datasets tools utils methods > [8] base > > other attached packages: > [1] GSEABase_1.13.3 graph_1.29.3 annotate_1.29.3 > [4] org.Bt.eg.db_2.4.6 RSQLite_0.9-4 DBI_0.2-5 > [7] AnnotationDbi_1.13.15 Biobase_2.11.9 weaver_1.17.0 > [10] codetools_0.2-8 digest_0.4.2 > > loaded via a namespace (and not attached): > [1] Matrix_0.999375-47 XML_3.2-0 grid_2.13.0 lattice_0.19-17 > [5] xtable_1.5-6 > > > On Tue, Mar 8, 2011 at 12:13 PM, Biase, Fernando<biase at="" illinois.edu=""> wrote: >> Hi everyone, >> >> I have a list of ENSEMBL gene _IDS and a list with NCBI gene_IDs. I need to find which ids correspond to genes in both list (overlapping genes) and each genes are in each one of them but not present in the other list (non-overlapping genes). >> Can anyone give me some advice on this task? Or indicate a material do read? >> In case it is relevant, the organism is Bos taurus. >> >> Thanks in advance, >> Fernando >> >> _______________________________________________ >> Bioconductor mailing list >> Bioconductor at r-project.org >> https://stat.ethz.ch/mailman/listinfo/bioconductor >> Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor >> > > _______________________________________________ > Bioconductor mailing list > Bioconductor at r-project.org > https://stat.ethz.ch/mailman/listinfo/bioconductor > Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor -- Hervé Pagès Program in Computational Biology Division of Public Health Sciences Fred Hutchinson Cancer Research Center 1100 Fairview Ave. N, M2-B876 P.O. Box 19024 Seattle, WA 98109-1024 E-mail: hpages at fhcrc.org Phone: (206) 667-5791 Fax: (206) 667-1319
ADD REPLY
0
Entering edit mode
Marc Carlson ★ 7.2k
@marc-carlson-2264
Last seen 7.9 years ago
United States
Hi Fernando, I see that the list has already provided a lot of very clever suggestions. I would like to add one that is slightly less exciting but which I hope might still be of some use in the event that you just wanted to do something simple. If your set of ensembl IDs was something like this: ensIds <- c("ENSBTAG00000038843","ENSBTAG00000009091","ENSBTAG00000033312", "wrongID") Then we could simply just use the appropriate mapping to get them back as a list like so: mget(ensIds, org.Bt.egENSEMBL2EG ,ifnotfound=NA) Or you might find it easier to work with data.frames, in which case you could do it more like this: toTable(org.Bt.egENSEMBL2EG[ Rkeys(org.Bt.egENSEMBL2EG) %in% ensIds ]) Once you had converted all your lists to the same kind of IDs, then you could use something like %in% (similar to how I used it above) to quickly see which things overlap. Please let us know if you still have questions. Hope this helps, Marc On 03/08/2011 09:13 AM, Biase, Fernando wrote: > Hi everyone, > > I have a list of ENSEMBL gene _IDS and a list with NCBI gene_IDs. I need to find which ids correspond to genes in both list (overlapping genes) and each genes are in each one of them but not present in the other list (non-overlapping genes). > Can anyone give me some advice on this task? Or indicate a material do read? > In case it is relevant, the organism is Bos taurus. > > Thanks in advance, > Fernando > > _______________________________________________ > Bioconductor mailing list > Bioconductor at r-project.org > https://stat.ethz.ch/mailman/listinfo/bioconductor > Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor >
ADD COMMENT

Login before adding your answer.

Traffic: 653 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6