howto map HuEx-1_0-st-v2 transcript_cluster_id's to gene symbols
1
1
Entering edit mode
@stephane-plaisance-vib-6362
Last seen 5.4 years ago
I am looking for a file or package that can map my transcript_cluster_id to hgnc symbols (human) my data was preprocessed using APT and I do not have the original probe IDs. Apparently, only ‘probeIDs' are supported by BiomaRt, not ‘transcript_cluster_id'. I tried hugene10sttranscriptcluster.db” but the IDs are different (probeIDs) Where can I find a table listing ‘transcript_cluster_id' and enough info to get symbols? thanks in advance Stephane [[alternative HTML version deleted]]
biomaRt biomaRt • 3.2k views
ADD COMMENT
0
Entering edit mode
@vincent-j-carey-jr-4
Last seen 3 months ago
United States
I am no expert on the vocabulary here, but this shows two ways of working with the ChipDb package. First we use some SQL to figure out what is in there > library(hugene10sttranscriptcluster.db) > con = hugene10sttranscriptcluster_dbconn() # do dbListTables(con) too > dbGetQuery(con, "select * from probes limit 5") probe_id gene_id is_multiple 1 7892501 <na> 0 2 7892502 <na> 0 3 7892503 <na> 0 4 7892504 <na> 0 5 7892505 <na> 0 > peek = dbGetQuery(con, "select * from probes where gene_id is not null limit 5") > peek probe_id gene_id is_multiple 1 7896740 81099 0 2 7896742 346288 0 3 7896744 81399 0 4 7896754 100287934 0 5 7896756 157693 0 # so now we know that there is some attribute 'probe_id' that maps to something # called gene_id. now let's use the select interface > somekeys = peek[,1] > select( hugene10sttranscriptcluster.db, keytype="PROBEID", keys=somekeys, columns=c("SYMBOL", "ENTREZID")) PROBEID SYMBOL ENTREZID 1 7896740 OR4F17 81099 2 7896742 SEPT14 346288 3 7896744 OR4F16 81399 4 7896754 LOC100287934 100287934 5 7896756 FAM87A 157693 > sessionInfo() R Under development (unstable) (2013-12-01 r64371) Platform: x86_64-apple-darwin10.8.0/x86_64 (64-bit) locale: [1] en_US.US-ASCII/en_US.US-ASCII/en_US.US-ASCII/C/en_US.US-ASCII/en_US .US-ASCII attached base packages: [1] parallel stats graphics grDevices datasets utils tools [8] methods base other attached packages: [1] hugene10sttranscriptcluster.db_8.0.1 org.Hs.eg.db_2.10.1 [3] RSQLite_0.11.4 DBI_0.2-7 [5] AnnotationDbi_1.25.9 Biobase_2.23.3 [7] BiocGenerics_0.9.3 BiocInstaller_1.13.3 [9] weaver_1.29.1 codetools_0.2-8 [11] digest_0.6.4 loaded via a namespace (and not attached): [1] IRanges_1.21.21 stats4_3.1.0 On Tue, Jan 28, 2014 at 4:18 AM, Stephane Plaisance | VIB | < stephane.plaisance@vib.be> wrote: > I am looking for a file or package that can map my transcript_cluster_id > to hgnc symbols (human) > my data was preprocessed using APT and I do not have the original probe > IDs. > > Apparently, only 'probeIDs' are supported by BiomaRt, not > 'transcript_cluster_id'. > I tried hugene10sttranscriptcluster.db" but the IDs are different > (probeIDs) > > Where can I find a table listing 'transcript_cluster_id' and enough info > to get symbols? > > thanks in advance > Stephane > > > > [[alternative HTML version deleted]] > > > _______________________________________________ > Bioconductor mailing list > Bioconductor@r-project.org > https://stat.ethz.ch/mailman/listinfo/bioconductor > Search the archives: > http://news.gmane.org/gmane.science.biology.informatics.conductor > [[alternative HTML version deleted]]
ADD COMMENT
0
Entering edit mode
for the record, my reply goes down the wrong path, to gene 1.0 st annotation. it seems likely that the solution lies in AffyCompatible package, and that resource should be mentioned in http://www.bioconductor.org/help/workflows/arrays/ On Tue, Jan 28, 2014 at 4:41 AM, Vincent Carey <stvjc@channing.harvard.edu>wrote: > I am no expert on the vocabulary here, but this shows two ways of working > with the ChipDb package. First we use some SQL to figure out what is in > there > > > library(hugene10sttranscriptcluster.db) > > > con = hugene10sttranscriptcluster_dbconn() # do dbListTables(con) too > > > dbGetQuery(con, "select * from probes limit 5") > probe_id gene_id is_multiple > 1 7892501 <na> 0 > 2 7892502 <na> 0 > 3 7892503 <na> 0 > 4 7892504 <na> 0 > 5 7892505 <na> 0 > > > peek = dbGetQuery(con, "select * from probes where gene_id is not null > limit 5") > > > peek > probe_id gene_id is_multiple > 1 7896740 81099 0 > 2 7896742 346288 0 > 3 7896744 81399 0 > 4 7896754 100287934 0 > 5 7896756 157693 0 > > # so now we know that there is some attribute 'probe_id' that maps to > something > # called gene_id. now let's use the select interface > > > somekeys = peek[,1] > > > select( hugene10sttranscriptcluster.db, keytype="PROBEID", > keys=somekeys, columns=c("SYMBOL", "ENTREZID")) > PROBEID SYMBOL ENTREZID > 1 7896740 OR4F17 81099 > 2 7896742 SEPT14 346288 > 3 7896744 OR4F16 81399 > 4 7896754 LOC100287934 100287934 > 5 7896756 FAM87A 157693 > > > > sessionInfo() > R Under development (unstable) (2013-12-01 r64371) > Platform: x86_64-apple-darwin10.8.0/x86_64 (64-bit) > > locale: > [1] > en_US.US-ASCII/en_US.US-ASCII/en_US.US-ASCII/C/en_US.US-ASCII/en_US .US-ASCII > > attached base packages: > [1] parallel stats graphics grDevices datasets utils tools > [8] methods base > > other attached packages: > [1] hugene10sttranscriptcluster.db_8.0.1 org.Hs.eg.db_2.10.1 > > [3] RSQLite_0.11.4 DBI_0.2-7 > > [5] AnnotationDbi_1.25.9 Biobase_2.23.3 > > [7] BiocGenerics_0.9.3 BiocInstaller_1.13.3 > > [9] weaver_1.29.1 codetools_0.2-8 > > [11] digest_0.6.4 > > loaded via a namespace (and not attached): > [1] IRanges_1.21.21 stats4_3.1.0 > > > > On Tue, Jan 28, 2014 at 4:18 AM, Stephane Plaisance | VIB | < > stephane.plaisance@vib.be> wrote: > >> I am looking for a file or package that can map my transcript_cluster_id >> to hgnc symbols (human) >> my data was preprocessed using APT and I do not have the original probe >> IDs. >> >> Apparently, only 'probeIDs' are supported by BiomaRt, not >> 'transcript_cluster_id'. >> I tried hugene10sttranscriptcluster.db" but the IDs are different >> (probeIDs) >> >> Where can I find a table listing 'transcript_cluster_id' and enough info >> to get symbols? >> >> thanks in advance >> Stephane >> >> >> >> [[alternative HTML version deleted]] >> >> >> _______________________________________________ >> Bioconductor mailing list >> Bioconductor@r-project.org >> https://stat.ethz.ch/mailman/listinfo/bioconductor >> Search the archives: >> http://news.gmane.org/gmane.science.biology.informatics.conductor >> > > [[alternative HTML version deleted]]
ADD REPLY
0
Entering edit mode
after you obtain credentials from netaffx, you can do something like the following library(AffyCompatible) rsrc = NetAffxResource(user=[user], password=[pass]) lk = grep("Ex", names(rsrc), value=TRUE) names(rsrc[[lk[3]]]) fl = readAnnotation(rsrc, anno=rsrc[["HuEx-1_0-st-v2", "Transcript Cluster Annotations, CSV"]], content=FALSE) con = unz(fl, "HuEx-1_0-st-v2.na33.1.hg19.transcript.csv") smal = read.csv(con, skip=24, nrow=20, stringsAsFactors=FALSE) # drop the nrow for full # import smal[,1:7] > smal[,1:7] transcript_cluster_id probeset_id seqname strand start stop total_probes 1 2315100 2315100 chr1 + 11884 14409 20 2 2315106 2315106 chr1 + 14760 15198 8 3 2315109 2315109 chr1 + 19408 19712 4 4 2315113 2315113 chr1 + 27563 27813 4 5 2315115 2315115 chr1 + 28425 29158 4 6 2315117 2315117 chr1 + 30623 30667 4 7 2315119 2315119 chr1 + 35690 35804 4 8 2315121 2315121 chr1 + 35928 35968 2 9 2315125 2315125 chr1 + 51913 70633 26 10 2315129 2315129 chr1 + 52452 63887 20 11 2315145 2315145 chr1 + 115060 116054 4 12 2315147 2315147 chr1 + 131158 146731 44 13 2315160 2315160 chr1 + 136242 136557 8 14 2315163 2315163 chr1 + 138440 662350 85 15 2315189 2315189 chr1 + 140341 140737 4 16 2315191 2315191 chr1 + 141298 320283 15 17 2315198 2315198 chr1 + 148041 682717 20 18 2315206 2315206 chr1 + 158150 164721 8 19 2315209 2315209 chr1 + 236570 238386 8 20 2315212 2315212 chr1 + 237097 237462 4 substr(smal[,8], 1, 60) > substr(smal[,8], 1, 60) [1] "NR_046018 // DDX11L1 // DEAD/H (Asp-Glu-Ala-Asp/His) box hel" [2] "AK092583 // WASH7P // WAS protein family homolog 7 pseudogen" [3] "NR_033266 // WASH5P // WAS protein family homolog 5 pseudoge" [4] "---" [5] "ENST00000422679 // MIR1302-11 // microRNA 1302-11 // --- // " [6] "ENST00000473358 // MIR1302-11 // microRNA 1302-11 // --- // " [7] "NR_026818 // FAM138A // family with sequence similarity 138," [8] "ENST00000448235 // FAM138F // family with sequence similarit" [9] "NM_001005240 // OR4F17 // olfactory receptor, family 4, subf" [10] "ENST00000492842 // OR4G11P // olfactory receptor, family 4, " [11] "---" [12] "ENST00000432723 // CICP7 // capicua homolog (Drosophila) pse" [13] "AK303004 // FLJ45445 // uncharacterized LOC399844 // 19p13.3" [14] "AK302487 // FLJ45340 // uncharacterized LOC402483 // 7q32.1 " [15] "NR_039983 // LOC729737 // uncharacterized LOC729737 // 1p36." [16] "ENST00000570230 // LOC100653348 // uncharacterized LOC100653" [17] "NR_039983 // LOC729737 // uncharacterized LOC729737 // 1p36." [18] "NR_029401 // LOC731275 // uncharacterized LOC731275 // 1q43 " [19] "ENST00000424587 // LOC100508047 // uncharacterized LOC100508" [20] "NR_029406 // FLJ43681 // ribosomal protein L23a pseudogene /" so it appears you have to parse the 8th column to get the symbols On Tue, Jan 28, 2014 at 5:11 AM, Vincent Carey <stvjc@channing.harvard.edu>wrote: > for the record, my reply goes down the wrong path, to gene 1.0 st > annotation. > > it seems likely that the solution lies in AffyCompatible package, and that > resource should > be mentioned in > > http://www.bioconductor.org/help/workflows/arrays/ > > > On Tue, Jan 28, 2014 at 4:41 AM, Vincent Carey <stvjc@channing.harvard.edu> > wrote: > >> I am no expert on the vocabulary here, but this shows two ways of working >> with the ChipDb package. First we use some SQL to figure out what is in >> there >> >> > library(hugene10sttranscriptcluster.db) >> >> > con = hugene10sttranscriptcluster_dbconn() # do dbListTables(con) too >> >> > dbGetQuery(con, "select * from probes limit 5") >> probe_id gene_id is_multiple >> 1 7892501 <na> 0 >> 2 7892502 <na> 0 >> 3 7892503 <na> 0 >> 4 7892504 <na> 0 >> 5 7892505 <na> 0 >> >> > peek = dbGetQuery(con, "select * from probes where gene_id is not null >> limit 5") >> >> > peek >> probe_id gene_id is_multiple >> 1 7896740 81099 0 >> 2 7896742 346288 0 >> 3 7896744 81399 0 >> 4 7896754 100287934 0 >> 5 7896756 157693 0 >> >> # so now we know that there is some attribute 'probe_id' that maps to >> something >> # called gene_id. now let's use the select interface >> >> > somekeys = peek[,1] >> >> > select( hugene10sttranscriptcluster.db, keytype="PROBEID", >> keys=somekeys, columns=c("SYMBOL", "ENTREZID")) >> PROBEID SYMBOL ENTREZID >> 1 7896740 OR4F17 81099 >> 2 7896742 SEPT14 346288 >> 3 7896744 OR4F16 81399 >> 4 7896754 LOC100287934 100287934 >> 5 7896756 FAM87A 157693 >> >> >> > sessionInfo() >> R Under development (unstable) (2013-12-01 r64371) >> Platform: x86_64-apple-darwin10.8.0/x86_64 (64-bit) >> >> locale: >> [1] >> en_US.US-ASCII/en_US.US-ASCII/en_US.US-ASCII/C/en_US.US-ASCII/en_US .US-ASCII >> >> attached base packages: >> [1] parallel stats graphics grDevices datasets utils tools >> [8] methods base >> >> other attached packages: >> [1] hugene10sttranscriptcluster.db_8.0.1 org.Hs.eg.db_2.10.1 >> >> [3] RSQLite_0.11.4 DBI_0.2-7 >> >> [5] AnnotationDbi_1.25.9 Biobase_2.23.3 >> >> [7] BiocGenerics_0.9.3 BiocInstaller_1.13.3 >> >> [9] weaver_1.29.1 codetools_0.2-8 >> >> [11] digest_0.6.4 >> >> loaded via a namespace (and not attached): >> [1] IRanges_1.21.21 stats4_3.1.0 >> >> >> >> On Tue, Jan 28, 2014 at 4:18 AM, Stephane Plaisance | VIB | < >> stephane.plaisance@vib.be> wrote: >> >>> I am looking for a file or package that can map my transcript_cluster_id >>> to hgnc symbols (human) >>> my data was preprocessed using APT and I do not have the original probe >>> IDs. >>> >>> Apparently, only 'probeIDs' are supported by BiomaRt, not >>> 'transcript_cluster_id'. >>> I tried hugene10sttranscriptcluster.db" but the IDs are different >>> (probeIDs) >>> >>> Where can I find a table listing 'transcript_cluster_id' and enough info >>> to get symbols? >>> >>> thanks in advance >>> Stephane >>> >>> >>> >>> [[alternative HTML version deleted]] >>> >>> >>> _______________________________________________ >>> Bioconductor mailing list >>> Bioconductor@r-project.org >>> https://stat.ethz.ch/mailman/listinfo/bioconductor >>> Search the archives: >>> http://news.gmane.org/gmane.science.biology.informatics.conductor >>> >> >> > [[alternative HTML version deleted]]
ADD REPLY
0
Entering edit mode
hi there, what you need is the huex10sttranscriptcluster.db for the HuEx 1.0 arrays. I'm not sure if James had made it available for download from the BioC site yet. Prior what I did was to use BiomaRt to fetch the probe ID, then look used the Affymetrix supplied annotation files (which I got from the NetAffx website). That's another possibility. Jeremy On Jan 28, 2014 8:42 PM, "Vincent Carey" <stvjc@channing.harvard.edu> wrote: > I am no expert on the vocabulary here, but this shows two ways of working > with the ChipDb package. First we use some SQL to figure out what is in > there > > > library(hugene10sttranscriptcluster.db) > > > con = hugene10sttranscriptcluster_dbconn() # do dbListTables(con) too > > > dbGetQuery(con, "select * from probes limit 5") > probe_id gene_id is_multiple > 1 7892501 <na> 0 > 2 7892502 <na> 0 > 3 7892503 <na> 0 > 4 7892504 <na> 0 > 5 7892505 <na> 0 > > > peek = dbGetQuery(con, "select * from probes where gene_id is not null > limit 5") > > > peek > probe_id gene_id is_multiple > 1 7896740 81099 0 > 2 7896742 346288 0 > 3 7896744 81399 0 > 4 7896754 100287934 0 > 5 7896756 157693 0 > > # so now we know that there is some attribute 'probe_id' that maps to > something > # called gene_id. now let's use the select interface > > > somekeys = peek[,1] > > > select( hugene10sttranscriptcluster.db, keytype="PROBEID", keys=somekeys, > columns=c("SYMBOL", "ENTREZID")) > PROBEID SYMBOL ENTREZID > 1 7896740 OR4F17 81099 > 2 7896742 SEPT14 346288 > 3 7896744 OR4F16 81399 > 4 7896754 LOC100287934 100287934 > 5 7896756 FAM87A 157693 > > > > sessionInfo() > R Under development (unstable) (2013-12-01 r64371) > Platform: x86_64-apple-darwin10.8.0/x86_64 (64-bit) > > locale: > [1] > > en_US.US-ASCII/en_US.US-ASCII/en_US.US-ASCII/C/en_US.US-ASCII/en_US .US-ASCII > > attached base packages: > [1] parallel stats graphics grDevices datasets utils tools > [8] methods base > > other attached packages: > [1] hugene10sttranscriptcluster.db_8.0.1 org.Hs.eg.db_2.10.1 > > [3] RSQLite_0.11.4 DBI_0.2-7 > > [5] AnnotationDbi_1.25.9 Biobase_2.23.3 > > [7] BiocGenerics_0.9.3 BiocInstaller_1.13.3 > > [9] weaver_1.29.1 codetools_0.2-8 > > [11] digest_0.6.4 > > loaded via a namespace (and not attached): > [1] IRanges_1.21.21 stats4_3.1.0 > > > > On Tue, Jan 28, 2014 at 4:18 AM, Stephane Plaisance | VIB | < > stephane.plaisance@vib.be> wrote: > > > I am looking for a file or package that can map my transcript_cluster_id > > to hgnc symbols (human) > > my data was preprocessed using APT and I do not have the original probe > > IDs. > > > > Apparently, only 'probeIDs' are supported by BiomaRt, not > > 'transcript_cluster_id'. > > I tried hugene10sttranscriptcluster.db" but the IDs are different > > (probeIDs) > > > > Where can I find a table listing 'transcript_cluster_id' and enough info > > to get symbols? > > > > thanks in advance > > Stephane > > > > > > > > [[alternative HTML version deleted]] > > > > > > _______________________________________________ > > Bioconductor mailing list > > Bioconductor@r-project.org > > https://stat.ethz.ch/mailman/listinfo/bioconductor > > Search the archives: > > http://news.gmane.org/gmane.science.biology.informatics.conductor > > > > [[alternative HTML version deleted]] > > _______________________________________________ > Bioconductor mailing list > Bioconductor@r-project.org > https://stat.ethz.ch/mailman/listinfo/bioconductor > Search the archives: > http://news.gmane.org/gmane.science.biology.informatics.conductor > [[alternative HTML version deleted]]
ADD REPLY

Login before adding your answer.

Traffic: 786 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6