Question

How to go from affymetrix to Ensembl transcript IDs

0

Entering edit mode

peter robinson ▴ 300

@peter-robinson-529

Last seen 10.2 years ago

Hi all, sorry if this is a dumb question, but rtfm has not helped so far. I would like to get the Ensembl transcript IDs that correspond to affymetrix probeset ids using biomaRt. As a test case, I am using the ALL data set from bioconductor. My code: library("biomaRt") library("ALL") data("ALL") ## Note this dataset uses hgu95av2 Affymetrix chip dat <- exprs(ALL) affyids = rownames(dat) ## get mapping data from Ensembl via bioMaRt ensembl <- useMart("ensembl") ensembl = useDataset("hsapiens_gene_ensembl",mart=ensembl) mapping <- getBM(attributes = c("affy_hg_u95av2", "ensembl_transcript_id"), filters = "affy_hg_u95av2", values = affyids, mart = ensembl) Here is where the problem is. The "mapping" seems to be a random collection of transcript IDs. > which(mapping=="32337_at") [1] 8 46 139 155 203 267 320 327 7385 8701 18769 20533 [13] 23728 23969 23972 24241 24242 24243 24244 25236 26157 26204 26218 26231 [25] 26240 26321 26404 > mapping[which(mapping=="32337_at"),] affy_hg_u95av2 ensembl_transcript_id 8 32337_at ENST00000404812 46 32337_at ENST00000393574 139 32337_at ENST00000403842 155 32337_at ENST00000397467 203 32337_at ENST00000407990 267 32337_at ENST00000399007 320 32337_at ENST00000404500 327 32337_at ENST00000399891 7385 32337_at ENST00000396599 8701 32337_at ENST00000403916 18769 32337_at ENST00000334328 20533 32337_at ENST00000377603 23728 32337_at ENST00000401418 23969 32337_at ENST00000046640 23972 32337_at ENST00000381870 24241 32337_at ENST00000326092 24242 32337_at ENST00000319826 24243 32337_at ENST00000272274 24244 32337_at ENST00000311549 25236 32337_at ENST00000404512 26157 32337_at ENST00000404609 26204 32337_at ENST00000402713 26218 32337_at ENST00000401464 26231 32337_at ENST00000407389 26240 32337_at ENST00000406161 26321 32337_at ENST00000402658 26404 32337_at ENST00000401595 At the end of the day, I would like to write the data matrix as a CSV file for further analysis, whereby the affy ID is replaced by an Ensembl ID. Thanks Peter

hgu95av2 affy biomaRt hgu95av2 affy biomaRt • 5.2k views

ADD COMMENT • link updated 15.6 years ago by Steve Lianoglou ★ 13k • written 15.6 years ago by peter robinson ▴ 300

score 0 · Answer 1 · 2009-04-09

On Thu, Apr 9, 2009 at 5:40 PM, Peter Robinson <peter.robinson@t-online.de>wrote: > Hi all, > > sorry if this is a dumb question, but rtfm has not helped so far. > > I would like to get the Ensembl transcript IDs that correspond to > affymetrix probeset ids using biomaRt. As a test case, I am using the ALL > data set from bioconductor. My code: > > > library("biomaRt") > library("ALL") > data("ALL") ## Note this dataset uses hgu95av2 Affymetrix chip > > dat <- exprs(ALL) > affyids = rownames(dat) > > > ## get mapping data from Ensembl via bioMaRt > ensembl <- useMart("ensembl") > ensembl = useDataset("hsapiens_gene_ensembl",mart=ensembl) > > mapping <- getBM(attributes = c("affy_hg_u95av2", "ensembl_transcript_id"), > filters = "affy_hg_u95av2", > values = affyids, mart = ensembl) > > > > Here is where the problem is. The "mapping" seems to be a random collection > of transcript IDs. > > > which(mapping=="32337_at") > [1] 8 46 139 155 203 267 320 327 7385 8701 18769 20533 > [13] 23728 23969 23972 24241 24242 24243 24244 25236 26157 26204 26218 > 26231 > [25] 26240 26321 26404 > > mapping[which(mapping=="32337_at"),] > affy_hg_u95av2 ensembl_transcript_id > 8 32337_at ENST00000404812 > 46 32337_at ENST00000393574 > 139 32337_at ENST00000403842 > 155 32337_at ENST00000397467 > 203 32337_at ENST00000407990 > 267 32337_at ENST00000399007 > 320 32337_at ENST00000404500 > 327 32337_at ENST00000399891 > 7385 32337_at ENST00000396599 > 8701 32337_at ENST00000403916 > 18769 32337_at ENST00000334328 > 20533 32337_at ENST00000377603 > 23728 32337_at ENST00000401418 > 23969 32337_at ENST00000046640 > 23972 32337_at ENST00000381870 > 24241 32337_at ENST00000326092 > 24242 32337_at ENST00000319826 > 24243 32337_at ENST00000272274 > 24244 32337_at ENST00000311549 > 25236 32337_at ENST00000404512 > 26157 32337_at ENST00000404609 > 26204 32337_at ENST00000402713 > 26218 32337_at ENST00000401464 > 26231 32337_at ENST00000407389 > 26240 32337_at ENST00000406161 > 26321 32337_at ENST00000402658 > 26404 32337_at ENST00000401595 > > At the end of the day, I would like to write the data matrix as a CSV file > for further analysis, whereby the affy ID is replaced by an Ensembl ID. > Hi, Peter. Ensembl does their own mapping of affy probes and the above is an example of what can happen--a probeset can map to multiple transcripts. In fact, there is not a reason to think that a probeset should, in general, map to only one transcript. All that said, I think you have used biomaRt correctly and are faithfully reproducing the results available from Ensembl. If you want another alternative based more closely on what affy supplies, try the following: library(hgu95av2.db) dat = toTable(hgu95av2ENSEMBL) dat[dat[,1]=="32337_at",] probe_id ensembl_id 5562 32337_at ENSG00000122026 dim(dat) [1] 12316 2 Hope that helps, Sean sessionInfo() R version 2.9.0 Under development (unstable) (2009-02-21 r47969) x86_64-unknown-linux-gnu locale: LC_CTYPE=en_US.UTF-8;LC_NUMERIC=C;LC_TIME=en_US.UTF-8;LC_COLLATE=en_US .UTF-8;LC_MONETARY=C;LC_MESSAGES=en_US.UTF-8;LC_PAPER=en_US.UTF-8;LC_N AME=C;LC_ADDRESS=C;LC_TELEPHONE=C;LC_MEASUREMENT=en_US.UTF-8;LC_IDENTI FICATION=C attached base packages: [1] stats graphics grDevices utils datasets methods base other attached packages: [1] hgu95av2.db_2.2.11 RSQLite_0.7-1 DBI_0.2-4 [4] AnnotationDbi_1.5.23 Biobase_2.3.11 [[alternative HTML version deleted]]

score 0 · Answer 2 · 2009-04-09

Hi Peter, On Apr 9, 2009, at 5:40 PM, Peter Robinson wrote: > Hi all, > > sorry if this is a dumb question, but rtfm has not helped so far. > > I would like to get the Ensembl transcript IDs that correspond to > affymetrix probeset ids using biomaRt. As a test case, I am using > the ALL data set from bioconductor. My code: > > > library("biomaRt") > library("ALL") > data("ALL") ## Note this dataset uses hgu95av2 Affymetrix chip > > dat <- exprs(ALL) > affyids = rownames(dat) > > > ## get mapping data from Ensembl via bioMaRt > ensembl <- useMart("ensembl") > ensembl = useDataset("hsapiens_gene_ensembl",mart=ensembl) > > mapping <- getBM(attributes = c("affy_hg_u95av2", > "ensembl_transcript_id"), filters = "affy_hg_u95av2", > values = affyids, mart = ensembl) > > > > Here is where the problem is. The "mapping" seems to be a random > collection of transcript IDs. Your query is right, so ... your results are not random. You can double check by trying the small example in the ?getBM help. Anyway: that probe looks a-weird one. Even affy maps it to several locations. See: https://www.affymetrix.com/analysis/netaffx/fullrecord.affx?pk=HG- U95AV2%3A32337_AT #a_ensembl You will need an Affy NetAffx account to see that. Some relevant stats from that page are that the probe maps to 6 different ensembl IDs. It even aligns to two different places: chr13:26725913-26728689(+) chr10:122104175-122104685(-) You'll probably find this for many probes, so you'll need some policy to deal with that. Hope that helps, -steve -- Steve Lianoglou Graduate Student: Physiology, Biophysics and Systems Biology Weill Medical College of Cornell University http://cbio.mskcc.org/~lianos