Hi all,
sorry if this is a dumb question, but rtfm has not helped so far.
I would like to get the Ensembl transcript IDs that correspond to
affymetrix probeset ids using biomaRt. As a test case, I am using the
ALL data set from bioconductor. My code:
library("biomaRt")
library("ALL")
data("ALL") ## Note this dataset uses hgu95av2 Affymetrix chip
dat <- exprs(ALL)
affyids = rownames(dat)
## get mapping data from Ensembl via bioMaRt
ensembl <- useMart("ensembl")
ensembl = useDataset("hsapiens_gene_ensembl",mart=ensembl)
mapping <- getBM(attributes = c("affy_hg_u95av2",
"ensembl_transcript_id"), filters = "affy_hg_u95av2",
values = affyids, mart = ensembl)
Here is where the problem is. The "mapping" seems to be a random
collection of transcript IDs.
> which(mapping=="32337_at")
[1] 8 46 139 155 203 267 320 327 7385 8701 18769
20533
[13] 23728 23969 23972 24241 24242 24243 24244 25236 26157 26204 26218
26231
[25] 26240 26321 26404
> mapping[which(mapping=="32337_at"),]
affy_hg_u95av2 ensembl_transcript_id
8 32337_at ENST00000404812
46 32337_at ENST00000393574
139 32337_at ENST00000403842
155 32337_at ENST00000397467
203 32337_at ENST00000407990
267 32337_at ENST00000399007
320 32337_at ENST00000404500
327 32337_at ENST00000399891
7385 32337_at ENST00000396599
8701 32337_at ENST00000403916
18769 32337_at ENST00000334328
20533 32337_at ENST00000377603
23728 32337_at ENST00000401418
23969 32337_at ENST00000046640
23972 32337_at ENST00000381870
24241 32337_at ENST00000326092
24242 32337_at ENST00000319826
24243 32337_at ENST00000272274
24244 32337_at ENST00000311549
25236 32337_at ENST00000404512
26157 32337_at ENST00000404609
26204 32337_at ENST00000402713
26218 32337_at ENST00000401464
26231 32337_at ENST00000407389
26240 32337_at ENST00000406161
26321 32337_at ENST00000402658
26404 32337_at ENST00000401595
At the end of the day, I would like to write the data matrix as a CSV
file for further analysis, whereby the affy ID is replaced by an
Ensembl
ID.
Thanks Peter
On Thu, Apr 9, 2009 at 5:40 PM, Peter Robinson
<peter.robinson@t-online.de>wrote:
> Hi all,
>
> sorry if this is a dumb question, but rtfm has not helped so far.
>
> I would like to get the Ensembl transcript IDs that correspond to
> affymetrix probeset ids using biomaRt. As a test case, I am using
the ALL
> data set from bioconductor. My code:
>
>
> library("biomaRt")
> library("ALL")
> data("ALL") ## Note this dataset uses hgu95av2 Affymetrix chip
>
> dat <- exprs(ALL)
> affyids = rownames(dat)
>
>
> ## get mapping data from Ensembl via bioMaRt
> ensembl <- useMart("ensembl")
> ensembl = useDataset("hsapiens_gene_ensembl",mart=ensembl)
>
> mapping <- getBM(attributes = c("affy_hg_u95av2",
"ensembl_transcript_id"),
> filters = "affy_hg_u95av2",
> values = affyids, mart = ensembl)
>
>
>
> Here is where the problem is. The "mapping" seems to be a random
collection
> of transcript IDs.
>
> > which(mapping=="32337_at")
> [1] 8 46 139 155 203 267 320 327 7385 8701
18769 20533
> [13] 23728 23969 23972 24241 24242 24243 24244 25236 26157 26204
26218
> 26231
> [25] 26240 26321 26404
> > mapping[which(mapping=="32337_at"),]
> affy_hg_u95av2 ensembl_transcript_id
> 8 32337_at ENST00000404812
> 46 32337_at ENST00000393574
> 139 32337_at ENST00000403842
> 155 32337_at ENST00000397467
> 203 32337_at ENST00000407990
> 267 32337_at ENST00000399007
> 320 32337_at ENST00000404500
> 327 32337_at ENST00000399891
> 7385 32337_at ENST00000396599
> 8701 32337_at ENST00000403916
> 18769 32337_at ENST00000334328
> 20533 32337_at ENST00000377603
> 23728 32337_at ENST00000401418
> 23969 32337_at ENST00000046640
> 23972 32337_at ENST00000381870
> 24241 32337_at ENST00000326092
> 24242 32337_at ENST00000319826
> 24243 32337_at ENST00000272274
> 24244 32337_at ENST00000311549
> 25236 32337_at ENST00000404512
> 26157 32337_at ENST00000404609
> 26204 32337_at ENST00000402713
> 26218 32337_at ENST00000401464
> 26231 32337_at ENST00000407389
> 26240 32337_at ENST00000406161
> 26321 32337_at ENST00000402658
> 26404 32337_at ENST00000401595
>
> At the end of the day, I would like to write the data matrix as a
CSV file
> for further analysis, whereby the affy ID is replaced by an Ensembl
ID.
>
Hi, Peter.
Ensembl does their own mapping of affy probes and the above is an
example of
what can happen--a probeset can map to multiple transcripts. In fact,
there
is not a reason to think that a probeset should, in general, map to
only one
transcript. All that said, I think you have used biomaRt correctly
and are
faithfully reproducing the results available from Ensembl.
If you want another alternative based more closely on what affy
supplies,
try the following:
library(hgu95av2.db)
dat = toTable(hgu95av2ENSEMBL)
dat[dat[,1]=="32337_at",]
probe_id ensembl_id
5562 32337_at ENSG00000122026
dim(dat)
[1] 12316 2
Hope that helps,
Sean
sessionInfo()
R version 2.9.0 Under development (unstable) (2009-02-21 r47969)
x86_64-unknown-linux-gnu
locale:
LC_CTYPE=en_US.UTF-8;LC_NUMERIC=C;LC_TIME=en_US.UTF-8;LC_COLLATE=en_US
.UTF-8;LC_MONETARY=C;LC_MESSAGES=en_US.UTF-8;LC_PAPER=en_US.UTF-8;LC_N
AME=C;LC_ADDRESS=C;LC_TELEPHONE=C;LC_MEASUREMENT=en_US.UTF-8;LC_IDENTI
FICATION=C
attached base packages:
[1] stats graphics grDevices utils datasets methods base
other attached packages:
[1] hgu95av2.db_2.2.11 RSQLite_0.7-1 DBI_0.2-4
[4] AnnotationDbi_1.5.23 Biobase_2.3.11
[[alternative HTML version deleted]]
Hi Peter,
On Apr 9, 2009, at 5:40 PM, Peter Robinson wrote:
> Hi all,
>
> sorry if this is a dumb question, but rtfm has not helped so far.
>
> I would like to get the Ensembl transcript IDs that correspond to
> affymetrix probeset ids using biomaRt. As a test case, I am using
> the ALL data set from bioconductor. My code:
>
>
> library("biomaRt")
> library("ALL")
> data("ALL") ## Note this dataset uses hgu95av2 Affymetrix chip
>
> dat <- exprs(ALL)
> affyids = rownames(dat)
>
>
> ## get mapping data from Ensembl via bioMaRt
> ensembl <- useMart("ensembl")
> ensembl = useDataset("hsapiens_gene_ensembl",mart=ensembl)
>
> mapping <- getBM(attributes = c("affy_hg_u95av2",
> "ensembl_transcript_id"), filters = "affy_hg_u95av2",
> values = affyids, mart = ensembl)
>
>
>
> Here is where the problem is. The "mapping" seems to be a random
> collection of transcript IDs.
Your query is right, so ... your results are not random. You can
double check by trying the small example in the ?getBM help.
Anyway: that probe looks a-weird one. Even affy maps it to several
locations. See:
https://www.affymetrix.com/analysis/netaffx/fullrecord.affx?pk=HG-
U95AV2%3A32337_AT
#a_ensembl
You will need an Affy NetAffx account to see that. Some relevant stats
from that page are that the probe maps to 6 different ensembl IDs.
It even aligns to two different places:
chr13:26725913-26728689(+)
chr10:122104175-122104685(-)
You'll probably find this for many probes, so you'll need some policy
to deal with that.
Hope that helps,
-steve
--
Steve Lianoglou
Graduate Student: Physiology, Biophysics and Systems Biology
Weill Medical College of Cornell University
http://cbio.mskcc.org/~lianos