Hi,
I've been working around with Affy files for a while, I had been using direct datas from NetAffx to annotate my raw Affy files (merging expression data with annotation data by probe_id).
I recently shifted to a more straightforward method with Annotate and the HGU133PLUS2 package (which corresponds to my data). While some probesets are still associated with genes in NetAffx (online and when I download database) and in hgu133plus2.db, I can't see them associated with gene names.
For instance, I can use two methods to get gene names:
biocLite(hgu133plus2.db) biocLite(annotate) r=rownames(df_rma) head(r) [1] "1053_at" "117_at" "121_at" "1255_g_at" "1316_at" "1320_at" symb_ID=getSYMBOL(r,"hgu133plus2.db") head(symb_ID) 1053_at 117_at 121_at 1255_g_at 1316_at 1320_at [1] "RFC2" "HSPA6" "PAX8" "GUCA1A" "THRA" "PTPN21" table(is.na(symb_ID)) FALSE 42358 eligibles=hgu133plus2SYMBOL[r] > annots=toTable(eligibles) > table(is.na(annots$symbol)) FALSE 42358
This is OK (we start from 54675 rownames, so 12317 genes aren't annotated), but when I look for a particular probeset of a gene of interest (for instance BBC3) for which a probeset is given by Affy:
grep("BBC",annots$symbol) integer(0) grep("211692_s_at",annots$probe_id) integer(0)
This very gene isn't annotated. Still, it's correctly annotated into hgu133plus2.db:
grep("211692_s_at",(keys(hgu133plus2.db))) [1] 21014 grep("BBC3",(keys(hgu133plus2.db,keytype="SYMBOL"))) [1] 10487 s=select(hgu133plus2.db,keys="211692_s_at",columns="SYMBOL") 'select()' returned 1:many mapping between keys and columns s PROBEID SYMBOL 1 211692_s_at BBC3 2 211692_s_at MIR3191 3 211692_s_at MIR3190
So, is it that probesets that matche several transcripts are "discarded"? In the very good documentation of Marc Carlson it's not straightforwardly inidcated.
Tahnk you!
Thanks, was unaware of the mapIds method.