You can annotate your ExpressionSet using annotateEset
in the affycoretools package, which explicitly allows you to control for the multiple mappings.
But do note that there are levels of multiple mappings here. When you ask for various annotations, there are ambiguities that arise between annotation groups (e.g., NCBI and EBI/EMBL don't always agree on what is and is not a gene or whatever), and even within annotation groups (e.g., HUGO symbols are intended to be unique, but they are inextricably linked with 'aliases' which are not intended to be unique, necessarily, and might be intended to be memorable instead). The same is true of Entrez Gene IDs - as we go through time, things that were once thought to be different things end up being the same thing, and one or more Entrez Gene IDs get retired. Maybe in future we will have these ambiguities ironed out, but we don't yet.
As an example of NCBI -> EBI disagreements:
> select(org.Hs.eg.db, "23", "ENSEMBL")
'select()' returned 1:many mapping between keys and columns
ENTREZID ENSEMBL
1 23 ENSG00000204574
2 23 ENSG00000225989
3 23 ENSG00000232169
4 23 ENSG00000231129
5 23 ENSG00000206490
6 23 ENSG00000236149
7 23 ENSG00000236342
And then there is the issue of Affy putting multiple probesets on the array that may or may not measure the same gene. Which is a separate issue from the fact that Affy will say that a given probeset measures any transcript that is within the boundary of the genomic region that the probeset is intended to measure. In other words, say there is a gene X, and within the first intron of gene X, there is another gene (gene Y), and in the first exon of gene X there is an miRNA. In that situation Affy may well say that the probeset that measures transcripts from that region will measure all of those things. Therefore you have a many-to-one problem, as well as a one-to-many problem with the Affy mappings.
So the answer to your question as to which is most appropriate is to ask 'To whom?'. What is and is not appropriate is a rather personal thing, no? In annotateEset
, I take the most naive approach possible, and simply take the first annotation of any multiple-mapping probeset. People have argued against that idea, with their own reasoning, and I don't necessarily disagree, but I don't necessarily agree either.
And as for the multiple probesets that measure the same gene, each of those may measure splice variants. Or not. Or maybe it's a combination of things. Who knows? The genefilter package has a function called findLargest
that takes the (naive, to me) approach of choosing the probeset that has the biggest difference in a given comparison as the one to use. That's one way you could go. Or you could use an MBNI re-mapped probeset (I think they have those for the HTA array). Or you could just report the top set of probesets and then look more closely at them, and see if there are other probesets for a differentially expressed gene that don't agree, and then figure out why not. I think you could argue cogently for any of those approaches, and most reasonable people would say 'OK, why not?'.
Although old, very helpful response. Many thanks James.