Question

How to deal with Affymetrix GeneChip probeids that map to multiple genes

1

Entering edit mode

relathman ▴ 20

@relathman-11472

Last seen 6.8 years ago

Germany

I am currently working with HTA 2.0 chips (GeneChip® Human Transcriptome Array 2.0). After normalization of the data with RMA, I get 70523 features which I need to annotate for the downstream analysis that includes differential expression and gene enrichment steps. I would like to assign only one kind of gene identifier (preferably using the Entrez ID) to a given probeid using the select() function of {AnnotationDbi}. However, this is not always possible since there are probeids that map ambiguously to several genes:

annotation_testdata <- AnnotationDbi::select(hta20sttranscriptcluster.db, featureNames(norm_testdata), c("SYMBOL","GENENAME","ENTREZID", "ENSEMBL"))
'select()' returned 1:many mapping between keys and columns

While looking for an answer, I have found suggestions to either

discard all probes that map to multiple gene symbols (How to deal with Affymetrix probe that map to multiple genes),
choose a single one of these mappings OR to combine them.

Which method is the most appropriate in your experience? And what is the biological reason behind these unambiguous mappings?

hta20sttranscriptcluster.db probe mapping annotation pd.hta.2.0 annotationdbi • 3.2k views

ADD COMMENT • link updated 8.6 years ago by James W. MacDonald 68k • written 8.6 years ago by relathman ▴ 20

score 8 · Accepted Answer · 2016-09-13

You can annotate your ExpressionSet using annotateEset in the affycoretools package, which explicitly allows you to control for the multiple mappings.

But do note that there are levels of multiple mappings here. When you ask for various annotations, there are ambiguities that arise between annotation groups (e.g., NCBI and EBI/EMBL don't always agree on what is and is not a gene or whatever), and even within annotation groups (e.g., HUGO symbols are intended to be unique, but they are inextricably linked with 'aliases' which are not intended to be unique, necessarily, and might be intended to be memorable instead). The same is true of Entrez Gene IDs - as we go through time, things that were once thought to be different things end up being the same thing, and one or more Entrez Gene IDs get retired. Maybe in future we will have these ambiguities ironed out, but we don't yet.

As an example of NCBI -> EBI disagreements:

> select(org.Hs.eg.db, "23", "ENSEMBL")
'select()' returned 1:many mapping between keys and columns
  ENTREZID         ENSEMBL
1       23 ENSG00000204574
2       23 ENSG00000225989
3       23 ENSG00000232169
4       23 ENSG00000231129
5       23 ENSG00000206490
6       23 ENSG00000236149
7       23 ENSG00000236342

And then there is the issue of Affy putting multiple probesets on the array that may or may not measure the same gene. Which is a separate issue from the fact that Affy will say that a given probeset measures any transcript that is within the boundary of the genomic region that the probeset is intended to measure. In other words, say there is a gene X, and within the first intron of gene X, there is another gene (gene Y), and in the first exon of gene X there is an miRNA. In that situation Affy may well say that the probeset that measures transcripts from that region will measure all of those things. Therefore you have a many-to-one problem, as well as a one-to-many problem with the Affy mappings.

So the answer to your question as to which is most appropriate is to ask 'To whom?'. What is and is not appropriate is a rather personal thing, no? In annotateEset, I take the most naive approach possible, and simply take the first annotation of any multiple-mapping probeset. People have argued against that idea, with their own reasoning, and I don't necessarily disagree, but I don't necessarily agree either.

And as for the multiple probesets that measure the same gene, each of those may measure splice variants. Or not. Or maybe it's a combination of things. Who knows? The genefilter package has a function called findLargest that takes the (naive, to me) approach of choosing the probeset that has the biggest difference in a given comparison as the one to use. That's one way you could go. Or you could use an MBNI re-mapped probeset (I think they have those for the HTA array). Or you could just report the top set of probesets and then look more closely at them, and see if there are other probesets for a differentially expressed gene that don't agree, and then figure out why not. I think you could argue cogently for any of those approaches, and most reasonable people would say 'OK, why not?'.