I am trying to annotate transcripts from an Affymetrix Mouse Gene ST 2.0 microarray using 'oligo', but I have found so many resources and annotation approaches that I cannot figure out the relationships and differences between them.
So, to start with, what is the difference between the packages pd.mogene.2.0.st and mogene20sttranscriptcluster.db? How are they used?
Also, one what do they differ from the annotation .csv file provided by Affymetrix that you can obtain using getNetAffx()?
Finally, annotating with BioMart would one get the same results as with any of the previous approaches?
Thanks.
Excellent answer James, thanks a lot!
just a final question: when mapping the 'core' probesets to genes using mogene20sttranscriptcluster.db package, should I expect to have duplicated genes in the collapsed matrix, or ALL probesets mapping to the same gene will have been collapsed?
I expected to be the latter, but to make a sanity check, after collapsing using affyNorm <- rma(affyRaw, target="core"), I annotated all "collapsed" probesets using BioMart, and I obtain many duplicated genes like the following:
probeset.id gene symbol gene entrez
17427309 Jun 16476
17427312 Jun 16476
I checked the corresponding probeset sequence with BLAT, and they both actually map at different regions of the Jun gene.
Why is this happening? What should we do with that?
No, you shouldn't expect that. Affymetrix arrays have historically had more than one probeset for some genes, and this pattern continues with the Gene ST arrays. Why this is so, and what you should do with it are good questions, but I have no answers for you. Given the number of duplicated probesets I would be surprised if Affymetrix had a single rationale for the duplication, and would instead assume that it is a gene-dependent thing.
As an example, they could have just piled all the probes that interrogate Jun into one probeset and called it good. But maybe they think there are two predominant transcripts, and you can use the two probesets to infer differences between those two transcripts by looking at the expression values for the two probesets. Or maybe they think something else about Jun. I really don't know. This does make interpretation of the results more difficult, and using an MBNI cdf where they have all been collapsed at the gene level would make interpretation easier.
On the other hand, there is a valid argument that simplifying data to the gene level is ignoring the complexity of the transcript, and you could for instance think you have differential expression when in fact you have identical expression levels, but very different transcripts. In other words, consider the hypothetical of a gene where there are four exons, and two main transcripts where transcript A has all four exons, and transcript B only has two. If two sample types are expressing exactly equivalent numbers of transcript, but one sample type is only expressing transcript A, and the other sample type is expressing only transcript B, then you may get very different signal, and interpret it as differential gene expression when in fact it is due entirely to differences in the form of the transcript.
As to what you should do with that, I have no idea. The answer depends on too many variables. There is always the tension between the 'bulk analysis' that we do with microarray data and the particular questions you may have. With 30,000 or more different probesets, you have to do some things in a pretty naive way. For example, you fit the same linear model on each probeset, which you would never do if you were doing conventional statistical analysis. But you cannot go through and decide what the best model is for each gene because nobody has that kind of time. Plus you usually don't have the replication to decide what the best model is anyway. But at some point you have a set of interesting genes, at which time you might want to look more closely at the probesets, what they are measuring, etc, in order to decide what the data mean.
That perfectly answers my question, thanks James!