The QC probes on the random primer Affy arrays are completely different from what they put on the 3'-biased arrays, and how you get them is different as well, as you have to use the oligo/pdInfo pipeline rather than the affy/makecdfenv pipeline. Normally you would be able to query the pd.hta.2.0 database directly, but we are unfortunately in a period where the pd.hta.2.0 database doesn't have the right data for the probeset types, and since Benilton is refactoring it for the next release, it won't get fixed until April.
Note that this only extends to the part of the database that says what 'type' a probeset is. The mapping of probes to probesets is unaffected by this issue.
However, there is still a way to get these data using the netaffxTranscript.rda that comes with the pd.hta.2.0 package:
> library(pd.hta.2.0)
> load(paste0(path.package("pd.hta.2.0"), "/extdata/netaffxTranscript.rda"))
> annot <- pData(netaffxTranscript)
> table(annot$category)
additional control->affx->bac_spike
230 4
control->affx->ercc control->affx->ercc->step
92 63
control->affx->polya_spike control->bgp->antigenomic
4 23
main main///normgene->exon
67528 1465
normgene->exon normgene->intron
698 646
The 'Main' probes are those that are intended to measure transcripts, and all these other probesets are intended to be controls of one type or another.
Note that the ordering of this object (annot) will NOT be the same as the ordering of your data, so you have to make sure you get things ordered correctly!
Thanks, James.
Do you know where to find documentation detailing what these control probesets are?
Also, do I understand correctly that there are probes used to both serve as control AND measure transcripts (main///normgene->exon)?
The product pdf gives some hints (http://www.affymetrix.com/support/technical/datasheets/hta_array_2_0_datasheet.pdf). And you can look at the names of the probes for more hints
I don't know of other documentation, and I have just figured things out by poking around.
The 'additional' probesets are a mystery. If you search netaffx, they will helpfully let you know that these are 'additional' probesets. What they are for is beyond me.
The ERCC probes are for using the ERCC spike-ins for normalization. The antigenomic probesets are intended to be background probesets - Affymetrix claim that these sequences don't exist in the human genome.
The normgene->intron probes are supposed to be background as well, being that they are targeting intronic regions. I think Affy somehow forgot that with a random primer you are just as likely to pick up signal from nascent mRNA that have not yet been processed to excise the introns - these probes have an irritating ability to show up in almost all sets of 'top' genes, which of course makes people skittish.
The normgene->exon probes are (AFAICT) always just exon-level probes that get used twice. In other words, say you have Gene X, and there are 12 exon-level (or more correctly, 'probe set region' or PSR probesets) that are aggregated to make the transcript probeset. There may also be a single PSR probeset from that gene that Affy labels as a normgene->exon probeset. So that single PSR probeset is used twice; once when aggregated at the Gene X transcript level, and once more as an individual control probeset.
This is the first I have seen the main///normgene->exon type designation, and Affy seem to want to keep these things a mystery. Searching netaffx brings nothing up, for both a transcript or a probeset search. However, if you look in the netaffxProbeset.rda file that comes with the pd.hta.2.0 package, it appears that these are single PSR probesets (or JUC probesets) that are never aggregated into a larger transcript probeset. So it appears that your supposition is correct; these do seem to be probesets that are both 'main' and 'normgene->exon', all at the same time.