Hello,
I'm looking for an automated way to choose between preprocessing microarray datasets with the affy or oligo package. I'm developing a pipeline that automates the acquisition and processing of data from ArrayExpress, and neither package is one-size-fits all. My main question is thus:
- Is there a way to automatically determine which package is preferable, either from the name of a platform (e.g. "[HuGene-1_1-st] Affymetrix Human Gene 1.1 ST Array", "Affymetrix GeneChip Human Genome U133 Plus 2.0 [HG-U133_Plus_2]"), or from the header of a CEL file?
I will also include some answers that I've found which may be helpful if you're arriving from Google:
- Should I use oligo or affy?
- Try oligo first, then if it doesn't work, try affy.*
- oligo works for newer platforms and the popular old platforms. affy won't work for new platforms such as the Gene ST and Exon ST arrays.
- Some datasets cause an error in oligo but still work with affy; I think this has to do (sometimes? always?) with custom CDFs in the dataset.
- What differences are there between the two, if both of them work?
- The expression matrices produced by each are almost identical. **
- oligo's read.celfiles() uses 33% less memory than affy's read.affybatch().*** Since this step is the most memory-demanding of a microarray analysis, and a big dataset can easily suck up tens of gigabytes of memory and reduce your computer to a thrashing mess, this can be significant.
- affy::rma() is often 10% - 50% quicker than oligo::rma()
If anyone has any other reasons to choose one over the other, please do let me know.
Footnotes:
* They're quite easy to change between. For a vector of rawfilepaths
:
rawbatch = read.celfiles(rawfilepaths) ; RMA = oligo::rma(rawbatch)
rawbatch = read.affybatch(rawfilepaths) ; RMA = affy::rma(rawbatch)
** For older chips, the expression matrices produced are virtually identical (the differences are just rounding error or similar). The only real difference I found was for Human Gene 1.0 ST, in which oligo produced an expression table with 33,297 rows, vs. 32,321 rows from affy.
*** I tested files from E-MTAB-1724 and found the relationships between number of files and peak memory usage (in GiB) to be:
For read.affybatch(): 0.0263 * length(rawfilepaths) + 0.102
For read.celfiles(): 0.0176 * length(rawfilepaths) + 0.139
(R^2 > 0.999 for both)
Another difference, which won't affect most people, is that the output object from oligo's read.celfiles() is far larger than that from affy's read.affybatch(). However, it's so much smaller (tens to hundreds of megabytes) than the peak memory usage (gigabytes to tens of gigabytes) that you shouldn't worry about it unless you're collecting a lot of these.
Hello,
I've got the dataset which is not processed by oligo, but is processed with affy: https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSM840099 One of the customers wanted to process it in the software my compay is developing. We use R scripts from java code on server side. The HuGene-1_0-st platform from the dataset above is usually processed with oligo. It works fine with many other datasets, but not with the GSM840099_Hela-c.CEL.
Can you please tell, what is wrong with it? Should the code call oligo, and then affy if oligo failed?
Please don't add comments to old posts. Instead submit a new question.