Question

How do I choose between using affy or oligo package for preprocessing microarrays?

0

Entering edit mode

stevessheridan • 0

@stevessheridan-12228

Last seen 8.2 years ago

Hello,

I'm looking for an automated way to choose between preprocessing microarray datasets with the affy or oligo package. I'm developing a pipeline that automates the acquisition and processing of data from ArrayExpress, and neither package is one-size-fits all. My main question is thus:

Is there a way to automatically determine which package is preferable, either from the name of a platform (e.g. "[HuGene-1_1-st] Affymetrix Human Gene 1.1 ST Array", "Affymetrix GeneChip Human Genome U133 Plus 2.0 [HG-U133_Plus_2]"), or from the header of a CEL file?

I will also include some answers that I've found which may be helpful if you're arriving from Google:

Should I use oligo or affy?
- Try oligo first, then if it doesn't work, try affy.*
- oligo works for newer platforms and the popular old platforms. affy won't work for new platforms such as the Gene ST and Exon ST arrays.
- Some datasets cause an error in oligo but still work with affy; I think this has to do (sometimes? always?) with custom CDFs in the dataset.
What differences are there between the two, if both of them work?
- The expression matrices produced by each are almost identical. **
- oligo's read.celfiles() uses 33% less memory than affy's read.affybatch().*** Since this step is the most memory-demanding of a microarray analysis, and a big dataset can easily suck up tens of gigabytes of memory and reduce your computer to a thrashing mess, this can be significant.
- affy::rma() is often 10% - 50% quicker than oligo::rma()

If anyone has any other reasons to choose one over the other, please do let me know.

Footnotes:

* They're quite easy to change between. For a vector of rawfilepaths:

rawbatch = read.celfiles(rawfilepaths) ; RMA = oligo::rma(rawbatch)
rawbatch = read.affybatch(rawfilepaths) ; RMA = affy::rma(rawbatch)

** For older chips, the expression matrices produced are virtually identical (the differences are just rounding error or similar). The only real difference I found was for Human Gene 1.0 ST, in which oligo produced an expression table with 33,297 rows, vs. 32,321 rows from affy.

*** I tested files from E-MTAB-1724 and found the relationships between number of files and peak memory usage (in GiB) to be:

For read.affybatch(): 0.0263 * length(rawfilepaths) + 0.102
For read.celfiles(): 0.0176 * length(rawfilepaths) + 0.139
(R^2 > 0.999 for both)

Another difference, which won't affect most people, is that the output object from oligo's read.celfiles() is far larger than that from affy's read.affybatch(). However, it's so much smaller (tens to hundreds of megabytes) than the peak memory usage (gigabytes to tens of gigabytes) that you shouldn't worry about it unless you're collecting a lot of these.

microarray affymetrix microarrays affy oligo • 4.3k views

ADD COMMENT • link updated 4.1 years ago by James W. MacDonald 68k • written 8.2 years ago by stevessheridan • 0

score 0 · Answer 1 · 2017-01-25

0

Entering edit mode

James W. MacDonald 68k

@james-w-macdonald-5106

Last seen 5 hours ago

United States

These days I can't think of a compelling reason that you would need to use affy in lieu of oligo. Benilton added a generic array class, maybe two release cycles ago, that allows one to use non-standard CDFs like the MBNI re-mapped CDFs that you mention. And if you go to MBNI's download site, you can see that they have a pdInfoPackage for (AFAIK) all of the CDFs that they have generated.

The only reason I can think of for using affy is if you want to use one of the packages (like frma) that rely on affy rather than oligo. This is particularly true for any of the random-primer based arrays, which share probes across multiple probesets, and which the affy package does not handle correctly.

But if you have a particular array for which affy works, but oligo does not, please let me know.

ADD COMMENT • link 8.2 years ago James W. MacDonald 68k

0

Entering edit mode

Hello,

I've got the dataset which is not processed by oligo, but is processed with affy: https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSM840099 One of the customers wanted to process it in the software my compay is developing. We use R scripts from java code on server side. The HuGene-1_0-st platform from the dataset above is usually processed with oligo. It works fine with many other datasets, but not with the GSM840099_Hela-c.CEL.

Can you please tell, what is wrong with it? Should the code call oligo, and then affy if oligo failed?