Question

unusal expression level values after normalization in Affymetrix microarray experiment

0

Entering edit mode

Amos Kirilovsky ▴ 50

@amos-kirilovsky-5407

Last seen 7.9 years ago

Dear Bioconductor community,

I'm making a gene expression analysis of a subset of 145 samples (affymetrix hgu133a) from a cohort of lung cancers found in GEO.

I made a strange observation while cutting the cohort in two based on a median expression level for each probe: for several probes the cohort was not divided in 2 sub groups with equal number of patients as it should be. For example the cohort was divided in 30% of patients below and 70% above the median expression level of probe "212970_at". I found out that for many patients the expression intensity was exactly equal to the median level. I can't figure out why? I checked, the raw data are different for each sample.

I just import the data with ReadAffy (1.52.0) function from affy package and normalize the data with gcrma (2.46.0) function and that's all.

Have you already observed something similar?

If any information or data is missing please just tell me.

Thank you for your help,

Amos Kirilovsky

microarray gcrma affy hgu133a normalization • 1.7k views

ADD COMMENT • link updated 8.0 years ago by Gordon Smyth 52k • written 8.0 years ago by Amos Kirilovsky ▴ 50

0

Entering edit mode

Thank you Wolfgang and Gordon for your answer. The option fast = FALSE in gcrma did the trick. I plotted the intensities of one probe after I run the full gcrma algorithm and the ad hoc approximation. As you can see many ties (53) were generated with the ad hoc approximation but not with the full algorithm. I’m not a specialist but the correlation between both method doesn’t seem very high (R2 =0.77). Should we worried about that? If yes maybe the fast option should be False by default. Is it possible that the generated ties have a bad influence in some kind of analysis (e.g. survival)? I didn’t find yet any documentation about the differences between the two methods.

I also plotted the log 2 intensities of PM from the same probe set than earlier, the median value, and ad hoc approximation and full gcrma algorithm.

I would expected at least a small correlation between median raw intensities and gcrma processed data. But I suppose this is not new. The value with full algorithm are lower than with ad hoc approximation. The main difference is generated during the background subtraction?

Thanks again,

Amos

ADD REPLY • link 8.0 years ago Amos Kirilovsky ▴ 50

0

Entering edit mode

Wolfgang Huber ★ 13k

@wolfgang-huber-3550

Last seen 6 weeks ago

EMBL European Molecular Biology Laborat…

gcrma is a sophisticated and numerically somewhat complex algorithm; it is possible that ties might be induced as you describe.

I'd recommend:

Trying with plain rma
Plotting the data from some of the offending probe set at the per-probe level, e.g. as in Fig.12 of Chapter 3 of the Case Studies book (http://www.bioconductor.org/help/publications/books/bioconductor-case-studies/web-supplement/chapter_3/figures/figure_12.html which is reached from http://www.bioconductor.org/help/publications/books/bioconductor-case-studies/web-supplement)

Wolfgang

ADD COMMENT • link 8.0 years ago Wolfgang Huber ★ 13k

score 2 · Accepted Answer · 2017-04-04

When you run gcrma(), try setting option fast=FALSE. This will cause it to run the full gcrma algorithm instead of an ad hoc approximation. In my experience, this solves the sort of problems you mention.

The gcrma publications only document the full algorithm. The "fast" option isn't described anywhere and no doubt was only implemented because computers tended to be slower 13 years ago.

You might say that fast=FALSE should be the default now, and I'd agree, but I'm not a gcrma author. I wouldn't use gcrma myself without setting fast=FALSE.