Question

neqc normalization results in bimodal histogram of expression values

0

Entering edit mode

alexvpickering ▴ 110

@alexvpickering

Last seen 3.1 years ago

Canada

I am using neqc to normalize (bg correct, quantile normalize, and log2 transform) Illumina microarray data downloaded from GEO but am getting results that I am suspicious of. I do not have access to the negative control probe files but do have access to Detection P values (GSE39313 and GSE49000). After editing the raw non-normalized expression txt files so that "AVG_Signal" and "Detection" are included in the column names along with the sample names, I run the following:

 data <- read.ilmn(data_paths, probeid="ID_REF")
 data <- neqc(data)$E

If I do a histogram of the now transformed values I get a bimodal distribution (a primary peak at log2 of ~ 6 and a secondary peak at log2 of ~14). I get a similar behaviour for both GSE's so I suspect it's not an oddity of the data. Additionally, after performing differential expression analysis I get a (suspiciously?) large number of differentially expressed genes with adjusted p-values > 0.05 and log-fold-change > 1 (~3000). Also, the top log-fold change values are also very high (10-14). If I import the data and just log2 transform (not using neqc), the data is approximately normally distributed (not bimodal). I suspect that neqc has done something strange and I'm hoping to figure out why so that I can still use it.

Any suggestions or thoughts?

neqc limma microarray illumina • 2.6k views

ADD COMMENT • link updated 9.2 years ago by Gordon Smyth 52k • written 9.2 years ago by alexvpickering ▴ 110

0

Entering edit mode

I think I figured this one out (bi-modality is perhaps to be expected - see here). It seems to be quite a common observation with RMA normalized expression data (neqc is very similar to RMA).

ADD REPLY • link 9.2 years ago alexvpickering ▴ 110

0

Entering edit mode

No, bimodality isn't particularly expected nor is it very common.

ADD REPLY • link 9.2 years ago Gordon Smyth 52k

score 0 · Answer 1 · 2016-02-21

Dear Alex,

Actually neqc() doesn't produce a bimodal distribution for these datasets. Nor is there any need to edit the data file to produce your own column names. Just download the non-normalized data file from GEO without any modification:

library(limma)
x <- read.ilmn("GSE39313_non-normalized.txt", probeid="ID_REF")
y <- neqc(x)
plotDensities(y[,1])

The density plot is beautifully unimodal.

It is important not to edit raw data files IMO. In this case I am guessing that you may have changed the data file so that it is no longer read correctly.