Getting the raw data :

Question

How many features an ExpressionSet can contain ?

0

Entering edit mode

OE • 0

@1b6d9c0f

Last seen 2.6 years ago

Morocco

I have some Raw Data from a microarray experiment with 473230 probes.

When I created an ExpressionSet object for this data, I found out that it contains only 47323 features.

I didn't perform any normalization.

So why that happened ? Does an ExpressionSet object have some limited capacity ?

Could anyone clarify that please ?

ExpressionSet • 1.9k views

ADD COMMENT • link 2.6 years ago OE • 0

1

Entering edit mode

Rory Stark ★ 5.2k

@rory-stark-5741

Last seen 24 days ago

Cambridge, UK

I had a look at GSE68849 and see that the array platform has about 47k probes. The dataset has 10 samples. So the $E expression matrix contains 47323 probes x 10 samples = 473230 values.

When the ExpressionSet is formed, you have the correct number of probes (47323), with values for all 10 samples in the dataset, totalling 473230 values.

ADD COMMENT • link 2.6 years ago Rory Stark ★ 5.2k

0

Entering edit mode

You are right that means 47323 for every single sample, totalling 473230 values.

However when I tried to feed the ExpressionSet to the rma function in order to normalize my data I got the following error:

Error in (function (classes, fdef, mtable) : unable to find an inherited method for function ‘rma’ for signature ‘"ExpressionSet"’

Do you have any idea about how can I perform normalization on my raw data contained in the Expression Set ?

ADD REPLY • link 2.6 years ago OE • 0

0

Entering edit mode

James W. MacDonald 67k

@james-w-macdonald-5106

Last seen 13 hours ago

United States

> library(Biobase)
> bigboy <- matrix(rnorm(473230 * 10), ncol = 10)
> z <- ExpressionSet(bigboy)
> dim(z)
Features  Samples
  473230       10

ADD COMMENT • link 2.6 years ago James W. MacDonald 67k

0

Entering edit mode

The only limits I know of are the OS limits on the number of things you can have in a vector. For really huge things you might want to use a SummarizedExperiment with an HDF5 back-end so you don't have to keep all the data in memory at once. But so far as I know there isn't a limit on an ExpressionSet directly.

ADD REPLY • link 2.6 years ago James W. MacDonald 67k

0

Entering edit mode

Okay. I think I've found an explication of that but I'm not pretty sure.

In effect I've already learned in a previous course about Bioconductor that Affymetrix uses a design where a gene is being measured by many different probes simultaneously (i.e., probset). As part of the preprocessing step for Affymetrix arrays the measurements for all probes in a probeset needs to be combined into one expression measure.

So at the beggining ( i.e., in the raw data) I had 473230 rows which are in fact the 473230 probes used in the microarray, but when I put that in the ExpressionSet I got only 47323 features.

A simple computation lets us know that in the ExpressionSet there's only 10 % of what it is in the raw data.

I guess that the ExpressionSet takes automatically all the probes from the raw data and combines them by 10 as if we use 10 probes per gene.in the microarray.

So this is my reflexion. Any further clarification or explication would be greatly appreciated !

Anyway thank you James W. MacDonald .

ADD REPLY • link 2.6 years ago OE • 0

0

Entering edit mode

I am not exactly sure what you mean by 'when I put that in the ExpressionSet'. In general, one doesn't usually create an ExpressionSet, but instead it is generated by the software being used. As an example, if I were to do

> library(oligo)
> dat <- read.celfiles(list.celfiles())
> eset <- rma(dat)

I would technically be putting data in an ExpressionSet, but not really. I told oligo to summarize the data, it did its magic, and then I end up with an ExpressionSet that has fewer rows than the starting data because it's been summarized. So if that is what you mean, then yes. If you are doing something else (and you don't know exactly what you are doing), then you should probably not do that, and instead do what I have outlined above.

ADD REPLY • link 2.6 years ago James W. MacDonald 67k

0

Entering edit mode

By " i put that in the Expression Set" I meant create an ExpressionSet that contains it. To clear things up here is my code and my approach:

Getting the raw data :

rawdata <- read.ilmn("GSE68849_non-normalized.txt.gz")

Subsetting to get the expression data table:

rawdata$E

Checking how many probes I have got here:

length(rawdata$E)

473230

Creating an ExpressionSet that contains the expression data from the raw data:

esetraw <- new("ExpressionSet",exprs=rawdata$E)

Checking how many features I have got:

length(featureNames(esetraw))

47323

Here, you might have noticed that the ExpressionSet contains only 10 % of what is in "rawd$E". I hope you have already read what I wrote about this story of affymetrix and probsets.

So my question is why that happened, why here the ExpressionSet contains only 10% of the raw data. I gave an explanation but I'm not sure.

ADD REPLY • link 2.6 years ago OE • 0

score 2 · Accepted Answer · 2022-06-20

GSE68849 contains Illumina Beadchip arrays and you need to apply preprocessing and normalization appropriate for that type of microarray.

This is what you should be doing to read, background correct and normalize the arrays:

> library(limma)
> x <- read.ilmn("GSE68849_non-normalized.txt.gz", probeid="ID_REF")
Reading file GSE68849_non-normalized.txt.gz ... ...
> y <- neqc(x)
Note: inferring mean and variance of negative control probe intensities from the detection p-values.
> dim(y)
[1] 47323    10
> dim(y$E)
[1] 47323    10
> length(y$E)
[1] 473230

Note:

You should not be trying to use rma() because these are not Affymetrix arrays.
You should not be subsetting or dissasembling the EListRaw object from read.ilmn(), because doing so throws away crucial information that is required to properly background correct and normalize the arrays.
You should not be converting to an ExpressionSet, certainly not before the normalization is complete anyway.
There is no limit to the number of features that an ExpressionSet or EListRaw object can contain. The arrays here contain 47323 probes but the dataset contains 473230 values, because there are ten samples, each with one expression value per probe. In your posts here you have confused the number of expression values in a dataset with the number of probes or features. The number of probes is the number of rows, not the total number of values (i.e., length) of the data matrix.

Having normalized the arrays, you can then proceed to a differential expression analysis:

> targets <- readSampleInfoFromGEO("GSE68849_series_matrix.txt.gz")
> Donor <- factor(targets$CharacteristicsCh1[,"donor"])
> Agent <- factor(targets$CharacteristicsCh1[,"agent"])
> data.frame(Donor,Agent)
     Donor            Agent
1  Donor 1 No virus control
2  Donor 1      Influenza A
3  Donor 2 No virus control
4  Donor 2      Influenza A
5  Donor 3 No virus control
6  Donor 3      Influenza A
7  Donor 4 No virus control
8  Donor 4      Influenza A
9  Donor 5 No virus control
10 Donor 5      Influenza A
> design <- model.matrix(~Donor+Agent)
> fit <- lmFit(y,design)
> keep <- (fit$Amean > 5)
> fit2 <- eBayes(fit[keep,], trend=TRUE)
> topTable(fit2, coef=6)
             logFC AveExpr     t  P.Value adj.P.Val    B
ILMN_2202096 -8.91    8.93 -67.4 6.16e-20  7.29e-16 32.6
ILMN_1664595 -9.09    9.50 -66.7 7.26e-20  7.29e-16 32.5
ILMN_2176225 -8.25    8.73 -62.8 1.78e-19  9.44e-16 32.0
ILMN_1675192 -8.43    9.87 -62.3 2.00e-19  9.44e-16 31.9
ILMN_1660195 -8.70    8.87 -61.6 2.35e-19  9.44e-16 31.8
ILMN_2110561 -8.60    9.27 -57.7 6.16e-19  1.78e-15 31.3
ILMN_2186745 -9.34    9.64 -57.7 6.19e-19  1.78e-15 31.3
ILMN_1781057 -8.07    9.09 -56.1 9.34e-19  2.35e-15 31.0
ILMN_1688663 -8.43    9.14 -52.5 2.50e-18  5.59e-15 30.4
ILMN_1769988 -9.03    9.83 -49.1 6.89e-18  1.34e-14 29.7