Question

sample size determination in microarrays studies with limma analysis

0

Entering edit mode

eleonoregravier ▴ 70

@eleonoregravier-8219

Last seen 2.7 years ago

France

Hi BioC community,

I study the gene expression profiles of three groups by microarray data. The aim is to perform the pairwise comparisons of the three groups. The sample size of each of the group is n=10.

GCRMA normalization and filtering were applied before using limma to perform differential analysis.

Unfortunately, no gene is significantly differentially expressed after limma analysis (FDR at 5%).

I think that the sample size of my groups is too small.

That's why I want to compute the sample size which would be sufficient to get for example 100 significant genes.

Is there a package dedicated to sample size/power determination with limma analysis ? Which package do you recommend to assess sample size/power in microarray studies ?

Thanks a lot,

Bests,

Eléonore

limma sample size microarrays statistical power • 2.7k views

ADD COMMENT • link 7.0 years ago eleonoregravier ▴ 70

score 2 · Answer 1 · 2018-04-04

It is pretty easy to home-brew your own power analysis, especially given that you already have some "pilot" data.

The key to making it realistic is to pull out the empirical Bayes statistics from your current dataset.

df0 <- median(fit$df.prior) # median, just in case you used robust=TRUE
s2 <- fit$s2.prior # might be a vector if trend=TRUE
ameans <- fit$Amean

Now, let's say you want to see the effect of doubling the number of samples in each group. For simplicity, I'll only deal with two groups. We generate a design matrix where the second coefficient represents the effect of interest.

grouping <- gl(2, 20)
nlibs <- length(grouping)
new.design <- model.matrix(~grouping)

We now do some stuff to generate a pretend dataset with similar parameters to the pilot:

ngenes <- length(ameans)

# Getting true variances for each gene
sigma <- sqrt(s2 / rchisq(ngenes, df=df0))

# Generating the residual effects.
resid.effects <- matrix(rnorm(ngenes*nlibs, sd=sigma), nrow=ngenes)
resid.effects[,seq_len(ncol(new.design))] <- 0

# Generating the residual matrix.
resids <- t(qr.qty(qr(new.design), t(resid.effects)))

# Adding back the abundances.
new.y <- resids + ameans

# Adding a specified effect for a random 10% of genes.
# In this case, a log-fold change of 1.5.
chosen <- sample(ngenes, round(ngenes/10))
effect <- sample(c(-1.5, 1.5), length(chosen), replace=TRUE)
new.y[chosen,] <- new.y[chosen,] + outer(effect, new.design[,2], "*")

Then we can analyze new.y with limma again, and see how many of the DE genes are detected:

new.fit <- lmFit(new.y, new.design)
new.fit <- eBayes(new.fit)
table(decideTests(new.fit)[chosen,2])

Rinse and repeat until you get an experimental design with your desired detection rate. You can also tune the generation of the new dataset if you think your genes of interest are low or high-abundance (in which case, you just sample from the relevant interval).

I should add that the above simulation is not entirely correct, as the addition of the effect will change the average abundance of the simulated genes in new.y. I'm not entirely sure how to avoid that for an arbitrary design matrix; I guess one could define something like:

new.design2 <- sweep(new.design, 2, colMeans(new.design), "-")
new.design2[,"(Intercept)"] <- 1

... and use that instead, which ensures that only the intercept needs to be specified to determine the average abundance.

score 1 · Answer 2 · 2018-04-04

1

Entering edit mode

James W. MacDonald 68k

@james-w-macdonald-5106

Last seen 2 hours ago

United States

The sizepower package comes to mind.

ADD COMMENT • link 7.0 years ago James W. MacDonald 68k

score 0 · Answer 3 · 2018-04-05

0

Entering edit mode

eleonoregravier ▴ 70

@eleonoregravier-8219

Last seen 2.7 years ago

France

Thanks a lot Aaron for your detailed answer.

The idea here is to simulate a dataset with the same caracteristics than the pilot dataset and to increase the sample size to reach the desired number of significant genes.

I do not understand all your computations but I have some questions :

- Does this simulation keep the same correlation structure between genes than the one in the pilot dataset ?

- Why do you choose 10% of genes with an effect ? This parameter is very important in the sample size calculation : how do you estimate it from the pilot dataset ?

ADD COMMENT • link 7.0 years ago eleonoregravier ▴ 70

1

Entering edit mode

No. But that shouldn't matter for a standard DE analysis, where information is only shared between genes to calculate the empirical Bayes shrinkage statistics. And you have far more important things to worry about than recapitulating the correlation structure to encourage detection of co-regulated genes.
I made up the 10%. I also made up the 1.5 log-fold change. These are parameters that you always have to decide yourself when you do any power analysis. You could try to estimate the proportion of true nulls using Storey's method on the p-value distribution from your pilot data; but if your pilot didn't have enough power, the distribution of p-values under the alternative would probably be close to uniform, resulting in overestimation of the number of true nulls. Similar issues apply to estimation of the log-fold changes. After all, if you were able to estimate these parameters precisely from the pilot data, then you should have been able to identify some DE genes in the first place.

For the log-fold change, I would just go for 1, as this is the minimum I would consider for further experimental work. For the number of genes - this depends on how similar your biological groups are. Perhaps there are only 100 DEGs; maybe there are thousands. If you knew that already, you wouldn't need to do any experiments.

ADD REPLY • link 7.0 years ago Aaron Lun ★ 28k

0

Entering edit mode

Unless you're answering your own question, don't use "Add answer". Use the "ADD COMMENT" button instead.

ADD REPLY • link 7.0 years ago Aaron Lun ★ 28k