Dear all,
I would like to use limma package to find differential expressed genes.
I'm working on miRNA microarray data.
From the technical part:
We have used agilent v3.0 8x15K microarray.
The chip contains different probes for 2006 miRNAs.
PROBES without CONTROLS: 60180 ----------------------------------
(Non-CTRL) Unique Probe: 4774 (Non-CTRL) Unique Genes: 2006 ---------------------------------- DISTRIBUTION OF REPLICATED NonControl Probes reps 7 8 10 15 30 630 630 414 3094 6 ------------------------------------------------------ Replication at Probe level- MEDIAN CV
Each gene has 30 probes (one or more than 2 different probes).
As you can see most of probes are replicated 15 times across the chip.
In summary: Each gene has one or more probes and a total of 30.
The distribution of the replicates on the chip is not side-by-side.
From the biological part:
We took 2 samples from each patient (day 2 and 7) from control and obese patients. We have a total of 16 patients, so we have 32 samples.
So we have 2 levels in the analysis: day and disease.
OK! Now the question is...
We don't know if we could/should apply within-array replicate spots.
Following these recomendations:
Limma Design for Paired Samples Question http://www.statsci.org/smyth/pubs/2003to2014.html
we have treated
Person <- factor(Woman) ## It took values from 1 to 16
Treatmetn = Group ## It took values obesity_2, obesity_7, control_2, control_7
design <- model.matrix(~0+Treatment)
colnames(design) <- levels(Treatment)
corfit <- duplicateCorrelation(normalized_matrix,design,block=Person) ## normalized with background and betweenArray corrections.
corfit$consensus.correlation
And we've used the correlation to fit the model
fit <- lmFit(valores.pca_all, design, block=Person,correlation=corfit$consensus)
And then we've made the contrasts:
contrastes <- makeContrasts (
c1 = control_7 - control_2,
c2 = obesity_2 - control_2,
c3 = obesity_7 - obesity_2,
c4 = obesity_7 - control_7,
levels = design)
fit2=contrasts.fit(fit,contrastes)
fit2=eBayes(fit2)
However, we don't know if we should use or not within-array replicate spots. And if it is the case, how could we apply it.
http://www.statsci.org/smyth/pubs/dupcor.pdf
I've read your paper, but I'm not statistician so...
Is the following code correct ?
NDUPS = 30 ## we have 30 probes for each gene
corfit <- duplicateCorrelation (expr, design = des.mat, ndups = NDUPS, spacing = 1)
fit <- lmFit (expr, design = des.mat, ndups = NDUPS, spacing=1, correlation = corfit$consensus)
Thanks in advance
Thanks a lot Aaron Lun.
I'm using avereps before using duplicateCorrelation on patients.
But I don't know which of both approaches would be better.
Results using spots or patients are very different. That's why I would like to know if the following code is correct:
DUPS = 30 ## we have 30 probes for each gene (they can be different)
corfit <- duplicateCorrelation (expr, design = des.mat, ndups = NDUPS, spacing = 1)
fit <- lmFit (expr, design = des.mat, ndups = NDUPS, spacing=1, correlation = corfit$consensus)
I assume you've set it up so that the duplicate spots are side-by-side in
exprs
, otherwisespacing=1
would be incorrect. What exactly do you mean by "they can be different"? Because if any genes have a different number of spots, then many of the assignments will probably be wrong if you assume that they've all got 30.Anyway, modelling the correlation due to the patient factor seems more important than modelling the correlation between spots, given that patient-to-patient (biological) variability should be much larger than spot-to-spot (technical) variability. You haven't shown what
des.mat
is, but I assume that you've used the same~Treatment
formula. Presumably you haven't included a patient blocking factor; not that it would help, because that precludes an obese/normal comparison.Thanks a lot for your help. I hope this will be my last question :-)
This is my code (for the 2 approaches)
When I started the analysis I used the following code. But I didn't get any DEG.
x <-read.maimages(targets_all$FileName, source="agilent", green.only = TRUE)
colnames(x$E) <- targets_all$Code
x <- backgroundCorrect(x, method="normexp", offset = 0)
y <- normalizeBetweenArrays(x, method = "quantile")
y <- y[y$genes$ControlType == 0,]
################ pipeline 1: duplicatedCorrelation Between Women
y.ave <- avereps(y, ID=y$genes$SystematicName)
dim(y.ave) ### 2006 miRNAs and 32 samples
Person <- factor(targets_all$Woman)
Treat = targets_all$Group
design <- model.matrix(~0+Treat)
colnames(design) <- levels(Treat)
corfit <- duplicateCorrelation(y.ave$E,design,block=Person)
corfit$consensus.correlation
contrastes <- makeContrasts (
c1 = control_lh7 - control_lh2,
c2 = obesity_lh2 - control_lh2,
c3 = obesity_lh7 - obesity_lh2,
c4 = obesity_lh7 - control_lh7,
levels = design)
## fit model
fit <- lmFit(y.ave$E,design,block=Person,correlation=corfit$consensus.correlation)
## contrastes
fit2=contrasts.fit(fit,contrastes)
fit2=eBayes(fit2)
res_limma_m1_c1 <- topTable(fit2, coef = 1, number = 2006) ## 0 deg
res_limma_m1_c2 <- topTable(fit2, coef = 2, number = 2006) ## 0 deg
res_limma_m1_c3 <- topTable(fit2, coef = 3, number = 2006) ## 0 deg
res_limma_m1_c4 <- topTable(fit2, coef = 4, number = 2006) ## 0 deg
So, I decided to use correlation between spots and I got a lot of DEG
############### pipeline 2: Correlation within array: spots
expr <- y$E ## i'm using background corrected and normalized data (is it ok?) at probe level
rownames (expr) <- y$genes$SystematicName
table(table(rownames(expr))) ## each gene 30 spots: but probe can be different
ord <- order (rownames (expr))
expr <- expr[ord,]
expr[1:3,]
## Estimate the correlation structure for the genes / miRNAs
NDUPS <- unique (table (rownames (expr))) ## 30
if (length (NDUPS) > 1) stop ("NOT THE SAME NUMBER OF DUPLICATED SPOTS")
dupcor <- duplicateCorrelation (expr, design = design, ndups = NDUPS, spacing = 1)
dupcor$consensus
## fit the limma model with correlation
fit <- lmFit (expr, design = design, ndups = NDUPS, spacing=1, correlation = dupcor$consensus)
fit <- contrasts.fit (fit, contrastes)
fit2=eBayes(fit)
res_limma_m2_c1 <- topTable(fit2, coef = 1, number = 2006) ## 175 deg
res_limma_m2_c2 <- topTable(fit2, coef = 2, number = 2006) ## 27 deg
res_limma_m2_c3 <- topTable(fit2, coef = 3, number = 2006) ## 246 deg
res_limma_m2_c4 <- topTable(fit2, coef = 4, number = 2006) ## 193 deg
Do you think the both approches (and code) are correct?
In this case, why do you think the results are so different?
Huge thanks in advance :-D
Getting more DE genes doesn't make the analysis correct. Because your second approach doesn't consider the correlations due to the patient factor, it will likely be liberal as it overstates the amount of information that is present across dependent samples. This is why you get more DE genes at a nominal FDR threshold of (presumably) 5%, compared to the first analysis. Many of these are probably false positives, at least at this FDR threshold.
In short, the first approach seems a lot more valid to me. If you don't get any DE genes, then so be it. It's better to get nothing from a correct analysis than to get something from an incorrect one. In any case, you can just relax the FDR threshold in the first analysis to pick up some DE genes. Sure, the evidence for DE will be weaker, but at least you know it's weaker, whereas the second analysis will give you a false sense of security about the reliability of your results.
Thank you :-)