Help on factorial experiment analysis using limma
2
0
Entering edit mode
Yuk Fai Leung ▴ 100
@yuk-fai-leung-416
Last seen 10.2 years ago
Hi there, I have an experiment that looks like factorial experiment and I want to give it a try to use limma for its analysis. And I have a few questions The experimental scheme looks like this: 0: not-treated with the respective factor a: treated by factor a b: treated by factor b ab: treated by both factor a & b 00 -------> 0b | | | | | | | | v v a0 -------> ab There are also two diagonal hybridizations which I can't easily draw here: 00 ------> ab 0b ------> a0 The arrow points towards RNA sample labeled with Cy5 and I have two technical replications for each arrow. Therefore there are 6 experiments * 2 replicates = 12 arrays Here are my questions: 1. Is my experimental design suitable for limma analysis? 2. Should I combine the technical replicates before I calculate the liner model by lm.series, or just treat the 12 arrays like individual experiment and enter them to 12 different columns of the M matrix for lm.series? If I should combine them beforehand, how should I do that? 3. Is the following design matrix correct for my experiment? If I can treat the technical replicates as individual experiment, should I duplicate the each row of the design matrix to reflect the replicated data in the M matrix? b a ba 00 -> a0 0 1 0 0b -> ab 0 1 1 00 -> 0b 1 0 0 a0 -> ab 1 0 1 00 -> ab 1 1 1 0b -> a0 -1 1 0 4. Could someone give me some pointers on using heatdiagram to interpret my data? For example, if I am interested in the physiological effect of factor a, but the factor b is a potential confounding factor, what would the expression level of those genes only being regulated by factor a look like in the diagram? Besides, how can I interpret the interaction effect of a & b? Or is there other method to do these? Thanks in advance! Best regards, Fai ________ Yuk Fai Leung Bauer Center for Genomics Research Harvard University 7 Divinity Avenue Cambridge, MA 02138 Tel: 617-496-7134 Fax: 617-495-2196 email: yfleung@cgr.harvard.edu; yfleung@genomicshome.com URL: http://genomicshome.com
limma limma • 1.5k views
ADD COMMENT
0
Entering edit mode
@gordon-smyth
Last seen 33 minutes ago
WEHI, Melbourne, Australia
At 10:15 AM 13/09/2003, Yuk Fai Leung wrote: >Hi there, > >I have an experiment that looks like factorial experiment and I want to give >it a try to use limma for its analysis. And I have a few questions > >The experimental scheme looks like this: > >0: not-treated with the respective factor >a: treated by factor a >b: treated by factor b >ab: treated by both factor a & b > >00 -------> 0b >| | >| | >| | >| | >v v >a0 -------> ab > >There are also two diagonal hybridizations which I can't easily draw here: > >00 ------> ab >0b ------> a0 > >The arrow points towards RNA sample labeled with Cy5 and I have two >technical replications for each arrow. Therefore there are 6 experiments * 2 >replicates = 12 arrays This is a saturated direct design for a two-way factorial experiment. Good. >Here are my questions: > >1. Is my experimental design suitable for limma analysis? Certainly. >2. Should I combine the technical replicates before I calculate the liner >model by lm.series, or just treat the 12 arrays like individual experiment >and enter them to 12 different columns of the M matrix for lm.series? If I >should combine them beforehand, how should I do that? This is a good question. For the purposes of the limma analysis, I think I would treat the technical reps as ordinary reps, i.e., treat the experiment has having 12 independent arrays. This will have the consequence that the standard errors from the analysis will be slightly under-estimated, i.e., the significance results will be slightly over-stated, but the ranking of your genes in terms of evidence for differential expression will be close to optimal. >3. Is the following design matrix correct for my experiment? Yes, although of course it represents only one possible parametrization. You might investigate the functions makeContrasts() and contrasts.fit() if you want to make comparisons from the experiment other than the main effects and the interaction term represented by the columns of the design matrix. > If I can treat >the technical replicates as individual experiment, should I duplicate the >each row of the design matrix to reflect the replicated data in the M >matrix? Yes. > b a ba >00 -> a0 0 1 0 >0b -> ab 0 1 1 >00 -> 0b 1 0 0 >a0 -> ab 1 0 1 >00 -> ab 1 1 1 >0b -> a0 -1 1 0 > >4. Could someone give me some pointers on using heatdiagram to interpret my >data? For example, if I am interested in the physiological effect of factor >a, but the factor b is a potential confounding factor, what would the >expression level of those genes only being regulated by factor a look like >in the diagram? Besides, how can I interpret the interaction effect of a & >b? Or is there other method to do these? More good questions. The design matrix that you've written above corresponds to a classical interaction parametrization. Here the column 'ab' corresponds to extra effect that 'a' has in the presence of 'b'. The effect of 'a' by itself (a0-00) is represented by the coefficient 'a' and the effect of 'a' in the presence of 'b' (ab-0b) is represented by the sum of the coefficients for 'a' and 'ba'. If 'b' is a confounding factor, then you probably want to have the effects for 'a' with and without 'b' in your heatdiagram. You could do this by fit <- lm.series(MA, design) contrast.matrix <- makeContrasts(a,a+ba,levels=design) fit2 <- contrasts.fit(fit, contrast.matrix) eb2 <- ebayes(fit2) heatdiagram(stat=eb2$t,coef=fit2$coef) This would show whether genes which respond to factor 'a' still respond to 'a' in the presence of 'b', and whether in the same direction. Note that makeContrasts() is available only in the development version of limma. I am in the process of changing the syntax of heatdiagram to work with classifyTests() and vennDiagram(). See heatDiagram(). Regards Gordon >Thanks in advance! > >Best regards, >Fai >________ >Yuk Fai Leung >Bauer Center for Genomics Research >Harvard University >7 Divinity Avenue >Cambridge, MA 02138 >Tel: 617-496-7134 >Fax: 617-495-2196 >email: yfleung@cgr.harvard.edu; yfleung@genomicshome.com >URL: http://genomicshome.com
ADD COMMENT
0
Entering edit mode
Yuk Fai Leung ▴ 100
@yuk-fai-leung-416
Last seen 10.2 years ago
Dear Gordon, Thanks a lot for your prompt reply. I have more questions. > This is a saturated direct design for a two-way factorial experiment. > Good. Is it still ok to use limma to analyze the data if one or both diagonal experiment arrow is missing? I just realize I have another experiment that I want to discard the diagonal experiments for some reasons. It seems I can still run my analysis on the data without the diagonals by limma, but I have no clue whether this is a legitimate analysis or not. > This is a good question. For the purposes of the limma analysis, I think I > would treat the technical reps as ordinary reps, i.e., treat the > experiment > has having 12 independent arrays. This will have the consequence that the > standard errors from the analysis will be slightly under-estimated, i.e., > the significance results will be slightly over-stated, but the ranking of > your genes in terms of evidence for differential expression will be close > to optimal. This reminds me a very general question with regard to replication I have had for a while. What is the proper way to analyze the replicated data if there are both biological replication and technical replication in the raw data? Consider an example in which there are a hundred samples from different cancer patients and the microarray experiment for each sample is repeated three times. I heard some people would treat both biological and technical replicates equally in this case. But isn't it true that the technical replicates would have smaller variance and are somehow related with each other and should be treated differently? > More good questions. The design matrix that you've written above > corresponds to a classical interaction parametrization. Here the column > 'ab' corresponds to extra effect that 'a' has in the presence of 'b'. The > effect of 'a' by itself (a0-00) is represented by the coefficient 'a' and > the effect of 'a' in the presence of 'b' (ab-0b) is represented by the sum > of the coefficients for 'a' and 'ba'. If 'b' is a confounding factor, then > you probably want to have the effects for 'a' with and without 'b' in your > heatdiagram. You could do this by > > fit <- lm.series(MA, design) > contrast.matrix <- makeContrasts(a,a+ba,levels=design) > fit2 <- contrasts.fit(fit, contrast.matrix) > eb2 <- ebayes(fit2) > heatdiagram(stat=eb2$t,coef=fit2$coef) > > This would show whether genes which respond to factor 'a' still respond to > 'a' in the presence of 'b', and whether in the same direction. Note that > makeContrasts() is available only in the development version of limma. Could you give me further assistance on understanding the biological meaning of the heatdiagram? I know red means the gene is upregulated and green means the gene is downregulated. Does white color mean the statistical test is not significant for that gene? In my particular example, what does it mean when the gene i is i. red for a and green/less red for a+ba? Does it mean b is suppressing the effect of a on this gene? ii. red for a and white for a+ba? Does it mean b has no effect on this gene while a is upregulating it? iii. red for a and more red for a+ba? Does it mean a and b are both upregulating gene i? (I won't consider the green situations for a because they are essentially the same) This question sounds a little bit stupid, but I believe a step-by-step guide is actually very helpful for biologist like me to understand the result. It would be nice to have more tutorials on how to use heatdiagram, make contrast, venndiagram etc to interpret the biological meaning of analysis result in the manual of your future version of limma. Thanks very much! Best regards, Fai
ADD COMMENT
0
Entering edit mode
> Dear Gordon, > > Thanks a lot for your prompt reply. I have more questions. > >> This is a saturated direct design for a two-way factorial experiment. >> Good. > > Is it still ok to use limma to analyze the data if one or both diagonal > experiment arrow is missing? I just realize I have another experiment > that I want to discard the diagonal experiments for some reasons. It > seems I can still run my analysis on the data without the diagonals by > limma, but I have no clue whether this is a legitimate analysis or not. Yes, you can still run your analysis in limma with no problem, except of course that with less data you will be able to make comparisons with slightly less precision. >> This is a good question. For the purposes of the limma analysis, I >> think I would treat the technical reps as ordinary reps, i.e., treat >> the experiment >> has having 12 independent arrays. This will have the consequence that >> the standard errors from the analysis will be slightly >> under-estimated, i.e., the significance results will be slightly >> over-stated, but the ranking of your genes in terms of evidence for >> differential expression will be close to optimal. > > This reminds me a very general question with regard to replication I > have had for a while. What is the proper way to analyze the replicated > data if there are both biological replication and technical replication > in the raw data? Consider an example in which there are a hundred > samples from different cancer patients and the microarray experiment for > each sample is repeated three times. I heard some people would treat > both biological and technical replicates equally in this case. But isn't > it true that the technical replicates would have smaller variance and > are somehow related with each other and should be treated differently? Yes. An explicit treatment of technical replicates will be a future version of limma. >> More good questions. The design matrix that you've written above >> corresponds to a classical interaction parametrization. Here the >> column 'ab' corresponds to extra effect that 'a' has in the presence >> of 'b'. The effect of 'a' by itself (a0-00) is represented by the >> coefficient 'a' and the effect of 'a' in the presence of 'b' (ab- 0b) >> is represented by the sum of the coefficients for 'a' and 'ba'. If 'b' >> is a confounding factor, then you probably want to have the effects >> for 'a' with and without 'b' in your heatdiagram. You could do this by >> >> fit <- lm.series(MA, design) >> contrast.matrix <- makeContrasts(a,a+ba,levels=design) >> fit2 <- contrasts.fit(fit, contrast.matrix) >> eb2 <- ebayes(fit2) >> heatdiagram(stat=eb2$t,coef=fit2$coef) >> >> This would show whether genes which respond to factor 'a' still >> respond to 'a' in the presence of 'b', and whether in the same >> direction. Note that makeContrasts() is available only in the >> development version of limma. > > Could you give me further assistance on understanding the biological > meaning of the heatdiagram? I can explain the statistical meaning. The biological interpretation is really more in the domain of biologists such as yourself. > I know red means the gene is upregulated and > green means the gene is downregulated. Does white color mean the > statistical test is not significant for that gene? Yes. > In my particular > example, what does it mean when the gene i is > > i. red for a and green/less red for a+ba? Does it mean b is suppressing > the effect of a on this gene? red for a and green for a+ba (which remember means the comparison ab- 0b) means that a upregulates the gene but b *reverses* the effect of a. red for a and less red for a+ab could mean that b is reducing the effect of a. The interaction term ab tells you whether this reduction is actually significant. > ii. red for a and white for a+ba? Does it mean b has no effect on this > gene while a is upregulating it? No. a upregulates the gene but b blocks the effect of a. > iii. red for a and more red for a+ba? Does it mean a and b are both > upregulating gene i? Yes. a upregulates the gene and even more so in the presence of b. > (I won't consider the green situations for a because they are > essentially the same) > > This question sounds a little bit stupid, but I believe a step-by- step > guide is actually very helpful for biologist like me to understand the > result. It would be nice to have more tutorials on how to use > heatdiagram, make contrast, venndiagram etc to interpret the biological > meaning of analysis result in the manual of your future version of > limma. If you or someone else could provide me with an interesting data set on which the illustrate the methods, then I could include it in the User's Guide. At present I do not have permission to make public any data sets on which I have used these methods. Gordon > Thanks very much! > > Best regards, > Fai
ADD REPLY
0
Entering edit mode
Dear Fai, > This reminds me a very general question with regard to replication I have > had for a while. What is the proper way to analyze the replicated data if > there are both biological replication and technical replication in the raw > data? Consider an example in which there are a hundred samples from > different cancer patients and the microarray experiment for each sample is > repeated three times. I heard some people would treat both biological and > technical replicates equally in this case. But isn't it true that the > technical replicates would have smaller variance and are somehow related > with each other and should be treated differently? I think so. Gordon and I had an email exchange about these issues on late august (19 to 21). Just in case this could be of any help, this is what I ended up doing. I had a design with treatment and control hybridized on the same array. There were 3 biological replicates and 2 technical replicates for each biological replicate (with dye swap on the technical repl). Then, this is very much like a one-sample t-test (ignoring the dye-swap and technical replicates issues). My first try was to use linear mixed effects models (with library nlme); fit a model to just the intercept, and then re-organize the output of lme, and use, for example, ebayes. However, probably because of the tiny sample sizes, I was occasionally getting funny results from lme that didn't seem to make sense. In addition there could be very differences between analyzing all data ignoring technical replication (3 df) or analyzing the means of the technical replicates (2 df) (and my collaborators had made clear that they needed, for other various reasons, "hard rules" based on p-values). Thus, I finally decided to obtain the means of the technical replicates, and analyze the data as in the swirl example (like a one-sample t-test); this is a conservative approach and my collaborators and I agreed that it would be a lot easier to explain and to defend, specially with such small sample sizes. Of course, if everything works as it should, the estimates of the t-statistic from using the t-test on the means of replicates and the lme t-statistic should be very, very similar, but because of the differences in the df we can end up with differences in the p-values. Below is a function I concocted quickly to see how things differed between limma, lme and t-test on the means of the technical replicates. I used it to try to understand what might happen. Best, Ram?n ********************************************* library(nlme) library(limma) options(contrasts = c("contr.helmert", "contr.poly")) f3 <- function(effect.size = 0, s.biol = 2, s.tech = 1, biological.repls = 5, n.tech.reps = 10){ ## For one-sample comparisons, with dye-swaps ## generates data, and fits models using both limma ## and lme and one-sample t-tests on the mean of the ## technical replicates if(getOption("contrasts")[1] != "contr.helmert") stop("You need to set Helmert contrasts in options") if(n.tech.reps %% 2) stop("Need even number of n.tech.reps") ## simulate data tech.rep.effect <- rnorm(biological.repls, mean = 0, sd = s.biol) tech.rep.effect <- rep(tech.rep.effect, rep(n.tech.reps, biological.repls)) data.l <- data <- rnorm(biological.repls * n.tech.reps, mean = 0, sd = s.tech) + tech.rep.effect + effect.size dim(data.l) <- c(1, biological.repls * n.tech.reps) ## design matrices, etc. tech.reps <- factor(rep(1:biological.repls, rep(n.tech.reps, biological.repls))) dm1 <- model.matrix(data ~ tech.reps) dye.swap.codes <- rep(c(-1, 1), (n.tech.reps * biological.repls)/2) data.sp <- data.l * dye.swap.codes data.s <- data * dye.swap.codes ## model fitting tmp1 <- lm.series(data.l, dm1) tmp2 <- lm.series(data.sp, dm1 * dye.swap.codes) tmp3 <- summary(lme(data ~ 1, random = ~ 1|tech.reps, control = list(tolerance = 1e-10, msTol = 1e-10))) tmp4 <- summary(lme(data.s ~ -1 + dye.swap.codes, random = ~ 1|tech.reps, control = list(tolerance = 1e-10, msTol = 1e-10))) tmp1 <- c(tmp1$coefficients[[1]], tmp1$stdev.unscaled[[1]], tmp1$sigma, tmp1$sigma *tmp1$stdev.unscaled[[1]], tmp1$df) names(tmp1) <- c("coeff", "unscaled.stdev", "sigma", "scaled se", "df") tmp2 <- c(tmp2$coefficients[[1]], tmp2$stdev.unscaled[[1]], tmp2$sigma, tmp2$sigma *tmp2$stdev.unscaled[[1]], tmp2$df) cat("\n\n *************\n\n") mean.by.tech.rep <- tapply(data, tech.reps, mean) the.t.results <- unlist(t.test(mean.by.tech.rep)[1:6]) the.t.se <- sqrt(var(mean.by.tech.rep)/length(unique(tech.reps))) lme.and.t <- rbind(lme.positivized.t.table = tmp3$tTable, lme.swap.design.t.table = tmp4$tTable, t.test.results = c(the.t.results[6], the.t.se, the.t.results[c(2, 1, 3)])) rownames(lme.and.t) <- c("lme.positivized", "lme.swap", "t.test") return( lme.and.limma = rbind( lm.series.positivized = tmp1, lm.series.swap.design = tmp2, lme.positivized = c(tmp3$coefficients[[1]], sqrt(tmp3$varFix)/tmp3$sigma, tmp3$sigma, sqrt(tmp3$varFix), tmp3$tTable[[3]]), lme.swap.design = c(tmp4$coefficients[[1]], sqrt(tmp4$varFix)/tmp4$sigma, tmp4$sigma, sqrt(tmp4$varFix), tmp4$tTable[[3]])), lme.and.t) } ## an example call: f3(2, 2, 1, 3, 2) > > > More good questions. The design matrix that you've written above > > corresponds to a classical interaction parametrization. Here the column > > 'ab' corresponds to extra effect that 'a' has in the presence of 'b'. The > > effect of 'a' by itself (a0-00) is represented by the coefficient 'a' and > > the effect of 'a' in the presence of 'b' (ab-0b) is represented by the > > sum of the coefficients for 'a' and 'ba'. If 'b' is a confounding factor, > > then you probably want to have the effects for 'a' with and without 'b' > > in your heatdiagram. You could do this by > > > > fit <- lm.series(MA, design) > > contrast.matrix <- makeContrasts(a,a+ba,levels=design) > > fit2 <- contrasts.fit(fit, contrast.matrix) > > eb2 <- ebayes(fit2) > > heatdiagram(stat=eb2$t,coef=fit2$coef) > > > > This would show whether genes which respond to factor 'a' still respond > > to 'a' in the presence of 'b', and whether in the same direction. Note > > that makeContrasts() is available only in the development version of > > limma. > > Could you give me further assistance on understanding the biological > meaning of the heatdiagram? I know red means the gene is upregulated and > green means the gene is downregulated. Does white color mean the > statistical test is not significant for that gene? In my particular > example, what does it mean when the gene i is > > i. red for a and green/less red for a+ba? Does it mean b is suppressing the > effect of a on this gene? > ii. red for a and white for a+ba? Does it mean b has no effect on this gene > while a is upregulating it? > iii. red for a and more red for a+ba? Does it mean a and b are both > upregulating gene i? > (I won't consider the green situations for a because they are essentially > the same) > > This question sounds a little bit stupid, but I believe a step-by- step > guide is actually very helpful for biologist like me to understand the > result. It would be nice to have more tutorials on how to use heatdiagram, > make contrast, venndiagram etc to interpret the biological meaning of > analysis result in the manual of your future version of limma. > > Thanks very much! > > Best regards, > Fai > > _______________________________________________ > Bioconductor mailing list > Bioconductor@stat.math.ethz.ch > https://www.stat.math.ethz.ch/mailman/listinfo/bioconductor -- Ram?n D?az-Uriarte Bioinformatics Unit Centro Nacional de Investigaciones Oncol?gicas (CNIO) (Spanish National Cancer Center) Melchor Fern?ndez Almagro, 3 28029 Madrid (Spain) Fax: +-34-91-224-6972 Phone: +-34-91-224-6900 http://bioinfo.cnio.es/~rdiaz
ADD REPLY

Login before adding your answer.

Traffic: 915 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6