Question

block design and how it handles missing values in Limma

0

Entering edit mode

Gordon Smyth 52k

@gordon-smyth

Last seen 20 minutes ago

WEHI, Melbourne, Australia

You seem to have a major problem with your analysis, which is that you don't seem to have incorporated the dye swaps into your design matrix at all. You define a vector call 'vector', but then make no use of it. Perhaps you should investigate the use of the functions in limma such as modelMatrix() which use targets frame to construct design matrices. As far as the 'block' argument is concerned, what makes you think that it is "supposed" to return simple averages? There wouldn't be much point in this functionality if it simply returned the same answers as you would get without it. Gordon >Date: Sun, 27 Feb 2005 16:12:53 -0800 >From: "xiaocui zhu" <xzhu@caltech.edu> >Subject: [BioC] block design and how it handles missing values in > Limma >To: <bioconductor@stat.math.ethz.ch> > >Hello all, > >I have a cDNA array data sets collected from a time-course experiment. The >experiment design was similar to the following: > 1)Treat cells with ligands A, ligand B, or a vector control > 2)Harvest cells at 1h and 2h > 3)Measure expression changes in treated cells relative to >time-matched-controls using 2-color cDNA arrays with a dyeSwap design (each >treated and time-matched-control sample pairs were hybridized onto two >arrays with a dyeSwap). > >Step 1) to 3) were repeated three times, so that for each treatment >condition, we have three biological and two technical repeats. > >I used Limma to identify differentially expressed genes in response to each >ligand, and genes differentially expressed in response to ligand A vs. to >ligand B at each time point. Since each time the experiment was repeated, >the cell preparation and other experimental conditions might vary slightly, >I thought that the data collected from one experiment can be considered a >block to account for the batch variance. Parts of the codes taking into >account the dyeSwap design and block factor are as the following: > >#Identify differentially expressed genes at each time point to each ligand >treatments <- factor(c(1,1,1,1,1,1, 2,2,2,2,2,2, 3,3,3,3,3,3, 4,4,4,4,4,4)) >vector<- c(1,-1,1,-1,1,-1, 1,-1,1,-1,1,-1, 1,-1,1,-1,1,-1, >1,-1,1,-1,1,-1) >design <- model.matrix(~ 0+treatments) >colnames(design) <- c("A.1h","A.2h","B.1h","B.2h") >fit<- lmFit(MA, design, block=c(rep(c(1,1,2,2,3,3),4))) >efit <-eBayes(fit) >for (i in 1:length(colnames(design))){ > output<-topTable(efit, coef = i, number=16200, adj="fdr") > write.table(output, file = paste(colnames(design)[i], ".txt", >sep=""), sep="\t") > } > >When I examined the output files from the above codes, I was concerned that >the M value for some of the array features did not equal to the average of >the replicates, even though it's supposed to. This is only seen with >features if a pair of its dyeSwap measurements had a "NA" value in only one >of the arrays. If both arrays of a dyeSwap pair gave a "NA" value for the >feature, the M value would still be equal to the average of replicates as >it's supposed to. This problem seemed to be caused by including the block >factor in the lmFit statement, because no such inconsistency was found in >the output if I removed the block factor. I don't know whether this >inconsistency is due to some errors in my codes, or whether block design >somehow handles missing value in a dyeSwap pair differently. > >My questions are: >1) Is it appropriate to use block design in my case? >2) How does block design handles missing values of a dyeSwap pair? Why do I >see that in a block design, if a pair of dyeSwap measurements has only one >missing value for a feature, the M value of that feature does not equal to >the average of the replicates? > >Any help to this matter will be greatly appreciated! > >Xiaocui

limma limma • 1.2k views

ADD COMMENT • link 19.9 years ago Gordon Smyth 52k

score 0 · Answer 1 · 2005-02-28

Hello Gordon, > You seem to have a major problem with your analysis, which is that you > don't seem to have incorporated the dye swaps into your design matrix at > all. You define a vector call 'vector', but then make no use of it. > Perhaps > you should investigate the use of the functions in limma such as > modelMatrix() which use targets frame to construct design matrices. I am very sorry about the mistake in the code I gave. There should be a line " design<-design*vector " immediately following the definition of 'vector'. I missed the line when I was doing copy&paste between my existing script and the email. > As far as the 'block' argument is concerned, what makes you think that it > is "supposed" to return simple averages? There wouldn't be much point in > this functionality if it simply returned the same answers as you would get > without it. I assumed that the 'M' column output by the 'topTable' function returns the average of replicates. I found this was the case many times previously, so often so that I made the assumption and started to use the comparison between 'M' column of topTable output and the simple average to check if I have made any mistake in the codes. Apparently, I had jumped to the wrong conclusion. According to the definition of the 'topTable' function, "M is the estimate of the effect or the contrast, on the log2 scale". If you do not mind, I would like to hear how the calculation of this estimate is different from the simple averaging. If you haven't been completely turned off by my foolishness by now, I would like to ask for help with the two questions I posted in my first email: 1) Is it appropriate to use block design in my case? 2) How does block design handles missing values of a dyeSwap pair? Why do I see that in a block design, if a pair of dyeSwap measurements has only one missing value for a feature, the M value of that feature does not equal to the average of the replicates, while the M value equals the average of replicates without the block factor? Thanks, Xiaocui > >Date: Sun, 27 Feb 2005 16:12:53 -0800 > >From: "xiaocui zhu" <xzhu@caltech.edu> > >Subject: [BioC] block design and how it handles missing values in > > Limma > >To: <bioconductor@stat.math.ethz.ch> > > > >Hello all, > > > >I have a cDNA array data sets collected from a time-course experiment. > The > >experiment design was similar to the following: > > 1)Treat cells with ligands A, ligand B, or a vector control > > 2)Harvest cells at 1h and 2h > > 3)Measure expression changes in treated cells relative to > >time-matched-controls using 2-color cDNA arrays with a dyeSwap design > (each > >treated and time-matched-control sample pairs were hybridized onto two > >arrays with a dyeSwap). > > > >Step 1) to 3) were repeated three times, so that for each treatment > >condition, we have three biological and two technical repeats. > > > >I used Limma to identify differentially expressed genes in response to > each > >ligand, and genes differentially expressed in response to ligand A vs. to > >ligand B at each time point. Since each time the experiment was repeated, > >the cell preparation and other experimental conditions might vary > slightly, > >I thought that the data collected from one experiment can be considered a > >block to account for the batch variance. Parts of the codes taking into > >account the dyeSwap design and block factor are as the following: > > > >#Identify differentially expressed genes at each time point to each > ligand > >treatments <- factor(c(1,1,1,1,1,1, 2,2,2,2,2,2, 3,3,3,3,3,3, > 4,4,4,4,4,4)) > >vector<- c(1,-1,1,-1,1,-1, 1,-1,1,-1,1,-1, 1,-1,1,-1,1,-1, > >1,-1,1,-1,1,-1) > >design <- model.matrix(~ 0+treatments) > >colnames(design) <- c("A.1h","A.2h","B.1h","B.2h") > >fit<- lmFit(MA, design, block=c(rep(c(1,1,2,2,3,3),4))) > >efit <-eBayes(fit) > >for (i in 1:length(colnames(design))){ > > output<-topTable(efit, coef = i, number=16200, adj="fdr") > > write.table(output, file = paste(colnames(design)[i], ".txt", > >sep=""), sep="\t") > > } > > > >When I examined the output files from the above codes, I was concerned > that > >the M value for some of the array features did not equal to the average > of > >the replicates, even though it's supposed to. This is only seen with > >features if a pair of its dyeSwap measurements had a "NA" value in only > one > >of the arrays. If both arrays of a dyeSwap pair gave a "NA" value for the > >feature, the M value would still be equal to the average of replicates as > >it's supposed to. This problem seemed to be caused by including the block > >factor in the lmFit statement, because no such inconsistency was found in > >the output if I removed the block factor. I don't know whether this > >inconsistency is due to some errors in my codes, or whether block design > >somehow handles missing value in a dyeSwap pair differently. > > > >My questions are: > >1) Is it appropriate to use block design in my case? > >2) How does block design handles missing values of a dyeSwap pair? Why do > I > >see that in a block design, if a pair of dyeSwap measurements has only > one > >missing value for a feature, the M value of that feature does not equal > to > >the average of the replicates? > > > >Any help to this matter will be greatly appreciated! > > > >Xiaocui

score 0 · Answer 2 · 2005-02-28

At 01:49 PM 1/03/2005, xiaocui zhu wrote: >Hello Gordon, > > > You seem to have a major problem with your analysis, which is that you > > don't seem to have incorporated the dye swaps into your design matrix at > > all. You define a vector call 'vector', but then make no use of it. > > Perhaps > > you should investigate the use of the functions in limma such as > > modelMatrix() which use targets frame to construct design matrices. > >I am very sorry about the mistake in the code I gave. There should be a line >" design<-design*vector " immediately following the definition of 'vector'. >I missed the line when I was doing copy&paste between my existing script and >the email. > > > As far as the 'block' argument is concerned, what makes you think that it > > is "supposed" to return simple averages? There wouldn't be much point in > > this functionality if it simply returned the same answers as you would get > > without it. >I assumed that the 'M' column output by the 'topTable' function returns the >average of replicates. I found this was the case many times previously, so >often so that I made the assumption and started to use the comparison >between 'M' column of topTable output and the simple average to check if I >have made any mistake in the codes. Apparently, I had jumped to the wrong >conclusion. According to the definition of the 'topTable' function, "M is >the estimate of the effect or the contrast, on the log2 scale". If you do >not mind, I would like to hear how the calculation of this estimate is >different from the simple averaging. > >If you haven't been completely turned off by my foolishness by now, I would >like to ask for help with the two questions I posted in my first email: >1) Is it appropriate to use block design in my case? No it isn't appropriate here, for the reason explained in the "technical replication" section of the Limma User's Guide. >2) How does block design handles missing values of a dyeSwap pair? Why do >I see that in a block design, if a pair of dyeSwap measurements has only >one missing value for a feature, the M value of that feature does not equal >to the average of the replicates, while the M value equals the average of >replicates without the block factor? Fitted values end up being simple means when the design is entirely balanced. The estimation is by generalised least squares. The best reference is http://www.bioinformatics.oupjournals.org/cgi/content/abstract/bti270v 1 although here the blocking is on pairs of spots rather than blocks of arrays. Gordon >Thanks, > >Xiaocui > > > > > >Date: Sun, 27 Feb 2005 16:12:53 -0800 > > >From: "xiaocui zhu" <xzhu@caltech.edu> > > >Subject: [BioC] block design and how it handles missing values in > > > Limma > > >To: <bioconductor@stat.math.ethz.ch> > > > > > >Hello all, > > > > > >I have a cDNA array data sets collected from a time-course experiment. > > The > > >experiment design was similar to the following: > > > 1)Treat cells with ligands A, ligand B, or a vector control > > > 2)Harvest cells at 1h and 2h > > > 3)Measure expression changes in treated cells relative to > > >time-matched-controls using 2-color cDNA arrays with a dyeSwap design > > (each > > >treated and time-matched-control sample pairs were hybridized onto two > > >arrays with a dyeSwap). > > > > > >Step 1) to 3) were repeated three times, so that for each treatment > > >condition, we have three biological and two technical repeats. > > > > > >I used Limma to identify differentially expressed genes in response to > > each > > >ligand, and genes differentially expressed in response to ligand A vs. to > > >ligand B at each time point. Since each time the experiment was repeated, > > >the cell preparation and other experimental conditions might vary > > slightly, > > >I thought that the data collected from one experiment can be considered a > > >block to account for the batch variance. Parts of the codes taking into > > >account the dyeSwap design and block factor are as the following: > > > > > >#Identify differentially expressed genes at each time point to each > > ligand > > >treatments <- factor(c(1,1,1,1,1,1, 2,2,2,2,2,2, 3,3,3,3,3,3, > > 4,4,4,4,4,4)) > > >vector<- c(1,-1,1,-1,1,-1, 1,-1,1,-1,1,-1, 1,-1,1,-1,1,-1, > > >1,-1,1,-1,1,-1) > > >design <- model.matrix(~ 0+treatments) > > >colnames(design) <- c("A.1h","A.2h","B.1h","B.2h") > > >fit<- lmFit(MA, design, block=c(rep(c(1,1,2,2,3,3),4))) > > >efit <-eBayes(fit) > > >for (i in 1:length(colnames(design))){ > > > output<-topTable(efit, coef = i, number=16200, adj="fdr") > > > write.table(output, file = paste(colnames(design)[i], ".txt", > > >sep=""), sep="\t") > > > } > > > > > >When I examined the output files from the above codes, I was concerned > > that > > >the M value for some of the array features did not equal to the average > > of > > >the replicates, even though it's supposed to. This is only seen with > > >features if a pair of its dyeSwap measurements had a "NA" value in only > > one > > >of the arrays. If both arrays of a dyeSwap pair gave a "NA" value for the > > >feature, the M value would still be equal to the average of replicates as > > >it's supposed to. This problem seemed to be caused by including the block > > >factor in the lmFit statement, because no such inconsistency was found in > > >the output if I removed the block factor. I don't know whether this > > >inconsistency is due to some errors in my codes, or whether block design > > >somehow handles missing value in a dyeSwap pair differently. > > > > > >My questions are: > > >1) Is it appropriate to use block design in my case? > > >2) How does block design handles missing values of a dyeSwap pair? Why do > > I > > >see that in a block design, if a pair of dyeSwap measurements has only > > one > > >missing value for a feature, the M value of that feature does not equal > > to > > >the average of the replicates? > > > > > >Any help to this matter will be greatly appreciated! > > > > > >Xiaocui