limma, subsets of design matrix and p-values

0

Entering edit mode

Mike Schaffer ▴ 90

@mike-schaffer-424

Last seen 10.2 years ago

I've run limma for a few months and had a question about the p-values being calculated from a large set of data vs. just a subset. All of my 2-color array data is relative to an untreated sample and each is replicated three times. For example: Treatment1 vs untreated x 3, Treatment2 vs. untreated x 3, Treatment3 vs. untreated x 3 ...etc. So I have a large MA object of all the data that is normalized within arrays and a design matrix created by: design <- modelMatrix(targets,ref="untreated") My confusion stems from the fact that if I run eBayes on the entire dataset (code below), I get different p-values (but same M values) for the TreatmentX vs. untreated, than if I fit the subset that only includes data for one treatment (e.g. just Treatment 1 vs. untreated). For example, fit <- lmFit(MA,design=design) eb <- eBayes(fit) eb$p.values[1:10,1] gives different p-values than if I were to only initially subset on the Treatment1 vs. untreated data, and then run lmFit. For example, design <- modelMatrix(targets[1:3,],ref="untreated") fit <- lmFit(MA[,1:3],design=design) # the first three data sets include ALLof the Treatment1 vs. untreated data eb <- eBayes(fit) eb$p.values[1:10] Am I incorrect to assume that the p-values should be the same regardless of how much data is included in the MA object, as long as the design matrix has no overlap between experiments (e.g. treatment1 vs. untreated data is distinct from treatment2 vs. untreated data) -- aside from the fact they are all relative to an untreated sample? Or is the moderated t-statistic based on ALL of the data in the MA object regardless of each experiment's relationship to others? I'd like to read in all the data and keep it in one large RG and MA object. If I'm looking to determine the p-values for genes induced relative to untreated by just one of the treatments, should I use lmFit on the large MA object or should it be subset first (e.g MA[,1:3]) before doing the linear fit? Am I missing something? Thanks, in advance, for any help.

limma limma • 1.7k views

ADD COMMENT • link updated 19.7 years ago by Naomi Altman ★ 6.0k • written 19.7 years ago by Mike Schaffer ▴ 90

0

Entering edit mode

Naomi Altman ★ 6.0k

@naomi-altman-380

Last seen 3.6 years ago

United States

Analysis of variance (ANOVA) and related methods such as SAM and eBayes in Limma produce p-values that depend on the entire data array, not just the treatments of interest. There are 3 main reasons for this. In the discussion below "t-value" refers to the ordinary t-value for the contrast (for ANOVA), a rank version (for tests like the Wilcoxon), permutation versions (for SAM and other permutation routines) and (e)Bayesian versions for routines like Limma eBayes. All the available data are used to estimate the "within variance" which is the denominator of the test statistic so the "t-value" (or equivalent) depends on all the data, not just the 2 treatments being compared The degrees of freedom of the test are determined by the data used to estimate the "within variance", so even if the "t-value" is the same, the p-value will not be. In some cases, multiple comparisons adjustments are built-in to the procedures, and these become more stringent as the number of treatments increases. Any good intro text on ANOVA will elucidate these issues. --Naomi At 02:15 PM 3/10/2005, Mike Schaffer wrote: >I've run limma for a few months and had a question about the p-values >being calculated from a large set of data vs. just a subset. > > >All of my 2-color array data is relative to an untreated sample and each >is replicated three times. >For example: > Treatment1 vs untreated x 3, > Treatment2 vs. untreated x 3, > Treatment3 vs. untreated x 3 > ...etc. > >So I have a large MA object of all the data that is normalized within >arrays and a design matrix created by: > >design <- modelMatrix(targets,ref="untreated") > > >My confusion stems from the fact that if I run eBayes on the entire >dataset (code below), I get different p-values (but same M values) for the >TreatmentX vs. untreated, than if I fit the subset that only includes data >for one treatment (e.g. just Treatment 1 vs. untreated). > >For example, > >fit <- lmFit(MA,design=design) >eb <- eBayes(fit) >eb$p.values[1:10,1] > > >gives different p-values than if I were to only initially subset on the >Treatment1 vs. untreated data, and then run lmFit. >For example, > >design <- modelMatrix(targets[1:3,],ref="untreated") >fit <- lmFit(MA[,1:3],design=design) # the first three data sets >include ALLof the Treatment1 vs. untreated data >eb <- eBayes(fit) >eb$p.values[1:10] > > >Am I incorrect to assume that the p-values should be the same regardless >of how much data is included in the MA object, as long as the design >matrix has no overlap between experiments (e.g. treatment1 vs. untreated >data is distinct from treatment2 vs. untreated data) -- aside from the >fact they are all relative to an untreated sample? > >Or is the moderated t-statistic based on ALL of the data in the MA object >regardless of each experiment's relationship to others? > >I'd like to read in all the data and keep it in one large RG and MA >object. If I'm looking to determine the p-values for genes induced >relative to untreated by just one of the treatments, should I use lmFit on >the large MA object or should it be subset first (e.g MA[,1:3]) before >doing the linear fit? > >Am I missing something? > >Thanks, in advance, for any help. > >_______________________________________________ >Bioconductor mailing list >Bioconductor@stat.math.ethz.ch >https://stat.ethz.ch/mailman/listinfo/bioconductor Naomi S. Altman 814-865-3791 (voice) Associate Professor Bioinformatics Consulting Center Dept. of Statistics 814-863-7114 (fax) Penn State University 814-865-1348 (Statistics) University Park, PA 16802-2111

ADD COMMENT • link 19.7 years ago Naomi Altman ★ 6.0k

Login before adding your answer.