limma: to pair or not to pair?

0

Entering edit mode

Aaron Mackey ▴ 170

@aaron-mackey-4358

Last seen 10.5 years ago

My collaborators have an experimental design in which cells are treated experimentally with two conditions, and they naturally wish to know the differences in response between the two. Moreover, the experiments are setup in pairs of treatments, with each pair produced from the same "batch" of cells, inducing a natural pairing that we might want to include in the limma design. We would do this to take advantage of expected correlations in gene expression due to the source of cells in each experiments. However, when we run the analyses with either a paired or unpaired design, we find that the unpaired statistics are far more significant (~1000 probesets at FDR < 5%) than with the paired design (~100), which implies that there is not enough correlation across pairs, at least relative to the induced treatment effects. A bit stumped at first, I finally confirmed for myself that even in the presence of strong correlation, a larger treatment effect will remain more significant with an unpaired design: > wt <- c(0.9, 1.0, 1.2) > mean(wt) [1] 1.033333 > mu <- c(6.2, 6.1, 5.9) > mean(mu) [1] 6.066667 > mean(mu) - mean(wt) [1] 5.033333 > mean(mu-wt) [1] 5.033333 In fact, no matter how you pair up mu and wt, you will always get 5.0333 as the paired fold change. However, the variance may change, depending on how correlated mu and wt are (it is this correlation that we are trying to take advantage of by pairing): > cor(wt, mu) [1] -1 > sd(mu-wt) [1] 0.305505 > mu2 <- sort(mu) > wt2 <- sort(wt) > cor(wt2, mu2) [1] 0.9285714 > sd(mu2-wt2) [1] 0.05773503 Now let's see how this affects t-test significance: > t.test(*mu, wt, paired=F*, var.equal=T) Two Sample t-test data: mu and wt *t = 40.3564, df = 4, p-value = 2.253e-06* alternative hypothesis: true difference in means is not equal to 0 95 percent confidence interval: 4.687050 5.379617 sample estimates: mean of x mean of y 6.066667 1.033333 > t.test(*mu, wt, paired=T*, var.equal=T) Paired t-test data: mu and wt *t = 28.5363, df = 2, p-value = 0.001226* alternative hypothesis: true difference in means is not equal to 0 95 percent confidence interval: 4.274417 5.792250 sample estimates: mean of the differences 5.033333 > t.test(*mu2, wt2, paired=F*, var.equal=T) Two Sample t-test data: mu2 and wt2 *t = 40.3564, df = 4, p-value = 2.253e-06* alternative hypothesis: true difference in means is not equal to 0 95 percent confidence interval: 4.687050 5.379617 sample estimates: mean of x mean of y 6.066667 1.033333 > t.test(*mu2, wt2, paired=T*, var.equal=T) Paired t-test data: mu2 and wt2 *t = 151, df = 2, p-value = 4.385e-05* alternative hypothesis: true difference in means is not equal to 0 95 percent confidence interval: 4.889912 5.176755 sample estimates: mean of the differences 5.033333 In the first case, when wt & mu were anti-correlated, the unpaired t-test gave much better P values; adding the pairing info made the variation in mu-wt larger, and so the P value got worse (t statistic was smaller; also the smaller df in the paired test will, for the same t statistics, deflate the P value). In the second case, when mt & mu were strongly correlated, the paired t-test was still very good, and had a much higher t statistics than the unpaired test, but the P value was still not quite as good as the unpaired -- this is due to the drop in df. Much of this has to do with the large difference between to the two groups; if I make the difference between mu and wt a bit smaller, without changing the correlation structure: > *mu3 <- mu2 - 4.5 # cor(wt2, mu3) == cor(wt2, mu2)* > t.test(*mu3, wt2, paired=F*, var.equal=T) Two Sample t-test data: mu3 and wt2 *t = 4.2762, df = 4, p-value = 0.01289* alternative hypothesis: true difference in means is not equal to 0 95 percent confidence interval: 0.1870498 0.8796169 sample estimates: mean of x mean of y 1.566667 1.033333 > t.test(*mu3, wt2, paired=T*, var.equal=T) Paired t-test data: mu3 and wt2 *t = 16, df = 2, p-value = 0.003884* alternative hypothesis: true difference in means is not equal to 0 95 percent confidence interval: 0.3899116 0.6767551 sample estimates: mean of the differences 0.5333333 Then, finally, you see a change in the expected direction: the paired test is more significant than the unpaired test. So, the question is -- how might you convince yourself (or a savvy and skeptic reviewer for that matter) that your deliberate removal of pairing from your design is the statistically valid approach? My own thoughts were to show a distribution of observed correlations across pairings, to demonstrate that the within-pairings variances were much smaller than the between-treatment variances of interest. Thanks for your time and attention, -Aaron -- Aaron J. Mackey, PhD Assistant Professor Center for Public Health Genomics University of Virginia amackey@virginia.edu http://www.cphg.virginia.edu/mackey [[alternative HTML version deleted]]

• 1.5k views

ADD COMMENT • link 13.0 years ago Aaron Mackey ▴ 170

0

Entering edit mode

Aaron Mackey ▴ 170

@aaron-mackey-4358

Last seen 10.5 years ago

Tim, thanks for this comment, but I found it a bit cryptic. What is the binomial event here to be modeled with logistic regression? Did you instead mean that we should just repeat the linear regression model both with and without pairing, and (despite the change in P values) confirm that the coefficients of interest do not change (just their confidence intervals)? Thanks again, -Aaron On Mon, Feb 20, 2012 at 2:49 PM, Tim Triche, Jr. <tim.triche@gmail.com>wrote: > conditional logistic regression, then unconditional on pairing, and show > the coefficients don't change > > > > On Mon, Feb 20, 2012 at 10:40 AM, Aaron Mackey <amackey@virginia.edu>wrote: > >> My collaborators have an experimental design in which cells are treated >> experimentally with two conditions, and they naturally wish to know the >> differences in response between the two. Moreover, the experiments are >> setup in pairs of treatments, with each pair produced from the same >> "batch" >> of cells, inducing a natural pairing that we might want to include in the >> limma design. We would do this to take advantage of expected correlations >> in gene expression due to the source of cells in each experiments. >> >> However, when we run the analyses with either a paired or unpaired design, >> we find that the unpaired statistics are far more significant (~1000 >> probesets at FDR < 5%) than with the paired design (~100), which implies >> that there is not enough correlation across pairs, at least relative to >> the >> induced treatment effects. A bit stumped at first, I finally confirmed >> for myself that even in the presence of strong correlation, a larger >> treatment effect will remain more significant with an unpaired design: >> >> > wt <- c(0.9, 1.0, 1.2) >> > mean(wt) >> [1] 1.033333 >> >> > mu <- c(6.2, 6.1, 5.9) >> > mean(mu) >> [1] 6.066667 >> >> > mean(mu) - mean(wt) >> [1] 5.033333 >> >> > mean(mu-wt) >> [1] 5.033333 >> >> In fact, no matter how you pair up mu and wt, you will always get 5.0333 >> as >> the paired fold change. However, the variance may change, depending on >> how >> correlated mu and wt are (it is this correlation that we are trying to >> take >> advantage of by pairing): >> >> > cor(wt, mu) >> [1] -1 >> > sd(mu-wt) >> [1] 0.305505 >> >> > mu2 <- sort(mu) >> > wt2 <- sort(wt) >> >> > cor(wt2, mu2) >> [1] 0.9285714 >> >> > sd(mu2-wt2) >> [1] 0.05773503 >> >> Now let's see how this affects t-test significance: >> >> > t.test(*mu, wt, paired=F*, var.equal=T) >> >> >> Two Sample t-test >> >> data: mu and wt >> *t = 40.3564, df = 4, p-value = 2.253e-06* >> >> alternative hypothesis: true difference in means is not equal to 0 >> 95 percent confidence interval: >> 4.687050 5.379617 >> sample estimates: >> mean of x mean of y >> 6.066667 1.033333 >> >> > t.test(*mu, wt, paired=T*, var.equal=T) >> >> >> Paired t-test >> >> data: mu and wt >> *t = 28.5363, df = 2, p-value = 0.001226* >> >> alternative hypothesis: true difference in means is not equal to 0 >> 95 percent confidence interval: >> 4.274417 5.792250 >> sample estimates: >> mean of the differences >> 5.033333 >> >> > t.test(*mu2, wt2, paired=F*, var.equal=T) >> >> >> Two Sample t-test >> >> data: mu2 and wt2 >> *t = 40.3564, df = 4, p-value = 2.253e-06* >> >> alternative hypothesis: true difference in means is not equal to 0 >> 95 percent confidence interval: >> 4.687050 5.379617 >> sample estimates: >> mean of x mean of y >> 6.066667 1.033333 >> >> > t.test(*mu2, wt2, paired=T*, var.equal=T) >> >> >> Paired t-test >> >> data: mu2 and wt2 >> *t = 151, df = 2, p-value = 4.385e-05* >> >> alternative hypothesis: true difference in means is not equal to 0 >> 95 percent confidence interval: >> 4.889912 5.176755 >> sample estimates: >> mean of the differences >> 5.033333 >> >> >> In the first case, when wt & mu were anti-correlated, the unpaired t-test >> gave much better P values; adding the pairing info made the variation in >> mu-wt larger, and so the P value got worse (t statistic was smaller; also >> the smaller df in the paired test will, for the same t statistics, deflate >> the P value). >> >> In the second case, when mt & mu were strongly correlated, the paired >> t-test was still very good, and had a much higher t statistics than the >> unpaired test, but the P value was still not quite as good as the unpaired >> -- this is due to the drop in df. Much of this has to do with the large >> difference between to the two groups; if I make the difference between mu >> and wt a bit smaller, without changing the correlation structure: >> >> > *mu3 <- mu2 - 4.5 # cor(wt2, mu3) == cor(wt2, mu2)* >> > t.test(*mu3, wt2, paired=F*, var.equal=T) >> >> >> Two Sample t-test >> >> data: mu3 and wt2 >> *t = 4.2762, df = 4, p-value = 0.01289* >> >> alternative hypothesis: true difference in means is not equal to 0 >> 95 percent confidence interval: >> 0.1870498 0.8796169 >> sample estimates: >> mean of x mean of y >> 1.566667 1.033333 >> >> > t.test(*mu3, wt2, paired=T*, var.equal=T) >> >> >> Paired t-test >> >> data: mu3 and wt2 >> *t = 16, df = 2, p-value = 0.003884* >> >> alternative hypothesis: true difference in means is not equal to 0 >> 95 percent confidence interval: >> 0.3899116 0.6767551 >> sample estimates: >> mean of the differences >> 0.5333333 >> >> Then, finally, you see a change in the expected direction: the paired test >> is more significant than the unpaired test. >> >> So, the question is -- how might you convince yourself (or a savvy and >> skeptic reviewer for that matter) that your deliberate removal of pairing >> from your design is the statistically valid approach? My own thoughts >> were >> to show a distribution of observed correlations across pairings, to >> demonstrate that the within-pairings variances were much smaller than the >> between-treatment variances of interest. >> >> Thanks for your time and attention, >> -Aaron >> >> -- >> Aaron J. Mackey, PhD >> Assistant Professor >> Center for Public Health Genomics >> University of Virginia >> amackey@virginia.edu >> http://www.cphg.virginia.edu/mackey >> >> [[alternative HTML version deleted]] >> >> _______________________________________________ >> Bioconductor mailing list >> Bioconductor@r-project.org >> https://stat.ethz.ch/mailman/listinfo/bioconductor >> Search the archives: >> http://news.gmane.org/gmane.science.biology.informatics.conductor >> > > > > -- > *A model is a lie that helps you see the truth.* > * > * > Howard Skipper<http: cancerres.aacrjournals.org="" content="" 31="" 9="" 1173.full.pdf=""> > > [[alternative HTML version deleted]]

ADD COMMENT • link 13.0 years ago Aaron Mackey ▴ 170

Login before adding your answer.