LIMMA lmscFit: difference between linear models
2
0
Entering edit mode
Koen Bossers ▴ 20
@koen-bossers-1438
Last seen 10.3 years ago
Dear all, I'm currently studying gene expression in human brain samples using Agilent arrays (4 patients, 4 controls). I am analyzing the data using a single channel approach (lmscFit), which I think valid for my dataset, as all channels from each individual nicely cluster together, apart from all other channels. The hybridization setup is as follows: -------------- Cy3 Cy5 US12302316_251182110152_S01_A01 pat2 ctrl3 US12302316_251182110153_S01_A01 pat3 ctrl2 US12302316_251182110154_S01_A01 pat1 ctrl4 US12302316_251182110155_S01_A01 ctrl4 pat4 US12302316_251182110156_S01_A01 ctrl1 pat1 US12302316_251182110157_S01_A01 pat4 ctrl1 US12302316_251182110158_S01_A01 ctrl2 pat4 US12302316_251182110159_S01_A01 pat2 ctrl1 US12302316_251182110160_S01_A01 ctrl3 pat1 US12302316_251182110176_S01_A01 pat3 ctrl4 US12302316_251182110177_S01_A01 ctrl2 pat2 US12302316_251182110178_S01_A01 ctrl3 pat3 -------------- The first analysis I tried was the following: I replaced all individual labels with a generic one (thus: pat1 becomes pat, ctrl2 becomes ctrl), and calculated a contrast between pat&ctrl: -------------- targets2 <- targetsA2C(targets) u <- unique(targets2$Target) f <- factor(targets2$Target, levels=u) design <- model.matrix(~0+f) colnames(design) <- u corfit <- intraspotCorrelation(MA, design) fit <- lmscFit(MA, design, correlation=corfit$consensus) cont.matrix <- makeContrasts(pat-ctrl,levels=design) fit2 <- contrasts.fit(fit, cont.matrix) fit2 <- eBayes(fit2) -------------- This analysis does not yield many significantly regulated genes (5 or so), which is likely due to the small number of biological replicates and the large diversity in the human population. I also tried another approach, leaving the individual labels intact, and fitting a linear model in the following manner: -------------- cont.matrix <- makeContrasts((pat1+pat2+pat3+pat4-ctrl1-ctrl2-ctrl3-ctrl4)/4 ,levels=design) -------------- Is this linear model valid? This analysis yields loads of significantly regulated genes (hundreds)! Neither the MA plot or the M values in fit2 look suspicious, so I do not have a reason to distrust this data. I do not really understand why there is such a large discrepancy between the two analysis methods. Is this due to the way replication is handled? Could anybody comment on the validity of these two analyses, taking into account individual variation in the human population, and the way replication is handled in LIMMA? Thank you very much, Koen Bossers -- Koen Bossers, PhD student Netherlands Institute for Brain Research Meibergdreef 33 1105 AZ Amsterdam, The Netherlands Phone: 0031-20-5665512 Email: k.bossers at nih.knaw.nl
BRAIN BRAIN • 1.6k views
ADD COMMENT
0
Entering edit mode
Francois Pepin ★ 1.3k
@francois-pepin-1012
Last seen 10.3 years ago
Hi Koen, Both models are "valid", but that doesn't mean that they're equally good. Here is the main difference between the two models (please correct me if I'm wrong): In the first one, you are trying to fit a single coefficients for each gene in patient and control. This interprets all variations between patients and controls as error and gives you a rather high variance. So, for each gene you have a 2 value (for the patients and for the controls) which you won't trust much because of the variance. In the second case, you're fitting each patient and control separately first. Assuming you have decent arrays, then the variance between the replicates is pretty low. So now you have 8 values (4 for the patients, 4 for the controls) which you'll trust more, because they have lower variances. Then those are then used for the contrast. So the second method is more powerful, because it can model the technical variation separately from the patient variation. This is the intuition why the second method can call more genes as being differentially expressed. Francois On Mon, 2005-26-09 at 10:46 +0200, Koen Bossers wrote: > Dear all, > > I'm currently studying gene expression in human brain samples using > Agilent arrays (4 patients, 4 controls). I am analyzing the data using a > single channel approach (lmscFit), which I think valid for my dataset, > as all channels from each individual nicely cluster together, apart from > all other channels. > > The hybridization setup is as follows: > > -------------- > Cy3 Cy5 > US12302316_251182110152_S01_A01 pat2 ctrl3 > US12302316_251182110153_S01_A01 pat3 ctrl2 > US12302316_251182110154_S01_A01 pat1 ctrl4 > US12302316_251182110155_S01_A01 ctrl4 pat4 > US12302316_251182110156_S01_A01 ctrl1 pat1 > US12302316_251182110157_S01_A01 pat4 ctrl1 > US12302316_251182110158_S01_A01 ctrl2 pat4 > US12302316_251182110159_S01_A01 pat2 ctrl1 > US12302316_251182110160_S01_A01 ctrl3 pat1 > US12302316_251182110176_S01_A01 pat3 ctrl4 > US12302316_251182110177_S01_A01 ctrl2 pat2 > US12302316_251182110178_S01_A01 ctrl3 pat3 > > -------------- > > The first analysis I tried was the following: I replaced all individual > labels with a generic one (thus: pat1 becomes pat, ctrl2 becomes ctrl), > and calculated a contrast between pat&ctrl: > > -------------- > > targets2 <- targetsA2C(targets) > u <- unique(targets2$Target) > f <- factor(targets2$Target, levels=u) > design <- model.matrix(~0+f) > colnames(design) <- u > > corfit <- intraspotCorrelation(MA, design) > fit <- lmscFit(MA, design, correlation=corfit$consensus) > > cont.matrix <- makeContrasts(pat-ctrl,levels=design) > fit2 <- contrasts.fit(fit, cont.matrix) > fit2 <- eBayes(fit2) > > -------------- > > This analysis does not yield many significantly regulated genes (5 or > so), which is likely due to the small number of biological replicates > and the large diversity in the human population. > > I also tried another approach, leaving the individual labels intact, and > fitting a linear model in the following manner: > > -------------- > > cont.matrix <- makeContrasts((pat1+pat2+pat3+pat4-ctrl1-ctrl2-ctrl3-ctrl4)/4 > ,levels=design) > > > -------------- > > Is this linear model valid? > This analysis yields loads of significantly regulated genes (hundreds)! > Neither the MA plot or the M values in fit2 look suspicious, so I do not > have a reason to distrust this data. > > I do not really understand why there is such a large discrepancy between > the two analysis methods. Is this due to the way replication is handled? > > Could anybody comment on the validity of these two analyses, taking into > account individual variation in the human population, and the way > replication is handled in LIMMA? > > Thank you very much, > > Koen Bossers > >
ADD COMMENT
0
Entering edit mode
Koen Bossers ▴ 20
@koen-bossers-1438
Last seen 10.3 years ago
Francois (and other bioconductor users), Thanks for your explanation. My thinking was in that direction also, and you made it very clear to me. I do have some questions though. The second model is more powerful, because we can now model the technical variation first, and then the patient variation. This then gives me far more significant genes. But the thing is, I think that the technical variation is much smaller than the biological variation to begin with (high quality Agilent arrays vs human samples)! For example, the distance between all pat1 channels in a hierarchical clustering is much smaller than between individuals. Doesn't that imply that "removing" the technical variance shouldn't have such a big impact on my analysis results? So is the linear model doing something else that I'm missing here? Second, and this is something that is more related to some limitations in the LIMMA package, the technical replicates are now treated as biological replicates. So probably the number of significant genes is overestimated. But to what extend? Is there a way the estimate this overestimation? Or shoul I let the qPCR validation do the talking? Does anybody have any comments/suggestions on the best method to detect differential gene expression in an experiment with low technical variance and high biological variance? Thank you, Koen -- Koen Bossers, PhD student Netherlands Institute for Brain Research Meibergdreef 33 1105 AZ Amsterdam, The Netherlands Phone: +31-20-5665512 Email: k.bossers at nih.knaw.nl > Hi Koen, > > Both models are "valid", but that doesn't mean that they're equally good. > > Here is the main difference between the two models (please correct me if I'm wrong): > > In the first one, you are trying to fit a single coefficients for each gene in patient and control. This interprets all variations between patients and controls as error and gives you a rather high variance. So, for each gene you have a 2 value (for the patients and for the controls) which you won't trust much because of the variance. > > In the second case, you're fitting each patient and control separately first. Assuming you have decent arrays, then the variance between the replicates is pretty low. So now you have 8 values (4 for the patients, 4 for the controls) which you'll trust more, because they have lower variances. Then those are then used for the contrast. > > So the second method is more powerful, because it can model the > technical variation separately from the patient variation. This is the intuition why the second method can call more genes as being > differentially expressed. > > Francois > > On Mon, 2005-26-09 at 10:46 +0200, Koen Bossers wrote: >> Dear all, >> >> I'm currently studying gene expression in human brain samples using Agilent arrays (4 patients, 4 controls). I am analyzing the data using a single channel approach (lmscFit), which I think valid for my dataset, as all channels from each individual nicely cluster together, apart from all other channels. >> >> The hybridization setup is as follows: >> >> -------------- >> Cy3 Cy5 >> US12302316_251182110152_S01_A01 pat2 ctrl3 >> US12302316_251182110153_S01_A01 pat3 ctrl2 >> US12302316_251182110154_S01_A01 pat1 ctrl4 >> US12302316_251182110155_S01_A01 ctrl4 pat4 >> US12302316_251182110156_S01_A01 ctrl1 pat1 >> US12302316_251182110157_S01_A01 pat4 ctrl1 >> US12302316_251182110158_S01_A01 ctrl2 pat4 >> US12302316_251182110159_S01_A01 pat2 ctrl1 >> US12302316_251182110160_S01_A01 ctrl3 pat1 >> US12302316_251182110176_S01_A01 pat3 ctrl4 >> US12302316_251182110177_S01_A01 ctrl2 pat2 >> US12302316_251182110178_S01_A01 ctrl3 pat3 >> >> -------------- >> >> The first analysis I tried was the following: I replaced all individual labels with a generic one (thus: pat1 becomes pat, ctrl2 becomes ctrl), and calculated a contrast between pat&ctrl: >> >> -------------- >> >> targets2 <- targetsA2C(targets) >> u <- unique(targets2$Target) >> f <- factor(targets2$Target, levels=u) >> design <- model.matrix(~0+f) >> colnames(design) <- u >> >> corfit <- intraspotCorrelation(MA, design) >> fit <- lmscFit(MA, design, correlation=corfit$consensus) >> >> cont.matrix <- makeContrasts(pat-ctrl,levels=design) >> fit2 <- contrasts.fit(fit, cont.matrix) >> fit2 <- eBayes(fit2) >> >> -------------- >> >> This analysis does not yield many significantly regulated genes (5 or so), which is likely due to the small number of biological replicates and the large diversity in the human population. >> >> I also tried another approach, leaving the individual labels intact, and fitting a linear model in the following manner: >> >> -------------- >> >> cont.matrix <- >> makeContrasts((pat1+pat2+pat3+pat4-ctrl1-ctrl2-ctrl3-ctrl4)/4 >> ,levels=design) >> >> >> -------------- >> >> Is this linear model valid? >> This analysis yields loads of significantly regulated genes (hundreds)! Neither the MA plot or the M values in fit2 look suspicious, so I do not have a reason to distrust this data. >> >> I do not really understand why there is such a large discrepancy between the two analysis methods. Is this due to the way replication is handled? >> >> Could anybody comment on the validity of these two analyses, taking into account individual variation in the human population, and the way replication is handled in LIMMA? >> >> Thank you very much, >> >> Koen Bossers >> >> >
ADD COMMENT
0
Entering edit mode
Hi Koen, > But the thing is, I think that the technical variation is much smaller > than the biological variation to begin with (high quality Agilent arrays > vs human samples)! For example, the distance between all pat1 channels in > a hierarchical clustering is much smaller than between individuals. > Doesn't that imply that "removing" the technical variance shouldn't have > such a big impact on my analysis results? So is the linear model doing > something else that I'm missing here? Why would you want to remove it? And would you be planning on doing that? As far as I can see, it would be more work for you and screw up your analysis. Even if it doesn't have a big impact, it would still have an impact. > Second, and this is something that is more related to some limitations in > the LIMMA package, the technical replicates are now treated as biological > replicates. So probably the number of significant genes is overestimated. > But to what extend? Is there a way the estimate this overestimation? Or > shoul I let the qPCR validation do the talking? I do not think that is correct. But your second model takes both the technical (in the original fit) and the biological replicates (in the contrast) into account. I wouldn't think that it would overestimate the number of genes because of that. You would probably want to look at the false discovery rate though to make sure you're correcting for that. Of couse, you will want to validate those results anyway. > Does anybody have any comments/suggestions on the best method to detect > differential gene expression in an experiment with low technical variance > and high biological variance? We're in about the same situation as you are and we're quite happy with Limma. Francois
ADD REPLY

Login before adding your answer.

Traffic: 722 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6