Regression analysis problem

0

Entering edit mode

khadeeja ismail ▴ 400

@khadeeja-ismail-4711

Last seen 8.7 years ago

Hi everyone, I am wondering if anyone can help me understand the following output from limma. The problem goes like this: Say I have methylation values for four genes from three twin pairs. I created some dummy data so that values of gene1 is clearly different within twin pairs. My interest here is to find out the genes that are significantly different within the pairs (i.e. between T1a and T1b, T2a and T2b and T3a and T3b) T1a T1b T2a T2b T3a T3b "gene1" 16 4 18 4 19 5 "gene2" 3 1 4 5 2 6 "gene3" 2 1 4 3 4 5 "gene4" 3 1 2 3 3 4 I created a design matrix: > Pair<-c("TWIN1","TWIN1","TWIN2","TWIN2","TWIN3","TWIN3") >Classes<-c(2,1,2,1,2,1) >BMI<-c(45,20,34,12,25,19) >Pair<-factor(Pair) >Classes<-factor(Classes) >design <- model.matrix(~Pair+Classes) > design (Intercept) PairTWIN2 PairTWIN3 Classes2 1 1 0 0 1 2 1 0 0 0 3 1 1 0 1 4 1 1 0 0 5 1 0 1 1 6 1 0 1 0 ********** ...and fitted a linear model to the data to get the following results: ************************************************************ ID X.Intercept. PairTWIN2 PairTWIN3 Classes2 AveExpr F 1 gene1 3.333333 1.0 2.0 1.333333e+01 11.000000 106.298724 2 gene2 2.500000 2.5 2.0 -1.000000e+00 3.500000 8.745648 3 gene3 1.333333 2.0 3.0 3.333333e-01 3.166667 7.430245 4 gene4 2.000000 0.5 1.5 9.064933e-16 2.666667 4.799441 P.Value adj.P.Val 1 9.993042e-91 3.997217e-90 2 4.683758e-07 9.367515e-07 3 5.578117e-06 7.437489e-06 4 7.186523e-04 7.186523e-04 ************************************************************* As you can see, all the genes have adjusted p values less than 0.05. But if I just compute the differnces between the pairs T1 T2 T3 gene1 12 14 14 gene2 2 -1 -4 gene3 1 1 -1 gene4 2 -1 -1 ....and used the design matrix (1,1,1), gene 1 comes up as significantly different within the pairs. ID logFC t P.Value adj.P.Val B 1 gene1 13.3333333 10.6660452 5.234557e-06 2.093823e-05 46.016217 2 gene2 -1.0000000 -0.7999534 4.468388e-01 8.936777e-01 -5.513313 3 gene3 0.3333333 0.2666511 7.964821e-01 1.000000e+00 -5.772418 4 gene4 0.0000000 0.0000000 1.000000e+00 1.000000e+00 -5.804806 So, my questions are: Why am I not getting the same (expected) result from my first approach? Or am I misinterpreting the output? What does the p value in the 1st approach signify? And most importantly, am I applying the right model? Any help on this would be most appreciated. Thanks in advance, Khadeeja [[alternative HTML version deleted]]

limma limma • 1.8k views

ADD COMMENT • link updated 13.1 years ago by Sean Davis 21k • written 13.1 years ago by khadeeja ismail ▴ 400

0

Entering edit mode

Sean Davis 21k

@sean-davis-490

Last seen 12 weeks ago

United States

On Thu, Oct 6, 2011 at 7:15 AM, khadeeja ismail <hajjja at="" yahoo.com=""> wrote: > Hi everyone, > > I am wondering if anyone can help me understand the following output from limma. > > The problem goes like this: > Say I have methylation values for four genes from three twin pairs. I created some dummy data so that values of gene1 is clearly different within twin pairs. My interest here is to find out the genes that are significantly different within the pairs (i.e. between T1a and T1b,? T2a and T2b and T3a and T3b) > > > ??? ??? ??? ??? T1a T1b T2a T2b T3a T3b > > "gene1"??? 16??? 4??? 18??? 4??? 19??? 5 > "gene2"??? 3????? 1?? ? 4? ?? 5???? 2???? 6 > "gene3"??? 2????? 1?? ? 4???? 3???? 4? ?? 5 > "gene4"??? 3????? 1?? ? 2???? 3???? 3??? 4 > > > I created a design matrix: > > ?? > Pair<-c("TWIN1","TWIN1","TWIN2","TWIN2","TWIN3","TWIN3") > ??? >Classes<-c(2,1,2,1,2,1) > ??? >BMI<-c(45,20,34,12,25,19) > ??? >Pair<-factor(Pair) > ??? >Classes<-factor(Classes) > ??? >design <- model.matrix(~Pair+Classes) > >> design > ? (Intercept) PairTWIN2 PairTWIN3 Classes2 > 1?????????? 1???????? 0???????? 0??????? 1 > 2?????????? 1???????? 0???????? 0??????? 0 > 3?????????? 1???????? 1???????? 0??????? 1 > 4?????????? 1???????? 1???????? 0??????? 0 > 5?????????? 1???????? 0???????? 1??????? 1 > 6?????????? 1???????? 0???????? 1??????? 0 > > > ********** > > ...and fitted a linear model to the data to get the following results: > ************************************************************ > ID X.Intercept. PairTWIN2 PairTWIN3????? Classes2?? AveExpr????????? F > 1 gene1???? 3.333333?????? 1.0?????? 2.0? 1.333333e+01 11.000000 106.298724 > 2 gene2???? 2.500000?????? 2.5?????? 2.0 -1.000000e+00? 3.500000?? 8.745648 > 3 gene3???? 1.333333?????? 2.0?????? 3.0? 3.333333e-01? 3.166667?? 7.430245 > 4 gene4???? 2.000000?????? 0.5?????? 1.5? 9.064933e-16? 2.666667?? 4.799441 > ?????? P.Value??? adj.P.Val > 1 9.993042e-91 3.997217e-90 > 2 4.683758e-07 9.367515e-07 > 3 5.578117e-06 7.437489e-06 > 4 7.186523e-04 7.186523e-04 > ************************************************************* You didn't show us the rest of the code. In particular, I suspect that you are using something like: topTable(X) You might try: topTable(X,coef=4) Sean > > As you can see, all the genes have adjusted p values less than 0.05. > > But if I just compute the differnces between the pairs > > ???????????? T1 T2 T3 > gene1 ? 12? 14? 14 > gene2??? 2? -1? -4 > gene3 ?? 1?? 1? -1 > gene4??? 2? -1? -1 > > > ....and used the design matrix (1,1,1), > gene 1 comes up as significantly different within the pairs. > > ?ID????? logFC????????? t????? P.Value??? adj.P.Val???????? B > 1 gene1 13.3333333 10.6660452 5.234557e-06 2.093823e-05 46.016217 > 2 gene2 -1.0000000 -0.7999534 4.468388e-01 8.936777e-01 -5.513313 > 3 gene3? 0.3333333? 0.2666511 7.964821e-01 1.000000e+00 -5.772418 > 4 gene4? 0.0000000? 0.0000000 1.000000e+00 1.000000e+00 -5.804806 > > > So, my questions are: > Why am I not getting the same (expected) result from my first approach? Or am I misinterpreting? the output? > > What does the p value in the 1st approach signify? > > And most importantly, am I applying the right model? > > Any help on this would be most appreciated. > > Thanks in advance, > > Khadeeja > ? ? ? ?[[alternative HTML version deleted]] > > > _______________________________________________ > Bioconductor mailing list > Bioconductor at r-project.org > https://stat.ethz.ch/mailman/listinfo/bioconductor > Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor >

ADD COMMENT • link 13.1 years ago Sean Davis 21k

0

Entering edit mode

Got it. Thanks Sean. ________________________________ From: Sean Davis <sdavis2@mail.nih.gov> Cc: "bioconductor@r-project.org" <bioconductor@r-project.org> Sent: Thursday, October 6, 2011 2:25 PM Subject: Re: [BioC] Regression analysis problem > Hi everyone, > > I am wondering if anyone can help me understand the following output from limma. > > The problem goes like this: > Say I have methylation values for four genes from three twin pairs. I created some dummy data so that values of gene1 is clearly different within twin pairs. My interest here is to find out the genes that are significantly different within the pairs (i.e. between T1a and T1b,� T2a and T2b and T3a and T3b) > > > �� T1a T1b T2a T2b T3a T3b > > "gene1"�� 16�� 4�� 18�� 4�� 19�� 5 > "gene2"�� 3�� 1�� 4� �� 5�� 2�� 6 > "gene3"�� 2�� 1�� 4�� 3�� 4� �� 5 > "gene4"�� 3�� 1�� 2�� 3�� 3�� 4 > > > I created a design matrix: > > �� > Pair<-c("TWIN1","TWIN1","TWIN2","TWIN2","TWIN3","TWIN3") > �� >Classes<-c(2,1,2,1,2,1) > �� >BMI<-c(45,20,34,12,25,19) > �� >Pair<-factor(Pair) > �� >Classes<-factor(Classes) > �� >design <- model.matrix(~Pair+Classes) > >> design > � (Intercept) PairTWIN2 PairTWIN3 Classes2 > 1�� 1�� 0�� 0�� 1 > 2�� 1�� 0�� 0�� 0 > 3�� 1�� 1�� 0�� 1 > 4�� 1�� 1�� 0�� 0 > 5�� 1�� 0�� 1�� 1 > 6�� 1�� 0�� 1�� 0 > > > ********** > > ...and fitted a linear model to the data to get the following results: > ************************************************************ > ID X.Intercept. PairTWIN2 PairTWIN3�� Classes2�� AveExpr�� F > 1 gene1�� 3.333333�� 1.0�� 2.0� 1.333333e+01 11.000000 106.298724 > 2 gene2�� 2.500000�� 2.5�� 2.0 -1.000000e+00� 3.500000�� 8.745648 > 3 gene3�� 1.333333�� 2.0�� 3.0� 3.333333e-01� 3.166667�� 7.430245 > 4 gene4�� 2.000000�� 0.5�� 1.5� 9.064933e-16� 2.666667�� 4.799441 > �� P.Value�� adj.P.Val > 1 9.993042e-91 3.997217e-90 > 2 4.683758e-07 9.367515e-07 > 3 5.578117e-06 7.437489e-06 > 4 7.186523e-04 7.186523e-04 > ************************************************************* You didn't show us the rest of the code.� In particular, I suspect that you are using something like: topTable(X) You might try: topTable(X,coef=4) Sean > > As you can see, all the genes have adjusted p values less than 0.05. > > But if I just compute the differnces between the pairs > > �� T1 T2 T3 > gene1 � 12� 14� 14 > gene2�� 2� -1� -4 > gene3 �� 1�� 1� -1 > gene4�� 2� -1� -1 > > > ....and used the design matrix (1,1,1), > gene 1 comes up as significantly different within the pairs. > > �ID�� logFC�� t�� P.Value�� adj.P.Val�� B > 1 gene1 13.3333333 10.6660452 5.234557e-06 2.093823e-05 46.016217 > 2 gene2 -1.0000000 -0.7999534 4.468388e-01 8.936777e-01 -5.513313 > 3 gene3� 0.3333333� 0.2666511 7.964821e-01 1.000000e+00 -5.772418 > 4 gene4� 0.0000000� 0.0000000 1.000000e+00 1.000000e+00 -5.804806 > > > So, my questions are: > Why am I not getting the same (expected) result from my first approach? Or am I misinterpreting� the output? > > What does the p value in the 1st approach signify? > > And most importantly, am I applying the right model? > > Any help on this would be most appreciated. > > Thanks in advance, > > Khadeeja > � � � �[[alternative HTML version deleted]] > > > _______________________________________________ > Bioconductor mailing list > Bioconductor@r-project.org > https://stat.ethz.ch/mailman/listinfo/bioconductor > Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor > [[alternative HTML version deleted]]

ADD REPLY • link 13.1 years ago khadeeja ismail ▴ 400

0

Entering edit mode

>In particular, I suspect >that you are using something like: >topTable(X) >You might try: >topTable(X,coef=4) It worked with the small dataset. When I tried the same with a larger set (11 pairs and and some 150k genes) I'm getting different results from the two approaches (even with using topTable(X, coef=12) in approach 1). I'm finding no significant genes from the first approach, but using the second approach I'm finding some. Is it possible? I thought I was doing the same thing in two different ways. Have I overlooked something again? Thanks, Khadeeja Targets<-readTargets(file="TargetsSelBMI.csv", sep="\t") Pair <- factor(Targets$Twin) Classes <- factor(Targets$HL) design <- model.matrix(~Pair + Classes) fit<-lmFit(data,design) fit.pair<-eBayes(fit) topGenes<-topTable(fit.pair, coef=12,number=2000) ________________________________ From: Sean Davis <sdavis2@mail.nih.gov> Cc: "bioconductor@r-project.org" <bioconductor@r-project.org> Sent: Thursday, October 6, 2011 2:25 PM Subject: Re: [BioC] Regression analysis problem > Hi everyone, > > I am wondering if anyone can help me understand the following output from limma. > > The problem goes like this: > Say I have methylation values for four genes from three twin pairs. I created some dummy data so that values of gene1 is clearly different within twin pairs. My interest here is to find out the genes that are significantly different within the pairs (i.e. between T1a and T1b, T2a and T2b and T3a and T3b) > > > T1a T1b T2a T2b T3a T3b > > "gene1" 16 4 18 4 19 5 > "gene2" 3 1 4 5 2 6 > "gene3" 2 1 4 3 4 5 > "gene4" 3 1 2 3 3 4 > > > I created a design matrix: > > > Pair<-c("TWIN1","TWIN1","TWIN2","TWIN2","TWIN3","TWIN3") > >Classes<-c(2,1,2,1,2,1) > >BMI<-c(45,20,34,12,25,19) > >Pair<-factor(Pair) > >Classes<-factor(Classes) > >design <- model.matrix(~Pair+Classes) > >> design > (Intercept) PairTWIN2 PairTWIN3 Classes2 > 1 1 0 0 1 > 2 1 0 0 0 > 3 1 1 0 1 > 4 1 1 0 0 > 5 1 0 1 1 > 6 1 0 1 0 > > > ********** > > ...and fitted a linear model to the data to get the following results: > ************************************************************ > ID X.Intercept. PairTWIN2 PairTWIN3 Classes2 AveExpr F > 1 gene1 3.333333 1.0 2.0 1.333333e+01 11.000000 106.298724 > 2 gene2 2.500000 2.5 2.0 -1.000000e+00 3.500000 8.745648 > 3 gene3 1.333333 2.0 3.0 3.333333e-01 3.166667 7.430245 > 4 gene4 2.000000 0.5 1.5 9.064933e-16 2.666667 4.799441 > P.Value adj.P.Val > 1 9.993042e-91 3.997217e-90 > 2 4.683758e-07 9.367515e-07 > 3 5.578117e-06 7.437489e-06 > 4 7.186523e-04 7.186523e-04 > ************************************************************* You didn't show us the rest of the code. In particular, I suspect that you are using something like: topTable(X) You might try: topTable(X,coef=4) Sean > > As you can see, all the genes have adjusted p values less than 0.05. > > But if I just compute the differnces between the pairs > > T1 T2 T3 > gene1 12 14 14 > gene2 2 -1 -4 > gene3 1 1 -1 > gene4 2 -1 -1 > > > ....and used the design matrix (1,1,1), > gene 1 comes up as significantly different within the pairs. > > ID logFC t P.Value adj.P.Val B > 1 gene1 13.3333333 10.6660452 5.234557e-06 2.093823e-05 46.016217 > 2 gene2 -1.0000000 -0.7999534 4.468388e-01 8.936777e-01 -5.513313 > 3 gene3 0.3333333 0.2666511 7.964821e-01 1.000000e+00 -5.772418 > 4 gene4 0.0000000 0.0000000 1.000000e+00 1.000000e+00 -5.804806 > > > So, my questions are: > Why am I not getting the same (expected) result from my first approach? Or am I misinterpreting the output? > > What does the p value in the 1st approach signify? > > And most importantly, am I applying the right model? > > Any help on this would be most appreciated. > > Thanks in advance, > > Khadeeja > [[alternative HTML version deleted]] > > > _______________________________________________ > Bioconductor mailing list > Bioconductor@r-project.org > https://stat.ethz.ch/mailman/listinfo/bioconductor > Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor > [[alternative HTML version deleted]]

ADD REPLY • link 13.1 years ago khadeeja ismail ▴ 400

0

Entering edit mode

Hi Khadeeja, On 10/7/2011 9:42 AM, khadeeja ismail wrote: >> In particular, I suspect >> that you are using something like: >> topTable(X) >> You might try: >> topTable(X,coef=4) > > It worked with the small dataset. When I tried the same with a larger set (11 pairs and and some 150k genes) I'm getting different results from the two approaches (even with using topTable(X, coef=12) in approach 1). > I'm finding no significant genes from the first approach, but using the > second approach I'm finding some. Is it possible? I thought I was doing > the same thing in two different ways. Have I overlooked something again? I don't think you have overlooked anything, but there may be some misunderstanding. In your dummy data below, you have very few genes, and fewer samples. In the example you cite above, you use 150k genes and more samples. This will increase your power to detect differences in two ways. First, by increasing the number of genes, you can more accurately estimate an overall variance, which is then used in the eBayes() step to 'shrink' your observed variance towards this overall variance. This is one of the reasons that limma is so popular - by using information from all genes, you can increase the power to detect differences in individual genes. Second, when you increase the number of pairs you are using, your power to detect differences increases as well. This has nothing to do with limma per se; it is just that the power of a t-test increases as you increase the number of observations. Best, Jim > > > Thanks, > Khadeeja > > > > > Targets<-readTargets(file="TargetsSelBMI.csv", sep="\t") > > Pair<- factor(Targets$Twin) > Classes<- factor(Targets$HL) > design<- model.matrix(~Pair + Classes) > > fit<-lmFit(data,design) > fit.pair<-eBayes(fit) > > topGenes<-topTable(fit.pair, coef=12,number=2000) > > > > > > > > > ________________________________ > From: Sean Davis<sdavis2@mail.nih.gov> > > Cc: "bioconductor@r-project.org"<bioconductor@r-project.org> > Sent: Thursday, October 6, 2011 2:25 PM > Subject: Re: [BioC] Regression analysis problem > > >> Hi everyone, >> >> I am wondering if anyone can help me understand the following output from limma. >> >> The problem goes like this: >> Say I have methylation values for four genes from three twin pairs. I created some dummy data so that values of gene1 is clearly different within twin pairs. My interest here is to find out the genes that are significantly different within the pairs (i.e. between T1a and T1b, T2a and T2b and T3a and T3b) >> >> >> T1a T1b T2a T2b T3a T3b >> >> "gene1" 16 4 18 4 19 5 >> "gene2" 3 1 4 5 2 6 >> "gene3" 2 1 4 3 4 5 >> "gene4" 3 1 2 3 3 4 >> >> >> I created a design matrix: >> >> > Pair<-c("TWIN1","TWIN1","TWIN2","TWIN2","TWIN3","TWIN3") >> >Classes<-c(2,1,2,1,2,1) >> >BMI<-c(45,20,34,12,25,19) >> >Pair<-factor(Pair) >> >Classes<-factor(Classes) >> >design<- model.matrix(~Pair+Classes) >> >>> design >> (Intercept) PairTWIN2 PairTWIN3 Classes2 >> 1 1 0 0 1 >> 2 1 0 0 0 >> 3 1 1 0 1 >> 4 1 1 0 0 >> 5 1 0 1 1 >> 6 1 0 1 0 >> >> >> ********** >> >> ...and fitted a linear model to the data to get the following results: >> ************************************************************ >> ID X.Intercept. PairTWIN2 PairTWIN3 Classes2 AveExpr F >> 1 gene1 3.333333 1.0 2.0 1.333333e+01 11.000000 106.298724 >> 2 gene2 2.500000 2.5 2.0 -1.000000e+00 3.500000 8.745648 >> 3 gene3 1.333333 2.0 3.0 3.333333e-01 3.166667 7.430245 >> 4 gene4 2.000000 0.5 1.5 9.064933e-16 2.666667 4.799441 >> P.Value adj.P.Val >> 1 9.993042e-91 3.997217e-90 >> 2 4.683758e-07 9.367515e-07 >> 3 5.578117e-06 7.437489e-06 >> 4 7.186523e-04 7.186523e-04 >> ************************************************************* > You didn't show us the rest of the code. In particular, I suspect > that you are using something like: > > topTable(X) > > You might try: > > topTable(X,coef=4) > > Sean > >> As you can see, all the genes have adjusted p values less than 0.05. >> >> But if I just compute the differnces between the pairs >> >> T1 T2 T3 >> gene1 12 14 14 >> gene2 2 -1 -4 >> gene3 1 1 -1 >> gene4 2 -1 -1 >> >> >> ....and used the design matrix (1,1,1), >> gene 1 comes up as significantly different within the pairs. >> >> ID logFC t P.Value adj.P.Val B >> 1 gene1 13.3333333 10.6660452 5.234557e-06 2.093823e-05 46.016217 >> 2 gene2 -1.0000000 -0.7999534 4.468388e-01 8.936777e-01 -5.513313 >> 3 gene3 0.3333333 0.2666511 7.964821e-01 1.000000e+00 -5.772418 >> 4 gene4 0.0000000 0.0000000 1.000000e+00 1.000000e+00 -5.804806 >> >> >> So, my questions are: >> Why am I not getting the same (expected) result from my first approach? Or am I misinterpreting the output? >> >> What does the p value in the 1st approach signify? >> >> And most importantly, am I applying the right model? >> >> Any help on this would be most appreciated. >> >> Thanks in advance, >> >> Khadeeja >> [[alternative HTML version deleted]] >> >> >> _______________________________________________ >> Bioconductor mailing list >> Bioconductor@r-project.org >> https://stat.ethz.ch/mailman/listinfo/bioconductor >> Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor >> > [[alternative HTML version deleted]] > > > > _______________________________________________ > Bioconductor mailing list > Bioconductor@r-project.org > https://stat.ethz.ch/mailman/listinfo/bioconductor > Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor -- James W. MacDonald, M.S. Biostatistician Douglas Lab University of Michigan Department of Human Genetics 5912 Buhl 1241 E. Catherine St. Ann Arbor MI 48109-5618 734-615-7826 ********************************************************** Electronic Mail is not secure, may not be read every day, and should not be used for urgent or sensitive issues [[alternative HTML version deleted]]

ADD REPLY • link 13.1 years ago James W. MacDonald 67k

0

Entering edit mode

On Fri, Oct 7, 2011 at 8:15 AM, James W. MacDonald <jmacdon@med.umich.edu>wrote: > > First, by increasing the number of genes, you can more accurately > estimate an overall variance, which is then used in the eBayes() step to > 'shrink' your observed variance towards this overall variance. This is > one of the reasons that limma is so popular - by using information from > all genes, you can increase the power to detect differences in > individual genes. > Careful though -- this is methylation data, which tends to be strongly bimodal. It's not clear that the assumption of a common variance across unmethylated, partially-methylated, and methylated sites is appropriate. I seem to recall Gordon Smyth commenting upon this at one point -- perhaps he'll chime in. > Second, when you increase the number of pairs you are using, your power > to detect differences increases as well. This has nothing to do with > limma per se; it is just that the power of a t-test increases as you > increase the number of observations. > This is of course appropriate at any time that you can get more high- quality samples rather than fewer :-) -- If people do not believe that mathematics is simple, it is only because they do not realize how complicated life is.John von Neumann<http: www-groups.dcs.st-="" and.ac.uk="" ~history="" biographies="" von_neumann.html=""> [[alternative HTML version deleted]]

ADD REPLY • link 13.1 years ago Tim Triche ★ 4.2k

0

Entering edit mode

To follow up on Tim's point: I have yet to see any evidence that methylation data is anything other than ternary. The typical interpretation for the beta values is less than 0.25 = fully unmethylated greater than 0.75 = fully methylated between 0.25 and 0.75 = partially methylated Part of the evidence for this assertion is that we ahve several sets of data from samples treated with drugs that should fully methylate (or fully demethylate, respectively) everything. On those samples, 99% of the observed values are above 0.75 (or below 0.25, respectively). So I'm not convinced that t-tests have role to play in analyzing genome-wide methylation data.... Kevin On 10/7/2011 10:39 AM, Tim Triche, Jr. wrote: > On Fri, Oct 7, 2011 at 8:15 AM, James W. MacDonald<jmacdon at="" med.umich.edu="">wrote: > >> First, by increasing the number of genes, you can more accurately >> estimate an overall variance, which is then used in the eBayes() step to >> 'shrink' your observed variance towards this overall variance. This is >> one of the reasons that limma is so popular - by using information from >> all genes, you can increase the power to detect differences in >> individual genes. >> > Careful though -- this is methylation data, which tends to be strongly > bimodal. It's not clear that the assumption of a common variance across > unmethylated, partially-methylated, and methylated sites is appropriate. I > seem to recall Gordon Smyth commenting upon this at one point -- perhaps > he'll chime in. > > >> Second, when you increase the number of pairs you are using, your power >> to detect differences increases as well. This has nothing to do with >> limma per se; it is just that the power of a t-test increases as you >> increase the number of observations. >> > This is of course appropriate at any time that you can get more high-quality > samples rather than fewer :-) >

ADD REPLY • link 13.1 years ago Kevin Coombes ▴ 430

0

Entering edit mode

Dear Kevin, How you treat the DNA methylation data will very much depend on the specific biological question you are trying to address. It is pretty clear that if you have a mixture of different cell types, each with a different methylation pattern, that small yet biologically and clinically relevant changes in the cell type composition will lead to small yet statistically significant changes in methylation. The evidence for biologically relevant small changes in methylation (differences in means of ~0.1) is overwhelming, for instance in DNA methylation studies conducted on whole blood DNA where blood cell subtype composition changes in response to the presence of cancer. kind regards A. ********************************************************************** ********************************************************************** *** Andrew E Teschendorff PhD Heller Research Fellow Statistical Cancer Genomics Paul O'Gorman Building UCL Cancer Institute University College London 72 Huntley Street London WC1E 6BT, UK. Mob: +44 07876 561263 Email: a.teschendorff at ucl.ac.uk http://www.ucl.ac.uk/cancer/research- groups/statistical_cancer_genomics/index.htm ********************************************************************** ********************************************************************** ________________________________________ From: bioconductor-bounces@r-project.org [bioconductor- bounces@r-project.org] On Behalf Of Kevin R. Coombes [kevin.r.coombes@gmail.com] Sent: 07 October 2011 18:26 To: ttriche at usc.edu Cc: James W. MacDonald; Tim Triche, Jr.; bioconductor at r-project.org; Sean Davis; khadeeja ismail Subject: Re: [BioC] Regression analysis problem To follow up on Tim's point: I have yet to see any evidence that methylation data is anything other than ternary. The typical interpretation for the beta values is less than 0.25 = fully unmethylated greater than 0.75 = fully methylated between 0.25 and 0.75 = partially methylated Part of the evidence for this assertion is that we ahve several sets of data from samples treated with drugs that should fully methylate (or fully demethylate, respectively) everything. On those samples, 99% of the observed values are above 0.75 (or below 0.25, respectively). So I'm not convinced that t-tests have role to play in analyzing genome-wide methylation data.... Kevin On 10/7/2011 10:39 AM, Tim Triche, Jr. wrote: > On Fri, Oct 7, 2011 at 8:15 AM, James W. MacDonald<jmacdon at="" med.umich.edu="">wrote: > >> First, by increasing the number of genes, you can more accurately >> estimate an overall variance, which is then used in the eBayes() step to >> 'shrink' your observed variance towards this overall variance. This is >> one of the reasons that limma is so popular - by using information from >> all genes, you can increase the power to detect differences in >> individual genes. >> > Careful though -- this is methylation data, which tends to be strongly > bimodal. It's not clear that the assumption of a common variance across > unmethylated, partially-methylated, and methylated sites is appropriate. I > seem to recall Gordon Smyth commenting upon this at one point -- perhaps > he'll chime in. > > >> Second, when you increase the number of pairs you are using, your power >> to detect differences increases as well. This has nothing to do with >> limma per se; it is just that the power of a t-test increases as you >> increase the number of observations. >> > This is of course appropriate at any time that you can get more high-quality > samples rather than fewer :-) > _______________________________________________ Bioconductor mailing list Bioconductor at r-project.org https://stat.ethz.ch/mailman/listinfo/bioconductor Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor

ADD REPLY • link 13.1 years ago Andrew Teschendorff ▴ 60

0

Entering edit mode

Hi, Thanks for some very useful and interesting information. My problem could very much be related to the distribution of the values, but both methods use limma and eBayes, so both are using moderated t tests on the same data set. The difference is that for Method 1 the difference in methylation within each twin pair is calculated by hand and treated as one sample by limma, while in Method 2 limma identifies the pairs and computes the differences (in my understanding). Using Method 1 I'm getting really nice and 'expected' sort of results. Method 2 baffles me. My thought is that even if the methylation data is bimodal, the computed differences within pairs don't have to be. Since the pairs are monozygotic twins, we can expect the differences in methylation within each pair to be close to zero for most CpG sites. So I kind of believed in the results from Method 1 (and also since the results made sense). With Method 2 I expected limma to compute the differences and treat the data the same way, but apparently it doesn't since I'm not getting the same results. Please correct me if I'm having any misunderstandings here. Regards, Khadeeja ________________________________ From: Andrew Teschendorff <a.teschendorff@ucl.ac.uk> To: Kevin R. Coombes <kevin.r.coombes@gmail.com>; "ttriche@usc.edu" <ttriche@usc.edu> Cc: James W. MacDonald <jmacdon@med.umich.edu>; "Tim Triche, Jr." <tim.triche@gmail.com>; "bioconductor@r-project.org" <bioconductor@r-project.org>; S Sent: Friday, October 7, 2011 10:57 PM Subject: RE: [BioC] Regression analysis problem Dear Kevin, How you treat the DNA methylation data will very much depend on the specific biological question you are trying to address. It is pretty clear that if you have a mixture of different cell types, each with a different methylation pattern, that small yet biologically and clinically relevant changes in the cell type composition will lead to small yet statistically significant changes in methylation. The evidence for biologically relevant small changes in methylation (differences in means of ~0.1) is overwhelming, for instance in DNA methylation studies conducted on whole blood DNA where blood cell subtype composition changes in response to the presence of cancer. kind regards A. ********************************************************************** ********************************************************************** *** Andrew E Teschendorff PhD Heller Research Fellow Statistical Cancer Genomics Paul O'Gorman Building UCL Cancer Institute University College London 72 Huntley Street London WC1E 6BT, UK. Mob: +44 07876 561263 Email: a.teschendorff@ucl.ac.uk http://www.ucl.ac.uk/cancer/research- groups/statistical_cancer_genomics/index.htm ********************************************************************** ********************************************************************** ________________________________________ From: bioconductor-bounces@r-project.org [bioconductor- bounces@r-project.org] On Behalf Of Kevin R. Coombes [kevin.r.coombes@gmail.com] Sent: 07 October 2011 18:26 To: ttriche@usc.edu Cc: James W. MacDonald; Tim Triche, Jr.; bioconductor@r-project.org; Sean Davis; khadeeja ismail Subject: Re: [BioC] Regression analysis problem To follow up on Tim's point: I have yet to see any evidence that methylation data is anything other than ternary. The typical interpretation for the beta values is less than 0.25 = fully unmethylated greater than 0.75 = fully methylated between 0.25 and 0.75 = partially methylated Part of the evidence for this assertion is that we ahve several sets of data from samples treated with drugs that should fully methylate (or fully demethylate, respectively) everything. On those samples, 99% of the observed values are above 0.75 (or below 0.25, respectively). So I'm not convinced that t-tests have role to play in analyzing genome-wide methylation data.... Kevin On 10/7/2011 10:39 AM, Tim Triche, Jr. wrote: > On Fri, Oct 7, 2011 at 8:15 AM, James W. MacDonald<jmacdon@med.umich.edu>wrote: > >> First, by increasing the number of genes, you can more accurately >> estimate an overall variance, which is then used in the eBayes() step to >> 'shrink' your observed variance towards this overall variance. This is >> one of the reasons that limma is so popular - by using information from >> all genes, you can increase the power to detect differences in >> individual genes. >> > Careful though -- this is methylation data, which tends to be strongly > bimodal. It's not clear that the assumption of a common variance across > unmethylated, partially-methylated, and methylated sites is appropriate. I > seem to recall Gordon Smyth commenting upon this at one point -- perhaps > he'll chime in. > > >> Second, when you increase the number of pairs you are using, your power >> to detect differences increases as well. This has nothing to do with >> limma per se; it is just that the power of a t-test increases as you >> increase the number of observations. >> > This is of course appropriate at any time that you can get more high-quality > samples rather than fewer :-) > _______________________________________________ Bioconductor mailing list Bioconductor@r-project.org https://stat.ethz.ch/mailman/listinfo/bioconductor Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor [[alternative HTML version deleted]]

ADD REPLY • link 13.1 years ago khadeeja ismail ▴ 400

0

Entering edit mode

Hi, Managed to find the problem which was to do with reading the targets file and now the results look good. But some of the probes that came up as significant using method 1 did not come up using method2, and interestingly they happen to be those not mapping to any gene. Thanks again for the prompt responses and some very useful information. Best regards, Khadeeja ________________________________ From: Andrew Teschendorff <a.teschendorff@ucl.ac.uk> To: Kevin R. Coombes <kevin.r.coombes@gmail.com>; "ttriche@usc.edu" <ttriche@usc.edu> Cc: James W. MacDonald <jmacdon@med.umich.edu>; "Tim Triche, Jr." <tim.triche@gmail.com>; "bioconductor@r-project.org" <bioconductor@r-project.org>; S Sent: Friday, October 7, 2011 8:57 PM Subject: RE: [BioC] Regression analysis problem Dear Kevin, How you treat the DNA methylation data will very much depend on the specific biological question you are trying to address. It is pretty clear that if you have a mixture of different cell types, each with a different methylation pattern, that small yet biologically and clinically relevant changes in the cell type composition will lead to small yet statistically significant changes in methylation. The evidence for biologically relevant small changes in methylation (differences in means of ~0.1) is overwhelming, for instance in DNA methylation studies conducted on whole blood DNA where blood cell subtype composition changes in response to the presence of cancer. kind regards A. ********************************************************************** ********************************************************************** *** Andrew E Teschendorff PhD Heller Research Fellow Statistical Cancer Genomics Paul O'Gorman Building UCL Cancer Institute University College London 72 Huntley Street London WC1E 6BT, UK. Mob: +44 07876 561263 Email: a.teschendorff@ucl.ac.uk http://www.ucl.ac.uk/cancer/research- groups/statistical_cancer_genomics/index.htm ********************************************************************** ********************************************************************** ________________________________________ From: bioconductor-bounces@r-project.org [bioconductor- bounces@r-project.org] On Behalf Of Kevin R. Coombes [kevin.r.coombes@gmail.com] Sent: 07 October 2011 18:26 To: ttriche@usc.edu Cc: James W. MacDonald; Tim Triche, Jr.; bioconductor@r-project.org; Sean Davis; khadeeja ismail Subject: Re: [BioC] Regression analysis problem To follow up on Tim's point: I have yet to see any evidence that methylation data is anything other than ternary. The typical interpretation for the beta values is less than 0.25 = fully unmethylated greater than 0.75 = fully methylated between 0.25 and 0.75 = partially methylated Part of the evidence for this assertion is that we ahve several sets of data from samples treated with drugs that should fully methylate (or fully demethylate, respectively) everything. On those samples, 99% of the observed values are above 0.75 (or below 0.25, respectively). So I'm not convinced that t-tests have role to play in analyzing genome-wide methylation data.... Kevin On 10/7/2011 10:39 AM, Tim Triche, Jr. wrote: > On Fri, Oct 7, 2011 at 8:15 AM, James W. MacDonald<jmacdon@med.umich.edu>wrote: > >> First, by increasing the number of genes, you can more accurately >> estimate an overall variance, which is then used in the eBayes() step to >> 'shrink' your observed variance towards this overall variance. This is >> one of the reasons that limma is so popular - by using information from >> all genes, you can increase the power to detect differences in >> individual genes. >> > Careful though -- this is methylation data, which tends to be strongly > bimodal. It's not clear that the assumption of a common variance across > unmethylated, partially-methylated, and methylated sites is appropriate. I > seem to recall Gordon Smyth commenting upon this at one point -- perhaps > he'll chime in. > > >> Second, when you increase the number of pairs you are using, your power >> to detect differences increases as well. This has nothing to do with >> limma per se; it is just that the power of a t-test increases as you >> increase the number of observations. >> > This is of course appropriate at any time that you can get more high-quality > samples rather than fewer :-) > _______________________________________________ Bioconductor mailing list Bioconductor@r-project.org https://stat.ethz.ch/mailman/listinfo/bioconductor Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor [[alternative HTML version deleted]]

ADD REPLY • link 13.1 years ago khadeeja ismail ▴ 400

Login before adding your answer.