design matrix edge R pairwise comparison at different time points after infection with replicates

0

Entering edit mode

Kaat De Cremer ▴ 70

@kaat-de-cremer-5346

Last seen 10.3 years ago

Dear all, I am using edgeR to find genes differentially expressed between infected and mock-infected control plants, at 3 time points after infection. I have RNAseq data for 3 independent tests, so for every single test I have 6 libraries (control + infected at 3 time points). Having three replicates this makes 18 libraries in total. What I did until now is look at each time point separate and calculate DEgenes at that time point as shown in this script: > head(x) C1 C2 C3 T1 T2 T3 1 0 1 2 0 0 0 2 13 6 4 10 8 12 3 17 16 9 10 8 11 4 2 1 2 2 3 2 5. 1 3 1 2 1 3 0 6 958 457 438 565 429 518 > treatment<-factor(c("C","C","C","T","T","T")) > test<-factor(c(1,2,3,1,2,3)) > y<-DGEList(counts=x,group=treatment) Calculating library sizes from column totals. > cpm.y<-cpm(y) > y<-y[rowSums(cpm.y>2)>=3,] > y<-calcNormFactors(y) > design<-model.matrix(~test+treat) > y<-estimateGLMCommonDisp(y,design,verbose=TRUE) Disp = 0.0265 , BCV = 0.1628 > y<-estimateGLMTrendedDisp(y,design) Loading required package: splines > y<-estimateGLMTagwiseDisp(y,design) > fit<-glmFit(y,design) > lrt<-glmLRT(y,fit) This works fine but I wonder if I should do the analysis of the different time points all at once? Will this make a difference? Unfortunately I cannot figure out how to design the matrix. I hope someone can help me, Kaat [[alternative HTML version deleted]]

RNASeq edgeR RNASeq edgeR • 2.0k views

ADD COMMENT • link updated 12.5 years ago by Gordon Smyth 52k • written 12.5 years ago by Kaat De Cremer ▴ 70

0

Entering edit mode

Mark Robinson ▴ 880

@mark-robinson-4908

Last seen 6.1 years ago

Hi Kaat, It is probably better to fit all your data with a single call to glmFit(), over all 18 samples; you can test the differences of interest trough the 'coef' or 'contrast' argument on glmLRT(). That would afford you more degrees of freedom and presumably better estimates of dispersion, and so on.

ADD COMMENT • link 12.5 years ago Mark Robinson ▴ 880

0

Entering edit mode

Hi Mark, Thank you for your suggestion, I really appreciate your time. Working in R is new to me so it has been a struggle using edgeR, but I think I managed it using only 2 factors (test and treatment). Now that I will be including 3 factors (test, treatment and time) in one analysis it is clear to me that I still don't understand how it works exactly. Below you can see my workspace with the only design matrix I could come up with, but I don't see which coefficients I should include or which contrast vector to use in the glmLRT function to make the comparison of control-treatment at each time point separate, ignoring the other 2 time points. Is this possible with this design matrix? Or is the matrix wrong for this purpose? Thanks! Kaat > head(x) 12hpi C1 12hpi C2 12hpi C3 12hpi T1 12hpi T2 12hpi T3 24hpi C1 24hpi C2 Lsa000001.1 0 1 1 2 0 2 1 1 Lsa000002.1 5 4 0 5 6 6 6 4 Lsa000003.1 10 9 7 5 5 8 6 2 Lsa000004.1 1 1 1 1 1 1 1 3 Lsa000005.1 1 0 1 0 2 0 0 1 Lsa000006.1 510 223 228 287 222 268 303 358 24hpi C3 24hpi T1 24hpi T2 24hpi T3 48hpi C1 48hpi C2 48hpi C3 48hpi T1 Lsa000001.1 0 1 1 0 0 0 0 2 Lsa000002.1 7 5 2 5 10 6 12 12 Lsa000003.1 7 5 4 2 6 5 8 2 Lsa000004.1 1 3 1 2 1 3 2 3 Lsa000005.1 0 1 0 0 1 0 0 2 Lsa000006.1 372 362 237 320 472 440 411 858 48hpi T2 48hpi T3 Lsa000001.1 0 0 Lsa000002.1 1 5 Lsa000003.1 1 0 Lsa000004.1 0 2 Lsa000005.1 1 0 Lsa000006.1 375 275 > treat<-factor(c("C","C","C","T","T","T","C","C","C","T","T","T","C", "C","C","T","T","T")) > test<-factor(c(1,1,2,3,1,2,3,2,3,1,2,3,1,2,3,1,2,3)) time<-factor(c("12hpi","12hpi","12hpi","12hpi","12hpi","12hpi","24hpi" ,"24hpi","24hpi","24hpi","24hpi","24hpi","48hpi","48hpi","48hpi","48hp i","48hpi","48hpi")) > y<-DGEList(counts=x,group=treat) Calculating library sizes from column totals. > cpm.y<-cpm(y) > y<-y[rowSums(cpm.y>2)>=3,] > y<-calcNormFactors(y) design<-model.matrix(~test+treat+time) > design (Intercept) test2 test3 treatT time24hpi time48hpi 1 1 0 0 0 0 0 2 1 1 0 0 0 0 3 1 0 1 0 0 0 4 1 0 0 1 0 0 5 1 1 0 1 0 0 6 1 0 1 1 0 0 7 1 0 0 0 1 0 8 1 1 0 0 1 0 9 1 0 1 0 1 0 10 1 0 0 1 1 0 11 1 1 0 1 1 0 12 1 0 1 1 1 0 13 1 0 0 0 0 1 14 1 1 0 0 0 1 15 1 0 1 0 0 1 16 1 0 0 1 0 1 17 1 1 0 1 0 1 18 1 0 1 1 0 1 attr(,"assign") [1] 0 1 1 2 3 3 attr(,"contrasts") attr(,"contrasts")$test [1] "contr.treatment" attr(,"contrasts")$treat [1] "contr.treatment" attr(,"contrasts")$time [1] "contr.treatment" > y<-estimateGLMCommonDisp(y,design,verbose=TRUE) Disp = 0.07299 , BCV = 0.2702 > y<-estimateGLMTrendedDisp(y,design) Loading required package: splines > y<-estimateGLMTagwiseDisp(y,design) Warning message: In maximizeInterpolant(spline.pts, apl.smooth[j, ]) : max iterations exceeded > fit<-glmFit(y,design) -----Original Message----- From: Mark Robinson [mailto:mark.robinson@imls.uzh.ch] Sent: vrijdag 22 juni 2012 12:03 To: Kaat De Cremer Cc: bioconductor list Subject: Re: [BioC] design matrix edge R pairwise comparison at different time points after infection with replicates Hi Kaat, It is probably better to fit all your data with a single call to glmFit(), over all 18 samples; you can test the differences of interest trough the 'coef' or 'contrast' argument on glmLRT(). That would afford you more degrees of freedom and presumably better estimates of dispersion, and so on. >From your description, I can't quite figure out your design matrix. You have three factors: treatment, test and time point. First, you need to input all 18 samples and extend your 'treatment' and 'test' factor variables to have 18 values (corresponding to the columns of your table). And, then also include a time variable in your design. Some decisions might need to be made about interactions to include. Hope that gets you started. Best, Mark ---------- Prof. Dr. Mark Robinson Bioinformatics Institute of Molecular Life Sciences University of Zurich Winterthurerstrasse 190 8057 Zurich Switzerland v: +41 44 635 4848 f: +41 44 635 6898 e: mark.robinson at imls.uzh.ch o: Y11-J-16 w: http://tiny.cc/mrobin ---------- http://www.fgcz.ch/Bioconductor2012 On 21.06.2012, at 11:42, Kaat De Cremer wrote: > Dear all, > > > I am using edgeR to find genes differentially expressed between infected and mock-infected control plants, at 3 time points after infection. > I have RNAseq data for 3 independent tests, so for every single test I have 6 libraries (control + infected at 3 time points). > Having three replicates this makes 18 libraries in total. > > What I did until now is look at each time point separate and calculate DEgenes at that time point as shown in this script: > >> head(x) > C1 C2 C3 T1 T2 T3 > 1 0 1 2 0 0 0 > 2 13 6 4 10 8 12 > 3 17 16 9 10 8 11 > 4 2 1 2 2 3 2 > 5. 1 3 1 2 1 3 0 > 6 958 457 438 565 429 518 > >> treatment<-factor(c("C","C","C","T","T","T")) >> test<-factor(c(1,2,3,1,2,3)) >> y<-DGEList(counts=x,group=treatment) > Calculating library sizes from column totals. >> cpm.y<-cpm(y) >> y<-y[rowSums(cpm.y>2)>=3,] >> y<-calcNormFactors(y) >> design<-model.matrix(~test+treat) >> y<-estimateGLMCommonDisp(y,design,verbose=TRUE) > Disp = 0.0265 , BCV = 0.1628 >> y<-estimateGLMTrendedDisp(y,design) > Loading required package: splines >> y<-estimateGLMTagwiseDisp(y,design) >> fit<-glmFit(y,design) >> lrt<-glmLRT(y,fit) > > > This works fine but I wonder if I should do the analysis of the different time points all at once? Will this make a difference? > Unfortunately I cannot figure out how to design the matrix. > > I hope someone can help me, > > Kaat > > > [[alternative HTML version deleted]] > > _______________________________________________ > Bioconductor mailing list > Bioconductor at r-project.org > https://stat.ethz.ch/mailman/listinfo/bioconductor > Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor

ADD REPLY • link 12.5 years ago Kaat De Cremer ▴ 70

0

Entering edit mode

Gordon Smyth 52k

@gordon-smyth

Last seen 3 hours ago

WEHI, Melbourne, Australia

Hi Kaat, I'll jump in and continue on from Mark's help. To test for treatment effects separately at each time, the easiest way is to include the terms "time+time:treat" in your model formula. I'll assume that your "tests" are independent replicates of the whole experiment. If there are batch effects associated with the tests that you need to correct for, then your complete design matrix might be: design <- model.matrix(~test+time+time:treat) This produces a design matrix with the following columns: > colnames(design) [1] "(Intercept)" "test2" "test3" "time24hpi" [5] "time48hpi" "time12hpi:treatT" "time24hpi:treatT" "time48hpi:treatT" So testing for treatment effects at each time is easy. To test for treatment effect as time 12h: fit <- glmFit(y, design) lrt <- glmLRT(y, fit, coef="time12hpi:treatT") etc. To test for treatment effect at time 24h: lrt <- glmLRT(y, fit, coef="time24hpi:treatT") and so on. Best wishes Gordon > Date: Fri, 22 Jun 2012 13:11:41 +0000 > From: Kaat De Cremer <kaat.decremer at="" biw.kuleuven.be=""> > To: Mark Robinson <mark.robinson at="" imls.uzh.ch=""> > Cc: bioconductor list <bioconductor at="" r-project.org=""> > Subject: Re: [BioC] design matrix edge R pairwise comparison at > different time points after infection with replicates > > Hi Mark, > Thank you for your suggestion, > I really appreciate your time. > > Working in R is new to me so it has been a struggle using edgeR, but I > think I managed it using only 2 factors (test and treatment). Now that I > will be including 3 factors (test, treatment and time) in one analysis > it is clear to me that I still don't understand how it works exactly. > Below you can see my workspace with the only design matrix I could come > up with, but I don't see which coefficients I should include or which > contrast vector to use in the glmLRT function to make the comparison of > control-treatment at each time point separate, ignoring the other 2 time > points. Is this possible with this design matrix? Or is the matrix wrong > for this purpose? > > > Thanks! > Kaat > > >> head(x) > 12hpi C1 12hpi C2 12hpi C3 12hpi T1 12hpi T2 12hpi T3 24hpi C1 24hpi C2 > Lsa000001.1 0 1 1 2 0 2 1 1 > Lsa000002.1 5 4 0 5 6 6 6 4 > Lsa000003.1 10 9 7 5 5 8 6 2 > Lsa000004.1 1 1 1 1 1 1 1 3 > Lsa000005.1 1 0 1 0 2 0 0 1 > Lsa000006.1 510 223 228 287 222 268 303 358 > 24hpi C3 24hpi T1 24hpi T2 24hpi T3 48hpi C1 48hpi C2 48hpi C3 48hpi T1 > Lsa000001.1 0 1 1 0 0 0 0 2 > Lsa000002.1 7 5 2 5 10 6 12 12 > Lsa000003.1 7 5 4 2 6 5 8 2 > Lsa000004.1 1 3 1 2 1 3 2 3 > Lsa000005.1 0 1 0 0 1 0 0 2 > Lsa000006.1 372 362 237 320 472 440 411 858 > 48hpi T2 48hpi T3 > Lsa000001.1 0 0 > Lsa000002.1 1 5 > Lsa000003.1 1 0 > Lsa000004.1 0 2 > Lsa000005.1 1 0 > Lsa000006.1 375 275 >> treat<-factor(c("C","C","C","T","T","T","C","C","C","T","T","T","C" ,"C","C","T","T","T")) >> test<-factor(c(1,1,2,3,1,2,3,2,3,1,2,3,1,2,3,1,2,3)) > time<-factor(c("12hpi","12hpi","12hpi","12hpi","12hpi","12hpi","24hp i","24hpi","24hpi","24hpi","24hpi","24hpi","48hpi","48hpi","48hpi","48 hpi","48hpi","48hpi")) >> y<-DGEList(counts=x,group=treat) > Calculating library sizes from column totals. >> cpm.y<-cpm(y) >> y<-y[rowSums(cpm.y>2)>=3,] >> y<-calcNormFactors(y) > design<-model.matrix(~test+treat+time) >> design > (Intercept) test2 test3 treatT time24hpi time48hpi > 1 1 0 0 0 0 0 > 2 1 1 0 0 0 0 > 3 1 0 1 0 0 0 > 4 1 0 0 1 0 0 > 5 1 1 0 1 0 0 > 6 1 0 1 1 0 0 > 7 1 0 0 0 1 0 > 8 1 1 0 0 1 0 > 9 1 0 1 0 1 0 > 10 1 0 0 1 1 0 > 11 1 1 0 1 1 0 > 12 1 0 1 1 1 0 > 13 1 0 0 0 0 1 > 14 1 1 0 0 0 1 > 15 1 0 1 0 0 1 > 16 1 0 0 1 0 1 > 17 1 1 0 1 0 1 > 18 1 0 1 1 0 1 > attr(,"assign") > [1] 0 1 1 2 3 3 > attr(,"contrasts") > attr(,"contrasts")$test > [1] "contr.treatment" > > attr(,"contrasts")$treat > [1] "contr.treatment" > > attr(,"contrasts")$time > [1] "contr.treatment" > >> y<-estimateGLMCommonDisp(y,design,verbose=TRUE) > Disp = 0.07299 , BCV = 0.2702 >> y<-estimateGLMTrendedDisp(y,design) > Loading required package: splines >> y<-estimateGLMTagwiseDisp(y,design) > Warning message: > In maximizeInterpolant(spline.pts, apl.smooth[j, ]) : > max iterations exceeded >> fit<-glmFit(y,design) > > > > > > > > -----Original Message----- > From: Mark Robinson [mailto:mark.robinson at imls.uzh.ch] > Sent: vrijdag 22 juni 2012 12:03 > To: Kaat De Cremer > Cc: bioconductor list > Subject: Re: [BioC] design matrix edge R pairwise comparison at different time points after infection with replicates > > Hi Kaat, > > It is probably better to fit all your data with a single call to glmFit(), over all 18 samples; you can test the differences of interest trough the 'coef' or 'contrast' argument on glmLRT(). That would afford you more degrees of freedom and presumably better estimates of dispersion, and so on. > >> From your description, I can't quite figure out your design matrix. You have three factors: treatment, test and time point. First, you need to input all 18 samples and extend your 'treatment' and 'test' factor variables to have 18 values (corresponding to the columns of your table). And, then also include a time variable in your design. Some decisions might need to be made about interactions to include. > > Hope that gets you started. > > Best, > Mark > > > ---------- > Prof. Dr. Mark Robinson > Bioinformatics > Institute of Molecular Life Sciences > University of Zurich > Winterthurerstrasse 190 > 8057 Zurich > Switzerland > > v: +41 44 635 4848 > f: +41 44 635 6898 > e: mark.robinson at imls.uzh.ch > o: Y11-J-16 > w: http://tiny.cc/mrobin > > ---------- > http://www.fgcz.ch/Bioconductor2012 > > On 21.06.2012, at 11:42, Kaat De Cremer wrote: > >> Dear all, >> >> >> I am using edgeR to find genes differentially expressed between >> infected and mock-infected control plants, at 3 time points after >> infection. >> I have RNAseq data for 3 independent tests, so for every single test I >> have 6 libraries (control + infected at 3 time points). >> Having three replicates this makes 18 libraries in total. >> >> What I did until now is look at each time point separate and calculate DEgenes at that time point as shown in this script: >> >>> head(x) >> C1 C2 C3 T1 T2 T3 >> 1 0 1 2 0 0 0 >> 2 13 6 4 10 8 12 >> 3 17 16 9 10 8 11 >> 4 2 1 2 2 3 2 >> 5. 1 3 1 2 1 3 0 >> 6 958 457 438 565 429 518 >> >>> treatment<-factor(c("C","C","C","T","T","T")) >>> test<-factor(c(1,2,3,1,2,3)) >>> y<-DGEList(counts=x,group=treatment) >> Calculating library sizes from column totals. >>> cpm.y<-cpm(y) >>> y<-y[rowSums(cpm.y>2)>=3,] >>> y<-calcNormFactors(y) >>> design<-model.matrix(~test+treat) >>> y<-estimateGLMCommonDisp(y,design,verbose=TRUE) >> Disp = 0.0265 , BCV = 0.1628 >>> y<-estimateGLMTrendedDisp(y,design) >> Loading required package: splines >>> y<-estimateGLMTagwiseDisp(y,design) >>> fit<-glmFit(y,design) >>> lrt<-glmLRT(y,fit) >> >> >> This works fine but I wonder if I should do the analysis of the different time points all at once? Will this make a difference? >> Unfortunately I cannot figure out how to design the matrix. >> >> I hope someone can help me, >> >> Kaat >> ______________________________________________________________________ The information in this email is confidential and intend...{{dropped:4}}

ADD COMMENT • link 12.5 years ago Gordon Smyth 52k

0

Entering edit mode

Thank you so much to clarify this! Kaat -----Original Message----- From: Gordon K Smyth [mailto:smyth@wehi.EDU.AU] Sent: zondag 24 juni 2012 2:42 To: Kaat De Cremer Cc: Bioconductor mailing list; Mark Robinson Subject: design matrix edge R pairwise comparison at different time points after infection with replicates Hi Kaat, I'll jump in and continue on from Mark's help. To test for treatment effects separately at each time, the easiest way is to include the terms "time+time:treat" in your model formula. I'll assume that your "tests" are independent replicates of the whole experiment. If there are batch effects associated with the tests that you need to correct for, then your complete design matrix might be: design <- model.matrix(~test+time+time:treat) This produces a design matrix with the following columns: > colnames(design) [1] "(Intercept)" "test2" "test3" "time24hpi" [5] "time48hpi" "time12hpi:treatT" "time24hpi:treatT" "time48hpi:treatT" So testing for treatment effects at each time is easy. To test for treatment effect as time 12h: fit <- glmFit(y, design) lrt <- glmLRT(y, fit, coef="time12hpi:treatT") etc. To test for treatment effect at time 24h: lrt <- glmLRT(y, fit, coef="time24hpi:treatT") and so on. Best wishes Gordon > Date: Fri, 22 Jun 2012 13:11:41 +0000 > From: Kaat De Cremer <kaat.decremer at="" biw.kuleuven.be=""> > To: Mark Robinson <mark.robinson at="" imls.uzh.ch=""> > Cc: bioconductor list <bioconductor at="" r-project.org=""> > Subject: Re: [BioC] design matrix edge R pairwise comparison at > different time points after infection with replicates > > Hi Mark, > Thank you for your suggestion, > I really appreciate your time. > > Working in R is new to me so it has been a struggle using edgeR, but I > think I managed it using only 2 factors (test and treatment). Now that > I will be including 3 factors (test, treatment and time) in one > analysis it is clear to me that I still don't understand how it works exactly. > Below you can see my workspace with the only design matrix I could > come up with, but I don't see which coefficients I should include or > which contrast vector to use in the glmLRT function to make the > comparison of control-treatment at each time point separate, ignoring > the other 2 time points. Is this possible with this design matrix? Or > is the matrix wrong for this purpose? > > > Thanks! > Kaat > > >> head(x) > 12hpi C1 12hpi C2 12hpi C3 12hpi T1 12hpi T2 12hpi T3 24hpi C1 24hpi C2 > Lsa000001.1 0 1 1 2 0 2 1 1 > Lsa000002.1 5 4 0 5 6 6 6 4 > Lsa000003.1 10 9 7 5 5 8 6 2 > Lsa000004.1 1 1 1 1 1 1 1 3 > Lsa000005.1 1 0 1 0 2 0 0 1 > Lsa000006.1 510 223 228 287 222 268 303 358 > 24hpi C3 24hpi T1 24hpi T2 24hpi T3 48hpi C1 48hpi C2 48hpi C3 48hpi T1 > Lsa000001.1 0 1 1 0 0 0 0 2 > Lsa000002.1 7 5 2 5 10 6 12 12 > Lsa000003.1 7 5 4 2 6 5 8 2 > Lsa000004.1 1 3 1 2 1 3 2 3 > Lsa000005.1 0 1 0 0 1 0 0 2 > Lsa000006.1 372 362 237 320 472 440 411 858 > 48hpi T2 48hpi T3 > Lsa000001.1 0 0 > Lsa000002.1 1 5 > Lsa000003.1 1 0 > Lsa000004.1 0 2 > Lsa000005.1 1 0 > Lsa000006.1 375 275 >> treat<-factor(c("C","C","C","T","T","T","C","C","C","T","T","T","C"," >> C","C","T","T","T")) >> test<-factor(c(1,1,2,3,1,2,3,2,3,1,2,3,1,2,3,1,2,3)) > time<-factor(c("12hpi","12hpi","12hpi","12hpi","12hpi","12hpi","24hpi" > ,"24hpi","24hpi","24hpi","24hpi","24hpi","48hpi","48hpi","48hpi","48hp > i","48hpi","48hpi")) >> y<-DGEList(counts=x,group=treat) > Calculating library sizes from column totals. >> cpm.y<-cpm(y) >> y<-y[rowSums(cpm.y>2)>=3,] >> y<-calcNormFactors(y) > design<-model.matrix(~test+treat+time) >> design > (Intercept) test2 test3 treatT time24hpi time48hpi > 1 1 0 0 0 0 0 > 2 1 1 0 0 0 0 > 3 1 0 1 0 0 0 > 4 1 0 0 1 0 0 > 5 1 1 0 1 0 0 > 6 1 0 1 1 0 0 > 7 1 0 0 0 1 0 > 8 1 1 0 0 1 0 > 9 1 0 1 0 1 0 > 10 1 0 0 1 1 0 > 11 1 1 0 1 1 0 > 12 1 0 1 1 1 0 > 13 1 0 0 0 0 1 > 14 1 1 0 0 0 1 > 15 1 0 1 0 0 1 > 16 1 0 0 1 0 1 > 17 1 1 0 1 0 1 > 18 1 0 1 1 0 1 > attr(,"assign") > [1] 0 1 1 2 3 3 > attr(,"contrasts") > attr(,"contrasts")$test > [1] "contr.treatment" > > attr(,"contrasts")$treat > [1] "contr.treatment" > > attr(,"contrasts")$time > [1] "contr.treatment" > >> y<-estimateGLMCommonDisp(y,design,verbose=TRUE) > Disp = 0.07299 , BCV = 0.2702 >> y<-estimateGLMTrendedDisp(y,design) > Loading required package: splines >> y<-estimateGLMTagwiseDisp(y,design) > Warning message: > In maximizeInterpolant(spline.pts, apl.smooth[j, ]) : > max iterations exceeded >> fit<-glmFit(y,design) > > > > > > > > -----Original Message----- > From: Mark Robinson [mailto:mark.robinson at imls.uzh.ch] > Sent: vrijdag 22 juni 2012 12:03 > To: Kaat De Cremer > Cc: bioconductor list > Subject: Re: [BioC] design matrix edge R pairwise comparison at > different time points after infection with replicates > > Hi Kaat, > > It is probably better to fit all your data with a single call to glmFit(), over all 18 samples; you can test the differences of interest trough the 'coef' or 'contrast' argument on glmLRT(). That would afford you more degrees of freedom and presumably better estimates of dispersion, and so on. > >> From your description, I can't quite figure out your design matrix. You have three factors: treatment, test and time point. First, you need to input all 18 samples and extend your 'treatment' and 'test' factor variables to have 18 values (corresponding to the columns of your table). And, then also include a time variable in your design. Some decisions might need to be made about interactions to include. > > Hope that gets you started. > > Best, > Mark > > > ---------- > Prof. Dr. Mark Robinson > Bioinformatics > Institute of Molecular Life Sciences > University of Zurich > Winterthurerstrasse 190 > 8057 Zurich > Switzerland > > v: +41 44 635 4848 > f: +41 44 635 6898 > e: mark.robinson at imls.uzh.ch > o: Y11-J-16 > w: http://tiny.cc/mrobin > > ---------- > http://www.fgcz.ch/Bioconductor2012 > > On 21.06.2012, at 11:42, Kaat De Cremer wrote: > >> Dear all, >> >> >> I am using edgeR to find genes differentially expressed between >> infected and mock-infected control plants, at 3 time points after >> infection. >> I have RNAseq data for 3 independent tests, so for every single test >> I have 6 libraries (control + infected at 3 time points). >> Having three replicates this makes 18 libraries in total. >> >> What I did until now is look at each time point separate and calculate DEgenes at that time point as shown in this script: >> >>> head(x) >> C1 C2 C3 T1 T2 T3 >> 1 0 1 2 0 0 0 >> 2 13 6 4 10 8 12 >> 3 17 16 9 10 8 11 >> 4 2 1 2 2 3 2 >> 5. 1 3 1 2 1 3 0 >> 6 958 457 438 565 429 518 >> >>> treatment<-factor(c("C","C","C","T","T","T")) >>> test<-factor(c(1,2,3,1,2,3)) >>> y<-DGEList(counts=x,group=treatment) >> Calculating library sizes from column totals. >>> cpm.y<-cpm(y) >>> y<-y[rowSums(cpm.y>2)>=3,] >>> y<-calcNormFactors(y) >>> design<-model.matrix(~test+treat) >>> y<-estimateGLMCommonDisp(y,design,verbose=TRUE) >> Disp = 0.0265 , BCV = 0.1628 >>> y<-estimateGLMTrendedDisp(y,design) >> Loading required package: splines >>> y<-estimateGLMTagwiseDisp(y,design) >>> fit<-glmFit(y,design) >>> lrt<-glmLRT(y,fit) >> >> >> This works fine but I wonder if I should do the analysis of the different time points all at once? Will this make a difference? >> Unfortunately I cannot figure out how to design the matrix. >> >> I hope someone can help me, >> >> Kaat >> ______________________________________________________________________ The information in this email is confidential and intend...{{dropped:6}}

ADD REPLY • link 12.5 years ago Kaat De Cremer ▴ 70

0

Entering edit mode

Again, Thank you both very much for your reply. I have analyzed my time course data now in several different ways and noticed some differences: 1) when I analyze the data of all time points together and look for DE genes at one time point, I find less DE genes compared to when I use only the data of that one time point in edgeR. I assume this is because the dispersion is larger when I include all the different time points at once? In that case, is this the right way to go? The dispersion can be larger at one time point compared to another I would think. 2) I noticed in the edgeR user's guide that in some examples you correct the library size after filtering for low expressed genes, and in other examples you don't. Correcting this library size gives less DE genes for my data at all time points when I analyze all data together and then look for DE genes at each time point. I don't see this difference when I only look at the data of one time point in edgeR. I hope you can comment on this, Thank you, Kaat -----Original Message----- From: Gordon K Smyth [mailto:smyth@wehi.EDU.AU] Sent: zondag 24 juni 2012 2:42 To: Kaat De Cremer Cc: Bioconductor mailing list; Mark Robinson Subject: design matrix edge R pairwise comparison at different time points after infection with replicates Hi Kaat, I'll jump in and continue on from Mark's help. To test for treatment effects separately at each time, the easiest way is to include the terms "time+time:treat" in your model formula. I'll assume that your "tests" are independent replicates of the whole experiment. If there are batch effects associated with the tests that you need to correct for, then your complete design matrix might be: design <- model.matrix(~test+time+time:treat) This produces a design matrix with the following columns: > colnames(design) [1] "(Intercept)" "test2" "test3" "time24hpi" [5] "time48hpi" "time12hpi:treatT" "time24hpi:treatT" "time48hpi:treatT" So testing for treatment effects at each time is easy. To test for treatment effect as time 12h: fit <- glmFit(y, design) lrt <- glmLRT(y, fit, coef="time12hpi:treatT") etc. To test for treatment effect at time 24h: lrt <- glmLRT(y, fit, coef="time24hpi:treatT") and so on. Best wishes Gordon > Date: Fri, 22 Jun 2012 13:11:41 +0000 > From: Kaat De Cremer <kaat.decremer at="" biw.kuleuven.be=""> > To: Mark Robinson <mark.robinson at="" imls.uzh.ch=""> > Cc: bioconductor list <bioconductor at="" r-project.org=""> > Subject: Re: [BioC] design matrix edge R pairwise comparison at > different time points after infection with replicates > > Hi Mark, > Thank you for your suggestion, > I really appreciate your time. > > Working in R is new to me so it has been a struggle using edgeR, but I > think I managed it using only 2 factors (test and treatment). Now that > I will be including 3 factors (test, treatment and time) in one > analysis it is clear to me that I still don't understand how it works exactly. > Below you can see my workspace with the only design matrix I could > come up with, but I don't see which coefficients I should include or > which contrast vector to use in the glmLRT function to make the > comparison of control-treatment at each time point separate, ignoring > the other 2 time points. Is this possible with this design matrix? Or > is the matrix wrong for this purpose? > > > Thanks! > Kaat > > >> head(x) > 12hpi C1 12hpi C2 12hpi C3 12hpi T1 12hpi T2 12hpi T3 24hpi C1 24hpi C2 > Lsa000001.1 0 1 1 2 0 2 1 1 > Lsa000002.1 5 4 0 5 6 6 6 4 > Lsa000003.1 10 9 7 5 5 8 6 2 > Lsa000004.1 1 1 1 1 1 1 1 3 > Lsa000005.1 1 0 1 0 2 0 0 1 > Lsa000006.1 510 223 228 287 222 268 303 358 > 24hpi C3 24hpi T1 24hpi T2 24hpi T3 48hpi C1 48hpi C2 48hpi C3 48hpi T1 > Lsa000001.1 0 1 1 0 0 0 0 2 > Lsa000002.1 7 5 2 5 10 6 12 12 > Lsa000003.1 7 5 4 2 6 5 8 2 > Lsa000004.1 1 3 1 2 1 3 2 3 > Lsa000005.1 0 1 0 0 1 0 0 2 > Lsa000006.1 372 362 237 320 472 440 411 858 > 48hpi T2 48hpi T3 > Lsa000001.1 0 0 > Lsa000002.1 1 5 > Lsa000003.1 1 0 > Lsa000004.1 0 2 > Lsa000005.1 1 0 > Lsa000006.1 375 275 >> treat<-factor(c("C","C","C","T","T","T","C","C","C","T","T","T","C"," >> C","C","T","T","T")) >> test<-factor(c(1,1,2,3,1,2,3,2,3,1,2,3,1,2,3,1,2,3)) > time<-factor(c("12hpi","12hpi","12hpi","12hpi","12hpi","12hpi","24hpi" > ,"24hpi","24hpi","24hpi","24hpi","24hpi","48hpi","48hpi","48hpi","48hp > i","48hpi","48hpi")) >> y<-DGEList(counts=x,group=treat) > Calculating library sizes from column totals. >> cpm.y<-cpm(y) >> y<-y[rowSums(cpm.y>2)>=3,] >> y<-calcNormFactors(y) > design<-model.matrix(~test+treat+time) >> design > (Intercept) test2 test3 treatT time24hpi time48hpi > 1 1 0 0 0 0 0 > 2 1 1 0 0 0 0 > 3 1 0 1 0 0 0 > 4 1 0 0 1 0 0 > 5 1 1 0 1 0 0 > 6 1 0 1 1 0 0 > 7 1 0 0 0 1 0 > 8 1 1 0 0 1 0 > 9 1 0 1 0 1 0 > 10 1 0 0 1 1 0 > 11 1 1 0 1 1 0 > 12 1 0 1 1 1 0 > 13 1 0 0 0 0 1 > 14 1 1 0 0 0 1 > 15 1 0 1 0 0 1 > 16 1 0 0 1 0 1 > 17 1 1 0 1 0 1 > 18 1 0 1 1 0 1 > attr(,"assign") > [1] 0 1 1 2 3 3 > attr(,"contrasts") > attr(,"contrasts")$test > [1] "contr.treatment" > > attr(,"contrasts")$treat > [1] "contr.treatment" > > attr(,"contrasts")$time > [1] "contr.treatment" > >> y<-estimateGLMCommonDisp(y,design,verbose=TRUE) > Disp = 0.07299 , BCV = 0.2702 >> y<-estimateGLMTrendedDisp(y,design) > Loading required package: splines >> y<-estimateGLMTagwiseDisp(y,design) > Warning message: > In maximizeInterpolant(spline.pts, apl.smooth[j, ]) : > max iterations exceeded >> fit<-glmFit(y,design) > > > > > > > > -----Original Message----- > From: Mark Robinson [mailto:mark.robinson at imls.uzh.ch] > Sent: vrijdag 22 juni 2012 12:03 > To: Kaat De Cremer > Cc: bioconductor list > Subject: Re: [BioC] design matrix edge R pairwise comparison at > different time points after infection with replicates > > Hi Kaat, > > It is probably better to fit all your data with a single call to glmFit(), over all 18 samples; you can test the differences of interest trough the 'coef' or 'contrast' argument on glmLRT(). That would afford you more degrees of freedom and presumably better estimates of dispersion, and so on. > >> From your description, I can't quite figure out your design matrix. You have three factors: treatment, test and time point. First, you need to input all 18 samples and extend your 'treatment' and 'test' factor variables to have 18 values (corresponding to the columns of your table). And, then also include a time variable in your design. Some decisions might need to be made about interactions to include. > > Hope that gets you started. > > Best, > Mark > > > ---------- > Prof. Dr. Mark Robinson > Bioinformatics > Institute of Molecular Life Sciences > University of Zurich > Winterthurerstrasse 190 > 8057 Zurich > Switzerland > > v: +41 44 635 4848 > f: +41 44 635 6898 > e: mark.robinson at imls.uzh.ch > o: Y11-J-16 > w: http://tiny.cc/mrobin > > ---------- > http://www.fgcz.ch/Bioconductor2012 > > On 21.06.2012, at 11:42, Kaat De Cremer wrote: > >> Dear all, >> >> >> I am using edgeR to find genes differentially expressed between >> infected and mock-infected control plants, at 3 time points after >> infection. >> I have RNAseq data for 3 independent tests, so for every single test >> I have 6 libraries (control + infected at 3 time points). >> Having three replicates this makes 18 libraries in total. >> >> What I did until now is look at each time point separate and calculate DEgenes at that time point as shown in this script: >> >>> head(x) >> C1 C2 C3 T1 T2 T3 >> 1 0 1 2 0 0 0 >> 2 13 6 4 10 8 12 >> 3 17 16 9 10 8 11 >> 4 2 1 2 2 3 2 >> 5. 1 3 1 2 1 3 0 >> 6 958 457 438 565 429 518 >> >>> treatment<-factor(c("C","C","C","T","T","T")) >>> test<-factor(c(1,2,3,1,2,3)) >>> y<-DGEList(counts=x,group=treatment) >> Calculating library sizes from column totals. >>> cpm.y<-cpm(y) >>> y<-y[rowSums(cpm.y>2)>=3,] >>> y<-calcNormFactors(y) >>> design<-model.matrix(~test+treat) >>> y<-estimateGLMCommonDisp(y,design,verbose=TRUE) >> Disp = 0.0265 , BCV = 0.1628 >>> y<-estimateGLMTrendedDisp(y,design) >> Loading required package: splines >>> y<-estimateGLMTagwiseDisp(y,design) >>> fit<-glmFit(y,design) >>> lrt<-glmLRT(y,fit) >> >> >> This works fine but I wonder if I should do the analysis of the different time points all at once? Will this make a difference? >> Unfortunately I cannot figure out how to design the matrix. >> >> I hope someone can help me, >> >> Kaat >> ______________________________________________________________________ The information in this email is confidential and intend...{{dropped:6}}

ADD REPLY • link 12.5 years ago Kaat De Cremer ▴ 70

0

Entering edit mode

Dear Kaat, 1) It is not generally true that you will find more DE genes analysing just one time separately rather than using all the libraries in one linear model. It is possible in principle that the later time may show more variability, and that might justify separate analysis of the different times. However I would do that only when you have a good a priori biological reason for expecting such an effect and data exploration (such as an MDS plot) confirms it. Otherwise the extra instability of dispersion estimation with a small number of libraries is not justified. I would not make such decisions merely on the basis of which analysis gives more DE genes. 2) I don't think there is yet a definitive rule regarding filtering and library sizes, but I prefer to recompute library sizes and scale normalize (calcNormFactors) after filtering. Recomputing the library sizes doesn't make a lot of difference, because scale normalization will self correct anyway. Are you normalizing your data? Best wishes Gordon On Mon, 25 Jun 2012, Kaat De Cremer wrote: > Again, > Thank you both very much for your reply. > > I have analyzed my time course data now in several different ways and > noticed some differences: > > > 1) when I analyze the data of all time points together and look for DE > genes at one time point, I find less DE genes compared to when I use > only the data of that one time point in edgeR. I assume this is because > the dispersion is larger when I include all the different time points at > once? In that case, is this the right way to go? The dispersion can be > larger at one time point compared to another I would think. > > 2) I noticed in the edgeR user's guide that in some examples you correct > the library size after filtering for low expressed genes, and in other > examples you don't. Correcting this library size gives less DE genes for > my data at all time points when I analyze all data together and then > look for DE genes at each time point. I don't see this difference when I > only look at the data of one time point in edgeR. > > I hope you can comment on this, > > > Thank you, > Kaat > > > > -----Original Message----- > From: Gordon K Smyth [mailto:smyth at wehi.EDU.AU] > Sent: zondag 24 juni 2012 2:42 > To: Kaat De Cremer > Cc: Bioconductor mailing list; Mark Robinson > Subject: design matrix edge R pairwise comparison at different time points after infection with replicates > > Hi Kaat, > > I'll jump in and continue on from Mark's help. > > To test for treatment effects separately at each time, the easiest way is to include the terms "time+time:treat" in your model formula. > > I'll assume that your "tests" are independent replicates of the whole experiment. If there are batch effects associated with the tests that you need to correct for, then your complete design matrix might be: > > design <- model.matrix(~test+time+time:treat) > > This produces a design matrix with the following columns: > > > colnames(design) > [1] "(Intercept)" "test2" "test3" "time24hpi" > [5] "time48hpi" "time12hpi:treatT" "time24hpi:treatT" "time48hpi:treatT" > > So testing for treatment effects at each time is easy. To test for treatment effect as time 12h: > > fit <- glmFit(y, design) > lrt <- glmLRT(y, fit, coef="time12hpi:treatT") > > etc. To test for treatment effect at time 24h: > > lrt <- glmLRT(y, fit, coef="time24hpi:treatT") > > and so on. > > Best wishes > Gordon > >> Date: Fri, 22 Jun 2012 13:11:41 +0000 >> From: Kaat De Cremer <kaat.decremer at="" biw.kuleuven.be=""> >> To: Mark Robinson <mark.robinson at="" imls.uzh.ch=""> >> Cc: bioconductor list <bioconductor at="" r-project.org=""> >> Subject: Re: [BioC] design matrix edge R pairwise comparison at >> different time points after infection with replicates >> >> Hi Mark, >> Thank you for your suggestion, >> I really appreciate your time. >> >> Working in R is new to me so it has been a struggle using edgeR, but I >> think I managed it using only 2 factors (test and treatment). Now that >> I will be including 3 factors (test, treatment and time) in one >> analysis it is clear to me that I still don't understand how it works exactly. > >> Below you can see my workspace with the only design matrix I could >> come up with, but I don't see which coefficients I should include or >> which contrast vector to use in the glmLRT function to make the >> comparison of control-treatment at each time point separate, ignoring >> the other 2 time points. Is this possible with this design matrix? Or >> is the matrix wrong for this purpose? >> >> >> Thanks! >> Kaat >> >> >>> head(x) >> 12hpi C1 12hpi C2 12hpi C3 12hpi T1 12hpi T2 12hpi T3 24hpi C1 24hpi C2 >> Lsa000001.1 0 1 1 2 0 2 1 1 >> Lsa000002.1 5 4 0 5 6 6 6 4 >> Lsa000003.1 10 9 7 5 5 8 6 2 >> Lsa000004.1 1 1 1 1 1 1 1 3 >> Lsa000005.1 1 0 1 0 2 0 0 1 >> Lsa000006.1 510 223 228 287 222 268 303 358 >> 24hpi C3 24hpi T1 24hpi T2 24hpi T3 48hpi C1 48hpi C2 48hpi C3 48hpi T1 >> Lsa000001.1 0 1 1 0 0 0 0 2 >> Lsa000002.1 7 5 2 5 10 6 12 12 >> Lsa000003.1 7 5 4 2 6 5 8 2 >> Lsa000004.1 1 3 1 2 1 3 2 3 >> Lsa000005.1 0 1 0 0 1 0 0 2 >> Lsa000006.1 372 362 237 320 472 440 411 858 >> 48hpi T2 48hpi T3 >> Lsa000001.1 0 0 >> Lsa000002.1 1 5 >> Lsa000003.1 1 0 >> Lsa000004.1 0 2 >> Lsa000005.1 1 0 >> Lsa000006.1 375 275 >>> treat<-factor(c("C","C","C","T","T","T","C","C","C","T","T","T","C"," >>> C","C","T","T","T")) >>> test<-factor(c(1,1,2,3,1,2,3,2,3,1,2,3,1,2,3,1,2,3)) >> time<-factor(c("12hpi","12hpi","12hpi","12hpi","12hpi","12hpi","24hpi" >> ,"24hpi","24hpi","24hpi","24hpi","24hpi","48hpi","48hpi","48hpi","48hp >> i","48hpi","48hpi")) >>> y<-DGEList(counts=x,group=treat) >> Calculating library sizes from column totals. >>> cpm.y<-cpm(y) >>> y<-y[rowSums(cpm.y>2)>=3,] >>> y<-calcNormFactors(y) >> design<-model.matrix(~test+treat+time) >>> design >> (Intercept) test2 test3 treatT time24hpi time48hpi >> 1 1 0 0 0 0 0 >> 2 1 1 0 0 0 0 >> 3 1 0 1 0 0 0 >> 4 1 0 0 1 0 0 >> 5 1 1 0 1 0 0 >> 6 1 0 1 1 0 0 >> 7 1 0 0 0 1 0 >> 8 1 1 0 0 1 0 >> 9 1 0 1 0 1 0 >> 10 1 0 0 1 1 0 >> 11 1 1 0 1 1 0 >> 12 1 0 1 1 1 0 >> 13 1 0 0 0 0 1 >> 14 1 1 0 0 0 1 >> 15 1 0 1 0 0 1 >> 16 1 0 0 1 0 1 >> 17 1 1 0 1 0 1 >> 18 1 0 1 1 0 1 >> attr(,"assign") >> [1] 0 1 1 2 3 3 >> attr(,"contrasts") >> attr(,"contrasts")$test >> [1] "contr.treatment" >> >> attr(,"contrasts")$treat >> [1] "contr.treatment" >> >> attr(,"contrasts")$time >> [1] "contr.treatment" >> >>> y<-estimateGLMCommonDisp(y,design,verbose=TRUE) >> Disp = 0.07299 , BCV = 0.2702 >>> y<-estimateGLMTrendedDisp(y,design) >> Loading required package: splines >>> y<-estimateGLMTagwiseDisp(y,design) >> Warning message: >> In maximizeInterpolant(spline.pts, apl.smooth[j, ]) : >> max iterations exceeded >>> fit<-glmFit(y,design) >> >> >> >> >> >> >> >> -----Original Message----- >> From: Mark Robinson [mailto:mark.robinson at imls.uzh.ch] >> Sent: vrijdag 22 juni 2012 12:03 >> To: Kaat De Cremer >> Cc: bioconductor list >> Subject: Re: [BioC] design matrix edge R pairwise comparison at >> different time points after infection with replicates >> >> Hi Kaat, >> >> It is probably better to fit all your data with a single call to glmFit(), over all 18 samples; you can test the differences of interest trough the 'coef' or 'contrast' argument on glmLRT(). That would afford you more degrees of freedom and presumably better estimates of dispersion, and so on. >> >>> From your description, I can't quite figure out your design matrix. You have three factors: treatment, test and time point. First, you need to input all 18 samples and extend your 'treatment' and 'test' factor variables to have 18 values (corresponding to the columns of your table). And, then also include a time variable in your design. Some decisions might need to be made about interactions to include. >> >> Hope that gets you started. >> >> Best, >> Mark >> >> >> ---------- >> Prof. Dr. Mark Robinson >> Bioinformatics >> Institute of Molecular Life Sciences >> University of Zurich >> Winterthurerstrasse 190 >> 8057 Zurich >> Switzerland >> >> v: +41 44 635 4848 >> f: +41 44 635 6898 >> e: mark.robinson at imls.uzh.ch >> o: Y11-J-16 >> w: http://tiny.cc/mrobin >> >> ---------- >> http://www.fgcz.ch/Bioconductor2012 >> >> On 21.06.2012, at 11:42, Kaat De Cremer wrote: >> >>> Dear all, >>> >>> >>> I am using edgeR to find genes differentially expressed between >>> infected and mock-infected control plants, at 3 time points after >>> infection. > >>> I have RNAseq data for 3 independent tests, so for every single test >>> I have 6 libraries (control + infected at 3 time points). > >>> Having three replicates this makes 18 libraries in total. >>> >>> What I did until now is look at each time point separate and calculate DEgenes at that time point as shown in this script: >>> >>>> head(x) >>> C1 C2 C3 T1 T2 T3 >>> 1 0 1 2 0 0 0 >>> 2 13 6 4 10 8 12 >>> 3 17 16 9 10 8 11 >>> 4 2 1 2 2 3 2 >>> 5. 1 3 1 2 1 3 0 >>> 6 958 457 438 565 429 518 >>> >>>> treatment<-factor(c("C","C","C","T","T","T")) >>>> test<-factor(c(1,2,3,1,2,3)) >>>> y<-DGEList(counts=x,group=treatment) >>> Calculating library sizes from column totals. >>>> cpm.y<-cpm(y) >>>> y<-y[rowSums(cpm.y>2)>=3,] >>>> y<-calcNormFactors(y) >>>> design<-model.matrix(~test+treat) >>>> y<-estimateGLMCommonDisp(y,design,verbose=TRUE) >>> Disp = 0.0265 , BCV = 0.1628 >>>> y<-estimateGLMTrendedDisp(y,design) >>> Loading required package: splines >>>> y<-estimateGLMTagwiseDisp(y,design) >>>> fit<-glmFit(y,design) >>>> lrt<-glmLRT(y,fit) >>> >>> >>> This works fine but I wonder if I should do the analysis of the different time points all at once? Will this make a difference? >>> Unfortunately I cannot figure out how to design the matrix. >>> >>> I hope someone can help me, >>> >>> Kaat >>> > > ______________________________________________________________________ > The information in this email is confidential and inte...{{dropped:10}}

ADD REPLY • link 12.5 years ago Gordon Smyth 52k

Login before adding your answer.