Question

How to design matrix on edgeR to study genotype x environmental interaction

0

Entering edit mode

Gordon Smyth 52k

@gordon-smyth

Last seen 5 hours ago

WEHI, Melbourne, Australia

Dear Daniela, What version of the edgeR are you using? The posting guide asks you to give sessionInfo() output so we can see package versions. Your codes looks correct for testing an interaction, although you could estimate the same interaction more directly using an interaction formula as in Section 3.3.4 of the edgeR User's Guide. However the model you have used is correct only if all 12 samples correspond to the same physiological stage. I wonder why you are not analysing all the 48 samples together. I would start with data exploration of all 48 samples, including exploration measures like transcript filtering, library sizes, normalization factors, an MDS plot, a BCV plot, and so on. The first step is to check the data quality before going on to test for differential expression. edgeR has very high statistical power, even giving p-values smaller than I would like in some cases. So if you're not getting any differential expression, it is because there is none or because you have data quality problems. Best wishes Gordon > Date: Fri, 9 Nov 2012 14:44:28 +0100 > From: Daniela Lopes Paim Pinto <d.lopespaimpinto at="" sssup.it=""> > To: bioconductor at r-project.org > Subject: Re: [BioC] How to design matrix on edgeR to study genotype x > environmental interaction > > Dear Gordon, > > Thank you so much for the reference. I read all the chapter regarding to > the models and I tried to set up the following code considering a data > frame like this: > >> target > Sample Variety Location > 1 1 CS Mont > 2 2 CS Mont > 3 25 CS Bol > 4 26 CS Bol > 5 49 CS Ric > 6 50 CS Ric > 7 13 SG Mont > 8 14 SG Mont > 9 37 SG Bol > 10 38 SG Bol > 11 61 SG Ric > 12 62 SG Ric > >> group <- factor(paste(target$Variety,target$Location,sep="_")) >> cbind(target,Group=group) >> d <- DGEList(counts=file,group=group) >> DGEnorm <- calcNormFactors(d) >> design <- model.matrix(~0+group, data=DGEnorm$samples) >> colnames(design) <- levels(group) > > Which gave me the design matrix: > >> design > CS_Bol CS_Mont CS_Ric SG_Bol SG_Mont SG_Ric > CS_Mont 0 1 0 0 0 0 > CS_Mont.1 0 1 0 0 0 0 > CS_Bol 1 0 0 0 0 0 > CS_Bol.1 1 0 0 0 0 0 > CS_Ric 0 0 1 0 0 0 > CS_Ric.1 0 0 1 0 0 0 > SG_Mont 0 0 0 0 1 0 > SG_Mont.1 0 0 0 0 1 0 > SG_Bol 0 0 0 1 0 0 > SG_Bol.1 0 0 0 1 0 0 > SG_Ric 0 0 0 0 0 1 > SG_Ric.1 0 0 0 0 0 1 > attr(,"assign") > [1] 1 1 1 1 1 1 > attr(,"contrasts") > attr(,"contrasts")$group > [1] "contr.treatment" > > And then I estimated the trended and tag wise dispersion and fit the model > doing: > >> disp.tren <- estimateGLMTrendedDisp(DGEnorm,design) >> disp.tag <- estimateGLMTagwiseDisp(disp.tren,design) >> fit <- glmFit(disp.tag,design) > > When I made some contrasts to find DE miRNAs, for example: > >> my.constrasts <- makeContrasts(CS_BolvsMont = CS_Bol-CS_Mont, > CSvsSG_BolvsMont = (CS_Bol-CS_Mont)-(SG_Bol-SG_Mont), levels=design) >> lrt <- glmLRT(fit, contrast=my.constrasts[,"CS_BolvsMont"]) > > I expected to find DE miRNAs due the environment effect (CS_BolvsMont) and > for example DE miRNAs due the interaction genotypeXenvironment ( > CSvsSG_BolvsMont). > > However the results do not seems to reflect it, since I did not get even a > single DE miRNA with significant FDR (even less than 20%!!!!) and going > back to the counts in the raw data I find reasonable differences in their > expression, which was expected. I forgot to mention that I decided to > consider stage by stage separately and not add one more factor on the > model, since I am not interested, for the moment, on the time course (as I > wrote in the previous email - see below). > > Could you (or any body else from the list) give me some advise regarding > the code? Is this matrix appropriate for the kind of comparisons I am > interested on? > > Thank you in advance for any input. > > Daniela > > > > > 2012/10/30 Gordon K Smyth <smyth at="" wehi.edu.au=""> > >> Dear Daniela, >> >> edgeR can work with any design matrix. Just setup your interaction >> model using standard R model formula. See for example Chapter 11 of: >> >> http://cran.r-project.org/doc/**manuals/R-intro.pdf<http: cran.r-proj="" ect.org="" doc="" manuals="" r-intro.pdf=""> >> >> Best wishes >> Gordon >> >> Date: Mon, 29 Oct 2012 16:24:31 +0100 >>> From: Daniela Lopes Paim Pinto <d.lopespaimpinto at="" sssup.it=""> >>> To: bioconductor at r-project.org >>> Subject: [BioC] How to design matrix on edgeR to study genotype x >>> environmental interaction >>> >>> Dear all, >>> >>> I'm currently working with data coming from deep sequencing of 48 >>> small RNAs libraries and using edgeR to identify DE miRNAs. I could >>> not figure out how to design my matrix for the following experimental >>> design: >>> >>> I have 2 varieties (genotypes), cultivated in 3 different locations >>> (environments) and collected in 4 physiological stages. None of them >>> represent a control treatment. I'm particulary interested on >>> identifying those miRNAs which modulate their expression dependent on >>> genotypes (G), environments (E) and G x E interaction. For instance >>> the same variety in the 3 different locations, both varieties in the >>> same location and both varieties in the 3 different locations. >>> >>> I was wondering if I could use the section 3.3 of edgeR user guide as >>> reference or if someone could suggest me any other alternative method. >>> >>> Thanks in advance >>> >>> Daniela >>> >> ______________________________________________________________________ The information in this email is confidential and intend...{{dropped:4}}

Sequencing miRNA Normalization edgeR Sequencing miRNA Normalization edgeR • 1.7k views

ADD COMMENT • link 12.5 years ago • updated 12.4 years ago Gordon Smyth 52k

score 0 · Answer 1 · 2012-11-22

Dear Gordon, Thank you so much for your valuable input. I took sometime to study a bit more and be able to consider all the aspects you pointed out. At this time I reconsider the analysis and started again, with the data exploration of all 48 samples. First I filtered out the low reads, considering just the ones with more than 1 cpm in at least 2 libraries (I have two replicates of each library); the MDS plot clearly separate one of the locations from the other two (dimension 1) and with less distinction the two varieties (dimension 2). The stages also seems to be separated in two groups (the first two ones together and separate of the two last ones) but as the varieties, not so distinct. The two replicates are also consistent. With the BCV plot I could observe that reads with lower logCPM have bigger BCV (the BCV value was equal to 0.5941), and then comes my first question: Should I choose *prior.df* different from the default, due to this behavior, when estimating genewise dispersion? To proceed with the DE analysis, I tried two approaches, this time with all the 48 samples, as suggested. For both approaches, I have the following data frame: > target Sample Vineyard Variety Stage 1 1 mont CS ps 2 2 mont CS ps 3 4 mont CS bc 4 5 mont CS bc 5 7 mont CS 19b 6 8 mont CS 19b 7 10 mont CS hv 8 11 mont CS hv 9 13 mont SG ps 10 14 mont SG ps 11 16 mont SG bc 12 17 mont SG bc 13 19 mont SG 19b 14 20 mont SG 19b 15 22 mont SG hv 16 23 mont SG hv 17 25 Bol CS ps 18 26 Bol CS ps 19 28 Bol CS bc 20 29 Bol CS bc 21 31 Bol CS 19b 22 32 Bol CS 19b 23 34 Bol CS hv 24 35 Bol CS hv 25 37 Bol SG ps 26 38 Bol SG ps 27 40 Bol SG bc 28 41 Bol SG bc 29 43 Bol SG 19b 30 44 Bol SG 19b 31 46 Bol SG hv 32 47 Bol SG hv 33 49 Ric CS ps 34 50 Ric CS ps 35 52 Ric CS bc 36 53 Ric CS bc 37 55 Ric CS 19b 38 56 Ric CS 19b 39 58 Ric CS hv 40 59 Ric CS hv 41 61 Ric SG ps 42 62 Ric SG ps 43 64 Ric SG bc 44 65 Ric SG bc 45 67 Ric SG 19b 46 68 Ric SG 19b 47 70 Ric SG hv 48 71 Ric SG hv At the first instance, I used the full interaction formula as the following code: > d <- DGEList(counts=file) > keep <- rowSums(cpm(DGElist) > 1) >= 2 > DGElist <- DGElist[keep,] > DGElist$samples$lib.size <- colSums(DGElist$counts) > DGElist_norm <- calcNormFactors(DGElist) *> design <- model.matrix(~0 + Vineyard + Variety + Stage + Vineyard:Variety + Vineyard:Stage + Variety:Stage + Vineyard:Variety:Stage, data=target)* [or even (*> design <- model.matrix(~0 + Vineyard*Variety*Stage, data=target)*) which gives the same result] > rownames(design) <- colnames(DGEList_norm) However, when I call the *design* I see that one Variety (i.e., CS) and one Stage (i.e., 19b) are not present in the design matrix, as individual effect or even in the interactions. Then I passed to the second approach, in which, I create groups: > group <- factor(paste(target$Vineyard,target$Variety,target$Stage,sep="_")) > cbind(target,Group=group) > DGElist <- DGEList(counts=file,group=group) > keep <- rowSums(cpm(DGElist) > 1) >= 2 > DGElist <- DGElist[keep,] > DGElist$samples$lib.size <- colSums(DGElist$counts) > DGElist_norm <- calcNormFactors(DGElist) > design <- model.matrix(~0+group, data=DGElist_norm$samples) > colnames(design) <- levels(group) The design matrix in this case include all the groups, and then I proceed doing: > commondisp <- estimateGLMCommonDisp(DGElist_norm, design, verbose=TRUE) Disp = 0.35294 , BCV = 0.5941 > trenddisp <- estimateGLMTrendedDisp(commondisp, design) > tagwisedisp <- estimateGLMTagwiseDisp(trenddisp, design) > fit <- glmFit(tagwisedisp, design) > my.contrasts <- makeContrasts(CS_ps_BolvsMont = Bol_CS_ps- mont_CS_ps, CS_ps_BolvsRic = Bol_CS_ps-Ric_CS_ps, Bol_ps_CSvsSG = Bol_CS_ps- Bol_SG_ps, levels=design) #Just as some examples of the contrasts I am interested on. > lrt <- glmLRT(fit, contrast=my.contrasts[,"CS_ps_BolvsMont"]) With this code, I got the results, but I am afraid that they are not very consistent with the data. To give one example, the DE results tell me that a given miRNA which has 0 and 1 reads respectively in the two replicates of one sample is significantly different when comparing with other sample in which this miRNA has 5 and 10 reads in the two replicates respectively, but in the same set of results another miRNA which has 4259 and 2198 reads respectively in the two replicates of one sample is not significantly different when comparing with the other sample in which this miRNA has 352 and 599 reads respectively in the two replicates. In other words, 0 and 1 are significantly different from 5 and 10 but 4259 and 2198 are not significantly different from 352 and 599. With this comparisons, I am just trying to interpret my data based on these results. I know that the test for differential expression is not made based on the raw reads, but I do not know exactly how it is made, anyway I expect that if I used the correct model to describe my data, the results will describe the differences consistently. Could you make any suggestions about my analysis? Creating the groups as I showed above, is it correct for testing all the interactions? Is there any explanation for the fact that the one variety and one stage "disappear" from the design matrix when using the full interaction formula? Sorry for the long email and thank you for all the advises, Best wishes Daniela Lopes Paim Pinto PhD student - Agrobiosciences Scuola Superiore Sant'Anna, Italy > sessionInfo() R version 2.15.2 (2012-10-26) Platform: x86_64-w64-mingw32/x64 (64-bit) locale: [1] LC_COLLATE=English_United States.1252 LC_CTYPE=English_United States.1252 LC_MONETARY=English_United States.1252 LC_NUMERIC=C [5] LC_TIME=English_United States.1252 attached base packages: [1] stats graphics grDevices utils datasets methods base other attached packages: [1] edgeR_3.0.3 limma_3.14.1 loaded via a namespace (and not attached): [1] tools_2.15.2 2012/11/11 Gordon K Smyth <smyth@wehi.edu.au> > Dear Daniela, > > What version of the edgeR are you using? The posting guide asks you to > give sessionInfo() output so we can see package versions. > > Your codes looks correct for testing an interaction, although you could > estimate the same interaction more directly using an interaction formula as > in Section 3.3.4 of the edgeR User's Guide. > > However the model you have used is correct only if all 12 samples > correspond to the same physiological stage. I wonder why you are not > analysing all the 48 samples together. I would start with data exploration > of all 48 samples, including exploration measures like transcript > filtering, library sizes, normalization factors, an MDS plot, a BCV plot, > and so on. The first step is to check the data quality before going on to > test for differential expression. > > edgeR has very high statistical power, even giving p-values smaller than I > would like in some cases. So if you're not getting any differential > expression, it is because there is none or because you have data quality > problems. > > Best wishes > Gordon > > Date: Fri, 9 Nov 2012 14:44:28 +0100 >> From: Daniela Lopes Paim Pinto <d.lopespaimpinto@sssup.it> >> To: bioconductor@r-project.org >> Subject: Re: [BioC] How to design matrix on edgeR to study genotype x >> environmental interaction >> >> Dear Gordon, >> >> Thank you so much for the reference. I read all the chapter regarding to >> the models and I tried to set up the following code considering a data >> frame like this: >> >> target >>> >> Sample Variety Location >> 1 1 CS Mont >> 2 2 CS Mont >> 3 25 CS Bol >> 4 26 CS Bol >> 5 49 CS Ric >> 6 50 CS Ric >> 7 13 SG Mont >> 8 14 SG Mont >> 9 37 SG Bol >> 10 38 SG Bol >> 11 61 SG Ric >> 12 62 SG Ric >> >> group <- factor(paste(target$Variety,**target$Location,sep="_")) >>> cbind(target,Group=group) >>> d <- DGEList(counts=file,group=**group) >>> DGEnorm <- calcNormFactors(d) >>> design <- model.matrix(~0+group, data=DGEnorm$samples) >>> colnames(design) <- levels(group) >>> >> >> Which gave me the design matrix: >> >> design >>> >> CS_Bol CS_Mont CS_Ric SG_Bol SG_Mont SG_Ric >> CS_Mont 0 1 0 0 0 0 >> CS_Mont.1 0 1 0 0 0 0 >> CS_Bol 1 0 0 0 0 0 >> CS_Bol.1 1 0 0 0 0 0 >> CS_Ric 0 0 1 0 0 0 >> CS_Ric.1 0 0 1 0 0 0 >> SG_Mont 0 0 0 0 1 0 >> SG_Mont.1 0 0 0 0 1 0 >> SG_Bol 0 0 0 1 0 0 >> SG_Bol.1 0 0 0 1 0 0 >> SG_Ric 0 0 0 0 0 1 >> SG_Ric.1 0 0 0 0 0 1 >> attr(,"assign") >> [1] 1 1 1 1 1 1 >> attr(,"contrasts") >> attr(,"contrasts")$group >> [1] "contr.treatment" >> >> And then I estimated the trended and tag wise dispersion and fit the model >> doing: >> >> disp.tren <- estimateGLMTrendedDisp(**DGEnorm,design) >>> disp.tag <- estimateGLMTagwiseDisp(disp.**tren,design) >>> fit <- glmFit(disp.tag,design) >>> >> >> When I made some contrasts to find DE miRNAs, for example: >> >> my.constrasts <- makeContrasts(CS_BolvsMont = CS_Bol-CS_Mont, >>> >> CSvsSG_BolvsMont = (CS_Bol-CS_Mont)-(SG_Bol-SG_**Mont), levels=design) >> >>> lrt <- glmLRT(fit, contrast=my.constrasts[,"CS_**BolvsMont"]) >>> >> >> I expected to find DE miRNAs due the environment effect (CS_BolvsMont) and >> for example DE miRNAs due the interaction genotypeXenvironment ( >> CSvsSG_BolvsMont). >> >> However the results do not seems to reflect it, since I did not get even a >> single DE miRNA with significant FDR (even less than 20%!!!!) and going >> back to the counts in the raw data I find reasonable differences in their >> expression, which was expected. I forgot to mention that I decided to >> consider stage by stage separately and not add one more factor on the >> model, since I am not interested, for the moment, on the time course (as I >> wrote in the previous email - see below). >> >> Could you (or any body else from the list) give me some advise regarding >> the code? Is this matrix appropriate for the kind of comparisons I am >> interested on? >> >> Thank you in advance for any input. >> >> Daniela >> >> >> >> >> 2012/10/30 Gordon K Smyth <smyth@wehi.edu.au> >> >> Dear Daniela, >>> >>> edgeR can work with any design matrix. Just setup your interaction >>> model using standard R model formula. See for example Chapter 11 of: >>> >>> >>> http://cran.r-project.org/doc/****manuals/R-intro.pdf<http: cran="" .r-project.org="" doc="" **manuals="" r-intro.pdf=""> > <http: **cran.r-project.org="" doc="" **manuals="" r-intro.pdf<http:="" cran.r="" -project.org="" doc="" manuals="" r-intro.pdf=""> > > > >> >>> Best wishes >>> Gordon >>> >>> Date: Mon, 29 Oct 2012 16:24:31 +0100 >>> >>>> From: Daniela Lopes Paim Pinto <d.lopespaimpinto@sssup.it> >>>> To: bioconductor@r-project.org >>>> Subject: [BioC] How to design matrix on edgeR to study genotype x >>>> environmental interaction >>>> >>>> Dear all, >>>> >>>> I'm currently working with data coming from deep sequencing of 48 small >>>> RNAs libraries and using edgeR to identify DE miRNAs. I could not figure >>>> out how to design my matrix for the following experimental design: >>>> >>>> I have 2 varieties (genotypes), cultivated in 3 different locations >>>> (environments) and collected in 4 physiological stages. None of them >>>> represent a control treatment. I'm particulary interested on identifying >>>> those miRNAs which modulate their expression dependent on genotypes (G), >>>> environments (E) and G x E interaction. For instance the same variety in >>>> the 3 different locations, both varieties in the same location and both >>>> varieties in the 3 different locations. >>>> >>>> I was wondering if I could use the section 3.3 of edgeR user guide as >>>> reference or if someone could suggest me any other alternative method. >>>> >>>> Thanks in advance >>>> >>>> Daniela >>>> >>>> >>> > ______________________________**______________________________**____ ______ > The information in this email is confidential and inte...{{dropped:10}}

score 0 · Answer 2 · 2012-11-23

Dear Daniela, I think you would be very well advised to seek out a statistical bioinformatician with whom you can collaborate on an ongoing basis. A GxE anova analysis would be statistically sophisticated even if you were analysing a simple univariate phenotypic trait. Attempting to do that sort of analysis in the context of an RNA-Seq experiment on miRNAs is far more difficult again. The design matrices you have created may be correct, but that's just the start of the analysis, and there are many layers of possible complexity. The BCV in your experiment is so large that I feel there must be quality issues with your data that you have not successfully dealt with. It seems very likely, for example, that there are batch effects that you have not yet described. To answer some specific questions: You might be better off with prior.df=10 instead the default, but this has little to do with the size of the BCV. You ask why one variety and one stage are disappearing from your design matrix. If you omit the "0+" in the first formula (and you should), you will find that one vineyard will disappear as well. This is because the number of contrasts for any factor must be one less than the number of leveles. This is a very fundamental feature of factors and model formula that you need to become familiar with before you can make sense of any model formula. Your email makes no mention of library sizes or sequencing depths, but obviously that has a fundamental effect on what is significantly different from what. I think you know now how to use edgeR in principle. However, as you probably already appreciate, deciding what is the right analysis for your data is beyond the scope of the mailing list. Best wishes Gordon On Thu, 22 Nov 2012, bioconductor-request at r-project.org wrote: > Date: Thu, 22 Nov 2012 10:07:19 +0100 > From: Daniela Lopes Paim Pinto <d.lopespaimpinto at="" sssup.it=""> > To: bioconductor at r-project.org > Subject: Re: [BioC] How to design matrix on edgeR to study genotype x > environmental interaction > Message-ID: > > Dear Gordon, > > Thank you so much for your valuable input. I took sometime to study a bit > more and be able to consider all the aspects you pointed out. At this time > I reconsider the analysis and started again, with the data exploration of > all 48 samples. > > First I filtered out the low reads, considering just the ones with more > than 1 cpm in at least 2 libraries (I have two replicates of each library); > the MDS plot clearly separate one of the locations from the other two > (dimension 1) and with less distinction the two varieties (dimension 2). > The stages also seems to be separated in two groups (the first two ones > together and separate of the two last ones) but as the varieties, not so > distinct. The two replicates are also consistent. > > With the BCV plot I could observe that reads with lower logCPM have bigger > BCV (the BCV value was equal to 0.5941), and then comes my first question: > > Should I choose *prior.df* different from the default, due to this > behavior, when estimating genewise dispersion? > > To proceed with the DE analysis, I tried two approaches, this time with all > the 48 samples, as suggested. > For both approaches, I have the following data frame: > >> target > Sample Vineyard Variety Stage > 1 1 mont CS ps > 2 2 mont CS ps > 3 4 mont CS bc > 4 5 mont CS bc > 5 7 mont CS 19b > 6 8 mont CS 19b > 7 10 mont CS hv > 8 11 mont CS hv > 9 13 mont SG ps > 10 14 mont SG ps > 11 16 mont SG bc > 12 17 mont SG bc > 13 19 mont SG 19b > 14 20 mont SG 19b > 15 22 mont SG hv > 16 23 mont SG hv > 17 25 Bol CS ps > 18 26 Bol CS ps > 19 28 Bol CS bc > 20 29 Bol CS bc > 21 31 Bol CS 19b > 22 32 Bol CS 19b > 23 34 Bol CS hv > 24 35 Bol CS hv > 25 37 Bol SG ps > 26 38 Bol SG ps > 27 40 Bol SG bc > 28 41 Bol SG bc > 29 43 Bol SG 19b > 30 44 Bol SG 19b > 31 46 Bol SG hv > 32 47 Bol SG hv > 33 49 Ric CS ps > 34 50 Ric CS ps > 35 52 Ric CS bc > 36 53 Ric CS bc > 37 55 Ric CS 19b > 38 56 Ric CS 19b > 39 58 Ric CS hv > 40 59 Ric CS hv > 41 61 Ric SG ps > 42 62 Ric SG ps > 43 64 Ric SG bc > 44 65 Ric SG bc > 45 67 Ric SG 19b > 46 68 Ric SG 19b > 47 70 Ric SG hv > 48 71 Ric SG hv > > At the first instance, I used the full interaction formula as the following > code: > >> d <- DGEList(counts=file) >> keep <- rowSums(cpm(DGElist) > 1) >= 2 >> DGElist <- DGElist[keep,] >> DGElist$samples$lib.size <- colSums(DGElist$counts) >> DGElist_norm <- calcNormFactors(DGElist) > *> design <- model.matrix(~0 + Vineyard + Variety + Stage + > Vineyard:Variety + Vineyard:Stage + Variety:Stage + Vineyard:Variety:Stage, > data=target)* > > [or even (*> design <- model.matrix(~0 + Vineyard*Variety*Stage, > data=target)*) which gives the same result] > >> rownames(design) <- colnames(DGEList_norm) > > However, when I call the *design* I see that one Variety (i.e., CS) and one > Stage (i.e., 19b) are not present in the design matrix, as individual > effect or even in the interactions. > > Then I passed to the second approach, in which, I create groups: > >> group <- > factor(paste(target$Vineyard,target$Variety,target$Stage,sep="_")) >> cbind(target,Group=group) >> DGElist <- DGEList(counts=file,group=group) >> keep <- rowSums(cpm(DGElist) > 1) >= 2 >> DGElist <- DGElist[keep,] >> DGElist$samples$lib.size <- colSums(DGElist$counts) >> DGElist_norm <- calcNormFactors(DGElist) >> design <- model.matrix(~0+group, data=DGElist_norm$samples) >> colnames(design) <- levels(group) > > The design matrix in this case include all the groups, and then I proceed > doing: > >> commondisp <- estimateGLMCommonDisp(DGElist_norm, design, verbose=TRUE) > Disp = 0.35294 , BCV = 0.5941 >> trenddisp <- estimateGLMTrendedDisp(commondisp, design) >> tagwisedisp <- estimateGLMTagwiseDisp(trenddisp, design) >> fit <- glmFit(tagwisedisp, design) >> my.contrasts <- makeContrasts(CS_ps_BolvsMont = Bol_CS_ps- mont_CS_ps, > CS_ps_BolvsRic = Bol_CS_ps-Ric_CS_ps, Bol_ps_CSvsSG = Bol_CS_ps- Bol_SG_ps, > levels=design) #Just as some examples of the contrasts I am interested on. >> lrt <- glmLRT(fit, contrast=my.contrasts[,"CS_ps_BolvsMont"]) > > With this code, I got the results, but I am afraid that they are not very > consistent with the data. To give one example, the DE results tell me that > a given miRNA which has 0 and 1 reads respectively in the two replicates of > one sample is significantly different when comparing with other sample in > which this miRNA has 5 and 10 reads in the two replicates respectively, > but in the same set of results another miRNA which has 4259 and 2198 reads > respectively in the two replicates of one sample is not significantly > different when comparing with the other sample in which this miRNA has > 352 and 599 reads respectively in the two replicates. In other words, 0 and > 1 are significantly different from 5 and 10 but 4259 and 2198 are > not significantly different from 352 and 599. With this comparisons, I am > just trying to interpret my data based on these results. > > I know that the test for differential expression is not made based on the > raw reads, but I do not know exactly how it is made, anyway I expect that > if I used the correct model to describe my data, the results will describe > the differences consistently. > Could you make any suggestions about my analysis? Creating the groups as I > showed above, is it correct for testing all the interactions? Is there any > explanation for the fact that the one variety and one stage "disappear" > from the design matrix when using the full interaction formula? > > Sorry for the long email and thank you for all the advises, > > Best wishes > > Daniela Lopes Paim Pinto > PhD student - Agrobiosciences > Scuola Superiore Sant'Anna, Italy > >> sessionInfo() > R version 2.15.2 (2012-10-26) > Platform: x86_64-w64-mingw32/x64 (64-bit) > > locale: > [1] LC_COLLATE=English_United States.1252 LC_CTYPE=English_United > States.1252 LC_MONETARY=English_United States.1252 LC_NUMERIC=C > > [5] LC_TIME=English_United States.1252 > > attached base packages: > [1] stats graphics grDevices utils datasets methods base > > other attached packages: > [1] edgeR_3.0.3 limma_3.14.1 > > loaded via a namespace (and not attached): > [1] tools_2.15.2 > > > > > > > > > > > 2012/11/11 Gordon K Smyth <smyth at="" wehi.edu.au=""> > >> Dear Daniela, >> >> What version of the edgeR are you using? The posting guide asks you to >> give sessionInfo() output so we can see package versions. >> >> Your codes looks correct for testing an interaction, although you could >> estimate the same interaction more directly using an interaction formula as >> in Section 3.3.4 of the edgeR User's Guide. >> >> However the model you have used is correct only if all 12 samples >> correspond to the same physiological stage. I wonder why you are not >> analysing all the 48 samples together. I would start with data exploration >> of all 48 samples, including exploration measures like transcript >> filtering, library sizes, normalization factors, an MDS plot, a BCV plot, >> and so on. The first step is to check the data quality before going on to >> test for differential expression. >> >> edgeR has very high statistical power, even giving p-values smaller than I >> would like in some cases. So if you're not getting any differential >> expression, it is because there is none or because you have data quality >> problems. >> >> Best wishes >> Gordon >> >> Date: Fri, 9 Nov 2012 14:44:28 +0100 >>> From: Daniela Lopes Paim Pinto <d.lopespaimpinto at="" sssup.it=""> >>> To: bioconductor at r-project.org >>> Subject: Re: [BioC] How to design matrix on edgeR to study genotype x >>> environmental interaction >>> >>> Dear Gordon, >>> >>> Thank you so much for the reference. I read all the chapter regarding to >>> the models and I tried to set up the following code considering a data >>> frame like this: >>> >>> target >>>> >>> Sample Variety Location >>> 1 1 CS Mont >>> 2 2 CS Mont >>> 3 25 CS Bol >>> 4 26 CS Bol >>> 5 49 CS Ric >>> 6 50 CS Ric >>> 7 13 SG Mont >>> 8 14 SG Mont >>> 9 37 SG Bol >>> 10 38 SG Bol >>> 11 61 SG Ric >>> 12 62 SG Ric >>> >>> group <- factor(paste(target$Variety,**target$Location,sep="_")) >>>> cbind(target,Group=group) >>>> d <- DGEList(counts=file,group=**group) >>>> DGEnorm <- calcNormFactors(d) >>>> design <- model.matrix(~0+group, data=DGEnorm$samples) >>>> colnames(design) <- levels(group) >>>> >>> >>> Which gave me the design matrix: >>> >>> design >>>> >>> CS_Bol CS_Mont CS_Ric SG_Bol SG_Mont SG_Ric >>> CS_Mont 0 1 0 0 0 0 >>> CS_Mont.1 0 1 0 0 0 0 >>> CS_Bol 1 0 0 0 0 0 >>> CS_Bol.1 1 0 0 0 0 0 >>> CS_Ric 0 0 1 0 0 0 >>> CS_Ric.1 0 0 1 0 0 0 >>> SG_Mont 0 0 0 0 1 0 >>> SG_Mont.1 0 0 0 0 1 0 >>> SG_Bol 0 0 0 1 0 0 >>> SG_Bol.1 0 0 0 1 0 0 >>> SG_Ric 0 0 0 0 0 1 >>> SG_Ric.1 0 0 0 0 0 1 >>> attr(,"assign") >>> [1] 1 1 1 1 1 1 >>> attr(,"contrasts") >>> attr(,"contrasts")$group >>> [1] "contr.treatment" >>> >>> And then I estimated the trended and tag wise dispersion and fit the model >>> doing: >>> >>> disp.tren <- estimateGLMTrendedDisp(**DGEnorm,design) >>>> disp.tag <- estimateGLMTagwiseDisp(disp.**tren,design) >>>> fit <- glmFit(disp.tag,design) >>>> >>> >>> When I made some contrasts to find DE miRNAs, for example: >>> >>> my.constrasts <- makeContrasts(CS_BolvsMont = CS_Bol-CS_Mont, >>>> >>> CSvsSG_BolvsMont = (CS_Bol-CS_Mont)-(SG_Bol-SG_**Mont), levels=design) >>> >>>> lrt <- glmLRT(fit, contrast=my.constrasts[,"CS_**BolvsMont"]) >>>> >>> >>> I expected to find DE miRNAs due the environment effect (CS_BolvsMont) and >>> for example DE miRNAs due the interaction genotypeXenvironment ( >>> CSvsSG_BolvsMont). >>> >>> However the results do not seems to reflect it, since I did not get even a >>> single DE miRNA with significant FDR (even less than 20%!!!!) and going >>> back to the counts in the raw data I find reasonable differences in their >>> expression, which was expected. I forgot to mention that I decided to >>> consider stage by stage separately and not add one more factor on the >>> model, since I am not interested, for the moment, on the time course (as I >>> wrote in the previous email - see below). >>> >>> Could you (or any body else from the list) give me some advise regarding >>> the code? Is this matrix appropriate for the kind of comparisons I am >>> interested on? >>> >>> Thank you in advance for any input. >>> >>> Daniela >>> >>> >>> >>> >>> 2012/10/30 Gordon K Smyth <smyth at="" wehi.edu.au=""> >>> >>> Dear Daniela, >>>> >>>> edgeR can work with any design matrix. Just setup your interaction >>>> model using standard R model formula. See for example Chapter 11 of: >>>> >>>> >>>> http://cran.r-project.org/doc/****manuals/R-intro.pdf<http: cra="" n.r-project.org="" doc="" **manuals="" r-intro.pdf=""> >> <http: **cran.r-project.org="" doc="" **manuals="" r-intro.pdf<http:="" cran.="" r-project.org="" doc="" manuals="" r-intro.pdf=""> >>> >> >>> >>>> Best wishes >>>> Gordon >>>> >>>> Date: Mon, 29 Oct 2012 16:24:31 +0100 >>>> >>>>> From: Daniela Lopes Paim Pinto <d.lopespaimpinto at="" sssup.it=""> >>>>> To: bioconductor at r-project.org >>>>> Subject: [BioC] How to design matrix on edgeR to study genotype x >>>>> environmental interaction >>>>> >>>>> Dear all, >>>>> >>>>> I'm currently working with data coming from deep sequencing of 48 small >>>>> RNAs libraries and using edgeR to identify DE miRNAs. I could not figure >>>>> out how to design my matrix for the following experimental design: >>>>> >>>>> I have 2 varieties (genotypes), cultivated in 3 different locations >>>>> (environments) and collected in 4 physiological stages. None of them >>>>> represent a control treatment. I'm particulary interested on identifying >>>>> those miRNAs which modulate their expression dependent on genotypes (G), >>>>> environments (E) and G x E interaction. For instance the same variety in >>>>> the 3 different locations, both varieties in the same location and both >>>>> varieties in the 3 different locations. >>>>> >>>>> I was wondering if I could use the section 3.3 of edgeR user guide as >>>>> reference or if someone could suggest me any other alternative method. >>>>> >>>>> Thanks in advance >>>>> >>>>> Daniela >>>>> >>>>> ______________________________________________________________________ The information in this email is confidential and intend...{{dropped:4}}