Dear Daniela,
I think you would be very well advised to seek out a statistical
bioinformatician with whom you can collaborate on an ongoing basis. A
GxE
anova analysis would be statistically sophisticated even if you were
analysing a simple univariate phenotypic trait. Attempting to do that
sort of analysis in the context of an RNA-Seq experiment on miRNAs is
far
more difficult again. The design matrices you have created may be
correct, but that's just the start of the analysis, and there are many
layers of possible complexity.
The BCV in your experiment is so large that I feel there must be
quality
issues with your data that you have not successfully dealt with. It
seems
very likely, for example, that there are batch effects that you have
not
yet described.
To answer some specific questions:
You might be better off with prior.df=10 instead the default, but this
has
little to do with the size of the BCV.
You ask why one variety and one stage are disappearing from your
design
matrix. If you omit the "0+" in the first formula (and you should),
you
will find that one vineyard will disappear as well. This is because
the
number of contrasts for any factor must be one less than the number of
leveles. This is a very fundamental feature of factors and model
formula
that you need to become familiar with before you can make sense of any
model formula.
Your email makes no mention of library sizes or sequencing depths, but
obviously that has a fundamental effect on what is significantly
different
from what.
I think you know now how to use edgeR in principle. However, as you
probably already appreciate, deciding what is the right analysis for
your
data is beyond the scope of the mailing list.
Best wishes
Gordon
On Thu, 22 Nov 2012, bioconductor-request at r-project.org wrote:
> Date: Thu, 22 Nov 2012 10:07:19 +0100
> From: Daniela Lopes Paim Pinto <d.lopespaimpinto at="" sssup.it="">
> To: bioconductor at r-project.org
> Subject: Re: [BioC] How to design matrix on edgeR to study genotype
x
> environmental interaction
> Message-ID:
>
> Dear Gordon,
>
> Thank you so much for your valuable input. I took sometime to study
a bit
> more and be able to consider all the aspects you pointed out. At
this time
> I reconsider the analysis and started again, with the data
exploration of
> all 48 samples.
>
> First I filtered out the low reads, considering just the ones with
more
> than 1 cpm in at least 2 libraries (I have two replicates of each
library);
> the MDS plot clearly separate one of the locations from the other
two
> (dimension 1) and with less distinction the two varieties (dimension
2).
> The stages also seems to be separated in two groups (the first two
ones
> together and separate of the two last ones) but as the varieties,
not so
> distinct. The two replicates are also consistent.
>
> With the BCV plot I could observe that reads with lower logCPM have
bigger
> BCV (the BCV value was equal to 0.5941), and then comes my first
question:
>
> Should I choose *prior.df* different from the default, due to this
> behavior, when estimating genewise dispersion?
>
> To proceed with the DE analysis, I tried two approaches, this time
with all
> the 48 samples, as suggested.
> For both approaches, I have the following data frame:
>
>> target
> Sample Vineyard Variety Stage
> 1 1 mont CS ps
> 2 2 mont CS ps
> 3 4 mont CS bc
> 4 5 mont CS bc
> 5 7 mont CS 19b
> 6 8 mont CS 19b
> 7 10 mont CS hv
> 8 11 mont CS hv
> 9 13 mont SG ps
> 10 14 mont SG ps
> 11 16 mont SG bc
> 12 17 mont SG bc
> 13 19 mont SG 19b
> 14 20 mont SG 19b
> 15 22 mont SG hv
> 16 23 mont SG hv
> 17 25 Bol CS ps
> 18 26 Bol CS ps
> 19 28 Bol CS bc
> 20 29 Bol CS bc
> 21 31 Bol CS 19b
> 22 32 Bol CS 19b
> 23 34 Bol CS hv
> 24 35 Bol CS hv
> 25 37 Bol SG ps
> 26 38 Bol SG ps
> 27 40 Bol SG bc
> 28 41 Bol SG bc
> 29 43 Bol SG 19b
> 30 44 Bol SG 19b
> 31 46 Bol SG hv
> 32 47 Bol SG hv
> 33 49 Ric CS ps
> 34 50 Ric CS ps
> 35 52 Ric CS bc
> 36 53 Ric CS bc
> 37 55 Ric CS 19b
> 38 56 Ric CS 19b
> 39 58 Ric CS hv
> 40 59 Ric CS hv
> 41 61 Ric SG ps
> 42 62 Ric SG ps
> 43 64 Ric SG bc
> 44 65 Ric SG bc
> 45 67 Ric SG 19b
> 46 68 Ric SG 19b
> 47 70 Ric SG hv
> 48 71 Ric SG hv
>
> At the first instance, I used the full interaction formula as the
following
> code:
>
>> d <- DGEList(counts=file)
>> keep <- rowSums(cpm(DGElist) > 1) >= 2
>> DGElist <- DGElist[keep,]
>> DGElist$samples$lib.size <- colSums(DGElist$counts)
>> DGElist_norm <- calcNormFactors(DGElist)
> *> design <- model.matrix(~0 + Vineyard + Variety + Stage +
> Vineyard:Variety + Vineyard:Stage + Variety:Stage +
Vineyard:Variety:Stage,
> data=target)*
>
> [or even (*> design <- model.matrix(~0 + Vineyard*Variety*Stage,
> data=target)*) which gives the same result]
>
>> rownames(design) <- colnames(DGEList_norm)
>
> However, when I call the *design* I see that one Variety (i.e., CS)
and one
> Stage (i.e., 19b) are not present in the design matrix, as
individual
> effect or even in the interactions.
>
> Then I passed to the second approach, in which, I create groups:
>
>> group <-
> factor(paste(target$Vineyard,target$Variety,target$Stage,sep="_"))
>> cbind(target,Group=group)
>> DGElist <- DGEList(counts=file,group=group)
>> keep <- rowSums(cpm(DGElist) > 1) >= 2
>> DGElist <- DGElist[keep,]
>> DGElist$samples$lib.size <- colSums(DGElist$counts)
>> DGElist_norm <- calcNormFactors(DGElist)
>> design <- model.matrix(~0+group, data=DGElist_norm$samples)
>> colnames(design) <- levels(group)
>
> The design matrix in this case include all the groups, and then I
proceed
> doing:
>
>> commondisp <- estimateGLMCommonDisp(DGElist_norm, design,
verbose=TRUE)
> Disp = 0.35294 , BCV = 0.5941
>> trenddisp <- estimateGLMTrendedDisp(commondisp, design)
>> tagwisedisp <- estimateGLMTagwiseDisp(trenddisp, design)
>> fit <- glmFit(tagwisedisp, design)
>> my.contrasts <- makeContrasts(CS_ps_BolvsMont = Bol_CS_ps-
mont_CS_ps,
> CS_ps_BolvsRic = Bol_CS_ps-Ric_CS_ps, Bol_ps_CSvsSG = Bol_CS_ps-
Bol_SG_ps,
> levels=design) #Just as some examples of the contrasts I am
interested on.
>> lrt <- glmLRT(fit, contrast=my.contrasts[,"CS_ps_BolvsMont"])
>
> With this code, I got the results, but I am afraid that they are not
very
> consistent with the data. To give one example, the DE results tell
me that
> a given miRNA which has 0 and 1 reads respectively in the two
replicates of
> one sample is significantly different when comparing with other
sample in
> which this miRNA has 5 and 10 reads in the two replicates
respectively,
> but in the same set of results another miRNA which has 4259 and 2198
reads
> respectively in the two replicates of one sample is not
significantly
> different when comparing with the other sample in which this
miRNA has
> 352 and 599 reads respectively in the two replicates. In other
words, 0 and
> 1 are significantly different from 5 and 10 but 4259 and 2198 are
> not significantly different from 352 and 599. With this
comparisons, I am
> just trying to interpret my data based on these results.
>
> I know that the test for differential expression is not made based
on the
> raw reads, but I do not know exactly how it is made, anyway I expect
that
> if I used the correct model to describe my data, the results will
describe
> the differences consistently.
> Could you make any suggestions about my analysis? Creating the
groups as I
> showed above, is it correct for testing all the interactions? Is
there any
> explanation for the fact that the one variety and one stage
"disappear"
> from the design matrix when using the full interaction formula?
>
> Sorry for the long email and thank you for all the advises,
>
> Best wishes
>
> Daniela Lopes Paim Pinto
> PhD student - Agrobiosciences
> Scuola Superiore Sant'Anna, Italy
>
>> sessionInfo()
> R version 2.15.2 (2012-10-26)
> Platform: x86_64-w64-mingw32/x64 (64-bit)
>
> locale:
> [1] LC_COLLATE=English_United States.1252 LC_CTYPE=English_United
> States.1252 LC_MONETARY=English_United States.1252 LC_NUMERIC=C
>
> [5] LC_TIME=English_United States.1252
>
> attached base packages:
> [1] stats graphics grDevices utils datasets methods base
>
> other attached packages:
> [1] edgeR_3.0.3 limma_3.14.1
>
> loaded via a namespace (and not attached):
> [1] tools_2.15.2
>
>
>
>
>
>
>
>
>
>
> 2012/11/11 Gordon K Smyth <smyth at="" wehi.edu.au="">
>
>> Dear Daniela,
>>
>> What version of the edgeR are you using? The posting guide asks
you to
>> give sessionInfo() output so we can see package versions.
>>
>> Your codes looks correct for testing an interaction, although you
could
>> estimate the same interaction more directly using an interaction
formula as
>> in Section 3.3.4 of the edgeR User's Guide.
>>
>> However the model you have used is correct only if all 12 samples
>> correspond to the same physiological stage. I wonder why you are
not
>> analysing all the 48 samples together. I would start with data
exploration
>> of all 48 samples, including exploration measures like transcript
>> filtering, library sizes, normalization factors, an MDS plot, a BCV
plot,
>> and so on. The first step is to check the data quality before
going on to
>> test for differential expression.
>>
>> edgeR has very high statistical power, even giving p-values smaller
than I
>> would like in some cases. So if you're not getting any
differential
>> expression, it is because there is none or because you have data
quality
>> problems.
>>
>> Best wishes
>> Gordon
>>
>> Date: Fri, 9 Nov 2012 14:44:28 +0100
>>> From: Daniela Lopes Paim Pinto <d.lopespaimpinto at="" sssup.it="">
>>> To: bioconductor at r-project.org
>>> Subject: Re: [BioC] How to design matrix on edgeR to study
genotype x
>>> environmental interaction
>>>
>>> Dear Gordon,
>>>
>>> Thank you so much for the reference. I read all the chapter
regarding to
>>> the models and I tried to set up the following code considering a
data
>>> frame like this:
>>>
>>> target
>>>>
>>> Sample Variety Location
>>> 1 1 CS Mont
>>> 2 2 CS Mont
>>> 3 25 CS Bol
>>> 4 26 CS Bol
>>> 5 49 CS Ric
>>> 6 50 CS Ric
>>> 7 13 SG Mont
>>> 8 14 SG Mont
>>> 9 37 SG Bol
>>> 10 38 SG Bol
>>> 11 61 SG Ric
>>> 12 62 SG Ric
>>>
>>> group <- factor(paste(target$Variety,**target$Location,sep="_"))
>>>> cbind(target,Group=group)
>>>> d <- DGEList(counts=file,group=**group)
>>>> DGEnorm <- calcNormFactors(d)
>>>> design <- model.matrix(~0+group, data=DGEnorm$samples)
>>>> colnames(design) <- levels(group)
>>>>
>>>
>>> Which gave me the design matrix:
>>>
>>> design
>>>>
>>> CS_Bol CS_Mont CS_Ric SG_Bol SG_Mont SG_Ric
>>> CS_Mont 0 1 0 0 0 0
>>> CS_Mont.1 0 1 0 0 0 0
>>> CS_Bol 1 0 0 0 0 0
>>> CS_Bol.1 1 0 0 0 0 0
>>> CS_Ric 0 0 1 0 0 0
>>> CS_Ric.1 0 0 1 0 0 0
>>> SG_Mont 0 0 0 0 1 0
>>> SG_Mont.1 0 0 0 0 1 0
>>> SG_Bol 0 0 0 1 0 0
>>> SG_Bol.1 0 0 0 1 0 0
>>> SG_Ric 0 0 0 0 0 1
>>> SG_Ric.1 0 0 0 0 0 1
>>> attr(,"assign")
>>> [1] 1 1 1 1 1 1
>>> attr(,"contrasts")
>>> attr(,"contrasts")$group
>>> [1] "contr.treatment"
>>>
>>> And then I estimated the trended and tag wise dispersion and fit
the model
>>> doing:
>>>
>>> disp.tren <- estimateGLMTrendedDisp(**DGEnorm,design)
>>>> disp.tag <- estimateGLMTagwiseDisp(disp.**tren,design)
>>>> fit <- glmFit(disp.tag,design)
>>>>
>>>
>>> When I made some contrasts to find DE miRNAs, for example:
>>>
>>> my.constrasts <- makeContrasts(CS_BolvsMont = CS_Bol-CS_Mont,
>>>>
>>> CSvsSG_BolvsMont = (CS_Bol-CS_Mont)-(SG_Bol-SG_**Mont),
levels=design)
>>>
>>>> lrt <- glmLRT(fit, contrast=my.constrasts[,"CS_**BolvsMont"])
>>>>
>>>
>>> I expected to find DE miRNAs due the environment effect
(CS_BolvsMont) and
>>> for example DE miRNAs due the interaction genotypeXenvironment (
>>> CSvsSG_BolvsMont).
>>>
>>> However the results do not seems to reflect it, since I did not
get even a
>>> single DE miRNA with significant FDR (even less than 20%!!!!)
and going
>>> back to the counts in the raw data I find reasonable differences
in their
>>> expression, which was expected. I forgot to mention that I decided
to
>>> consider stage by stage separately and not add one more factor on
the
>>> model, since I am not interested, for the moment, on the time
course (as I
>>> wrote in the previous email - see below).
>>>
>>> Could you (or any body else from the list) give me some advise
regarding
>>> the code? Is this matrix appropriate for the kind of comparisons I
am
>>> interested on?
>>>
>>> Thank you in advance for any input.
>>>
>>> Daniela
>>>
>>>
>>>
>>>
>>> 2012/10/30 Gordon K Smyth <smyth at="" wehi.edu.au="">
>>>
>>> Dear Daniela,
>>>>
>>>> edgeR can work with any design matrix. Just setup your
interaction
>>>> model using standard R model formula. See for example Chapter 11
of:
>>>>
>>>>
>>>>
http://cran.r-project.org/doc/****manuals/R-intro.pdf<http: cra="" n.r-project.org="" doc="" **manuals="" r-intro.pdf="">
>> <http: **cran.r-project.org="" doc="" **manuals="" r-intro.pdf<http:="" cran.="" r-project.org="" doc="" manuals="" r-intro.pdf="">
>>>
>>
>>>
>>>> Best wishes
>>>> Gordon
>>>>
>>>> Date: Mon, 29 Oct 2012 16:24:31 +0100
>>>>
>>>>> From: Daniela Lopes Paim Pinto <d.lopespaimpinto at="" sssup.it="">
>>>>> To: bioconductor at r-project.org
>>>>> Subject: [BioC] How to design matrix on edgeR to study genotype
x
>>>>> environmental interaction
>>>>>
>>>>> Dear all,
>>>>>
>>>>> I'm currently working with data coming from deep sequencing of
48 small
>>>>> RNAs libraries and using edgeR to identify DE miRNAs. I could
not figure
>>>>> out how to design my matrix for the following experimental
design:
>>>>>
>>>>> I have 2 varieties (genotypes), cultivated in 3 different
locations
>>>>> (environments) and collected in 4 physiological stages. None of
them
>>>>> represent a control treatment. I'm particulary interested on
identifying
>>>>> those miRNAs which modulate their expression dependent on
genotypes (G),
>>>>> environments (E) and G x E interaction. For instance the same
variety in
>>>>> the 3 different locations, both varieties in the same location
and both
>>>>> varieties in the 3 different locations.
>>>>>
>>>>> I was wondering if I could use the section 3.3 of edgeR user
guide as
>>>>> reference or if someone could suggest me any other alternative
method.
>>>>>
>>>>> Thanks in advance
>>>>>
>>>>> Daniela
>>>>>
>>>>>
______________________________________________________________________
The information in this email is confidential and
intend...{{dropped:4}}