I'm dealing with a factorial RNA-seq data set in which cells have been
stimulated with various combinations of extra-cellular cues. As such,
I was interested in applying the GLM framework in edgeR to assess the
contribution of each extra-cellular cue to the differential expression
of certain genes. My concern, however, is that both the expression
level and the dispersion of each gene varies greatly with the
combination of cues. EdgeR doesn't seem to estimate condition-specific
dispersion but rather one dispersion per gene (if the tagwise options
is used). My question is therefore two-fold:
1) Does it make sense to want to estimate condition-specific
dispersions?
2) Is there a way to modify the edgeR framework so that it does this?
Thanks
Thomas
[[alternative HTML version deleted]]
Hi Thomas,
A couple thoughts below ?
On 02.10.2012, at 19:15, Thomas Frederick Willems wrote:
> I'm dealing with a factorial RNA-seq data set in which cells have
been stimulated with various combinations of extra-cellular cues. As
such, I was interested in applying the GLM framework in edgeR to
assess the contribution of each extra-cellular cue to the differential
expression of certain genes. My concern, however, is that both the
expression level and the dispersion of each gene varies greatly with
the combination of cues. EdgeR doesn't seem to estimate condition-
specific dispersion but rather one dispersion per gene (if the tagwise
options is used). My question is therefore two-fold:
> 1) Does it make sense to want to estimate condition-specific
dispersions?
Maybe. I haven't seen too much evidence of this in data I've
analyzed. Maybe you could show a compelling example?
> 2) Is there a way to modify the edgeR framework so that it does
this?
It's not so easy. Unless I'm mistaken, the standard likelihood ratio
test isn't able to handle this setting. A conservative approach would
be to estimate the dispersions using the more-variable state, and use
these in the DE analysis. But, maybe then your dispersion estimates
are less accurate (using less data) and it doesn't buy you much in the
end.
A recent paper shows an extension that might be able to handle this
more general situation, but I haven't figured out all the details yet:
http://biostatistics.oxfordjournals.org/content/early/2012/09/16/biost
atistics.kxs031.short
Hope that helps.
Best, Mark
>
> Thanks
>
> Thomas
>
> [[alternative HTML version deleted]]
>
> _______________________________________________
> Bioconductor mailing list
> Bioconductor at r-project.org
> https://stat.ethz.ch/mailman/listinfo/bioconductor
> Search the archives:
http://news.gmane.org/gmane.science.biology.informatics.conductor
Dear Thomas,
It does make sense to estimate condition-specific dispersions, but
most of
the time it isn't worthwhile to do so, and the only penalty for not
doing
so when you could have is some loss of statistical power (fewer DE
genes).
It makes sense when a perturbed condition is more variable than a
'normal'
condition, for example cancer tumour vs normal tissue, or knockout vs
wildtype. For it to be worthwhile, there must be a substantial
difference
between in variability and a relatively large number of replicate
samples
in each group. It is almost certainly not worthwhile if you only have
2-3
replicates in each condition.
I wonder how you have established that the dispersion varies with the
combination of cues? By running edgeR separately on different
conditions?
Otherwise you might be examining standard deviations rather than
dispersions, and they are not the same thing.
Is the sequencing depth similar between the different conditions? If
the
library sizes are different, then edgeR will assign different
variances to
different observations, even though the dispersions might be the same.
Anyway, edgeR is limited to estimating the dispersion at the gene
level.
It cannot be easily modified to estimate the dispersion on a
condition-specific basis.
On the other hand, voom (a function in the limma package) estimates
observation-specific dispersions, and can be easily modified to do so
in a
condition-specific manner. This is part of the work of Charity Law,
who
is currently writing up her PhD thesis. If you really need to go in
this
direction, I can show you how to do so using voom.
Best wishes
Gordon
> Date: Tue, 2 Oct 2012 17:15:47 +0000
> From: Thomas Frederick Willems <twillems at="" mit.edu="">
> To: "bioconductor at stat.math.ethz.ch" <bioconductor at="" stat.math.ethz.ch="">
> Subject: [BioC] EdgeR condition-specific dispersion
>
> I'm dealing with a factorial RNA-seq data set in which cells have
been
> stimulated with various combinations of extra-cellular cues. As
such, I
> was interested in applying the GLM framework in edgeR to assess the
> contribution of each extra-cellular cue to the differential
expression
> of certain genes. My concern, however, is that both the expression
level
> and the dispersion of each gene varies greatly with the combination
of
> cues. EdgeR doesn't seem to estimate condition-specific dispersion
but
> rather one dispersion per gene (if the tagwise options is used). My
> question is therefore two-fold:
> 1) Does it make sense to want to estimate condition-specific
> dispersions?
> 2) Is there a way to modify the edgeR framework so that it does
this?
>
> Thanks
> Thomas
______________________________________________________________________
The information in this email is confidential and
intend...{{dropped:4}}
Dear Thomas,
if you have 10 or more samples per condition you could try the
tweeDEseq
package which is based on a more flexible family of count data
distributions, the Poisson-Tweedie, and will estimate different
dispersions and shapes per condition. the shape is a third parameter
which provides additional flexibility over the negative-binomial to
fit
distributions with features such as heavy-tails or zero-inflation.
cheers,
robert.
On 10/02/2012 07:15 PM, Thomas Frederick Willems wrote:
> I'm dealing with a factorial RNA-seq data set in which cells have
been stimulated with various combinations of extra-cellular cues. As
such, I was interested in applying the GLM framework in edgeR to
assess
the contribution of each extra-cellular cue to the differential
expression of certain genes. My concern, however, is that both the
expression level and the dispersion of each gene varies greatly with
the
combination of cues. EdgeR doesn't seem to estimate condition-specific
dispersion but rather one dispersion per gene (if the tagwise options
is
used). My question is therefore two-fold:
> 1) Does it make sense to want to estimate condition-specific
dispersions?
> 2) Is there a way to modify the edgeR framework so that it does
this?
>
> Thanks
>
> Thomas
>
> [[alternative HTML version deleted]]
>
> _______________________________________________
> Bioconductor mailing list
> Bioconductor at r-project.org
> https://stat.ethz.ch/mailman/listinfo/bioconductor
> Search the archives:
http://news.gmane.org/gmane.science.biology.informatics.conductor
>
--
Robert Castelo, PhD
Associate Professor
Dept. of Experimental and Health Sciences
Universitat Pompeu Fabra (UPF)
Barcelona Biomedical Research Park (PRBB)
Dr Aiguader 88
E-08003 Barcelona, Spain
telf: +34.933.160.514
fax: +34.933.160.550