Hi Simon,
edgeR does take into account the amount of biological variation within
groups, and it does de-prioritize genes that are inconsistent within
groups, although it seems not as strongly as you'd like in your case.
Here are some quick solutions. First, the filtering you've done
sounds
good, but I would require a minimum cpm for at least ten samples for
your
experiment, rather than eight. That's because both of your groups are
of
size ten. From your description, if you filter genes that fail to
achieve
at least 2 cpm in >= 10 samples, that may take care of the one-offs.
Second, edgeR (unlike limma) doesn't have to ability to automatically
adapt the degree of empirical Bayes smoothing, but you can adjust it
yourself. The default prior degrees of freedom for the edgeR
empirical
Bayes procedure is set at 20. You might need a smaller value, perhaps
a
lot smaller. Try prior df of 2, say, which you can achieve by setting
prior.n=2/18 when you run estimateTagwiseDisp(). The smaller you make
this value, the more strongly edgeR will down-weight genes that are
inconsistent within replicates.
A more radical solution would be to use edgeR's glm pipeline, and to
use
glmQLFTest() in place of the more usual glmLRT(). In this quasi glm
pipeline, estimateGLMTagwiseDisp() is omitted, and instead edgeR calls
limma functions to do the empirical Bayes shrinkage, meaning that the
prior df is estimated rather than preset. This also provides a more
conservative statistical test that fully takes into account the
uncertainty with which the dispersion is estimated. This pipeline
will
strongly de-prioritize genes that are inconsistent within replicates.
Finally, you could consider removing outlier genes manually. There
are a
few ways to do that. We always look at plotBCV() plots of the
estimated
dispersions, and sometimes if there are obvious outliers we will
identify
and filter them out. If you have a small percentage of extreme
outliers,
this is the way to go.
Best wishes
Gordon
> Date: Mon, 7 May 2012 12:19:19 -0700
> From: Simon Melov <smelov at="" buckinstitute.org="">
> To: "bioconductor at r-project.org" <bioconductor at="" r-project.org="">
> Subject: [BioC] edgeR outlier question
>
> I have a reasonable RNASeq data set of 10 biological replicates of a
> control group versus 10 biological replicates experimental I've gone
> through the edgeR workflow, and get a nice list of about 1000 genes
> differentially expressed due to the experimental manipulation. I
input
> the data based on total reads per gene (I'd like to get to exons
too,
> but first things first). The data is obtained via a paired end
strategy,
> so its pretty good quality. The number of reads per sample (library)
is
> about 10 million reads each. My question is, as I go through list of
> significant genes which are differentially expressed between the two
> groups (normalized via the workflow), ranked by BH FDR down to 0.05,
I
> see genes being judged as differentially expressed which have very
low
> expression in most samples, yet are thrown off by 1 or 2 values,
thereby
> achieving statistical significance. For example, a gene might have
> between 1 and 2 counts per million reads in one group, and be
basically
> the ! same in the other group, but one of the values is perhaps at a
> 1000 or so counts, which seems to throw off the entire group,
thereby
> becoming "significant".
>
> Shouldn't edgeR take into account this sort of biological variation
> within a group and account for it in assessing significance? Its
clear
> that in the above example, that sample is an outlier, and therefore
the
> variance is so high, so it shouldn't be ranked as being
differentially
> expressed. I filtered the data by applying the criteria of at least
1
> count per sample, and I have to have at least 8 samples per group
which
> have this. Should there be an additional filtering criteria to
exclude
> these outliers? or doesn't edgeR take into account this sort of
> situation (I thought it did).
>
> Am I doing something wrong here?
______________________________________________________________________
The information in this email is confidential and
intend...{{dropped:4}}