Entering edit mode
Markus Grohme
▴
20
@markus-grohme-4974
Last seen 10.4 years ago
Dear all,
Simon Anders kindly asked me to repeat my question on this mailing
list
considering the statistical analysis of my RNA sequencing data.
I am rather new to Bioconductor and RNA sequencing analysis (molecular
biologist) and tried to read myself a bit into the statistics behind
the
analysis in DESeq and found the publication of "Storey & Tibshirani,
2003"
and tested also the R program "qvalue". In the paper he authors claim
that
that the Benjamini/Hochberg method is too conservative for most
genomics
studies. After running my analysis in DESeq I did some analysis of the
pvalues using "qvalue". How does the way the adjusted p-value is
calculated
using Benjamini/Hochberg and qvalue differ, the ones generated by
DESeq
(which uses the BH method as I understand it). The package DE"G"Seq
(Wang
et. al 2010) allows both B/H and qvalue methods to be used for adj-
pvalue
calculation. The Bourgon_2010 paper for independent filtering also
briefly
mentions qvalue (ref.23).
I am not really a statistician, not even a real bioinformatician so I
would
like to ask more experienced people dealing with this kind of data on
a
regular basis. Some of the programs have been written with microarrays
in
mind so I am not sure how this will perform on sequencing count data.
I am
also not sure how my experimental setup fits into all of this.
My experimental setup is (not standard I think):
4 biological samples no biological no technical replicate (I know far
from
perfect)
1 lane Illumina GAII 36bp single end each where i get around 7-8 Mio
reads
mapped (basis of my counts).
I do not sample over the whole transcript but sequenced in a window of
~300
bases of 3' (fragmented cDNA / PolyA selected). This means all my tags
are
clustered around the 3' of the transcript. This gives me the advantage
that
my counts are not distributed over sequences that I can't connect
informatically without a reference genome. My reference was a
transcriptome
assembly (non-model organism, Sanger/454/Illumina data). So I have
about
70.000 features as a reference but due to the nature of a
transcriptome
assembly I get some fragmentation. E.g. a transcript is represented in
3
separate contigs. Of these 3 contigs only the 3' most gets read counts
assigned. This means I am testing against 3 features of which only 1
is
capable of giving me informative results.
I know that the experiment is not the best setup. But there should be
relatively little biological variability as I pooled 200 synchronized
individuals that are also more or less clones because they reproduce
asexually. Also the 4 stages are a circular time course that is only
some
hours apart and the gene expression patterns should overlap. So the
e.g.
A->B->C->D(->A) share in part expression changes with their adjacent
sample
(D will finally end up being A again.) My interest is finding out what
is
different in B,C,D compared to A that is my normal physiological
state.
I am concerned how my overabundance of uninformative features
generates a
multiple testing correction problem due to my experimental setup. Many
tag
count rows give just 0,0,0,0 or spurious hits like 0,2,0,5 due to hits
more
5' in some fragments not connected by the assembly to the 3'. Should I
remove the rows that are just background noise? Or should I let DESeq
do
the filtering using the independent filtering option (if it is even
possible or makes sense for my data). I also attached a picture of A
vs B
and marked at 5% false discovery rate cutoff from DESeq. It seems that
the
changes are not very subtle. I understand that by not having any
replicates
my test will not have strong power to detect small differences. So
will
filtering actually help here? Also it seems that my sequencing depth
is
also on the lower end so I would need to get more data to
statistically
venture into the low expression fraction.
Does anybody have some clues how to approach this and what makes sense
and
what doesn't?
Cheers,
Markus
-------------- next part --------------
A non-text attachment was scrubbed...
Name: A_vs_B_q05.png
Type: image/png
Size: 39357 bytes
Desc: not available
URL: <https: stat.ethz.ch="" pipermail="" bioconductor="" attachments="" 20111124="" 744f00eb="" attachment.png="">