Susan-
Thank you for weighing in. I actually learned a good deal from your
recently submitted paper, Waste Not, Want Not:Why rarefying microbiome
data is inadmissible, available on Joey McMurdies website (for anyone
else interested). Reading that manuscript is actually what turned me
on to using DESeq2 for variance stabilization in the first place. I
appreciate your groups work in this area as often the techniques used
in RNA-seq are not easily transferred to metagenomic studies.
I agree that my particular study design makes statistical analyses
much more difficult, which we hope to rectify by sequencing further
samples across multiple years.
Thank you for your insights.
Kristina
------------------------------------------------------------------
Kristina Fontanez, Postdoctoral Fellow
fontanez@mit.edu<mailto:fontanez@mit.edu>
Massachusetts Institute of Technology
Department of Civil and Environmental Engineering
48-120E
15 Vassar Street
Cambridge, MA 02139
On Jan 24, 2014, at 11:36 AM, Susan Holmes
<susanatstat@gmail.com<mailto:susanatstat@gmail.com>> wrote:
Dear Kristina,
I don't quite know whether the BioConductor users group is the right
forum for this interesting scientific
discussion, the problem with your data is that you do not have
replicates and the varying library
sizes give a heteroscedasticity that reduces the power with which you
could try to compare
the different abundances. All the BioC people can do is help you fix
your code, not your experiment.
It's not because you fix the radio that the news will be good.
With Joey McMurdie, we are putting alot of work in making this passage
to standard normalization
procedures more understandable for biologists and your input is
valuable, however your experiment
seems to be have a design flaw for what you want to do.
I still believe that normalization the data will give a much more
reasonable scale of comparison
than taking the raw data. Data transformations are always arbitrary,
but the original measurements
scales are also. I think reading more thoroughly a chapter on the
Delta method and variance stabilization
in a book like John Rice's Introduction to Mathematical Statistics and
Data Analysis will enable to get a better handle on the transformation
and variance stabilization issues which are often encountered in these
settings.
Good Luck
Susan
Susan Holmes
Professor, Statistics and BioX
John Henry Samter University Fellow
in Undergraduate Education
Director, Mathematical and Computational Sciences
Stanford
http://www-stat.stanford.edu/~susan/
On Fri, Jan 24, 2014 at 5:12 PM, Kristina M Fontanez
<fontanez@mit.edu<mailto:fontanez@mit.edu>> wrote:
Dear Bioconductors-
After previous discussions on this list, I had settled on using
regularized log transformed counts in order to present relative
abundance heatmaps/barcharts of my metagenomic data. As this
transformation accounts for both differing library sizes and
stabilizes the variance among counts, it seemed an appropriate choice.
However, now I am running into the issue that the transformed data
does not make biological sense. For example, below is a small table
with 12 samples and the OTU counts for one taxa which have been either
rlog transformed (using the blind option to ignore experimental
design) or represent the raw counts. This particular taxa is one of
the most abundant in my dataset. As you can see, the rlog transformed
counts are much more stable across samples than the raw counts. This
is exactly the expected behavior of the rlog transformation in that
the variance of counts among samples has been reduced and the library
sizes are now much more similar.
The problem is that the results do not make biological sense. This
particular taxa is a copiotroph and so grows to high abundance in the
presence of short bursts of nutrients. That kind of burst of growth is
expected in samples 1,3,5, and 7, however samples 9-12 represent
seawater samples and so they are nutrient-poor (relatively). This taxa
ought to be in low abundance in those samples, which is the case in
the raw counts. However, the rlog transformation inflates their counts
so that they are now similar to those found in all other samples. This
result obscures what is likely the true abundance of this taxa in the
water column by inflating it beyond what is biologically reasonable
given the environment in which these samples were collected.
sample1 sample2 sample3 sample4 sample5 sample6 sample7
sample8 sample9 sample10 sample11 sample12
OTU1 rlog trans 136,280 93,429 131,635
93,351 120,957 93,334 120,358 93,317
93,304 93,310 93,308 93,322
OTU1 raw 468,459 1,190 168,280 330
237,248 94 232,483 195 249 1,647 331
501
With expression data, it seems appropriate to assume that in a
treatment vs control scenario, all genes are likely to be expressed at
some level (if even a low level) in all samples. However, I do not
think that holds true for the abundance of taxa, especially in a case
like mine where the samples are not so much different treatments as
different ways of collecting microbes. It very well may be that the
abundance of a particular taxa drops to a few hundred cells in any
given sample because it is simply outcompeted by other microbes - that
is a distinctly plausible biological event. The variance stabilization
appears to mute that effect thereby making it appear that all taxa
appear in all samples at relatively similar abundances.
Given these results, is rlog transformation really an appropriate
choice when comparing metagenomic data? The variance stabilization
seems to drastically change the biological patterns in the dataset.
Thanks,
Kristina
------------------------------------------------------------------
Kristina Fontanez, Postdoctoral Fellow
fontanez@mit.edu<mailto:fontanez@mit.edu><mailto:fontanez@mit.edu<mail to:fontanez@mit.edu="">>
Massachusetts Institute of Technology
Department of Civil and Environmental Engineering
48-120E
15 Vassar Street
Cambridge, MA 02139
[[alternative HTML version deleted]]
_______________________________________________
Bioconductor mailing list
Bioconductor@r-project.org<mailto:bioconductor@r-project.org>
https://stat.ethz.ch/mailman/listinfo/bioconductor
Search the archives:
http://news.gmane.org/gmane.science.biology.informatics.conductor
[[alternative HTML version deleted]]