**I am including an earlier message which I neglected to cc to the
Bioconductor forum, pasted below**
Mike-
Just one more short, related question - why is it that some of the
rlogTransformed counts have negative values? I can certainly just add
1 to all of the values in order to make plotting relative abundance
easier, but it seems strange that there should be negative values at
all.
Thanks,
Kristina
------------------------------------------------------------------
Kristina Fontanez, Postdoctoral Fellow
fontanez@mit.edu<mailto:fontanez@mit.edu>
Massachusetts Institute of Technology
Department of Civil and Environmental Engineering
48-120E
15 Vassar Street
Cambridge, MA 02139
On Jan 16, 2014, at 5:04 PM, Kristina Fontanez
<fontanez@mit.edu<mailto:fontanez@mit.edu>> wrote:
Mike-
I want to make sure I understand your point here correctly.
Instead of using rlogTransformation, I am simply using the DESeq
function to do the differential expression analysis and then
subsequently exporting the normalized counts. An example code would
be:
> normalizedcounts<-counts(DEseqobject,normalized=TRUE)
> write.csv(normalizedcounts,file="Normalizedcounts.csvâ)
In this case, the normalized counts use the design formula associated
with the DEseqobject. In my case, that means the dispersion was
estimated by treating depths as biological replications (design
~TREATMENT) as opposed to with design ~ 1. However, Simon mentioned in
a previous post that I should be using the rlog-transformed data to
make relative abundance heatmaps/barcharts and NOT the normalized
counts, which only take into account sequencing depth.
So, that gets me back to the place where I am trying to use
rlogTransformation to export a matrix of rlog transformed values. I
was able to do this successfully but I found that the rlog-transformed
values created from 2 different DESeq objects with different designs
resulted in slightly different count tables despite using
rlogTransformation(DEseqobject, blind=TRUE) which should have ignored
the designs. The values are different but only if you go out to
several decimal places.
For example:
Design ~ Treatment
Count value for OTU 1 in sample 1 is
3.096214077
Design ~ Treatment + Depth
Count value for OTU 1 in sample 1 is
3.093644372
Why are these values different at all if blind was set to TRUE?
Thanks,
Kristina
------------------------------------------------------------------
Kristina Fontanez, Postdoctoral Fellow
fontanez@mit.edu<mailto:fontanez@mit.edu>
Massachusetts Institute of Technology
Department of Civil and Environmental Engineering
48-120E
15 Vassar Street
Cambridge, MA 02139
On Jan 13, 2014, at 5:27 PM, Michael Love
<michaelisaiahlove@gmail.com<mailto:michaelisaiahlove@gmail.com>>
wrote:
hi Kristina,
In the vignette we show the use of the rlogTransformation() function
which takes care of everything for the user:
The two functions return SummarizedExperiment objects, as the data are
no longer counts. The assay function is used to extract the matrix of
normalized values.
rld <- rlogTransformation(dds, blind=TRUE)
This by default estimates the dispersion with design of ~ 1 (as the
'blind' argument defaults to TRUE). We recommend in general estimating
dispersion with ~ 1, so that the transformation is entirely unaware of
the experimental design (though using the experimental design would
only inform or alter the transformation in so far as the estimation of
the dispersion-mean trend, so it's not a huge concern). rlogData is
the lower-level function called by rlogTransformation, after
everything has been "taken care of" for the user, i.e., checking if
there are size factors, if not estimating them, etc.
if you are surprised by the output of a function, it's always good to
check the help page for the function under the Value section. For
?rlogTransformation and ?rlogData.
Value
for rlogTransformation, a SummarizedExperiment with assay data
elements equal to log2 (qij ) = xj.βi, see formula at DESeq. for
rlogData, a matrix of the same dimension as the count data, containing
the transformed values.
Mike
On Mon, Jan 13, 2014 at 4:08 PM, Kristina M Fontanez
<fontanez@mit.edu<mailto:fontanez@mit.edu>> wrote:
Simon-
Thank you for your insights. Replicates arenât an option often in
marine metagenomics because many of us arenât working at depths
where dropping a bucket is not a feasible option. Notably, we are
working with medium-term deployments of in situ incubations where the
number of sampling spots was extremely limited. Itâs a first pass at
a difficult issue in marine microbiology but until we have the
physical infrastructure (sampling mechanisms) to get a large number of
samples, we often sacrifice replicates for better coverage across
depths. Looking forward, replicates is definitely a goal.
The treatments are in situ incubations representing live microbes,
poisoned microbes (dead) and surrounding seawater (not an incubation,
can be thought of as a control).
Your point about guessing the dispersions is a difficult one for me to
agree with. First, the differences between the treatments are likely
to far exceed the differences within the depths. Preliminary analyses
in bayseq where samples are grouped by treatment bear this point out.
So, I think itâs reasonable to combine the dispersions across depths
so that I can compare live and dead treatments. Second, as you point
out, guessing the appropriate dispersion makes the data difficult to
publish given I wonât have a good reason to argue for any value.
These particular marine metagenomes would be the first of their kind
so there really isnât a good reference point.
Are you following some code example here? Because it looks that you
are
mixing up two. The result of 'rlogData' is a matrix, you don't need
'assay' to extract the matrix. Simply save the content of 'rlogData'
into a CSV file, or pass the variable as is it to the heatmap
function,
or do something else with it. You need to use 'assay' if you have used
'rlogTransformation' instead of 'rlogDataâ.
/snip
As you suggest, I would prefer to make relative abundance bar
charts/heatmaps using regularized log transformed counts while
accounting for varying sequencing depth. This was my Option 2 where I
tried to follow a code example I found on the Bioconductor forums<http s:="" stat.ethz.ch="" pipermail="" bioconductor="" 2013-april="" 051878.html="">
relating to use of data without replicates, which used design ~ 1.
Based on your insights about exporting directly as .csv (and a
previous poster - thank you Steve), I plan to use the below code to
get counts normalized by sequencing depth and regularized log
transformed. I will estimate the dispersions by pooling the treatments
across depths. One remaining question is whether I need the design ~ 1
option. Iâm still not clear on that.
design(Deptreat)<- ~1
Deptreat
class: DESeqDataSet
dim: 1064 12
exptData(0):
assays(1): counts
rownames(1064): OTU1 OTU2 ... OTU1063 OTU1064
rowData metadata column names(0):
colnames(12): HD5_150L HD5_150D ... HD5_500D HD9_SW_500_22
colData names(5): Treatment Depth Cmg.m2.day Nmg.m2.day Pmg.m2.day
Deptreat<-estimateSizeFactors(Deptreat)
Deptreat<-estimateDispersions(Deptreat,fitType="local")
gene-wise dispersion estimates
mean-dispersion relationship
final dispersion estimates
>Deptreat_rlog=rlogData(Deptreat)
>write.csv(Deptreat_rlog,âDeptreat_rlog_counts.csvâ)
Thank you again,
Kristina
------------------------------------------------------------------
Kristina Fontanez, Postdoctoral Fellow
fontanez@mit.edu<mailto:fontanez@mit.edu><mailto:fontanez@mit.edu<mail to:fontanez@mit.edu="">>
Massachusetts Institute of Technology
Department of Civil and Environmental Engineering
48-120E
15 Vassar Street
Cambridge, MA 02139
On Jan 13, 2014, at 3:25 PM, Simon Anders <anders@embl.de<mailto:ander s@embl.de=""><mailto:anders@embl.de<mailto:anders@embl.de>>> wrote:
Hi
On 13/01/14 20:08, Kristina Fontanez [guest] wrote:
I am working with a metagenomic dataset in DESeq2 that does not have
true biological replicates but has multiple factors and levels. Due
to the difficulty in obtaining these samples, replicates were not an
option. The study design consists of seawater microbes collected at 4
depths.
Out of genuine curiosity: Why are replicates not an option? Maybe I
have
a wrong idea about metagenomics, but I now imagine that you went
somewhere with a ship and lowered a bucket (or, maybe, a more
sophisticated sample-taking apparatus ;-) ) four times, each time to a
different depth, to take a sample. Why would it have been
prohibitively
expensive to let down the bucket twice for each depth? Even if you do
all sample takings at the very same spot, you should get a reasonable
estimate for variation within a depth layer, because your boat
probably
drifts a few hundred metres anyway in the time.
For each depth, there are 3 treatments (live, dead and none).
Can you explain more what treatments these are?
I expect there to be true biological differences in abundance both
among depths and among treatments. So, combining the samples across
depths is not a great option because I expect there to be some
variation in abundance of taxa among depths. The same holds true for
combining samples across treatments. However, if I have to choose one
I would expect there to be less variation among depths than among
treatments.
The goal is to assess the differences in abundance among treatments
and among depths. Ultimately, I would like to create relative
abundance heat maps and bar charts displaying the taxa which are
differentially abundant among treatments and among depths. I would
also like to cluster the samples using PCoA or nMDS to see the
effects of treatment and depth on sample similarity.
Is it possible to normalize these data (using variance-stabilized
normalization or regularized log transformation) without choosing a
study design? In my perusing of the mailing list, one suggestion was
to use a design ~ 1 to estimate the dispersions so that the
dispersions treat all samples as replicates. In my case, that is not
an ideal choice due to the expected variation in abundance of taxa
among depths and treatments.
The normalization does not use a study design, nor does the VST or
rlog
transform. (Note that the term "normalization" here only refers to
accounting for sequencing depth, and this has little to do with study
design.) The respective DESeq2 functions all ignore the 'design'
argument.
1) Make relative abundance heat maps and bar charts based on simple
counts/library size proportions.
You should base them on the rlog-transformed data, not on normalized
counts.
Later, to do differential abundance
tests, pool samples across depths to compare the 3 treatments (4
replicates per treatment), calculating the dispersion using
~Treatment as the study design.
I would be surprised if you find much this way. I don't know what the
treatments are but "live" and "dead" sounds like rather dramatic
differences, giving rise to very high dispersion estimates. If you are
lucky, you get some results nevertheless.
In the end, however, the best option for non-replicated data is to
_guess_ a good value for the dispersion instead of estimating it. For
example, instead of the usual "dds <- estimateDispersions(dds)", you
write "dispersions(dds) <- 0.1" to set the dispersion value for all
genes to your best guess, in this example 0.1.
Where to get the guess from? Well, hopefully you have some idea how
much
these counts typically vary between two samples taken from the depth
and
treated the same manner. If you think, they typically vary by 50%,
then
set the dispersion to 0.5^2=0.25. (If you don't have any idea at all,
then you really should have tried to find out experimentally.)
Of course, this approach may only be used to analyse preliminary data.
After all, if you mention in a paper that you used a mere guess, the
reviewers will (hopefully) ask how you arrived at it. As you don't
have
replicates, you cannot argue from your data that your guess is a
reasonable value. It may also be hard to argue from previously
published
data, because metagenomics is too new a field to have much to compare
with, and you cannot know whether your data is of as good quality as
the
one you based your guess on.
It should be clear that you cannot claim that a difference is related
to
depth or treatment if you don't know whether differences of this
magnitude could as well have been found between two samples taken at
the
same depth and treated the same way.
I have explained this in a bit more detail, because we haven't had the
topic on this mailing list for a while, and it might make sense from
time to time to remind newer readers why replicates (or better: data
from several independent samples) are so important.
2) Normalize count data using a design ~1 option in DEseq2, export
normalized counts. As far I can tell - there isnââ¬â¢t a documented
procedure to do this in manual, vignettes or on the mailing list. The
main problem here is that I canââ¬â¢t figure out a way to export
the
regularized log transformed counts.
Again, the normalization has nothing to with the study design, as it
_only_ normalizes for sequencing depth. You will get always the same
normalized counts, which you can access with
counts( dds, normalized=TRUE )
no matter what design formula you specify.
One approach to (2) for the regularized log transformation is below
and the variable Deptreat_rlog contains the regularized log
transformed count data which is the same size as the original dataset
(but the ââ¬Åcountsâ⬠matrix is inaccessible):
Are you following some code example here? Because it looks that you
are
mixing up two. The result of 'rlogData' is a matrix, you don't need
'assay' to extract the matrix. Simply save the content of 'rlogData'
into a CSV file, or pass the variable as is it to the heatmap
function,
or do something else with it. You need to use 'assay' if you have used
'rlogTransformation' instead of 'rlogData'.
Simon
_______________________________________________
Bioconductor mailing list
Bioconductor@r-project.org<mailto:bioconductor@r-project.org><mailto:b ioconductor@r-project.org<mailto:bioconductor@r-project.org="">>
https://stat.ethz.ch/mailman/listinfo/bioconductor
Search the archives:
http://news.gmane.org/gmane.science.biology.informatics.conductor
[[alternative HTML version deleted]]
_______________________________________________
Bioconductor mailing list
Bioconductor@r-project.org<mailto:bioconductor@r-project.org>
https://stat.ethz.ch/mailman/listinfo/bioconductor
Search the archives:
http://news.gmane.org/gmane.science.biology.informatics.conductor
[[alternative HTML version deleted]]