Hi.
I am preparing to run DESeq2 on samples and I am uncertain about how I might proceed.
My questions may be elementary: I am not a statistician and have not used R for statistical modeling or DESeq2 for differential expression analyses.
We have biological samples each of which consists of RNA from a cell type obtained using FACS. There are two biological replicates per cell type. Furthermore, the cell types belong to one of two organ classes. In order to generalize this for convenience, I label the organ classes o1 and o2, and the cell types within each class c1, c2, and c3. Additionally, the cell type replicates are r1 and r2. So I have for organ class 1, the samples
o1-c1-r1 and o1-c1-r2
o1-c2-r1 and o1-c2-r2
o1-c3-r1 and o1-c3-r2
and for organ class 2
o2-c1-r1 and o2-c1-r2
o2-c2-r1 and o2-c2-r2
o2-c3-r1 and o2-c3-r2
I want to look for differential expression between various combinations of these cell types. As examples,
case 1:
o1-c1 + o1-c2 + o1-c3 vs o2-c1 + o2-c2 + o2-c3
(expression difference between organ class 1 and organ class 2)
and
case 2:
o1-c1 vs o1-c2 + o1-c3
(expression difference between cell type 1 of organ class 1 and all other cell types of organ class 1)
and so on.
My first thought is to combine the gene coverage counts, for example for case 1, I add the counts gene-wise to create
four files of combined counts
for organ class 1:
m1-r1 = o1-c1-r1 + o1-c2-r1 + o1-c3-r1
m1-r2 = o1-c1-r2 + o1-c2-r2 + o1-c3-r2
and for organ class 2:
m2-r1 = o2-c1-r1 + o2-c2-r1 + o2-c3-r1
m2-r2 = o2-c1-r2 + o2-c2-r2 + o2-c3-r2
and run a 'standard' DESeq2 comparison m1 vs m2 in which I treat m1-r1 and m1-r2 as biological replicates for m1
and m2-r1 and m2-r2 for m2.
But this strategy concerns me because sample sequencing depths vary, sometimes substantially, and I convinced
myself that the genes in over-represented samples will dominate counts and so introduce errors in the analyses.
(Likewise, I imagine that if one were to combine physically the cells into four samples, one would need to control the number of cells of each type collected into each sample.)
So I consider normalizing the counts for each of the samples before combining them. That is, I would use DESeq2 size factors to normalize independently the samples in m1-r1, m1-r2, m2-r1, and m2-r2. My documentation and google search reading suggest that this strategy may lead to two other technical problems
o multiplying counts by factors that differ from 1.0 effectively masks the significance of the
counts, which impairs the p-value accuracies
o something about possibly ending up with variance that's less than the mean thereby making the
negative binomial model inappropriate
Notes:
o I would normalize independently the samples that contribute to m1-r1 within m1-r1 alone, etc
o why is variance affected? These are normalizations are made overall to all genes within a sample using the same factor
whereas the variance is calculated gene-wise. Or am I mistaken.
o normalization might affect variance calculation so will the sequencing depth (this reasoning makes no sense to me now)
o low depth may tend to result in patchiness, and scaling up counts tends to exaggerate the patchiness so it might make the p-values more conservative?
My first questions are about the importance of these two problems. I suppose that the answer depends on factors
such as the magnitudes of the size factors, but I wonder whether one could use this approach if there were no
better option? Or is it always wrong?
More specifically, I wonder
o would multiplying the gene counts within a sample file by a value affect the DESeq2 variance
calculation substantially? Aren't the gene variances calculated per gene, and so wouldn't the count
variations survive largely the normalization?
o wouldn't samples with low read counts tend to have 'patchy' gene coverage due to chance? And wouldn't
multiplying the read counts by a value greater than one exaggerate this patchiness, which might increase
the variance and tend to make the p-values conservative?
Perhaps there is a more fundamental reason for not applying normalization to the sample counts? Is it possible that
differences in the genes expressed in the samples and the levels of their expression get averaged into a normalization
factor, and applying this uniformly to the individual gene counts distorts the counts? (Maybe I'm not thinking about
this with sufficient care.)
I read also about more complex GLM 'models', which one can describe to DESeq2 using contrasts. Do contrasts
offer a statistically sound solution to these problems? If so, I imagine that I would write the colData sample
information table as
condition
o1-c1-r1 o1-c1
o1-c1-r2 o1-c1
o1-c2-r1 o1-c2
o1-c2-r2 o1-c2
o1-c3-r1 o1-c3
o1-c3-r2 o1-c3
o2-c1-r1 o2-c1
o2-c1-r2 o2-c1
o2-c2-r1 o2-c2
o2-c2-r2 o2-c2
o2-c3-r1 o2-c3
o2-c3-r2 o2-c3
and run an analysis for case1 that looks like either
dds <- DESeqDataSetFromMatrix( countData = countData, colData = colData, design = ~ condition )
lcontrast <- list( c('conditiono1-c1',conditiono1-c2','conditiono1-c3'), c('conditiono2-c1', 'conditiono2-c2', 'conditiono2-c3' ) )
results( dds, contrast=lcontrast, alpha=0.05 )
or
dds <- DESeqDataSetFromMatrix( countData = countData, colData = colData, design = ~ condition )
lcontrast <- c( 1, 1, 1, 1, 1, 1, -1, -1, -1, -1, -1, -1 )
results( dds, contrast=lcontrast, alpha=0.05 )
(I don't know how to use the numeric contrast vector to indicate the conditions that belong to the numerator and denominator.)
and for case 2 either
dds <- DESeqDataSetFromMatrix( countData = countData, colData = colData, design = ~ condition )
lcontrast <- list( c('conditiono1-c1'), c('conditiono1-c2', 'conditiono1-c3' ) )
results( dds, contrast=lcontrast, alpha=0.05 )
or
dds <- DESeqDataSetFromMatrix( countData = countData, colData = colData, design = ~ condition )
lcontrast <- c( 0, 0, 0, 0, 0, 0, 1, 1, -1, -1, -1, -1 )
results( dds, contrast=lcontrast, alpha=0.05 )
Do these look OK?
DESeq2 is a black box to me so I am wonder whether these
function calls
o 'normalize' within c('conditiono1-c1',conditiono1-c2','conditiono1-c3')
and within c('conditiono2-c1', 'conditiono2-c2', 'conditiono2-c3' )
o 'normalize' between normalized c('conditiono1-c1',conditiono1-c2','conditiono1-c3')
and normalized c('conditiono2-c1', 'conditiono2-c2', 'conditiono2-c3' )
o take care of any other important aspects of this type
of analyses?
Thank you.
Brent Ewing
Edit:
Continuing to read, I find that the description of contrasts in Michael Crawley's 'Statistical Computing' clears up most of the confusion that lead me to post this question. It appears that running DESeq2 with contrasts results in a comparison of the mean of the sample counts in the numerator to the mean of the sample counts in the denominator. (And I understand that using contrasts is superior to summing normalized counts prior to running DESeq2.)
Thank you.
Brent
Hi Michael,
Thank you for clarifications and corrections.
I fear that I may have mislead you with my description of the samples. The phrase, 'the cell types belong to one of two organ classes', was not meant to suggest a special relationship between any cell type in organ 1 and cell type in organ 2. More concretely, we have samples of various types of neuronal cells and samples of cells from various other organs and we want to explore how expression differs between the neuronal cells and the non-neuronal cells (case 1 of my examples) and between each neuronal cell type and the other neuronal cell types (case 2).
In light of this clarification, do I need to define interactions or are the DESeq2 run examples (repeated with corrections from the initial post) reasonable?
case 1
dds <- DESeqDataSetFromMatrix( countData = countData, colData = colData, design = ~ condition )
dds <- DESeq( dds )
lcontrast <- list( c('conditiono1-c1, conditiono1-c2', 'conditiono1-c3'), c('conditiono2-c1', 'conditiono2-c2', 'conditiono2-c3' ) )
results( dds, contrast=lcontrast, listValues=c(1/3,-1/3), alpha=0.05 )
case 2
dds <- DESeqDataSetFromMatrix( countData = countData, colData = colData, design = ~ condition )
dds <- DESeq( dds )
lcontrast <- list( c('conditiono1-c1'), c('conditiono1-c2', 'conditiono1-c3' ) )
results( dds, contrast=lcontrast, listValues=c(1,-1/2), alpha=0.05 )
I may limit myself to pair-wise comparisons or seek help from a local statistician.
Thank you again.
Brent Ewing
That example code looks correct for those comparisons.
One more note, I would recommend using betaPrior=TRUE for these comparisons, when you run DESeq(). (This will allow you to keep using the same code if you update your R version and R packages.)
Hi Michael,
I am grateful for your reassurance and advice.
Thank you.
Brent Ewing