Question

DESeq2 comparison of sample combinations

0

Entering edit mode

bge • 0

@bge-13787

Last seen 3.7 years ago

Hi.

I am preparing to run DESeq2 on samples and I am uncertain about how I might proceed.

My questions may be elementary: I am not a statistician and have not used R for statistical modeling or DESeq2 for differential expression analyses.

We have biological samples each of which consists of RNA from a cell type obtained using FACS. There are two biological replicates per cell type. Furthermore, the cell types belong to one of two organ classes. In order to generalize this for convenience, I label the organ classes o1 and o2, and the cell types within each class c1, c2, and c3. Additionally, the cell type replicates are r1 and r2. So I have for organ class 1, the samples

o1-c1-r1 and o1-c1-r2
o1-c2-r1 and o1-c2-r2
o1-c3-r1 and o1-c3-r2

and for organ class 2

o2-c1-r1 and o2-c1-r2
o2-c2-r1 and o2-c2-r2
o2-c3-r1 and o2-c3-r2

I want to look for differential expression between various combinations of these cell types. As examples,

case 1:

o1-c1 + o1-c2 + o1-c3 vs o2-c1 + o2-c2 + o2-c3

(expression difference between organ class 1 and organ class 2)

and

case 2:

o1-c1 vs o1-c2 + o1-c3

(expression difference between cell type 1 of organ class 1 and all other cell types of organ class 1)

and so on.

My first thought is to combine the gene coverage counts, for example for case 1, I add the counts gene-wise to create
four files of combined counts

for organ class 1:

m1-r1 = o1-c1-r1 + o1-c2-r1 + o1-c3-r1

m1-r2 = o1-c1-r2 + o1-c2-r2 + o1-c3-r2

and for organ class 2:

m2-r1 = o2-c1-r1 + o2-c2-r1 + o2-c3-r1

m2-r2 = o2-c1-r2 + o2-c2-r2 + o2-c3-r2

and run a 'standard' DESeq2 comparison m1 vs m2 in which I treat m1-r1 and m1-r2 as biological replicates for m1
and m2-r1 and m2-r2 for m2.

But this strategy concerns me because sample sequencing depths vary, sometimes substantially, and I convinced
myself that the genes in over-represented samples will dominate counts and so introduce errors in the analyses.

(Likewise, I imagine that if one were to combine physically the cells into four samples, one would need to control the number of cells of each type collected into each sample.)

So I consider normalizing the counts for each of the samples before combining them. That is, I would use DESeq2 size factors to normalize independently the samples in m1-r1, m1-r2, m2-r1, and m2-r2. My documentation and google search reading suggest that this strategy may lead to two other technical problems

o multiplying counts by factors that differ from 1.0 effectively masks the significance of the
counts, which impairs the p-value accuracies
o something about possibly ending up with variance that's less than the mean thereby making the
negative binomial model inappropriate

Notes:
o I would normalize independently the samples that contribute to m1-r1 within m1-r1 alone, etc
o why is variance affected? These are normalizations are made overall to all genes within a sample using the same factor
whereas the variance is calculated gene-wise. Or am I mistaken.
o normalization might affect variance calculation so will the sequencing depth (this reasoning makes no sense to me now)
o low depth may tend to result in patchiness, and scaling up counts tends to exaggerate the patchiness so it might make the p-values more conservative?

My first questions are about the importance of these two problems. I suppose that the answer depends on factors
such as the magnitudes of the size factors, but I wonder whether one could use this approach if there were no
better option? Or is it always wrong?

More specifically, I wonder

o would multiplying the gene counts within a sample file by a value affect the DESeq2 variance
calculation substantially? Aren't the gene variances calculated per gene, and so wouldn't the count
variations survive largely the normalization?

o wouldn't samples with low read counts tend to have 'patchy' gene coverage due to chance? And wouldn't
multiplying the read counts by a value greater than one exaggerate this patchiness, which might increase
the variance and tend to make the p-values conservative?

Perhaps there is a more fundamental reason for not applying normalization to the sample counts? Is it possible that
differences in the genes expressed in the samples and the levels of their expression get averaged into a normalization
factor, and applying this uniformly to the individual gene counts distorts the counts? (Maybe I'm not thinking about
this with sufficient care.)

I read also about more complex GLM 'models', which one can describe to DESeq2 using contrasts. Do contrasts
offer a statistically sound solution to these problems? If so, I imagine that I would write the colData sample
information table as

                      condition
o1-c1-r1        o1-c1
o1-c1-r2        o1-c1
o1-c2-r1        o1-c2
o1-c2-r2        o1-c2
o1-c3-r1        o1-c3
o1-c3-r2        o1-c3
o2-c1-r1        o2-c1
o2-c1-r2        o2-c1
o2-c2-r1        o2-c2
o2-c2-r2        o2-c2
o2-c3-r1        o2-c3
o2-c3-r2        o2-c3

and run an analysis for case1 that looks like either

dds <- DESeqDataSetFromMatrix( countData = countData, colData = colData, design = ~ condition )
lcontrast <- list( c('conditiono1-c1',conditiono1-c2','conditiono1-c3'), c('conditiono2-c1', 'conditiono2-c2', 'conditiono2-c3' ) )
results( dds, contrast=lcontrast, alpha=0.05 )

or

dds <- DESeqDataSetFromMatrix( countData = countData, colData = colData, design = ~ condition )
lcontrast <- c( 1, 1, 1, 1, 1, 1, -1, -1, -1, -1, -1, -1 )
results( dds, contrast=lcontrast, alpha=0.05 )

(I don't know how to use the numeric contrast vector to indicate the conditions that belong to the numerator and denominator.)

and for case 2 either

dds <- DESeqDataSetFromMatrix( countData = countData, colData = colData, design = ~ condition )
lcontrast <- list( c('conditiono1-c1'), c('conditiono1-c2', 'conditiono1-c3' ) )
results( dds, contrast=lcontrast, alpha=0.05 )

or

dds <- DESeqDataSetFromMatrix( countData = countData, colData = colData, design = ~ condition )
lcontrast <- c( 0, 0, 0, 0, 0, 0, 1, 1, -1, -1, -1, -1 )
results( dds, contrast=lcontrast, alpha=0.05 )

Do these look OK?

DESeq2 is a black box to me so I am wonder whether these
function calls

o 'normalize' within c('conditiono1-c1',conditiono1-c2','conditiono1-c3')
and within c('conditiono2-c1', 'conditiono2-c2', 'conditiono2-c3' )

o 'normalize' between normalized c('conditiono1-c1',conditiono1-c2','conditiono1-c3')
and normalized c('conditiono2-c1', 'conditiono2-c2', 'conditiono2-c3' )

o take care of any other important aspects of this type
of analyses?

Thank you.

Brent Ewing

Edit:

Continuing to read, I find that the description of contrasts in Michael Crawley's 'Statistical Computing' clears up most of the confusion that lead me to post this question. It appears that running DESeq2 with contrasts results in a comparison of the mean of the sample counts in the numerator to the mean of the sample counts in the denominator. (And I understand that using contrasts is superior to summing normalized counts prior to running DESeq2.)

Thank you.

Brent

deseq2 • 4.2k views

ADD COMMENT • link updated 7.3 years ago by Michael Love 43k • written 7.3 years ago by bge • 0

score 0 · Answer 1 · 2017-08-23

0

Entering edit mode

Michael Love 43k

@mikelove

Last seen 1 day ago

United States

Let me jump to the first part, and then we can see what consequence that has on the rest. From the beginning you have:

"I want to look for differential expression between various combinations of these cell types. As examples, case 1:

o1-c1 + o1-c2 + o1-c3 vs o2-c1 + o2-c2 + o2-c3

(expression difference between organ class 1 and organ class 2), and case 2:

o1-c1 vs o1-c2 + o1-c3

(expression difference between cell type 1 of organ class 1 and all other cell types of organ class 1)"

If you want both of these, you can use the recommendation in the vignette for combining factors into a group variable. Your case 2 here requires that you've fit organ specific cell-type effects, otherwise this would be the same result for organ 1 and organ 2, which sounds like it is not what you want. So by combining the factors into a group variable, you get unique levels of the factor for all the combinations of organ and cell-type that have samples. It's critical that you have biological replication for these 6 groups of samples, and it sounds like you have 2 for each, which is the bare minimum, but at least you have this, given that you are interested in organ-specific cell type differences. Then you can use

results(dds, contrast=list("o1c1", c("o1c2","o1c3")), listValues=c(1,-1/2))

For the second comparison, and so on. Note that you want to compare o1c1 to the average of {o1c2 and o1c3} so the list value needs to be -1/2 for the denominator, not -1. See ?results for more details.

So, use this recommendation from the vignette, and, for sure, do not combine columns which are biological replicates. This is a huge loss of the most important information from the experiment: biological variability.

ADD COMMENT • link 7.3 years ago Michael Love 43k

0

Entering edit mode

Hi Michael,

Thank you for clarifications and corrections.

I fear that I may have mislead you with my description of the samples. The phrase, 'the cell types belong to one of two organ classes', was not meant to suggest a special relationship between any cell type in organ 1 and cell type in organ 2. More concretely, we have samples of various types of neuronal cells and samples of cells from various other organs and we want to explore how expression differs between the neuronal cells and the non-neuronal cells (case 1 of my examples) and between each neuronal cell type and the other neuronal cell types (case 2).

In light of this clarification, do I need to define interactions or are the DESeq2 run examples (repeated with corrections from the initial post) reasonable?

case 1
dds <- DESeqDataSetFromMatrix( countData = countData, colData = colData, design = ~ condition )
dds <- DESeq( dds )
lcontrast <- list( c('conditiono1-c1, conditiono1-c2', 'conditiono1-c3'), c('conditiono2-c1', 'conditiono2-c2', 'conditiono2-c3' ) )
results( dds, contrast=lcontrast, listValues=c(1/3,-1/3), alpha=0.05 )

case 2
dds <- DESeqDataSetFromMatrix( countData = countData, colData = colData, design = ~ condition )
dds <- DESeq( dds )
lcontrast <- list( c('conditiono1-c1'), c('conditiono1-c2', 'conditiono1-c3' ) )
results( dds, contrast=lcontrast, listValues=c(1,-1/2), alpha=0.05 )

I may limit myself to pair-wise comparisons or seek help from a local statistician.

Thank you again.

Brent Ewing

ADD REPLY • link 7.3 years ago bge • 0

0

Entering edit mode

That example code looks correct for those comparisons.

One more note, I would recommend using betaPrior=TRUE for these comparisons, when you run DESeq(). (This will allow you to keep using the same code if you update your R version and R packages.)

ADD REPLY • link 7.2 years ago Michael Love 43k

0

Entering edit mode

Hi Michael,

I am grateful for your reassurance and advice.

Thank you.

Brent Ewing

ADD REPLY • link 7.2 years ago bge • 0