Question

Batch effects and model matrices

0

Entering edit mode

bvernot • 0

@bvernot-13293

Last seen 7.9 years ago

I have mice with two possible genotypes: homozygous wildtype for a gene of interest, or homozygous edited. We’ve collected cDNA of single cells from a few tissue types, although here I mostly ignore cell and tissue type. The primary issue is that I clearly have mouse-specific effects, but I don't know how to separate those from genotype. This is analogous to the batch example in section 3.12.1 of the DESeq2 vignette, but because the mouse itself is the batch, it's not like I can change the study design. I'm doing an ad-hoc correction where I discard genes where the test ~mouse vs ~genotype is highly significant (described below). Does this seem reasonable? Is there a better approach?

Thanks!

I simulated some data as an example: 50 cells from 4 mice (two wt, two hum). In reality, I have sibling pairs, so that does help to control for some of the batch effect (testing ~genotype vs ~genotype+sibling).

Here's the simulated expression for each gene. Gene 1 has no genotype effect, but a clear issue in mouse 1. Genes 2 and 3 have a genotype effect, and the rest do not.

Do a wald test for ~genotype:

deseq.data <- DESeqDataSetFromMatrix(expr_counts, cell_annotations, ~genotype)
deseq.data.wald.gt = DESeq(deseq.data)
deseq.data.wald.gt <- results(deseq.data.wald.gt)
deseq.data.wald.gt[deseq.data.wald.gt$padj < .1,]

# log2 fold change (MAP): genotype wt vs hum 
# Wald test p-value: genotype wt vs hum 
# DataFrame with 4 rows and 6 columns
#         baseMean log2FoldChange      lfcSE      stat       pvalue         padj
#        <numeric>      <numeric>  <numeric> <numeric>    <numeric>    <numeric>
# gene1   15.50258     -0.8298745 0.14021331 -5.918657 3.245816e-09 6.491631e-08
# gene2   24.30998      0.4379174 0.08033192  5.451350 4.998887e-08 4.998887e-07
# gene3   21.15279      0.2339134 0.08394988  2.786346 5.330586e-03 2.665293e-02
# gene19  18.54322     -0.2473045 0.08387366 -2.948536 3.192826e-03 2.128551e-02

Cannot do ~mouse+genotype vs ~mouse, because genotype is a subset of mouse.

Instead I test full = ~mouse vs reduced = ~genotype. This gives a highly significant result for gene1:

# log2 fold change (MLE): mousem4 
# LRT p-value: full vs reduced 
# DataFrame with 1 row and 6 columns
#        baseMean log2FoldChange     lfcSE      stat       pvalue         padj
#       <numeric>      <numeric> <numeric> <numeric>    <numeric>    <numeric>
# gene1  15.50258       3.361492 0.1976183  297.7472 2.213212e-65 4.426425e-64

This is what I'm currently doing: treat a gene as significant only if it has a significant result with ~genotype vs ~1, and if it has an unsignificant result for ~mouse vs ~genotype.

deseq.data.wald.gt[deseq.data.wald.gt$padj < .1 & deseq.data.lrt2$padj > .0001,]

# log2 fold change (MAP): genotype wt vs hum 
# Wald test p-value: genotype wt vs hum 
# DataFrame with 3 rows and 6 columns
#         baseMean log2FoldChange      lfcSE      stat       pvalue         padj
#        <numeric>      <numeric>  <numeric> <numeric>    <numeric>    <numeric>
# gene2   24.30998      0.4379174 0.08033192  5.451350 4.998887e-08 4.998887e-07
# gene3   21.15279      0.2339134 0.08394988  2.786346 5.330586e-03 2.665293e-02
# gene19  18.54322     -0.2473045 0.08387366 -2.948536 3.192826e-03 2.128551e-02

deseq2 single cell monocle • 1.6k views

ADD COMMENT • link updated 7.9 years ago by Michael Love 43k • written 7.9 years ago by bvernot • 0

score 0 · Answer 1 · 2017-06-21

0

Entering edit mode

Michael Love 43k

@mikelove

Last seen 9 days ago

United States

For single cell data, you may want to use a software that handles excess (technical) zeros, relative to a negative binomial or a log linear model.

In general, the LRT you are performing doesn't make sense. The reduced design needs to be nested within the full design (meaning that you remove some terms but you should not introduce other terms).

Here mouse is confounded with genotype (in an unavoidable way), so you can't use fixed effects to try to account for the correlation among expression measurements from the same mouse. The only approach I know of is to use the duplicateCorrelation() functionality of limma.

ADD COMMENT • link 7.9 years ago Michael Love 43k

0

Entering edit mode

Thanks for your thoughts, Mike. I was under the impression that ~genotype is a nested model of ~mouse, since it is just ~mouse constraining m1==m3 and m2==m4 (since those mice are the same genotype). But perhaps I'm wrong, or deseq2/monocle just can't handle that type of nesting? For what it's worth, the results from Deseq2 (and monocle) for that particular LRT do seem to make sense.

WRT the single cell aspect, I have used monocle in a similar way as I described here, and I do seem to get more stable results,

ADD REPLY • link 7.9 years ago bvernot • 0

0

Entering edit mode

I see what you're doing now. You can test ~mouse vs ~genotype, but think about what you're testing: it's whether or not there are differences between the two mice within a genotype. This isn't what you want. Like I said, the only way i know of to deal with the correlations between mice within genotype is using duplicateCorrelation() in the limma pipeline.

ADD REPLY • link 7.9 years ago Michael Love 43k