I have mice with two possible genotypes: homozygous wildtype for a gene of interest, or homozygous edited. We’ve collected cDNA of single cells from a few tissue types, although here I mostly ignore cell and tissue type. The primary issue is that I clearly have mouse-specific effects, but I don't know how to separate those from genotype. This is analogous to the batch example in section 3.12.1 of the DESeq2 vignette, but because the mouse itself is the batch, it's not like I can change the study design. I'm doing an ad-hoc correction where I discard genes where the test ~mouse vs ~genotype is highly significant (described below). Does this seem reasonable? Is there a better approach?
Thanks!
I simulated some data as an example: 50 cells from 4 mice (two wt, two hum). In reality, I have sibling pairs, so that does help to control for some of the batch effect (testing ~genotype vs ~genotype+sibling).
Here's the simulated expression for each gene. Gene 1 has no genotype effect, but a clear issue in mouse 1. Genes 2 and 3 have a genotype effect, and the rest do not.
Do a wald test for ~genotype:
deseq.data <- DESeqDataSetFromMatrix(expr_counts, cell_annotations, ~genotype) deseq.data.wald.gt = DESeq(deseq.data) deseq.data.wald.gt <- results(deseq.data.wald.gt) deseq.data.wald.gt[deseq.data.wald.gt$padj < .1,] # log2 fold change (MAP): genotype wt vs hum # Wald test p-value: genotype wt vs hum # DataFrame with 4 rows and 6 columns # baseMean log2FoldChange lfcSE stat pvalue padj # <numeric> <numeric> <numeric> <numeric> <numeric> <numeric> # gene1 15.50258 -0.8298745 0.14021331 -5.918657 3.245816e-09 6.491631e-08 # gene2 24.30998 0.4379174 0.08033192 5.451350 4.998887e-08 4.998887e-07 # gene3 21.15279 0.2339134 0.08394988 2.786346 5.330586e-03 2.665293e-02 # gene19 18.54322 -0.2473045 0.08387366 -2.948536 3.192826e-03 2.128551e-02
Cannot do ~mouse+genotype vs ~mouse, because genotype is a subset of mouse.
Instead I test full = ~mouse
vs reduced = ~genotype
. This gives a highly significant result for gene1:
# log2 fold change (MLE): mousem4 # LRT p-value: full vs reduced # DataFrame with 1 row and 6 columns # baseMean log2FoldChange lfcSE stat pvalue padj # <numeric> <numeric> <numeric> <numeric> <numeric> <numeric> # gene1 15.50258 3.361492 0.1976183 297.7472 2.213212e-65 4.426425e-64
This is what I'm currently doing: treat a gene as significant only if it has a significant result with ~genotype vs ~1
, and if it has an unsignificant result for ~mouse
vs ~genotype.
deseq.data.wald.gt[deseq.data.wald.gt$padj < .1 & deseq.data.lrt2$padj > .0001,] # log2 fold change (MAP): genotype wt vs hum # Wald test p-value: genotype wt vs hum # DataFrame with 3 rows and 6 columns # baseMean log2FoldChange lfcSE stat pvalue padj # <numeric> <numeric> <numeric> <numeric> <numeric> <numeric> # gene2 24.30998 0.4379174 0.08033192 5.451350 4.998887e-08 4.998887e-07 # gene3 21.15279 0.2339134 0.08394988 2.786346 5.330586e-03 2.665293e-02 # gene19 18.54322 -0.2473045 0.08387366 -2.948536 3.192826e-03 2.128551e-02
Thanks for your thoughts, Mike. I was under the impression that ~genotype is a nested model of ~mouse, since it is just ~mouse constraining m1==m3 and m2==m4 (since those mice are the same genotype). But perhaps I'm wrong, or deseq2/monocle just can't handle that type of nesting? For what it's worth, the results from Deseq2 (and monocle) for that particular LRT do seem to make sense.
WRT the single cell aspect, I have used monocle in a similar way as I described here, and I do seem to get more stable results,
I see what you're doing now. You can test ~mouse vs ~genotype, but think about what you're testing: it's whether or not there are differences between the two mice within a genotype. This isn't what you want. Like I said, the only way i know of to deal with the correlations between mice within genotype is using duplicateCorrelation() in the limma pipeline.