Hey all,
We've been running ZINB-WaVE and DESeq2 on an inDrop dataset to look for differentially expressed genes across subsets of cells. This does a nice job for genes that are more highly expressed, but it looks like this can miss genes that look like they are markers but are expressed at a low level in just one condition.
If I select genes with a log2FoldChange > 10 and plot the baseMean distribution, there's a long tail of genes with a higher baseMean. For example:
Selecting the long-tail genes there (baseMean > 0.3) and jitter-plotting the counts for the two conditions I am testing:
Turning off the Cook's distance filtering doesn't help any, the p-values for these samples are ~1. I get why these aren't being called, the lfcSE is like 10x the log2FoldChange:
baseMean log2FoldChange lfcSE stat pvalue padj <numeric> <numeric> <numeric> <numeric> <numeric> <numeric> Sulf2 0.3104184 13.87955 292.0464 0.04752516 0.9622228 NA Atp1a2 0.4537823 14.11711 290.0398 0.04867300 0.9612700 1 Csf1 0.3267646 15.11244 282.0484 0.05358100 0.9573912 1 H2afy 0.3259195 16.07577 274.9044 0.05847767 0.9535017 1 Mknk2 0.4025375 17.26996 266.7222 0.06474887 0.9484551 1 ... ... ... ... ... ... ... Dner 0.6504387 14.05085 290.5956 0.04835190 0.9615246 1 Akap12 0.8202702 13.61820 294.3050 0.04627242 0.9631704 1 Col6a3 0.3008975 11.87571 310.7945 0.03821081 0.9696493 NA Pi15 0.7426169 14.92787 283.4807 0.05265920 0.9580888 1 Marcks 0.3847278 14.81474 284.3462 0.05210108 0.9585412 1
These are also pretty unbalanced and small groups, there's only 54 cells in the mutant_anterior group and 17 in the wt_anterior group and on average ~25% of the cells in the mutant_anterior group express the gene. Despite all that, these do kind of look like they should be called to me. What do you all think?
At Bioc2018, someone was saying that UMI counts might not require zero-inflated component, but I forget who said this and what was the evidence or paper that backed it up.
I think Wolfgang Huber mentioned that about UMI counts right? I'm running a DE analysis using zimbwave and will look out for this too. Thanks Rory.
I agree with Mike's suggestions, and would check those first. In addition, could you also let us know how the weights look like for the zeros of those specific genes? Can you share some of your code, e.g. did you include the treatment effect in the ZINB-WaVE model?
Note that in our simulations, we also did not find substantial benefits to account for zero inflation in UMI data: "In general, bulk RNA-seq methods perform well in this simulation, probably because the extremely high zero abundance in combination with low counts can be reasonably accommodated by the negative binomial distribution".
In my experience, accounting for zero inflation is particularly useful for data from full-length protocols like SMART-Seq2.
EDIT: external research (https://www.biorxiv.org/content/early/2018/08/17/225177) seems to suggest that the benefit depends on the dataset, also for UMI data, and we will be looking into diagnostics for this.
Hi everyone, to answer everyone's questions:
I ran ZINB-WaVE with all ~1700 cells in the experiment, and then subset down to a group I was interested in to do DE across conditions within the different subgroups. I fit it like this:
I ran with LRT and doing the Wald test, and these genes aren't captured either way, the same low expressing but condition-specific genes that look like they should be hits to me get missed. I also tried doing the LRT but not using the weights at all, and the same genes get missed. For all of these, I subset down the DESeq2 object like this to do the group-wise tests:
which ends up having the dispersions recalculated using only the subset of cells, which may be where I am losing some power. I can provide all the data and the reports if that would be helpful.
I'm surprised that LRT doesn't capture this without the weights, but can't figure out why that would be.
What if you do, after subsetting:
Thanks-- I'm an idiot, I was grabbing from the weighted dds object when doing the test instead of the one I created from just the counts. You're right that DESeq2 without the ZINB-WaVE weights catches most of these.
No worries. I catch myself doing a dozen slips like this every day before noon. How can you characterize the ones that still aren't caught (and are these missed by Wilcoxon also)?
Ok, it's interesting that the weights trip up the inference on these particular genes, and I'll be interested to see how the treatment design approach that Koen mentions works.
row-wise Wilcoxon does get most of these but only if I grab all the high LFC change genes with the baseMean cutoff in order to not lose them during multiple hypothesis testing correction. True here is padj < 0.05, using BH correction.
Thank you for checking this. In the zinbFit function, could you specify your treatment design as well? This will allow the zero excess probability to systematically vary with the treatment. Also, we recommend setting commondispersion=TRUE.
In the meanwhile, I would be happy to take a look at the data, if you could upload it somewhere or send it via email to koen.vandenberge@ugent.be
Thanks everyone for the awesome discussion. I put up a github repository here (https://github.com/roryk/zinbwave-deseq2-indrop) that has an rmarkdown report summarizing all of this. In the report are links to the raw data, I put up Seurat object I started with, and a script to calculate the weights and save a DDS object with and without the weights.
link for the lazy:
https://htmlpreview.github.io/?https://raw.githubusercontent.com/roryk/zinbwave-deseq2-indrop/master/zinbwave-deseq2-comparison.html