Can DESeq2 deal with zero-inflated data
3
0
Entering edit mode
KELVINLEE • 0
@kelvinlee-9111
Last seen 9.1 years ago
Singapore

I have a RNA-seq data set that have many zero due to insufficient sequencing depth and low abundance for certain genes. I want to use DESeq2 to analyse my data, but not sure if DESeq2 can deal with zero-inflated data set like mine. I know that DESeq2 uses negative binomial instead of the zero-inflated negative binomial model. Hope someone can help me out. Thank you.

deseq2 • 4.8k views
ADD COMMENT
1
Entering edit mode
Robert Castelo ★ 3.4k
@rcastelo
Last seen 18 days ago
Barcelona/Universitat Pompeu Fabra

Have a try with the BioC package tweeDEseq. It uses the Poisson-Tweedie family of count distributions, which allow one to fit odd distributional features such as heavy-tails or zero-inflation. You will find more details in the vignette of the package and in the corresponding article:

Esnaola et al. A flexible count data model to fit the wide diversity of expression profiles arising from extensively replicated RNA-seq experiments. BMC Bioinformatics, 14:254, 2013.

cheers,

robert.

ADD COMMENT
1
Entering edit mode
Simon Anders ★ 3.8k
@simon-anders-3855
Last seen 4.4 years ago
Zentrum für Molekularbiologie, Universi…

Yes, you can use DESeq2 for this, because I doubt that you have "zero-inflated" data.

Note that the term "zero-inflated" does not simply mean that your data has more zeroes than usual RNA-Seq data sets. Rather, it means that the proportions of samples with zero values in the data is larger than what a negative-binomial or similar model would predict given the average counts for the gene across all samples.

Now, Poisson-mixture models of sequencing (such as the negative binomial model used in DESeq2 and similar tools) do predict that the proportion of zero counts increases if sequencing depth is low, so there is not inflation of zeroes as compared to the model, i.e., no need for a special zero-inflated null distribution. Hence, if your large number of zeroes is only because of the low sequencing depth, then DESeq2 (or any similar tool) should work fine.

Some authors claim that certain types of data or of experimental design (especially data with strong experimental [not: technical] noise) cause zero inflation and that then the negative binomial is a bad fit. As far as I udnerstand, these authors, however, do not claim that low sequencing depth is among the reasons for using a zero-inflated null distribution, because there, the conventional models predict the increase in zero counts quite fine.

ADD COMMENT
0
Entering edit mode

so is there a way I can check whether my data is zero-inflated?

ADD REPLY
1
Entering edit mode

hi, there are diffeferent approaches to model and test for goodness of fit to a zero-inflated distribution, see for instance here and here. One way to approach this question with tweeDEseq is simply to estimate the shape parameter from the Poisson-Tweedie distribution and check whether it is close to the shape value for negative-binomial (a=0) or something else (not negative-binomial):

y <- c(0,63,1,4,1,44,2,2,1,0,1,0,0,0,0,1,0,0,3,0,0,2,0,0,0,0,0,2,0,0,0,0,0,0,0,0,0,0,0,0,6,1,11,1,1,0,0,0,2)
thetahat <- mlePoissonTweedie(y)
getParam(thetahat)
         mu           D           a
  3.0408163 102.5138255   0.5753331

in this case, the distribution of counts with all these many zeroes seems close to a Poisson-inverse Gaussian (see, Esnaola et al., 2013, Fig. 4). In the vignette of tweeDEseq you can find how to do goodness of fit tests to every row of a matrix of counts and produce a Q-Q plot to decide what fraction of genes follow what count distribution of your interest.

cheers,

robert.

ADD REPLY
0
Entering edit mode

A first diagnostic is to look at the scatterplot of counts between replicates, and check the frequency of having a very large count in one replicate and a zero in another replicate, for the same gene. However, I don't know about a quantitative diagnostic that would then help you objectively decide whether the data are zero-inflated or not. (And see the fortune(234) quote below, which also applies here - i.e. the question is not whether zero-inflation is detectable but whether it's bad enough to distort the inference.)

Models that explicitly model the data as a mixture of a point mass at zero and another, more disperse distribution are interesting - but I wonder whether in those cases where they would apply, the real data doesn't also have an excess of other small numbers (e.g. 1, 2, ..) and how they handle that?

 

library("fortunes")

fortune(234)

The issue really comes down to the fact that the questions: "exactly normal?", and "normal enough?" are 2 very different questions (with the difference becoming greater with increased sample size) and while the first is the easier to answer, the second is generally the more useful one.
   -- Greg Snow (answering a question about a "normality test" 
      suitable for large data)
      R-help (April 2009)
ADD REPLY
0
Entering edit mode

@Simon, wouldn't zeros from sequencing "errors" or being under threshold or something like that constitute exactly a zero-inflated model? x = ifelse(<zero for sequencing reason>, 0, <real distribution>)

ADD REPLY
0
Entering edit mode

BTW this thread is quite old, here are some relevant links since 4 years ago:

https://genomebiology.biomedcentral.com/articles/10.1186/s13059-018-1406-4

https://bioconductor.org/packages/release/bioc/vignettes/zinbwave/inst/doc/intro.html#differential-expression-with-deseq2

https://github.com/mikelove/zinbwave-deseq2/blob/master/zinbwave-deseq2.knit.md

The question remains whether a given dataset requires a zero component, but if you do require it, we have built out the infrastructure.

ADD REPLY
0
Entering edit mode
@ryan-c-thompson-5618
Last seen 3 months ago
Icahn School of Medicine at Mount Sinai…

DESeq2 and edgeR both use only the negative binomial distribution and do not support zero inflation, as far as I know. However, are you sure your data is zero-inflated? The NB distribution allows for a certain amount of zeros in the data on its own. Just having zeros in your data does not make it "zero-inflated". Zero-inflation would mean that the non-zero data follows a NB distribution, but the number of zeros is in excess of what would be predicted from the NB.

If your really believe you have zero-inflated data, the only package I've heard of for analyzing RNA-seq data using the ZI-NB distribution is ShrinkBayes, but its website seems to be down now.

ADD COMMENT
0
Entering edit mode

so is there a way I can check whether my data is zero-inflated?

ADD REPLY

Login before adding your answer.

Traffic: 652 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6