Question

Deseq2 error estimating size factors -- zeros in count matrix

1

Entering edit mode

ssabri ▴ 20

@ssabri-9464

Last seen 4.6 years ago

I am posting in regards to an error I've experience when running Deseq2's estimateSizeFactors. The specific error is posted below:

> dds    = DESeqDataSetFromMatrix(countData=filteredDGE, colData=design, design=~condition)
converting counts to integer mode
> dds    = estimateSizeFactors(dds)
Error in estimateSizeFactorsForMatrix(counts(object), locfunc = locfunc,  : 
  every gene contains at least one zero, cannot compute log geometric means

I've run Deseq2 several times in the past but have never experienced this error. I have tried to compare my current input with prior input from previous analyses but fail to see differences in formatting. The error seems very straightforward however I do not understand it. My data.frame will contain zeros in it because there are genes not being hit or expressed within samples. I have posted below the code I am using. `filteredDGE` is a data.frame with 500 columns (samples) and about 200 rows (genes).

DGE = read.table(paste(getwd(), "/output/AGGCAGAA_final/AGGCAGAA_final_Collapsed_DGE_HUMAN.txt", sep=""), row.names=1, header=TRUE, sep="\t")
n = 80
filteredDGE = DGE[rowSums(DGE==0)<=ncol(DGE)*(n/100),] 
library(DESeq2)
design = data.frame(row.names=names(filteredDGE), condition=as.factor(names(filteredDGE)))
dds    = DESeqDataSetFromMatrix(countData=filteredDGE, colData=design, design=~condition)
dds    = estimateSizeFactors(dds)

I would greatly appreciate any help you could toss my way.

EDIT: I should state that this is in regard to single-celled data. I'm only interested in performing a counts normalization/transformation. There is no need to run differential analysis.

deseq2 estimatesizefactors • 11k views

ADD COMMENT • link updated 8.9 years ago by Michael Love 43k • written 8.9 years ago by ssabri ▴ 20

1

Entering edit mode

You might try the edgeR normalization, which I believe an handle data with a zero in every row. Unfortunately, you'll have to work out how to convert edgeR normalization factors to DESeq2 size factors. I don't remember the conversion off the top of my head (which is why I made this a comment and not an answer).

ADD REPLY • link 8.9 years ago Ryan C. Thompson ★ 7.9k

3

Entering edit mode

Wolfgang Huber ★ 13k

@wolfgang-huber-3550

Last seen 3 months ago

EMBL European Molecular Biology Laborat…

Have a look at the manual page of estimateSizeFactors for options beyond the default. E.g. you could try type = "iterate", to get an estimator that iterates between estimating the dispersion with a design of ~1, and finding a size factor vector by numerically optimizing the likelihood of the ~1 model.

Or, use sizeFactors<- to specify your own factor, e.g. one over the sum of counts in each sample.

And ... depending on the analysis you want to do (e.g. transformation, data exploration, quality control), maybe better to use no size factors at all and just look at the actual data?

ADD COMMENT • link updated 8.9 years ago by Michael Love 43k • written 8.9 years ago by Wolfgang Huber ★ 13k

0

Entering edit mode

For this particular analysis I simply intend to do some data visualizations. I don't believe there would be any need to do differential expression which is why I only want to transform my data (via VST).

I must have overlooked the `type="iterate"` option! Thank you for bringing this to my attention. I've also found a workaround by simply pre-filtering my data. So, rather than looking at say 500 cells I can filter (by overall expression) to look at a fraction of these.

ADD REPLY • link 8.9 years ago ssabri ▴ 20

score 4 · Accepted Answer · 2016-01-06

Just to add to Ryan and Wolfgang's answers:

First, practical advice, you can use the type="iterate" method which Wolfgang describes. This is motivated by a model for the data, which seeks to find a size factor vector to optimize the likelihood of the data assuming that most genes are not differentially expressed. The iterative algorithm ignores some percent of the genes which have the lowest likelihood, as these may be the genes which are in fact differentially expressed.

Or you can estimate size factors other ways, including an upper quantile of the count matrix. You can then provide these size factors to DESeq2 with sizeFactors(dds) <- x. It's best to divide out the geometric mean first, so that your estimated size factors leave the mean of the normalized counts close to the mean of the unnormalized counts. This would look like:

qs <- apply(counts(dds), 2, quantile, .9)
sf <- qs / exp(mean(log(qs)))
sizeFactors(dds) <- sf

You say,

"My data.frame will contain zeros in it because there are genes not being hit or expressed within samples"

That is true, but this error indicates that every row in the dataset has a zero, which is not typical for the kind of "bulk" RNA-seq data DESeq2 was designed for. DESeq2 can certainly handle zeros in the model. But for bulk RNA-seq there are typically thousands of genes which have positive counts for all samples, and so the default size estimation works well and without error.

For example, with the Bodymap dataset prepared by ReCount:

download.file("http://bowtie-bio.sourceforge.net/recount/ExpressionSets/bodymap_eset.RData", "body.rda")
load("body.rda")
e <- exprs(bodymap.eset)
table(rowSums(e == 0) == 0)

FALSE  TRUE 
45936  6644

So even across a variety of tissues, we still have thousands genes which have positive counts for all samples.