Hi,
I am working with an RNA-seq data set that contains a lot of 0s across the genes. When I do Deseq2 to look for differential gene expression, it finds several significantly dysregulated genes. However, when I look in the normalised count file, I am finding very little support for these genes across all my samples under one condition.
I have been investigating how the normalisation algorithm works, though I hit a bit of a stumbling block when it comes to 0s. In my understanding, the normalisation works as follows. Deseq2 normalisation divides counts by the sample-specific size factors. It generates this size-factor by the median ratio of gene counts relative to the mean per gene. Normalization works via the assumption that most of the genes are not DE and that the median ratio captures the scaling factor from the non-DE genes. This can also be summarised as the steps below.
Overview :
- Step 1: creates a pseudo-reference sample (row-wise geometric mean)
- Step 2: calculates the ratio of each sample to the reference
- Step 3: calculate the normalization factor for each sample (size factor)
- Step 4: calculate normalised counts
I have 71 samples, across 6 conditions, and I do my process as following :
dds <- DESeqDataSetFromMatrix(countData=countdata, colData=coldata, design=~cond)
dds <- collapseReplicates(dds, dds$ID2) ## have 3 wells per sample/patient so combining them
dds <- estimateSizeFactors(dds)
# filter genes where there are less than 5 samples with normalised counts greater or equal to 10
keep <- rowSums( counts(dds, normalized=TRUE) >= 10 ) >= 5
dds <- dds[keep,]
dds <- DESeq(dds)
res <- results(dds)
c1_vs_c2 <- results(dds, contrast=c("cond", "c1", "c2"), independentFiltering=TRUE, alpha=0.05, pAdjustMethod="BH")
c3_vs_c4 <- results(dds, contrast=c("cond", "c3", "c4"), independentFiltering=TRUE, alpha=0.05, pAdjustMethod="BH")
c5_vs_c6 <- results(dds, contrast=c("cond", "c5", "c6"), independentFiltering=TRUE, alpha=0.05, pAdjustMethod="BH")
##lfc shrink
c1_vs_c2_lfc <- lfcShrink(dds, contrast=c("cond", "c1", "c2"), res=c1_vs_c2)
c3_vs_c4_lfc <- lfcShrink(dds, contrast=c("cond", "c3", "c4"), res=c3_vs_c4)
c5_vs_c6_lfc <- lfcShrink(dds, contrast=c("cond", "c5", "c6"), res=c5_vs_c6)
When I view the counts, the normalised counts and on the sizeFactors, I have results and I get no errors. Even though every row contains at least 1 zero.
head(counts(dds))
c1_r1 c1_r2 c1_r3 c2_r1 c2_r2 c2_r3
A2M 2 0 4 5 25 5 0 1 0
AAAS 0 0 0 23 30 28 0 2 0
AAGAB 0 0 24 54 90 0 0 211 0
head(counts(dds, normalized=TRUE))
c1_r1 c1_r2 c1_r3 c2_r1 c2_r2 c2_r3
A2M 1.349375 0.00000 3.703930 1.2096526 9.425593 1.945083
AAAS 0.000000 0.00000 0.000000 5.5644018 11.310711 10.892464
AAGAB 0.000000 0.00000 22.223580 13.0642476 33.932133 0.000000
head(dds$sizeFactor)
c1_r1 c1_r2 c1_r3 c2_r1 c2_r2 c2_r3
1.4821678 0.9565306 1.0799340 4.1334183 2.6523531 2.5705847
In a geometric mean typically one of three options occur:
- 1 is added to each value in the set and then 1 is subtracted from the result.
- Blank and 0 values are ignored in the calculation
- 0s are converted to 1 for the calculation
I am at a loss on how the normalisation factor is handling zeros. I have been crazy enough to do the maths on my raw data to see if I can reproduce the normalised counts using all three methods but so far I haven't.
How does Deseq2 handle 0s in the input counts? I would appreciate any help on this matter.
thanks for getting back to me so quickly.
I did read the paper but I clearly am too tired on a Friday evening (where I am) to have properly understood the parts about 0. So Deseq2 ignored the zeros and calculates the geometric mean of the row without them?
It is a bit of an unusual dataset. It is not typically bulk but not single cell. It was data that was initially sorted with flow cytometry to look at pathogen-specific cells. Which is why I am trying to understand what effect this non-standard data might be having on the algorithim.
You may need an alternative size factor estimation method if there are no genes that don't have a zero. But I don't know what would be appropriate for your data. Perhaps look into sum factors from scran.
Sorry Double post.
Just to clarify, I don't have a single gene without at least one zero.