Hi,
I am trying to use DESeq2 to perform differential analysis on a large 16s microbiome dataset of around 600 samples. It is clear from our current non-parametric analysis that many of our OTUs of interest are associated with one or more unwanted covariates. We want to use a parametric model like DESeq2 to account for these. As with many OTU tables, it's extremely sparse. I find that there is a lot of variation in the size factors as well as an unusual looking dispersion plot.
The OTU table is extremely sparse, I have filtered down the initial 15000 OTUs to only those present in more than 10% of samples. This leaves 302 OTUs.
>dim(counts(dds)) [1] 302 653 >quantile(rowSums(counts(dds) > 0), seq(0,1,0.1)) 0% 10% 20% 30% 40% 50% 60% 70% 80% 90% 100% 66.0 73.0 81.0 89.3 101.0 113.0 136.0 175.1 209.8 269.2 588.0 # most OTUs are extremely sparse My metadata looks like this: > dds@colData[1:10,] DataFrame with 10 rows and 6 columns Condition Gender ABusage Location Cutage sizeFactor <factor> <factor> <factor> <factor> <factor> <numeric> SPRRPN1131 G2A Female PAST UC (44.6,57.4] 4.481373 SPRRPN2087 G2A Female TRUE UC (44.6,57.4] 2.562013 SPRRPN1532 G2 Male PAST UC (44.6,57.4] 4.234307 SPRRPN1186 G2A Male PAST UC (44.6,57.4] 4.215819 SPRRPN1974 G2 Male PAST KI (44.6,57.4] 1.563766 SPRRPN1110 G2 Female PAST JP (44.6,57.4] 7.166926 SPRRPN1525 G2A Male PAST JP (57.4,70.2] 2.753161 SPRRPN1437 G2A Female TRUE UC (31.8,44.6] 4.289375 SPRRPN1884 G2A Female FALSE JP (57.4,70.2] 1.567005 SPRRPN1559 G2A Male PAST UC (18.9,31.8] 1.396909 # run deseq2 with covariates dds <- DESeqDataSetFromMatrix(countData = countData, colData = datMeta, design = ~Gender + ABusage + Location + Cutage + Condition) # phyloseq function for calculating geometric mean with 0 gm_mean = function(x, na.rm=TRUE){ exp(sum(log(x[x > 0]), na.rm=na.rm) / length(x)) } # calc size factors geoMeans = apply(counts(dds), 1, gm_mean) dds = estimateSizeFactors(dds, geoMeans = geoMeans) > quantile(sizeFactors(dds), seq(0,1,0.1)) 0% 10% 20% 30% 40% 50% 60% 70% 80% 90% 100% 0.3734477 1.5644135 2.2461473 2.8414593 3.5440500 4.0941265 4.7938855 5.8795089 7.4389091 12.3545886 66.8361032 # use parallel and fit register(MulticoreParam(4)) diagdds = DESeq(dds, fitType="local", parallel = TRUE,minReplicatesForReplace=Inf) # def contrasts contrasts <- list( c("Condition", "G2", "CTRL"), c("Condition", "G2A", "CTRL"), c("Condition", "G1A", "CTRL"), c("Condition", "G1", "CTRL")) # extract contrasts res = results(diagdds, contrast = contrasts[[1]],cooksCutoff=FALSE,pAdjustMethod "fdr",independentFiltering = FALSE) res = res[order(res$padj, na.last=NA), ] res= cbind(as(res, "data.frame"))
Given the variability in the size factors, I'm concerned that maybe our data is too variable and doesn't fit the NB very well. As I'm quite new to this, could someone comment on the size factors? Given that our library sizes range from 2500 to 10000, size factors ranging from 0.3 to 67 seems extraordinary high and its unlike anything I've seen before. Also what about the dispersions, whilst it looks “smooth” it also looks very flat which again is unlike any examples I have come across.
Thanks,
Jess
Thanks Micheal,
I tired calculating the the size factors using the upper quantile method, I found a code snippet mentioned in another thread:
Around 20 samples are 0 at the 0.90 quantile, I would have to cut at the 0.99 quantile to make all samples non-zero. To evaluate the effect, I removed these samples for now.
The size factors are smaller but there is still an 80 fold difference.
This also changed the dispersion plot considerably. I also tried calculating the size factors using the iterate function. This failed and did not converge.
So I'm unsure about how to proceed with this, do you think that the analysis is unusable given the variability in size factors using the median ratio approach? The significant bugs are mostly concordant with what I find from wilcoxon tests so I'm inclined to believe them. Both find around 50 significant bugs with an intersection of 30 at an fdr < 0.05 with no covariates in the design. Also, I am filtering the very rare species before running DESeq2, is it more appropriate to calculate the size factors on the entire dataset and then fit the model on the filtered data?
I really can't say what's appropriate for metagenomic data analysis.
I don't know if the estimated size factors are reasonable or not. Just because the range is different than the range of the total counts doesn't mean it's wrong. I just don't know what's the appropriate way to assess size factors for a given metagenomics dataset.
The dispersion plot looks totally fine, basically there is no shrinkage when you have so many degrees of freedom.