Hi All,
I've been attempting to use MAST to perform differential gene expression analysis between 3 conditions (control, disease_state1, and disease_state2) from single-cell data and have been running into issues with very low logFCs (all of them with absolute values of <1) that are different in comparison to using Wilcoxon Rank Sum test with Seurat or when I perform pseudobulk methods (such as DESeq2 or edgeR).
Code should be placed in three backticks as shown below
# Subsetting my cluster of interest
cluster_seurat_obj <- subset(seurat_obj, idents = cluster_1")
DefaultAssay(cluster_seurat_obj) <- "RNA"
# Converting the Seurat object into a SingleCellExperiment object
sce <- as.SingleCellExperiment(DietSeurat(cluster_seurat_obj, graphs = c("pca", "umap")))
# Log-2 transforming the raw counts
logcounts(sce) <- log2(counts(sce) + 1)
# Converting the SingleCellExperiment object into a MAST SingleCellAssay object
sca <- MAST::SceToSingleCellAssay(sce, class = "SingleCellAssay")
# Scale gene detection rate
colData(sca)$nFeature_RNA = scale(colData(sca)$nFeature_RNA)
# Cellular detection rate
cdr2 <- colSums(assay(sca)>0)
colData(sca)$cngeneson <- scale(cdr2)
# Using column header for my variable of interest
cond <- factor(colData(sca)$process)
cond <- relevel(cond, "control") #setting my control, as there are 2 different disease processes as well
colData(sca)$condition <- cond
# Adding batch to cond to control for batch effect (using each individual sample as its own batch)
cond <- factor(colData(sca)$sample_ID)
colData(sca)$batch_ID <- cond
# Filtering genes to only include those with non-zero expression in >5% of cells
expressed_genes <- freq(sca) >0.05
sca <- sca[expressed_genes,]
# Performing the DEG analysis
zlmCond <- suppressMessages(MAST::zlm(~condition + cngeneson + (1 | batch_ID), exprs_value = "logcounts", sca,method='glmer',ebayes = FALSE,strictConvergence = FALSE, parallel=TRUE))
# Using "condition<diseaseofinterest>" annotation to highlight my condition of interest
summaryCond <- summary(zlmCond, doLRT='conditiondisease1')
summaryDt <- summaryCond$datatable
fcHurdle <- merge(summaryDt[contrast=='conditiondisease1' & component=='H',.(primerid, `Pr(>Chisq)`)],
summaryDt[contrast=='conditiondisease1' & component=='logFC', .(primerid, coef, ci.hi, ci.lo)], by='primerid')
fcHurdle[,fdr:=p.adjust(`Pr(>Chisq)`, 'fdr')]
I do get the following warning when I run the zlm, suggesting that MAST cannot find a good model because the variability is too high. However, this is for a cluster with ~15,000 cells and a very common cell type, so I'm wondering if there's something wrong with my code itself:
Warning messages:
1: In checkConv(attr(opt, "derivs"), opt$par, ctrl = control$checkConv, :
Model failed to converge with max|grad| = 0.00241813 (tol = 0.002, component 1)
2: In checkConv(attr(opt, "derivs"), opt$par, ctrl = control$checkConv, :
Model failed to converge with max|grad| = 0.0464445 (tol = 0.002, component 1)
3: In checkConv(attr(opt, "derivs"), opt$par, ctrl = control$checkConv, :
Model failed to converge with max|grad| = 0.0719693 (tol = 0.002, component 1)
4: In checkConv(attr(opt, "derivs"), opt$par, ctrl = control$checkConv, :
Model is nearly unidentifiable: very large eigenvalue
- Rescale variables?
5: In checkConv(attr(opt, "derivs"), opt$par, ctrl = control$checkConv, :
Model failed to converge with max|grad| = 0.0824029 (tol = 0.002, component 1)
6: In checkConv(attr(opt, "derivs"), opt$par, ctrl = control$checkConv, :
Model is nearly unidentifiable: very large eigenvalue
- Rescale variables?;Model is nearly unidentifiable: large eigenvalue ratio
- Rescale variables?
The results end up with around 70 significant genes (FDR <0.05) but in the "coef" column correlating to the logFC, the absolute values range from 0.03 to 0.9 or so. This is very different in comparison to the Wilcoxon Rank Sum test or other pseudobulk approaches (where the log2FCs are typically in the 3s for a lot of the genes). My suspicion is that I'm doing something wrong, as a relative beginner to MAST, in what I'm coding but I'm not sure what steps to take next to troubleshoot this.
Any guidance would be much appreciated. Thanks in advance!
I am facing the same issue. Did you ever figure out why the logFC is so different?