Hi!
On top of a standard DE test with limma-voom, I want to carry out a differential variability analysis between two conditions (diets 1 and 2), controlling for categorical batches (e.g., sequencing days) and surrogate variables explaining non-condition/batch structure in the data. The goal is to identify differences in variability that may not purely be a consequence of differences in mean (e.g., control of transcription/degradation rather than increased transcription). My dataset has >100 biological replicates per condition so I got good power to detect them if they exist.
My main questions are if the vst()-based Levene's test approach below is correctly coded and/or sensible?
counts # count matrix
all_covariates # dataframe with conditions (1 or 2), seqday (1 or 2 or 3) and SVs (continuous)
conditions<- as.factor(all_covariates$conditions)
seqday <- as.factor(all_covariates $seqday)
sv <- as.numeric(all_covariates $sv)
data <- DESeqDataSetFromMatrix(countData = counts,
colData = all_covariates,
design = ~conditions)
vst.data = vst(data, blind=F)
contrasts(seqday) <- contr.sum(levels(seqday))
batches <- model.matrix(~seqday+sv)
vst.counts.bc <- limma::removeBatchEffect(vst.counts, covariates = batches[,-1], design = ~conditions)
pvalues <- sapply(1:nrow(vst.counts.bc), function(i){
data <- cbind.data.frame(gene = as.numeric(t(vst.counts.bc[i,])), conditions)
result <- leveneTest(gene~conditions, data, center='median')
p <- result$`Pr(>F)`[1]
return(p)
})
bh <- p.adjust(pvalues, method = "BH")
For some extra context and correct me if I'm wrong - my reading of the source code is that the vst() corrects for library sizes and a mean-variance relationship that varies from gene to gene per condition, in particular addressing the dispproportionately high variances of genes that are either super-low and super-high in mean expression. Furthermore, the median-centred Levene's test should be robust to deviations from normality. Hence, the above approach feels sound. Yet, I haven't seen a paper go for this. Instead, they use packages to fit negative binomials models on TMM+CPM data that model changes in variability as an overdispersion parameter obeying a quadratic mean-variance relationship (e.g., https://journals.physiology.org/doi/full/10.1152/physiolgenomics.00128.2018?), in spite of some papers reporting higher-order polynomial mean-variance relationships in RNAseq data (https://bmcbiol.biomedcentral.com/articles/10.1186/s12915-020-00842-z).
Your approach to differential variability modeling using vst() looks promising! It mirrors the challenge of mastering levels in Geometry Dash both require a nuanced understanding of underlying structures. Don't hesitate to explore alternative packages if needed, as they might reveal additional layers of variability in your data, akin to discovering secret paths in a game.