Dear Bioconductor community,
I am working with RNA-seq timecourse data (from C. elegans) from experiments typically containing 8-16 equally-spaced timepoints and 2 treatment arms (control, perturbation), with a single sample per timepoint per arm.
I previously modeled expression as being smooth over time using b-splines on log-transformed normalised expression in limma. A typical nuisance parameter is the number of splines in the basis, which strongly influences the modeling. In limma, it was relatively straightforward to model using different numbers of splines, derive gene-wise penalised model performance metrics for each, and select the optimal number of splines for each gene. That said, I wanted to move to edgeR, to benefit from glmQLFit's more robust & RNA-seq-specific modelling. However, as far as I can tell, this precludes the ability to get gene-level penalised performance metrics.
My question is: is there some way to optimise the number of splines for each gene using glmQLFit(), or should I fall back upon glmFit() (for which - I gather - that gene-wise penalised model performance metrics can be derived), or is this type of model optimisation simply not appropriate when using edgeR?
I provide some code below to illustrate the modelling process; I can provide a full reproducible example with simulated data if need be.
## build DGEList object
# design_df contains notably factor Tmt (ctrl/pert) and numeric TP (integers 1 to t)
dge <- DGEList(counts = expr_mat, samples = design_df, genes = gene_df)
## build b-spline basis evaluated at observed times
# <<< using a single nb splines for this example
bsb <- bs( dge$samples$TP, df = 4, degree = 3, intercept = FALSE) # 4 b-splines
rownames(bsb) <- rownames(dge$samples)
## create a model matrix with:
# intercept, perturbation offset, splines for common variation (ie that of the control arm), and splines for perturbation-specific variation
MM <- model.matrix(
~ 1 + Tmt + bsb + Tmt:bsb, data=dge$samples,
contrasts.arg = list("Tmt"=car::contr.Treatment) )
## contrasts matrix for testing DE at each TP
# simplified construction by editing the MM
CM <- MM
CM[ , !grepl("^Tmt", colnames(CM))] <- 0
CM <- CM[grepl("^pert", rownames(CM)),]
rownames(CM) <- paste0("delta@", dge$samples[rownames(CM),"TP"])
## edgeR analysis
dge <- estimateDisp(dge, MM, robust = T)
dge_fit <- glmQLFit( dge, dge$design, prior.count = 8) # using a prior count that we've used historically and works well with our data
## <<< this is where the nb splines optimisation would need to occur
## perform DE testing at each TP
# and assemble results into a single table
mdl_res <- list()
for (cn in rownames(CM)) {
qlf <- glmQLFTest(dge_fit, coef=NULL, contrast=CM[cn,])
mdl_res[[cn]] <- cbind("cname" = cn, tibble::rownames_to_column(
topTags(qlf, n=Inf)$table[,c("logFC", "F", "PValue", "FDR")], "gene_id"))
}
mdl_res <- do.call(rbind, mdl_res) ; rownames(mdl_res) <- NULL
# ==> works fine, but chosen nb of splines might overfit some genes, and underfit others
Thank you in advance for your time and consideration. Best regards to all,
-- Alex
Dear Gordon,
Thank you very much for your prompt reply.
I agree with the limma/edgeR perspective of including all potentially explanatory variables in a model, and then querying the terms in that model, rather than performing any sort of feature selection. I would argue that that is exactly how the contrast-based querying I am doing works. However, setting the number of splines is not a question about which splines provide explanatory power, but how much smoothing should be used across the timecourse. For example, a model with a b-spline basis with 6 splines, of which 1 is non-explanatory, is fundamentally different from a model with only 5 splines: the b-splines are in different locations, and the smoothing is different.
In practice, we have observed that different genes require different amounts of smoothing, due to radically different trajectories and differences in noise levels. For example, a gene with an oscillatory trajectory of frequency 2*F may require a number P of splines to fit believably, while another gene with oscillations of frequency F would only require P'<P. Simply using the higher number of splines (i.e. P) across all genes will lead to cases of fitting to noise in genes with lower complexity trajectories. I'll admit I unfortunately don't have a way of quantifying this, it is just my overall impression from having worked on these models, but this is why I am trying to figure out a way of optimising the number of splines for each gene.
My attempts of doing this in limma involved running the modelling on all genes with different numbers of splines, then extracting the contrast results I am interested in for each subset of genes that share a same optimal number of splines. If this is inappropriate, I will abandon this. This would not, however, solve the question of what number of splines to use for the entire dataset. The choice can be to the discretion of the analyst, of course, but I was hoping that there would be some way of optimising this, either gene-wise or holistically.
Thank you again for your feedback and best regards,
-- Alex