Model performance for spline modelling of RNA-seq timecourses in edgeR
1
0
Entering edit mode
skiaphrene ▴ 10
@skiaphrene-6914
Last seen 7 days ago
Switzerland

Dear Bioconductor community,

I am working with RNA-seq timecourse data (from C. elegans) from experiments typically containing 8-16 equally-spaced timepoints and 2 treatment arms (control, perturbation), with a single sample per timepoint per arm.

I previously modeled expression as being smooth over time using b-splines on log-transformed normalised expression in limma. A typical nuisance parameter is the number of splines in the basis, which strongly influences the modeling. In limma, it was relatively straightforward to model using different numbers of splines, derive gene-wise penalised model performance metrics for each, and select the optimal number of splines for each gene. That said, I wanted to move to edgeR, to benefit from glmQLFit's more robust & RNA-seq-specific modelling. However, as far as I can tell, this precludes the ability to get gene-level penalised performance metrics.

My question is: is there some way to optimise the number of splines for each gene using glmQLFit(), or should I fall back upon glmFit() (for which - I gather - that gene-wise penalised model performance metrics can be derived), or is this type of model optimisation simply not appropriate when using edgeR?

I provide some code below to illustrate the modelling process; I can provide a full reproducible example with simulated data if need be.

## build DGEList object
# design_df contains notably factor Tmt (ctrl/pert) and numeric TP (integers 1 to t)
dge <- DGEList(counts = expr_mat, samples = design_df, genes = gene_df) 

## build b-spline basis evaluated at observed times
# <<< using a single nb splines for this example
bsb <- bs( dge$samples$TP, df = 4, degree = 3, intercept = FALSE) # 4 b-splines
rownames(bsb) <- rownames(dge$samples)

## create a model matrix with:
# intercept, perturbation offset, splines for common variation (ie that of the control arm), and splines for perturbation-specific variation
MM <- model.matrix( 
  ~ 1 + Tmt + bsb + Tmt:bsb, data=dge$samples, 
  contrasts.arg = list("Tmt"=car::contr.Treatment) )

## contrasts matrix for testing DE at each TP 
# simplified construction by editing the MM
CM <- MM
CM[ , !grepl("^Tmt", colnames(CM))] <- 0
CM <- CM[grepl("^pert", rownames(CM)),]
rownames(CM) <- paste0("delta@", dge$samples[rownames(CM),"TP"])

## edgeR analysis
dge <- estimateDisp(dge, MM, robust = T) 
dge_fit <- glmQLFit( dge, dge$design, prior.count = 8)  # using a prior count that we've used historically and works well with our data

## <<< this is where the nb splines optimisation would need to occur

## perform DE testing at each TP
# and assemble results into a single table
mdl_res <- list()
for (cn in rownames(CM)) {
  qlf <- glmQLFTest(dge_fit, coef=NULL, contrast=CM[cn,])
  mdl_res[[cn]] <- cbind("cname" = cn, tibble::rownames_to_column(
    topTags(qlf, n=Inf)$table[,c("logFC", "F", "PValue", "FDR")], "gene_id"))
}
mdl_res <- do.call(rbind, mdl_res) ; rownames(mdl_res) <- NULL
# ==> works fine, but chosen nb of splines might overfit some genes, and underfit others

Thank you in advance for your time and consideration. Best regards to all,

-- Alex

edgeR limma • 423 views
ADD COMMENT
0
Entering edit mode
@gordon-smyth
Last seen 6 hours ago
WEHI, Melbourne, Australia

I don't understand your question, because there is no difference between limma or glmFit or glmQLFit in terms of spline modeling or the tests than can be done.

limma and edgeR assume regression splines where the degree of the spline is set in advance. This approach is more flexible that it appears and does not prejudge the complexity of the curve for each gene. You simply set the degree of the spline to be maximum that is likely to be needed for any gene. I would never use degree greater than 5 for any practical analysis myself.

I generally like to generate natural splines, because they extend gracefully outside the range of the data.

I do not recommend trying to optimize the degree of the spline that is fitted on a per-gene basis. Such as approach is incompatible with rigorous statistical assessment of DE, whether in limma or edgeR or in any other software. limma and edgeR have always been based on the principle that the user will fit the maximal model to each gene, and then you can conduct DE tests to determine how many of the coefficients were significant for each gene.

ADD COMMENT
0
Entering edit mode

Dear Gordon,

Thank you very much for your prompt reply.

I agree with the limma/edgeR perspective of including all potentially explanatory variables in a model, and then querying the terms in that model, rather than performing any sort of feature selection. I would argue that that is exactly how the contrast-based querying I am doing works. However, setting the number of splines is not a question about which splines provide explanatory power, but how much smoothing should be used across the timecourse. For example, a model with a b-spline basis with 6 splines, of which 1 is non-explanatory, is fundamentally different from a model with only 5 splines: the b-splines are in different locations, and the smoothing is different.

In practice, we have observed that different genes require different amounts of smoothing, due to radically different trajectories and differences in noise levels. For example, a gene with an oscillatory trajectory of frequency 2*F may require a number P of splines to fit believably, while another gene with oscillations of frequency F would only require P'<P. Simply using the higher number of splines (i.e. P) across all genes will lead to cases of fitting to noise in genes with lower complexity trajectories. I'll admit I unfortunately don't have a way of quantifying this, it is just my overall impression from having worked on these models, but this is why I am trying to figure out a way of optimising the number of splines for each gene.

My attempts of doing this in limma involved running the modelling on all genes with different numbers of splines, then extracting the contrast results I am interested in for each subset of genes that share a same optimal number of splines. If this is inappropriate, I will abandon this. This would not, however, solve the question of what number of splines to use for the entire dataset. The choice can be to the discretion of the analyst, of course, but I was hoping that there would be some way of optimising this, either gene-wise or holistically.

Thank you again for your feedback and best regards,

-- Alex

ADD REPLY

Login before adding your answer.

Traffic: 836 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6