I am trying to decide which covariates I should include in my model for EdgeR. To do this, I have read that I should compare a full model including the covariate and a reduced model that lacks the covariate using a likelihood ratio test. How would I go about this in EdgeR and what metrics should I use to decide whether to include the covariate (e.g. patient age) or not.
In conventional linear modeling you would do what you say; fit a full and reduced model and then compare the two to see which fits the data better. But when you are planning to fit maybe 20,000 models, it's not so simple, because the covariate may be significant for some subset of the genes, but not the others. In which case what do you do?
You could just fit a possibly over-parameterized model to all the genes, and accept the fact that for some of the genes you are losing degrees of freedom to fit a coefficient that isn't necessary, and doesn't improve the model. This is what people usually do, so long as they have the degrees of freedom to spare.
An alternative would be to fit the models including (for example) age, and then test the age coefficient. If < N genes have a significant age coefficient, where N is some ad hoc number that you choose, then you could say it's not useful for enough genes and omit it.
And these days people use a quasi-likelihood F test with edgeR, rather than the likelihood ratio test, and you should too.
Great thank you. And how would I find out whether the coefficient for age is significant or not? Is this stored somewhere in the output of glmQLFTest if my model is ~age + disease. I couldn't locate the p-values for the coefficients?
Thank you, I don't think I understood correctly when I was reading it. I know how to test for differential expression based on a condition of interest but I always get confused when it comes to the covariates - I couldn't find anything specifically relating to testing the significance of covariates in the EdgeR manual (unless I am missing it?). If I just set my model as ~disease + age, would this provide me with the correct info?
In a linear model, a 'condition of interest' and a 'covariate' or 'nuisance variable' are not different. They are all just independent variables that we think might explain something about the observed gene expression. It so happens that age in your model is a nuisance variable, that you think might affect the gene expression, but isn't of interest otherwise. But you test it the same way you test a condition of interest.
Presuming you have two disease states, and age is continuous (rather than a factor, which wouldn't make sense), then your third coefficient captures the slope for age. And using topTags to select that coefficient will give you the genes that are significantly affected by age, just like using topTags to select the disease coefficient tells you the genes that vary between the two disease states.
But do note that your model forces the slope to be the same for both disease states, which might not actually be the case for any number of genes. If you use ~ disease*age you will allow separate intercepts and slopes for the (assumed by me) two disease states, and you would test the disease:age coefficient to see if there is evidence that the gene expression is affected by a combination of disease and age.
We recommend that you use FDRs to assess whether a variable in significant, and it is exactly the same for a covariate as for any other variable.
Generally speaking, a covariate should be included if it is highly significant for a large number of genes or can be ignored if it has only moderate significance for a limited number of genes, especially if those genes are not of primary interest to you.
To find out whether the coefficient for age is significant:
Thank you, this is really helpful. How do you deal with covariates such as sequencing pool that are factors. Say I had 4 different sequencing pools, would I have to do pairwise comparisons between each of these, or how would I test whether pool is having an effect?
Great thank you. And how would I find out whether the coefficient for age is significant or not? Is this stored somewhere in the output of
glmQLFTest
if my model is~age + disease
. I couldn't locate the p-values for the coefficients?Have you read the edgeR User's Guide? There are multiple examples that should answer your question.
Thank you, I don't think I understood correctly when I was reading it. I know how to test for differential expression based on a condition of interest but I always get confused when it comes to the covariates - I couldn't find anything specifically relating to testing the significance of covariates in the EdgeR manual (unless I am missing it?). If I just set my model as
~disease + age
, would this provide me with the correct info?In a linear model, a 'condition of interest' and a 'covariate' or 'nuisance variable' are not different. They are all just independent variables that we think might explain something about the observed gene expression. It so happens that age in your model is a nuisance variable, that you think might affect the gene expression, but isn't of interest otherwise. But you test it the same way you test a condition of interest.
Presuming you have two disease states, and age is continuous (rather than a factor, which wouldn't make sense), then your third coefficient captures the slope for age. And using
topTags
to select that coefficient will give you the genes that are significantly affected by age, just like usingtopTags
to select the disease coefficient tells you the genes that vary between the two disease states.Thank you, that makes sense - I always struggle to get my head around these things.
But do note that your model forces the slope to be the same for both disease states, which might not actually be the case for any number of genes. If you use
~ disease*age
you will allow separate intercepts and slopes for the (assumed by me) two disease states, and you would test thedisease:age
coefficient to see if there is evidence that the gene expression is affected by a combination of disease and age.