I have RNAseq data of patient and control samples with different age and gender.
I would like to know whether these covariate affect gene expression levels. If they do not have any effect remove them from model and create simpler model for differential expression analysis using edgeR and DESeq2.
Would you please let me know if there is a function in these packages to find the answer?
There aren't any functions per se, but it's not an uncommon thing to do. Do note however that you are fitting (tens of) thousands of models simultaneously, and age and/or gender (especially) will have an effect on at least some of those genes. One aspect of this sort of statistics is that you are doing things in bulk, and you don't have the ability to do lots of model checking, etc, and sometimes have to use an over-specified model because it is useful for some subset of the genes you care about (and you are thereby 'wasting' degrees of freedom on those genes for which the extra parameter(s) are not needed).
One thing you can do is a MDS or PCA plot to see if there are large sex or age related differences. You can also fit the 'full' model and then see how many genes have a significant coefficient for age or sex. If there are not too many such genes (for some definition of 'not too many'), you could decide to drop one or more coefficients. Or you could take the opposite approach and just fit the model with age and sex and burn the two degrees of freedom. If you have enough replication it might not matter that much.
So now we fit a model that includes sex and age, and then extracted all the genes for which the 'sex' coefficient is significant at an FDR of 5%. We then see how many genes fulfill that criterion. You can then decide if that number is large enough for you to think that the sex coefficient is needed or not.
I agree with Jim that for just 2 possible co-variates, I would just include them both as coefficients in the model. Even if there are only a few genes that one or both affect, it is better for them and the remaining genes likely won't be affected that much by losing 2 DF. However, another approach I recently used with some human data with 7 possible covariates was to first run WGCNA on the genes to divide the them into modules based on similar expression patterns, then use the eigengene values for each module in a BIC stepwise regression to select the best covariates for each module, and was pretty pleased with this broad level approach. I'm playing around with then running a separate limma/edgeR for the set of genes in each module using the best model to get gene-level results, but I'm not sure if it's a good idea or not because of the separate variance estimates for each gene set. I may broach this topic in a separate post...
Thanks James. Would you please explain about what is " significant coefficient ". sorry If this question is unrelevant.
So let's say you did something like this, where you have some data (called dglst, in my example) and a design matrix that includes sex and age.
So now we fit a model that includes sex and age, and then extracted all the genes for which the 'sex' coefficient is significant at an FDR of 5%. We then see how many genes fulfill that criterion. You can then decide if that number is large enough for you to think that the sex coefficient is needed or not.