Hello,
I have an RNA-seq dataset consisting of 16 species, each subjected to three experimental treatments, with three replicate individuals each. The species fall into three physiological categories. I would like to identify genes whose response to the treatments varies by physiological category.
I think I've set up my model and contrasts effectively in edgeR in a way that treats each species as independent (I get apparently sensible results given prior knowledge) but we know from a long history of work in comparative methods that they are not, and I am concerned this may mislead my analysis.
I think the expectation under a phylogenetic regression is that the errors in the model will covary with a structure determined by the phylogeny and an underlying evolutionary model (e.g. brownian motion). It has been suggested to me that because edgeR is fitting a glm, it may be possible to input a covariance matrix that accounts for this expectation. I have considered extracting the log fold changes for each species separately and modeling them in a separate analysis, but that seems extremely undesirable as it would lock in point estimates that are likely to be highly uncertain at our level of replication.
I am new to both comparative methods and functional genomics analysis, so I'm sure I'm explaining my issue in a muddy way, but any advice would be appreciated. I would be happy to provide more detail on the experiment or my analysis. I've searched the message board history and can't find any information on how to deal with phylogenetic non-independence in multi-species expression datasets.
Thanks,
Noah
Dear Noah,
You do have a dependence problem here, but it doesn't have anything to do with the phylogenetic relationships between the species. The problem is simply that your data has a repeated measures structure, with samples within species and species within categories. This sort of dataset is treated in Section 3.5 of the edgeR User Guide. When you read that section, think of species as patients and categories as disease status. Aaron's answer follows the approach of that section.
There are some nuances however. The correct analysis of your data depends on what conclusions you want to make and how the species were selected. You are studying 16 species. Are these all the species you want to make conclusions about, or are they just example species from the three categories that happened to be convenient? If you did the RNA-seq experiment again, would you necessarily use the same species, or might you use different ones?
Can you respond to the above questions? Then I might be able to say more .