Question

accounting for phylogenetic non-independence in differential expression analysis

0

Entering edit mode

noah.reid ▴ 10

@noahreid-13414

Last seen 2.6 years ago

United States

Hello,

I have an RNA-seq dataset consisting of 16 species, each subjected to three experimental treatments, with three replicate individuals each. The species fall into three physiological categories. I would like to identify genes whose response to the treatments varies by physiological category.

I think I've set up my model and contrasts effectively in edgeR in a way that treats each species as independent (I get apparently sensible results given prior knowledge) but we know from a long history of work in comparative methods that they are not, and I am concerned this may mislead my analysis.

I think the expectation under a phylogenetic regression is that the errors in the model will covary with a structure determined by the phylogeny and an underlying evolutionary model (e.g. brownian motion). It has been suggested to me that because edgeR is fitting a glm, it may be possible to input a covariance matrix that accounts for this expectation. I have considered extracting the log fold changes for each species separately and modeling them in a separate analysis, but that seems extremely undesirable as it would lock in point estimates that are likely to be highly uncertain at our level of replication.

I am new to both comparative methods and functional genomics analysis, so I'm sure I'm explaining my issue in a muddy way, but any advice would be appreciated. I would be happy to provide more detail on the experiment or my analysis. I've searched the message board history and can't find any information on how to deal with phylogenetic non-independence in multi-species expression datasets.

Thanks,

Noah

edger differential expression phylogenetic comparative methods • 1.6k views

ADD COMMENT • link updated 7.8 years ago by Aaron Lun ★ 28k • written 7.8 years ago by noah.reid ▴ 10

0

Entering edit mode

Dear Noah,

You do have a dependence problem here, but it doesn't have anything to do with the phylogenetic relationships between the species. The problem is simply that your data has a repeated measures structure, with samples within species and species within categories. This sort of dataset is treated in Section 3.5 of the edgeR User Guide. When you read that section, think of species as patients and categories as disease status. Aaron's answer follows the approach of that section.

There are some nuances however. The correct analysis of your data depends on what conclusions you want to make and how the species were selected. You are studying 16 species. Are these all the species you want to make conclusions about, or are they just example species from the three categories that happened to be convenient? If you did the RNA-seq experiment again, would you necessarily use the same species, or might you use different ones?

Can you respond to the above questions? Then I might be able to say more .

ADD REPLY • link 7.8 years ago Gordon Smyth 52k

score 1 · Answer 1 · 2017-07-07

Regardless of whether the species are "independent", if you haven't modelled the species factor explicitly, this will already cause problems. Presumably, replicates from the same species are more similar than replicates from different species - this will induce hidden correlations in the residuals of the fitted model. As a result, we overstate the residual d.f. available in the study, which results in inaccurate inferences from edgeR.

The better approach would be to model the species explicitly. I would combine the species and treatment factors into a single grouping factor, and feed that to model.matrix as a one-way layout with no intercept term. Here, correlations between closely-related species are absorbed into the fixed effects and can be ignored. They have no bearing on the modelling of the random component (which just deals with technical and biological variability across replicates within a species), so no GLMM-like fiddling with the covariance matrix is necessary.

Now it is easy to compute the treatment effect for each species. For example, to get the effect of treatment 1 for species A, you could use:

makeContrasts(A.treatment1 - A.control, levels=design)

If you have species A, B and C for the same physiological state, you can get a single statistic for this state with:

makeContrasts(((A.treatment1 - A.control) + (B.treatment1 - B.control) +
               (C.treatment1 - C.control))/3, levels=design)

... which tests whether the average of the treatment effects across these species is equal to zero. A straightforward extension of this logic can be applied to determine if treatment effect varies between physiological states. If species X and Y belong to a different state than A, B and C, and you want to determine if these two states differ in their treatment responses, you can do:

makeContrasts(((A.treatment1 - A.control) + (B.treatment1 - B.control) +
               (C.treatment1 - C.control))/3 - 
              ((X.treatment1 - X.control) + (Y.treatment1 - Y.control))/2, levels=design)

... which tests if the average treatment effect of A, B and C differs from that of X and Y.