Hi! I'm new with edgeR and Differential Expression analysis in general so I'm having some trouble designing the analysis for my single-cell RNAseq data.
First of all, my input is a count matrix in a txt file with genes in rows, cells in columns, and expression level as values (number of mapped reads). With this, I've performed a clustering analysis using Seurat so now I have groups of cells that potentially are of the same cell type. Moreover, I have cells coming from two samples from different conditions (WT and mutant). So, to sum up, I have something similar to this toy experimental design:
> exp.design
cluster condition
cell1 cluster1 WT
cell2 cluster1 WT
cell3 cluster1 mutant
cell4 cluster1 mutant
cell5 cluster2 WT
cell6 cluster2 WT
cell7 cluster2 mutant
cell8 cluster2 mutant
cell9 cluster3 WT
cell10 cluster3 WT
cell11 cluster3 mutant
cell12 cluster3 mutant
And now I wanted to use edgeR (since it is highly recommended by the scientific community) to perform a DE to answer the following questions:
- Which genes are differentially expressed for each cluster (cell type) independently of the condition? That is, which genes help differentiate each one of the clusters from the others? That would help me to find genes that characterize the cluster so I can infer the cell type and potentially identify marker genes.
- Within one cluster, which genes are differentially expressed between conditions? That would help to see if one condition affects the gene expression in one particular cell type.
Now that I've explained the context, a few questions have come to my mind when designing the steps for the analysis:
- With the experimental design provided above, each cell will be treated as a replicate. Is this okay? I've read in some blogs that it's recommendable to perform a "pseudobulk approach", that is, to aggregate the counts coming from cells grouped in the same cluster. They say so because they consider that each cell is not truly a replicate since it comes from the same biological sample (it's the case of cell1 and cell2, for example). However, in other sites, they don't recommend this practice since it doesn't take advantage of the power of single-cell data. Which are your opinions? what's the best way to proceed?
And regarding the analysis with edgeR specifically:
Which would be the best way to proceed to answer the first question? Is it okay to block the "condition" factor as it is explained in section 3.4.2 of the manual?
Regarding the first question as well, what would be the best procedure to test one cluster against the others (once we control the "condition" factor)?
3.1. Do all the pairwise comparisons between clusters (e.g. cluster1 vs cluster2, and cluster1 vs cluster3) and keep the DE genes in all comparisons? 3.2. Compare the expression of the desired cluster against the average of the rest? With:
average.model<- makeContrasts( clust1 - (clust2 + clust3) / 2 , levels = designmatrix)
de.average<- glmLRT(fit, contrast = average model)
- To address the second question, it would be fine to combine cluster+condition into one group and perform DE for each cluster? For example, clust1.WT vs clust1.mutant. Something like:
> design.matrix <- model.matrix(~0+cluster.condition)
> contrast.c1 <- makeContrasts(cluster1.WT - cluster1.mutant, levels = design.matrix)
> de.cluster1 <- glmLRT(fit, contrast = contrast.c1)
Sorry for the long questions, let me know if I didn't explain myself well at any point. Thanks in advance!
Thank you very much for your answer and thanks to Gordon Smyth too! I was going further with the analysis and now I'm stuck again in how to block the "condition" factor. Sorry for the bother, but I don't fully understand how the
model.matrix()
works.As I've mentioned in my post, what I want is to know which genes are differentially expressed for each cluster (cell type) independently of the condition. If I go as the manual says I obtain a matrix like this:
As far as I've understood since there's an "Intercept" and the columns "conditionsWT" and "clusterscluster_1" don't appear, the condition "WT" (wild-type) and the cluster "cluster_1" are treated as a reference. I'm fine with the wild-type condition being treated as a reference. What I don't want is to treat "cluster_1" as such, since there's no prior reason to do that.
Eventually, I would like to compare each cluster against the other but handling the "condition" effect somehow in the model matrix. I would like to do something like this:
How should I build the model.matrix()? Is
model.design <- model.matrix(~0+clusters+conditions)
correct? That seems to do the trick:Any help will be appreciated,
Marta.
Hi all! In case someone gets until here with the same doubts as me, I have encountered this amazing, helpful guide: "A guide to creating design matrices for gene expression experiments". You can jump directly to the section of interest, but I really encourage people who are new to this type of analysis to read it entirely.
Now I will try to answer my own question on how to build the design matrix in my particular case. When the explanatory variable is categorical, the model matrix is equivalent with and without the intercept term. See section 4.4 of the above-mentioned guide for more details. So I shouldn't worry too much about putting the
~0
and treating a cluster or condition as a "reference". However, I prefer to put it for clarity.Moreover, it is the same to do
model.matrix(clusters+conditions)
asmodel.matrix(conditions+clusters)
. It doesn't matter the order, both models are adding the effects of the two variables. The only difference is that columns in the model matrix will be ordered differently. So what we should care about is to indicate correctly the indices in thecoef
argument in theglmQLFTest()
andglmLRT()
functions. Another way to do it avoiding the indices is by using themakeContrasts()
function. See section 7.3 of the guide to learn more about accounting for factors that are not of interest.To sum up, I ended building this matrix and performing the DE like this:
This is my solution so far! If anyone finds an error, I would be very grateful if you could comment on it.