Question

Error when calculating DE using the classical approach in edgeR

0

Entering edit mode

lynski008 • 0

@lynski008-11394

Last seen 8.6 years ago

I am trying to use edgeR to analyse differential gene expression from my RNA-seq dataset. The experimental design is as follows: I had 3 different treatment conditions i.e. 1 factor with 3 different levels. And there are 3 replicate RNA samples (counts libraries) for each of the 3 treatments i.e. 9 samples in total. I have made a DGEList with the counts for the 9 samples. I have filtered to remove features with metatags and low expression. I have also made a design matrix giving the information that there are 3 groups with 3 samples assigned to each group. The design matrix (called design) looks like this:

             Group1       Group2        Group3    
Sample1         1           0              0
Sample2         1           0              0
Sample3         1           0              0
Sample4         0           1              0
Sample5         0           1              0
Sample6         0           1              0
Sample7         0           0              1
Sample8         0           0              1          
Sample9         0           0              1

I have also estimated dispersion using the following


DG <- estimateDisp(DG, design, robust=TRUE)

where DG is my DGEList and design is my design matrix. Because my experimental design is very simple I don't think I need to use a GLM approach to analyse differential expression and I only want to do pair-wise comparisons i.e. find DE between group 1 and group 2 and between group 1 and group 3. So I think I want to use the edgeR classic approach. However, when I try the following code:

> et <- exactTest(DG, pair=c("group1","group2"))

I get the following error:

At least one element of given pair is not a group.
Groups are: 1

Can someone please help explain where I am going wrong? I think that my DGEList is not getting the information from my data matrix (design) or perhaps my data matrix is over-complicated for my experimental design and I need to somehow include the group information in my DGEList. At the moment if I print my DGEList the $samples table looks like this:

$samples
                      files group lib.size norm.factors
counts3D7_1-1 counts3D7_1-1     1 13251744    2.0632398
counts3D7_2-1 counts3D7_2-1     1 13809955    0.9912349
counts3D7_3-1 counts3D7_3-1     1 12328705    1.3071140
counts3D7_1-2 counts3D7_1-2     1 12605616    0.6537380
counts3D7_2-2 counts3D7_2-2     1 13392599    0.6290064

(with not all rows printed) so I think that the correct group information isn't there. Any help is greatly appreciated.

edgeR differential expression rna-seq • 3.7k views

ADD COMMENT • link updated 8.6 years ago by Gordon Smyth 52k • written 8.6 years ago by lynski008 • 0

score 4 · Answer 1 · 2016-09-05

4

Entering edit mode

Aaron Lun ★ 28k

@alun

Last seen 9 hours ago

The city by the bay

To answer your immediate question, you need to specify the groupings in group:

DG$samples$group <- rep(c("group1", "group2", "group3", each=3))

... and then re-run estimateDisp and exactTest. The design matrix is not used in the classic method, so you can't expect the grouping information there to carry through to exactTest. Indeed, supplying design to estimateDisp will use the GLM approach for estimating the dispersions. While this is not wrong per se, it doesn't make much sense to use one approach for estimation and another for testing.

Now, at a more conceptual level, a simple experimental design doesn't preclude the use of the GLM approach in edgeR. I always use GLMs to analyze one-way layouts - and, in your case, since you've already set up the design matrix, why not go all the way? In fact, I dare say the QL-based methods (which use GLMs) would outperform the exact test as the former accounts for the uncertainty of the dispersion estimates and the latter does not. There's no trade-off in performance here; both the GLM-based approach and the exact test work well, but one works for a variety of designs and the other... doesn't.

ADD COMMENT • link 8.6 years ago Aaron Lun ★ 28k

0

Entering edit mode

Many thanks for your helpful suggestions. I am now calculating differential expression using the GLM approach. One further question: you say that supplying design to estimateDisp will use the GLM approach for estimating the dispersions. Does this mean that I don't need to go back and use qlmQLFit to estimate dispersions prior to calculating differential expression using glmQLFTest ?

ADD REPLY • link 8.6 years ago lynski008 • 0

1

Entering edit mode

You still need to use both of them. estimateDisp estimates the negative binomial dispersions, while glmQLFit estimates the quasi-likelihood dispersions. The book chapter has more details on this:

http://link.springer.com/protocol/10.1007/978-1-4939-3578-9_19

ADD REPLY • link 8.6 years ago Aaron Lun ★ 28k