Question

EdgeR, between/within subject comparisons

0

Entering edit mode

tassos.dam ▴ 10

@tassosdam-11612

Last seen 2.5 years ago

Sweden

Hi All

The short version of the question.

I have a study that is basically the same as in section 3.5 page 40 in the EdgeR manual. There it shows how you can extract the differential genes in various comparisons. One comparison that escapes me is what if I would like to know if there is a difference in gene expression between non treated samples in the healthy versus disease patients. Basically, "DiseaseDisease1:TreatmentNone"-"DiseaseHealthy:TreatmentNone". How would I be able to extract that contrast from the design proposed in the manual?

The longer version.

In my study all the patients have the disease, just for a different length of time (long vs short). And I have samples from control tissues and disease tissue from the same patient. Basically it looks like this.

pat<-gl(10,2)
time<-gl(2,10, labels=c("short","long"))
tissue<-gl(2,1,length=20,labels=c("ctrl","dis"))

data.frame(time, pat, tissue)

   time pat tissue
1  short   1   ctrl
2  short   1    dis
3  short   2   ctrl
4  short   2    dis
5  short   3   ctrl
6  short   3    dis
7  short   4   ctrl
8  short   4    dis
9  short   5   ctrl
10 short   5    dis
11  long   6   ctrl
12  long   6    dis
13  long   7   ctrl
14  long   7    dis
15  long   8   ctrl
16  long   8    dis
17  long   9   ctrl
18  long   9    dis
19  long  10   ctrl
20  long  10    dis

Renumber the patients within the time groups as per section 3.5 in the manual.

patb<-gl(5,2,length=20)

data.frame(time, patb, tissue)

    time patb tissue
1  short    1   ctrl
2  short    1    dis
3  short    2   ctrl
4  short    2    dis
5  short    3   ctrl
6  short    3    dis
7  short    4   ctrl
8  short    4    dis
9  short    5   ctrl
10 short    5    dis
11  long    1   ctrl
12  long    1    dis
13  long    2   ctrl
14  long    2    dis
15  long    3   ctrl
16  long    3    dis
17  long    4   ctrl
18  long    4    dis
19  long    5   ctrl
20  long    5    dis

design<-model.matrix(~time+time:tissue+time:patb)
colnames(design)

 [1] "(Intercept)"         "timelong"            "timeshort:tissuedis" "timelong:tissuedis"  "timeshort:patb2"     "timelong:patb2"      "timeshort:patb3"    
 [8] "timelong:patb3"      "timeshort:patb4"     "timelong:patb4"      "timeshort:patb5"     "timelong:patb5"

So if I would like to see if there is difference in gene expression between patients that have had the disease long time vs short time I could use coef="timelong" or contrast=c(0,1,0,0,0,0,0,0,0,0,0,0).

For differences in the diseased tissue in short time vs long time patients I could use contrast=c(0,0,1,-1,0,0,0,0,0,0,0,0) .

But what If I would like to see if there is differential expression in the control tissue between the short/long time patients? Would contrast=c(0,1,-1,-1,0,0,0,0,0,0,0,0) work?

Also, if I would like to make the comparison "diseased tissue" - "control tissue", would it be correct to use a contrast like c(0,0,-1,-1,0,0,0,0,0,0,0,0)

Cheers

edger • 1.0k views

ADD COMMENT • link updated 8.3 years ago by Aaron Lun ★ 28k • written 8.3 years ago by tassos.dam ▴ 10

score 3 · Accepted Answer · 2016-10-06

With these designs, you need to sit down and have a think about what the coefficients actually mean:

The intercept represents the (fitted) log-expression in the short control sample of patient 1.
timelong represents the (fitted) log-fold change in long control over short control of patient 1 (well, after renaming; it's actually the log-FC of the control of patient 6 over the control of patient 1).
timeshort:tissuedis represents the log-fold change in diseased short over control short.
timelong:tissuedis represents the log-fold change in diseased long over control long.
The remaining terms are patient-specific blocking factors, representing the average log-fold change in expression for each other patient over the first patient. Generally uninteresting.

Do not be deceived by timelong, it does not represent the average log-fold change of long over short samples across patients. Such information is used to estimate the patient-specific blocking factors and is no longer available for computing the DE log-fold change. In fact, the only relevant long/short comparison you could do is:

con <- c(0,0,1,-1,0,0,0,0,0,0,0,0)

... which tests for differences in the diseased/control log-fold change between long and short patients. This allows you to determine whether the effect of disease on transcription is itself affected by the disease duration. Obviously, you could also test each of timeshort:tissuedis and timelong:tissuedis separately, in order to examine the specific disease effect in either long or short patients. Or you can test their average, to determine if there is any overall disease effect:

con <- c(0,0,1/2,1/2,0,0,0,0,0,0,0,0) # Halve to get average log-FC.

If you want to directly compare between groups, you'll need to subset your data and run edgeR on that subset. In your case, you have 4 groups, i.e., diseased + long, diseased + short, control + long, and control + short. If you want to compare between long and short groups, you need to subset the data to only include samples in those two groups, and then fit a one-way layout to the subsetted data. This removes the other sample for each patient, thus avoiding problems due to correlations between samples from the same patient.