EdgeR, between/within subject comparisons
1
0
Entering edit mode
tassos.dam ▴ 10
@tassosdam-11612
Last seen 2.5 years ago
Sweden

Hi All

The short version of the question.

I have a study that is basically the same as in section 3.5 page 40 in the EdgeR manual. There it shows how you can extract the differential genes in various comparisons. One comparison that escapes me is what if I would like to know if there is a difference in gene expression between non treated samples in the healthy versus disease patients. Basically, "DiseaseDisease1:TreatmentNone"-"DiseaseHealthy:TreatmentNone". How would I be able to extract that contrast from the design proposed in the manual?

 

The longer version.

In my study all the patients have the disease, just for a different length of time (long vs short). And I have samples from control tissues and disease tissue from the same patient. Basically it looks like this.

 

pat<-gl(10,2)
time<-gl(2,10, labels=c("short","long"))
tissue<-gl(2,1,length=20,labels=c("ctrl","dis"))

data.frame(time, pat, tissue)

   time pat tissue
1  short   1   ctrl
2  short   1    dis
3  short   2   ctrl
4  short   2    dis
5  short   3   ctrl
6  short   3    dis
7  short   4   ctrl
8  short   4    dis
9  short   5   ctrl
10 short   5    dis
11  long   6   ctrl
12  long   6    dis
13  long   7   ctrl
14  long   7    dis
15  long   8   ctrl
16  long   8    dis
17  long   9   ctrl
18  long   9    dis
19  long  10   ctrl
20  long  10    dis


Renumber the patients within the time groups as per section 3.5 in the manual.

patb<-gl(5,2,length=20)

data.frame(time, patb, tissue)

    time patb tissue
1  short    1   ctrl
2  short    1    dis
3  short    2   ctrl
4  short    2    dis
5  short    3   ctrl
6  short    3    dis
7  short    4   ctrl
8  short    4    dis
9  short    5   ctrl
10 short    5    dis
11  long    1   ctrl
12  long    1    dis
13  long    2   ctrl
14  long    2    dis
15  long    3   ctrl
16  long    3    dis
17  long    4   ctrl
18  long    4    dis
19  long    5   ctrl
20  long    5    dis

 

design<-model.matrix(~time+time:tissue+time:patb)
colnames(design)
 [1] "(Intercept)"         "timelong"            "timeshort:tissuedis" "timelong:tissuedis"  "timeshort:patb2"     "timelong:patb2"      "timeshort:patb3"    
 [8] "timelong:patb3"      "timeshort:patb4"     "timelong:patb4"      "timeshort:patb5"     "timelong:patb5"     

 

So if I would like to see if there is difference in gene expression between patients that have had the disease long time vs short time I could use coef="timelong" or contrast=c(0,1,0,0,0,0,0,0,0,0,0,0).

For differences in the diseased tissue in short time vs long time patients I could use contrast=c(0,0,1,-1,0,0,0,0,0,0,0,0) .

But what If I would like to see if there is differential expression in the control tissue between the short/long time patients? Would contrast=c(0,1,-1,-1,0,0,0,0,0,0,0,0) work?

Also, if I would like to make the comparison "diseased tissue" - "control tissue", would it be correct to use a contrast like c(0,0,-1,-1,0,0,0,0,0,0,0,0)

 

Cheers

edger • 1.0k views
ADD COMMENT
3
Entering edit mode
Aaron Lun ★ 28k
@alun
Last seen 6 hours ago
The city by the bay

With these designs, you need to sit down and have a think about what the coefficients actually mean:

  • The intercept represents the (fitted) log-expression in the short control sample of patient 1.
  • timelong represents the (fitted) log-fold change in long control over short control of patient 1 (well, after renaming; it's actually the log-FC of the control of patient 6 over the control of patient 1).
  • timeshort:tissuedis represents the log-fold change in diseased short over control short.
  • timelong:tissuedis represents the log-fold change in diseased long over control long.
  • The remaining terms are patient-specific blocking factors, representing the average log-fold change in expression for each other patient over the first patient. Generally uninteresting.

Do not be deceived by timelong, it does not represent the average log-fold change of long over short samples across patients. Such information is used to estimate the patient-specific blocking factors and is no longer available for computing the DE log-fold change. In fact, the only relevant long/short comparison you could do is:

con <- c(0,0,1,-1,0,0,0,0,0,0,0,0)

... which tests for differences in the diseased/control log-fold change between long and short patients. This allows you to determine whether the effect of disease on transcription is itself affected by the disease duration. Obviously, you could also test each of timeshort:tissuedis and timelong:tissuedis separately, in order to examine the specific disease effect in either long or short patients. Or you can test their average, to determine if there is any overall disease effect:

con <- c(0,0,1/2,1/2,0,0,0,0,0,0,0,0) # Halve to get average log-FC.

If you want to directly compare between groups, you'll need to subset your data and run edgeR on that subset. In your case, you have 4 groups, i.e., diseased + long, diseased + short, control + long, and control + short. If you want to compare between long and short groups, you need to subset the data to only include samples in those two groups, and then fit a one-way layout to the subsetted data. This removes the other sample for each patient, thus avoiding problems due to correlations between samples from the same patient.

ADD COMMENT
0
Entering edit mode

Hi Aaron

Thanks a lot for taking the time and for your very helpful answer. I have been avoiding these kind of interaction designs and usually break down everything to smaller subgroups, I have easier time wrapping my head around those. So indeed,  subsetting the data was my first thing to do  but thought to explore if there are other possibilities.

If you would have the time, one followup question.

Instead of subsetting the data could one make a design like:

timeTissue<-paste(time,tissue, sep="_")
design2<-model.matrix(~0+timeTissue)
colnames(design2)<-gsub("timeTissue","", colnames(design2))

and then to get the difference in the control tissue for long vs short time patients use 

constrast=c(-1,0,1,0)

while for the long_dis vs short_dis 

constrast=c(0,-1,0,1)

 

Cheers

 

ADD REPLY
0
Entering edit mode

You still need to subset, otherwise you won't be account for the correlations between samples from the same patient. Your design only works if you were using limma with duplicateCorrelation, which models these correlations explicitly. With edgeR, subsetting keeps only one sample per patient so you don't need to worry about these correlations. 

ADD REPLY
0
Entering edit mode

OK, great, thanks. I have been using limma with dublicateCorrelation previously in the way you mention so my design thinking is still influenced by that.

Cheers

ADD REPLY

Login before adding your answer.

Traffic: 845 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6