Question

Building contrasts for combined treatment groups to compare to a control

0

Entering edit mode

mart1139 • 0

@mart1139-14954

Last seen 5.6 years ago

Purdue University

Hello, my name is Alex and I am currently a graduate student at Purdue University. I am using edgeR to conduct differential gene expression for a project wherein I'd like to compare 3 treatment groups [control (c), treatment-low (tl), and treatment-high (th)] across 3 population groups (lm, lc, and ct) and I have a question regarding forming contrasts. I want to make two general types of comparisons:

compare all treated individuals (i.e., combined th + tl) to all control individuals across populations
compare treated individuals (again, combined th+tl) to control individuals within each population

Using the edgeR User Guide section 3.3 as a guide, I combined population+treatment into a single factor and created the following design matrix:

design <- model.matrix(~0+group)

ct_c ct_th ct_tl lc_c lc_th lc_tl lm_c lm_th lm_tl
ct_c_1     1     0     0    0     0     0    0     0     0
ct_c_2     1     0     0    0     0     0    0     0     0
ct_c_3     1     0     0    0     0     0    0     0     0
ct_th_1    0     1     0    0     0     0    0     0     0
ct_th_2    0     1     0    0     0     0    0     0     0
ct_th_3    0     1     0    0     0     0    0     0     0
ct_tl_1    0     0     1    0     0     0    0     0     0
ct_tl_2    0     0     1    0     0     0    0     0     0
ct_tl_3    0     0     1    0     0     0    0     0     0
lc_c_1     0     0     0    1     0     0    0     0     0
lc_c_2     0     0     0    1     0     0    0     0     0
lc_c_3     0     0     0    1     0     0    0     0     0
lc_th_1    0     0     0    0     1     0    0     0     0
lc_th_2    0     0     0    0     1     0    0     0     0
lc_th_3    0     0     0    0     1     0    0     0     0
lc_tl_1    0     0     0    0     0     1    0     0     0
lc_tl_2    0     0     0    0     0     1    0     0     0
lc_tl_3    0     0     0    0     0     1    0     0     0
lm_c_1     0     0     0    0     0     0    1     0     0
lm_c_2     0     0     0    0     0     0    1     0     0
lm_c_3     0     0     0    0     0     0    1     0     0
lm_c_4     0     0     0    0     0     0    1     0     0
lm_c_5     0     0     0    0     0     0    1     0     0
lm_th_1    0     0     0    0     0     0    0     1     0
lm_th_2    0     0     0    0     0     0    0     1     0
lm_tl_1    0     0     0    0     0     0    0     0     1
lm_tl_2    0     0     0    0     0     0    0     0     1
lm_tl_3    0     0     0    0     0     0    0     0     1
lm_tl_4    0     0     0    0     0     0    0     0     1
lm_tl_5    0     0     0    0     0     0    0     0     1

With the above design matrix I created the following contrasts to make the comparisons outlined above:

1. allpops_contrast <- c(-1/2,1/4,1/4,-1/2,1/4,1/4,-1/2,1/4,1/4)

2. my.contrasts <- makeContrasts(lm_cvt=(lm_th+lm_tl)/2-lm_c, lc_cvt=(lc_th+lc_tl)/2-lc_c, ct_cvt=(ct_th+ct_tl)/2-ct_c, levels=design)

I then used edgeR's GLM functionality to carry out the comparisons. My question is whether these contrasts are comparing the groups I intend for them to compare based on my objectives listed above. The results for the all population comparison yield a large number of DEGs (> 4,000), whereas each individual population comparison yields a much more modest number (<80 in any population comparison). Am I missing something when it comes to forming contrasts?

Thanks in advance for any insight!

Alex

edger design and contrast matrix • 1.6k views

ADD COMMENT • link updated 7.0 years ago by Aaron Lun ★ 28k • written 7.0 years ago by mart1139 • 0

score 3 · Accepted Answer · 2018-05-10

For your first hypothesis; your allpops_contrast doesn't look right to me. It should look something like this:

allpops_contrast <- c(-1/3,1/6,1/6,-1/3,1/6,1/6,-1/3,1/6,1/6)

... because you want to know whether the grand average of the three control groups is different from the grand average of the six treatment groups. I would recommend being explicit about it and writing it out:

con <- makeContrasts(
    (lm_th + lm_tl + lc_th + lc_tl + ct_th + ct_tl)/6
    - (lm_c + lc_c + ct_c)/3, levels=design)

This will ensure that your code is robust to reordering of the columns in design.

For your second set of hypotheses; my.contrasts looks fine.

While the aforementioned fix to allpops_contrast should now yield correct log-fold changes, it shouldn't change the number of DE genes, unless you're using TREAT to test against a log-fold change threshold. Assuming you ran glmQLFTest correctly, the discrepancy in the number of DE genes is most likely caused by the fact that you have a lot more power when you're using information across 30 libraries (in the first contrast) compared to 9-12 libraries (in each of the population-specific contrasts).

To convince yourself, you can do a little thought experiment based on the standard errors of the log-fold changes (we'll use a limma-like linear model for convenience). The sample mean of each condition has a variance of v/n where v is the sample variance and n is the number of replicates. This means that the variance of the log-fold change in the first contrast would be:

(v/3 + v/3 + v/5)/3^2 + (v/3 + v/3 + v/2 + v/3 + v/3 + v/5)/6^2 = v * 0.152

... remembering that any scaling of a random variable has a squared effect on the variance (during calculation of the grand mean of each treated/control group). By comparison, the variance of the log-fold change in the ct-specific contrast is much larger:

(v/3 + v/3)/2^2 + v/3 = v * 0.5

... which results in reduced power to detect population-specific DE genes.

P.S. Keep in mind that when you supply a contrast matrix to the various GLM testing functions in edgeR, they will perform ANODEVs where the global null hypothesis (formed from the individual null hypotheses corresponding to each column) is tested. If you want to test the null hypothesis for each population, make sure you only supply the corresponding column to contrasts=.