Question

DESEQ2 linear model of dose

1

Entering edit mode

jshouse ▴ 10

@jshouse-10956

Last seen 23 months ago

United States

I have a question about a linear models in DESeq2.

The experimental design is far from optimal, but I wasn't asked about that part. The samples consist of 12 controls and 3 treatments with different doses of treatment.

coldataset:

                         Treatment Sample.ID DOSE               Group
002_S2_L001_R1_001   Isoproterenol        24  0.1 Isoproterenol_C_0.1
004_S4_L001_R1_001   Isoproterenol        24    1   Isoproterenol_C_1
006_S6_L001_R1_001   Isoproterenol        24   10  Isoproterenol_C_10
098_S98_L001_R1_001        VEHICLE         V    0         VEHICLE_C_0
100_S100_L001_R1_001       VEHICLE         V    0         VEHICLE_C_0
102_S102_L001_R1_001       VEHICLE         V    0         VEHICLE_C_0
110_S110_L001_R1_001       VEHICLE         V    0         VEHICLE_C_0
112_S112_L001_R1_001       VEHICLE         V    0         VEHICLE_C_0
114_S114_L001_R1_001       VEHICLE         V    0         VEHICLE_C_0
248_S248_L001_R1_001       VEHICLE         V    0         VEHICLE_C_0
250_S250_L001_R1_001       VEHICLE         V    0         VEHICLE_C_0
252_S252_L001_R1_001       VEHICLE         V    0         VEHICLE_C_0
260_S260_L001_R1_001       VEHICLE         V    0         VEHICLE_C_0
262_S262_L001_R1_001       VEHICLE         V    0         VEHICLE_C_0
264_S264_L001_R1_001       VEHICLE         V    0         VEHICLE_C_0

Normally, I run these experiments at lowest level (Group) and use contrast statements to examine max dose comparisons to their vehicle controls (among others). (design ~ Group)

Is it possible to fit a linear model on DOSE in DESeq2? To get a list of genes with adjusted p-values that have a linear relationship between counts and dose? The log(dose) vector would be c(rep(-2,12),-1,0,1).

Should I just take the normalized counts from DESeq2 and run linear models for normcounts=log(dose) for each gene?

Lastly, if I run a design~DOSE, and examine results(dds), what do the log2foldchange represent below? Are they not meaningful without contrasts statements?

> resultsNames(dds)
[1] "Intercept" "DOSE0"     "DOSE0.1"   "DOSE1"     "DOSE10"

    baseMean log2FoldChange     lfcSE      stat       pvalue        padj
1 816.064720      1.3116114 0.2635343  4.977004 6.457587e-07 0.001850744
2  74.737786     -1.0885462 0.2493632 -4.365304 1.269462e-05 0.018191391
3 184.634430     -0.8549204 0.2158798 -3.960168 7.489703e-05 0.064807331
4  53.630486      1.0040814 0.2564783  3.914878 9.044987e-05 0.064807331
5  40.440782      0.9380167 0.2972477  3.155674 1.601278e-03 0.415725274
6   7.345135      1.0853250 0.3465192  3.132078 1.735739e-03 0.415725274

deseq2 • 4.7k views

ADD COMMENT • link updated 8.2 years ago by Simon Anders ★ 3.8k • written 8.2 years ago by jshouse ▴ 10

0

Entering edit mode

Usually, the first line of the output of results tells you what the contrast was. You seem to have cut this off when copying and pasting. Can you fix this, please?

ADD REPLY • link 8.2 years ago Simon Anders ★ 3.8k

0

Entering edit mode

There was no contrast. This was just design ~ DOSE as listed above. Followed by results(dds).

ADD REPLY • link 8.2 years ago jshouse ▴ 10

0

Entering edit mode

Are you sure you pasted the complete beginning of the output of results? I still think there are a few lines missing -- or you have a very outdated version of DESeq2.

Usually, it looks like this:

> dds <- makeExampleDESeqDataSet()
> dds <- DESeq( dds )
estimating size factors
estimating dispersions
gene-wise dispersion estimates
mean-dispersion relationship
final dispersion estimates
fitting model and testing
> results(dds)
log2 fold change (MAP): condition B vs A 
Wald test p-value: condition B vs A 
DataFrame with 1000 rows and 6 columns
           baseMean log2FoldChange     lfcSE        stat     pvalue      padj
          <numeric>      <numeric> <numeric>   <numeric>  <numeric> <numeric>
gene1    11.5279202    -0.43863697 0.3334728 -1.31536065  0.1883887 0.9223485
gene2     0.5509324    -0.02871079 0.1698456 -0.16904051  0.8657648 0.9967501
gene3     2.4920175    -0.08257666 0.3125119 -0.26423523  0.7915987 0.9967501
[...]

ADD REPLY • link 8.2 years ago Simon Anders ★ 3.8k

0

Entering edit mode

Thank you Simon! I see my mistake. I had used head(arrange(testdf,padj)) from testdf<-as.data.frame(results(dds)). When I just did results(dds) it showed me the answer.

> results(dds)
log2 fold change (MAP): DOSE 10 vs 0 
Wald test p-value: DOSE 10 vs 0 
DataFrame with 2982 rows and 6 columns
                  baseMean log2FoldChange     lfcSE        stat    pvalue      padj
                 <numeric>      <numeric> <numeric>   <numeric> <numeric> <numeric>
A2M_12371        33.547083    -0.44207838 0.3372159  -1.3109652 0.1898695         1
AARS_3          241.719869    -0.03126903 0.3102425  -0.1007890 0.9197180         1
AASDHPPT_21936  134.906874     0.07619819 0.2842790   0.2680401 0.7886684         1
AATF_26432       32.640388     0.20345819 0.3409460   0.5967461 0.5506769         1
ABCA6_22690       0.300474    -0.05016478 0.1264196  -0.3968117 0.6915063         1

ADD REPLY • link 8.2 years ago jshouse ▴ 10

Michael Love · Answer 1 · 2016-10-17

0

Entering edit mode

Simon Anders ★ 3.8k

@simon-anders-3855

Last seen 4.4 years ago

Zentrum für Molekularbiologie, Universi…

Now, that we have solved the contrast issue, let's get to your actual question:

Is it possible to fit a linear model on DOSE in DESeq2? To get a list of genes with adjusted p-values that have a linear relationship between counts and dose? The log(dose) vector would be c(rep(-2,12),-1,0,1).

Yes, of course. Let's say, you have a numeric column in your sample table called. say, logdose, then you simply use the design

~ logdose

and DESeq will perform a linear regression of log2 gene expression on log dose. The results table will then contain the regression coeffcient in the table log2FoldChange. That the column is called "log2FoldChange" is then, of course, a bit a misnomer; it is the slope of the regression line, i.e., change in expression divided by change in the numeric predictor. (The expression is measured in units of log2 normalized counts, the predictor in whatever unit you use; in your case log10 of the dose.).

The log(dose) vector would be c(rep(-2,12),-1,0,1).

Well, whether -2 is really the best value to use instead of log(0)=-inf is of course not clear a priori, but it might in fact be good enough for a first try.

ADD COMMENT • link 8.2 years ago Simon Anders ★ 3.8k

0

Entering edit mode

Thank you Simon! I will model this.

ADD REPLY • link 8.2 years ago jshouse ▴ 10

0

Entering edit mode

Simon,

I was rearranging the code when I got stuck. The data are arranged as posted in the first box of the original question. When design ~ Group, this worked fine as I was using contrasts to pull out Treatment from Vehicle comparisons. The dataset actually has 73 different treatments structured this way and I would of course, want to normalize all concurrently. How do I do that with a design ~ logdose as we discussed? That will compare the logdose of features (genes) summed across all treatments (I think) which I do not want.

A use of design ~ Treatment + logdose will fail because each treatment/logdose combo consists of a chemical plus its vehicle controls which I do want to model as a "logdose".

I was going to split it into 73 runs, but then I don't get the experiment wide normalization. To further complicate things, not every chemical has the same vehicle controls. There are 4 subplates with a varying number of treatments assigned to the same 12 vehicle controls. Thus the "C" you see in Group. That means that Vehicle C replicates get assigned to Isoproterenol, but they will also get assigned to another XX chemicals that were on that plate.

ADD REPLY • link 8.2 years ago jshouse ▴ 10

0

Entering edit mode

So, Treatment is a drug here, and you have several drugs? Then you need to write

~ treatment * logdose

because you want to get a regression slope for each treatment. With a '+', you get an intercept per treatment, but only one slope common for all drugs.

Also, you want to get not one intercept per drug, but per drug group (where, by "drug group" I mean a group of drugs sharing the same vehicle control). So, from the top of my head, it would be

~ group + treatment * logdose

where you should add some constant to logdoes beforehands, such that control samples have logdose 0.

ADD REPLY • link 8.2 years ago Simon Anders ★ 3.8k

0

Entering edit mode

Simon, I am sorry that this is taking so much of your time, and I appreciate you sharing it. Using ~ Treatment * logdose returns a full rank error. I'll try to explain more clearly the experimental design by pasting example below. These two treatments share the same 12 vehicle controls. On purpose, I have the Vehicle dose set to be a log order of magnitude that is the next one down from the lowest treatment dose. I understand this is up for debate in the dose response curve fitting world, and it is what I use at this point with tcpl and DRM.

The "B" actually uniquely identifies which control is associated with which chemicals. Thus Group is unique for a given dose/chemical. There are actually 4 different controls that each match to a subset of chemicals. Thus the ~ Group works with a contrast statement later that compares Xylitol at 10mM to its Vehicle B.

The linear fit I am trying to achieve would be to normalize the entire experiment and then get linear fits (singly) of xylitol with its control (B), vernacalant with its control (B) and isoproteronol (first post) with its control (C) using the numeric variable logdose.

                       Treatment DOSE Index.Set logdose                               Group
1                        Xylitol  0.1         B      -1                    Xylitol_0.1_B
2                        Xylitol    1         B       0                      Xylitol_1_B
3                        Xylitol   10         B       1                     Xylitol_10_B
4                    vernacalant  0.1         B      -1                vernacalant_0.1_B
5                    vernacalant    1         B       0                  vernacalant_1_B
6                    vernacalant   10         B       1                 vernacalant_10_B
7                        VEHICLE 0.01         B      -2                   VEHICLE_0.01_B
8                        VEHICLE 0.01         B      -2                   VEHICLE_0.01_B
9                        VEHICLE 0.01         B      -2                   VEHICLE_0.01_B
  9 more vehicle rows....

ADD REPLY • link updated 8.1 years ago by Michael Love 43k • written 8.2 years ago by jshouse ▴ 10

0

Entering edit mode

Is the full rank error because there exists only 1 dose per non-vehicle treatment?

ADD REPLY • link 8.1 years ago jshouse ▴ 10

0

Entering edit mode

Sorry, I've overlooked your second question. Do you still need help, or have you solved it already?

If not, can you post an R command to create your full model frame /column data (e.g. using "dump")? This would help me try myself to create a valid model matrix.

ADD REPLY • link 8.1 years ago Simon Anders ★ 3.8k

0

Entering edit mode

I'd recommend you use model.matrix to explore how the design matrices will look, outside of DESeq2 (whenever you have interaction terms, as here, you will get the same design matrix within DESeq2 as using model.matrix).

For example, the following coldata and formula produces a term for the baseline of each treatment, and a slope for each treatment:

> coldata <- data.frame(treatment=factor(rep(1:2,each=4)),
                        dose=c(0:3,0:3))
> coldata
  treatment dose
1         1    0
2         1    1
3         1    2
4         1    3
5         2    0
6         2    1
7         2    2
8         2    3
> model.matrix(~0 + treatment + treatment:dose, coldata)
  treatment1 treatment2 treatment1:dose treatment2:dose
1          1          0               0               0
2          1          0               1               0
3          1          0               2               0
4          1          0               3               0
5          0          1               0               0
6          0          1               0               1
7          0          1               0               2
8          0          1               0               3

Maybe we can work from simple examples like this, to the kind of coefficients you want to model.

ADD REPLY • link 8.1 years ago Michael Love 43k

0

Entering edit mode

Thanks Michael and Simon.

My data are structured as follows, where A and B share the zero dose from V (vehicle).

> coldata <- data.frame(treatment=factor(rep(c("A","B","V"),each=3)),
+                         dose=c(1:3,1:3,0,0,0))
> coldata
  treatment dose
1         A    1
2         A    2
3         A    3
4         B    1
5         B    2
6         B    3
7         V    0
8         V    0
9         V    0

ADD REPLY • link 8.1 years ago jshouse ▴ 10

0

Entering edit mode

If you code like so:

coldata <- data.frame(treatment=factor(rep(c("A","B","A"),each=3)),dose=c(1:3,1:3,0,0,0))

You would get:

> model.matrix(~treatment:dose,coldata)
  (Intercept) treatmentA:dose treatmentB:dose
1           1               1               0
2           1               2               0
3           1               3               0
4           1               0               1
5           1               0               2
6           1               0               3
7           1               0               0
8           1               0               0
9           1               0               0

Does that look correct?

ADD REPLY • link 8.1 years ago Michael Love 43k

0

Entering edit mode

For treatment A, wont that design assign B treatment values to a zero dose? Likewise for B with regard to A.

ADD REPLY • link 8.1 years ago jshouse ▴ 10

0

Entering edit mode

Can you rephrase your concern in terms of the coefficients that will be estimated: Intercept, treatmentA:dose and treatmentB:dose? That is, what is your specific concern regarding these estimated coefficients?

ADD REPLY • link 8.1 years ago Michael Love 43k

0

Entering edit mode

My concern is the coefficient for TreatmentB:dose will use count data from treatmentA: doses 1,2,3. Or to put another way, the model as written will model 6 doses of (0) with 1 dose each of 1,2,3 for both TreatmentA:dose and TreatmentB:dose. This may be ignorance on my part regarding the model matrix notation.

ADD REPLY • link 8.1 years ago jshouse ▴ 10

0

Entering edit mode

It's true that the samples will all affect each other with this design (e.g. observations 1-3 affect the calculation of the intercept, and also the slope associated with treatmentB). But this seems to me to be a natural consequence of having all the treatments share the same control samples (zero dose).

ADD REPLY • link 8.1 years ago Michael Love 43k