Controlling numerical covariates: differential gene expression with limma voom
1
0
Entering edit mode
Onyi Ukay • 0
@onyi-ukay-17718
Last seen 5.0 years ago

Here is a snapshot of a data.frame, df

Subject | Cell_type | Condition | Viral_load
--------|-----------|-----------|------------
1       | A         | normal    | 10
1       | B         | normal    | 180
2       | A         | diseased  | 500 
2       | B         | diseased  | 90
3       | A         | normal    | 720
3       | B         | normal    | 63

I would like to find differentially expressed genes in the following comparisons:
      i. A.normal vs A.diseased
      ii. B.normal vs B.diseased
      iii. A.normal vs B.normal
      iv. A.diseased vs B.diseased

What is the right way to incorporate into my analysis an interactive design formula for Cell_type and Condition, while blocking for the Subject and controlling for Viral_load covariate?


My idea is to add a supplementary column to df called Group,

Subject | Cell_type | Condition | Viral_load | Group
--------|-----------|-----------|------------|-------
1       | A         | normal    | 10         | A.normal 
1       | B         | normal    | 180        | B.normal
2       | A         | diseased  | 500        | A.diseased
2       | B         | diseased  | 90         | B.diseased
3       | A         | normal    | 720        | A.normal
3       | B         | normal    | 63         | B.normal

Then run the following analysis:

design <- model.matrix(~ 0 + Group + Viral_load, data = df)
colnames(design) <- c(levels(df$Group), "Viral_load")
v <- voom(TMM, design = design)
dupcor <- duplicateCorrelation(v, design = v$design, block = df$Subject)
fit <- lmFit(v, block = df$Subject, correlation = dupcor$consensus.correlation)
cm <- makeContrasts(A.normal_vs_diseased = A.normal   - A.diseased,
                    B.normal_vs_diseased = B.normal   - B.diseased,
                    normal.A_vs_B        = A.normal   - B.normal,
                    diseased.A_vs_B      = A.diseased - B.diseased,
                    levels = v$design)
fit <- contrasts.fit(fit, cm)
fit <- eBayes(fit)

Have I accurately controlled for Viral_load as a covariate, blocked by Subject and extracted the appropriate contrasts?

limma voom differential gene expression rnaseq R • 2.5k views
ADD COMMENT
0
Entering edit mode

Hi, I would really appreciate it if you upvote my answer if you find it helpful :)

ADD REPLY
3
Entering edit mode
@mikhaelmanurung-17423
Last seen 2.6 years ago
Netherlands

Hi Onyi,

Your design and contrast matrices are correct.

I am not sure about your variable naming convention, but please make sure the TMM input variable to your voom command is the expression set.

Last but not least, it is now recommended to do the duplicateCorrelation calculation twice each. Also keep in mind to include the block to your voom call! Here is an example:

v <- voom(TMM, design = design, block=df$Subject)
dupcor <- duplicateCorrelation(v, design = v$design, block = df$Subject)
v <- voom(TMM, design = design, correlation=dupcor$consensus.correlation, block=df$Subject)
dupcor <- duplicateCorrelation(v, design = v$design, block = df$Subject)

v <- voom(TMM, design = design, correlation=dupcor$consensus.correlation, block=df$Subject)

Further discussion on this can be seen at using duplicateCorrelation with limma+voom for RNA-seq data

I hope this answers your question.

Best, Mikhael

ADD COMMENT
0
Entering edit mode

Thank you for your answer Mikhael. Could a limma developer, or someone with a higher reputation please affirm Mikhael's answer?

ADD REPLY
0
Entering edit mode

I'm not technically a limma developer, but I hope my reputation is high enough for you.

As Mikhael said, your design matrices and contrasts are correct. I will just note that edgeR uses log-link GLMs, which means that your model assumes that log-expression scales linearly with viral load (i.e., raw expression increases exponentially with viral load). Perhaps a more reasonable relationship for most genes would be that log-expression scales linearly with log-viral load. Or you could be even more general and use a spline on the (log-)viral load, allowing you to block out any smooth relationship with viral load. All of this assumes, of course, that the viral load is not confounded with your conditions of interest, which will cause problems of varying levels to the different blocking strategies mentioned above.

The other thing I should point out is that the TMM should typically be a DGEList that has been run through calcNormFactors. If you give voom a raw matrix of counts, it won't do any normalization by default (aside from library size normalization when the log-CPMs are computed). You can set normalize.method but the available choices tend to be unnecessarily aggressive, being designed to deal with the horrors of microarrays.

ADD REPLY
0
Entering edit mode

Thanks for affirming Mikhael's answer Aaron.

TMM is indeed a DGEList, defined as:

dge  <- DGEList(counts=counts)
keep <- filterByExpr(dge, design, min.count=5) 
dge  <- dge[keep,,keep.lib.sizes=FALSE]
TMM  <- calcNormFactors(dge)

Thank you for also pointing out a different problem that I could have. On the suggested more general case, what exactly do you mean by "use a spline on the log-viral_load?" Does using a spline take care of the unknown relationship between log-cpm values and log-viral_load, or does it simply "block out any smooth relationship with viral load" [I assume you mean linear relationship here]?

 
ADD REPLY
0
Entering edit mode

No, I meant any smooth relationship (depending on the number of d.f.). See section 9.6.2 of the limma user's guide.

ADD REPLY

Login before adding your answer.

Traffic: 553 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6