Question

DGE with a Continuous Variable

0

Entering edit mode

picasa1983 • 0

@picasa1983-14806

Last seen 24 days ago

USA

Hi everyone,

I'm working on a differential gene expression (DGE) analysis using DESeq2, with the goal of identifying genes that are differentially expressed according to age (which is a numerical value in info).

Here is the code I have so far:

    dds <- DESeqDataSetFromMatrix(countData = data, colData = info, design = ~ Age)

    keep <- rowSums(counts(dds) >= 10) >= smallestGroupSize
    dds <- dds[keep,]

    dds <- DESeq(dds)

    res <- results(dds, alpha = 0.05)

1) Is it necessary to perform log fold change shrinkage (lfcShrink)?

    res.ape <- lfcShrink(dds, coef = "Age", type = "apeglm", res = res)

2) I found an interesting gene with the following statistics:

               baseMean log2FoldChange     lfcSE      stat      pvalue        padj
               <numeric>      <numeric> <numeric> <numeric>   <numeric>   <numeric>
Gene_X      6273.312       0.260054 0.0441626   5.88856 3.89567e-09 0.000079569

How can I plot the expression of this gene for all samples and draw a linear regression line using the intercept and slope computed by DESeq2?

Should I use the counts from:

vsd <- vst(dds, blind=TRUE)
assay(vsd)

?

Thank you!

DESeq2 model • 1.4k views

ADD COMMENT • link updated 8 months ago by Michael Love 43k • written 9 months ago by picasa1983 • 0

score 1 · Answer 1 · 2024-07-16

1

Entering edit mode

Michael Love 43k

@mikelove

Last seen 1 day ago

United States

1) you can, this is useful for ranking. See the apeglm paper.

2) "draw a linear regression line using the intercept and slope computed by DESeq2"

You can use normTransform(), which is log2 scaled counts. This is close enough to what you get from the coefficients in coef(dds), however the glm takes into account count variance, while simple regression on log counts does not.

ADD COMMENT • link 9 months ago Michael Love 43k

0

Entering edit mode

Thank you for your quick support. I created two figures using the following code:

The counts from a particular gene were taken from assay(normTransform(dds))

1) Using coef(dds)

intercept <- coef(dds)[gene_to_check, "Intercept"]
slope <- coef(dds)[gene_to_check, "Age"]

ggplot(plot_data, aes(x=Age, y=GeneExpression)) +
  geom_point() + 
  geom_abline(intercept = intercept, slope = slope, color = "red")

deseq2

2) Using lm

lm_fit <- lm(expression ~ dds@colData$Age)

ggplot(plot_data, aes(x=Age, y=GeneExpression)) +
  geom_point() +  # Add points
  geom_abline(intercept = lm_fit$coefficients[1], slope = lm_fit$coefficients[2], color='turquoise')

How can these be very different, and which one makes more sense?

Thanks

ADD REPLY • link 9 months ago picasa1983 • 0

1

Entering edit mode

How can these be very different, and which one makes more sense?

I directly answered that above in my post, did you see?

ADD REPLY • link 9 months ago Michael Love 43k

0

Entering edit mode

I think I read it too quickly. Thanks :)

ADD REPLY • link 9 months ago picasa1983 • 0

0

Entering edit mode

Is there a way to get the R-squared R^2 from the model ?

ADD REPLY • link 8 months ago picasa1983 • 0

0

Entering edit mode

You could compute R2 from the predicted values, using the fitted coefficients, correlated to the observed values, but as with the regression line, a GLM/likelihood approach takes into account count variance while simple regression on log counts does not.

ADD REPLY • link 8 months ago Michael Love 43k

0

Entering edit mode

I read in a Biostars thread that the fitted coefficient is in assays(dds)[["mu"]], but I'm not sure if this is correct. Would this be the correct way to get the R2 ?

fitted_values <- assays(dds)[["mu"]][gene_id, ]
observed_counts <- counts(dds, normalized = TRUE)[gene_id, ]
r_squared <- cor(fitted_values, observed_counts)^2

ADD REPLY • link 8 months ago picasa1983 • 0

1

Entering edit mode

The fitted counts in mu will work but they contain size factor scaling. So you can divide those by sizeFactors(dds).

mu <- assays(dds)[["mu"]]
t( t(mu) / sizeFactors(dds) ) )

ADD REPLY • link 8 months ago Michael Love 43k

0

Entering edit mode

Hi Michael, using an other dataset, I got this strange message:

the design formula contains one or more numeric variables with integer values, specifying a model with increasing fold change for higher values. did you mean for this to be a factor? if so, first convert this variable to a factor using the factor() function the design formula contains one or more numeric variables that have mean or standard deviation larger than 5 (an arbitrary threshold to trigger this message). Including numeric variables with large mean can induce collinearity with the intercept. Users should center and scale numeric variables in the design to improve GLM convergence.

In order to solve it, I did:

meta$scaled_Age <- scale(meta$Age, center = TRUE, scale = TRUE)
dds <- DESeqDataSetFromMatrix(data, colData = meta, design = ~ scaled_Age)

Then classical following steps.

Is it the right way to do ?

ADD REPLY • link 8 months ago picasa1983 • 0

0

Entering edit mode

That's not strange to me :)

Scaling the numerical variables is useful, and the other message is just making sure users know how the variables will be modeled, so it isn't ambiguous.

ADD REPLY • link 8 months ago Michael Love 43k

score 1 · Answer 2 · 2024-07-16

I had this code just hanging around, you can try to tweak it to make it fit your data. Ignore all the quotes and the new lines; its a template for making R code that runs within an R markdown document, that's why it needs those.

  "expression <- log(counts(dds, normalized = TRUE)[gene_name,] + 0.001,2)\n",


  "lm_fit <- lm(expression ~ dds@colData$Day)\n",


  "d<- plotCounts(dds, gene=gene_name, intgroup=c('Donor', 'Day'), returnData = TRUE)\n",

  "p <- ggplot(d, aes(x=Day, y=log(count,2), color = Donor)) +\n",
  "geom_point(position=position_jitter(w=0.1,h=0)) +\n",
  "labs(title = my_gene_name) +\n",
  "theme(plot.title = element_text(hjust = 0.5))\n",

  "p + geom_abline(intercept = lm_fit$coefficients[1], slope =  lm_fit$coefficients[2], color = 'turquoise')\n",