DGE with a Continuous Variable
2
0
Entering edit mode
picasa1983 • 0
@picasa1983-14806
Last seen 25 days ago
USA

Hi everyone,

I'm working on a differential gene expression (DGE) analysis using DESeq2, with the goal of identifying genes that are differentially expressed according to age (which is a numerical value in info).

Here is the code I have so far:

    dds <- DESeqDataSetFromMatrix(countData = data, colData = info, design = ~ Age)

    keep <- rowSums(counts(dds) >= 10) >= smallestGroupSize
    dds <- dds[keep,]

    dds <- DESeq(dds)

    res <- results(dds, alpha = 0.05)

1) Is it necessary to perform log fold change shrinkage (lfcShrink)?

    res.ape <- lfcShrink(dds, coef = "Age", type = "apeglm", res = res)

2) I found an interesting gene with the following statistics:

               baseMean log2FoldChange     lfcSE      stat      pvalue        padj
               <numeric>      <numeric> <numeric> <numeric>   <numeric>   <numeric>
Gene_X      6273.312       0.260054 0.0441626   5.88856 3.89567e-09 0.000079569

How can I plot the expression of this gene for all samples and draw a linear regression line using the intercept and slope computed by DESeq2?

Should I use the counts from:

vsd <- vst(dds, blind=TRUE)
assay(vsd)

?

Thank you!

DESeq2 model • 808 views
ADD COMMENT
1
Entering edit mode
@mikelove
Last seen 1 day ago
United States

1) you can, this is useful for ranking. See the apeglm paper.

2) "draw a linear regression line using the intercept and slope computed by DESeq2"

You can use normTransform(), which is log2 scaled counts. This is close enough to what you get from the coefficients in coef(dds), however the glm takes into account count variance, while simple regression on log counts does not.

ADD COMMENT
0
Entering edit mode

Thank you for your quick support. I created two figures using the following code:

The counts from a particular gene were taken from assay(normTransform(dds))

1) Using coef(dds)

intercept <- coef(dds)[gene_to_check, "Intercept"]
slope <- coef(dds)[gene_to_check, "Age"]

ggplot(plot_data, aes(x=Age, y=GeneExpression)) +
  geom_point() + 
  geom_abline(intercept = intercept, slope = slope, color = "red")

deseq2

2) Using lm

lm_fit <- lm(expression ~ dds@colData$Age)

ggplot(plot_data, aes(x=Age, y=GeneExpression)) +
  geom_point() +  # Add points
  geom_abline(intercept = lm_fit$coefficients[1], slope = lm_fit$coefficients[2], color='turquoise')

lm

How can these be very different, and which one makes more sense?

Thanks

ADD REPLY
1
Entering edit mode

How can these be very different, and which one makes more sense?

I directly answered that above in my post, did you see?

ADD REPLY
0
Entering edit mode

I think I read it too quickly. Thanks :)

ADD REPLY
0
Entering edit mode

Is there a way to get the R-squared R^2 from the model ?

ADD REPLY
0
Entering edit mode

You could compute R2 from the predicted values, using the fitted coefficients, correlated to the observed values, but as with the regression line, a GLM/likelihood approach takes into account count variance while simple regression on log counts does not.

ADD REPLY
0
Entering edit mode

I read in a Biostars thread that the fitted coefficient is in assays(dds)[["mu"]], but I'm not sure if this is correct. Would this be the correct way to get the R2 ?

fitted_values <- assays(dds)[["mu"]][gene_id, ]
observed_counts <- counts(dds, normalized = TRUE)[gene_id, ]
r_squared <- cor(fitted_values, observed_counts)^2
ADD REPLY
1
Entering edit mode

The fitted counts in mu will work but they contain size factor scaling. So you can divide those by sizeFactors(dds).

mu <- assays(dds)[["mu"]]
t( t(mu) / sizeFactors(dds) ) )
ADD REPLY
0
Entering edit mode

Hi Michael, using an other dataset, I got this strange message:

the design formula contains one or more numeric variables with integer values, specifying a model with increasing fold change for higher values. did you mean for this to be a factor? if so, first convert this variable to a factor using the factor() function the design formula contains one or more numeric variables that have mean or standard deviation larger than 5 (an arbitrary threshold to trigger this message). Including numeric variables with large mean can induce collinearity with the intercept. Users should center and scale numeric variables in the design to improve GLM convergence.

In order to solve it, I did:

meta$scaled_Age <- scale(meta$Age, center = TRUE, scale = TRUE)
dds <- DESeqDataSetFromMatrix(data, colData = meta, design = ~ scaled_Age)

Then classical following steps.

Is it the right way to do ?

ADD REPLY
0
Entering edit mode

That's not strange to me :)

Scaling the numerical variables is useful, and the other message is just making sure users know how the variables will be modeled, so it isn't ambiguous.

ADD REPLY
1
Entering edit mode
swbarnes2 ★ 1.4k
@swbarnes2-14086
Last seen 3 days ago
San Diego

I had this code just hanging around, you can try to tweak it to make it fit your data. Ignore all the quotes and the new lines; its a template for making R code that runs within an R markdown document, that's why it needs those.

  "expression <- log(counts(dds, normalized = TRUE)[gene_name,] + 0.001,2)\n",


  "lm_fit <- lm(expression ~ dds@colData$Day)\n",


  "d<- plotCounts(dds, gene=gene_name, intgroup=c('Donor', 'Day'), returnData = TRUE)\n",

  "p <- ggplot(d, aes(x=Day, y=log(count,2), color = Donor)) +\n",
  "geom_point(position=position_jitter(w=0.1,h=0)) +\n",
  "labs(title = my_gene_name) +\n",
  "theme(plot.title = element_text(hjust = 0.5))\n",

  "p + geom_abline(intercept = lm_fit$coefficients[1], slope =  lm_fit$coefficients[2], color = 'turquoise')\n",
ADD COMMENT
0
Entering edit mode

Thanks for the code, but you don't use the coefficient estimated by the model ? with coef(dds)[gene_name,]

ADD REPLY

Login before adding your answer.

Traffic: 472 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6