Question

edgeR: boxplot the expression of a gene in two conditions after glmFit

0

Entering edit mode

zoppoli pietro ▴ 10

@zoppoli-pietro-4792

Last seen 6.5 years ago

United States

I have a DGEGLM object from edgeR called fit2

fit2 <- glmFit(counts(Data), design, disp2$tagwise.dispersion,offset=-offst(dataOffset))

I want the logFC values so

logFC <- fit2$coefficients[geneA,2] # first column intercept, second column is my comparison

fit2$fitted.values[geneA,] # is the vector with the fitted values

TableWithFC <- aggregate(log2(fit2$fitted.values[geneA,]) ~ (condition),FUN= mean)

logFC2 <- TableWithFC[2,2]-TableWithFC [1,2]

logFC is NOT equal to logFC2

If logFC is the right one (is the same I get if I continue the pipeline and I do glmLRT and topTags), how I can obtain the expression values of geneA for each sample in order to obtain a boxplot like this:

boxplot(log2(fit2$fitted.values["geneA",]) ~ (fit2$design[,2]))

but with the values giving the logFC of the fit2$coefficients[geneA,2] ?

In other words the boxplot with the fitted.values is not representative of the result with coefficients.

I want a boxplot representative of the logFC obtained with coefficients.

Thanks,

P

edgeR • 2.4k views

ADD COMMENT • link updated 6.4 years ago by Aaron Lun ★ 28k • written 6.4 years ago by zoppoli pietro ▴ 10

score 1 · Answer 1 · 2018-11-21

1

Entering edit mode

Aaron Lun ★ 28k

@alun

Last seen 2 minutes ago

The city by the bay

There are many reasons why those two values are different:

The log-fold changes derived from coefficients are computed with a prior count added to the counts, which squeezes differences towards zero (see ?glmFit). If you want the unshrunk log-fold changes, you should use unshrunk.coefficients.
The mean of the log-values is not generally equal to the log-mean value. So computing the mean of the log-fitted values doesn't necessarily give you the log-fold change between the means.
In fact, the sample mean of each group will not be exactly proportional to the group-specific abundance reported by edgeR (unless your libraries are exactly the same size). This is because the latter is computed from the GLM, which effectively weights each observation based on its variance.
The coefficients are computed with a natural log, while you're using log2 for your log-transform.

I think you're worrying too much about this. Just use cpm(y, log=TRUE, prior=5) to get log-CPMs and make your boxplot; any discrepancy should be too minor for people to notice or care, at least for the purposes of visualization.

ADD COMMENT • link 6.4 years ago Aaron Lun ★ 28k

0

Entering edit mode

Hi Aaron,

thanks a lot for your answer, you made me reflect on the many reasons the values can be different.

I expected a minor difference in FC as you state in your answer BUT I find FCs with different sign comparing the results of my analysis and the cpm.

Let me explain that i'm using both EDAseq and RUVSeq prior edgeR to obtain my result.

According to EDASeq manual:

dataOffset <- withinLaneNormalization(data,"gc",which="upper",offset=TRUE)

dataOffset <- betweenLaneNormalization(dataOffset,which="upper",offset=TRUE)

then according to RUVSeq using control genes taken from

Eisenberg E, Levanon EY. Human housekeeping genes, revisited. Trends Genet. 2013

I do:

set1 <- RUVg(x = dataOffset cIdx = HKgenes, k=1)

RUVg does NOT fill the offset slot of the EDASeq object.

To use the offsets I need the result from EDASeq but I still want to Remove Unwanted Variation

so I modify the edgeR paragraph of the EDASeq manual:

W_1 <- pData(set1)$ W_1

pData(dataOffset) <- cbind(pData(dataOffset),W_1)

design <- model.matrix(~conditions + W_1, data=pData(dataOffset))

then all by manual

disp2 <- estimateDisp(counts(dataOffset),design, offset=-offst(dataOffset))

fit2 <- glmFit(counts(dataOffset), design, disp2$tagwise.dispersion,
offset=-offst(dataOffset))

lrt <- glmLRT(fit, coef=2)
topTags(lrt)

contrast_BvsA <- makeContrasts(TvsN=conditionsB-conditionA, levels=design)

contrast_BvsA

#                      Contrasts
# Levels               BvsA
# conditionA            -1
# conditionsB           1
#  W_1                  0           

#  W_1 is a vector to Remove Unwanted Variation

lrt2 <- glmLRT(fit2,contrast = contrast_BvsA )

lrt2$table["geneB",]

#         logFC   logCPM       LR       PValue
# geneB -1.088669 31.60722 23.45083 1.281471e-06

If I do

y <- DGEList(counts = counts(dataOffset), lib.size = colSums(counts(dataOffset)),
         samples = (dataOffset@phenoData@data),
        group = design[,2], genes = dataOffset@featureData@data, remove.zeros = FALSE)
 y <- calcNormFactors(y)

CPMs <- cpm(y, log=TRUE, prior=5)
CPMs[geneB,]
b <- aggregate(CPMs["geneB",]~ (condition),FUN= mean)
b[2,2]-b[1,2]

# 0.2840857

As you can see it's not a small difference.

Thanks,

P

ADD REPLY • link 6.4 years ago zoppoli pietro ▴ 10

0

Entering edit mode

In my answer, I was assuming that you were performing the simplest possible comparison between group means, i.e., a GLM fitted with a one-way layout. However, if you're using blocking factors that are not balanced between groups, all bets are off. To illustrate with a simple linear model:

set.seed(0)
groups <- gl(2, 10)
batch <- factor(rep(c(1,2,1,2), c(8,2,2,8)))
y <- rnorm(length(batch), mean=as.integer(batch)) # complete batch effect

# Difference in sample means:
mean(y[groups=="2"]) - mean(y[groups=="1"])
## [1] -0.1214053

# Linear model:
fit <- lm(y ~ groups + batch)
coef(fit)
## (Intercept)     groups2       batch
## -0.01288533 -0.90730992  1.30984108

So, you can see that the difference in sample means (-0.12) is quite different from the difference returned by the linear model (-0.91). This is because the latter accounts for the effect of the blocking variables on the apparent differences between groups. If you want to represent this in a boxplot, I'd suggest using removeBatchEffect on the log-CPMs with your groups in design= and all blocking factors in covariates=.

ADD REPLY • link 6.4 years ago Aaron Lun ★ 28k

0

Entering edit mode

Dear Aaron,
thanks for your answer.

I tried your solution but with no luck.

Maybe it's related to the used of the offset from EDAseq package together with the RUVg function from RUVSeq package.

As you can read in the previous post my design is

design <- model.matrix(~conditions + W_1, data=pData(dataOffset))

so the only block factor i see is W_1 (which is the the estimated factors of unwanted variation).

so:

CPMsNoBatch<- removeBatchEffect(CPMs, batch=NULL, batch2=NULL,

 covariates=Data2@phenoData@data$W_1, design=as.matrix(design[,2]))

 bb <- aggregate(CPMsNoBatch["geneB",]~ design[,2],FUN= mean)

b[2,2]-b[1,2]
# 9.359853

A difference bigger than before.

I remind you the

lrt2$table["geneB",]

gives FC -1

Maybe I'm doing something wrong or there is an easiest way to integrate EDAseq normalization and RUVg results into edgeR pipeline.

Sorry for the many questions but I'd really like to use this pipeline (or a modified working one) in some papers we are preparing.

Best,

P

ADD REPLY • link 6.4 years ago zoppoli pietro ▴ 10

0

Entering edit mode

set.seed(0)
groups <- gl(2, 10)
batch <- factor(rep(c(1,2,1,2), c(8,2,2,8)))
design <- model.matrix(~groups + batch)

# Mocking up some data.
ngenes <- 1000
batch.means <- matrix(rnorm(2*ngenes), ncol=2)
sample.means <- 100*2^batch.means[,batch]
y <- matrix(rnbinom(length(sample.means), mu=sample.means, size=10), ncol=length(batch))

# Using glmFit/glmLRT for demonstration purposes: normally I would use QL methods.
library(edgeR)
fit <- glmFit(y, design, offset=0, dispersion=0.1)
res <- glmLRT(fit, coef=2)

# Using removeBatchEffect.
lcpms <- cpm(y, log=TRUE, prior.count=1)
corrected <- removeBatchEffect(lcpms, design=design[,1:2,drop=FALSE], covariates=design[,3])
alt.lfc <- rowMeans(corrected[,groups==2]) - rowMeans(corrected[,groups==1])

# Looks pretty consistent to me.
plot(alt.lfc, res$table$logFC)
abline(a=0, b=1, col="red", lwd=2, lty=2)

At low counts, this is less likely to match up due to the effect of the prior.count and the fact that the variance relative to the mean is higher (such that the mean of the logs becomes even more different to the log of the means). But otherwise it should be pretty consistent, as you can see from the plot above. If you're not getting this result, something probably went wrong - try working back from the code I've just provided above.

ADD REPLY • link 6.4 years ago Aaron Lun ★ 28k