Same contrasts returning different edgeR results
1
0
Entering edit mode
s.connell • 0
@sconnell-22580
Last seen 5.1 years ago

Hi there,

I'm working with some RNASeq data to try and identify any differential gene expression (DGE) between disease states in edgeR.

Disease can be one of three levels: No Disease, Low-risk disease, and High-risk disease. I was hoping to set up a single regression call that I could then use to explore all the contrasts of disease state and then also no disease vs any disease:

RNADesign <- model.matrix(~ 0 + Status)

RNAContrasts <- makeContrasts(
      #Only three contrasts here:
      NCvL = "StatusL - StatusNC",
      NCvH = "StatusH - StatusNC",
      LvH  = "StatusH - StatusL",
      CvNC = "(StatusL + StatusH) - 2*StatusNC",
      levels = RNADesign)

RNAContrasts
           Contrasts
Levels      NCvL NCvH LvH CvNC
  StatusNC   -1   -1   0   -2
  StatusC     1    0  -1    1
  StatusC     0    1   1    1

Where I'm running into some weirdness is the last contrast, looking for DGE between No Disease and Any Disease (a combination of the latter two levels). Depending on how I set up the contrast, I get wildly different numbers of DGE.

My understanding of contrasts was that the absolute values do not matter so long as they work out to what you are interested in. So (StatusL + StatusH) - 2*StatusNC should be the same as ((StatusL + StatusH) / 2) - StatusNC and also the identical to 2*(StatusL + StatusH) - 4*StatusNC .

However after testing each contrast with glmTreat I get 184, 326 and 473 differentially expressed genes, respectively. I've hunted this down to the different contrasts changing the logFC value that is output from glmQLFit.

I'm not sure why this is happening, and wondering if someone could help set me straight?

Thanks in advance!

edger limma contrasts DGE • 1.5k views
ADD COMMENT
3
Entering edit mode
@james-w-macdonald-5106
Last seen 16 hours ago
United States

Your assumption about the contrasts isn't correct. The contrast is how the fold change is being calculated, and if you use anything but (StatusL + StatusH)/2 - StatusNC, the fold change will be off by a factor of N, and your results will consequently be off as well. As an example, cribbing liberally from example(glmLRT):

> library(edgeR)
Loading required package: limma
> nlibs <- 9
> ngenes <- 100
> dispersion.true <- 0.1
> grps <- gl(3,3)
> design <- model.matrix(~0+grps)
> contrast <- matrix(c(0.5, 0.5, -1, 1,1,-2), ncol = 2)
> contrast
     [,1] [,2]
[1,]  0.5    1
[2,]  0.5    1
[3,] -1.0   -2
> beta.true <- cbind(Beta1=2,Beta2=2, Beta3=c(2,rep(0,ngenes-1)))
> mu.true <- 2^(beta.true %*% t(design))
> y <- rnbinom(ngenes*nlibs,mu=mu.true,size=1/dispersion.true)
> y <- matrix(y,ngenes,nlibs)
> colnames(y) <- paste0("Sample_", 1:9)
> rownames(y) <- paste0("gene", 1:100)
> d <- DGEList(y)
> d <- calcNormFactors(d)
> fit <- glmFit(d, design, dispersion = dispersion.true)
> topTags(glmTreat(fit, contrast = contrast[,1]))
Coefficient:  0.5*grps1 0.5*grps2 -1*grps3 
           logFC unshrunk.logFC   logCPM      PValue       FDR
gene89 -1.929178  -1.977236e+00 14.40385 0.001173633 0.1173633
gene24  4.982705   1.442695e+08 14.10156 0.002387623 0.1193812
gene70  2.660949   2.869588e+00 14.34994 0.011409445 0.2581814
gene50 -1.821671  -1.926589e+00 13.98772 0.012368778 0.2581814
gene86  2.596709   2.801657e+00 14.34974 0.014456109 0.2581814
gene1  -1.484763  -1.529912e+00 14.32752 0.019371575 0.2581814
gene88  2.496031   2.698986e+00 14.26905 0.019809159 0.2581814
gene65  2.478759   2.682506e+00 14.23898 0.020654515 0.2581814
gene22  2.262478   2.459458e+00 14.10063 0.038848685 0.4266688
gene46  2.230123   2.425242e+00 14.09958 0.042666878 0.4266688
> topTags(glmTreat(fit, contrast = contrast[,2]))
Coefficient:  1*grps1 1*grps2 -2*grps3 
           logFC unshrunk.logFC   logCPM       PValue        FDR
gene89 -3.858356  -3.954473e+00 14.40385 0.0009126225 0.09126225
gene24  9.965409   2.885390e+08 14.10156 0.0022654893 0.11327446
gene70  5.321899   5.739176e+00 14.34994 0.0106779128 0.24531566
gene50 -3.643341  -3.853178e+00 13.98772 0.0111683907 0.24531566
gene86  5.193418   5.603314e+00 14.34974 0.0136168017 0.24531566
gene1  -2.969526  -3.059823e+00 14.32752 0.0170713972 0.24531566
gene88  4.992062   5.397972e+00 14.26905 0.0188051667 0.24531566
gene65  4.957518   5.365013e+00 14.23898 0.0196252525 0.24531566
gene22  4.524956   4.918915e+00 14.10063 0.0374513706 0.41215662
gene46  4.460246   4.850484e+00 14.09958 0.0412156620 0.41215662

ADD COMMENT
0
Entering edit mode

Good to know, thanks for clearing that up!

ADD REPLY
0
Entering edit mode

Thanks for this example. It shows clearly that the logFC is multiplied by a factor 2 due to the coefficients in the contrast definition.

Just a small remark. The PValue and FDR are nearly unchanged. As the lfc coefficient of the glmTreat function is 0 by default, the differentially expressed genes are the same in this example. I think the lfc coefficient should be set to improve this example and I think the OP should have given the glmTreat call to make the question clearer.

ADD REPLY
0
Entering edit mode

Why do you think the default lfc argument is 0? It's like 0.26 or so.

> args(glmTreat)
function (glmfit, coef = ncol(glmfit$design), contrast = NULL, 
    lfc = log2(1.2), null = "interval") 
NULL

And even if the example used a whopping lfc argument it wouldn't change (I would have had to choose an lfc > 1.48, which is a massive fold change to use for glmTreat).

ADD REPLY
0
Entering edit mode

Google fools me? https://www.rdocumentation.org/packages/edgeR/versions/3.14.0/topics/glmTreat rdocumentation.org is not synchronized with bioconductor. I should have searched bioconductor. My fault.

Could you elaborate what you expect by setting Beta3 as you did? If I understand correctly, the OP stated the DGE is changing. You showed that logFC changed with the coefficients of the contrast. So, my naive question: did the DGE list change in your example?

ADD REPLY
0
Entering edit mode

DGE is filtered by FDR, and at a 10% FDR the first contrast has no genes and the second has one. So for this contrived example the DGE changes by 1. Presumably a 'real' analysis would have more changes than that.

As for the rationale for Beta3, I just cribbed from the example for glmLRT, where the different beta values are used to ensure that there are differentially expressed genes. I didn't put much thought into that, as my point was to show that the OP's contention that any contrast that sums to zero is identical. So far as I know, a real contrast sums to zero, and all the positive and negative coefficients sum to 1 or -1.

ADD REPLY
0
Entering edit mode

Thanks. Concerning your last sentence, I think you mean "all positive (resp.negative) coefficients sum to 1 (resp. -1)".

Interestingly, the gene that is below the FDR threshold is not the one that is differential by construction aka gene1. I am wondering if there is not too much low values in the group3 due to the value 1. I will try to increase Beta3 and to set a seed for reproducibility.

Any idea why the p-value are not exactly the same while the counts are the same by definition? If there are so fluctuations due to some stochasticity in the calculation, is there a practical difference between 0.001173633 and 0.0009126225 for the top gene (or *100 if one looks at FDR)?

ADD REPLY
0
Entering edit mode

The p-values change for different contrasts because the logFC changes changes. This is deliberate. A logFC of 3.8 is obviously greater than log2(1.2) with more confidence than a logFC of 1.9 and hence gets a smaller p-value. There is no "stochasticity".

ADD REPLY

Login before adding your answer.

Traffic: 684 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6