Hi there,
I'm working with some RNASeq data to try and identify any differential gene expression (DGE) between disease states in edgeR.
Disease can be one of three levels: No Disease, Low-risk disease, and High-risk disease. I was hoping to set up a single regression call that I could then use to explore all the contrasts of disease state and then also no disease vs any disease:
RNADesign <- model.matrix(~ 0 + Status)
RNAContrasts <- makeContrasts(
#Only three contrasts here:
NCvL = "StatusL - StatusNC",
NCvH = "StatusH - StatusNC",
LvH = "StatusH - StatusL",
CvNC = "(StatusL + StatusH) - 2*StatusNC",
levels = RNADesign)
RNAContrasts
Contrasts
Levels NCvL NCvH LvH CvNC
StatusNC -1 -1 0 -2
StatusC 1 0 -1 1
StatusC 0 1 1 1
Where I'm running into some weirdness is the last contrast, looking for DGE between No Disease and Any Disease (a combination of the latter two levels). Depending on how I set up the contrast, I get wildly different numbers of DGE.
My understanding of contrasts was that the absolute values do not matter so long as they work out to what you are interested in. So
(StatusL + StatusH) - 2*StatusNC
should be the same as ((StatusL + StatusH) / 2) - StatusNC
and also the identical to
2*(StatusL + StatusH) - 4*StatusNC
.
However after testing each contrast with glmTreat
I get 184, 326 and 473 differentially expressed genes, respectively. I've hunted this down to the different contrasts changing the logFC value that is output from glmQLFit
.
I'm not sure why this is happening, and wondering if someone could help set me straight?
Thanks in advance!
Good to know, thanks for clearing that up!
Thanks for this example. It shows clearly that the logFC is multiplied by a factor 2 due to the coefficients in the contrast definition.
Just a small remark. The PValue and FDR are nearly unchanged. As the lfc coefficient of the glmTreat function is 0 by default, the differentially expressed genes are the same in this example. I think the lfc coefficient should be set to improve this example and I think the OP should have given the glmTreat call to make the question clearer.
Why do you think the default lfc argument is 0? It's like 0.26 or so.
And even if the example used a whopping lfc argument it wouldn't change (I would have had to choose an lfc > 1.48, which is a massive fold change to use for
glmTreat
).Google fools me? https://www.rdocumentation.org/packages/edgeR/versions/3.14.0/topics/glmTreat rdocumentation.org is not synchronized with bioconductor. I should have searched bioconductor. My fault.
Could you elaborate what you expect by setting Beta3 as you did? If I understand correctly, the OP stated the DGE is changing. You showed that logFC changed with the coefficients of the contrast. So, my naive question: did the DGE list change in your example?
DGE is filtered by FDR, and at a 10% FDR the first contrast has no genes and the second has one. So for this contrived example the DGE changes by 1. Presumably a 'real' analysis would have more changes than that.
As for the rationale for Beta3, I just cribbed from the example for
glmLRT
, where the different beta values are used to ensure that there are differentially expressed genes. I didn't put much thought into that, as my point was to show that the OP's contention that any contrast that sums to zero is identical. So far as I know, a real contrast sums to zero, and all the positive and negative coefficients sum to 1 or -1.Thanks. Concerning your last sentence, I think you mean "all positive (resp.negative) coefficients sum to 1 (resp. -1)".
Interestingly, the gene that is below the FDR threshold is not the one that is differential by construction aka gene1. I am wondering if there is not too much low values in the group3 due to the value 1. I will try to increase Beta3 and to set a seed for reproducibility.
Any idea why the p-value are not exactly the same while the counts are the same by definition? If there are so fluctuations due to some stochasticity in the calculation, is there a practical difference between 0.001173633 and 0.0009126225 for the top gene (or *100 if one looks at FDR)?
The p-values change for different contrasts because the logFC changes changes. This is deliberate. A logFC of 3.8 is obviously greater than log2(1.2) with more confidence than a logFC of 1.9 and hence gets a smaller p-value. There is no "stochasticity".