Hello,
I have RNA-seq data of 5 control and 15 case samples. I want to find modules related to the disease I'm studying. One of the steps in WGCNA is creating the correlation matrix in order to find co-expressed genes later on. From Peter Langfelder's previous posts on this form, I understand that all samples (case + control) should be used together. But it is not clear to me how this works, because if for example gene A correlates with gene B in control but not in case, then the correlation value will be based on mixed signals. What does this value really mean then? Is the assumption here that all genes will stay correlated in both groups, only the expression value will differ?
What sounds logical for me (an undergrad, so please bear with me) is to create modules for case and modules for control, and then search for modules which are not preserved. These will then be the modules related to the disease. Unfortunately with my low sample size I don't think this is a possibility for me. (Maybe it is?)
Thanks in advance.
Edit I have 5 wild-type samples and 15 disease samples. I'm interested in the processes related to the disease. What happens in the body transcription wise in disease samples? I think WGCNA is a good fit because the modules contain co-expressed and thus (hopefully) genes related to a shared function. Modules with high correlation to trait data (and perhaps with many DEGs?) should be annotated to see what functions the gene products perform.
As you said with low sample size, you can build a reliable network with just 5 samples (even with just 15 case is difficult to obtain a good network). I would recommend to build your network and correlate the modules with a variable that indicates if a samples is control or case.
The correlation shows that if the expression of gene A increases 10, the expression of gene B increases something similar to 10, or not. To build the network it is not just based on the correlation between 2 genes, but also if the correlation of those two genes with other genes is similar (see documentation about Topological Overlap Measure). Thus providing a robust way to build the network, because if those genes really behave different then their correlation with other genes will also differ, and will end up in separate modules.
Thanks for the response Lluis. So if I understand it correctly, we call two genes co-expressed only if they show co-expression between samples? Following my example, gene A and B are not co-expressed? Wouldn't this in my case be very biased because I have three times more samples in case than control?
We call two genes co-expressed if their (absolute) correlation of expression is higher than a predefined threshold. In your example gene A and B could still be correlated higher than your threshold and thus be considered co-expressed. You, of course, can compute the correlation of 5 controls and 15 cases separately to see if the correlation is different (you could have a look at the package mergemaid to see how many genes keep the same relation with other genes). The correlation can take into account weights (see corr in boots package), or can be more robust to outliers (see bicor in WGCNA package) if you want to take into account those biases.
Ah, but going back to the original question, I do understand the meaning of co-expression as you just nicely explained, but my concern was that the correlation is now based on how correlated the two genes are in each group. Two genes are more correlated if they correlate in both groups vs one, but what is the biological meaning of this? I hope my concern is a bit clear, and thanks for the package recommendations to help remove the bias.
In my opinion, having a high correlation of expression between gene A and gene B in both groups wouldn't mean anything (which such a low number of samples, and without independent cohorts). But if a couple of genes (or with higher number of samples and independent cohorts a couple of genes) show a pattern of co-expression on both groups, then this might indicate, they are related on their function, either direct or indirect, (although it could mean that they are expressed similarly but not due to their function).
It might be due to their function or could still be due to non-biological sources of variation. The type of relationship between two genes (or group of genes) can't be infered only using expression data. That's why enrichment methods are used to see if a genes are enriched in a pathway (KEGG, REACTOME, ...), in a collection of genes (GSE), in a function from the gene ontologies (GO)... Must of these methods require a list/group of identifiers, and ultimately this associations must be checked in the wet lab.
Thank you again for the reply. I really appreciate it, but I think my point isn't really coming across, or perhaps I'm terribly bad at understanding your point. I understand that co-expression of gene A and B in both groups wouldn't tell us much on their own and instead we need to look at the toplogical overlap. What I do not understand is how we can use the correlation value if they are not co-expressed in both groups. What does this value mean? I edited my post to give some more information about what I want to find. Perhaps using both samples is not the correct way to answer my question?
I might not be the indicated person to answer you, but the value of correlation between two genes that are not co-expressed in both groups is noise. That's why is always better to perform a different network for each condition. Even if it is noise it can help to set the "background" noise to the really co-expressed genes.
>What happens in the body transcription wise in disease samples?
Some genes might be differentially expressed between the groups, some will be co-expressed and keep a similar expression profile, while others might be differentially co-expressed between both groups. See the first paragraph of Robert's answer.
You might find interesting this article which describes a method to compare different conditions. Or this other one which is not based on WGCNA. However I hope new answer to your question will come.
A big thank you again for your help, Lluis.