Question

Unclear of calculating correlations with case+control samples together

2

Entering edit mode

mocharrout ▴ 20

@mocharrout-11719

Last seen 7.6 years ago

Hello,

I have RNA-seq data of 5 control and 15 case samples. I want to find modules related to the disease I'm studying. One of the steps in WGCNA is creating the correlation matrix in order to find co-expressed genes later on. From Peter Langfelder's previous posts on this form, I understand that all samples (case + control) should be used together. But it is not clear to me how this works, because if for example gene A correlates with gene B in control but not in case, then the correlation value will be based on mixed signals. What does this value really mean then? Is the assumption here that all genes will stay correlated in both groups, only the expression value will differ?

What sounds logical for me (an undergrad, so please bear with me) is to create modules for case and modules for control, and then search for modules which are not preserved. These will then be the modules related to the disease. Unfortunately with my low sample size I don't think this is a possibility for me. (Maybe it is?)

Thanks in advance.

Edit I have 5 wild-type samples and 15 disease samples. I'm interested in the processes related to the disease. What happens in the body transcription wise in disease samples? I think WGCNA is a good fit because the modules contain co-expressed and thus (hopefully) genes related to a shared function. Modules with high correlation to trait data (and perhaps with many DEGs?) should be annotated to see what functions the gene products perform.

WGCNA wgcna • 1.6k views

ADD COMMENT • link 8.0 years ago • updated 7.9 years ago mocharrout ▴ 20

1

Entering edit mode

As you said with low sample size, you can build a reliable network with just 5 samples (even with just 15 case is difficult to obtain a good network). I would recommend to build your network and correlate the modules with a variable that indicates if a samples is control or case.

The correlation shows that if the expression of gene A increases 10, the expression of gene B increases something similar to 10, or not. To build the network it is not just based on the correlation between 2 genes, but also if the correlation of those two genes with other genes is similar (see documentation about Topological Overlap Measure). Thus providing a robust way to build the network, because if those genes really behave different then their correlation with other genes will also differ, and will end up in separate modules.

ADD REPLY • link 7.9 years ago Lluís Revilla Sancho ▴ 750

0

Entering edit mode

Thanks for the response Lluis. So if I understand it correctly, we call two genes co-expressed only if they show co-expression between samples? Following my example, gene A and B are not co-expressed? Wouldn't this in my case be very biased because I have three times more samples in case than control?

ADD REPLY • link 7.9 years ago mocharrout ▴ 20

1

Entering edit mode

We call two genes co-expressed if their (absolute) correlation of expression is higher than a predefined threshold. In your example gene A and B could still be correlated higher than your threshold and thus be considered co-expressed. You, of course, can compute the correlation of 5 controls and 15 cases separately to see if the correlation is different (you could have a look at the package mergemaid to see how many genes keep the same relation with other genes). The correlation can take into account weights (see corr in boots package), or can be more robust to outliers (see bicor in WGCNA package) if you want to take into account those biases.

ADD REPLY • link 7.9 years ago Lluís Revilla Sancho ▴ 750

0

Entering edit mode

Ah, but going back to the original question, I do understand the meaning of co-expression as you just nicely explained, but my concern was that the correlation is now based on how correlated the two genes are in each group. Two genes are more correlated if they correlate in both groups vs one, but what is the biological meaning of this? I hope my concern is a bit clear, and thanks for the package recommendations to help remove the bias.

ADD REPLY • link 7.9 years ago mocharrout ▴ 20

1

Entering edit mode

In my opinion, having a high correlation of expression between gene A and gene B in both groups wouldn't mean anything (which such a low number of samples, and without independent cohorts). But if a couple of genes (or with higher number of samples and independent cohorts a couple of genes) show a pattern of co-expression on both groups, then this might indicate, they are related on their function, either direct or indirect, (although it could mean that they are expressed similarly but not due to their function).

It might be due to their function or could still be due to non-biological sources of variation. The type of relationship between two genes (or group of genes) can't be infered only using expression data. That's why enrichment methods are used to see if a genes are enriched in a pathway (KEGG, REACTOME, ...), in a collection of genes (GSE), in a function from the gene ontologies (GO)... Must of these methods require a list/group of identifiers, and ultimately this associations must be checked in the wet lab.

ADD REPLY • link 7.9 years ago Lluís Revilla Sancho ▴ 750

0

Entering edit mode

Thank you again for the reply. I really appreciate it, but I think my point isn't really coming across, or perhaps I'm terribly bad at understanding your point. I understand that co-expression of gene A and B in both groups wouldn't tell us much on their own and instead we need to look at the toplogical overlap. What I do not understand is how we can use the correlation value if they are not co-expressed in both groups. What does this value mean? I edited my post to give some more information about what I want to find. Perhaps using both samples is not the correct way to answer my question?

ADD REPLY • link 7.9 years ago mocharrout ▴ 20

1

Entering edit mode

I might not be the indicated person to answer you, but the value of correlation between two genes that are not co-expressed in both groups is noise. That's why is always better to perform a different network for each condition. Even if it is noise it can help to set the "background" noise to the really co-expressed genes.

>What happens in the body transcription wise in disease samples?

Some genes might be differentially expressed between the groups, some will be co-expressed and keep a similar expression profile, while others might be differentially co-expressed between both groups. See the first paragraph of Robert's answer.

You might find interesting this article which describes a method to compare different conditions. Or this other one which is not based on WGCNA. However I hope new answer to your question will come.

ADD REPLY • link 7.9 years ago Lluís Revilla Sancho ▴ 750

0

Entering edit mode

A big thank you again for your help, Lluis.

ADD REPLY • link 7.9 years ago mocharrout ▴ 20

score 1 · Answer 1 · 2016-10-25

Hi,

this is not at all a trivial question. In my opinion, you should first think about what are you looking for. Are you looking for correlations that occur within both, case and control samples? are you looking for correlations that occur only in one condition but not on the other? do you want to identify pairs of genes that are correlated irrespective of the reason why they are correlated, or would like to discard those pairs that are correlated due to non-biological sources of variation, such as batch effects?

If you are interested in finding pairs of correlated genes whose correlation occurs throughout conditions, and adjusted for indirect effects, you may want to have a look to the function qpGenNrr() of the qpgraph package and the corresponding paper describing what is doing. This was designed for microarray data, so you should transform first the RNA-seq counts into a continuous measurement, such as log-CPM values.

cheers,

robert.