Question

WGCNA: 1) low soft thresholding power, 2) large modules, 3) best correlation for different types of trait variables

0

Entering edit mode

stu111538 • 0

@stu111538-13994

Last seen 4.3 years ago

Germany, Kiel, University Hospital Kiel

Hello, I performed WGCNA on RNA-Seq data of 55 samples and used the code exactly as provided at the WGCNA website for the network analysis of the female mice data. There are three issues I am not sure about:

1) According to the tutorial recommendations I would need to choose a soft thresholding power of 3, since it reaches already R^2 of 0.8 and is also the maximum. However, the power recommendations in the table of the FAQs suggest a power of 6-12 for my sample size. What would you recommend?

enter image description here

2) I am using about 20,000 genes as input, and both the signed and the unsigned network analysis yield 4 or 5 modules containing thousands of genes (the largest module contains 9,000 genes), and about 15 modules containing hundreds of genes. Should I be concerned about the large modules?

3) I want to correlate the gene modules with continuous (BMI), categorial (e.g. smoking habits) and binary variables (e.g. mutation yes/no). What correlation is the best for all types of variables? bicor(x,y, robustY = FALSE, maxPOutliers = 0.05) or simple pearson? Or is a separated correlation according to variable type the best? I have NAs in every kind of variable.

Thank you in advance

WGCNA • 7.8k views

ADD COMMENT • link updated 5.4 years ago by Peter Langfelder ★ 3.0k • written 5.4 years ago by stu111538 • 0

0

Entering edit mode

1) According to the tutorial recommendations I would need to choose a soft thresholding power of 3, since it reaches already R^2 of 0.8 and is also the maximum. However, the power recommendations in the table of the FAQs suggest a power of 6-12 for my sample size. What would you recommend?

A soft thresholdin power of 3 is really low. I would recommend to look at your data (just do a PCA) because you might just have a very strong driver of variation, which explains why you ends up with a module of 9000 genes; perhaps is the smoking habits or other categorical variables that you did not take into account.

I want to correlate the gene modules with continuous (BMI), categorial (e.g. smoking habits) and binary variables (e.g. mutation yes/no). What correlation is the best for all types of variables? bicor(x,y, robustY = FALSE, maxPOutliers = 0.05) or simple pearson? Or is a separated correlation according to variable type the best? I have NAs in every kind of variable.

I would use a pearson for both categorical and continuous variables. NAs should not be a problem

ADD REPLY • link 5.4 years ago andres.firrincieli ▴ 50

score 2 · Answer 1 · 2019-09-20

2

Entering edit mode

Peter Langfelder ★ 3.0k

@peter-langfelder-4469

Last seen 3 months ago

United States

I'd go with 6 for unsigned or signed hybrid networks, and 12 for signed network. Power 3 is really too low with 55 samples. As Andres mentioned, check the sample clustering tree for large drivers (strong branches); large modules are often the result of having very strong global drivers of expression. For working with categorical variables with more than 2 levels, you may want to read https://peterlangfelder.com/2018/11/25/working-with-categorical-variables/ .

ADD COMMENT • link 5.4 years ago Peter Langfelder ★ 3.0k

0

Entering edit mode

Thank you for your comments and support! I will check whether there are global drivers of expression. It might just be that those drivers are exactly the variables I am interested in.

ADD REPLY • link 5.3 years ago stu111538 • 0

0

Entering edit mode

I have a similar situation. I am working with 40 samples (20 groupA + 20 groupB). For first part of my analysis, I used DEseq2 to identify DEG between groupA and groupB samples. I then used the vst transformed values of ~16k genes (all protein coding genes filtered on low counts) for WGCNA. The dendrogram of sample and trait relation showed two groups clearly. However, the soft threshold power I got was 4 at 0.8. Using this, I obtained 12 modules of which a single module contained ~10k genes and I also observed it had high negative correlation with my trait of interest. Is getting such large module usual? For module-trait relationship, as the samples were from 2 groups I used 1 for groupA and 2 for groupB. Is this the correct approach? Also, is there a way to tell which modules are related to groupA and which are related to groupB.

ADD REPLY • link 5.2 years ago Arindam ▴ 80