Question

WGCNA - networking RNA seq gives only one module!

1

Entering edit mode

avelarbio46 ▴ 10

@avelarbio46-20674

Last seen 5.8 years ago

Hello everyone!

I'm doing WGCNA co-expression analysis on 29 samples related to a specific disease, with RNA-seq data with 100million reads. After all quality control, I ended up with 53000 genes in FPM measure. I then filtered all genes with 0 variance, miRNA and pseudogenes, also genes that had less than 10 reads across more than 90% of samples. I ended up with 14000 genes.

When I run goodSamplesGenes, everything seems ok. I run power treshold and ended up choosing 7 (graphs attatched) Soft power treshold Then I run WGCNA blockWiseModules(), with block size 5000 (tried 7000 and 8000 with no good results). I will do an analysis with the full data with our HPC.

But I got one big turquoise module, which is very unexpected.

WGCNA cluster If I change the network to "signed hybrid", I start getting a little bit more clusters:

Any ideas?

WGCNA R network co-expression • 5.6k views

ADD COMMENT • link updated 2.3 years ago by Peter Langfelder ★ 3.0k • written 6.0 years ago by avelarbio46 ▴ 10

score 5 · Accepted Answer · 2019-05-01

5

Entering edit mode

Peter Langfelder ★ 3.0k

@peter-langfelder-4469

Last seen 6 months ago

United States

Did you actually look at the resulting module labels/colors (e.g. table(bwMods$colors)), or are you just looking at the dendrogram of the first block?

There are two issues. The first is that your scale-free fit plot suggests your network looks nothing like scale free. This is often an indication that you have some sort of a global expression driver (strong batch, normalization or biological effect). Sample clustering may help visualizing it.

The second issue is that with the low maxBlockSize you used, the calculation in blockwiseModules split your data into 3, possibly more blocks. What you see is the dendrogram for block 1. I suggest you read Tutorial I, Section 2a, paragraph "A second word of caution concerning block size" on page 3, and also Section 2c. Try plotting all of the block dendrograms (the code is in Section 2c of the tutorial).

BTW, you don't need that much RAM for 14k genes; I have done such calculations in single block on an old laptop with 16GB RAM. It may be doable with 8GB, although it would probably be pushing it.

ADD COMMENT • link 6.0 years ago Peter Langfelder ★ 3.0k

0

Entering edit mode

Indeed, using all genes at once changed it a bit, but not too much as the turquoise block is still big. It is very strange to find a big cluster which is not expected for the disease I'm studying, but might be very helpful also! One thing is that I'm working with one tissue, so maybe the turquoise genes are correlated because of this (biological effect).

When doing cluster analysis, I get 2 to 3 clusters, but by using cluster validation they seem robust. I will do consensus analysis with my clinical data.

Do you have any advice on how to deal with categorical variables with more than 2 classes when doing eigen values correlation?

ADD REPLY • link 6.0 years ago avelarbio46 ▴ 10

0

Entering edit mode

You seem to have ignored the first issue I raised. I do suggest making sure your data do not contain large spurious expression drivers.

Regarding categorical variables, you may want to read https://peterlangfelder.com/2018/11/25/working-with-categorical-variables/.

ADD REPLY • link 6.0 years ago Peter Langfelder ★ 3.0k

0

Entering edit mode

Any techniques to test for spurious expressions drivers? My PCA seemed good, with Dim 1 representing 66% of data variation and Dim 2 6%, which was good for how much data I used.I have tested VAT graphs on distance matrices and they seem pretty good also, indicating nice clustering. Also I'm starting consensus analysis and correlations to see if any GO term is biologically enriched (biological validation of correlations, also by using clValid package), but I'm having my time with categorical traits that are not ordinal and have more than one category (those that might need dummy coding), simply because I never worked with dummy variables and their interpretation. The same problem with categories applies to WGCNA, I think it would be nice if the package could use Gower Distances so categorical data is not so different from numeric data (at least in terms of analysis, not interpretation)

ADD REPLY • link 6.0 years ago avelarbio46 ▴ 10

0

Entering edit mode

The first PC explains 66% of the variance? That's more than a module eigengene in the typical analyses that I do, which is usually in the 30-60% range. So obviously you have a major expression driver. What it is and whether it is spurious is difficult to say without a deep dive into what your data represent, how they were collected/generated, preprocessed etc. Some suggestions are laid out in WGCNA FAQ point 4 ("Can WGCNA be used to analyze RNA-Seq data?"). I would especially check the quantile scatterplots (calculate say the 30%, 50%, 70%, 80% and 90% quantile for each sample, so you get 5 vectors of 29 quantiles, the use pairs to plot them - if you see strong correlations, chances are you have a normalization problem).

ADD REPLY • link 6.0 years ago Peter Langfelder ★ 3.0k

0

Entering edit mode

After some reading, I've been trying some other normalization techniques. As per WGCNA basis, I've done log2(x+1), which radically changed my results (more clusters and power treshold got to over 0.9 within 7 power) and PCA dimension 1 lowered to 17 variance. Do you have any advices on scaling? Log2 vs Z-score

ADD REPLY • link 6.0 years ago avelarbio46 ▴ 10

0

Entering edit mode

I am not sure what you mean by x. You mentioned above that you started with FPM; I suggested that you need to use some sort of normalization before anything else. Once the data are normalized, use log. The offset can be 1, although you could also try to choose an offset that makes the mean-variance relationship approximately constant.

Alternatively, you can start from the raw counts and apply varianceStabilizingTransformation in DESeq2. This includes both a normalization and a transformation step.

ADD REPLY • link 6.0 years ago Peter Langfelder ★ 3.0k

0

Entering edit mode

Hi Peter, I am working on WGCNA for DNA methylation EPIC array data. To do co-methylation module detection, I selected the top 400,000 CpG probes with high variance across 71 samples. But, I always get a big module which will lead to the cluster dendrogram looking so weird... So, do you think I have the same problem as avelarbio46? enter image description here . Do you have any suggestions? Thanks.

ADD REPLY • link 2.3 years ago shuo • 0

1

Entering edit mode

Chances are you have used the blockwise funtionality of blockwiseModules. In that case, the dendrogram you see only refers to the first block. See WGCNA tutorial, section 2c - you need to loop over all blocks to plot the dendrograms in all blocks. Also, if you execute table(bwLabels) (substitute the variable that holds the module labels), you should see more modules and their sizes.

ADD REPLY • link 2.3 years ago Peter Langfelder ★ 3.0k