Hello everyone,
I'm using WGCNA on a pretty large RNA-seq dataset from soil - 600,000 genes after filtering for poor spurious hits. I did a trial run with a subset of 4000 genes on my laptop, and it worked fantasticly and am in the process of applying it to the larger dataset. I have a 68 cpu server with 1TB of RAM to run the analysis on, and am currently using the following R input:
bwnet = blockwiseModules(datExpr, maxBlockSize = 46000, power = 18, networkType = "signed", TOMType = "signed", minModuleSize = 30, reassignThreshold = 0, mergeCutHeight = 0.2, numericLabels = TRUE, saveTOMs = TRUE, saveTOMFileBase = "permafrostmetaT-blockwise", verbose = 3)
So I have 16 blocks to process, and after 56 hours I am at this point:
" Calculating module eigengenes block-wise from all genes
Flagging genes and samples with too many missing values...
..step 1
....pre-clustering genes to determine blocks..
Projective K-means:
..k-means clustering..
..merging smaller clusters...
Block sizes:
gBlocks
1 2 3 4 5 6 7 8 9 10 11 12 13
45968 45966 45643 45616 45425 44946 42957 41659 40476 38567 37969 37211 35505
14 15 16
34049 33345 18019
..Working on block 1 .
TOM calculation: adjacency..
adjacency: replaceMissing: 0
..will use 47 parallel threads.
Fraction of slow calculations: 0.000000
..connectivity..
..matrix multiplication.."
The server stats show that WGCNA is using 3.5% of the memory, but only 1CPU - so it doesn't seem like the parallelisation is working - it would be great to use all 47 to get things moving.
Does anyone have any experience of running such a large dataset? And any tips on how to get it to use more threads? At this point, it looks as though things will take a very long time, even though there are a lot more cpu resources that could be used.
Thank you,
Caitlin
Hello Peter,
Thank you so much for that. It's great to know what sort of time to expect for each block. I will install a BLAS library and give it another go!
Kind regards,
Caitlin