How to compare two or more GSVA outputs, when it is not possible to run GSVA at once
1
1
Entering edit mode
hsiowa2 ▴ 20
@hsiowa2-23534
Last seen 15 months ago
South Korea

Hello.

I want to ask a question whether it is possible to compare two or more GSVA outputs (of same gene sets), when it is not possible to merge the data (i.e. too large gene expression data matrix for GSVA to run) and run GSVA at once.

It looks like when there are too many samples, GSVA takes too much time to run, or even have no progress even after long time of waiting. I'm curious if it is possible to split the samples into two or more groups, run GSVA separately to each groups, and compare the GSVA values with each other.

If it is not possible or right to do so, is there any way to do instead?

Thank you very much.

gsva • 4.4k views
ADD COMMENT
1
Entering edit mode
Robert Castelo ★ 3.4k
@rcastelo
Last seen 14 days ago
Barcelona/Universitat Pompeu Fabra

hi,

the last release version of GSVA 1.36.x, part of Bioconductor 3.11, running on top of the last release of R 4.0, allows one to run quickly GSVA on large input expression matrices with thousands of samples, as in those derived from consortium projects such as TCGA, using multiple cores and the BiocParallel infrastructure. Please check the argument BPPARAM in the help page of gsva().

if you are unsure about what version of GSVA and Bioconductor are you using, please give us the output of:

BiocManager::version()

cheers,

robert.

ADD COMMENT
0
Entering edit mode

Thank you very much for your reply. I'm really happy to hear about the updated version of GSVA.

I took a look at BPPARAM parameter in the new gsva function, but it seems difficult for me to understand. Could you give me an example code? Below is what I tried, and I don't know if this is the right way.

output <- gsva(geneexpressionforgsva ,genesets ,min.sz=5, max.sz=500, method = "gsva",mx.diff=TRUE, verbose=T, parallel.sz=3, BPPARAM=SerialParam(progressbar=T))

*in the help page originally it was BPPARAM=SerialParam(progressbar=verbose) but this gave an error saying verbose does not exist, and I changed it to T.

Thank you very much.

ADD REPLY
0
Entering edit mode

sure, let me first clarify that the "Usage:" section of a help page from a function in R normally shows the function call with all the arguments and their default values in the form f(argument=value) but this does not mean that you should call the function f() exactly in that way, it means that if you do not set that argument, then the function will use that default value. in fact, i see that in your case, it may suffice to set the argument parallel.sz=3, in addition to the input expression data matrix, the input list of gene sets and your desired cutoffs on the minimum and maximum gene set size.

the following simulation of an input expression matrix with 10k genes and 1k samples, and 1000 gene sets, runs in about three minutes using 4 cores in my computer:

p <- 10000
n <- 1000
sizeGeneSets <- sample(5:500, size=1000, replace=TRUE)
geneSets <- lapply(as.list(sizeGeneSets), sample,
                   x=paste0("g", 1:p), replace=FALSE)
names(geneSets) <- paste0("gs", 1:length(geneSets))
length(geneSets)
[1] 1000
y <- matrix(rnorm(n*p), nrow=p, ncol=n,
            dimnames=list(paste("g", 1:p, sep="") , paste("s", 1:n, sep="")))
dim(y)
[1] 10000  1000

library(GSVA)

es <- gsva(y, geneSets, parallel.sz=4))
Setting parallel calculations through a MulticoreParam back-end
with workers=4 and tasks=100.
Estimating GSVA scores for 1000 gene sets.
Estimating ECDFs with Gaussian kernels
Estimating ECDFs in parallel
iteration: 100
  |======================================================================| 100%
dim(y)
[1] 10000  1000

by setting the argument parallel.sz the gsva() function will try to run the calculations in parallel using a multicore backend with the number of cores specified in parallel.sz. if you need additional fine tuning of the parallelization then you need to set the argument BPPARAM. to learn how to set that argument you should read sections 1 to 3 of the vignette "Introduction to BiocParallel` from the BiocParallel package.

ADD REPLY
0
Entering edit mode

Thank you very much for your reply.

I tried gsvaparallel.sz=4) and it seemed to work, but after 100 iterations it said

Error: cannot allocate vector of size 7.9 Gb

which was quite frustrating. In fact, my gene expression matrix is much much larger than your example, and it took quite a long time for 100 iterations to finish.

Nevertheless, I am using 64GB RAM computer (Windows 10), and upon checking memory usage after the error message it was still about 48Gb, which seemed to have still more space to allocate 7.9 Gb.

Would there be any suggestions?

Thank you very much.

ADD REPLY
0
Entering edit mode

Could you tell us what are the dimensions of your gene expression matrix? Could you try running this in a linux machine?

ADD REPLY
0
Entering edit mode

Sure, the number of genes are about 58000 and the samples about 19000.

Um, how could I try this in Linux? Do I need separate computer or could I use Ubuntu in Windows? Or perhaps virtual machine?

ADD REPLY
0
Entering edit mode

You could filter out lowly-expressed genes and reduce the gene dimension, this can greatly reduce the memory requirements. what kind of expression units do you have? normalized log2-CPM?

my guess is that a linux system may have a better memory management so I think it would be best if you have access to a separate computer running linux, alternatively you may try running linux (e.g., Ubuntu as you suggest) in a VM on windows, or also you may install docker and run GSVA from the docker container. you can find instructions on how to run a pre-configured docker container for Bioconductor in this link.

ADD REPLY
0
Entering edit mode

Yes I guess it would be useful to filter out lowly-expressed genes. As I know people use criteria something like "genes that have 0 expressions in at least 70% of all samples" something like this, is this correct?

My expression matrix is RNASeq data, DESeq2-normalized, log2 transformed.

Thank you for the linux and related suggestions. I learned a lot.

ADD REPLY
0
Entering edit mode

The strategy of selecting genes with a minimum expression level in a minimum number of samples is usually applied when you have some intuition about what this minimum number of samples can be, such as the size of the smallest group in a comparison across two or more groups of samples. With 19000 samples, it may not be sensible to follow that strategy, so I'd simply plot the distribution of the mean expression values of genes, identify the lower mode corresponding to lowly-expressed genes across the 19000 samples and set a cutoff that discards genes in that mode, e.g., if y is your gene expression data matrix of normalized log2-transformed gene expression units, do something like:

avgexppergene <- rowMeans(y)
plot(density(avgexppergene), xlab="Gene average expression level")

let's say you consider your cutoff on the x-axis to be 1, but you should determine this from the plot:

mask <- avgexppergene > 1
y.filt <- y[mask, ]

and then call gsva() using y.filt.

ADD REPLY
0
Entering edit mode

Your suggestion worked perfectly. Thank you so much.

ADD REPLY

Login before adding your answer.

Traffic: 558 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6