GSVA (running group samples separately)
1
1
Entering edit mode
Jenn ▴ 30
@jenn-11033
Last seen 19 months ago
United States

Hi,

First of all, thanks for this great GSVA package.

When I ran GSVA (mx.diff=T) on healthy and diseased samples together, the enrichment scores are quite different from when I ran GSVA on healthy and diseased samples separately.

For example, when I plotted the distribution of enrichment scores for diseased samples when the samples were ran using GSVA separately from the healthy samples, i get a bimodal (a peak at somewhere ES > 0 and one at somewhere ES < 0 ).

When the diseased samples were ran using GSVA together with the healthy samples, the distribution of ES for diseased is mostly unimodal, for example ES > 0 (for genes in gene set that should be upregulated in disease vs. in healthy).

I looked at Fhat GSVA paper equation (1) and see that it is dependent on the distribution of the gene in all samples. So i tried to map out the intermediate steps to understand what each output is-- basically I see that when a gene that is expressed highly when compared to other genes within a diseased sample itself, but lowly expressed as compared to healthy samples and similarly expressed to other disease samples, the ranking of the gene dropped within the sample. However, when no healthy samples were provided, the rank of the gene is higher within the sample.

So, when only diseased samples are included, does it "exaggerate" the difference of the gene expression present in the gene set between diseased samples, hence resulting in the bimodal distribution of ES?

When diseased samples vs. healthy are ran together, technically healthy gene expression in the gene set should be quite different from diseased, hence it gives some weight to the genes in the diseased /healthy that results in the ES separating out disease from healthy.

Do i have the right picture or am I misunderstanding something?

Thanks,

Jenn

GSVA • 3.0k views
ADD COMMENT
0
Entering edit mode
Robert Castelo ★ 3.4k
@rcastelo
Last seen 4 weeks ago
Barcelona/Universitat Pompeu Fabra

hi Jenn,

Indeed, the output of GSVA depends on the number of input samples because the first step of the algorithm calculates an expression statistic that attempts to bring distinct gene expression profiles to a common scale. As explained in the paper, the motivation for this step is that sample-by-gene specific effects, caused for instance by differences in GC content or gene length, may result in identical expression values meaning slightly different things in different genes. While this is not relevant in a classical differential expression analysis, we thought this is important when you want to put together those expression values in a gene set. Because this first step is calculated for every gene across all samples, it will lead to different results depending on how many samples are available, which by the way, we recommend in the paper to be at least 10, but in general the larger the sample size, the more robust will be this first step to differences in the type and number of samples. The algorithm runs in an "unsupervised" manner simply meaning that it does not take into account any phenotype, it does not know whether a sample belongs to a healthy or to a diseased individual.

Regarding the distribution of GSVA enrichment scores, to give an informed opinion I would have to look at your actual data or see at least the plots of the distributions you're talking about. Do we talk about the distribution of scores across genes? across samples? have lowly-expressed genes been filtered out? how many samples are we talking about? what kind of samples are they? has data been normalized? etc.. ==========================Edited answer after discussion below==========

Since I don't have your data, I only can try to simulate it, you can run this by yourself:

library(GSVA)

p <- 10000 ## number of genes
n <- 100   ## number of samples
nGrp1 <- 50 ## number of samples in group 1
nGrp2 <- n - nGrp1 ## number of samples in group 2

## consider three disjoint gene sets
geneSets <- list(set1=sample(paste("g", 1:p, sep=""), size=200, replace=FALSE))

## sample data from a normal distribution with mean 0 and st.dev. 1
y <- matrix(rnorm(n*p), nrow=p, ncol=n,
            dimnames=list(paste("g", 1:p, sep="") , paste("s", 1:n, sep="")))

## consider samples divided in two groups and defined group2 as follows
grp2 <- (nGrp1+1):n

## genes in set1 are expressed at higher levels in the
## second group of samples
y[geneSets$set1, grp2] <- y[geneSets$set1, grp2] + 2
dim(y)
## [1] 10000 100

## run GSVA in all samples
es <- gsva(y, geneSets)

## subset data and the run GSVA
y2 <- y[, grp2]
dim(y2)
## [1] 10000   50
es1 <- gsva(y2, geneSets)
dim(es1)
## [1]  1 50

## run GSVA and then subset data
es2 <- gsva(y, geneSets)[, grp2, drop=FALSE]
dim(es2)
## [1]  1 50

## plot the results
par(mfrow=c(3, 1), mar=c(4, 5, 3, 2))
plot(density(es[1, ]), main="All data")
plot(density(es1[1, ]), main="Subset data")
plot(density(es2[1, ]), main="Subset results")

Comparison GSVA scores

As the plots show, only when using all the samples you get the bimodal distribution for a gene set which is differentially expressed in half of the samples. In the other two scenarios, where either you subset one half of the data and calculate GSVA enrichment scores, or you calculate the scores on all the data and look at the distribution of the corresponding half of the scores, the distribution is unimodal.

cheers,

robert.

ADD COMMENT
0
Entering edit mode

Hi Robert,

Thanks for the insights. To answer your questions,

I was referring to the distribution of scores across samples. https://ibb.co/7pT6b31 Lowly expressed have not been filtered out, i.e. I included all genes regardless of their expression. I have 83 tissue biopsy samples ( 36 healthy and 47 disease). They have been normalized by using neqc in limma because these are illumina bead chips microarrays data. 182 genes in Gene set A.

Thanks again!

ADD REPLY
0
Entering edit mode

Hi Jenn,

I would recommend to filter out lowly-expressed genes. Imagine a gene set for which a substantial fraction of its genes are not expressed above the detection threshold of the instrument throughout all the samples. In such a case, we are trying to get an estimate of relative expression for that gene set using a fraction of the data that is, in the case of a microarray, essentially fluorescence noise. Subsection 17.3 from the limma User's Guide illustrates how to process Illumina bead chip microarray data, including how to filter for lowly expressed genes after normalization with neqc(). In your case, I would keep probes that are expressed at least in 36 samples, which is the size of the smallest group.

Your last message ends with the sentence "Distribution of ES scores". If this was intended to be the caption of some plot, the plot did not make it through, I can't see any plot. Please provide as well the exact call that you made to gsva() with its arguments and corresponding values and any warning or relevant message of the gsva() function, if any.

ADD REPLY
0
Entering edit mode

Sorry! The plot image is attached here https://ibb.co/7pT6b31

I actually did use the p value detection level in limma to filter out genes detected in at least 36 samples.

I have the expression data in a matrix form. The genes included in the gene set are in genes$V1


setA<-list(setA=genes$V1)
Es<- gsva(as.matrix(alldata), setA, min.sz=0, max.sz=Inf, verbose=TRUE,  method="gsva")


diseasedata<-alldata[,grep("disease"), colnames(alldata)]
DiseaseEs<- gsva(as.matrix(diseasedata), setA, min.sz=0, max.sz=Inf, verbose=TRUE,  method="gsva")

ADD REPLY
0
Entering edit mode

The titles of the plot are a bit confusing, I understand that the plots show the non-parametric density estimation of the GSVA enrichment scores for one gene set (the one you mentioned about with 182 genes, I guess), but which one of the plots is produced with the calculations made from all the data and which one is made from the subset of the data with disease samples only (corresponding to the gsva() function called you give before). Could you also post the whole set of instructions? (the two calls to the gsva() function and the calls that produce the plot).

ADD REPLY
0
Entering edit mode

Hi Robert,

Please see below:

setA<-list(setA=genes$V1)

# subset data first and then run GSVA
diseasedata<-alldata[,grep("disease"), colnames(alldata)]
DiseaseEs<- gsva(as.matrix(diseasedata), setA, min.sz=0, max.sz=Inf, verbose=TRUE,  method="gsva")
plot(density(DiseaseEs), main="disease only Gene set A",xlab="Enrichment score")


# run GSVA first and then subset
Es<- gsva(as.matrix(alldata), setA, min.sz=0, max.sz=Inf, verbose=TRUE,  method="gsva")
Disease_es_subset<-Es[,grep("disease",colnames(Es))]
plot(density(Disease_es_subset), main="disease ES subset from disease+Healthy run ES Gene set A",xlab="Enrichment score")


Thanks!

ADD REPLY
0
Entering edit mode

Thanks, now I understand what you mean. However, the code above has an error in the call to the grep() function, concretely here:

diseasedata<-alldata[,grep("disease"), colnames(alldata)]

because grep() needs two arguments, I guess the actual code that should run is:

diseasedata<-alldata[,grep("disease", colnames(alldata))]

since the code you gave is not reproducible makes me think that in that very step something may be going differently than you think.

could you check that the dimensions of the object diseasedata are the expected ones, i.e., have the number of samples corresponding to the subset of data that you are supposed to select?

I'm saying this because I've made a simulation trying to recreate your data, which I'll add to my answer above, and I cannot see the effect you describe. By now, my only explanation is that you made a mistake in subsetting the data and your top plot corresponds to calculations made from the whole dataset. Please see my edited answer.

ADD REPLY
0
Entering edit mode

Hi Robert,

That was a typo, I was trying to replace the colnames with a more generic name, "disease", accidentally added a parenthesis there.

Thanks for the example given. Based on that, the "Subset data" ES scores distribution is unimodal at peak 0 whereas the "Subset results" ES scores distribution is unimodal peak at ~0.5.

Does this mean that gsva should be run on all the samples together instead of doing healthy and diseased separately? If I did diseased data only, the ES scores would be around zeros.

Although the method does not take into account the phenotypes info, the distribution of gene expression across samples would infer the phenotype info in calculating the enrichment scores, thus adjusting the scores? Thanks!

ADD REPLY
0
Entering edit mode

Hi Robert,

Just to add on top of my previous question. I think Illumina or Affy data are not normally distributed after normalization and removing lowly expressed genes, so I'll get a bimodal ES scores if I subset the healthy or diseased samples out before running GSVA. Agilent normalized data, however, are normally distributed and I can get the unimodal plot that you have up there if only I did it on agilent data.

ADD REPLY
0
Entering edit mode

hi Jenn,

yes, expression data is in general not normally distributed. What you select as a subset of disease samples may be formed by a very heterogeneous collection of cells. This means that you may have easily genes that are expressed in some of those disease samples and not expressed in other ones, leading to a non-unimodal distribution for your disease samples. I'm not surprised that different microarray technologies with distinct probe designs may capture those changes in different ways. Even probe annotations for a single technology change over time.

ADD REPLY
0
Entering edit mode

hi Jenn,

What you see in "Subset results" it's calculated exactly in the same way (check it on the code) as from "All data", but by subsetting the results you are simply zooming into the second mode.

When you give only a fraction of the data, then that's the dynamic range of expression that the algorithm is seeing with, in this case, very little variability and that's the reason why the distribution is symmetric and centered around zero.

Whether you should give as input all the samples or only a fraction of them depends on the question you want to answer. If you give to the algorithm a fraction of the data with no variability, the algorithm it's not going to capture any variability. If you give variability, it'll attempt to capture it. If that variability is important for what you're investigating, then better give it and give all the data. But I'd say this is in general for any kind of analysis, differential expression, etc.

Regarding your last question, think in terms of a single gene. When a single gene is upregulated in disease samples and downregulated in healthy samples, would you say that the gene "infers" the phenotype? I'd simply say that the gene is associated with the phenotype, but I might not have understood the question.

ADD REPLY
0
Entering edit mode

Hi Robert's

That's been very helpful. Thank you so much for the feedbacks and insights!

ADD REPLY

Login before adding your answer.

Traffic: 812 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6