Question

The difference when using Limma compare between 3 conditions vs 2 conditions

0

Entering edit mode

Chris ▴ 20

@3fdb6f97

Last seen 2 days ago

United States

Hi all,

I use limma to compare pathway score (from GSVA tool) between 3 conditions (normal control, early state, late state), I got no difference. Each condition have multiple samples. But when I remove early state observation from the data matrix, there are pathways significant different between normal control and late state. Would you please explain why? I thought when limma compare 3 conditions, it will pairwise compare them so if 2 conditions have different, 3 conditions also should have. I think apply limma is better than using anova and also can work on other data than microarray and RNA-seq. Thank you so much!

limma • 4.2k views

ADD COMMENT • link 12 months ago • updated 11 months ago Chris ▴ 20

score 2 · Accepted Answer · 2024-03-29

2

Entering edit mode

Gordon Smyth 52k

@gordon-smyth

Last seen 12 minutes ago

WEHI, Melbourne, Australia

I don't know what you mean when you say that you using limma to compare pathway scores from GSVA. Please show your code so we know what you're doing. Have a read of the posting guide.

When you say that you are comparing pathway scores between 3 conditions (normal control, early state, late state), does that mean you are conducting an F-test between all three conditions or does it mean that you are comparing late state to normal control? It would have to be the latter otherwise it's not comparable to the 2 condition comparison. You need to show code so that we know which test you are refering to.

Added 7 days later

I finally recalled that you gave code for what appears to be same problem 18 days ago on Biostars: https://www.biostars.org/p/9590349/ . This makes the problem clear. With three groups, you didn't specify which contrast you wanted to test, so limma did an anova-like F-test for any differences between the three groups. This means that the significant differences between NC and LS were diluted by the non-significant comparisons so you didn't get any significant results overall. When analysing the three conditions, to test for differences specifically between LS and NC you simply specify the appropriate contrast to topTable:

topTable(fit2, coef="NCvsLS")

That way, you will be doing the same comparison with 3 conditions as you did with 2 conditions.

ADD COMMENT • link 12 months ago Gordon Smyth 52k

0

Entering edit mode

Thank you for reply early on Saturday morning. Sorry if the question is not clear to you. Here is my code when compare 3 conditions:

design <- model.matrix(~ 0 + paths_vector) 
contrast_matrix <- makeContrasts(NCvsES=paths_vectorNC-paths_vectorES, NCvsLS=paths_vectorNC-paths_vectorLS, ESvsLS=paths_vectorES-paths_vectorLS, levels=design)

GSVA is a tool that will give me a matrix with rows instead of genes, it is pathways name. The value instead of gene expression, it is pathway score.

ADD REPLY • link 12 months ago Chris ▴ 20

3

Entering edit mode

Regarding your original question, limma results will change if you remove one of the conditions because the residual standard deviations will change. However you should not be removing any of the groups. You should instead explore why the early-state might be more variable than the other conditions using plotMDS(). If there are differences in variability between conditions, you might want to use arrayWeights() to account for that.

The limma analysis of GSVA pathway scores can be improved by allowing limma to take set size into account when modelling the variance. To do that, use trend = sqrt(setsize) when running eBayes(), where setsize is the number of genes in the pathway.

The analysis of pathway scores isn't terribly powerful unless you have large sample sizes. Have you considered analysing the pathways directly in limma using camera()?

ADD REPLY • link 12 months ago Gordon Smyth 52k

0

Entering edit mode

Thank you so much for a detail reply! I analyze thousand of pathways so the number of genes vary a lot, so how can I use setsize in this case?

ADD REPLY • link 12 months ago Chris ▴ 20

1

Entering edit mode

setsize is a vector, of length equal to the number of rows of data, i.e., the number of pathways. It gives the size of each individual pathway.

If you type ?eBayes the help page says: "trend can be a row-wise numeric vector, which will be used as the covariate for the prior variance"

ADD REPLY • link 12 months ago Gordon Smyth 52k

0

Entering edit mode

Thanks Gordon! I takes the pathways from MSigDB and I think they don't mention how many genes in a pathway. At the beginning, I want to compare if pathway score different between groups using anova test. Then I think limma maybe better even though people use it for microarray and RNA-seq but not pathway score.

ADD REPLY • link 12 months ago Chris ▴ 20

1

Entering edit mode

Yes, limma is much better than univariate anova and it is used to analyse pathway scores in the GSVA vignette. I am just giving you suggestions that would do even better.

Pathways are just lists of genes in R and there are various ways to count how many genes are in each set. For example:

> download.file("https://bioinf.wehi.edu.au/MSigDB/v7.1/Hs.c2.all.v7.1.entrez.rds", "c2.rds")
> c2 <- readRDS("c2.rds")
> lengths(c2)[1:10]
              KEGG_GLYCOLYSIS_GLUCONEOGENESIS                  KEGG_CITRATE_CYCLE_TCA_CYCLE 
                                           62                                            31 
               KEGG_PENTOSE_PHOSPHATE_PATHWAY KEGG_PENTOSE_AND_GLUCURONATE_INTERCONVERSIONS 
                                           27                                            28 
         KEGG_FRUCTOSE_AND_MANNOSE_METABOLISM                     KEGG_GALACTOSE_METABOLISM 
                                           34                                            26 
       KEGG_ASCORBATE_AND_ALDARATE_METABOLISM                    KEGG_FATTY_ACID_METABOLISM 
                                           25                                            42 
                    KEGG_STEROID_BIOSYNTHESIS           KEGG_PRIMARY_BILE_ACID_BIOSYNTHESIS 
                                           17                                            16

If you don't know how to do this with GSVA, you could ask the GSVA authors.

ADD REPLY • link 12 months ago Gordon Smyth 52k

0

Entering edit mode

I see. I ran different expression at pathway level without notice that limma being used. Thank you. Will try camera(). Do we have anything explain a little bit easier than this? https://www.rdocumentation.org/packages/limma/versions/3.28.14/topics/camera

I adjusted to follow reply by email but it didn't work, so I have to manually check the post if someone reply.

ADD REPLY • link 12 months ago Chris ▴ 20

1

Entering edit mode

Examples of camera workflows:

By the way, have you considered simply typing ?camera at the R prompt to see the help page for the function? Instead of doing a Google search and reading the help page from a decade ago.

ADD REPLY • link 12 months ago Gordon Smyth 52k

0

Entering edit mode

Hi Gordon,

I don't have rds file but gmt file https://data.broadinstitute.org/gsea-msigdb/msigdb/release/7.5/c2.all.v7.5.entrez.gmt. So may I know the equivalent function with readRDS to count the number of gene? Thanks Gordon.

ADD REPLY • link 12 months ago Chris ▴ 20

2

Entering edit mode

Hi Chris, here the GSVA maintainer, getting the gene set lengths that match the gene sets employed with GSVA may not be obvious for many users, so here is a way to do it from the GMT file you're referring to:

## use function read.gmt() from the qusage package to read the GMT file, another option would be
## the getGmt() function from the GSEABase package, but it chokes because this GMT file apparently
## contains a couple of gene sets with identical names (likely a misannotation)
c2 <- qusage::read.gmt("https://data.broadinstitute.org/gsea-msigdb/msigdb/release/7.5/c2.all.v7.5.entrez.gmt")

## discard duplicated gene sets
c2 <- c2[!duplicated(names(c2))]

## calculate sizes of gene sets
setsize <- lengths(c2)

## assuming that the output of gsva(), and input to the limma-trend pipeline, is called 'gsva_es'
## arrange the 'setsize' vector in one-to-one correspondence with the rows of 'gsva_es'
## overwriting the previous 'setsize' vector
setsize <- setsize[match(rownames(gsva_es), names(setsize))]

now you should be able to use setsize as argument to the trend parameter in the call to the eBayes() function from the limma-trend pipeline.

still, the final gene set sizes employed by gsva() might be slightly different to the input ones, due to internal filtering, I've opened an issue at the GSVA GitHub repo to give a solution to this problem.

ADD REPLY • link 12 months ago Robert Castelo ★ 3.4k

0

Entering edit mode

Thanks Robert! I successful ran limma with setsize parameter and the result slightly change. So the difference in pathway score between groups also depend on the number of genes in each pathway? Gordon explained above about the difference when comparing 3 vs 2 conditions but I am still not clear. Like when I compare how salty 3 types of pizza, I got no difference but comparing the first and third type of pizza, I got difference. When comparing 3 types, it also compare the first and the third one, right?

ADD REPLY • link 12 months ago Chris ▴ 20

1

Entering edit mode

Chris, the process of comparing three conditions is exactly the same for limma as it is for anova, and the same for pizza as it is for pathway scores. It is well known that a comparison that is significant for a two-group t-test can become nonsignificant in an anova context.

You need to show the limma code you are using for testing so that we can see what you are doing. I asked you to do that 4 days ago. The way you are asking the question is so vague that we can't tell which comparison you are testing.

I also advised you to make an MDS plot. Can you please make that plot and show the results.

ADD REPLY • link 12 months ago Gordon Smyth 52k

0

Entering edit mode

Hi Gordon, I have more than 2000 samples (around 1500 samples are normal control), so MDS plot is very messy. enter image description here

plotMDS(normalized_data)

ADD REPLY • link 12 months ago Chris ▴ 20

1

Entering edit mode

MDS <- plotMDS(normalized_data, plot=FALSE, gene.selection="common")
col <- paths_vector <- factor(paths_vector, levels=c("NC","ES","LS"))
levels(col) <- c("black","blue","red")
plotMDS(MDS, pch=16, cex=0.3, col=as.character(col))

ADD REPLY • link 12 months ago Gordon Smyth 52k

0

Entering edit mode

Thanks Gordon! We can change color to distinguish between blue and black. For RNA-seq, we normalize by gene length and sequencing depth. In pathway score, I am not sure what we normalize when using:

normalized_paths <- normalizeBetweenArrays(pathway_data, method = "quantile")

enter image description here

ADD REPLY • link 12 months ago Chris ▴ 20

1

Entering edit mode

This is not helpful. I asked you to make an MDS plot of the same pathway scores that you input to limma, but this is something different.

The plot you've posted today is showing different data than the MDS plot yesterday. I had thought that you would just change the labels into coloured dots without changing the data.

You say that you have two groups, but the plot shows only two groups.

You say that you have 2000 samples but the plot shows only around 100 points. Perhaps you have not plotted the control samples at all.

The plot shows absolutely no evidence of any differences between the two groups shown, whatever they are.

ADD REPLY • link 12 months ago Gordon Smyth 52k

1

Entering edit mode

When you ask a question here on the Bioconductor forum, you are always asked to show the code that is causing you a problem. For some reason, you didn't want to do that here, even after being asked for it. I finally remembered though that you did show code 18 days ago on Biostars for what seems like the same analysis: https://www.biostars.org/p/9590349/

Seeing the code makes your problem clear, so I have edited my answer above to give the solution.

ADD REPLY • link 12 months ago Gordon Smyth 52k

0

Entering edit mode

Sorry for not provide full code! I thought it was not necessary.
compare_paths is a dataframe with around 2,000 rows, the first column is state (with values NC, ES, LS), other columns are pathway name.

vector_paths <- compare_paths$state
paths_vector <- factor(vector_paths)
design <- model.matrix(~ 0 + paths_vector)
contrast_matrix <- makeContrasts(ESvsNC=paths_vectorES-paths_vectorNC, LSvsNC=paths_vectorLS-paths_vectorNC, ESvsLS=paths_vectorES-paths_vectorLS, levels=design)

Convert value to numeric

rownames_transpose_paths <- rownames(transpose_paths)
numeric_paths <- apply(transpose_paths, 2, as.numeric)
rownames(numeric_paths) <- rownames_transpose_paths

normalized_paths <- normalizeBetweenArrays(numeric_paths, method = "quantile")

MDS <- plotMDS(normalized_paths, plot=FALSE, gene.selection="common")
col <- paths_vector <- factor(paths_vector, levels=c("NC","ES","LS"))
levels(col) <- c("black","blue","red")
plotMDS(MDS, pch=16, cex=0.3, col=as.character(col))

So as you suggested, I should not remove the early state groups, so when using topTable(), what I should put in coeff? Normally, I see people compare between 2 conditions (disease vs control, treated vs control) so I am not sure what statistic should use in 3 conditions.

ADD REPLY • link 12 months ago Chris ▴ 20

0

Entering edit mode

Hi Robert, If I don't want to use all pathway in c2 gene set but only a few pathways of interested, how can I do that? So we try to get a gmt file of only some pathways? In my case, the result when apply trend =sqrt(setsize) or not, doesn't change the result but want to try the good practice. Thank you so much!

ADD REPLY • link 12 months ago Chris ▴ 20

1

Entering edit mode

Hi, your c2 object storing all the gene sets is probably a list object, where element names are the names of the pathways given by MSigDb. The setsize object is a vector of integer values storing the sizes of gene sets, in parallel to c2, this means the first value of setsize correspond to the number of genes in the first element of c2, and so on. Therefore, to work with a few out of those pathways in the c2 list object, you only need to _subset_ both, the c2 object and the setsize object, to those few pathways of interest. For that purpose you need to know either the names given by MSigDb of those few pathways or the position where they are in c2 and setsize. If you are unsure about how to do this, probably you need to take a course on the basics of using R.

ADD REPLY • link 12 months ago Robert Castelo ★ 3.4k

0

Entering edit mode

Thanks Robert!

ADD REPLY • link 12 months ago Chris ▴ 20

0

Entering edit mode

When using arrayWeights(), you mean the variability within the same group, is that correct? I alway run arrayWeights() because I naive think it would be helpful most of the time.

ADD REPLY • link 11 months ago Chris ▴ 20