Contrasts in GOexpress analysis?
2
0
Entering edit mode
mforde84 ▴ 20
@mforde84-12150
Last seen 7.4 years ago

Hi,

I'm a little confused by what's actually going on with the analysis for this package.

The main function can take a factor of any size and returns enriched GO terms and pval based upon those factors using random forest regressions.

I'm having difficulty understanding what the returned pvalues because there's no mention what what the contrasts are. For example, say I ran the analysis with a treatment factor having three independent levels. It appears that the pvalue returned is for the overall effect a GO term across all factors (e.g., sort of like an F statistic). I'd like to be able to do contrasts between groups, e.g., grps1 - 2, grps1 - 3, and grps2 - 3.  I was able to use subEset() function to run a 2 factor analysis (e.g., subEset(eSet, subset=list(Treatment=c("grp1", "grp2"))). I'm assuming this is a one tail test (e.g., only significant term enrichment and doesn't include depletion) I wasn't sure enriched terms are being contrasted grp1-2 or grp2-1, or if it's a combination of both.

Any one have any thoughts on what these pvalues mean or suggestions of R packages which might be better suited for the contrasts I want to analyze?

Martin

gene ontology goexpress • 1.8k views
ADD COMMENT
3
Entering edit mode
kevin.rue ▴ 350
@kevinrue-6757
Last seen 7 months ago
University of Oxford

Dear Martin,

"I'm having difficulty understanding what the returned pvalues because there's no mention what what the contrasts are. [...] It appears that the pvalue returned is for the overall effect a GO term across all factors (e.g., sort of like an F statistic)."

Indeed you are right: GOexpress was designed to estimate how well GO discriminate all levels of the factor analysed. As Gordon pointed out, there are already various methods designed to address pairwise contrasts. 

Now, regarding the actual meaning of the P-value, I assume you already had a look at the vignette, section "Permutation-based P-value for ontologies", which I reference below:

"To assess the significance of GO term ranking — or scoring —, we implemented a permutation-based function randomising the gene feature ranking, and counting how many times each GO term is ranked (scored) equal or higher than the real rank (score)."

In other words, the P-values do not estimate enrichment of GO terms within a gene list (the random forest return the full list of genes). Rather P-values estimate how often each GO term would rank/score higher by chance, than they did in the ranking/scoring of genes obtained from the random forest (RF).
More clearly, P-values are obtained as follows:

  • the full list of genes is randomised N times,
  • for each randomisation, the rank/score of each GO term is calculated and compared to rank/score obtained by the RF
  • P-value(GOterm) = {count of randomisation where GOterm ranked/scored better than in RF} / {N}

As a consequence, to answer your other question, this is indeed equivalent to a one-tailed test. However, not for enrichment of genes in a list but rather for significance of the rank/score relative to a random gene list.
A correlate of the above explanation is that terms are not contrasted in any direction between the groups of samples, rather they are tested for the frequency at which their associated genes rank/score higher than expected by chance in the RF in the task of classifying the samples into their known groups.

Given your interest in pairwise contrasts, I think Gordon's suggestion is better suited. But for what it is worth, your use of subset() would have been exactly my suggestion for an alternative approach using GOexpress. Actually, I would be quite interested if you were keen to run both and share your impression of the comparison.

All the best,
Kevin

 

ADD COMMENT
0
Entering edit mode

Hi Kevin,

Thanks for the clarifications. I'll run some test tomorrow, and do some comparisons and see what looks suitable. Again, thanks for the nice response.

Martin

ADD REPLY
2
Entering edit mode
@gordon-smyth
Last seen 1 hour ago
WEHI, Melbourne, Australia

Do you require random forest regression? Since you're already using the limma package ( limma -- posthoc analysis when design factor has 3 levels ), have you tried the goana function for a more traditional GO analysis?

I'll continue the code that Aaron gave you in response to your previous post. If your data is human and has Entrez Gene IDs as row names, you might proceed:

group <- factor(rep(1:3, c(nGrp1, nGrp2, nGrp3))
design <- model.matrix(~0 + group)
con <- makeContrasts(
   G1vs2 = group1 - group2,
   G1vs3 = group1 - group3,
   G3vs2 = group3 - group2, levels=design)
fit <- lmFit(y, design)
​fit2 <- contrasts.fit(fit, con)
fit2 <- eBayes(fit2)
g <- goana(fit2, coef="G1vs2", species="Hs")
topGO(g)
g <- goana(fit2, coef="G1vs3", species="Hs")
topGO(g)

etc.

 

ADD COMMENT
0
Entering edit mode

Thanks, Gordon. This is very helpful.

ADD REPLY

Login before adding your answer.

Traffic: 462 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6