Question

GOstats - are significantly enriched GO terms with a small "size" meaningful?

1

Entering edit mode

joasmile84 ▴ 20

@joasmile84-20206

Last seen 2.2 years ago

Switzerland

Hi all,

I am trying to perform a GO enrichment analysis on a list of differentially expressed genes and I am having doubts on the robustness of the results. I am using the GOstats package and the organism is Apis mellifera, so I had to make a custom gene universe set, which consists of 7903 genes for which GO terms were available. My codes are as follows:

library(GOstats)
library(GSEABase)
library(GOstatsPlus)

require(biomaRt)

mart <- useMart('metazoa_mart', host = 'metazoa.ensembl.org') 
mart <- useDataset('amellifera_eg_gene', mart)

annot <- getBM(
    mart = mart,
    attributes = c(
        'go_id',
        'go_linkage_type',
        'ensembl_gene_id'),
    uniqueRows = TRUE)

library(dplyr)
annot<-subset(annot, trimws(annot$go_id) !="")

goFrame=GOFrame(annot,organism="Apis mellifera")
goAllFrame=GOAllFrame(goFrame)

library("GSEABase")
gsc <- GeneSetCollection(goAllFrame, setType = GOCollection())

gene_universe = head(unique(unlist(geneIds(gsc))),100000)

library(stringi)
gene_IDs_of_interest<-stri_join(AllDEGs,  sep="")

params <- GSEAGOHyperGParams(name="My Custom GSEA based annot Params",
                                                           geneSetCollection=GO_universe,
                                                           geneIds = gene_IDs_of_interest,
                                                           universeGeneIds = gene_universe,
                                                           ontology = "BP",
                                                           pvalueCutoff = 0.05,
                                                           conditional = FALSE,
                                                           testDirection = "over")

BP_Over <- hyperGTest(params)

summary(BP_Over)

My DEG list consisted of 96 genes, so rather short, but I get 52 biological process terms to be significant at a 0.05 p value cutoff. However, when I look more closely I see that many of these significant terms have a size of 1. For example:

GOBPID Pvalue OddsRatio ExpCount Count Size

GO:0046098 0.007278609 Inf 0.007278609 1 1

So, my question is whether this is an OK result? Are terms with size 1 ok to keep in the test or should have I applied some initial filtering? I could not find a discussion of this in the GOstats vignettes so I hope you can help me.

Another doubt is, should I correct these GO term p values for multiple testing? My approach has so far been to not apply an FDR correction at this stage but then "condense" the results with a tool like Revigo to bin terms by semantic similarity to better summarize the results.

Would you have any other suggestion to improve the accuracy of these analyses?

Thank you very much in advance for your kind help!

GOstats GO RNASeq • 2.3k views

ADD COMMENT • link updated 3.4 years ago by James W. MacDonald 68k • written 3.4 years ago by joasmile84 ▴ 20

score 0 · Answer 1 · 2021-11-19

0

Entering edit mode

James W. MacDonald 68k

@james-w-macdonald-5106

Last seen 2 hours ago

United States

I normally do

summary(BP_over, categorySize = 10)

Because significant results for GO terms with fewer than 10 genes may reflect mere chance rather than evidence for enrichment.

ADD COMMENT • link 3.4 years ago James W. MacDonald 68k

score 0 · Answer 2 · 2021-11-19

hi, one could say that is easier to enrich a gene set that has just one gene, than a gene set that has 100, and from that perspective you may want to filter out gene sets below certain size. In the end, what you're looking for is replicability, i.e., the biological function that you see altered in an experiment, by seeing some enriched gene set, you expect to see it altered in another dataset obtained by conducting the same experiment. A way to attempt selecting altered biological functions that replicate is to keep gene sets with a robust enrichment and that robustness comes usually from having a minimum amount of genes enriching that gene set, something like 3, 4 or 5 genes. you can do that by filtering on the column Count from the data.frame returned by summary(). another useful way to look at those results is ordering them by odds ratio (column OddsRatio) in decreasing order, since that gives you a magnitude of the enrichment, and maybe discarding those enrichments with OR < 1.5.

with respect to multiple testing, it's more tricky because here the one-tailed Fisher's exact test p-values returned by GOstats are not independent, in the sense that overlapping GO terms may have correlated p-values. In general this means that you can potentially be over-correcting those p-values, which in the end means the correction may be conservative. Others in this forum may have more informed or specific opinions or advice about this. If you decide anyway to apply some p-value correction, one problem with your code above is that the function summary() is only giving you the GO terms with a p-value below the threshold you put on the GSEAGOHyperGParams object (pvalueCutoff = 0.05). So, you would need to build another data.frame with all the GO terms tested by GOstats and apply the correction there. You can build such a data.frame as follows:

BP_Over_All <- data.frame(GOBPID=names(geneIdUniverse(BP_Over)),
                          Pvalue=pvalues(BP_Over),
                          OddsRatio=oddsRatios(BP_Over),
                          ExpCount=expectedCounts(BP_Over),
                          Count=geneCounts(BP_Over),
                          Size=universeCounts(BP_Over))

and then do the correction on BP_Over_All$Pvalue.