Hi all,
I am trying to perform a GO enrichment analysis on a list of differentially expressed genes and I am having doubts on the robustness of the results. I am using the GOstats package and the organism is Apis mellifera, so I had to make a custom gene universe set, which consists of 7903 genes for which GO terms were available. My codes are as follows:
library(GOstats)
library(GSEABase)
library(GOstatsPlus)
require(biomaRt)
mart <- useMart('metazoa_mart', host = 'metazoa.ensembl.org')
mart <- useDataset('amellifera_eg_gene', mart)
annot <- getBM(
mart = mart,
attributes = c(
'go_id',
'go_linkage_type',
'ensembl_gene_id'),
uniqueRows = TRUE)
library(dplyr)
annot<-subset(annot, trimws(annot$go_id) !="")
goFrame=GOFrame(annot,organism="Apis mellifera")
goAllFrame=GOAllFrame(goFrame)
library("GSEABase")
gsc <- GeneSetCollection(goAllFrame, setType = GOCollection())
gene_universe = head(unique(unlist(geneIds(gsc))),100000)
library(stringi)
gene_IDs_of_interest<-stri_join(AllDEGs, sep="")
params <- GSEAGOHyperGParams(name="My Custom GSEA based annot Params",
geneSetCollection=GO_universe,
geneIds = gene_IDs_of_interest,
universeGeneIds = gene_universe,
ontology = "BP",
pvalueCutoff = 0.05,
conditional = FALSE,
testDirection = "over")
BP_Over <- hyperGTest(params)
summary(BP_Over)
My DEG list consisted of 96 genes, so rather short, but I get 52 biological process terms to be significant at a 0.05 p value cutoff. However, when I look more closely I see that many of these significant terms have a size of 1. For example:
GOBPID Pvalue OddsRatio ExpCount Count Size
GO:0046098 0.007278609 Inf 0.007278609 1 1
So, my question is whether this is an OK result? Are terms with size 1 ok to keep in the test or should have I applied some initial filtering? I could not find a discussion of this in the GOstats vignettes so I hope you can help me.
Another doubt is, should I correct these GO term p values for multiple testing? My approach has so far been to not apply an FDR correction at this stage but then "condense" the results with a tool like Revigo to bin terms by semantic similarity to better summarize the results.
Would you have any other suggestion to improve the accuracy of these analyses?
Thank you very much in advance for your kind help!
I believe that BH remains accurate under positive dependence, and BY remains accurate under arbitrary dependence structures. That is, if I understand the second reference in ?p.adjust.
Gordon Smyth is probably the one to ask though, since he's the one who contributed the code for BH and BY in
p.adjust
, and IIRC, the interpretation of BH in R and the original are slightly different (the original estimating the proportion of false positives and Gordon's version estimating the maximum proportion of false positives).