I'm trying to use clusterProfiler to perform gene set enrichment analysis. My differentially-expressed gene dataset has Ensembl IDs in addition to log2foldChange values.
First I ran the following chunk of code, and returned a "no term enriched under specific pvalueCutoff" error:
> gse_hfd <- gseGO(geneList = hfd_gene_list, ont = "ALL", keyType = "ENSEMBL", minGSSize = 3, verbose = TRUE, OrgDb = org.Mm.eg.db, pAdjustMethod = "none")
preparing geneSet collections...
GSEA analysis...
no term enriched under specific pvalueCutoff...
Warning message:
In preparePathwaysAndStats(pathways, stats, minSize, maxSize, gseaParam, :
There are ties in the preranked stats (16.93% of the list).
The order of those tied genes will be arbitrary, which may produce unexpected results.
I re-ran the code with the p-value cutoff set to 1 and only returned 3 GO terms, each associated to one of the sub-ontologies:
> gse_hfd <- gseGO(geneList = hfd_gene_list, ont = "ALL", keyType = "ENSEMBL", minGSSize = 1, pvalueCutoff = 1.00, verbose = TRUE, OrgDb = org.Mm.eg.db, pAdjustMethod = "none")
preparing geneSet collections...
GSEA analysis...
leading edge analysis...
done...
Warning message:
In preparePathwaysAndStats(pathways, stats, minSize, maxSize, gseaParam, :
There are ties in the preranked stats (16.93% of the list).
The order of those tied genes will be arbitrary, which may produce unexpected results.
> print(gse_hfd)
#
# Gene Set Enrichment Analysis
#
#...@organism Mus musculus
#...@setType GOALL
#...@keytype ENSEMBL
#...@geneList Named num [1:34284] 10.05 5.79 5.06 4.84 4.63 ...
- attr(*, "names")= chr [1:34284] "ENSMUSG00000110631.1" "ENSMUSG00000117907.1" "ENSMUSG00000021803.10" "ENSMUSG00000118251.1" ...
#...nPerm
#...pvalues adjusted by 'none' with cutoff <1
#...3 enriched terms found
'data.frame': 3 obs. of 12 variables:
$ ONTOLOGY : chr "MF" "CC" "BP"
$ ID : chr "GO:0003674" "GO:0005575" "GO:0008150"
$ Description : chr "molecular_function" "cellular_component" "biological_process"
$ setSize : int 1 1 1
$ enrichmentScore: num -0.719 -0.719 -0.719
$ NES : num -0.962 -0.962 -0.962
$ pvalue : num 0.551 0.551 0.551
$ p.adjust : num 0.551 0.551 0.551
$ qvalues : num 0.551 0.551 0.551
$ rank : num 9632 9632 9632
$ leading_edge : chr "tags=100%, list=28%, signal=72%" "tags=100%, list=28%, signal=72%" "tags=100%, list=28%, signal=72%"
$ core_enrichment: chr " " " " " "
When I tried to plot my results, I got a term similarity error, so I did the following:
> emapplot(gse_hfd)
Error in has_pairsim(x) :
Term similarity matrix not available. Please use pairwise_termsim function to deal with the results of enrichment analysis.
> pairwise_termsim(gse_hfd, method = "JC", semData = NULL, showCategory = 300)
#
# Gene Set Enrichment Analysis
#
#...@organism Mus musculus
#...@setType GOALL
#...@keytype ENSEMBL
#...@geneList Named num [1:34284] 10.05 5.79 5.06 4.84 4.63 ...
- attr(*, "names")= chr [1:34284] "ENSMUSG00000110631.1" "ENSMUSG00000117907.1" "ENSMUSG00000021803.10" "ENSMUSG00000118251.1" ...
#...nPerm
#...pvalues adjusted by 'none' with cutoff <1
#...3 enriched terms found
'data.frame': 3 obs. of 12 variables:
$ ONTOLOGY : chr "MF" "CC" "BP"
$ ID : chr "GO:0003674" "GO:0005575" "GO:0008150"
$ Description : chr "molecular_function" "cellular_component" "biological_process"
$ setSize : int 1 1 1
$ enrichmentScore: num -0.719 -0.719 -0.719
$ NES : num -0.962 -0.962 -0.962
$ pvalue : num 0.551 0.551 0.551
$ p.adjust : num 0.551 0.551 0.551
$ qvalues : num 0.551 0.551 0.551
$ rank : num 9632 9632 9632
$ leading_edge : chr "tags=100%, list=28%, signal=72%" "tags=100%, list=28%, signal=72%" "tags=100%, list=28%, signal=72%"
$ core_enrichment: chr " " " " " "
I tried again with only the Biological Processes sub-ontology instead of using ALL as I had been doing earlier, and gain only returned the GO term associated with the sub-ontology:
> gse_hfd_bp <- gseGO(geneList = hfd_gene_list, ont = "BP", keyType = "ENSEMBL", minGSSize = 1, pvalueCutoff = 1.00, verbose = TRUE, OrgDb = org.Mm.eg.db, pAdjustMethod = "none")
preparing geneSet collections...
GSEA analysis...
leading edge analysis...
done...
Warning message:
In preparePathwaysAndStats(pathways, stats, minSize, maxSize, gseaParam, :
There are ties in the preranked stats (16.93% of the list).
The order of those tied genes will be arbitrary, which may produce unexpected results.
> print(gse_hfd_bp)
#
# Gene Set Enrichment Analysis
#
#...@organism Mus musculus
#...@setType BP
#...@keytype ENSEMBL
#...@geneList Named num [1:34284] 10.05 5.79 5.06 4.84 4.63 ...
- attr(*, "names")= chr [1:34284] "ENSMUSG00000110631.1" "ENSMUSG00000117907.1" "ENSMUSG00000021803.10" "ENSMUSG00000118251.1" ...
#...nPerm
#...pvalues adjusted by 'none' with cutoff <1
#...1 enriched terms found
'data.frame': 1 obs. of 11 variables:
$ ID : chr "GO:0008150"
$ Description : chr "biological_process"
$ setSize : int 1
$ enrichmentScore: num -0.719
$ NES : num -0.957
$ pvalue : num 0.538
$ p.adjust : num 0.538
$ qvalues : logi NA
$ rank : num 9632
$ leading_edge : chr "tags=100%, list=28%, signal=72%"
$ core_enrichment: chr " "
Now, when I began manually inputting my data into WebGestalt's online API, none of my genes were mapped until I started deleting the decimals from my Ensembl IDs (ex. ENSMUSG00000110631.1 -> ENSMUSG00000110631).
Only then did I start returning more "real" GO terms from my data. I'd really prefer to leave them untouched, but is my Ensembl ID formatting probably to blame for my inability to return more results?
You can get rid of gene number version :
Then just perform your GSEA the same way.
A strict "." token match would be safer, i.e.,
"\\.[0-9]+$"
.