Hi there! I would like to perform KEGG enrichment with some differentially expressed gene data from RNAseq data. I am working on a non-model organism.
I have 1) KEGG to GeneName Mapping
head(expr5_FS_final)
KEGG unigene_FS
1 K02727 FS_gene_1
2 K17277 FS_gene_3
3 K17307 FS_gene_10
4 K14453 FS_gene_11
5 K14700 FS_gene_11
6 K14701 FS_gene_11
2) A list of genes of interest (DEGs)
head(DEG)
DEG
1 FS_gene_1
2 FS_gene_38
3 FS_gene_101
I used the the list of DEG to grep KEGG to GeneName Mapping file, then got a KEGG to GeneName Mapping file just for DEGs.
head(DEG_KEGG)
KEGG DEG
1 K02727 FS_gene_1
2 K17277 FS_gene_1
3 K17307 FS_gene_38
...
Then I use enr_results <- enrichKEGG(DEG_KEGG$KEGG, organism='ko', minGSSize = 1, pvalueCutoff = 0.05, qvalueCutoff = 0.05)
to do KEGG enrichment.
I also tried enr_results2 <- enricher(DEG$DEG , TERM2GENE=expr5_FS_final, pvalueCutoff = 0.05, pAdjustMethod = "BH", qvalueCutoff = 0.05, minGSSize = 5)
Which one is correct? Could some one help me have a look?
Thanks a lot!
In essence both functions should should give identical results, provided the input is also the same.
enricher
is a universal function for gene set overrepresentation analysis (ORA), and requires all arguments to be provided by the user, includingTERM2GENE
.TERM2GENE
links genes to gene sets, and these sets can be defined by Gene Ontology, KEGG, or whatever source.enrichKEGG
can be considered as a convenience function in the sense that it has been 'configured' to perform an ORA using gene sets defined by KEGG. In other words, there is no need for a user to provide input for theTERM2GENE
(andTERM2NAME
) arguments, since these will be automagically generated.So, if the inputs are identical, the the results should be the same. Please note that the values that you used for
minGSSize
,pAdjustMethod
andqvalueCutoff
differ between your 2 function calls.Thank you so much! I am confused because that I think gene set KEGG overrepresentation analysis (ORA) should take 2 arguments: 1) The KEGG universe, in my case, the KEGG annotation of the whole transcriptome (KEGG to gene mappings) 2) A set of genes of interests, in my case, up-regulated gene list from DE comparison
However,
enrichKEGG
seems only takes the KEGG to gene mappings of the DEGs. How does it perform ORA without the "The KEGG universe (file 1)". Does it compare to the whole KEGG database?By default, the argument
universe
is indeed set toNULL
.If you check the source code of the functions, it becomes clear that in case
enrichKEGG
is run withuniverse = NULL
, as universe (background) all genes are used that have been annotated to a gene set by the KEGG curators. For example, for human KEGG has annotated 6451 genes to gene sets (pathways). Also, genes that are in your input, but are not present in a gene set, are filtered out.If you provide your own gene sets using the function
enricher
(through the argumentTERM2GENE
), withuniverse=NULL
, then all genes that are present in the data frameTERM2GENE
are used as universe (background genes).Lastly, if you do provide a list of background genes (i.e.
universe = names(geneList)
), then the gene sets are filtered so that only genes that are present in the list of background genes are kept within each gene set. ORA will then be performed on the filtered gene sets.See below for some code to illustrate these points.
This is so helpful! Thank you so much!
Last question, how do you determine those 3 parameters
minGSSize = 10, maxGSSize = 500, qvalueCutoff = 0.2
?I know qvalue is kind of similar to adjusted p value, would 0.2 be too high? What does
minGSSize
andmaxGSSize
mean? and are there any "gold standards" for it?I haven't been able to make a new post on the forum for some reason, but I had a question relating to the custome annotations in
enricher
. InTERM2GENE
, I pass the genes targeted by a TF in column 2, and column 1 is the TF, which is the same for all rows. Input is a list of genes. I wanted to check if the overlap between these two lists is significant.However, I get a NULL result - No gene sets have size between 0 and 50 .. > return NULL - for all TF target genes and input genes combination.
Could anyone point me in the right direction? Thanks.
It is not appreciated to post the same question multiple times!
Also: what is not clear on the message
No gene sets have size between 0 and 50
?