Hello there,
I am running a KEGG enrichment analysis but about 2500/14700 genes do not get converted to KEGG identifiers using keggConv. Presumably this is because they are not in the KEGG database.
Should we exclude those 2500 NA genes from the Wilcox test since those genes would always be considered "not in pathway" when comparing p-values to genes that are "in pathway". In an extreme case if those NA genes were all highly biased as significantly DE, that could dilute the impact of DE genes that are actually in pathways, potentially preventing those pathways from being significantly enriched. This makes me think those NA genes should be excluded...
Alternatively, we could consider those NA genes an essential part of the "baseline" transcriptome, of which the KEGG pathways and corresponding genes are also a component of... and therefore those NA genes are still needed to test for pathway enrichment. In this case, all genes, including those not in the KEGG database, should be included in the Wilcox test...
I was wondering if there is a standard or suggested "protocol" for handling this issue (genes that have no KEGG identifier)? Any insight would be greatly appreciated!
Thank you!