Question

Properly constructing a hypergeometric test

1

Entering edit mode

charles.foster ▴ 180

@charlesfoster-17652

Last seen 5 months ago

Australia

I'm having some conceptual challenges ensuring that I have properly constructed a hypergeometric test in R. I would appreciate some feedback.

For some background, we have carried out transcriptomic analyses and determined a set of differentially expressed genes (DEGs) between our experimental conditions. We wish to determine whether genes associated with a particular syndrome are overrepresented in the set of DEGs. We've obtained a set of curated genes associated with the syndrome of interested.

I have the following:

Set of differentially expressed gene IDs = DEGs (character vector)
Gene IDs for all genes detected in the study (i.e., those that are DEGs + those that are not DEGs) = universe (character vector)
Set of gene IDs associated with the syndrome of interest, filtered to only include those detected in the universe set = syndrome_genes (character vector)
Overlap between DEGs and syndrome_genes

I see the hypergeometric test being set up as follows:

### Formulation 1 ###
overlap <- intersect(DEGs, syndrome_genes)
result <- phyper(q = length(overlap) - 1,
                 m = length(syndrome_genes),
                 n = length(universe) - length(syndrome_genes),
                 k = length(DEGs),
                 lower.tail = FALSE)

I've formulated the test in this way because, using the classical urn terminology of phyper, I see the DEGs as being the number of balls sampled from the urn, the number of white balls in the urn being the syndrome_genes, the overlap being the number of white balls drawn during the sampling of DEGs, and the number of black balls in the urn being the genes in the universe set that are not genes associated with the syndrome of interest.

However, a collaborator formulates the test differently:

### Formulation 2 ###
overlap <- intersect(DEGs, syndrome_genes)
result <- phyper(q = length(overlap),
                 m = length(DEGs),
                 n = length(universe) - length(DEGs),
                 k = length(syndrome_genes),
                 lower.tail = FALSE)

Which of these formulations is correct? Thanks in advance.

overrepresentation overlap R hypergeometric • 3.7k views

ADD COMMENT • link updated 2.6 years ago by ATpoint ★ 4.8k • written 2.6 years ago by charles.foster ▴ 180

score 1 · Answer 1 · 2022-10-05

To me, yours is correct, the one of the your collaborator does not adhere to the urn model used to specify the parameters in the help page of the phyper() function. For instance, to get the one-tailed probability you need to set lower.tail=FALSE, but then because according to the help page, parameter lower.tail, you're getting P[X > x], because you actually want P[X >= x], then you need to set length(overlap)-1 in the first parameter, as you rightly do.

cheers,

robert.

score 0 · Answer 2 · 2022-10-05

0

Entering edit mode

ATpoint ★ 4.8k

@atpoint-13662

Last seen 1 day ago

Germany

Here is an answer from the (I think) clusterProfiler author that I used as guideline: https://www.biostars.org/p/485827/#9483835

The key point is to define the background properly, that would imo be all genes in your analysis that have any annotation in the database you enrich against.

ADD COMMENT • link 2.6 years ago ATpoint ★ 4.8k