Entering edit mode
Noah Dowell
▴
410
@noah-dowell-3791
Last seen 10.3 years ago
Hello to All,
I could not find this in the archives so I figured I could post this
to the board. It might be to much of an R question, but it arose
during the use of the excellent ChIPpeakAnno package.
Problem and Example:
I have ChIP-chip binding data for two transcription factors (TF) and I
have mapped their binding to genomic positions and I would like to see
if there is significant overlap between the binding peaks. The
makeVennDiagram function in ChIPpeakAnno does this for me, but it
still seems unclear how the parameters for p.value generation should
be set.
Here are the function inputs:
makeVennDiagram(RangedDataList(smallAnnoTFA, smallAnnoTFB),
NameOfPeaks=c("TFA", "TFB"), maxgap = 0, totalTest= 10000, cex = 1,
counts.col = "purple")
totalTest specifies the total number of tests performed to obtain the
list of peaks. This number should be larger than the largest number
of peaks in A and B.
So here are example data:
TFA peaks = 5118
TFB peaks = 1067
overlapTFA_TFB = 708
totalTest = 7500 gives a p.value = 0.9279632
totalTest = 10000 gives a p.value = 2.39585e-26
So I looked at the makeVennDiagram function to see how this p value is
being generated and find the phyper function is being used.
For example:
p.value.overall1 = phyper(p1.and.p2.and.p3 - 1, p1, totalTest - p1,
p2.and.p3, lower.tail = FALSE, log.p = FALSE)
where:
p1.and.p2.and.p3 =
length(unique(overlappingPeaks123[[NameOfPeaks[1]]]))
p1 = length(rownames(Peaks[[1]]))
p2.and.p3 = counts3[4, 3]
in my example data from above:
p1.and.p2.and.p3 = 708
p1 =5118
totalTest - p1 = 7500 -5118
p2.and.p3 = 1067 (I think???)
So I try to relate this to the base function in R and I get confused:
phyper(q, m, n, k, lower.tail = TRUE, log.p = FALSE)
the assignment of q, m, n, and k in the makeVennDiagram function does
not seem to follow the definition of q, m, n, k on the R help page.
The variable m follows as expected, but I figured the k variable
should be 708 in my example data and not 1067. Additionally, the
variable n in the makeVennDiagram function depends on how totalTest is
defined, but shouldn't it be defined by the peak number of TFB? The
larger I make n (by increasing my totalTest number) in the function
then the smaller my p.value gets. I am a little worried that I could
be creating an artifact due to my incomplete understanding of how the
hypergeometric test is being implemented.
Thank you for any help you can provide.
Noah
> sessionInfo()
R version 2.11.0 (2010-04-22)
i386-apple-darwin9.8.0
locale:
[1] en_US.UTF-8/en_US.UTF-8/C/C/en_US.UTF-8/en_US.UTF-8
attached base packages:
[1] stats graphics grDevices utils datasets methods base
other attached packages:
[1] org.Sc.sgd.db_2.4.1 ChIPpeakAnno_1.4.0
limma_3.4.0
[4] org.Hs.eg.db_2.4.1 GO.db_2.4.1
RSQLite_0.8-4
[7] DBI_0.2-5 AnnotationDbi_1.10.0
BSgenome.Ecoli.NCBI.20080805_1.3.16
[10] BSgenome_1.16.0 GenomicRanges_1.0.1
Biostrings_2.16.0
[13] multtest_2.4.0 Biobase_2.8.0
biomaRt_2.4.0
[16] rtracklayer_1.8.1 RCurl_1.3-1
bitops_1.0-4.1
[19] IRanges_1.6.0
loaded via a namespace (and not attached):
[1] MASS_7.3-5 splines_2.11.0 survival_2.35-8 tools_2.11.0
XML_2.8-1
[[alternative HTML version deleted]]