totalTest number in ChipPeakAnno

0

Entering edit mode

Noah Dowell ▴ 410

@noah-dowell-3791

Last seen 10.6 years ago

Hello Binbin, It would be helpful to describe your problem and post to the whole message board. (There are many experts who probably can be more helpful than myself :-)) That said, I think you are referring to the "NaN" error and below are my thoughts (Julie Zhu also answered this a couple of times and her reply is probably in the archives). When calling the makeVennDiagram function you want to set the totalTest number to something that is larger than the experimentally determined peak number. As far as I know, the totalTest number is used for the hypergeometric sampling that is used to determine if the overlap between two datasets is more than would be expected by chance. So one way to sort this out using biological information is to think about the maximum number of possible binding events and use that as the totalTest number. For example, if you are studying a sequence- specific DNA binding protein with a known motif you could count that number of times that motif occurs in the genome and compare that to the number of peaks you have experimentally determined. Motifs = 500 Peaks = 200 Peaks w/ motif = 180 (90%) "upper limit" = 500 new "upper limit" for totalTest = .9 x 500 = 450 Now if your working with a sequence-independent binding factor it can get tricky. One approach would be to determine the mean peak width. Then divide the whole genome sequence by this number to get an upper limit. This is probably way to high so using additional information such as if the protein binds intergenic or ORFs could bring the number down but make it more relevant to the biological experiment. For example: peaks = 75 intergenic peaks = 70 ORF peaks = 5 mean peak width = 50 base pairs genome size = 10000 base pairs "upper limit" = 10000/ 50 = 200 (possible peaks) intergenic seq = 4000 base pairs new "upper limit" = 4000/50 = 80 (possible intergenic peaks) I was working with something more like the second case and I felt the totalTest based on the total genome was quite relaxed and based on the intergenic sequence only was quite stringent so somewhere in the middle might be better but most importantly I feel I am standing on some solid biological reasoning for determining the amount of sampling. Hope this helps and I would be interested to here if anybody has some critiques of this approach or additional suggestions. Best, Noah On Nov 16, 2010, at 7:31 AM, Binbin Liu wrote: > Dear Noah, > > I saw your post on bioconductor mailing list regarding the totalTest number for the P-val calculation in ChipPeakAnno :: makeVennDiagram(). I am having the same problem. Can I ask how you got it sorted? > > > Many thanks. > > Binbin

ChIPpeakAnno ChIPpeakAnno • 1.4k views

ADD COMMENT • link updated 14.4 years ago by Binbin Liu ▴ 30 • written 14.4 years ago by Noah Dowell ▴ 410

0

Entering edit mode

Binbin Liu ▴ 30

@binbin-liu-4350

Last seen 10.6 years ago

Dear Noah, Many thanks for your detailed explanation on how totalTest is defined. What I am doing is similar to the second case. However, the TF we are interested could bind anywhere on the genome. So with mm9 of 2.7E+9 and peak width <=200 bps , the totalTest is 1.35E+7. It seems very computational costly to run ChIPpeakAnno. Nevertheless, do you think it is reasonable? Thanks, Binbin On 16 Nov 2010, at 18:41, Noah Dowell wrote: > Hello Binbin, > > It would be helpful to describe your problem and post to the whole message board. (There are many experts who probably can be more helpful than myself :-)) That said, I think you are referring to the "NaN" error and below are my thoughts (Julie Zhu also answered this a couple of times and her reply is probably in the archives). > > > When calling the makeVennDiagram function you want to set the totalTest number to something that is larger than the experimentally determined peak number. As far as I know, the totalTest number is used for the hypergeometric sampling that is used to determine if the overlap between two datasets is more than would be expected by chance. So one way to sort this out using biological information is to think about the maximum number of possible binding events and use that as the totalTest number. For example, if you are studying a sequence- specific DNA binding protein with a known motif you could count that number of times that motif occurs in the genome and compare that to the number of peaks you have experimentally determined. > > Motifs = 500 > Peaks = 200 > Peaks w/ motif = 180 (90%) > "upper limit" = 500 > new "upper limit" for totalTest = .9 x 500 = 450 > > Now if your working with a sequence-independent binding factor it can get tricky. One approach would be to determine the mean peak width. Then divide the whole genome sequence by this number to get an upper limit. This is probably way to high so using additional information such as if the protein binds intergenic or ORFs could bring the number down but make it more relevant to the biological experiment. For example: > > peaks = 75 > intergenic peaks = 70 > ORF peaks = 5 > mean peak width = 50 base pairs > genome size = 10000 base pairs > "upper limit" = 10000/ 50 = 200 (possible peaks) > intergenic seq = 4000 base pairs > new "upper limit" = 4000/50 = 80 (possible intergenic peaks) > > I was working with something more like the second case and I felt the totalTest based on the total genome was quite relaxed and based on the intergenic sequence only was quite stringent so somewhere in the middle might be better but most importantly I feel I am standing on some solid biological reasoning for determining the amount of sampling. > > Hope this helps and I would be interested to here if anybody has some critiques of this approach or additional suggestions. > > Best, > > Noah > > > > On Nov 16, 2010, at 7:31 AM, Binbin Liu wrote: > >> Dear Noah, >> >> I saw your post on bioconductor mailing list regarding the totalTest number for the P-val calculation in ChipPeakAnno :: makeVennDiagram(). I am having the same problem. Can I ask how you got it sorted? >> >> >> Many thanks. >> >> Binbin >

ADD COMMENT • link 14.4 years ago Binbin Liu ▴ 30

0

Entering edit mode

Binbin, In the current implementation of makeVennDiagram, the time used to calculate p-value does not depend on the totalTest. Noah, thanks so much for sharing your insights! Best regards, Julie On 11/18/10 11:26 AM, "Binbin Liu" <b.b.liu at="" leeds.ac.uk=""> wrote: > Dear Noah, > > Many thanks for your detailed explanation on how totalTest is defined. What I > am doing is similar to the second case. However, the TF we are interested > could bind anywhere on the genome. So with mm9 of 2.7E+9 and peak width <=200 > bps , the totalTest is 1.35E+7. It seems very computational costly to run > ChIPpeakAnno. Nevertheless, do you think it is reasonable? > > > Thanks, > > Binbin > > > On 16 Nov 2010, at 18:41, Noah Dowell wrote: > >> Hello Binbin, >> >> It would be helpful to describe your problem and post to the whole message >> board. (There are many experts who probably can be more helpful than myself >> :-)) That said, I think you are referring to the "NaN" error and below are >> my thoughts (Julie Zhu also answered this a couple of times and her reply is >> probably in the archives). >> >> >> When calling the makeVennDiagram function you want to set the totalTest >> number to something that is larger than the experimentally determined peak >> number. As far as I know, the totalTest number is used for the >> hypergeometric sampling that is used to determine if the overlap between two >> datasets is more than would be expected by chance. So one way to sort this >> out using biological information is to think about the maximum number of >> possible binding events and use that as the totalTest number. For example, >> if you are studying a sequence-specific DNA binding protein with a known >> motif you could count that number of times that motif occurs in the genome >> and compare that to the number of peaks you have experimentally determined. >> >> Motifs = 500 >> Peaks = 200 >> Peaks w/ motif = 180 (90%) >> "upper limit" = 500 >> new "upper limit" for totalTest = .9 x 500 = 450 >> >> Now if your working with a sequence-independent binding factor it can get >> tricky. One approach would be to determine the mean peak width. Then divide >> the whole genome sequence by this number to get an upper limit. This is >> probably way to high so using additional information such as if the protein >> binds intergenic or ORFs could bring the number down but make it more >> relevant to the biological experiment. For example: >> >> peaks = 75 >> intergenic peaks = 70 >> ORF peaks = 5 >> mean peak width = 50 base pairs >> genome size = 10000 base pairs >> "upper limit" = 10000/ 50 = 200 (possible peaks) >> intergenic seq = 4000 base pairs >> new "upper limit" = 4000/50 = 80 (possible intergenic peaks) >> >> I was working with something more like the second case and I felt the >> totalTest based on the total genome was quite relaxed and based on the >> intergenic sequence only was quite stringent so somewhere in the middle might >> be better but most importantly I feel I am standing on some solid biological >> reasoning for determining the amount of sampling. >> >> Hope this helps and I would be interested to here if anybody has some >> critiques of this approach or additional suggestions. >> >> Best, >> >> Noah >> >> >> >> On Nov 16, 2010, at 7:31 AM, Binbin Liu wrote: >> >>> Dear Noah, >>> >>> I saw your post on bioconductor mailing list regarding the totalTest number >>> for the P-val calculation in ChipPeakAnno :: makeVennDiagram(). I am having >>> the same problem. Can I ask how you got it sorted? >>> >>> >>> Many thanks. >>> >>> Binbin >> > > _______________________________________________ > Bioconductor mailing list > Bioconductor at stat.math.ethz.ch > https://stat.ethz.ch/mailman/listinfo/bioconductor > Search the archives: > http://news.gmane.org/gmane.science.biology.informatics.conductor >

ADD REPLY • link 14.4 years ago Julie Zhu ★ 4.3k

Login before adding your answer.