DiffBind counts
2
0
Entering edit mode
Naomi Altman ★ 6.0k
@naomi-altman-380
Last seen 3.7 years ago
United States
I am trying to understand what DiffBind is doing. Notice that I executed exactly the same command on the same data twice and got different counts. counts=dba.count(mydata,minOverlap=3,score="DBA_SCORE_READS", bRemoveDuplicates=TRUE, bCorPlot=FALSE) head(counts$peaks[[1]]) Chr Start End Score RPKM Reads cRPKM cReads 1 chr19 4113108 4113591 1 34.83796 13 29.97905 12 2 chr19 4878390 4879327 126 192.01349 139 16.92521 13 3 chr19 4961642 4962405 47 103.48129 61 22.59234 14 4 chr19 5724175 5724774 46 121.00902 56 20.72008 10 5 chr19 5798432 5799635 137 163.54396 152 15.47547 15 6 chr19 5801387 5802104 90 176.91451 98 13.46340 8 countsR1=dba.count(mydata,minOverlap=3,score="DBA_SCORE_READS", bRemoveDuplicates=TRUE, bCorPlot=FALSE) head(countsR1$peaks[[1]]) Chr Start End Score RPKM Reads cRPKM cReads 1 chr19 4113108 4113591 1 34.84423 13 29.98011 12 2 chr19 4878390 4879327 52 78.75352 57 6.62314 5 3 chr19 4961642 4962405 44 98.40975 58 22.59313 14 4 chr19 5724175 5724774 46 121.03080 56 20.72081 10 5 chr19 5798432 5799635 141 164.64953 153 12.61009 12 6 chr19 5801387 5802104 90 176.94635 98 13.46387 8 R version 3.0.2 (2013-09-25) Platform: x86_64-w64-mingw32/x64 (64-bit) locale: [1] LC_COLLATE=English_United States.1252 [2] LC_CTYPE=English_United States.1252 [3] LC_MONETARY=English_United States.1252 [4] LC_NUMERIC=C [5] LC_TIME=English_United States.1252 attached base packages: [1] parallel stats graphics grDevices utils datasets methods [8] base other attached packages: [1] DiffBind_1.8.4 GenomicRanges_1.14.4 XVector_0.2.0 [4] IRanges_1.20.7 BiocGenerics_0.8.0 loaded via a namespace (and not attached): [1] amap_0.8-12 bitops_1.0-6 caTools_1.16 [4] edgeR_3.4.2 gdata_2.13.2 gplots_2.12.1 [7] gtools_3.3.1 KernSmooth_2.23-12 limma_3.18.13 [10] RColorBrewer_1.0-5 stats4_3.0.2 tools_3.0.2 [13] zlibbioc_1.8.0 [[alternative HTML version deleted]]
DiffBind DiffBind • 1.5k views
ADD COMMENT
0
Entering edit mode
Gord Brown ▴ 670
@gord-brown-5664
Last seen 4.0 years ago
United Kingdom
Hi, Naomi et al, This is a bug in DiffBind's duplicate-removal code. The bug is fixed in BioC 2.13, and will (I hope) make it into the last build, if there is one more. It's also fixed in the development stream, so will make it into the next release. Counting without removing duplicates is (afaik) correct. Sorry for the inconvenience and many thanks for bringing it to our attention. Cheers, - Gord On 2014-04-03 11:00, "bioconductor-request at r-project.org" <bioconductor-request at="" r-project.org=""> wrote: >------------------------ > >Message: 24 >Date: Wed, 02 Apr 2014 21:50:50 -0400 >From: Naomi Altman <naomi at="" stat.psu.edu=""> >To: Bioconductor mailing list <bioconductor at="" r-project.org=""> >Subject: [BioC] DiffBind counts >Message-ID: <533CBE7A.6050101 at stat.psu.edu> >Content-Type: text/plain > > >I am trying to understand what DiffBind is doing. Notice that I executed >exactly the same command on the same data twice and got different counts. > > >counts=dba.count(mydata,minOverlap=3,score="DBA_SCORE_READS", >bRemoveDuplicates=TRUE, bCorPlot=FALSE) >head(counts$peaks[[1]]) > > Chr Start End Score RPKM Reads cRPKM cReads >1 chr19 4113108 4113591 1 34.83796 13 29.97905 12 >2 chr19 4878390 4879327 126 192.01349 139 16.92521 13 >3 chr19 4961642 4962405 47 103.48129 61 22.59234 14 >4 chr19 5724175 5724774 46 121.00902 56 20.72008 10 >5 chr19 5798432 5799635 137 163.54396 152 15.47547 15 >6 chr19 5801387 5802104 90 176.91451 98 13.46340 8 > > > >countsR1=dba.count(mydata,minOverlap=3,score="DBA_SCORE_READS", >bRemoveDuplicates=TRUE, bCorPlot=FALSE) >head(countsR1$peaks[[1]]) > > > >Chr Start End Score RPKM Reads cRPKM cReads >1 chr19 4113108 4113591 1 34.84423 13 29.98011 12 >2 chr19 4878390 4879327 52 78.75352 57 6.62314 5 >3 chr19 4961642 4962405 44 98.40975 58 22.59313 14 >4 chr19 5724175 5724774 46 121.03080 56 20.72081 10 >5 chr19 5798432 5799635 141 164.64953 153 12.61009 12 >6 chr19 5801387 5802104 90 176.94635 98 13.46387 8 > > >R version 3.0.2 (2013-09-25) >Platform: x86_64-w64-mingw32/x64 (64-bit) > >locale: >[1] LC_COLLATE=English_United States.1252 >[2] LC_CTYPE=English_United States.1252 >[3] LC_MONETARY=English_United States.1252 >[4] LC_NUMERIC=C >[5] LC_TIME=English_United States.1252 > >attached base packages: >[1] parallel stats graphics grDevices utils datasets methods >[8] base > >other attached packages: >[1] DiffBind_1.8.4 GenomicRanges_1.14.4 XVector_0.2.0 >[4] IRanges_1.20.7 BiocGenerics_0.8.0 > >loaded via a namespace (and not attached): > [1] amap_0.8-12 bitops_1.0-6 caTools_1.16 > [4] edgeR_3.4.2 gdata_2.13.2 gplots_2.12.1 > [7] gtools_3.3.1 KernSmooth_2.23-12 limma_3.18.13 >[10] RColorBrewer_1.0-5 stats4_3.0.2 tools_3.0.2 >[13] zlibbioc_1.8.0 > > > [[alternative HTML version deleted]]
ADD COMMENT
0
Entering edit mode
Hi All, I am still trying to understand DiffBind. After reading in my data, I find that the peaks component looks something like this: head(myCHIP$peaks[[1]]) V1 V2 V3 V8 1 chr19 3182597 3183033 0.10326322 2 chr19 3589475 3589990 0.09515837 3 chr19 3831795 3832326 0.06208947 4 chr19 4122385 4123105 0.06524229 5 chr19 4504682 4505416 0.15118871 6 chr19 4558434 4559635 0.22387278 The peaks were called by MACS, and looking at the code, dba pulls the peak score out of "column 8" and normalizes it to be between 0 and 1. However, the peak spreadsheet has 9 columns and none of them appear to be normalizable to obtain the numbers in this column. Where are these numbers coming from? What do they mean? And should I care? Thanks, Naomi [[alternative HTML version deleted]]
ADD REPLY
0
Entering edit mode
Rory Stark ★ 5.2k
@rory-stark-5741
Last seen 10 weeks ago
Cambridge, UK
Hi Naomi- For MACS .xls files, the score is derived from column 7: -10*log10(pvalue). Can you let me know where it is documented as column 8 so I can fix that? These scores aren't very important. They are only used in the correlation heatmaps and PCA plots after peaks are read in. These scores are discarded once reads are counted for a consensus peakset. Generally, these peak-derived plots are driven more by which samples have the peak called and which don't, rather than the specific scores for where they are called. Cheers- Rory On Sat Apr 5 00:06:26 CEST 2014, Naomi Altman wrote: Hi All, I am still trying to understand DiffBind. After reading in my data, I find that the peaks component looks something like this: head(myCHIP$peaks[[1]]) V1 V2 V3 V8 1 chr19 3182597 3183033 0.10326322 2 chr19 3589475 3589990 0.09515837 3 chr19 3831795 3832326 0.06208947 4 chr19 4122385 4123105 0.06524229 5 chr19 4504682 4505416 0.15118871 6 chr19 4558434 4559635 0.22387278 The peaks were called by MACS, and looking at the code, dba pulls the peak score out of "column 8" and normalizes it to be between 0 and 1. However, the peak spreadsheet has 9 columns and none of them appear to be normalizable to obtain the numbers in this column. Where are these numbers coming from? What do they mean? And should I care? Thanks, Naomi [[alternative HTML version deleted]]
ADD COMMENT

Login before adding your answer.

Traffic: 517 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6