Dear all,
I'm using DiffBind to compare histone modifications in two groups of samples.
I'm comparing two cell-types, from the same 8 individuals. So I have 16 samples, 8 of each cell-type.
The aim is to find peaks that are significantly diferentially bound between the two groups of samples, and thus identify epigenetic differences between the two cell-types, using the 8 matched pairs of cell-type from the same individuals.
Whenever I analyze all 8 samples together, I notice that there are six samples that cluster in the wrong group on the heatmap and the PCA. I would expect to find two nice clusters of each cell-type, but there are these six samples that go to the wrong group. Four of these samples, by the away, are matched pairs of the same two individuals. Finally, I get only one differentially bound peak between the two groups of samples when using all 16 of them.
Then, I took those 6 samples out of the analysis and run again.
So now I have 10 samples, 5 of each cell-type.
They cluster nicely on the heatmap and on the PCA.
And there are 1594 differentially bound peaks identified with edgeR, using the default threshold of FDR <= 0.05.
However, I have one doubt:
These 1594 peaks are identified when the .bam files do not have the 'chr' string in the first column for chromosome name, but the .bed files with the peaks do have the 'chr' string and the number.
When I run the analysis with no 'chr' string in either .bed or .bam files, DiffBind finds only two differentially bound peaks, instead of 1594... so I guess it doesn't work.
Even the FRiP value is generated for each peakset both when .bam and .bed chromosome columns are mismatched (.bed with 'chr'+number and .bam with no 'chr', just number) and also when they match, but only the differentially bound peaks are not identified when the files match. I'm puzzled by this.
Any insight, suggestion or opinion will be of great help.
Code:
c5 <- dba(sampleSheet=read.csv("c5.csv"))
c5
# 10 Samples, 50849 sites in matrix (83459 total):
# ID Tissue Factor Condition Replicate Caller Intervals
# 1 A1 LV H3K27ac a 1 bed 37472
# 2 A2 LV H3K27ac a 2 bed 32165
# 3 A3 LV H3K27ac a 3 bed 43512
# 4 A4 LV H3K27ac a 4 bed 46404
# 5 A5 LV H3K27ac a 5 bed 29335
# 6 B1 RV H3K27ac b 1 bed 22295
# 7 B2 RV H3K27ac b 2 bed 18116
# 8 B3 RV H3K27ac b 3 bed 47573
# 9 B4 RV H3K27ac b 4 bed 49963
# 10 B5 RV H3K27ac b 5 bed 38031
c5 <- dba.count(c5)
c5
# 10 Samples, 50849 sites in matrix:
# ID Tissue Factor Condition Replicate Caller Intervals FRiP
# 1 A1 LV H3K27ac a 1 counts 50849 0.23
# 2 A2 LV H3K27ac a 2 counts 50849 0.20
# 3 A3 LV H3K27ac a 3 counts 50849 0.24
# 4 A4 LV H3K27ac a 4 counts 50849 0.30
# 5 A5 LV H3K27ac a 5 counts 50849 0.18
# 6 B1 RV H3K27ac b 1 counts 50849 0.15
# 7 B2 RV H3K27ac b 2 counts 50849 0.13
# 8 B3 RV H3K27ac b 3 counts 50849 0.24
# 9 B4 RV H3K27ac b 4 counts 50849 0.29
# 10 B5 RV H3K27ac b 5 counts 50849 0.20
c5 <- dba.contrast(c5, categories=DBA_CONDITION, minMembers=2)
c5 <- dba.analyze(c5, method=DBA_ALL_METHODS)
c5
# 1 Contrast:
# Group1 Members1 Group2 Members2 DB.edgeR DB.DESeq DB.DESeq2
# 1 a 5 b 5 1594 1437 1366
Thank you in advance,
Daiane.
> sessionInfo()
R version 3.2.3 (2015-12-10)
Platform: x86_64-pc-linux-gnu (64-bit)
Running under: CentOS Linux 7 (Core)
locale:
[1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C
[3] LC_TIME=en_US.UTF-8 LC_COLLATE=en_US.UTF-8
[5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8
[7] LC_PAPER=en_US.UTF-8 LC_NAME=C
[9] LC_ADDRESS=C LC_TELEPHONE=C
[11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C
attached base packages:
[1] stats4 parallel stats graphics grDevices utils datasets
[8] methods base
other attached packages:
[1] DiffBind_1.16.3 RSQLite_1.0.0
[3] DBI_0.5 locfit_1.5-9.1
[5] GenomicAlignments_1.6.3 Rsamtools_1.25.0
[7] Biostrings_2.38.4 XVector_0.10.0
[9] limma_3.29.14 SummarizedExperiment_1.0.2
[11] Biobase_2.33.0 GenomicRanges_1.22.4
[13] GenomeInfoDb_1.6.3 IRanges_2.4.8
[15] S4Vectors_0.8.11 BiocGenerics_0.19.1
loaded via a namespace (and not attached):
[1] edgeR_3.12.1 splines_3.2.3
[3] gtools_3.5.0 Formula_1.2-1
[5] latticeExtra_0.6-28 amap_0.8-14
[7] RBGL_1.49.1 Category_2.39.0
[9] backports_1.0.3 lattice_0.20-33
[11] chron_2.3-47 digest_0.6.10
[13] RColorBrewer_1.1-2 checkmate_1.8.1
[15] colorspace_1.2-6 Matrix_1.2-6
[17] plyr_1.8.4 GSEABase_1.35.0
[19] DESeq2_1.10.1 XML_3.98-1.4
[21] pheatmap_1.0.8 ShortRead_1.28.0
[23] biomaRt_2.29.2 genefilter_1.52.1
[25] zlibbioc_1.19.0 xtable_1.8-2
[27] GO.db_3.2.2 scales_0.4.0
[29] brew_1.0-6 gdata_2.17.0
[31] BiocParallel_1.7.4 annotate_1.51.0
[33] ggplot2_2.1.0 GenomicFeatures_1.22.13
[35] nnet_7.3-12 survival_2.39-5
[37] magrittr_1.5 systemPipeR_1.4.8
[39] fail_1.3 gplots_3.0.1
[41] RcppArmadillo_0.7.200.2.0 hwriter_1.3.2
[43] foreign_0.8-66 GOstats_2.36.0
[45] graph_1.51.0 data.table_1.9.6
[47] tools_3.2.3 BBmisc_1.10
[49] stringr_1.0.0 sendmailR_1.2-1
[51] munsell_0.4.3 cluster_2.0.4
[53] AnnotationDbi_1.32.3 DESeq_1.22.1
[55] caTools_1.17.1 grid_3.2.3
[57] RCurl_1.95-4.8 rjson_0.2.15
[59] AnnotationForge_1.12.2 bitops_1.0-6
[61] base64enc_0.1-3 gtable_0.2.0
[63] gridExtra_2.2.1 rtracklayer_1.30.4
[65] Hmisc_3.17-4 KernSmooth_2.23-15
[67] stringi_1.1.1 BatchJobs_1.6
[69] Rcpp_0.12.6 geneplotter_1.48.0
[71] rpart_4.1-10 acepack_1.3-3.3