One of the changes from DiffBind 1.12.3 to DiffBind 1.14.0 is at line 1045 of analyze.R, where the line:
idx = match(as.integer(sites),1:nrow(pv$allvectors))
in the pv.getsites
function is replaced with:
idx = match(as.integer(sites), rownames(pv$allvectors))
Reading through the rest of the code indicates that sites
is a vector taken from the annotation field from topTags
(which, in turn, is the genes
object in the DGEList
) and pv
is a DBA
object. I presume that this command is meant to match up the p-value-sorted sites
from topTags
to the original order of the rows in pv
.
However, the DGEList
is constructed in line 232 of analyze.R (in the pv.DEinit
function) as:
res = DGEList(counts,lib.size=libsize, group=groups,genes=as.character(1:nrow(counts)))
Consider a case where the row names of pv$allvectors
are not consecutive integers starting at 1. This means that matching of sites to the row names in pv.getsites
will not recover the original ordering of rows. Rather, they should either be matched to 1:nrow(pv$allvectors)
, as done in version 1.12.3; or, the genes
object should be set to the row names during DGEList
construction.
This issue can be illustrated using the tamoxifen example:
data(tamoxifen_counts) tamoxifen = dba.analyze(tamoxifen) dba.report(tamoxifen)
This gives a top hit at "chr18:1302062-1303518" (paraphrased from GRanges
output). If you shuffle the row names before proceeding:
rownames(tamoxifen$allvectors) <- sample(rownames(tamoxifen$allvectors)) tamoxifen = dba.analyze(tamoxifen) dba.report(tamoxifen)
you get a different result for version 1.14.0. In contrast, these two sets of commands yield the same result for version 1.12.3 (top hit of "chr18:23914866-23916341" in both cases).
Here's some session information for the run with version 1.14.0:
> sessionInfo() R version 3.2.0 (2015-04-16) Platform: x86_64-unknown-linux-gnu (64-bit) Running under: CentOS release 6.4 (Final) locale: [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C [3] LC_TIME=en_US.UTF-8 LC_COLLATE=en_US.UTF-8 [5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8 [7] LC_PAPER=en_US.UTF-8 LC_NAME=C [9] LC_ADDRESS=C LC_TELEPHONE=C [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C attached base packages: [1] stats4 parallel stats graphics grDevices utils datasets [8] methods base other attached packages: [1] DiffBind_1.14.0 RSQLite_1.0.0 DBI_0.3.1 [4] locfit_1.5-9.1 GenomicAlignments_1.4.1 Rsamtools_1.20.1 [7] Biostrings_2.36.0 XVector_0.8.0 limma_3.24.2 [10] GenomicRanges_1.20.3 GenomeInfoDb_1.4.0 IRanges_2.2.1 [13] S4Vectors_0.6.0 BiocGenerics_0.14.0 loaded via a namespace (and not attached): [1] Rcpp_0.11.5 lattice_0.20-31 GO.db_3.1.2 [4] gtools_3.4.2 digest_0.6.8 plyr_1.8.2 [7] futile.options_1.0.0 BatchJobs_1.6 ShortRead_1.26.0 [10] ggplot2_1.0.1 gplots_2.16.0 zlibbioc_1.14.0 [13] annotate_1.46.0 gdata_2.13.3 Matrix_1.2-0 [16] checkmate_1.5.2 systemPipeR_1.2.0 proto_0.3-10 [19] GOstats_2.34.0 splines_3.2.0 BiocParallel_1.2.0 [22] stringr_0.6.2 pheatmap_1.0.2 munsell_0.4.2 [25] sendmailR_1.2-1 base64enc_0.1-2 BBmisc_1.9 [28] fail_1.2 edgeR_3.10.0 XML_3.98-1.1 [31] AnnotationForge_1.10.0 MASS_7.3-40 bitops_1.0-6 [34] grid_3.2.0 RBGL_1.44.0 xtable_1.7-4 [37] GSEABase_1.30.1 gtable_0.1.2 scales_0.2.4 [40] graph_1.46.0 KernSmooth_2.23-14 amap_0.8-14 [43] hwriter_1.3.2 reshape2_1.4.1 genefilter_1.50.0 [46] latticeExtra_0.6-26 futile.logger_1.4.1 brew_1.0-6 [49] rjson_0.2.15 lambda.r_1.1.7 RColorBrewer_1.1-2 [52] tools_3.2.0 Biobase_2.28.0 Category_2.34.0 [55] survival_2.38-1 AnnotationDbi_1.30.1 colorspace_1.2-6 [58] caTools_1.17.1
I first encountered this behaviour in an analysis with some real paired-end ChIP-seq data, running DiffBind on peaks called by HOMER in each of 8 libraries (4 groups of 2 replicates). I reformatted the HOMER peak files into BED format beforehand. The DiffBind commands were:
where
bam.files
andall.peakfiles
are character vectors of the respective files, andprefix
is a character vector of file accessions. There are 100132 sites in total; most of these have row names ofallvectors
that are in order, but scattered throughout are 5 sites that are not in order.