Hi,
I'm trying to analyze existing contact matrices generated by the Juicer pipeline. Basically I have sparse matrices in 3-column txt format where rows are { row_idx, col_idx, cell_value }.
I have three replicates each in mut and wt, and wt has more reads than mut (342M reads vs 309M).
I followed the instructions I found in these posts: How to use diffHiC with existing contact matrices? and problem recasting InteractionSet into a DGEList, then had to debug from there, but these are a few years old and I don't know if the finer points of the object structures have changed in this time.
Basically the whole workflow runs without error, but in result$table, all mut/wt LFC< 0 and 99% of p-values are basically 0. It does not appear to normalize for library size differences.
Thanks, Ariel
library(diffHic)
library(Matrix)
library(edgeR)
library(csaw)
bed <- read.delim("bin_1000000.bed", header=FALSE)
colnames(bed) <- c("Chrom","Start","End","Bin")
bed.gr <- makeGRangesFromDataFrame(bed)
samples <- c("mut1","mut2","mut3","wt1","wt2","wt3")
S <- length(samples)
design <- cbind(mut=c(1,1,1,0,0,0), wt=c(0,0,0,1,1,1))
conn <- keep <- setNames(vector("list",S),samples)
for (i in 1:S) {
d <- read.delim(paste0(samples[i],".matrix.gz"), header=FALSE)
sm <- sparseMatrix(i=d[,1], j=d[,2], x=d[,3])
conn[[i]] <- ContactMatrix(sm, bed.gr, bed.gr)
keep[[i]] <- sm!=0
}
libsizes <- sapply(1:S, function(i) sum(conn[[i]]@matrix) )
libsizes
# mut1 mut2 mut3 wt1 wt2 wt3
# 104674719 102308339 102586499 119988904 95368796 126548928
to.keep <- Reduce("|", keep)
iset <- lapply(conn, function(x) deflate(x, extract=to.keep) )
data <- do.call(cbind, iset)
interactions(data) <- as(interactions(data), "ReverseStrictGInteractions")
assayNames(data) <- "counts"
colnames(data) <- samples
colnames(attributes(data)$assays@data[[1]]) <- samples
data$totals <- libsizes
data <- normOffsets(data, method="loess", se.out=TRUE)
norm <- estimateDisp(asDGEList(data), design)
result <- glmQLFTest(glmQLFit(norm, design, robust=TRUE))
table(sign(result$table$logFC))
# -1
# 446035
summary(result$table$PValue)
# Min. 1st Qu. Median Mean 3rd Qu. Max.
# 0.000e+00 0.000e+00 0.000e+00 1.270e-09 0.000e+00 6.957e-05
sessionInfo( )
R version 4.1.0 (2021-05-18)
Platform: x86_64-pc-linux-gnu (64-bit)
Running under: CentOS Linux 7 (Core)
Matrix products: default
BLAS: /n/apps/CentOS7/install/r-4.1.0/lib64/R/lib/libRblas.so
LAPACK: /n/apps/CentOS7/install/r-4.1.0/lib64/R/lib/libRlapack.so
locale:
[1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C
[3] LC_TIME=en_US.UTF-8 LC_COLLATE=en_US.UTF-8
[5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8
[7] LC_PAPER=en_US.UTF-8 LC_NAME=C
[9] LC_ADDRESS=C LC_TELEPHONE=C
[11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C
attached base packages:
[1] parallel stats4 stats graphics grDevices utils datasets
[8] methods base
other attached packages:
[1] openxlsx_4.2.3 csaw_1.26.0
[3] edgeR_3.34.0 limma_3.48.0
[5] Matrix_1.3-4 diffHic_1.24.0
[7] InteractionSet_1.20.0 SummarizedExperiment_1.22.0
[9] Biobase_2.52.0 MatrixGenerics_1.4.0
[11] matrixStats_0.59.0 GenomicRanges_1.44.0
[13] GenomeInfoDb_1.28.0 IRanges_2.26.0
[15] S4Vectors_0.30.0 BiocGenerics_0.38.0
loaded via a namespace (and not attached):
[1] zip_2.2.0 Rcpp_1.0.6 compiler_4.1.0
[4] restfulr_0.0.13 XVector_0.32.0 bitops_1.0-7
[7] rhdf5filters_1.4.0 tools_4.1.0 zlibbioc_1.38.0
[10] metapod_1.0.0 rhdf5_2.36.0 lattice_0.20-44
[13] BSgenome_1.60.0 DelayedArray_0.18.0 rstudioapi_0.13
[16] yaml_2.2.1 GenomeInfoDbData_1.2.6 rtracklayer_1.52.0
[19] Biostrings_2.60.1 locfit_1.5-9.4 grid_4.1.0
[22] XML_3.99-0.6 BiocParallel_1.26.0 Rhdf5lib_1.14.1
[25] Rhtslib_1.24.0 Rsamtools_2.8.0 GenomicAlignments_1.28.0
[28] stringi_1.6.2 RCurl_1.98-1.3 crayon_1.4.1
[31] rjson_0.2.20 BiocIO_1.2.0
I was having such a time trying to get the intermediate steps to work, I completely forgot the contrast object, of course... That fixed it!