I found a bug in the CAGEr package when dealing with consensus clusters: the assays and rowRanges are disordered for consensusClusters.
This is caused by the code in CumulativeDistributionFunctions.R
from line 93 to 95 and line 99:
setMethod(".getCumsum", c("CTSS", "GRanges"), function(ctss, clusters, useMulticore , nrCores) {
getCumSumChrStrand <- function(chrom) {
plus.cumsum <- .getCumsumChr2(clusters = clusters, ctss = ctss, chrom = chrom, str = "+")
minus.cumsum <- .getCumsumChr2(clusters = clusters, ctss = ctss, chrom = chrom, str = "-")
c(plus.cumsum, minus.cumsum)
}
clusters.cumsum <- unlist(bplapply( seqlevels(clusters), getCumSumChrStrand
, BPPARAM = CAGEr_Multicore(useMulticore, nrCores)))
names(clusters.cumsum) <- names(clusters)
clusters.cumsum
})
This deals with '+' strand data first then '-' and concat the two results for the same chromosome together and assigned the names at line 99. These code assume the order of genome ranges are sorted by strand with all '+' ranges before '-' ranges for the same chromosome. Thus, if the ranges are mixed with disordered '+' and '-', the names(clusters.cumsum)
will not correspond to names(clusters)
and would generated disordered results.
Here is the situation when calling consensus clusters in the AggregationMethods.R
line 163, the rs
would be ordered by the as.factor(names)
, which would reorder the results by alphabet and disrupt the original order that all '+' ranges come before '-' ranges for the same chromosome. Thus, in line 172, the consensus.clusters[rownames(counts)]
recorded the consensus clusters by alphabet. This leads to the conflict with the underlying assumption in the above CumulativeDistributionFunctions.R
function and cause the bug.
setMethod( ".CCtoSE"
, c(se = "RangedSummarizedExperiment")
, function(se, consensus.clusters, tpmThreshold = 1) {
if (is.null(assays(se)[["normalizedTpmMatrix"]]))
stop("Needs normalised data; run ", sQuote("normalizeTagCount()"), " first.")
if (is.null(rowRanges(se)$cluster))
rowRanges(se)$cluster <- ranges2names(rowRanges(se), consensus.clusters)
if (tpmThreshold > 0)
se <- se[rowSums(DelayedArray(assays(se)[["normalizedTpmMatrix"]])) > tpmThreshold,]
.rowsumAsMatrix <- function(DF, names) {
rs <- rowsum(as.matrix(DelayedArray(DF)), as.factor(names))
if (rownames(rs)[1] == "") # If some CTSS were not in clusters
rs <- rs[-1, , drop = FALSE]
rs
}
counts <- .rowsumAsMatrix(assays(se)[["counts"]], rowRanges(se)$cluster)
norm <- .rowsumAsMatrix(assays(se)[["normalizedTpmMatrix"]], rowRanges(se)$cluster)
SummarizedExperiment( rowRanges = consensus.clusters[rownames(counts)]
, assays = SimpleList( counts = counts
, normalized = norm))
})
To solve this, reorder = FALSE
should be added in rowsum
at line 163. Thus, the code should change from
rs <- rowsum(as.matrix(DelayedArray(DF)), as.factor(names))
to
rs <- rowsum(as.matrix(DelayedArray(DF)), as.factor(names), reorder = FALSE)
Thank you very much, this is very useful! I regret that your post did not catch my attention earlier!