CAGEr bug report: wrong consensus cluster results
3
1
Entering edit mode
Meng ▴ 30
@b971d31d
Last seen 16 months ago
United States

I found a bug in the CAGEr package when dealing with consensus clusters: the assays and rowRanges are disordered for consensusClusters.

This is caused by the code in CumulativeDistributionFunctions.R from line 93 to 95 and line 99:

setMethod(".getCumsum", c("CTSS", "GRanges"), function(ctss, clusters, useMulticore , nrCores) {
  getCumSumChrStrand <- function(chrom) {
    plus.cumsum  <- .getCumsumChr2(clusters = clusters, ctss = ctss, chrom = chrom, str = "+")
    minus.cumsum <- .getCumsumChr2(clusters = clusters, ctss = ctss, chrom = chrom, str = "-")
    c(plus.cumsum, minus.cumsum)
  }
  clusters.cumsum <- unlist(bplapply( seqlevels(clusters), getCumSumChrStrand
                                    , BPPARAM = CAGEr_Multicore(useMulticore, nrCores)))
    names(clusters.cumsum) <- names(clusters)
    clusters.cumsum
})

This deals with '+' strand data first then '-' and concat the two results for the same chromosome together and assigned the names at line 99. These code assume the order of genome ranges are sorted by strand with all '+' ranges before '-' ranges for the same chromosome. Thus, if the ranges are mixed with disordered '+' and '-', the names(clusters.cumsum) will not correspond to names(clusters) and would generated disordered results.

Here is the situation when calling consensus clusters in the AggregationMethods.R line 163, the rs would be ordered by the as.factor(names), which would reorder the results by alphabet and disrupt the original order that all '+' ranges come before '-' ranges for the same chromosome. Thus, in line 172, the consensus.clusters[rownames(counts)] recorded the consensus clusters by alphabet. This leads to the conflict with the underlying assumption in the above CumulativeDistributionFunctions.R function and cause the bug.

setMethod( ".CCtoSE"
         , c(se = "RangedSummarizedExperiment")
         , function(se, consensus.clusters, tpmThreshold = 1) {
    if (is.null(assays(se)[["normalizedTpmMatrix"]]))
      stop("Needs normalised data; run ", sQuote("normalizeTagCount()"), " first.")
    if (is.null(rowRanges(se)$cluster))
      rowRanges(se)$cluster <- ranges2names(rowRanges(se), consensus.clusters)

    if (tpmThreshold > 0)
      se <- se[rowSums(DelayedArray(assays(se)[["normalizedTpmMatrix"]])) > tpmThreshold,]

    .rowsumAsMatrix <- function(DF, names) {
      rs <- rowsum(as.matrix(DelayedArray(DF)), as.factor(names))
      if (rownames(rs)[1] == "") # If some CTSS were not in clusters
        rs <- rs[-1, , drop = FALSE]
      rs
      }

    counts <- .rowsumAsMatrix(assays(se)[["counts"]], rowRanges(se)$cluster)
    norm   <- .rowsumAsMatrix(assays(se)[["normalizedTpmMatrix"]], rowRanges(se)$cluster)

      SummarizedExperiment( rowRanges = consensus.clusters[rownames(counts)]
                          , assays    = SimpleList( counts     = counts
                                                  , normalized = norm))
})

To solve this, reorder = FALSE should be added in rowsum at line 163. Thus, the code should change from

rs <- rowsum(as.matrix(DelayedArray(DF)), as.factor(names))

to

rs <- rowsum(as.matrix(DelayedArray(DF)), as.factor(names), reorder = FALSE)
CAGEr • 1.5k views
ADD COMMENT
0
Entering edit mode

Thank you very much, this is very useful! I regret that your post did not catch my attention earlier!

ADD REPLY
2
Entering edit mode
Meng ▴ 30
@b971d31d
Last seen 16 months ago
United States

Another bug in the start and end coordinations of consensus clusters. For AggregationMethods.R line 132 and 133

end(x)   <- mcols(x)[[paste0("q_", qUp) ]] + start(x)
start(x) <- mcols(x)[[paste0("q_", qLow)]] + start(x)

should be

end(x)   <- decode(mcols(x)[[paste0("q_", qUp) ]]) + start(x) - 1
start(x) <- decode(mcols(x)[[paste0("q_", qLow)]]) + start(x) - 1
ADD COMMENT
0
Entering edit mode

Thanks again! I found the bug by chance last week. In my fix I did not decode but I had to specify integer type for the addition (+ 1L).

ADD REPLY
2
Entering edit mode
Meng ▴ 30
@b971d31d
Last seen 16 months ago
United States

I found another bug that not only consensus cluster results are disordered but also the tag clusters results are disordered.

This is due to that in CumulativeDistributionFunctions.R line 97 deals with each chromosome data in the order or seqlevels(clusters), which is chr1, chr2 ... chr10, chr11, ...

However, when generating tagClusters in ClusteringFunctions.R line 210 seqnames = Rle(factor(clusters$chr)), with default levels, the order of chromosome would become chr1, chr10, chr11 ... chr2 ... after sorting at line 218. This doesn't match the order of results generated in CumulativeDistributionFunctions.R.

To fix this, line 210 of ClusteringFunctions.R:

gr <- GRanges( seqnames = Rle(factor(clusters$chr))

should be changed to:

gr <- GRanges( seqnames = Rle(factor(clusters$chr, levels = unique(clusters$chr)))
ADD COMMENT
0
Entering edit mode

Thanks! I think that I will fix this one using seqinfo= to inherit the seqlevel order from the BSgenome object or the CTSS genomic ranges..

ADD REPLY
0
Entering edit mode

Thanks, my issue has been fixed.

ADD REPLY
0
Entering edit mode

Cool, can I ask you to mark it as such by clicking the accept button of my last answer about updates? This will remove the issue from the list of unsolved ones.

ADD REPLY
1
Entering edit mode
@charles-plessy-7857
Last seen 14 months ago
Japan

Thank you again for your report. I beleive that I have corrected the bugs in the following updates of Bioconductor's release and devel branches:

ADD COMMENT

Login before adding your answer.

Traffic: 571 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6