Clustering GRanges combining percent overlap and categorical variable
1
0
Entering edit mode
enrique • 0
@enrique-11124
Last seen 7.4 years ago

Hi all,

I am analysing CNV data via GenomicRanges R package. To move forward, I would like to group (clustering) some features (CNVs) combining two criteria: 1) a pre-defined percent overlap and 2) a categorical variable (example, a variable associating the "source/collaborator" with the feature). I would like to say: "I have removed (or grouped in one instance) those features (CNVs) coming from the same study/collaborator and overlapping over 50%"

How could I to perform this task? (preferably using GenomicRanges' functions)

Thanks in advance!

genomicranges • 2.3k views
ADD COMMENT
2
Entering edit mode
@herve-pages-1542
Last seen 5 days ago
Seattle, WA, United States

Hi Enrique,

This can be tricky. Here is one way to proceed. In 4 steps. Let's say you have your features in a GRanges object gr, and that gr has a metadata column (study) that indicates the study the feature is coming from.

(1) Find self hits

hits <- findOverlaps(gr, drop.self=TRUE, drop.redundant=TRUE)

(2) Subset hits to keep only the hits that satisfy some criteria

  2-a) Keep the hits that are between features coming from the same study:

study <- mcols(gr)$study
hits <- hits[study[queryHits(hits)] == study[subjectHits(hits)]]

  2-b) Keep the hits that achieve at least 50% overlap:

x <- gr[queryHits(hits)]
y <- gr[subjectHits(hits)]
relative_overlap <- width(pintersect(x, y)) / pmin(width(x), width(y))
hits <- hits[relative_overlap >= 0.5]

(3) Cluster the features

## Think of 'hits' as a graph (the elments in 'hits' are the edges of the graph,
## the nodes are the features in 'gr'). Like the transitive.closure() function
## from the RBGL package, the clustering function below computes the transitive
## closure of the graph but in a much more efficient way.
## I already posted this function about 1 year ago here:
##   https://support.bioconductor.org/p/68021/#68129
## Maybe I should add it somewhere to our infrastructure.
extractClustersFromSelfHits <- function(hits)
{
    stopifnot(is(hits, "Hits"))
    N <- queryLength(hits)
    stopifnot(N == subjectLength(hits))
    h <- union(hits, t(hits))
    qh <- queryHits(h)
    sh <- subjectHits(h)
    cid <- cid0 <- seq_len(N)  # cluster ids
    while (TRUE) {
        cid2 <- pmin(cid, selectHits(h, "first"))
        if (identical(cid2, cid))
            break
        cid <- cid2
        h <- Hits(qh, cid[sh], N, N)
    }
    unname(splitAsList(cid0, cid))
}

clusters <- extractClustersFromSelfHits(hits)

clusters is an IntegerList object with 1 list element per cluster. Each cluster contains the indices of the features in gr that come from the same study and overlap over 50% with at least another feature in the cluster.

(4) Remove (or group in 1 instance) the features in gr that come from the same study and overlap over 50%

See A: Merge multiple genomic intervals if overlapping more than xx percents for how to "group in 1 instance" by merging all the features in each cluster into a single feature.

If you want to remove the "redundant features" from each cluster and keep only 1 feature, you would need to precise what criteria you want to use for picking "the most representative feature in each cluster". Note that a given cluster can be big and span a big region, and also some features in the cluster can be very far away from other features in the same cluster (because the clustering was based on transitive closure). So I don't think there is an obvious/natural way to do this in general (i.e. a way that would be meaningful from a biological perspective).

Anyway, hope this will help get you started.

H.

ADD COMMENT
1
Entering edit mode

Just wanted to mention that I have implemented something similar using graphs.

setAs("Hits", "graph", function(from) {
          stopifnot(queryLength(from) == subjectLength(from))
          NEL <- split(subjectHits(from), queryHits(from))
          graph::graphNEL(as.character(seq_len(queryLength(from))), NEL)
      })

findConnectedComps <- function(x) {
    g <- as(x, "graph")
    ManyToOneGrouping(unname(RBGL::connectedComp(g)))
}
comps <- findConnectedComps(hits)
gr$cluster <- as(comps, "factor")

 

ADD REPLY
0
Entering edit mode

Hello Pagès:

Thank you very much for your response. I was testing the code and works well. I would like only to point out that this approach could involve two o more categorical variable like clustering criteria. Just adding it in the step 2.a, example (also grouping by gene):

study <- mcols(gr)$study
hits <- hits[study[queryHits(hits)] == study[subjectHits(hits)]]

genes <- mcols(gr)$genes
hits <- hits[genes[queryHits(hits)] == genes[subjectHits(hits)]]

....


For the step 4, I am merging those features in the cluster rather than to pick one. I also found very difficult to try to get "a representative feature from a cluster".

Again, I appreciate your response,

Thanks,

enrique

ADD REPLY
0
Entering edit mode

Hi all,

After to be able to run this code, I was trying to scale up the analysis running, for example, a dataset with around 1.5 millions of features. However, I got some difficulties to do it in a efficient way. First, I would like to know if using a Corei7 32GB RAM Ubuntu 16.04 64bit PC is enough to perform efficiently this kind of processing on a "big data".

Regarding the code, a question is: What is the best way to process large GRangesList? Example, using lapply combined with my own function I don’t get success.

A strategy (learned from other posts), is split the GRanges object by sequences and to process each "piece" separately, OK, but even following this approach, I am getting message like: "Error:  cannot allocate vector of size 2.8 Gb". Looking at Traceback, the critical operation is:

 hits <- findOverlaps(gr, drop.self=TRUE, drop.redundant=TRUE, ignore.strand=TRUE)

Where gr is a GRanges object containing around 120 000 features. Apparently, the returned object (Hits) is huge and unmanageable by the system. How could I to improve this issue (using a code-oriented approach)? Any suggestion in general?

Thank you in advance,

enrique

ADD REPLY
1
Entering edit mode

Hi Enrique,

There is no best way to process large GRangesList objects in general. It all depends what you're doing with it exactly. Some kinds of operations can be performed very efficiently without the need for any loop by using the unlist/transform/relist trick or other trick.

In the worst case (i.e. when all the ranges overlap with each others), finding self overlaps in a GRanges object of length N will return a Hits object of length N^2. Even if you use drop.self=TRUE and drop.redundant=TRUE, the length of the returned object is still N(N-1)/2. So yes, a GRanges object that contains around 120 000 ranges can have billions of hits.

Note that for this kind of object, the process of clustering and merging all features in each cluster should reduce considerably the number of features. This suggests an incremental approach for the clustering and merging process where you start by clustering and merging a small fraction of gr (e.g. 10% of the features), then concatenate the result with the next 10% features, cluster and merge again etc... In pseudo code:

clustered <- NULL
while gr not empty:
    chunk <- head(gr, n=10000)
    clustered <- clusterAndMerge(c(clustered, chunk))
    gr <- tail(gr, n=-10000)

It seems that this should give a result close to what you would get by doing the clustering and merging of all the features in gr at once (i.e. clusterAndMerge(gr)), modulo the order of the genomic ranges in the final object (clustered). But the incremental approach should require a lot less memory because it should generate much smaller temporary Hits objects (inside the calls to clusterAndMerge). How different the final object will be from what clusterAndMerge(gr) would actually return is an interesting question per se that would deserve some close examination.

Cheers,

H.

ADD REPLY

Login before adding your answer.

Traffic: 532 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6