fast apply on RleList
2
0
Entering edit mode
@storybenjamin-11722
Last seen 3 days ago
Switzerland

Hi all,

I'm trying to parse a pure text file with coverage (essentially it's an "RleList"-style raw text file).

Anyway, converting it an RleList is a piece of cake it's just when I want to do any summarization of the data its running much longer than I would expect?

There are ~200k Rle elements of varying length (i.e. every exon in the human genome).

Any suggestions for speeding up the sapply part? I feel like there was a way to run a function quickly on an RleList but it's been a while maybe I'm getting rusty...

Specifically: cov.out <- t(sapply(cov.gr,compute_coverage_metrics))


# load it back into R
test.gr <- import.bed('targets.bed')
coverage.df <- import.bed('/mnt/HDD2/testing_ben/test.coverage.tsv')
coverage.df <- coverage.df[-c(1)]
coverage.df <- head(coverage.df,length(coverage.df)-24)
##colnames(coverage.df) <- c('chr','start','end','scores')
##coverage.df <- GRanges(coverage.df)
cov <- coverage(coverage.df, weight=as.numeric(values(coverage.df)$name))

## sapply sum stats
cov.gr <- cov[test.gr]
cov.out <- t(sapply(cov.gr,compute_coverage_metrics))

## cov_views <- Views(cov, as(test.gr, "RangesList"))


compute_coverage_metrics <- function(cov_rle) {
  rl <- runLength(cov_rle)
  rv <- runValue(cov_rle)
  ## len
  total <- sum(rl)
  cov_nucl_20      <- sum(rl[rv >= 20])
  mean_cov <- sum(rl * rv) / total
  return(data.frame(
    tot_nucl = total,
    cov_nucl_20 = cov_nucl_20,
    coverage = mean_cov
  ))
}

R is locked in v4.3.1 if it matters.

GenomicRanges RleList • 768 views
ADD COMMENT
0
Entering edit mode

Here's a test data.set if you need:

library(GenomicRanges)
library(rtracklayer)

# gr test
test.gr <- GRanges(seqnames = c("chr1", "chr1", "chr1"),
                   ranges = IRanges(start = c(100, 200, 300),
                                    end = c(150, 250, 350)))
## cov test
coverage.df <- test.gr
coverage.df$name <- as.character(c(5, 10, 15))
ADD REPLY
1
Entering edit mode
@storybenjamin-11722
Last seen 3 days ago
Switzerland

solved with viewApply (code is simplified below) + map reduce on chrs

test.split <- split(test.gr,seqnames(test.gr))

final.stats <- mclapply(unique(seqnames(test.gr)),function(chr){
    tested <- ranges(test.split[[chr]])
    views <- Views(cov[chr],tested)[[1]]
    x <- viewApply(views,compute_coverage_metrics)
    x <- data.frame(apply(x,1,as.numeric))
    return(x)
  },mc.cores=28)
final.stats <- do.call(rbind,final.stats)
ADD COMMENT
0
Entering edit mode
@james-w-macdonald-5106
Last seen 7 hours ago
United States

There is already a 'mean' function that will dispatch on an RleList that you can use. I don't have a long RleList in hand to test it on, but it should be fast.

0
Entering edit mode

Hi James, thanks for the reply. I ultimately figured it out ... viewApply showed a massive speed increase for me (see other answer) - I was also doing not just the mean but a bunch of other calculations (simplified for the example)

ADD REPLY

Login before adding your answer.

Traffic: 470 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6