Question

fast apply on RleList

0

Entering edit mode

story.benjamin ▴ 20

@storybenjamin-11722

Last seen 3 days ago

Switzerland

Hi all,

I'm trying to parse a pure text file with coverage (essentially it's an "RleList"-style raw text file).

Anyway, converting it an RleList is a piece of cake it's just when I want to do any summarization of the data its running much longer than I would expect?

There are ~200k Rle elements of varying length (i.e. every exon in the human genome).

Any suggestions for speeding up the sapply part? I feel like there was a way to run a function quickly on an RleList but it's been a while maybe I'm getting rusty...

Specifically: cov.out <- t(sapply(cov.gr,compute_coverage_metrics))


# load it back into R
test.gr <- import.bed('targets.bed')
coverage.df <- import.bed('/mnt/HDD2/testing_ben/test.coverage.tsv')
coverage.df <- coverage.df[-c(1)]
coverage.df <- head(coverage.df,length(coverage.df)-24)
##colnames(coverage.df) <- c('chr','start','end','scores')
##coverage.df <- GRanges(coverage.df)
cov <- coverage(coverage.df, weight=as.numeric(values(coverage.df)$name))

## sapply sum stats
cov.gr <- cov[test.gr]
cov.out <- t(sapply(cov.gr,compute_coverage_metrics))

## cov_views <- Views(cov, as(test.gr, "RangesList"))


compute_coverage_metrics <- function(cov_rle) {
  rl <- runLength(cov_rle)
  rv <- runValue(cov_rle)
  ## len
  total <- sum(rl)
  cov_nucl_20      <- sum(rl[rv >= 20])
  mean_cov <- sum(rl * rv) / total
  return(data.frame(
    tot_nucl = total,
    cov_nucl_20 = cov_nucl_20,
    coverage = mean_cov
  ))
}

R is locked in v4.3.1 if it matters.

GenomicRanges RleList • 768 views

ADD COMMENT • link 8 days ago • updated 7 days ago story.benjamin ▴ 20

0

Entering edit mode

Here's a test data.set if you need:

library(GenomicRanges)
library(rtracklayer)

# gr test
test.gr <- GRanges(seqnames = c("chr1", "chr1", "chr1"),
                   ranges = IRanges(start = c(100, 200, 300),
                                    end = c(150, 250, 350)))
## cov test
coverage.df <- test.gr
coverage.df$name <- as.character(c(5, 10, 15))

ADD REPLY • link 8 days ago story.benjamin ▴ 20

0

Entering edit mode

James W. MacDonald 68k

@james-w-macdonald-5106

Last seen 7 hours ago

United States

There is already a 'mean' function that will dispatch on an RleList that you can use. I don't have a long RleList in hand to test it on, but it should be fast.

ADD COMMENT • link 7 days ago James W. MacDonald 68k

0

Entering edit mode

Hi James, thanks for the reply. I ultimately figured it out ... viewApply showed a massive speed increase for me (see other answer) - I was also doing not just the mean but a bunch of other calculations (simplified for the example)

ADD REPLY • link 7 days ago story.benjamin ▴ 20

score 1 · Accepted Answer · 2025-04-10

solved with viewApply (code is simplified below) + map reduce on chrs

test.split <- split(test.gr,seqnames(test.gr))

final.stats <- mclapply(unique(seqnames(test.gr)),function(chr){
    tested <- ranges(test.split[[chr]])
    views <- Views(cov[chr],tested)[[1]]
    x <- viewApply(views,compute_coverage_metrics)
    x <- data.frame(apply(x,1,as.numeric))
    return(x)
  },mc.cores=28)
final.stats <- do.call(rbind,final.stats)