Hi all,
I'm trying to parse a pure text file with coverage (essentially it's an "RleList"-style raw text file).
Anyway, converting it an RleList is a piece of cake it's just when I want to do any summarization of the data its running much longer than I would expect?
There are ~200k Rle elements of varying length (i.e. every exon in the human genome).
Any suggestions for speeding up the sapply part? I feel like there was a way to run a function quickly on an RleList but it's been a while maybe I'm getting rusty...
Specifically: cov.out <- t(sapply(cov.gr,compute_coverage_metrics))
# load it back into R
test.gr <- import.bed('targets.bed')
coverage.df <- import.bed('/mnt/HDD2/testing_ben/test.coverage.tsv')
coverage.df <- coverage.df[-c(1)]
coverage.df <- head(coverage.df,length(coverage.df)-24)
##colnames(coverage.df) <- c('chr','start','end','scores')
##coverage.df <- GRanges(coverage.df)
cov <- coverage(coverage.df, weight=as.numeric(values(coverage.df)$name))
## sapply sum stats
cov.gr <- cov[test.gr]
cov.out <- t(sapply(cov.gr,compute_coverage_metrics))
## cov_views <- Views(cov, as(test.gr, "RangesList"))
compute_coverage_metrics <- function(cov_rle) {
rl <- runLength(cov_rle)
rv <- runValue(cov_rle)
## len
total <- sum(rl)
cov_nucl_20 <- sum(rl[rv >= 20])
mean_cov <- sum(rl * rv) / total
return(data.frame(
tot_nucl = total,
cov_nucl_20 = cov_nucl_20,
coverage = mean_cov
))
}
R is locked in v4.3.1 if it matters.
Here's a test data.set if you need: