Entering edit mode
I have a large RangedSummarizedExperiment generated by recount3 (>1000 samples, >10M exon junctions). I'd like to perform some simple summary statistics over each sample and have code that works but is rather slow (~30s per sample on my machine).
library(recount3)
#Takes about 5 minutes with cached files; another couple of minutes to download
ccle_jxn = create_rse(
subset(available_projects(), project == "SRP186687" & project_type == "data_sources"),
type = "jxn"
)
min_reads = 5
total_junctions = lapply(colData(ccle_jxn)$sra.sample_name, function(cell_line){
x = subset(ccle_jxn, select = sra.sample_name == cell_line)
return(sum(assay(x) >= min_reads))
})
This tweak makes it about x2 faster per sample:
min_reads = 5
total_junctions = lapply(colData(ccle_jxn)$sra.sample_name, function(cell_line){
return(sum(assay(ccle_jxn)[,which(colData(ccle_jxn)$sra.sample_name == cell_line)] >= min_reads))
})
But is a bit less readable. Is there a way to make further optimisation?