Question

error with base-level DERfinder using large sample sizes

0

Entering edit mode

andrewelamb • 0

@andrewelamb-12494

Last seen 8.1 years ago

I've gotten base-level DERfinder to work with smaller groups ~12 BAM files, but when i try on our larger experiment ~500 BAM files I'm running into an issue:2

2017-03-15 10:12:00 fullCoverage: processing chromosome 9
2017-03-15 10:12:03 loadCoverage: finding chromosome lengths
2017-03-15 10:12:03 loadCoverage: loading BAM file <path>

(More bam files)

2017-03-15 12:03:08 loadCoverage: applying the cutoff to the merged data
2017-03-15 12:03:08 filterData: normalizing coverage
2017-03-15 12:14:23 filterData: done normalizing coverage
2017-03-15 12:14:29 filterData: originally there were 138394717 rows, now there are 138394717 rows. Meaning that 0 percent was filtered.
extendedMapSeqlevels: sequence names mapped from NCBI to UCSC for species homo_sapiens
2017-03-15 12:20:22 filterData: originally there were 138394717 rows, now there are 5145304 rows. Meaning that 96.28 percent was filtered.
2017-03-15 12:20:22 sampleDepth: Calculating sample quantiles
2017-03-15 12:46:30 sampleDepth: Calculating sample adjustments
extendedMapSeqlevels: sequence names mapped from NCBI to UCSC for species homo_sapiens
2017-03-15 12:46:31 analyzeChr: Pre-processing the coverage data
Error in .Call2("Rle_constructor", values, lengths, check, 0L, PACKAGE = "S4Vectors") :
integer overflow while summing elements in 'lengths'
Calls: main ... <Anonymous> -> Rle -> Rle -> new_Rle -> .Call2 -> .Call
Execution halted

This is with:

filtered_coverage <- map(full_coverage, filterData, cutoff = 30)

analyzeChr(chr, filtered_coverage, models,
                                  groupInfo     = test_vars,
                                  writeOutput   = F,
                                  cutoffFstat   = 5e-02,
                                  nPermute      = 50,
                                  returnOutput = TRUE,
                                  mc.cores      = workers,
                                  runAnnotation = F)

Is this just too many samples?

Thanks!

-Andrew

derfinder • 1.9k views

ADD COMMENT • link updated 8.1 years ago by Leonardo Collado Torres ★ 1.1k • written 8.1 years ago by andrewelamb • 0

score 0 · Answer 1 · 2017-03-16

Hi Andrew,

derfinder's single base F-statistic mode can handle a dataset of 500 samples or so. However, we have mostly been using the expressed regions approach since it's much faster and allows checking different models without having to run the slow steps again.

Regarding your current analysis, I highly recommend that you run one cluster job per chromosome and that you request plenty of memory (which could be the issue in your error). If you use writeOutput = TRUE and returnOutput = FALSE (the defaults) then you can check the results later on without needing that much memory. You can find some scripts I used at https://github.com/leekgroup/derSupplement to analyze the BrainSpan dataset. In particular, use the lowMemDir and chunksize arguments to reduce the memory usage similar to what I did at https://github.com/leekgroup/derSupplement/blob/gh-pages/step3-analyzeChr.R#L63 The default of 5 million for chunksize might be too big in this case.

That being said, the error message is a tad obscure. Using traceback() after the error should help figuring out what step is actually failing. In your case, it seems to be the step where the mean coverage is calculated: https://github.com/lcolladotor/derfinder/blob/master/R/preprocessCoverage.R#L173. I assume that you might see this error if you run filterCoverage(returnMean = TRUE) (if it works, you can simply pass it to analyzeChr() and preprocessCoverage() won't calculate the mean).

Also, what package is the map() function from? Remember to include the output of traceback() and sessionInfo() [or devtools::session_info() ] as mentioned in http://www.bioconductor.org/help/support/posting-guide/.

Best,

Leonardo