Question

Rsubread: subjunc() loads index repeatedly

0

Entering edit mode

Gerhard Thallinger ▴ 180

@gerhard-thallinger-1552

Last seen 6 months ago

Austria

I am aligning RNA-seq data from 36 samples to the CHM13v2.0 reference using subjunc():

align.res <- subjunc(index="chm13v2.0_maskedY", readfile1=fwdname, readfile2=revname, output_file=bamname, nthreads = 12)

where fwdname, revname, and bamname represent character vectors with 36 elements. Total alignment time is 16 minutes per sample on average; of these, 5 minutes are spent on "Global environment is initialised" for each sample, where it seems that the index (~18 GB) is loaded into memory. This is also reflected in the working set of the Rgui process, which drops to 1.6 GB after completion of a sample, increasing to ~19 GB during preparation and peaking at 21.4 GB during alignment.

My question is now, whether their is a parameter to tell subjunc() to reuse the index loaded already for the first sample also for alignment of the subsequent samples.

The environment is as follows:

R version 4.3.0 (2023-04-21 ucrt)
Platform: x86_64-w64-mingw32/x64 (64-bit)
Running under: Windows 11 x64 (build 22621)

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
[1] Rsubread_2.14.2

loaded via a namespace (and not attached):
[1] compiler_4.3.0 Matrix_1.5-4.1 tools_4.3.0    grid_4.3.0     lattice_0.21-8

P.S.: The ungapped, single-block index was created with buildindex(basename="chm13v2.0_maskedY", reference="chm13v2.0_maskedY.fa.gz", memory=18000)

Rsubread • 1.3k views

ADD COMMENT • link 21 months ago • updated 20 months ago Gerhard Thallinger ▴ 180

score 1 · Answer 1 · 2023-07-13

Hi Gerhard, I think it is a very good suggestion to reuse the index-in-memory for mapping many samples in the same run. But for now the subjunc and subread aligners don't have this option. Each sample (a pair of input files in your case) is mapped from the step of loading the index.

I noticed that you ran the code in Windows. In my experience, Windows is indeed slow for allocating/operating large amounts of memory blocks (as the subread index). If you can run it in Linux, loading the index will be much faster.

score 1 · Answer 2 · 2023-07-16

The reason why Subjunc and Subread aligners do not reuse the previously loaded index is because they support split index, which only has part of the index present in the memory at any time. So different parts of the index will be present in the memory at different times and it is impossible to reuse them for the processing of subsequent samples.

score 0 · Answer 3 · 2023-08-14

Thank you both for your answers.

I noticed that you ran the code in Windows. In my experience, Windows is indeed slow for allocating/operating large amounts of memory blocks (as the subread index). If you can run it in Linux, loading the index will be much faster.

Unfortunately, I don't have access to a comparable Linux based system to test this. However, I moved the index to an NVMe based SSD and with that could reduce index loading to about 2 minutes per mapping.

The reason why Subjunc and Subread aligners do not reuse the previously loaded index is because they support split index, which only has part of the index present in the memory at any time. So different parts of the index will be present in the memory at different times and it is impossible to reuse them for the processing of subsequent samples.

If I understand you correctly, this applies to split indices only; in my case this is a single-block index and it should be possible to reuse the already loaded index for the mapping of subsequent samples. This would reduce total mapping time considerably, especially when processing a large number of samples at once. As more memory tends to be available in general, this might be a worthwhile change that many users could benefit from.