Question

findSpliceOverlaps Buffer Error

1

Entering edit mode

Dario Strbenac ★ 1.5k

@dario-strbenac-5916

Last seen 3 days ago

Australia

I have the GENCODE human annotation and a moderately sized RNA-seq sample. I created exonsForTranscripts from a TxDb object.

> length(exonsForTranscripts)
[1] 197538
> length(sampleReads)
[1] 35929048

When I use findSpliceOverlaps, it runs for almost one hour, then ends with an error:

> splicedOverJunctions <- findSpliceOverlaps(sampleReads, exonsForTranscripts)
Error in elementLengths(setdiff(qrng, srng)) :
  error in evaluating the argument 'x' in selecting a method for function 'elementLengths': Error in gaps(union(gaps(x), y), start = startx, end = endx) :
  error in evaluating the argument 'x' in selecting a method for function 'gaps': Error in .Call2("CompressedIRangesList_reduce", x, drop.empty.ranges,  :
  _get_new_buflength(): MAX_BUFLENGTH reached

Is it possible to do splice junction counting on a transcript database with as many transcript as GENCODE's ? I am using a 64-bit server with 512 GB of memory.

GenomicAlignments findSpliceOverlaps • 1.5k views

ADD COMMENT • link updated 8.2 years ago by Hervé Pagès 16k • written 8.2 years ago by Dario Strbenac ★ 1.5k

score 1 · Answer 1 · 2016-11-17

1

Entering edit mode

Hervé Pagès 16k

@herve-pages-1542

Last seen 4 weeks ago

Seattle, WA, United States

Hi,

Please provide code that we can use to reproduce the problem and also make sure that you use the latest version of BioC (the error message shows internal use of elementLengths which was replaced with elementNROWS in BioC 3.2, that is more than 1 year ago).

findSpliceOverlaps() calls the gaps method for IRangesList objects internally. It looks like you're hitting the maximum size of the internal buffers used by this method at the C level. This maximum size was doubled back in July (see commit 119473 in S4Vectors) and is now very close to 2^31. So maybe using the current release will help. Note that we can't make those buffers bigger at the moment without starting to support long Vector objects first, which is on our TODO list for BioC 3.5.

Finally please note that providing your sessionInfo() is a must when reporting a problem like this one.

Thanks,

H.

ADD COMMENT • link 8.2 years ago Hervé Pagès 16k

0

Entering edit mode

You are correct that I have been using Bioincoductor 3.2 packages. I mistakenly thought this part of Bioconductor was constant over an extended time, so I did not provide session information. I tried using the 3.4 set of packages, but I get the error

*** caught bus error ***
address 0x13cf1068, cause 'non-existent physical address'

The R process was using 45 GB of RAM at that time and the computer was barely responsive to input.

27418 dario 20 0 47.240g 0.045t 3984 S 0.0 47.1 12:12.80 R

The code I used is simply:

GENCODEgenesDatabase <- makeTxDbFromGFF("databases/hg19/gencode.v25.annotation.gff3")
exonsByTranscripts <- exonsBy(GENCODEgenesDatabase, "tx", use.names = TRUE)
splicedReads <- findSpliceOverlaps(bamFilePath, exonsByTranscripts)

Can you reproduce this excessive RAM consumption? I use GenomicAlignments 1.10 in a new session of R 3.3.1.

ADD REPLY • link 8.2 years ago Dario Strbenac ★ 1.5k

0

Entering edit mode

Sorry but I can't use this code to reproduce the problem. I don't have the gencode.v25.annotation.gff3 file on my computer and you're not telling us where you got it from. Also I don't know what bamFilePath is. In other words, it's not standalone code. Standalone code is code other people can copy/paste in their session and it should just work. That means the code should also contain the necessary library() statements so we don't have to guess which packages to load to make it work.

Also you're still not providing your sessionInfo().

Thanks,

H.

ADD REPLY • link 8.2 years ago Hervé Pagès 16k

0

Entering edit mode

I am unable to reproduce it with the latest versions of packages and publicly available data.

ADD REPLY • link 8.2 years ago Dario Strbenac ★ 1.5k

0

Entering edit mode

Thanks for sending me the dataset privately (35929048 paired-end reads). With this dataset findSpliceOverlaps() took about 2h to complete and used more than 200 Gb of RAM on a Linux server at Fred Hutch. It returned a Hits object with 144261871 annotated hits between the reads and the transcripts. I spent some time looking at the implementation of findSpliceOverlaps() and the reason it uses so much memory is because of its intensive use of set operations like setdiff(), gaps(), and intersect() on GRangesList and IRangesList objects that are parallel to (i.e. same length as) the returned Hits object. setdiff() in particular is very expensive on these objects. Addressing this would require some important refactoring of the function (e.g. make it use overlap encodings and isCompatibleWithSplicing() internally) but I don't have time to work on this at the moment.

Note that the GenomicAlignments package also provides findCompatibleOverlaps() which is similar to findSpliceOverlaps() but is about 15x faster and uses 10x less memory on the dataset you sent me. It uses overlap encodings and isCompatibleWithSplicing() internally. See ?findCompatibleOverlaps for more information. I just committed a couple of fixes and improvements to this function in devel so make sure to use GenomicAlignments >= 1.11.5 if you intend to try it.

Cheers,

H.

ADD REPLY • link 8.2 years ago Hervé Pagès 16k

0

Entering edit mode

This is useful to know. It's not clear from the documentation what the difference to the end user is between findSpliceOverlaps and findCompatibleOverlaps. findSpliceOverlaps mentions unique exons whereas findCompatibleOverlaps doesn't, so it's ambiguous what exactly findCompatibleOverlaps considers as compatible. The Overlap Encodings vignette uses "compatible" a lot, but nowhere defines it explicitly.

ADD REPLY • link 8.2 years ago Dario Strbenac ★ 1.5k

1

Entering edit mode

findCompatibleOverlaps() is only about detecting aligned reads that are compatible with the splicing of the transcript that they overlap. findSpliceOverlaps() does a little bit more: it also tells you whether the read is compatible with only one annotated transcript.

The Overlap Encodings vignette contains some ASCII art showing splice compatible overlaps e.g.:

  - read (no gap):            oooooooo
  - transcript:     ...   >>>>>>>>>>>>>>   ...

  - read (1 gap):             ooooo---ooo
  - transcript:     ...   >>>>>>>>>   >>>>>>>>>   ...

  - read (2 gaps):              oo---ooooo---o
  - transcript:     ...   >>>>>>>>   >>>>>   >>>>>>>   ...

and so on... (you can easily guess what it would look like for a read with 3 gaps).

The document also contains ASCII art showing splice compatible overlaps for paired-end reads.

Note that I shouldn't use the term gaps anymore in this vignette to design the regions on the reference sequence that correspond to the N's in the CIGAR. These regions are called skipped regions in the SAM Spec, so, a couple of years ago, I started to use that term consistently in GenomicAlignments. However I forgot to modify the Overlap Encodings vignette. This is now corrected in GenomicAlignments 1.11.6.

Finally be aware that, according to some comments I saw in its source code, the current implementation of findSpliceOverlaps() doesn't work properly on paired-end reads where the 2 ends overlap, which is a situation I've seen in the dataset you sent me. I added a note in the man page about this.

H.

ADD REPLY • link 8.2 years ago Hervé Pagès 16k

0

Entering edit mode

I'm glad to know of this limitation with overlapping paired end reads. Indeed, the sample preparation by the core facility was not as good as hoped for.

ADD REPLY • link 8.2 years ago Dario Strbenac ★ 1.5k