I have the GENCODE human annotation and a moderately sized RNA-seq sample. I created exonsForTranscripts
from a TxDb object.
> length(exonsForTranscripts) [1] 197538 > length(sampleReads) [1] 35929048
When I use findSpliceOverlaps
, it runs for almost one hour, then ends with an error:
> splicedOverJunctions <- findSpliceOverlaps(sampleReads, exonsForTranscripts) Error in elementLengths(setdiff(qrng, srng)) : error in evaluating the argument 'x' in selecting a method for function 'elementLengths': Error in gaps(union(gaps(x), y), start = startx, end = endx) : error in evaluating the argument 'x' in selecting a method for function 'gaps': Error in .Call2("CompressedIRangesList_reduce", x, drop.empty.ranges, : _get_new_buflength(): MAX_BUFLENGTH reached
Is it possible to do splice junction counting on a transcript database with as many transcript as GENCODE's ? I am using a 64-bit server with 512 GB of memory.
You are correct that I have been using Bioincoductor 3.2 packages. I mistakenly thought this part of Bioconductor was constant over an extended time, so I did not provide session information. I tried using the 3.4 set of packages, but I get the error
*** caught bus error ***
address 0x13cf1068, cause 'non-existent physical address'
The R process was using 45 GB of RAM at that time and the computer was barely responsive to input.
27418 dario 20 0 47.240g 0.045t 3984 S 0.0 47.1 12:12.80 R
The code I used is simply:
Can you reproduce this excessive RAM consumption? I use GenomicAlignments 1.10 in a new session of R 3.3.1.
Sorry but I can't use this code to reproduce the problem. I don't have the gencode.v25.annotation.gff3 file on my computer and you're not telling us where you got it from. Also I don't know what
bamFilePath
is. In other words, it's not standalone code. Standalone code is code other people can copy/paste in their session and it should just work. That means the code should also contain the necessarylibrary()
statements so we don't have to guess which packages to load to make it work.Also you're still not providing your
sessionInfo()
.Thanks,
H.
I am unable to reproduce it with the latest versions of packages and publicly available data.
Thanks for sending me the dataset privately (35929048 paired-end reads). With this dataset
findSpliceOverlaps()
took about 2h to complete and used more than 200 Gb of RAM on a Linux server at Fred Hutch. It returned a Hits object with 144261871 annotated hits between the reads and the transcripts. I spent some time looking at the implementation offindSpliceOverlaps()
and the reason it uses so much memory is because of its intensive use of set operations likesetdiff()
,gaps()
, andintersect()
on GRangesList and IRangesList objects that are parallel to (i.e. same length as) the returned Hits object.setdiff()
in particular is very expensive on these objects. Addressing this would require some important refactoring of the function (e.g. make it use overlap encodings andisCompatibleWithSplicing()
internally) but I don't have time to work on this at the moment.Note that the GenomicAlignments package also provides
findCompatibleOverlaps()
which is similar tofindSpliceOverlaps()
but is about 15x faster and uses 10x less memory on the dataset you sent me. It uses overlap encodings andisCompatibleWithSplicing()
internally. See?findCompatibleOverlaps
for more information. I just committed a couple of fixes and improvements to this function in devel so make sure to use GenomicAlignments >= 1.11.5 if you intend to try it.Cheers,
H.
This is useful to know. It's not clear from the documentation what the difference to the end user is between
findSpliceOverlaps
andfindCompatibleOverlaps
.findSpliceOverlaps
mentions unique exons whereasfindCompatibleOverlaps
doesn't, so it's ambiguous what exactlyfindCompatibleOverlaps
considers as compatible. The Overlap Encodings vignette uses "compatible" a lot, but nowhere defines it explicitly.findCompatibleOverlaps()
is only about detecting aligned reads that are compatible with the splicing of the transcript that they overlap.findSpliceOverlaps()
does a little bit more: it also tells you whether the read is compatible with only one annotated transcript.The Overlap Encodings vignette contains some ASCII art showing splice compatible overlaps e.g.:
and so on... (you can easily guess what it would look like for a read with 3 gaps).
The document also contains ASCII art showing splice compatible overlaps for paired-end reads.
Note that I shouldn't use the term gaps anymore in this vignette to design the regions on the reference sequence that correspond to the N's in the CIGAR. These regions are called skipped regions in the SAM Spec, so, a couple of years ago, I started to use that term consistently in GenomicAlignments. However I forgot to modify the Overlap Encodings vignette. This is now corrected in GenomicAlignments 1.11.6.
Finally be aware that, according to some comments I saw in its source code, the current implementation of
findSpliceOverlaps()
doesn't work properly on paired-end reads where the 2 ends overlap, which is a situation I've seen in the dataset you sent me. I added a note in the man page about this.H.
I'm glad to know of this limitation with overlapping paired end reads. Indeed, the sample preparation by the core facility was not as good as hoped for.