Hey, I have good experience with Rcpp, I manage to create IRanges objects etc, and pass them back to R, but GRanges is a bit trickier, since it uses factor for some columns like strand etc.
IRanges is easy, something like this:
S4 getIRangesFromC(int start, int end){ S4 iranges("IRanges"); iranges.slot("start") = start; iranges.slot("width) = end-start+1; return iranges; }
This is my code so far, it can return GRanges but is not valid for conversion to GRangesList, because columns are wrong types.
#IN R ->:
library(Rcpp) require("IRanges") require("GenomicRanges") Sys.setenv("PKG_CXXFLAGS"="-std=c++11") sourceCpp("foo.cpp") a = GRangesC() #<----GRanges object returned from rcpp here
//IN foo.cpp ->
#include <Rcpp.h> using namespace Rcpp; using ui = unsigned int; // [[Rcpp::export]] S4 GRangesC(){ S4 G("GRanges"); //Construct empty GRanges ui seqnames = 1; //Chromosone number S4 ranges ("IRanges"); //Ranges used ranges.slot("start") = 0; ranges.slot("width") = 2; char strand = '+'; //Strand string names = "abc"; //Names of GRanges G.slot("seqnames")= seqnames; G.slot("ranges")= ranges; G.slot("strand")= strand; return G; }
Error is this:
> GRangesList(a) Error in validObject(.Object) : invalid class “GRanges” object: 1: invalid object for slot "seqnames" in class "GRanges": got class "numeric", should be or extend class "Rle" invalid class “GRanges” object: 2: invalid object for slot "strand" in class "GRanges": got class "character", should be or extend class "Rle" invalid class “GRanges” object: 3: 'mcols(x)' is not parallel to 'x'
Is there a better and more safe way to make GRanges from Cpp that uses a safe constructor for GRanges, and not the .slot method ?
BTW: reason I do this is because GRanges is too slow in R for millions of rows, it takes days to combine with lapply etc.
Direct slot access violates the API; I'd instead arrange to call the constructor (
GenomicRanges::GRanges())
from C++.Okey, thanks, I tried this:
I use something else than depend to include the constructor ?
Use the 'ADD REPLY' button rather than 'Add Answer'. I was thinking more along the lines of http://gallery.rcpp.org/articles/r-function-from-c++/
Hmm, reading your question all the way through ;) ... GRanges can easily work on millions of rows, but you're probably trying to do operations one row at a time, instead of on vectors. e.g.,
And scaling the 'vectorized' version
So what is it that you're actually trying to do?
My objective is: find orfs, do some fancy stuff with each orf, return a grangesList of all processed orfs, there are around 2 million orfs in each experiment, and I do 2000 experiments.
So can you construct a simple example of a GRanges object that is big, and a simple operation that you've implemented and that has driven you to C++?
Ok, here is how it works, if you think c++ is overkill for this, maybe I can stay on vectorization.
So I do not know how many uORFs there will be before I am finished, the findInFrameUORF function looks like this:
Do I need c++ for this, or would vectorization be just as good ?
BTW: Right now the running time for the top lapply is 8 hours.
As written,
getFastqSeq()
andfind_in_frame_ORFs()
are loop invariants (don't depend on iteration, and can be moved outside the lapply. The iteration is nowSo the question is, can
map_granges()
be vectorize?What is
map_granges()
?For what it's worth, my changes as a github gist.
BTW, the getFastaSeq and find_ind_frame_ORFs are not loop invariants, the number is passed by global name.
This is map_granges: it maps IRanges to GRanges with correct columns, so map means transform to
So final question now, is there a good way to make all of this vectorizable or should I make it in cpp ?
It would help a lot to have code that I could actually execute. For instance, it looks like fiveUTRs is a list-GRangesList ? but it could be any number of things. Can you make a simple fully reproducible example (one that takes just a few seconds to run). This of course requires that you provide the definition of mapto_GRanges... Maybe this is a package or code that is or could be made available in GitHub?
yes, I found out it's not possible to vectorize, so I am going to make a C function for making huge GRangesList files that are not vectorizable, I think this will be usefull for others, so it might be shared on GitHub, thank you for your help.