I am trying to use BSgenome::forgeBSgenomeDataPkg
to build a genome for this genome, which has over 7000 unplaced contigs and scaffolds. This means that the seqnames:
line in the BSgenome seed file must be very very long (136762 characters), and I seem to be running into some line length limitation in R that causes this line to be truncated. However, it seems the the Debian control file format requires each field to be on one line, so I can't break up this long line [Edit: I broke up the line as suggested by Dan but it just gave a different error]. So it seems I can't forge a BSgenome package for this genome, and I suspect I will run into the same problem trying to build a TxDb as well. It looks like the problematic code is in copySubstitute
or a function called by it. I tried to make a test case, but I keep running into line length limitations and can't get an example that actually parses. I'm not sure where to go from here. Can anyone offer some help?
Here is the seed file I'm using. I've also added a log of the error that I get when trying to forge it.
EDIT: I have tried again using a 2bit file as suggested. Unfortunately, this simply causes an immediate error. I've updated the gist above with the latest content and error text.
I know that in DCF files you can break up long lines as long as continuation lines start with whitespace. But I don't know whether this helps you, and can't offer anything about this specific issue.
I tried that, and it got farther than before, but I got a new error, which I will add to the gist shortly.