Question

Forging a BSgenome with lots of seqnames fails

0

Entering edit mode

Ryan C. Thompson ★ 7.9k

@ryan-c-thompson-5618

Last seen 6 months ago

Icahn School of Medicine at Mount Sinai…

I am trying to use BSgenome::forgeBSgenomeDataPkg to build a genome for this genome, which has over 7000 unplaced contigs and scaffolds. This means that the seqnames: line in the BSgenome seed file must be very very long (136762 characters), and I seem to be running into some line length limitation in R that causes this line to be truncated. However, it seems the the Debian control file format requires each field to be on one line, so I can't break up this long line [Edit: I broke up the line as suggested by Dan but it just gave a different error]. So it seems I can't forge a BSgenome package for this genome, and I suspect I will run into the same problem trying to build a TxDb as well. It looks like the problematic code is in copySubstitute or a function called by it. I tried to make a test case, but I keep running into line length limitations and can't get an example that actually parses. I'm not sure where to go from here. Can anyone offer some help?

Here is the seed file I'm using. I've also added a log of the error that I get when trying to forge it.

EDIT: I have tried again using a 2bit file as suggested. Unfortunately, this simply causes an immediate error. I've updated the gist above with the latest content and error text.

bsgenome • 1.5k views

ADD COMMENT • link 10.3 years ago Ryan C. Thompson ★ 7.9k

0

Entering edit mode

I know that in DCF files you can break up long lines as long as continuation lines start with whitespace. But I don't know whether this helps you, and can't offer anything about this specific issue.

ADD REPLY • link 10.3 years ago Dan Tenenbaum ★ 8.2k

0

Entering edit mode

I tried that, and it got farther than before, but I got a new error, which I will add to the gist shortly.

ADD REPLY • link 10.3 years ago Ryan C. Thompson ★ 7.9k

score 0 · Answer 1 · 2015-01-16

0

Entering edit mode

Hervé Pagès 16k

@herve-pages-1542

Last seen 2 days ago

Seattle, WA, United States

Hi Ryan,

Thanks for the report, I'll look into this. In the meantime a workaround is to supply the sequences in the .2bit format instead of FASTA. Then the seqnames field is not required anymore (you could still use it if you wanted to select only a subset of the file or reorder the sequences). See the BSgenomeForge vignette for more information.

To convert from FASTA to .2bit:

library(Biostrings)
seqs <- readDNAStringSet("path/to/fasta/file")

... maybe reorder or drop some sequences ...

library(rtracklayer)
export.2bit(seqs, "path/to/output/file")

HTH,

H.

ADD COMMENT • link 10.3 years ago Hervé Pagès 16k

0

Entering edit mode

Actually, I am using 2bit format. I didn't realize the seqnames field was optional in that case. I'll try this.

ADD REPLY • link 10.3 years ago Ryan C. Thompson ★ 7.9k

0

Entering edit mode

I tried the 2bit suggestion, and it didn't work. The error message was

Error in makeS4FromList("BSgenomeDataPkgSeed", x) :
  some names in 'x' are not valid BSgenomeDataPkgSeed slots (seqfile_name)

Please see the gist for the the full error details and sessionInfo().

ADD REPLY • link 10.3 years ago Ryan C. Thompson ★ 7.9k

0

Entering edit mode

Hi Ryan,

I see at least 2 problems with your seed file:

SrcDataFiles1 is not a valid field, use SrcDataFiles instead
seqfile_name must be the short file name, not the path to the file

Cheers,

H.

ADD REPLY • link 10.3 years ago Hervé Pagès 16k