Forging a BSgenome with lots of seqnames fails
1
0
Entering edit mode
@ryan-c-thompson-5618
Last seen 2 days ago
Icahn School of Medicine at Mount Sinai…

I am trying to use BSgenome::forgeBSgenomeDataPkg to build a genome for this genome, which has over 7000 unplaced contigs and scaffolds. This means that the seqnames: line in the BSgenome seed file must be very very long (136762 characters), and I seem to be running into some line length limitation in R that causes this line to be truncated. However, it seems the the Debian control file format requires each field to be on one line, so I can't break up this long line [Edit: I broke up the line as suggested by Dan but it just gave a different error]. So it seems I can't forge a BSgenome package for this genome, and I suspect I will run into the same problem trying to build a TxDb as well. It looks like the problematic code is in copySubstitute or a function called by it. I tried to make a test case, but I keep running into line length limitations and can't get an example that actually parses. I'm not sure where to go from here. Can anyone offer some help?

Here is the seed file I'm using. I've also added a log of the error that I get when trying to forge it.

EDIT: I have tried again using a 2bit file as suggested. Unfortunately, this simply causes an immediate error. I've updated the gist above with the latest content and error text.

bsgenome • 1.4k views
ADD COMMENT
0
Entering edit mode

I know that in DCF files you can break up long lines as long as continuation lines start with whitespace. But I don't know whether this helps you, and can't offer anything about this specific issue.

ADD REPLY
0
Entering edit mode

I tried that, and it got farther than before, but I got a new error, which I will add to the gist shortly.

ADD REPLY
0
Entering edit mode
@herve-pages-1542
Last seen 3 days ago
Seattle, WA, United States

Hi Ryan,

Thanks for the report, I'll look into this. In the meantime a workaround is to supply the sequences in the .2bit format instead of FASTA. Then the seqnames field is not required anymore (you could still use it if you wanted to select only a subset of the file or reorder the sequences). See the BSgenomeForge vignette for more information.

To convert from FASTA to .2bit:

library(Biostrings)
seqs <- readDNAStringSet("path/to/fasta/file")

... maybe reorder or drop some sequences ...

library(rtracklayer)
export.2bit(seqs, "path/to/output/file") 

HTH,

H.

ADD COMMENT
0
Entering edit mode

Actually, I am using 2bit format. I didn't realize the seqnames field was optional in that case. I'll try this.

ADD REPLY
0
Entering edit mode

I tried the 2bit suggestion, and it didn't work. The error message was

Error in makeS4FromList("BSgenomeDataPkgSeed", x) :
  some names in 'x' are not valid BSgenomeDataPkgSeed slots (seqfile_name)

Please see the gist for the the full error details and sessionInfo().

ADD REPLY
0
Entering edit mode

Hi Ryan,

I see at least 2 problems with your seed file:

  1. SrcDataFiles1 is not a valid field, use SrcDataFiles instead
  2. seqfile_name must be the short file name, not the path to the file

Cheers,

H.

 

ADD REPLY

Login before adding your answer.

Traffic: 233 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6