Entering edit mode
Hi Marco,
Sorry for the delay in answering this. Seems like you are hitting
a limitation in the implementation of the read.dcf() function.
Having so many entries in the 'seqnames' field of your seed file is
probably a bad idea anyway. This field is typically expected to list
the chromosomes or "top-level sequences" so it will always have a
small number of entries (i.e. < 100).
Things like contigs or scaffolds, which are generally counted
by hundreds or thousands, should go in the 'mseqnames' field.
Here they are grouped and stored in 1 (or a small number) of
DNAStringSet objects (i.e. < 100) but each object can contain
millions of sequences. One entry per object should be reported
in the 'mseqnames' field.
See for example the 'chrUn.scaffolds' multiple sequence in the
BSgenome.Btaurus.UCSC.bosTau4 package, or the 'Zv9_NA' and
'Zv9_scaffold' multiple sequences in the BSgenome.Drerio.UCSC.danRer7
package.
See the BSgenomeForge vignette for details about to handle the
'mseqnames' field when forging your own package.
Hope this helps. Please let me know if you have further questions.
H.
On 03/29/2013 12:10 PM, Blanchette, Marco wrote:
> I traced back the error "Error: Line longer than buffer size" that I
am
> getting from forgeBSgenomeDataPkg() to a call to read.dcf() made in
the
> forgeBSgenomeDataPkg() that is used to read the seed file. I came to
the
> realization that there is an upper limit to the number of character
> allowed per single line for the DCF files.
>
> For instance:
>
> This works
> cat("test:",paste(sample(letters,8184,TRUE),collapse=""),"\n",file="
test.dc
> f");t <- read.dcf("test.dcf")
>
> While this breaks with the same error I get from
forgeBSgenomeDataPkg()
> cat("test:",paste(sample(letters,8184,TRUE),collapse=""),"\n",file="
test.dc
> f");t <- read.dcf("test.dcf")
>
> Since the seqnames: field I creates in my seed file contains several
> thousands entries, I am busting that upper limit. I can reproduce
the
> error just by trying to read the seed file with
read.dcf("mySeedFile.txt")
>
> At this point, I am not sure if there is an easy workaround and
whether
> this should be consider a bug in BSgenome or read.dcf() that should
be
> reported...
>
> Advise?
>
> -- Marco Blanchette, Ph.D.
> Stowers Institute for Medical Research
> 1000 East 50th Street
> Kansas City MO 64110
> www.stowers.org
>
>
> Tel: 816-926-4071
> Cell: 816-726-8419
> Fax: 816-926-2018
>
>
>
>
>
>
> On 3/28/13 11:58 AM, "Blanchette, Marco" <mab at="" stowers.org=""> wrote:
>
>> Kasper,
>>
>> I see your line of thought, is there a particular fasta file
causing
>> forgeBSgenomeDataPkg() to break?
>>
>> The answer is no. Once I reach a certain number of fasta files,
adding one
>> more contig breaks the function. For instance, taking the first 454
>> contigs of C. brenneri breaks while removing the last or the first
fasta
>> file from the list (keeping only 453) compile without a problem
(neither
>> the last or the first fasta files are responsible for breaking the
>> function, the number of file is the trigger)
>>
>> What's even more puzzling is that the number that breaks is not a
fixed
>> number. Selecting a random selection of contigs or changing genome
will
>> change the number that triggers the function to break... However
it's
>> always around 440 files, which might be due to the size of the
fasta files
>> being all of very similar sizes.
>>
>> Any clues?
>>
>>
>> -- Marco Blanchette, Ph.D.
>> Stowers Institute for Medical Research
>> 1000 East 50th Street
>> Kansas City MO 64110
>> www.stowers.org
>>
>>
>> Tel: 816-926-4071
>> Cell: 816-726-8419
>> Fax: 816-926-2018
>>
>>
>>
>>
>>
>>
>> On 3/27/13 8:22 PM, "Kasper Daniel Hansen" <kasperdanielhansen at="" gmail.com="">
>> wrote:
>>
>>> Marco,
>>>
>>> You are probably right in diagnosing the problem, but sometimes I
>>> think I have seen FASTA files with the entire sequence on a single
>>> line, instead of (say) 80 nucleotides and then a newline. I could
>>> believe that a really long contig on a single line without a
newline,
>>> could cause an error like this. You could quickly check if there
is a
>>> suspicious file by
>>> wc -l *
>>> and look for files with #lines like 2-3. Somehow 460 seems a
weird
>>> number to fail at.
>>>
>>> This may not be your problem, and I am sure Herve will respond in
due
>>> time.
>>>
>>> Best,
>>> Kasper
>>>
>>> On Wed, Mar 27, 2013 at 4:28 PM, Blanchette, Marco <mab at="" stowers.org="">
>>> wrote:
>>>> Hi,
>>>>
>>>> Is there a maximum number of sequence files (chromosomes or
contigs in
>>>> my case) that can be fed to the forgeBSgenomeDataPkg() function?
I am
>>>> trying to build a BSgenome for C. brenneri and C. japonica
available
>>> >from EnsemblGenomes. These genomes are made from thousands of
contigs
>>>> with genes annotated to them. Currently, I get the following
error when
>>>> running "Error: Line longer than buffer size" when running on the
full
>>>> set of contigs. However, it works fine on a seed file containing
a
>>>> subset of the contigs (I can forge a genome with 450 contigs but
not
>>>> with 460!)
>>>>
>>>> Any suggestions will be appreciated (I can provide a toy example
but I
>>>> am not sure what would be the merit of it at this point)
>>>>
>>>> Thanks
>>>>
>>>> -- Marco Blanchette, Ph.D.
>>>> Stowers Institute for Medical Research
>>>> 1000 East 50th Street
>>>> Kansas City MO 64110
>>>> www.stowers.org
>>>>
>>>> Tel: 816-926-4071
>>>> Cell: 816-726-8419
>>>> Fax: 816-926-2018
>>>>
>>>> [[alternative HTML version deleted]]
>>>>
>>>> _______________________________________________
>>>> Bioconductor mailing list
>>>> Bioconductor at r-project.org
>>>> https://stat.ethz.ch/mailman/listinfo/bioconductor
>>>> Search the archives:
>>>> http://news.gmane.org/gmane.science.biology.informatics.conductor
>>
>> _______________________________________________
>> Bioconductor mailing list
>> Bioconductor at r-project.org
>> https://stat.ethz.ch/mailman/listinfo/bioconductor
>> Search the archives:
>> http://news.gmane.org/gmane.science.biology.informatics.conductor
>
> _______________________________________________
> Bioconductor mailing list
> Bioconductor at r-project.org
> https://stat.ethz.ch/mailman/listinfo/bioconductor
> Search the archives:
http://news.gmane.org/gmane.science.biology.informatics.conductor
>
--
Hervé Pagès
Program in Computational Biology
Division of Public Health Sciences
Fred Hutchinson Cancer Research Center
1100 Fairview Ave. N, M1-B514
P.O. Box 19024
Seattle, WA 98109-1024
E-mail: hpages at fhcrc.org
Phone: (206) 667-5791
Fax: (206) 667-1319
Hi
I have a similar problem building a BSgenome for a Yeast strain. The 'Genome' consists ONLY of scaffold and unplaced sequences. There are around 500 scaffolds and 100 unplaced sequences (and nothing else). I CAN forge and load a BSgenome if I list the names of all sequences under 'seqnames' (I had to continue on extra lines in the _seed file with whitespace starting continuation lines). However, when I tried to use mseqnames (as above) - I made two fasta files, one with all the scaffolds another with all the unplaced and added the line
Then forgeBSgenomeDataPkg() failed with
Error in .forgeTwobitFile(seqnames, prefix, suffix, seqs_srcdir, seqs_destdir, : 'seqnames' must be a character vector In addition: Warning message: In forgeSeqFiles(.seqnames, mseqnames = .mseqnames, seqfile_name = x@seqfile_name, : 'seqnames' is empty
Which is correct, but not very helpful. As far as I can tell, mseqnames only works when there is also some seqnames.
Otherwise, well done on some really useful software.
Dave Gerrard
Nevermind!
If I set
Then it builds.