Limit on number of sequence files for forging a BSgenome

2

Entering edit mode

Marco Blanchette ▴ 220

@marco-blanchette-5439

Last seen 10.5 years ago

United States/Kansas City/Stowers Insti…

Hi, Is there a maximum number of sequence files (chromosomes or contigs in my case) that can be fed to the forgeBSgenomeDataPkg() function? I am trying to build a BSgenome for C. brenneri and C. japonica available from EnsemblGenomes. These genomes are made from thousands of contigs with genes annotated to them. Currently, I get the following error when running "Error: Line longer than buffer size" when running on the full set of contigs. However, it works fine on a seed file containing a subset of the contigs (I can forge a genome with 450 contigs but not with 460!) Any suggestions will be appreciated (I can provide a toy example but I am not sure what would be the merit of it at this point) Thanks -- Marco Blanchette, Ph.D. Stowers Institute for Medical Research 1000 East 50th Street Kansas City MO 64110 www.stowers.org Tel: 816-926-4071 Cell: 816-726-8419 Fax: 816-926-2018 [[alternative HTML version deleted]]

BSgenome BSgenome genomes BSgenome BSgenome genomes • 1.5k views

ADD COMMENT • link updated 12.0 years ago by Kasper Daniel Hansen ★ 6.5k • written 12.0 years ago by Marco Blanchette ▴ 220

0

Entering edit mode

Kasper Daniel Hansen ★ 6.5k

@kasper-daniel-hansen-2979

Last seen 21 months ago

United States

Marco, You are probably right in diagnosing the problem, but sometimes I think I have seen FASTA files with the entire sequence on a single line, instead of (say) 80 nucleotides and then a newline. I could believe that a really long contig on a single line without a newline, could cause an error like this. You could quickly check if there is a suspicious file by wc -l * and look for files with #lines like 2-3. Somehow 460 seems a weird number to fail at. This may not be your problem, and I am sure Herve will respond in due time. Best, Kasper On Wed, Mar 27, 2013 at 4:28 PM, Blanchette, Marco <mab at="" stowers.org=""> wrote: > Hi, > > Is there a maximum number of sequence files (chromosomes or contigs in my case) that can be fed to the forgeBSgenomeDataPkg() function? I am trying to build a BSgenome for C. brenneri and C. japonica available from EnsemblGenomes. These genomes are made from thousands of contigs with genes annotated to them. Currently, I get the following error when running "Error: Line longer than buffer size" when running on the full set of contigs. However, it works fine on a seed file containing a subset of the contigs (I can forge a genome with 450 contigs but not with 460!) > > Any suggestions will be appreciated (I can provide a toy example but I am not sure what would be the merit of it at this point) > > Thanks > > -- Marco Blanchette, Ph.D. > Stowers Institute for Medical Research > 1000 East 50th Street > Kansas City MO 64110 > www.stowers.org > > Tel: 816-926-4071 > Cell: 816-726-8419 > Fax: 816-926-2018 > > [[alternative HTML version deleted]] > > _______________________________________________ > Bioconductor mailing list > Bioconductor at r-project.org > https://stat.ethz.ch/mailman/listinfo/bioconductor > Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor

ADD COMMENT • link 12.0 years ago Kasper Daniel Hansen ★ 6.5k

0

Entering edit mode

Kasper, I see your line of thought, is there a particular fasta file causing forgeBSgenomeDataPkg() to break? The answer is no. Once I reach a certain number of fasta files, adding one more contig breaks the function. For instance, taking the first 454 contigs of C. brenneri breaks while removing the last or the first fasta file from the list (keeping only 453) compile without a problem (neither the last or the first fasta files are responsible for breaking the function, the number of file is the trigger) What's even more puzzling is that the number that breaks is not a fixed number. Selecting a random selection of contigs or changing genome will change the number that triggers the function to break... However it's always around 440 files, which might be due to the size of the fasta files being all of very similar sizes. Any clues? -- Marco Blanchette, Ph.D. Stowers Institute for Medical Research 1000 East 50th Street Kansas City MO 64110 www.stowers.org Tel: 816-926-4071 Cell: 816-726-8419 Fax: 816-926-2018 On 3/27/13 8:22 PM, "Kasper Daniel Hansen" <kasperdanielhansen at="" gmail.com=""> wrote: >Marco, > >You are probably right in diagnosing the problem, but sometimes I >think I have seen FASTA files with the entire sequence on a single >line, instead of (say) 80 nucleotides and then a newline. I could >believe that a really long contig on a single line without a newline, >could cause an error like this. You could quickly check if there is a >suspicious file by > wc -l * >and look for files with #lines like 2-3. Somehow 460 seems a weird >number to fail at. > >This may not be your problem, and I am sure Herve will respond in due >time. > >Best, >Kasper > >On Wed, Mar 27, 2013 at 4:28 PM, Blanchette, Marco <mab at="" stowers.org=""> >wrote: >> Hi, >> >> Is there a maximum number of sequence files (chromosomes or contigs in >>my case) that can be fed to the forgeBSgenomeDataPkg() function? I am >>trying to build a BSgenome for C. brenneri and C. japonica available >>from EnsemblGenomes. These genomes are made from thousands of contigs >>with genes annotated to them. Currently, I get the following error when >>running "Error: Line longer than buffer size" when running on the full >>set of contigs. However, it works fine on a seed file containing a >>subset of the contigs (I can forge a genome with 450 contigs but not >>with 460!) >> >> Any suggestions will be appreciated (I can provide a toy example but I >>am not sure what would be the merit of it at this point) >> >> Thanks >> >> -- Marco Blanchette, Ph.D. >> Stowers Institute for Medical Research >> 1000 East 50th Street >> Kansas City MO 64110 >> www.stowers.org >> >> Tel: 816-926-4071 >> Cell: 816-726-8419 >> Fax: 816-926-2018 >> >> [[alternative HTML version deleted]] >> >> _______________________________________________ >> Bioconductor mailing list >> Bioconductor at r-project.org >> https://stat.ethz.ch/mailman/listinfo/bioconductor >> Search the archives: >>http://news.gmane.org/gmane.science.biology.informatics.conductor

ADD REPLY • link 12.0 years ago Marco Blanchette ▴ 220

1

Entering edit mode

I traced back the error "Error: Line longer than buffer size" that I am getting from forgeBSgenomeDataPkg() to a call to read.dcf() made in the forgeBSgenomeDataPkg() that is used to read the seed file. I came to the realization that there is an upper limit to the number of character allowed per single line for the DCF files. For instance: This works cat("test:",paste(sample(letters,8184,TRUE),collapse=""),"\n",file="te st.dc f");t <- read.dcf("test.dcf") While this breaks with the same error I get from forgeBSgenomeDataPkg() cat("test:",paste(sample(letters,8184,TRUE),collapse=""),"\n",file="te st.dc f");t <- read.dcf("test.dcf") Since the seqnames: field I creates in my seed file contains several thousands entries, I am busting that upper limit. I can reproduce the error just by trying to read the seed file with read.dcf("mySeedFile.txt") At this point, I am not sure if there is an easy workaround and whether this should be consider a bug in BSgenome or read.dcf() that should be reported... Advise? -- Marco Blanchette, Ph.D. Stowers Institute for Medical Research 1000 East 50th Street Kansas City MO 64110 www.stowers.org Tel: 816-926-4071 Cell: 816-726-8419 Fax: 816-926-2018 On 3/28/13 11:58 AM, "Blanchette, Marco" <mab at="" stowers.org=""> wrote: >Kasper, > >I see your line of thought, is there a particular fasta file causing >forgeBSgenomeDataPkg() to break? > >The answer is no. Once I reach a certain number of fasta files, adding one >more contig breaks the function. For instance, taking the first 454 >contigs of C. brenneri breaks while removing the last or the first fasta >file from the list (keeping only 453) compile without a problem (neither >the last or the first fasta files are responsible for breaking the >function, the number of file is the trigger) > >What's even more puzzling is that the number that breaks is not a fixed >number. Selecting a random selection of contigs or changing genome will >change the number that triggers the function to break... However it's >always around 440 files, which might be due to the size of the fasta files >being all of very similar sizes. > >Any clues? > > >-- Marco Blanchette, Ph.D. >Stowers Institute for Medical Research >1000 East 50th Street >Kansas City MO 64110 >www.stowers.org > > >Tel: 816-926-4071 >Cell: 816-726-8419 >Fax: 816-926-2018 > > > > > > >On 3/27/13 8:22 PM, "Kasper Daniel Hansen" <kasperdanielhansen at="" gmail.com=""> >wrote: > >>Marco, >> >>You are probably right in diagnosing the problem, but sometimes I >>think I have seen FASTA files with the entire sequence on a single >>line, instead of (say) 80 nucleotides and then a newline. I could >>believe that a really long contig on a single line without a newline, >>could cause an error like this. You could quickly check if there is a >>suspicious file by >> wc -l * >>and look for files with #lines like 2-3. Somehow 460 seems a weird >>number to fail at. >> >>This may not be your problem, and I am sure Herve will respond in due >>time. >> >>Best, >>Kasper >> >>On Wed, Mar 27, 2013 at 4:28 PM, Blanchette, Marco <mab at="" stowers.org=""> >>wrote: >>> Hi, >>> >>> Is there a maximum number of sequence files (chromosomes or contigs in >>>my case) that can be fed to the forgeBSgenomeDataPkg() function? I am >>>trying to build a BSgenome for C. brenneri and C. japonica available >>>from EnsemblGenomes. These genomes are made from thousands of contigs >>>with genes annotated to them. Currently, I get the following error when >>>running "Error: Line longer than buffer size" when running on the full >>>set of contigs. However, it works fine on a seed file containing a >>>subset of the contigs (I can forge a genome with 450 contigs but not >>>with 460!) >>> >>> Any suggestions will be appreciated (I can provide a toy example but I >>>am not sure what would be the merit of it at this point) >>> >>> Thanks >>> >>> -- Marco Blanchette, Ph.D. >>> Stowers Institute for Medical Research >>> 1000 East 50th Street >>> Kansas City MO 64110 >>> www.stowers.org >>> >>> Tel: 816-926-4071 >>> Cell: 816-726-8419 >>> Fax: 816-926-2018 >>> >>> [[alternative HTML version deleted]] >>> >>> _______________________________________________ >>> Bioconductor mailing list >>> Bioconductor at r-project.org >>> https://stat.ethz.ch/mailman/listinfo/bioconductor >>> Search the archives: >>>http://news.gmane.org/gmane.science.biology.informatics.conductor > >_______________________________________________ >Bioconductor mailing list >Bioconductor at r-project.org >https://stat.ethz.ch/mailman/listinfo/bioconductor >Search the archives: >http://news.gmane.org/gmane.science.biology.informatics.conductor

ADD REPLY • link 12.0 years ago Marco Blanchette ▴ 220

0

Entering edit mode

advice: repost this to R-devel with new subject and example demoing the bug in the base package -----Original Message----- From: bioconductor-bounces@r-project.org [mailto:bioconductor- bounces@r-project.org] On Behalf Of Blanchette, Marco Sent: Friday, March 29, 2013 2:11 PM To: Blanchette, Marco; Kasper Daniel Hansen Cc: bioconductor at r-project.org Subject: Re: [BioC] Limit on number of sequence files for forging a BSgenome I traced back the error "Error: Line longer than buffer size" that I am getting from forgeBSgenomeDataPkg() to a call to read.dcf() made in the forgeBSgenomeDataPkg() that is used to read the seed file. I came to the realization that there is an upper limit to the number of character allowed per single line for the DCF files. For instance: This works cat("test:",paste(sample(letters,8184,TRUE),collapse=""),"\n",file="te st.dc f");t <- read.dcf("test.dcf") While this breaks with the same error I get from forgeBSgenomeDataPkg() cat("test:",paste(sample(letters,8184,TRUE),collapse=""),"\n",file="te st.dc f");t <- read.dcf("test.dcf") Since the seqnames: field I creates in my seed file contains several thousands entries, I am busting that upper limit. I can reproduce the error just by trying to read the seed file with read.dcf("mySeedFile.txt") At this point, I am not sure if there is an easy workaround and whether this should be consider a bug in BSgenome or read.dcf() that should be reported... Advise? -- Marco Blanchette, Ph.D. Stowers Institute for Medical Research 1000 East 50th Street Kansas City MO 64110 www.stowers.org Tel: 816-926-4071 Cell: 816-726-8419 Fax: 816-926-2018 On 3/28/13 11:58 AM, "Blanchette, Marco" <mab at="" stowers.org=""> wrote: >Kasper, > >I see your line of thought, is there a particular fasta file causing >forgeBSgenomeDataPkg() to break? > >The answer is no. Once I reach a certain number of fasta files, adding one >more contig breaks the function. For instance, taking the first 454 >contigs of C. brenneri breaks while removing the last or the first fasta >file from the list (keeping only 453) compile without a problem (neither >the last or the first fasta files are responsible for breaking the >function, the number of file is the trigger) > >What's even more puzzling is that the number that breaks is not a fixed >number. Selecting a random selection of contigs or changing genome will >change the number that triggers the function to break... However it's >always around 440 files, which might be due to the size of the fasta files >being all of very similar sizes. > >Any clues? > > >-- Marco Blanchette, Ph.D. >Stowers Institute for Medical Research >1000 East 50th Street >Kansas City MO 64110 >www.stowers.org > > >Tel: 816-926-4071 >Cell: 816-726-8419 >Fax: 816-926-2018 > > > > > > >On 3/27/13 8:22 PM, "Kasper Daniel Hansen" <kasperdanielhansen at="" gmail.com=""> >wrote: > >>Marco, >> >>You are probably right in diagnosing the problem, but sometimes I >>think I have seen FASTA files with the entire sequence on a single >>line, instead of (say) 80 nucleotides and then a newline. I could >>believe that a really long contig on a single line without a newline, >>could cause an error like this. You could quickly check if there is a >>suspicious file by >> wc -l * >>and look for files with #lines like 2-3. Somehow 460 seems a weird >>number to fail at. >> >>This may not be your problem, and I am sure Herve will respond in due >>time. >> >>Best, >>Kasper >> >>On Wed, Mar 27, 2013 at 4:28 PM, Blanchette, Marco <mab at="" stowers.org=""> >>wrote: >>> Hi, >>> >>> Is there a maximum number of sequence files (chromosomes or contigs in >>>my case) that can be fed to the forgeBSgenomeDataPkg() function? I am >>>trying to build a BSgenome for C. brenneri and C. japonica available >>>from EnsemblGenomes. These genomes are made from thousands of contigs >>>with genes annotated to them. Currently, I get the following error when >>>running "Error: Line longer than buffer size" when running on the full >>>set of contigs. However, it works fine on a seed file containing a >>>subset of the contigs (I can forge a genome with 450 contigs but not >>>with 460!) >>> >>> Any suggestions will be appreciated (I can provide a toy example but I >>>am not sure what would be the merit of it at this point) >>> >>> Thanks >>> >>> -- Marco Blanchette, Ph.D. >>> Stowers Institute for Medical Research >>> 1000 East 50th Street >>> Kansas City MO 64110 >>> www.stowers.org >>> >>> Tel: 816-926-4071 >>> Cell: 816-726-8419 >>> Fax: 816-926-2018 >>> >>> [[alternative HTML version deleted]] >>> >>> _______________________________________________ >>> Bioconductor mailing list >>> Bioconductor at r-project.org >>> https://stat.ethz.ch/mailman/listinfo/bioconductor >>> Search the archives: >>>http://news.gmane.org/gmane.science.biology.informatics.conductor > >_______________________________________________ >Bioconductor mailing list >Bioconductor at r-project.org >https://stat.ethz.ch/mailman/listinfo/bioconductor >Search the archives: >http://news.gmane.org/gmane.science.biology.informatics.conductor _______________________________________________ Bioconductor mailing list Bioconductor at r-project.org https://stat.ethz.ch/mailman/listinfo/bioconductor Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor

ADD REPLY • link 12.0 years ago Malcolm Cook ★ 1.6k

Login before adding your answer.