forge BSgenome data package

0

Entering edit mode

Steve Shen ▴ 330

@steve-shen-3743

Last seen 10.3 years ago

Hi list, This wasn't sent out with a none registered ids. I have a problem with forging a new BSgenome data package. The sequence data file is clof.v3.fa and the mask file is clof.v3.fa.masked. Below are seed file, command, error and sessioninfo. Your help will be much appreciated. Thanks a lot, Steve Package: BSgenome.Clof.yu.v3 Title: Clof (insects) full genome (version 3) Description: Clof (insects) full genome as provided by YU (V3.3, Jan. 2011) # and will store in Biostrings objects. Version: 1.0.0 organism: clof species: insect provider: YU provider_version: Assembly V3.3 release_date: Jan, 2011 release_name: insects Genome Reference Consortium source_url: NA # http://hgdownload.cse.ucsc.edu/goldenPath/hg19/bigZips/ organism_biocview: Clof BSgenomeObjname: Clof seqnames: "clof_v3.fa" mseqnames: NA #paste("upstream", c("1000", "2000", "5000"), sep="") nmask_per_seq: 2 #SrcDataFiles1: sequences: chromFa.zip, upstream1000.zip, upstream2000.zip, upstream5000.zip #from http://hgdownload.cse.ucsc.edu/goldenPath/hg19/bigZips/ #SrcDataFiles2: AGAPS masks: http://hgdownload.cse.ucsc.edu/goldenPath/hg19/database/gap.txt.gz #RM masks: http://hgdownload.cse.ucsc.edu/goldenPath/hg19/bigZips/chromOut.tar.gz #TRF masks: http://hgdownload.cse.ucsc.edu/goldenPath/hg19/bigZips/chromTrf.tar.gz PkgExamples: Hsapiens seqlengths(Hsapiens) Hsapiens$chr1 # same as Hsapiens[["chr1"]] seqs_srcdir: /home/steve/Data/Genomes masks_srcdir: /home/steve/Data/Genomes AGAPSfiles_name: clof_v3.fa.masked The command, error and sessionInfo are below > forgeBSgenomeDataPkg("Clof_seed.R", seqs_srcdir=".", masks_srcdir=".", destdir=".", verbose=TRUE) Loading required package: Biobase Welcome to Bioconductor Vignettes contain introductory material. To view, type 'openVignette()'. To cite Bioconductor, see 'citation("Biobase")' and for packages 'citation(pkgname)'. Attaching package: 'Biobase' The following object(s) are masked from 'package:IRanges': updateObject Error in forgeBSgenomeDataPkg(y, seqs_srcdir = seqs_srcdir, masks_srcdir = masks_srcdir, : values for symbols NMASKPERSEQ are not single strings In addition: Warning message: In storage.mode(x$nmask_per_seq) <- "integer" : NAs introduced by coercion > sessionInfo() R version 2.12.1 (2010-12-16) Platform: x86_64-pc-linux-gnu (64-bit) locale: [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C [3] LC_TIME=en_US.UTF-8 LC_COLLATE=en_US.UTF-8 [5] LC_MONETARY=C LC_MESSAGES=en_US.UTF-8 [7] LC_PAPER=en_US.UTF-8 LC_NAME=C [9] LC_ADDRESS=C LC_TELEPHONE=C [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C attached base packages: [1] stats graphics grDevices utils datasets methods base other attached packages: [1] Biobase_2.6.1 BSgenome_1.18.2 Biostrings_2.18.0 [4] GenomicRanges_1.2.1 IRanges_1.8.8 loaded via a namespace (and not attached): [1] tools_2.12.1 [[alternative HTML version deleted]]

BSgenome Biostrings BSgenome BSgenome Biostrings BSgenome • 2.9k views

ADD COMMENT • link updated 13.9 years ago by Hervé Pagès 16k • written 13.9 years ago by Steve Shen ▴ 330

0

Entering edit mode

steve Shen ▴ 70

@steve-shen-4443

Last seen 10.3 years ago

Hi list, I have a problem with forging a new BSgenome data package. The sequence data file is clof.v3.fa and the mask file is clof.v3.fa.masked. Below are seed file, command, error and sessioninfo. Thanks a lot, Steve Package: BSgenome.Clof.yu.v3 Title: Clof (insects) full genome (version 3) Description: Clof (insects) full genome as provided by YU (V3.3, Jan. 2011) # and will store in Biostrings objects. Version: 1.0.0 organism: clof species: insect provider: YU provider_version: Assembly V3.3 release_date: Jan, 2011 release_name: insects Genome Reference Consortium source_url: NA # http://hgdownload.cse.ucsc.edu/goldenPath/hg19/bigZips/ organism_biocview: Clof BSgenomeObjname: Clof seqnames: "clof_v3.fa" mseqnames: NA #paste("upstream", c("1000", "2000", "5000"), sep="") nmask_per_seq: 2 #SrcDataFiles1: sequences: chromFa.zip, upstream1000.zip, upstream2000.zip, upstream5000.zip #from http://hgdownload.cse.ucsc.edu/goldenPath/hg19/bigZips/ #SrcDataFiles2: AGAPS masks: http://hgdownload.cse.ucsc.edu/goldenPath/hg19/database/gap.txt.gz #RM masks: http://hgdownload.cse.ucsc.edu/goldenPath/hg19/bigZips/chromOut.tar.gz #TRF masks: http://hgdownload.cse.ucsc.edu/goldenPath/hg19/bigZips/chromTrf.tar.gz PkgExamples: Hsapiens seqlengths(Hsapiens) Hsapiens$chr1 # same as Hsapiens[["chr1"]] seqs_srcdir: /home/steve/Data/Genomes masks_srcdir: /home/steve/Data/Genomes AGAPSfiles_name: clof_v3.fa.masked The command, error and sessionInfo are below > forgeBSgenomeDataPkg("Clof_seed.R", seqs_srcdir=".", masks_srcdir=".", destdir=".", verbose=TRUE) Loading required package: Biobase Welcome to Bioconductor Vignettes contain introductory material. To view, type 'openVignette()'. To cite Bioconductor, see 'citation("Biobase")' and for packages 'citation(pkgname)'. Attaching package: 'Biobase' The following object(s) are masked from 'package:IRanges': updateObject Error in forgeBSgenomeDataPkg(y, seqs_srcdir = seqs_srcdir, masks_srcdir = masks_srcdir, : values for symbols NMASKPERSEQ are not single strings In addition: Warning message: In storage.mode(x$nmask_per_seq) <- "integer" : NAs introduced by coercion > sessionInfo() R version 2.12.1 (2010-12-16) Platform: x86_64-pc-linux-gnu (64-bit) locale: [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C [3] LC_TIME=en_US.UTF-8 LC_COLLATE=en_US.UTF-8 [5] LC_MONETARY=C LC_MESSAGES=en_US.UTF-8 [7] LC_PAPER=en_US.UTF-8 LC_NAME=C [9] LC_ADDRESS=C LC_TELEPHONE=C [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C attached base packages: [1] stats graphics grDevices utils datasets methods base other attached packages: [1] Biobase_2.6.1 BSgenome_1.18.2 Biostrings_2.18.0 [4] GenomicRanges_1.2.1 IRanges_1.8.8 loaded via a namespace (and not attached): [1] tools_2.12.1 [[alternative HTML version deleted]]

ADD COMMENT • link 13.9 years ago steve Shen ▴ 70

0

Entering edit mode

Hervé Pagès 16k

@herve-pages-1542

Last seen 3 days ago

Seattle, WA, United States

Hi Steve, There are several things that look wrong with your seed file. 1. The # must be the first character in a line to make it a line of comment (ignored). For example, those 2 lines will certainly not be interpreted as you might expect: mseqnames: NA #paste("upstream", c("1000", "2000", "5000"), sep="") source_url: NA # http://hgdownload.cse.ucsc.edu/goldenPath/hg19/bigZips/ 2. If you don't have multiple sequences, specify: mseqnames: character(0) 3. As explained in the BSgenomeForge vignette, the default for seqfiles_prefix is .fa so this won't work: seqnames: "clof_v3.fa" Unless your file is named clof_v3.fa.fa? If your file is named clof_v3.fa, then you should specify: seqnames: "clof_v3" 4. This doesn't look like a valid file of assembly gaps: AGAPSfiles_name: clof_v3.fa.masked Please refer to the BSgenomeForge vignette for what kind of masks and what file formats are supported. 5. Having this PkgExamples: Hsapiens seqlengths(Hsapiens) Hsapiens$chr1 # same as Hsapiens[["chr1"]] in a seed file for clof obviously doesn't make sense and your package won't pass 'R CMD check' because it will contain broken examples. There might be other problems with your seed file. All you need to do is read and follow carefully the instructions described in the BSgenomeForge vignette. Let me know if things are not clear in the vignette and I'll try to improve it. Thanks! H. On 01/18/2011 05:32 PM, Steve Shen wrote: > Hi list, > > This wasn't sent out with a none registered ids. I have a problem with > forging a new BSgenome data package. The sequence data file is clof.v3.fa > and the mask file is clof.v3.fa.masked. Below are seed file, command, error > and sessioninfo. Your help will be much appreciated. > > Thanks a lot, > Steve > > Package: BSgenome.Clof.yu.v3 > Title: Clof (insects) full genome (version 3) > Description: Clof (insects) full genome as provided by YU (V3.3, Jan. 2011) > # and will store in Biostrings objects. > Version: 1.0.0 > organism: clof > species: insect > provider: YU > provider_version: Assembly V3.3 > release_date: Jan, 2011 > release_name: insects Genome Reference Consortium > source_url: NA # http://hgdownload.cse.ucsc.edu/goldenPath/hg19/bigZips/ > organism_biocview: Clof > BSgenomeObjname: Clof > seqnames: "clof_v3.fa" > mseqnames: NA #paste("upstream", c("1000", "2000", "5000"), sep="") > nmask_per_seq: 2 > #SrcDataFiles1: sequences: chromFa.zip, upstream1000.zip, upstream2000.zip, > upstream5000.zip > #from http://hgdownload.cse.ucsc.edu/goldenPath/hg19/bigZips/ > #SrcDataFiles2: AGAPS masks: > http://hgdownload.cse.ucsc.edu/goldenPath/hg19/database/gap.txt.gz > #RM masks: > http://hgdownload.cse.ucsc.edu/goldenPath/hg19/bigZips/chromOut.tar.gz > #TRF masks: > http://hgdownload.cse.ucsc.edu/goldenPath/hg19/bigZips/chromTrf.tar.gz > PkgExamples: Hsapiens > seqlengths(Hsapiens) > Hsapiens$chr1 # same as Hsapiens[["chr1"]] > seqs_srcdir: /home/steve/Data/Genomes > masks_srcdir: /home/steve/Data/Genomes > AGAPSfiles_name: clof_v3.fa.masked > > The command, error and sessionInfo are below > >> forgeBSgenomeDataPkg("Clof_seed.R", seqs_srcdir=".", masks_srcdir=".", > destdir=".", verbose=TRUE) > Loading required package: Biobase > > Welcome to Bioconductor > > Vignettes contain introductory material. To view, type > 'openVignette()'. To cite Bioconductor, see > 'citation("Biobase")' and for packages 'citation(pkgname)'. > > > Attaching package: 'Biobase' > > The following object(s) are masked from 'package:IRanges': > > updateObject > > Error in forgeBSgenomeDataPkg(y, seqs_srcdir = seqs_srcdir, masks_srcdir = > masks_srcdir, : > values for symbols NMASKPERSEQ are not single strings > In addition: Warning message: > In storage.mode(x$nmask_per_seq)<- "integer" : NAs introduced by coercion >> sessionInfo() > R version 2.12.1 (2010-12-16) > Platform: x86_64-pc-linux-gnu (64-bit) > > locale: > [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C > [3] LC_TIME=en_US.UTF-8 LC_COLLATE=en_US.UTF-8 > [5] LC_MONETARY=C LC_MESSAGES=en_US.UTF-8 > [7] LC_PAPER=en_US.UTF-8 LC_NAME=C > [9] LC_ADDRESS=C LC_TELEPHONE=C > [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C > > attached base packages: > [1] stats graphics grDevices utils datasets methods base > > other attached packages: > [1] Biobase_2.6.1 BSgenome_1.18.2 Biostrings_2.18.0 > [4] GenomicRanges_1.2.1 IRanges_1.8.8 > > loaded via a namespace (and not attached): > [1] tools_2.12.1 > > [[alternative HTML version deleted]] > > _______________________________________________ > Bioconductor mailing list > Bioconductor at r-project.org > https://stat.ethz.ch/mailman/listinfo/bioconductor > Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor -- Hervé Pagès Program in Computational Biology Division of Public Health Sciences Fred Hutchinson Cancer Research Center 1100 Fairview Ave. N, M2-B876 P.O. Box 19024 Seattle, WA 98109-1024 E-mail: hpages at fhcrc.org Phone: (206) 667-5791 Fax: (206) 667-1319

ADD COMMENT • link 13.9 years ago Hervé Pagès 16k

0

Entering edit mode

Hi Herve, Thank you so much for your quick reply. Your vignette is pretty clear. I think the problem is on my side. I just lack of experience on dealing with this issue and maybe the file format doesn't fit. I sort of gave up using the seed file but just using low level commands. I still have errors. I really have no idea what is going on. Your help will be much appreciated. Best, Steve 1. with seed file, I got > forgeBSgenomeDataPkg("./cflo_seed.R") Error in as.list(.readSeedFile(x, verbose = verbose)) : error in evaluating the argument 'x' in selecting a method for function 'as.list' 2. with command forgeSeqlengthsFile, > forgeSeqlengthsFile("cflo_v3.3.fold", prefix="", suffix=".fa",seqs_srcdir=".", seqs_destdir=".", verbose=TRUE) Saving 'seqlengths' object to compressed data file './seqlengths.rda'... DONE Warning messages: 1: In FUN("cflo_v3.3.fold"[[1L]], ...) : In file './cflo_v3.3.fold.fa': 24026 sequences found, using first sequence only 2: In FUN("cflo_v3.3.fold"[[1L]], ...) : In file './cflo_v3.3.fold.fa': sequence description "scaffold9scaffold12scaffold16scaffold2scaffold4scaffold20scaffold10sc affold11scaffold24scaffold1scaffold3scaffold6scaffold15scaffold28scaff old30scaffold22scaffold18scaffold21scaffold7scaffold32scaffold36scaffo ld13scaffold23scaffold39scaffold19scaffold41scaffold29scaffold37scaffo ld33scaffold45scaffold38scaffold17scaffold5scaffold46scaffold48scaffol d31scaffold25scaffold51scaffold49scaffold44scaffold47scaffold54scaffol d43scaffold55scaffold60scaffold35scaffold42scaffold63scaffold53scaffol d50scaffold40scaffold67scaffold61scaffold69scaffold34scaffold70scaffol d27scaffold73scaffold8scaffold75scaffold71scaffold74scaffold14scaffold 66scaffold65scaffold80scaffold62scaffold76scaffold79scaffold85scaffold 78scaffold86scaffold88scaffold81scaffold87scaffold26scaffold92scaffold 93scaffold94scaffold59scaffold52scaffold84scaffold77scaffold89scaffold 100scaffold97scaffold99scaffold64scaffold103scaffold90scaffold106scaff old56scaffold98scaffold109scaffold68sc [... truncated] 3. with forgeSeqfiles, > forgeSeqFiles("cflo_v3.3.fold", mseqnames=NULL, prefix="", suffix=".fa",seqs_srcdir=".", seqs_destdir="./BSgenome", verbose=TRUE) Loading FASTA file './cflo_v3.3.fold.fa' in 'cflo_v3.3.fold' object... DONE Saving 'cflo_v3.3.fold' object to compressed data file './BSgenome/cflo_v3.3.fold.rda'... DONE Warning message: In .forgeSeqFile(name, prefix, suffix, seqs_srcdir, seqs_destdir, : file contains 24026 sequences, using the first sequence only 2011/1/19 Hervé Pagès <hpages@fhcrc.org> > Hi Steve, > > There are several things that look wrong with your seed file. > > 1. The # must be the first character in a line to make it a > line of comment (ignored). > For example, those 2 lines will certainly not be interpreted > as you might expect: > > > mseqnames: NA #paste("upstream", c("1000", "2000", "5000"), sep="") > > source_url: NA # > http://hgdownload.cse.ucsc.edu/goldenPath/hg19/bigZips/ > > 2. If you don't have multiple sequences, specify: > > mseqnames: character(0) > > 3. As explained in the BSgenomeForge vignette, the default for > seqfiles_prefix is .fa so this won't work: > > seqnames: "clof_v3.fa" > > Unless your file is named clof_v3.fa.fa? > If your file is named clof_v3.fa, then you should specify: > > seqnames: "clof_v3" > > 4. This doesn't look like a valid file of assembly gaps: > > > AGAPSfiles_name: clof_v3.fa.masked > > Please refer to the BSgenomeForge vignette for what kind of masks > and what file formats are supported. > > 5. Having this > > > PkgExamples: Hsapiens > seqlengths(Hsapiens) > Hsapiens$chr1 # same as Hsapiens[["chr1"]] > > in a seed file for clof obviously doesn't make sense and your > package won't pass 'R CMD check' because it will contain broken > examples. > > There might be other problems with your seed file. All you need to do > is read and follow carefully the instructions described in the > BSgenomeForge vignette. Let me know if things are not clear in the > vignette and I'll try to improve it. Thanks! > > H. > > > > On 01/18/2011 05:32 PM, Steve Shen wrote: > >> Hi list, >> >> This wasn't sent out with a none registered ids. I have a problem with >> forging a new BSgenome data package. The sequence data file is clof.v3.fa >> and the mask file is clof.v3.fa.masked. Below are seed file, command, >> error >> and sessioninfo. Your help will be much appreciated. >> >> Thanks a lot, >> Steve >> >> Package: BSgenome.Clof.yu.v3 >> Title: Clof (insects) full genome (version 3) >> Description: Clof (insects) full genome as provided by YU (V3.3, Jan. >> 2011) >> # and will store in Biostrings objects. >> Version: 1.0.0 >> organism: clof >> species: insect >> provider: YU >> provider_version: Assembly V3.3 >> release_date: Jan, 2011 >> release_name: insects Genome Reference Consortium >> source_url: NA # http://hgdownload.cse.ucsc.edu/goldenPath/hg19/bigZips/ >> organism_biocview: Clof >> BSgenomeObjname: Clof >> seqnames: "clof_v3.fa" >> mseqnames: NA #paste("upstream", c("1000", "2000", "5000"), sep="") >> nmask_per_seq: 2 >> #SrcDataFiles1: sequences: chromFa.zip, upstream1000.zip, >> upstream2000.zip, >> upstream5000.zip >> #from http://hgdownload.cse.ucsc.edu/goldenPath/hg19/bigZips/ >> #SrcDataFiles2: AGAPS masks: >> http://hgdownload.cse.ucsc.edu/goldenPath/hg19/database/gap.txt.gz >> #RM masks: >> http://hgdownload.cse.ucsc.edu/goldenPath/hg19/bigZips/chromOut.tar.gz >> #TRF masks: >> http://hgdownload.cse.ucsc.edu/goldenPath/hg19/bigZips/chromTrf.tar.gz >> PkgExamples: Hsapiens >> seqlengths(Hsapiens) >> Hsapiens$chr1 # same as Hsapiens[["chr1"]] >> seqs_srcdir: /home/steve/Data/Genomes >> masks_srcdir: /home/steve/Data/Genomes >> AGAPSfiles_name: clof_v3.fa.masked >> >> The command, error and sessionInfo are below >> >> forgeBSgenomeDataPkg("Clof_seed.R", seqs_srcdir=".", masks_srcdir=".", >>> >> destdir=".", verbose=TRUE) >> Loading required package: Biobase >> >> Welcome to Bioconductor >> >> Vignettes contain introductory material. To view, type >> 'openVignette()'. To cite Bioconductor, see >> 'citation("Biobase")' and for packages 'citation(pkgname)'. >> >> >> Attaching package: 'Biobase' >> >> The following object(s) are masked from 'package:IRanges': >> >> updateObject >> >> Error in forgeBSgenomeDataPkg(y, seqs_srcdir = seqs_srcdir, masks_srcdir = >> masks_srcdir, : >> values for symbols NMASKPERSEQ are not single strings >> In addition: Warning message: >> In storage.mode(x$nmask_per_seq)<- "integer" : NAs introduced by coercion >> >>> sessionInfo() >>> >> R version 2.12.1 (2010-12-16) >> Platform: x86_64-pc-linux-gnu (64-bit) >> >> locale: >> [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C >> [3] LC_TIME=en_US.UTF-8 LC_COLLATE=en_US.UTF-8 >> [5] LC_MONETARY=C LC_MESSAGES=en_US.UTF-8 >> [7] LC_PAPER=en_US.UTF-8 LC_NAME=C >> [9] LC_ADDRESS=C LC_TELEPHONE=C >> [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C >> >> attached base packages: >> [1] stats graphics grDevices utils datasets methods base >> >> other attached packages: >> [1] Biobase_2.6.1 BSgenome_1.18.2 Biostrings_2.18.0 >> [4] GenomicRanges_1.2.1 IRanges_1.8.8 >> >> loaded via a namespace (and not attached): >> [1] tools_2.12.1 >> >> [[alternative HTML version deleted]] >> >> _______________________________________________ >> Bioconductor mailing list >> Bioconductor@r-project.org >> https://stat.ethz.ch/mailman/listinfo/bioconductor >> Search the archives: >> http://news.gmane.org/gmane.science.biology.informatics.conductor >> > > > -- > Hervé Pagès > > Program in Computational Biology > Division of Public Health Sciences > Fred Hutchinson Cancer Research Center > 1100 Fairview Ave. N, M2-B876 > P.O. Box 19024 > Seattle, WA 98109-1024 > > E-mail: hpages@fhcrc.org > Phone: (206) 667-5791 > Fax: (206) 667-1319 > [[alternative HTML version deleted]]

ADD REPLY • link 13.9 years ago Steve Shen ▴ 330

0

Entering edit mode

Hi Steve, On 01/19/2011 01:04 PM, Steve Shen wrote: > Hi Herve, > > Thank you so much for your quick reply. Your vignette is pretty clear. I > think the problem is on my side. I just lack of experience on dealing > with this issue and maybe the file format doesn't fit. I sort of gave up > using the seed file but just using low level commands. I still have > errors. I really have no idea what is going on. Your help will be much > appreciated. > > Best, > Steve > > 1. with seed file, I got > > forgeBSgenomeDataPkg("./cflo_seed.R") > Error in as.list(.readSeedFile(x, verbose = verbose)) : > error in evaluating the argument 'x' in selecting a method for > function 'as.list' I agree that the error message is kind of obscure but the problem is probably that the specified path to your seed file is invalid. I'm about to commit a change to the BSgenomeForge code that will produce this error message instead: > forgeBSgenomeDataPkg("aaaaa") Error in .readSeedFile(x, verbose = verbose) : seed file 'aaaaa' not found An easy way to make sure the path is valid is to press <tab> when you are in the middle of typing the path: this will trigger the auto-completion feature. > > 2. with command forgeSeqlengthsFile, > > forgeSeqlengthsFile("cflo_v3.3.fold", prefix="", > suffix=".fa",seqs_srcdir=".", seqs_destdir=".", verbose=TRUE) > Saving 'seqlengths' object to compressed data file './seqlengths.rda'... > DONE > Warning messages: > 1: In FUN("cflo_v3.3.fold"[[1L]], ...) : > In file './cflo_v3.3.fold.fa': 24026 sequences found, using first > sequence only > 2: In FUN("cflo_v3.3.fold"[[1L]], ...) : > In file './cflo_v3.3.fold.fa': sequence description > "scaffold9scaffold12scaffold16scaffold2scaffold4scaffold20scaffold10 scaffold11scaffold24scaffold1scaffold3scaffold6scaffold15scaffold28sca ffold30scaffold22scaffold18scaffold21scaffold7scaffold32scaffold36scaf fold13scaffold23scaffold39scaffold19scaffold41scaffold29scaffold37scaf fold33scaffold45scaffold38scaffold17scaffold5scaffold46scaffold48scaff old31scaffold25scaffold51scaffold49scaffold44scaffold47scaffold54scaff old43scaffold55scaffold60scaffold35scaffold42scaffold63scaffold53scaff old50scaffold40scaffold67scaffold61scaffold69scaffold34scaffold70scaff old27scaffold73scaffold8scaffold75scaffold71scaffold74scaffold14scaffo ld66scaffold65scaffold80scaffold62scaffold76scaffold79scaffold85scaffo ld78scaffold86scaffold88scaffold81scaffold87scaffold26scaffold92scaffo ld93scaffold94scaffold59scaffold52scaffold84scaffold77scaffold89scaffo ld100scaffold97scaffold99scaffold64scaffold103scaffold90scaffold106sca ffold56scaffold98scaffold109scaffold68sc > [... truncated] > > 3. with forgeSeqfiles, > > forgeSeqFiles("cflo_v3.3.fold", mseqnames=NULL, prefix="", > suffix=".fa",seqs_srcdir=".", seqs_destdir="./BSgenome", verbose=TRUE) > Loading FASTA file './cflo_v3.3.fold.fa' in 'cflo_v3.3.fold' object... DONE > Saving 'cflo_v3.3.fold' object to compressed data file > './BSgenome/cflo_v3.3.fold.rda'... DONE > Warning message: > In .forgeSeqFile(name, prefix, suffix, seqs_srcdir, seqs_destdir, : > file contains 24026 sequences, using the first sequence only Sorry but I can only try to help if you stick to the vignette and use a seed file. Using low-level functions like forgeSeqlengthsFile() or forgeSeqFiles() is undocumented/unsupported. The reason for this is that this route is *much* more complicated and error-prone than using a seed file! This is exactly why seed files where invented: to make the whole process much easier for you, and to make troubleshooting much easier for us. Cheers, H. > > > > 2011/1/19 Hervé Pagès <hpages at="" fhcrc.org="" <mailto:hpages="" at="" fhcrc.org="">> > > Hi Steve, > > There are several things that look wrong with your seed file. > > 1. The # must be the first character in a line to make it a > line of comment (ignored). > For example, those 2 lines will certainly not be interpreted > as you might expect: > > > mseqnames: NA #paste("upstream", c("1000", "2000", "5000"), sep="") > > source_url: NA # > http://hgdownload.cse.ucsc.edu/goldenPath/hg19/bigZips/ > > 2. If you don't have multiple sequences, specify: > > mseqnames: character(0) > > 3. As explained in the BSgenomeForge vignette, the default for > seqfiles_prefix is .fa so this won't work: > > seqnames: "clof_v3.fa" > > Unless your file is named clof_v3.fa.fa? > If your file is named clof_v3.fa, then you should specify: > > seqnames: "clof_v3" > > 4. This doesn't look like a valid file of assembly gaps: > > > AGAPSfiles_name: clof_v3.fa.masked > > Please refer to the BSgenomeForge vignette for what kind of masks > and what file formats are supported. > > 5. Having this > > > PkgExamples: Hsapiens > seqlengths(Hsapiens) > Hsapiens$chr1 # same as Hsapiens[["chr1"]] > > in a seed file for clof obviously doesn't make sense and your > package won't pass 'R CMD check' because it will contain broken > examples. > > There might be other problems with your seed file. All you need to do > is read and follow carefully the instructions described in the > BSgenomeForge vignette. Let me know if things are not clear in the > vignette and I'll try to improve it. Thanks! > > H. > > > > On 01/18/2011 05:32 PM, Steve Shen wrote: > > Hi list, > > This wasn't sent out with a none registered ids. I have a > problem with > forging a new BSgenome data package. The sequence data file is > clof.v3.fa > and the mask file is clof.v3.fa.masked. Below are seed file, > command, error > and sessioninfo. Your help will be much appreciated. > > Thanks a lot, > Steve > > Package: BSgenome.Clof.yu.v3 > Title: Clof (insects) full genome (version 3) > Description: Clof (insects) full genome as provided by YU (V3.3, > Jan. 2011) > # and will store in Biostrings objects. > Version: 1.0.0 > organism: clof > species: insect > provider: YU > provider_version: Assembly V3.3 > release_date: Jan, 2011 > release_name: insects Genome Reference Consortium > source_url: NA # > http://hgdownload.cse.ucsc.edu/goldenPath/hg19/bigZips/ > organism_biocview: Clof > BSgenomeObjname: Clof > seqnames: "clof_v3.fa" > mseqnames: NA #paste("upstream", c("1000", "2000", "5000"), sep="") > nmask_per_seq: 2 > #SrcDataFiles1: sequences: chromFa.zip, upstream1000.zip, > upstream2000.zip, > upstream5000.zip > #from http://hgdownload.cse.ucsc.edu/goldenPath/hg19/bigZips/ > #SrcDataFiles2: AGAPS masks: > http://hgdownload.cse.ucsc.edu/goldenPath/hg19/database/gap.txt.gz > #RM masks: > http://hgdownload.cse.ucsc.edu/goldenPath/hg19/bigZips/chromOut.tar.gz > #TRF masks: > http://hgdownload.cse.ucsc.edu/goldenPath/hg19/bigZips/chromTrf.tar.gz > PkgExamples: Hsapiens > seqlengths(Hsapiens) > Hsapiens$chr1 # same as Hsapiens[["chr1"]] > seqs_srcdir: /home/steve/Data/Genomes > masks_srcdir: /home/steve/Data/Genomes > AGAPSfiles_name: clof_v3.fa.masked > > The command, error and sessionInfo are below > > forgeBSgenomeDataPkg("Clof_seed.R", seqs_srcdir=".", > masks_srcdir=".", > > destdir=".", verbose=TRUE) > Loading required package: Biobase > > Welcome to Bioconductor > > Vignettes contain introductory material. To view, type > 'openVignette()'. To cite Bioconductor, see > 'citation("Biobase")' and for packages 'citation(pkgname)'. > > > Attaching package: 'Biobase' > > The following object(s) are masked from 'package:IRanges': > > updateObject > > Error in forgeBSgenomeDataPkg(y, seqs_srcdir = seqs_srcdir, > masks_srcdir = > masks_srcdir, : > values for symbols NMASKPERSEQ are not single strings > In addition: Warning message: > In storage.mode(x$nmask_per_seq)<- "integer" : NAs introduced by > coercion > > sessionInfo() > > R version 2.12.1 (2010-12-16) > Platform: x86_64-pc-linux-gnu (64-bit) > > locale: > [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C > [3] LC_TIME=en_US.UTF-8 LC_COLLATE=en_US.UTF-8 > [5] LC_MONETARY=C LC_MESSAGES=en_US.UTF-8 > [7] LC_PAPER=en_US.UTF-8 LC_NAME=C > [9] LC_ADDRESS=C LC_TELEPHONE=C > [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C > > attached base packages: > [1] stats graphics grDevices utils datasets methods base > > other attached packages: > [1] Biobase_2.6.1 BSgenome_1.18.2 Biostrings_2.18.0 > [4] GenomicRanges_1.2.1 IRanges_1.8.8 > > loaded via a namespace (and not attached): > [1] tools_2.12.1 > > [[alternative HTML version deleted]] > > _______________________________________________ > Bioconductor mailing list > Bioconductor at r-project.org <mailto:bioconductor at="" r-project.org=""> > https://stat.ethz.ch/mailman/listinfo/bioconductor > Search the archives: > http://news.gmane.org/gmane.science.biology.informatics.conductor > > > > -- > Hervé Pagès > > Program in Computational Biology > Division of Public Health Sciences > Fred Hutchinson Cancer Research Center > 1100 Fairview Ave. N, M2-B876 > P.O. Box 19024 > Seattle, WA 98109-1024 > > E-mail: hpages at fhcrc.org <mailto:hpages at="" fhcrc.org=""> > Phone: (206) 667-5791 > Fax: (206) 667-1319 > > -- Hervé Pagès Program in Computational Biology Division of Public Health Sciences Fred Hutchinson Cancer Research Center 1100 Fairview Ave. N, M2-B876 P.O. Box 19024 Seattle, WA 98109-1024 E-mail: hpages at fhcrc.org Phone: (206) 667-5791 Fax: (206) 667-1319

ADD REPLY • link 13.9 years ago Hervé Pagès 16k

0

Entering edit mode

Hi Herve, Thank you very much. I tried carefully with correct path and file name by triggering auto-complete feature. But it didn't help either. It may be something wrong with my seed file, so I also appended the file below. Please have a look. Thanks in advance, Steve > forgeBSgenomeDataPkg("/home/steve/Data/Genomes/seed") Error in as.list(.readSeedFile(x, verbose = verbose)) : error in evaluating the argument 'x' in selecting a method for function 'as.list' > forgeBSgenomeDataPkg("seed") Error in as.list(.readSeedFile(x, verbose = verbose)) : error in evaluating the argument 'x' in selecting a method for function 'as.list' Seed file starts: Package: BSgenome.CFlo.YU.v3 Title: Camponotus floridanus (Ants) full genome (YU version V3.3) Description: Camponotus floridanus (Ants) full genome as provided by YU (V3.3, Jan. 2011) Version: 1.0.0 organism: camponotus floridanus species: Ant provider: YU provider_version: Assembly V3.3 release_date: Jan, 2011 release_name: Ant Genome Reference Consortium source_url: NA organism_biocview: C_flo BSgenomeObjname: CFlo seqnames: "cflo_v3.3.fold" mseqnames: character(0) nmask_per_seq:2 seqs_srcdir: /home/steve/Data/Genomes/ # masks_srcdir: /home/steve/Data/Masks_src # AGAPSfiles_name: cflo_v3.3.fa.masked 2011/1/19 Hervé Pagès <hpages@fhcrc.org> > Hi Steve, > > > On 01/19/2011 01:04 PM, Steve Shen wrote: > >> Hi Herve, >> >> Thank you so much for your quick reply. Your vignette is pretty clear. I >> think the problem is on my side. I just lack of experience on dealing >> with this issue and maybe the file format doesn't fit. I sort of gave up >> using the seed file but just using low level commands. I still have >> errors. I really have no idea what is going on. Your help will be much >> appreciated. >> >> Best, >> Steve >> >> 1. with seed file, I got >> > forgeBSgenomeDataPkg("./cflo_seed.R") >> Error in as.list(.readSeedFile(x, verbose = verbose)) : >> error in evaluating the argument 'x' in selecting a method for >> function 'as.list' >> > > I agree that the error message is kind of obscure but the problem > is probably that the specified path to your seed file is invalid. > I'm about to commit a change to the BSgenomeForge code that will > produce this error message instead: > > > forgeBSgenomeDataPkg("aaaaa") > Error in .readSeedFile(x, verbose = verbose) : > seed file 'aaaaa' not found > > An easy way to make sure the path is valid is to press <tab> when > you are in the middle of typing the path: this will trigger the > auto-completion feature. > > > >> 2. with command forgeSeqlengthsFile, >> > forgeSeqlengthsFile("cflo_v3.3.fold", prefix="", >> suffix=".fa",seqs_srcdir=".", seqs_destdir=".", verbose=TRUE) >> Saving 'seqlengths' object to compressed data file './seqlengths.rda'... >> DONE >> Warning messages: >> 1: In FUN("cflo_v3.3.fold"[[1L]], ...) : >> In file './cflo_v3.3.fold.fa': 24026 sequences found, using first >> sequence only >> 2: In FUN("cflo_v3.3.fold"[[1L]], ...) : >> In file './cflo_v3.3.fold.fa': sequence description >> >> "scaffold9scaffold12scaffold16scaffold2scaffold4scaffold20scaffold1 0scaffold11scaffold24scaffold1scaffold3scaffold6scaffold15scaffold28sc affold30scaffold22scaffold18scaffold21scaffold7scaffold32scaffold36sca ffold13scaffold23scaffold39scaffold19scaffold41scaffold29scaffold37sca ffold33scaffold45scaffold38scaffold17scaffold5scaffold46scaffold48scaf fold31scaffold25scaffold51scaffold49scaffold44scaffold47scaffold54scaf fold43scaffold55scaffold60scaffold35scaffold42scaffold63scaffold53scaf fold50scaffold40scaffold67scaffold61scaffold69scaffold34scaffold70scaf fold27scaffold73scaffold8scaffold75scaffold71scaffold74scaffold14scaff old66scaffold65scaffold80scaffold62scaffold76scaffold79scaffold85scaff old78scaffold86scaffold88scaffold81scaffold87scaffold26scaffold92scaff old93scaffold94scaffold59scaffold52scaffold84scaffold77scaffold89scaff old100scaffold97scaffold99scaffold64scaffold103scaffold90scaffold106sc affold56scaffold98scaffold109scaffold68sc >> [... truncated] >> >> 3. with forgeSeqfiles, >> > forgeSeqFiles("cflo_v3.3.fold", mseqnames=NULL, prefix="", >> suffix=".fa",seqs_srcdir=".", seqs_destdir="./BSgenome", verbose=TRUE) >> Loading FASTA file './cflo_v3.3.fold.fa' in 'cflo_v3.3.fold' object... >> DONE >> Saving 'cflo_v3.3.fold' object to compressed data file >> './BSgenome/cflo_v3.3.fold.rda'... DONE >> Warning message: >> In .forgeSeqFile(name, prefix, suffix, seqs_srcdir, seqs_destdir, : >> file contains 24026 sequences, using the first sequence only >> > > Sorry but I can only try to help if you stick to the vignette and > use a seed file. Using low-level functions like forgeSeqlengthsFile() > or forgeSeqFiles() is undocumented/unsupported. The reason for this > is that this route is *much* more complicated and error-prone > than using a seed file! This is exactly why seed files where > invented: to make the whole process much easier for you, and to > make troubleshooting much easier for us. > > Cheers, > H. > > >> >> >> 2011/1/19 Hervé Pagès <hpages@fhcrc.org <mailto:hpages@fhcrc.org="">> >> >> >> Hi Steve, >> >> There are several things that look wrong with your seed file. >> >> 1. The # must be the first character in a line to make it a >> line of comment (ignored). >> For example, those 2 lines will certainly not be interpreted >> as you might expect: >> >> >> mseqnames: NA #paste("upstream", c("1000", "2000", "5000"), >> sep="") >> >> source_url: NA # >> http://hgdownload.cse.ucsc.edu/goldenPath/hg19/bigZips/ >> >> 2. If you don't have multiple sequences, specify: >> >> mseqnames: character(0) >> >> 3. As explained in the BSgenomeForge vignette, the default for >> seqfiles_prefix is .fa so this won't work: >> >> seqnames: "clof_v3.fa" >> >> Unless your file is named clof_v3.fa.fa? >> If your file is named clof_v3.fa, then you should specify: >> >> seqnames: "clof_v3" >> >> 4. This doesn't look like a valid file of assembly gaps: >> >> >> AGAPSfiles_name: clof_v3.fa.masked >> >> Please refer to the BSgenomeForge vignette for what kind of masks >> and what file formats are supported. >> >> 5. Having this >> >> >> PkgExamples: Hsapiens >> seqlengths(Hsapiens) >> Hsapiens$chr1 # same as Hsapiens[["chr1"]] >> >> in a seed file for clof obviously doesn't make sense and your >> package won't pass 'R CMD check' because it will contain broken >> examples. >> >> There might be other problems with your seed file. All you need to do >> is read and follow carefully the instructions described in the >> BSgenomeForge vignette. Let me know if things are not clear in the >> vignette and I'll try to improve it. Thanks! >> >> H. >> >> >> >> On 01/18/2011 05:32 PM, Steve Shen wrote: >> >> Hi list, >> >> This wasn't sent out with a none registered ids. I have a >> problem with >> forging a new BSgenome data package. The sequence data file is >> clof.v3.fa >> and the mask file is clof.v3.fa.masked. Below are seed file, >> command, error >> and sessioninfo. Your help will be much appreciated. >> >> Thanks a lot, >> Steve >> >> Package: BSgenome.Clof.yu.v3 >> Title: Clof (insects) full genome (version 3) >> Description: Clof (insects) full genome as provided by YU (V3.3, >> Jan. 2011) >> # and will store in Biostrings objects. >> Version: 1.0.0 >> organism: clof >> species: insect >> provider: YU >> provider_version: Assembly V3.3 >> release_date: Jan, 2011 >> release_name: insects Genome Reference Consortium >> source_url: NA # >> http://hgdownload.cse.ucsc.edu/goldenPath/hg19/bigZips/ >> organism_biocview: Clof >> BSgenomeObjname: Clof >> seqnames: "clof_v3.fa" >> mseqnames: NA #paste("upstream", c("1000", "2000", "5000"), sep="") >> nmask_per_seq: 2 >> #SrcDataFiles1: sequences: chromFa.zip, upstream1000.zip, >> upstream2000.zip, >> upstream5000.zip >> #from http://hgdownload.cse.ucsc.edu/goldenPath/hg19/bigZips/ >> #SrcDataFiles2: AGAPS masks: >> http://hgdownload.cse.ucsc.edu/goldenPath/hg19/database/gap.txt.gz >> #RM masks: >> >> http://hgdownload.cse.ucsc.edu/goldenPath/hg19/bigZips/chromOut.tar.gz >> #TRF masks: >> >> http://hgdownload.cse.ucsc.edu/goldenPath/hg19/bigZips/chromTrf.tar.gz >> PkgExamples: Hsapiens >> seqlengths(Hsapiens) >> Hsapiens$chr1 # same as Hsapiens[["chr1"]] >> seqs_srcdir: /home/steve/Data/Genomes >> masks_srcdir: /home/steve/Data/Genomes >> AGAPSfiles_name: clof_v3.fa.masked >> >> The command, error and sessionInfo are below >> >> forgeBSgenomeDataPkg("Clof_seed.R", seqs_srcdir=".", >> masks_srcdir=".", >> >> destdir=".", verbose=TRUE) >> Loading required package: Biobase >> >> Welcome to Bioconductor >> >> Vignettes contain introductory material. To view, type >> 'openVignette()'. To cite Bioconductor, see >> 'citation("Biobase")' and for packages 'citation(pkgname)'. >> >> >> Attaching package: 'Biobase' >> >> The following object(s) are masked from 'package:IRanges': >> >> updateObject >> >> Error in forgeBSgenomeDataPkg(y, seqs_srcdir = seqs_srcdir, >> masks_srcdir = >> masks_srcdir, : >> values for symbols NMASKPERSEQ are not single strings >> In addition: Warning message: >> In storage.mode(x$nmask_per_seq)<- "integer" : NAs introduced by >> coercion >> >> sessionInfo() >> >> R version 2.12.1 (2010-12-16) >> Platform: x86_64-pc-linux-gnu (64-bit) >> >> locale: >> [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C >> [3] LC_TIME=en_US.UTF-8 LC_COLLATE=en_US.UTF-8 >> [5] LC_MONETARY=C LC_MESSAGES=en_US.UTF-8 >> [7] LC_PAPER=en_US.UTF-8 LC_NAME=C >> [9] LC_ADDRESS=C LC_TELEPHONE=C >> [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C >> >> attached base packages: >> [1] stats graphics grDevices utils datasets methods >> base >> >> other attached packages: >> [1] Biobase_2.6.1 BSgenome_1.18.2 Biostrings_2.18.0 >> [4] GenomicRanges_1.2.1 IRanges_1.8.8 >> >> loaded via a namespace (and not attached): >> [1] tools_2.12.1 >> >> [[alternative HTML version deleted]] >> >> _______________________________________________ >> Bioconductor mailing list >> Bioconductor@r-project.org <mailto:bioconductor@r-project.org> >> >> https://stat.ethz.ch/mailman/listinfo/bioconductor >> Search the archives: >> http://news.gmane.org/gmane.science.biology.informatics.conductor >> >> >> >> -- >> Hervé Pagès >> >> Program in Computational Biology >> Division of Public Health Sciences >> Fred Hutchinson Cancer Research Center >> 1100 Fairview Ave. N, M2-B876 >> P.O. Box 19024 >> Seattle, WA 98109-1024 >> >> E-mail: hpages@fhcrc.org <mailto:hpages@fhcrc.org> >> >> Phone: (206) 667-5791 >> Fax: (206) 667-1319 >> >> >> > > -- > Hervé Pagès > > Program in Computational Biology > Division of Public Health Sciences > Fred Hutchinson Cancer Research Center > 1100 Fairview Ave. N, M2-B876 > P.O. Box 19024 > Seattle, WA 98109-1024 > > E-mail: hpages@fhcrc.org > Phone: (206) 667-5791 > Fax: (206) 667-1319 > [[alternative HTML version deleted]]

ADD REPLY • link 13.9 years ago Steve Shen ▴ 330

0

Entering edit mode

Steve, On 01/19/2011 04:37 PM, Steve Shen wrote: > Hi Herve, > > Thank you very much. I tried carefully with correct path and file name > by triggering auto-complete feature. But it didn't help either. It may > be something wrong with my seed file, so I also appended the file below. > Please have a look. Thanks in advance, > > Steve > > > forgeBSgenomeDataPkg("/home/steve/Data/Genomes/seed") > Error in as.list(.readSeedFile(x, verbose = verbose)) : > error in evaluating the argument 'x' in selecting a method for > function 'as.list' > > forgeBSgenomeDataPkg("seed") > Error in as.list(.readSeedFile(x, verbose = verbose)) : > error in evaluating the argument 'x' in selecting a method for > function 'as.list' An easy way to check if your seed file is a valid DCF file is: > read.dcf("seed") Error in read.dcf("seed") : Line starting 'I'm a broken DCF fil ...' is malformed! Again, the error message you get when calling forgeBSgenomeDataPkg() on a broken DCF file should return exactly that, but, unfortunately, it doesn't with the current version of BSgenome. Updated versions of BSgenome (1.18.3 in release and 1.19.3 in devel) are correcting this and will become available tomorrow around noon, and midnight, respectively (Seattle time). > > Seed file starts: > > Package: BSgenome.CFlo.YU.v3 > Title: Camponotus floridanus (Ants) full genome (YU version V3.3) > Description: Camponotus floridanus (Ants) full genome as provided by YU > (V3.3, Jan. 2011) ^^^^^^^^^^^^^^^^^ Is this on a line of its own in your file? This might be the problem. Please consult the help for the read.dcf() function (?read.dcf from your R session) for how to format a DCF file. Note that a 'tag:value' should either be placed on a single line, or, if the value spans several lines, then the extra lines (aka "continuation lines") must start with a whitespace. Like this: Description: Camponotus floridanus (Ants) full genome as provided by YU (V3.3, Jan. 2011) (I'm using 4 whitespaces here but 1 is enough. Using a <tab> would also be OK.) > Version: 1.0.0 > organism: camponotus floridanus > species: Ant > provider: YU > provider_version: Assembly V3.3 > release_date: Jan, 2011 > release_name: Ant Genome Reference Consortium > source_url: NA > organism_biocview: C_flo > BSgenomeObjname: CFlo > seqnames: "cflo_v3.3.fold" > mseqnames: character(0) > nmask_per_seq:2 > > seqs_srcdir: /home/steve/Data/Genomes/ So you have a FASTA file named "cflo_v3.3.fold.fa" located in /home/steve/Data/Genomes/ that contains a *single* FASTA record that corresponds to the cflo_v3.3.fold sequence? Please double check this. (You can see the list of records contained in a FASTA file with the fasta.info() function from Biostrings.) H. > # masks_srcdir: /home/steve/Data/Masks_src > # AGAPSfiles_name: cflo_v3.3.fa.masked > > > 2011/1/19 Hervé Pagès <hpages at="" fhcrc.org="" <mailto:hpages="" at="" fhcrc.org="">> > > Hi Steve, > > > On 01/19/2011 01:04 PM, Steve Shen wrote: > > Hi Herve, > > Thank you so much for your quick reply. Your vignette is pretty > clear. I > think the problem is on my side. I just lack of experience on > dealing > with this issue and maybe the file format doesn't fit. I sort of > gave up > using the seed file but just using low level commands. I still have > errors. I really have no idea what is going on. Your help will > be much > appreciated. > > Best, > Steve > > 1. with seed file, I got > > forgeBSgenomeDataPkg("./cflo_seed.R") > Error in as.list(.readSeedFile(x, verbose = verbose)) : > error in evaluating the argument 'x' in selecting a method for > function 'as.list' > > > I agree that the error message is kind of obscure but the problem > is probably that the specified path to your seed file is invalid. > I'm about to commit a change to the BSgenomeForge code that will > produce this error message instead: > > > forgeBSgenomeDataPkg("aaaaa") > Error in .readSeedFile(x, verbose = verbose) : > seed file 'aaaaa' not found > > An easy way to make sure the path is valid is to press <tab> when > you are in the middle of typing the path: this will trigger the > auto-completion feature. > > > > 2. with command forgeSeqlengthsFile, > > forgeSeqlengthsFile("cflo_v3.3.fold", prefix="", > suffix=".fa",seqs_srcdir=".", seqs_destdir=".", verbose=TRUE) > Saving 'seqlengths' object to compressed data file > './seqlengths.rda'... > DONE > Warning messages: > 1: In FUN("cflo_v3.3.fold"[[1L]], ...) : > In file './cflo_v3.3.fold.fa': 24026 sequences found, using first > sequence only > 2: In FUN("cflo_v3.3.fold"[[1L]], ...) : > In file './cflo_v3.3.fold.fa': sequence description > "scaffold9scaffold12scaffold16scaffold2scaffold4scaffold20sc affold10scaffold11scaffold24scaffold1scaffold3scaffold6scaffold15scaff old28scaffold30scaffold22scaffold18scaffold21scaffold7scaffold32scaffo ld36scaffold13scaffold23scaffold39scaffold19scaffold41scaffold29scaffo ld37scaffold33scaffold45scaffold38scaffold17scaffold5scaffold46scaffol d48scaffold31scaffold25scaffold51scaffold49scaffold44scaffold47scaffol d54scaffold43scaffold55scaffold60scaffold35scaffold42scaffold63scaffol d53scaffold50scaffold40scaffold67scaffold61scaffold69scaffold34scaffol d70scaffold27scaffold73scaffold8scaffold75scaffold71scaffold74scaffold 14scaffold66scaffold65scaffold80scaffold62scaffold76scaffold79scaffold 85scaffold78scaffold86scaffold88scaffold81scaffold87scaffold26scaffold 92scaffold93scaffold94scaffold59scaffold52scaffold84scaffold77scaffold 89scaffold100scaffold97scaffold99scaffold64scaffold103scaffold90scaffo ld106scaffold56scaffold98scaffold109scaffold68sc > [... truncated] > > 3. with forgeSeqfiles, > > forgeSeqFiles("cflo_v3.3.fold", mseqnames=NULL, prefix="", > suffix=".fa",seqs_srcdir=".", seqs_destdir="./BSgenome", > verbose=TRUE) > Loading FASTA file './cflo_v3.3.fold.fa' in 'cflo_v3.3.fold' > object... DONE > Saving 'cflo_v3.3.fold' object to compressed data file > './BSgenome/cflo_v3.3.fold.rda'... DONE > Warning message: > In .forgeSeqFile(name, prefix, suffix, seqs_srcdir, seqs_destdir, : > file contains 24026 sequences, using the first sequence only > > > Sorry but I can only try to help if you stick to the vignette and > use a seed file. Using low-level functions like forgeSeqlengthsFile() > or forgeSeqFiles() is undocumented/unsupported. The reason for this > is that this route is *much* more complicated and error-prone > than using a seed file! This is exactly why seed files where > invented: to make the whole process much easier for you, and to > make troubleshooting much easier for us. > > Cheers, > H. > > > > > 2011/1/19 Hervé Pagès <hpages at="" fhcrc.org=""> <mailto:hpages at="" fhcrc.org=""> <mailto:hpages at="" fhcrc.org=""> <mailto:hpages at="" fhcrc.org="">>> > > > Hi Steve, > > There are several things that look wrong with your seed file. > > 1. The # must be the first character in a line to make it a > line of comment (ignored). > For example, those 2 lines will certainly not be interpreted > as you might expect: > > > mseqnames: NA #paste("upstream", c("1000", "2000", > "5000"), sep="") > > source_url: NA # > http://hgdownload.cse.ucsc.edu/goldenPath/hg19/bigZips/ > > 2. If you don't have multiple sequences, specify: > > mseqnames: character(0) > > 3. As explained in the BSgenomeForge vignette, the default for > seqfiles_prefix is .fa so this won't work: > > seqnames: "clof_v3.fa" > > Unless your file is named clof_v3.fa.fa? > If your file is named clof_v3.fa, then you should specify: > > seqnames: "clof_v3" > > 4. This doesn't look like a valid file of assembly gaps: > > > AGAPSfiles_name: clof_v3.fa.masked > > Please refer to the BSgenomeForge vignette for what kind > of masks > and what file formats are supported. > > 5. Having this > > > PkgExamples: Hsapiens > seqlengths(Hsapiens) > Hsapiens$chr1 # same as Hsapiens[["chr1"]] > > in a seed file for clof obviously doesn't make sense and your > package won't pass 'R CMD check' because it will contain > broken > examples. > > There might be other problems with your seed file. All you > need to do > is read and follow carefully the instructions described in the > BSgenomeForge vignette. Let me know if things are not clear > in the > vignette and I'll try to improve it. Thanks! > > H. > > > > On 01/18/2011 05:32 PM, Steve Shen wrote: > > Hi list, > > This wasn't sent out with a none registered ids. I have a > problem with > forging a new BSgenome data package. The sequence data > file is > clof.v3.fa > and the mask file is clof.v3.fa.masked. Below are seed file, > command, error > and sessioninfo. Your help will be much appreciated. > > Thanks a lot, > Steve > > Package: BSgenome.Clof.yu.v3 > Title: Clof (insects) full genome (version 3) > Description: Clof (insects) full genome as provided by > YU (V3.3, > Jan. 2011) > # and will store in Biostrings objects. > Version: 1.0.0 > organism: clof > species: insect > provider: YU > provider_version: Assembly V3.3 > release_date: Jan, 2011 > release_name: insects Genome Reference Consortium > source_url: NA # > http://hgdownload.cse.ucsc.edu/goldenPath/hg19/bigZips/ > organism_biocview: Clof > BSgenomeObjname: Clof > seqnames: "clof_v3.fa" > mseqnames: NA #paste("upstream", c("1000", "2000", > "5000"), sep="") > nmask_per_seq: 2 > #SrcDataFiles1: sequences: chromFa.zip, upstream1000.zip, > upstream2000.zip, > upstream5000.zip > #from > http://hgdownload.cse.ucsc.edu/goldenPath/hg19/bigZips/ > #SrcDataFiles2: AGAPS masks: > http://hgdownload.cse.ucsc.edu/goldenPath/hg19/database/gap.txt.gz > #RM masks: > http://hgdownload.cse.ucsc.edu/goldenPath/hg19/bigZips/chromOut.tar.gz > #TRF masks: > http://hgdownload.cse.ucsc.edu/goldenPath/hg19/bigZips/chromTrf.tar.gz > PkgExamples: Hsapiens > seqlengths(Hsapiens) > Hsapiens$chr1 # same as Hsapiens[["chr1"]] > seqs_srcdir: /home/steve/Data/Genomes > masks_srcdir: /home/steve/Data/Genomes > AGAPSfiles_name: clof_v3.fa.masked > > The command, error and sessionInfo are below > > forgeBSgenomeDataPkg("Clof_seed.R", seqs_srcdir=".", > masks_srcdir=".", > > destdir=".", verbose=TRUE) > Loading required package: Biobase > > Welcome to Bioconductor > > Vignettes contain introductory material. To view, type > 'openVignette()'. To cite Bioconductor, see > 'citation("Biobase")' and for packages 'citation(pkgname)'. > > > Attaching package: 'Biobase' > > The following object(s) are masked from 'package:IRanges': > > updateObject > > Error in forgeBSgenomeDataPkg(y, seqs_srcdir = seqs_srcdir, > masks_srcdir = > masks_srcdir, : > values for symbols NMASKPERSEQ are not single strings > In addition: Warning message: > In storage.mode(x$nmask_per_seq)<- "integer" : NAs > introduced by > coercion > > sessionInfo() > > R version 2.12.1 (2010-12-16) > Platform: x86_64-pc-linux-gnu (64-bit) > > locale: > [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C > [3] LC_TIME=en_US.UTF-8 LC_COLLATE=en_US.UTF-8 > [5] LC_MONETARY=C LC_MESSAGES=en_US.UTF-8 > [7] LC_PAPER=en_US.UTF-8 LC_NAME=C > [9] LC_ADDRESS=C LC_TELEPHONE=C > [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C > > attached base packages: > [1] stats graphics grDevices utils datasets > methods base > > other attached packages: > [1] Biobase_2.6.1 BSgenome_1.18.2 > Biostrings_2.18.0 > [4] GenomicRanges_1.2.1 IRanges_1.8.8 > > loaded via a namespace (and not attached): > [1] tools_2.12.1 > > [[alternative HTML version deleted]] > > _______________________________________________ > Bioconductor mailing list > Bioconductor at r-project.org <mailto:bioconductor at="" r-project.org=""> > <mailto:bioconductor at="" r-project.org=""> <mailto:bioconductor at="" r-project.org="">> > > https://stat.ethz.ch/mailman/listinfo/bioconductor > Search the archives: > http://news.gmane.org/gmane.science.biology.informatics.conductor > > > > -- > Hervé Pagès > > Program in Computational Biology > Division of Public Health Sciences > Fred Hutchinson Cancer Research Center > 1100 Fairview Ave. N, M2-B876 > P.O. Box 19024 > Seattle, WA 98109-1024 > > E-mail: hpages at fhcrc.org <mailto:hpages at="" fhcrc.org=""> > <mailto:hpages at="" fhcrc.org="" <mailto:hpages="" at="" fhcrc.org="">> > > Phone: (206) 667-5791 > Fax: (206) 667-1319 > > > > > -- > Hervé Pagès > > Program in Computational Biology > Division of Public Health Sciences > Fred Hutchinson Cancer Research Center > 1100 Fairview Ave. N, M2-B876 > P.O. Box 19024 > Seattle, WA 98109-1024 > > E-mail: hpages at fhcrc.org <mailto:hpages at="" fhcrc.org=""> > Phone: (206) 667-5791 > Fax: (206) 667-1319 > > -- Hervé Pagès Program in Computational Biology Division of Public Health Sciences Fred Hutchinson Cancer Research Center 1100 Fairview Ave. N, M2-B876 P.O. Box 19024 Seattle, WA 98109-1024 E-mail: hpages at fhcrc.org Phone: (206) 667-5791 Fax: (206) 667-1319

ADD REPLY • link 13.9 years ago Hervé Pagès 16k

0

Entering edit mode

Hervé Pagès 16k

@herve-pages-1542

Last seen 3 days ago

Seattle, WA, United States

Hi Steve, Since we've started this on the list, let's keep it there. Please provide your seed file so I can get a better picture of what you are doing. As I already mentioned before, you need to provide 1 FASTA file per chromosome. But your cflo_v3.3.fold.fa file seems to contain 24026 sequences! So you have 2 options: (1) either you split it into 1 file per sequence, (2) either you store it as a multiple sequence in your BSgenome data package. Option 1: It's easy to split a FASTA file into 1 file per FASTA record using the Biostrings package. You can do something like: library(Biostrings) x <- read.DNAStringSet("cflo_v3.3.fold.fa") for (i in seq_len(length(x))) { filepath <- paste(names(x)[i], ".fa", sep="") write.XStringSet(x[i], filepath) } The problem is that then you will need to list *all* the sequences in the seqnames field of your seed file. Another problem with this is that it's probably not the right thing to do. See below. Option 2: First some background. As you've probably noticed, the sequence objects stored in a BSgenome data package are divided in 2 groups: the group of single sequences and the group of multiple sequences. If you've ever displayed a BSgenome object that contains upstream sequences, you can't have missed it. For example: > library(BSgenome.Celegans.UCSC.ce2) > Celegans Worm genome | | organism: Caenorhabditis elegans (Worm) | provider: UCSC | provider version: ce2 | release date: Mar. 2004 | release name: WormBase v. WS120 | | single sequences (see '?seqnames'): | chrI chrII chrIII chrIV chrV chrX chrM | | multiple sequences (see '?mseqnames'): | upstream1000 upstream2000 upstream5000 | | (use the '$' or '[[' operator to access a given sequence) Objects in the 1st group are DNAString (or MaskedDNAString) objects. Each DNAString (or MaskedDNAString) object can only contain 1 sequence. Objects in the 2nd group are DNAStringSet objects. Each DNAStringSet object can contain an (almost) arbitrary number of sequences. The 2 groups are respectively defined thru the seqnames and mseqnames fields of the seed file. I'm just paraphrasing the existing documentation here: all this is documented in the man page for BSgenome objects and in the BSgenomeForge vignette. Typically the 1st group will contain chromosome sequences, or, more generally (and to use Ensembl terminology), the "top-level" sequences of the genome. This can include the Mitochondrial DNA (often denoted chrM), the 2-micron plasmid, the haplotype sequences, and other exotic DNA material that is considered to be at the "top-level". The number of top-level sequences varies from one genome to the other but it remains generally small (< 50). Typically the 2nd group will contain 1 or more big collections of sequences. Each collection itself can have thousands of sequences. But the number of collections should preferably remain small (i.e. < 50). So I guess this would be the place of choice for your cflo_v3.3.fold object (which seems to be a collection of 24026 scaffolds). Now for the problem with your gap file, I can't really help without seeing your seed file first. However, I'm a little bit puzzled that you have a file containing assembly gaps for your scaffolds: that doesn't seem to make a lot of sense to me. Anyway, you can't put masks on a multiple sequence object so, if you choose option 2, you will have to set nmask_per_seq to 0. Also, as a general rule when forging a BSgenome data package, you should ask yourself whether it is worth the extra effort to deal with the masks at all. A BSgenome data package doesn't *have* to contain masks. Cheers, H. On 01/26/2011 02:53 PM, Steve Shen wrote: > Hi Herve, > > Thank you so much for your help. I was finally able to have a working > seed file. However, there seem to be still long way to go. I guess that > most likely these errors are due to genome itself and gap file. I just > wonder if you could kindly give me some hints what do these error > message mean. Is gap file required for building up data package? Why do > package only use first sequence and others are truncated? Is this > because of two many sequences? What steps can I take in order to make it > work with GenomicRange packages? I really appreciate your help. Thanks > so much, > > Steve > > > forgeBSgenomeDataPkg("Cflo_seed.R") > Loading required package: Biobase > > Welcome to Bioconductor > > Vignettes contain introductory material. To view, type > 'openVignette()'. To cite Bioconductor, see > 'citation("Biobase")' and for packages 'citation(pkgname)'. > > > Attaching package: 'Biobase' > > The following object(s) are masked from 'package:IRanges': > > updateObject > > Creating package in ./BSgenome.CFlo.NYU.v3 > Saving 'seqlengths' object to compressed data file > './BSgenome.CFlo.NYU.v3/inst/extdata/seqlengths.rda'... DONE > Loading FASTA file '/home/steve/Data/Genomes/cflo_v3.3.fold.fa' in > 'cflo_v3.3.fold' object... DONE > Saving 'cflo_v3.3.fold' object to compressed data file > './BSgenome.CFlo.NYU.v3/inst/extdata/cflo_v3.3.fold.rda'... DONE > Error in .guessGapFileCOL2CLASS(file) : > unable to guess the column names in "gap" file > './cflo_v3.3.fold_gap.txt', sorry > In addition: Warning messages: > 1: In FUN("cflo_v3.3.fold"[[1L]], ...) : > In file '/home/steve/Data/Genomes/cflo_v3.3.fold.fa': 24026 sequences > found, using first sequence only > 2: In FUN("cflo_v3.3.fold"[[1L]], ...) : > In file '/home/steve/Data/Genomes/cflo_v3.3.fold.fa': sequence > description > "scaffold9scaffold12scaffold16scaffold2scaffold4scaffold20scaffold10 scaffold11scaffold24scaffold1scaffold3scaffold6scaffold15scaffold28sca ffold30scaffold22scaffold18scaffold21scaffold7scaffold32scaffold36scaf fold13scaffold23scaffold39scaffold19scaffold41scaffold29scaffold37scaf fold33scaffold45scaffold38scaffold17scaffold5scaffold46scaffold48scaff old31scaffold25scaffold51scaffold49scaffold44scaffold47scaffold54scaff old43scaffold55scaffold60scaffold35scaffold42scaffold63scaffold53scaff old50scaffold40scaffold67scaffold61scaffold69scaffold34scaffold70scaff old27scaffold73scaffold8scaffold75scaffold71scaffold74scaffold14scaffo ld66scaffold65scaffold80scaffold62scaffold76scaffold79scaffold85scaffo ld78scaffold86scaffold88scaffold81scaffold87scaffold26scaffold92scaffo ld93scaffold94scaffold59scaffold52scaffold84scaffold77scaffold89scaffo ld100scaffold97scaffold99scaffold64scaffold103scaffold90scaffold106sca ffold56scaffold98 > [... truncated] > 3: In .forgeSeqFile(name, prefix, suffix, seqs_srcdir, seqs_destdir, : > file contains 24026 sequences, using the first sequence only > 4: In file(file, "rt") : > cannot open file './cflo_v3.3.fold_gap.txt': No such file or directory > 5: In file(file, "rt") : > cannot open file './cflo_v3.3.fold_gap.txt': No such file or directory > > > 2011/1/20 Hervé Pagès <hpages at="" fhcrc.org="" <mailto:hpages="" at="" fhcrc.org="">> > > Steve, > > > On 01/19/2011 04:37 PM, Steve Shen wrote: > > Hi Herve, > > Thank you very much. I tried carefully with correct path and > file name > by triggering auto-complete feature. But it didn't help either. > It may > be something wrong with my seed file, so I also appended the > file below. > Please have a look. Thanks in advance, > > Steve > > > forgeBSgenomeDataPkg("/home/steve/Data/Genomes/seed") > Error in as.list(.readSeedFile(x, verbose = verbose)) : > error in evaluating the argument 'x' in selecting a method for > function 'as.list' > > forgeBSgenomeDataPkg("seed") > Error in as.list(.readSeedFile(x, verbose = verbose)) : > error in evaluating the argument 'x' in selecting a method for > function 'as.list' > > > An easy way to check if your seed file is a valid DCF file is: > > > read.dcf("seed") > Error in read.dcf("seed") : > Line starting 'I'm a broken DCF fil ...' is malformed! > > Again, the error message you get when calling forgeBSgenomeDataPkg() on > a broken DCF file should return exactly that, but, unfortunately, it > doesn't with the current version of BSgenome. Updated versions of > BSgenome (1.18.3 in release and 1.19.3 in devel) are correcting this > and will become available tomorrow around noon, and midnight, > respectively (Seattle time). > > > > Seed file starts: > > Package: BSgenome.CFlo.YU.v3 > Title: Camponotus floridanus (Ants) full genome (YU version V3.3) > Description: Camponotus floridanus (Ants) full genome as > provided by YU > (V3.3, Jan. 2011) > > ^^^^^^^^^^^^^^^^^ > Is this on a line of its own in your file? This might be the problem. > > Please consult the help for the read.dcf() function (?read.dcf from > your R session) for how to format a DCF file. Note that a 'tag:value' > should either be placed on a single line, or, if the value spans > several lines, then the extra lines (aka "continuation lines") must > start with a whitespace. Like this: > > > Description: Camponotus floridanus (Ants) full genome as provided > by YU (V3.3, Jan. 2011) > > (I'm using 4 whitespaces here but 1 is enough. Using a <tab> would > also be OK.) > > > Version: 1.0.0 > organism: camponotus floridanus > species: Ant > provider: YU > provider_version: Assembly V3.3 > release_date: Jan, 2011 > release_name: Ant Genome Reference Consortium > source_url: NA > organism_biocview: C_flo > BSgenomeObjname: CFlo > seqnames: "cflo_v3.3.fold" > mseqnames: character(0) > nmask_per_seq:2 > > seqs_srcdir: /home/steve/Data/Genomes/ > > > So you have a FASTA file named "cflo_v3.3.fold.fa" located in > /home/steve/Data/Genomes/ that contains a *single* FASTA record > that corresponds to the cflo_v3.3.fold sequence? Please double check > this. (You can see the list of records contained in a FASTA file with > the fasta.info <http: fasta.info="">() function from Biostrings.) > > H. > > # masks_srcdir: /home/steve/Data/Masks_src > # AGAPSfiles_name: cflo_v3.3.fa.masked > > > 2011/1/19 Hervé Pagès <hpages at="" fhcrc.org=""> <mailto:hpages at="" fhcrc.org=""> <mailto:hpages at="" fhcrc.org=""> <mailto:hpages at="" fhcrc.org="">>> > > Hi Steve, > > > On 01/19/2011 01:04 PM, Steve Shen wrote: > > Hi Herve, > > Thank you so much for your quick reply. Your vignette is > pretty > clear. I > think the problem is on my side. I just lack of > experience on > dealing > with this issue and maybe the file format doesn't fit. I > sort of > gave up > using the seed file but just using low level commands. I > still have > errors. I really have no idea what is going on. Your > help will > be much > appreciated. > > Best, > Steve > > 1. with seed file, I got > > forgeBSgenomeDataPkg("./cflo_seed.R") > Error in as.list(.readSeedFile(x, verbose = verbose)) : > error in evaluating the argument 'x' in selecting a > method for > function 'as.list' > > > I agree that the error message is kind of obscure but the > problem > is probably that the specified path to your seed file is > invalid. > I'm about to commit a change to the BSgenomeForge code that will > produce this error message instead: > > > forgeBSgenomeDataPkg("aaaaa") > Error in .readSeedFile(x, verbose = verbose) : > seed file 'aaaaa' not found > > An easy way to make sure the path is valid is to press <tab> > when > you are in the middle of typing the path: this will trigger the > auto-completion feature. > > > > 2. with command forgeSeqlengthsFile, > > forgeSeqlengthsFile("cflo_v3.3.fold", prefix="", > suffix=".fa",seqs_srcdir=".", seqs_destdir=".", > verbose=TRUE) > Saving 'seqlengths' object to compressed data file > './seqlengths.rda'... > DONE > Warning messages: > 1: In FUN("cflo_v3.3.fold"[[1L]], ...) : > In file './cflo_v3.3.fold.fa': 24026 sequences found, > using first > sequence only > 2: In FUN("cflo_v3.3.fold"[[1L]], ...) : > In file './cflo_v3.3.fold.fa': sequence description > "scaffold9scaffold12scaffold16scaffold2scaffold4scaffold20sc affold10scaffold11scaffold24scaffold1scaffold3scaffold6scaffold15scaff old28scaffold30scaffold22scaffold18scaffold21scaffold7scaffold32scaffo ld36scaffold13scaffold23scaffold39scaffold19scaffold41scaffold29scaffo ld37scaffold33scaffold45scaffold38scaffold17scaffold5scaffold46scaffol d48scaffold31scaffold25scaffold51scaffold49scaffold44scaffold47scaffol d54scaffold43scaffold55scaffold60scaffold35scaffold42scaffold63scaffol d53scaffold50scaffold40scaffold67scaffold61scaffold69scaffold34scaffol d70scaffold27scaffold73scaffold8scaffold75scaffold71scaffold74scaffold 14scaffold66scaffold65scaffold80scaffold62scaffold76scaffold79scaffold 85scaffold78scaffold86scaffold88scaffold81scaffold87scaffold26scaffold 92scaffold93scaffold94scaffold59scaffold52scaffold84scaffold77scaffold 89scaffold100scaffold97scaffold99scaffold64scaffold103scaffold90scaffo ld106scaffold56scaffold98scaffold109scaffold68sc > [... truncated] > > 3. with forgeSeqfiles, > > forgeSeqFiles("cflo_v3.3.fold", mseqnames=NULL, prefix="", > suffix=".fa",seqs_srcdir=".", seqs_destdir="./BSgenome", > verbose=TRUE) > Loading FASTA file './cflo_v3.3.fold.fa' in 'cflo_v3.3.fold' > object... DONE > Saving 'cflo_v3.3.fold' object to compressed data file > './BSgenome/cflo_v3.3.fold.rda'... DONE > Warning message: > In .forgeSeqFile(name, prefix, suffix, seqs_srcdir, > seqs_destdir, : > file contains 24026 sequences, using the first > sequence only > > > Sorry but I can only try to help if you stick to the > vignette and > use a seed file. Using low-level functions like > forgeSeqlengthsFile() > or forgeSeqFiles() is undocumented/unsupported. The reason > for this > is that this route is *much* more complicated and error- prone > than using a seed file! This is exactly why seed files where > invented: to make the whole process much easier for you, and to > make troubleshooting much easier for us. > > Cheers, > H. > > > > > 2011/1/19 Hervé Pagès <hpages at="" fhcrc.org=""> <mailto:hpages at="" fhcrc.org=""> > <mailto:hpages at="" fhcrc.org="" <mailto:hpages="" at="" fhcrc.org="">> > <mailto:hpages at="" fhcrc.org="" <mailto:hpages="" at="" fhcrc.org=""> > > <mailto:hpages at="" fhcrc.org="" <mailto:hpages="" at="" fhcrc.org="">>>> > > > Hi Steve, > > There are several things that look wrong with your > seed file. > > 1. The # must be the first character in a line to > make it a > line of comment (ignored). > For example, those 2 lines will certainly not be > interpreted > as you might expect: > > > mseqnames: NA #paste("upstream", c("1000", "2000", > "5000"), sep="") > > source_url: NA # > http://hgdownload.cse.ucsc.edu/goldenPath/hg19/bigZips/ > > 2. If you don't have multiple sequences, specify: > > mseqnames: character(0) > > 3. As explained in the BSgenomeForge vignette, the > default for > seqfiles_prefix is .fa so this won't work: > > seqnames: "clof_v3.fa" > > Unless your file is named clof_v3.fa.fa? > If your file is named clof_v3.fa, then you should > specify: > > seqnames: "clof_v3" > > 4. This doesn't look like a valid file of assembly gaps: > > > AGAPSfiles_name: clof_v3.fa.masked > > Please refer to the BSgenomeForge vignette for > what kind > of masks > and what file formats are supported. > > 5. Having this > > > PkgExamples: Hsapiens > seqlengths(Hsapiens) > Hsapiens$chr1 # same as Hsapiens[["chr1"]] > > in a seed file for clof obviously doesn't make > sense and your > package won't pass 'R CMD check' because it will > contain > broken > examples. > > There might be other problems with your seed file. > All you > need to do > is read and follow carefully the instructions > described in the > BSgenomeForge vignette. Let me know if things are > not clear > in the > vignette and I'll try to improve it. Thanks! > > H. > > > > On 01/18/2011 05:32 PM, Steve Shen wrote: > > Hi list, > > This wasn't sent out with a none registered ids. > I have a > problem with > forging a new BSgenome data package. The > sequence data > file is > clof.v3.fa > and the mask file is clof.v3.fa.masked. Below > are seed file, > command, error > and sessioninfo. Your help will be much appreciated. > > Thanks a lot, > Steve > > Package: BSgenome.Clof.yu.v3 > Title: Clof (insects) full genome (version 3) > Description: Clof (insects) full genome as > provided by > YU (V3.3, > Jan. 2011) > # and will store in Biostrings objects. > Version: 1.0.0 > organism: clof > species: insect > provider: YU > provider_version: Assembly V3.3 > release_date: Jan, 2011 > release_name: insects Genome Reference Consortium > source_url: NA # > http://hgdownload.cse.ucsc.edu/goldenPath/hg19/bigZips/ > organism_biocview: Clof > BSgenomeObjname: Clof > seqnames: "clof_v3.fa" > mseqnames: NA #paste("upstream", c("1000", "2000", > "5000"), sep="") > nmask_per_seq: 2 > #SrcDataFiles1: sequences: chromFa.zip, > upstream1000.zip, > upstream2000.zip, > upstream5000.zip > #from > http://hgdownload.cse.ucsc.edu/goldenPath/hg19/bigZips/ > #SrcDataFiles2: AGAPS masks: > http://hgdownload.cse.ucsc.edu/goldenPath/hg19/database/gap.txt.gz > #RM masks: > http://hgdownload.cse.ucsc.edu/goldenPath/hg19/bigZips/chromOut.tar.gz > #TRF masks: > http://hgdownload.cse.ucsc.edu/goldenPath/hg19/bigZips/chromTrf.tar.gz > PkgExamples: Hsapiens > seqlengths(Hsapiens) > Hsapiens$chr1 # same as Hsapiens[["chr1"]] > seqs_srcdir: /home/steve/Data/Genomes > masks_srcdir: /home/steve/Data/Genomes > AGAPSfiles_name: clof_v3.fa.masked > > The command, error and sessionInfo are below > > forgeBSgenomeDataPkg("Clof_seed.R", > seqs_srcdir=".", > masks_srcdir=".", > > destdir=".", verbose=TRUE) > Loading required package: Biobase > > Welcome to Bioconductor > > Vignettes contain introductory material. To > view, type > 'openVignette()'. To cite Bioconductor, see > 'citation("Biobase")' and for packages 'citation(pkgname)'. > > > Attaching package: 'Biobase' > > The following object(s) are masked from > 'package:IRanges': > > updateObject > > Error in forgeBSgenomeDataPkg(y, seqs_srcdir = > seqs_srcdir, > masks_srcdir = > masks_srcdir, : > values for symbols NMASKPERSEQ are not single > strings > In addition: Warning message: > In storage.mode(x$nmask_per_seq)<- "integer" : NAs > introduced by > coercion > > sessionInfo() > > R version 2.12.1 (2010-12-16) > Platform: x86_64-pc-linux-gnu (64-bit) > > locale: > [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C > [3] LC_TIME=en_US.UTF-8 > LC_COLLATE=en_US.UTF-8 > [5] LC_MONETARY=C > LC_MESSAGES=en_US.UTF-8 > [7] LC_PAPER=en_US.UTF-8 LC_NAME=C > [9] LC_ADDRESS=C LC_TELEPHONE=C > [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C > > attached base packages: > [1] stats graphics grDevices utils datasets > methods base > > other attached packages: > [1] Biobase_2.6.1 BSgenome_1.18.2 > Biostrings_2.18.0 > [4] GenomicRanges_1.2.1 IRanges_1.8.8 > > loaded via a namespace (and not attached): > [1] tools_2.12.1 > > [[alternative HTML version deleted]] > > _______________________________________________ > Bioconductor mailing list > Bioconductor at r-project.org <mailto:bioconductor at="" r-project.org=""> > <mailto:bioconductor at="" r-project.org=""> <mailto:bioconductor at="" r-project.org="">> > <mailto:bioconductor at="" r-project.org=""> <mailto:bioconductor at="" r-project.org=""> > <mailto:bioconductor at="" r-project.org=""> <mailto:bioconductor at="" r-project.org="">>> > > https://stat.ethz.ch/mailman/listinfo/bioconductor > Search the archives: > http://news.gmane.org/gmane.science.biology.informatics.conductor > > > > -- > Hervé Pagès > > Program in Computational Biology > Division of Public Health Sciences > Fred Hutchinson Cancer Research Center > 1100 Fairview Ave. N, M2-B876 > P.O. Box 19024 > Seattle, WA 98109-1024 > > E-mail: hpages at fhcrc.org <mailto:hpages at="" fhcrc.org=""> > <mailto:hpages at="" fhcrc.org="" <mailto:hpages="" at="" fhcrc.org="">> > <mailto:hpages at="" fhcrc.org="" <mailto:hpages="" at="" fhcrc.org=""> > <mailto:hpages at="" fhcrc.org="" <mailto:hpages="" at="" fhcrc.org="">>> > > > Phone: (206) 667-5791 > Fax: (206) 667-1319 > > > > > -- > Hervé Pagès > > Program in Computational Biology > Division of Public Health Sciences > Fred Hutchinson Cancer Research Center > 1100 Fairview Ave. N, M2-B876 > P.O. Box 19024 > Seattle, WA 98109-1024 > > E-mail: hpages at fhcrc.org <mailto:hpages at="" fhcrc.org=""> > <mailto:hpages at="" fhcrc.org="" <mailto:hpages="" at="" fhcrc.org="">> > Phone: (206) 667-5791 > Fax: (206) 667-1319 > > > > > -- > Hervé Pagès > > Program in Computational Biology > Division of Public Health Sciences > Fred Hutchinson Cancer Research Center > 1100 Fairview Ave. N, M2-B876 > P.O. Box 19024 > Seattle, WA 98109-1024 > > E-mail: hpages at fhcrc.org <mailto:hpages at="" fhcrc.org=""> > Phone: (206) 667-5791 > Fax: (206) 667-1319 > > -- Hervé Pagès Program in Computational Biology Division of Public Health Sciences Fred Hutchinson Cancer Research Center 1100 Fairview Ave. N, M2-B876 P.O. Box 19024 Seattle, WA 98109-1024 E-mail: hpages at fhcrc.org Phone: (206) 667-5791 Fax: (206) 667-1319

ADD COMMENT • link 13.9 years ago Hervé Pagès 16k

0

Entering edit mode

Hi Herve, Sorry about that I forgot using reply all. Below is my seed file. I am dealing with a brand new genome that was published a couple of months ago. The genome consists of lots of scaffolds and contigs, no chromosome information available. That's why it looks weird. Yous explanation and suggestion does make lot of sense to me. I begin to understand some basic mechanism of BSgenome. This is great and thanks so much. I understand now that the nmask_per_seq parameter should be set to 0 other than 2 (which I made up previously). However, it looks to me that I should probably try some other strategies for this reference genome. Thanks so much for your help. Steve ## seed file start here Package: BSgenome.CFlo.YU.v3 Title: Camponotus floridanus (Ants) full genome (YU version V3.3) Description: Camponotus floridanus (Ants) full genome as provided by YU (V3.3, Jan. 2011) Version: 1.0.0 organism: camponotus floridanus species: Ant provider: YU provider_version: BGI Assembly V3.3 release_date: Jan, 2011 release_name: Ant Genome Reference Consortium source_url: NA organism_biocview: C_flo BSgenomeObjname: CFlo seqnames: "cflo_v3.3.fold" mseqnames: character(0) nmask_per_seq:2 seqs_srcdir: /home/steve/Data/Genomes #masks_srcdir: /home/steve/Data/Masks_src #AGAPSfiles_name: cflo_v3.3.fa.masked ## seed file end here --- 2011/1/26 Hervé Pagès <hpages@fhcrc.org> > Hi Steve, > > Since we've started this on the list, let's keep it there. > > Please provide your seed file so I can get a better picture of what you > are doing. As I already mentioned before, you need to provide 1 FASTA > file per chromosome. But your cflo_v3.3.fold.fa file seems to contain > 24026 sequences! So you have 2 options: (1) either you split it into 1 > file per sequence, (2) either you store it as a multiple sequence in > your BSgenome data package. > > Option 1: It's easy to split a FASTA file into 1 file per FASTA > record using the Biostrings package. You can do something like: > > library(Biostrings) > x <- read.DNAStringSet("cflo_v3.3.fold.fa") > for (i in seq_len(length(x))) { > filepath <- paste(names(x)[i], ".fa", sep="") > write.XStringSet(x[i], filepath) > } > > The problem is that then you will need to list *all* the sequences in > the seqnames field of your seed file. Another problem with this is that > it's probably not the right thing to do. See below. > > Option 2: First some background. As you've probably noticed, the > sequence objects stored in a BSgenome data package are divided in 2 > groups: the group of single sequences and the group of multiple > sequences. If you've ever displayed a BSgenome object that contains > upstream sequences, you can't have missed it. For example: > > > library(BSgenome.Celegans.UCSC.ce2) > > Celegans > Worm genome > | > | organism: Caenorhabditis elegans (Worm) > | provider: UCSC > | provider version: ce2 > | release date: Mar. 2004 > | release name: WormBase v. WS120 > | > | single sequences (see '?seqnames'): > | chrI chrII chrIII chrIV chrV chrX chrM > | > | multiple sequences (see '?mseqnames'): > | upstream1000 upstream2000 upstream5000 > | > | (use the '$' or '[[' operator to access a given sequence) > > Objects in the 1st group are DNAString (or MaskedDNAString) objects. > Each DNAString (or MaskedDNAString) object can only contain 1 sequence. > Objects in the 2nd group are DNAStringSet objects. Each DNAStringSet > object can contain an (almost) arbitrary number of sequences. > The 2 groups are respectively defined thru the seqnames and mseqnames > fields of the seed file. I'm just paraphrasing the existing > documentation here: all this is documented in the man page for > BSgenome objects and in the BSgenomeForge vignette. > > Typically the 1st group will contain chromosome sequences, or, more > generally (and to use Ensembl terminology), the "top-level" sequences > of the genome. This can include the Mitochondrial DNA (often denoted > chrM), the 2-micron plasmid, the haplotype sequences, and other exotic > DNA material that is considered to be at the "top-level". The number > of top-level sequences varies from one genome to the other but it > remains generally small (< 50). > > Typically the 2nd group will contain 1 or more big collections of > sequences. Each collection itself can have thousands of sequences. > But the number of collections should preferably remain small > (i.e. < 50). So I guess this would be the place of choice for your > cflo_v3.3.fold object (which seems to be a collection of 24026 > scaffolds). > > Now for the problem with your gap file, I can't really help without > seeing your seed file first. However, I'm a little bit puzzled that > you have a file containing assembly gaps for your scaffolds: that > doesn't seem to make a lot of sense to me. Anyway, you can't put > masks on a multiple sequence object so, if you choose option 2, you > will have to set nmask_per_seq to 0. > > Also, as a general rule when forging a BSgenome data package, you > should ask yourself whether it is worth the extra effort to deal > with the masks at all. A BSgenome data package doesn't *have* to > contain masks. > > Cheers, > H. > > > > On 01/26/2011 02:53 PM, Steve Shen wrote: > >> Hi Herve, >> >> Thank you so much for your help. I was finally able to have a working >> seed file. However, there seem to be still long way to go. I guess that >> most likely these errors are due to genome itself and gap file. I just >> wonder if you could kindly give me some hints what do these error >> message mean. Is gap file required for building up data package? Why do >> package only use first sequence and others are truncated? Is this >> because of two many sequences? What steps can I take in order to make it >> work with GenomicRange packages? I really appreciate your help. Thanks >> so much, >> >> Steve >> >> > forgeBSgenomeDataPkg("Cflo_seed.R") >> Loading required package: Biobase >> >> Welcome to Bioconductor >> >> Vignettes contain introductory material. To view, type >> 'openVignette()'. To cite Bioconductor, see >> 'citation("Biobase")' and for packages 'citation(pkgname)'. >> >> >> Attaching package: 'Biobase' >> >> The following object(s) are masked from 'package:IRanges': >> >> updateObject >> >> Creating package in ./BSgenome.CFlo.YU.v3 >> Saving 'seqlengths' object to compressed data file >> './BSgenome.CFlo.YU.v3/inst/extdata/seqlengths.rda'... DONE >> Loading FASTA file '/home/steve/Data/Genomes/cflo_v3.3.fold.fa' in >> 'cflo_v3.3.fold' object... DONE >> Saving 'cflo_v3.3.fold' object to compressed data file >> './BSgenome.CFlo.YU.v3/inst/extdata/cflo_v3.3.fold.rda'... DONE >> Error in .guessGapFileCOL2CLASS(file) : >> unable to guess the column names in "gap" file >> './cflo_v3.3.fold_gap.txt', sorry >> In addition: Warning messages: >> 1: In FUN("cflo_v3.3.fold"[[1L]], ...) : >> In file '/home/steve/Data/Genomes/cflo_v3.3.fold.fa': 24026 sequences >> found, using first sequence only >> 2: In FUN("cflo_v3.3.fold"[[1L]], ...) : >> In file '/home/steve/Data/Genomes/cflo_v3.3.fold.fa': sequence >> description >> >> "scaffold9scaffold12scaffold16scaffold2scaffold4scaffold20scaffold1 0scaffold11scaffold24scaffold1scaffold3scaffold6scaffold15scaffold28sc affold30scaffold22scaffold18scaffold21scaffold7scaffold32scaffold36sca ffold13scaffold23scaffold39scaffold19scaffold41scaffold29scaffold37sca ffold33scaffold45scaffold38scaffold17scaffold5scaffold46scaffold48scaf fold31scaffold25scaffold51scaffold49scaffold44scaffold47scaffold54scaf fold43scaffold55scaffold60scaffold35scaffold42scaffold63scaffold53scaf fold50scaffold40scaffold67scaffold61scaffold69scaffold34scaffold70scaf fold27scaffold73scaffold8scaffold75scaffold71scaffold74scaffold14scaff old66scaffold65scaffold80scaffold62scaffold76scaffold79scaffold85scaff old78scaffold86scaffold88scaffold81scaffold87scaffold26scaffold92scaff old93scaffold94scaffold59scaffold52scaffold84scaffold77scaffold89scaff old100scaffold97scaffold99scaffold64scaffold103scaffold90scaffold106sc affold56scaffold98 >> [... truncated] >> 3: In .forgeSeqFile(name, prefix, suffix, seqs_srcdir, seqs_destdir, : >> file contains 24026 sequences, using the first sequence only >> 4: In file(file, "rt") : >> cannot open file './cflo_v3.3.fold_gap.txt': No such file or directory >> 5: In file(file, "rt") : >> cannot open file './cflo_v3.3.fold_gap.txt': No such file or directory >> >> >> 2011/1/20 Hervé Pagès <hpages@fhcrc.org <mailto:hpages@fhcrc.org="">> >> >> >> Steve, >> >> >> On 01/19/2011 04:37 PM, Steve Shen wrote: >> >> Hi Herve, >> >> Thank you very much. I tried carefully with correct path and >> file name >> by triggering auto-complete feature. But it didn't help either. >> It may >> be something wrong with my seed file, so I also appended the >> file below. >> Please have a look. Thanks in advance, >> >> Steve >> >> > forgeBSgenomeDataPkg("/home/steve/Data/Genomes/seed") >> Error in as.list(.readSeedFile(x, verbose = verbose)) : >> error in evaluating the argument 'x' in selecting a method for >> function 'as.list' >> > forgeBSgenomeDataPkg("seed") >> Error in as.list(.readSeedFile(x, verbose = verbose)) : >> error in evaluating the argument 'x' in selecting a method for >> function 'as.list' >> >> >> An easy way to check if your seed file is a valid DCF file is: >> >> > read.dcf("seed") >> Error in read.dcf("seed") : >> Line starting 'I'm a broken DCF fil ...' is malformed! >> >> Again, the error message you get when calling forgeBSgenomeDataPkg() on >> a broken DCF file should return exactly that, but, unfortunately, it >> doesn't with the current version of BSgenome. Updated versions of >> BSgenome (1.18.3 in release and 1.19.3 in devel) are correcting this >> and will become available tomorrow around noon, and midnight, >> respectively (Seattle time). >> >> >> >> Seed file starts: >> >> Package: BSgenome.CFlo.YU.v3 >> Title: Camponotus floridanus (Ants) full genome (YU version V3.3) >> Description: Camponotus floridanus (Ants) full genome as >> provided by YU >> (V3.3, Jan. 2011) >> >> ^^^^^^^^^^^^^^^^^ >> Is this on a line of its own in your file? This might be the problem. >> >> Please consult the help for the read.dcf() function (?read.dcf from >> your R session) for how to format a DCF file. Note that a 'tag:value' >> should either be placed on a single line, or, if the value spans >> several lines, then the extra lines (aka "continuation lines") must >> start with a whitespace. Like this: >> >> >> Description: Camponotus floridanus (Ants) full genome as provided >> by YU (V3.3, Jan. 2011) >> >> (I'm using 4 whitespaces here but 1 is enough. Using a <tab> would >> also be OK.) >> >> >> Version: 1.0.0 >> organism: camponotus floridanus >> species: Ant >> provider: YU >> provider_version: Assembly V3.3 >> release_date: Jan, 2011 >> release_name: Ant Genome Reference Consortium >> source_url: NA >> organism_biocview: C_flo >> BSgenomeObjname: CFlo >> seqnames: "cflo_v3.3.fold" >> mseqnames: character(0) >> nmask_per_seq:2 >> >> seqs_srcdir: /home/steve/Data/Genomes/ >> >> >> So you have a FASTA file named "cflo_v3.3.fold.fa" located in >> /home/steve/Data/Genomes/ that contains a *single* FASTA record >> that corresponds to the cflo_v3.3.fold sequence? Please double check >> this. (You can see the list of records contained in a FASTA file with >> the fasta.info <http: fasta.info="">() function from Biostrings.) >> >> >> H. >> >> # masks_srcdir: /home/steve/Data/Masks_src >> # AGAPSfiles_name: cflo_v3.3.fa.masked >> >> >> 2011/1/19 Hervé Pagès <hpages@fhcrc.org>> <mailto:hpages@fhcrc.org> <mailto:hpages@fhcrc.org>> <mailto:hpages@fhcrc.org>>> >> >> Hi Steve, >> >> >> On 01/19/2011 01:04 PM, Steve Shen wrote: >> >> Hi Herve, >> >> Thank you so much for your quick reply. Your vignette is >> pretty >> clear. I >> think the problem is on my side. I just lack of >> experience on >> dealing >> with this issue and maybe the file format doesn't fit. I >> sort of >> gave up >> using the seed file but just using low level commands. I >> still have >> errors. I really have no idea what is going on. Your >> help will >> be much >> appreciated. >> >> Best, >> Steve >> >> 1. with seed file, I got >> > forgeBSgenomeDataPkg("./cflo_seed.R") >> Error in as.list(.readSeedFile(x, verbose = verbose)) : >> error in evaluating the argument 'x' in selecting a >> method for >> function 'as.list' >> >> >> I agree that the error message is kind of obscure but the >> problem >> is probably that the specified path to your seed file is >> invalid. >> I'm about to commit a change to the BSgenomeForge code that >> will >> produce this error message instead: >> >> > forgeBSgenomeDataPkg("aaaaa") >> Error in .readSeedFile(x, verbose = verbose) : >> seed file 'aaaaa' not found >> >> An easy way to make sure the path is valid is to press <tab> >> when >> you are in the middle of typing the path: this will trigger the >> auto-completion feature. >> >> >> >> 2. with command forgeSeqlengthsFile, >> > forgeSeqlengthsFile("cflo_v3.3.fold", prefix="", >> suffix=".fa",seqs_srcdir=".", seqs_destdir=".", >> verbose=TRUE) >> Saving 'seqlengths' object to compressed data file >> './seqlengths.rda'... >> DONE >> Warning messages: >> 1: In FUN("cflo_v3.3.fold"[[1L]], ...) : >> In file './cflo_v3.3.fold.fa': 24026 sequences found, >> using first >> sequence only >> 2: In FUN("cflo_v3.3.fold"[[1L]], ...) : >> In file './cflo_v3.3.fold.fa': sequence description >> >> "scaffold9scaffold12scaffold16scaffold2scaffold4scaffold20scaffold 10scaffold11scaffold24scaffold1scaffold3scaffold6scaffold15scaffold28s caffold30scaffold22scaffold18scaffold21scaffold7scaffold32scaffold36sc affold13scaffold23scaffold39scaffold19scaffold41scaffold29scaffold37sc affold33scaffold45scaffold38scaffold17scaffold5scaffold46scaffold48sca ffold31scaffold25scaffold51scaffold49scaffold44scaffold47scaffold54sca ffold43scaffold55scaffold60scaffold35scaffold42scaffold63scaffold53sca ffold50scaffold40scaffold67scaffold61scaffold69scaffold34scaffold70sca ffold27scaffold73scaffold8scaffold75scaffold71scaffold74scaffold14scaf fold66scaffold65scaffold80scaffold62scaffold76scaffold79scaffold85scaf fold78scaffold86scaffold88scaffold81scaffold87scaffold26scaffold92scaf fold93scaffold94scaffold59scaffold52scaffold84scaffold77scaffold89scaf fold100scaffold97scaffold99scaffold64scaffold103scaffold90scaffold106s caffold56scaffold98scaffold109scaffold68sc >> [... truncated] >> >> 3. with forgeSeqfiles, >> > forgeSeqFiles("cflo_v3.3.fold", mseqnames=NULL, prefix="", >> suffix=".fa",seqs_srcdir=".", seqs_destdir="./BSgenome", >> verbose=TRUE) >> Loading FASTA file './cflo_v3.3.fold.fa' in >> 'cflo_v3.3.fold' >> object... DONE >> Saving 'cflo_v3.3.fold' object to compressed data file >> './BSgenome/cflo_v3.3.fold.rda'... DONE >> Warning message: >> In .forgeSeqFile(name, prefix, suffix, seqs_srcdir, >> seqs_destdir, : >> file contains 24026 sequences, using the first >> sequence only >> >> >> Sorry but I can only try to help if you stick to the >> vignette and >> use a seed file. Using low-level functions like >> forgeSeqlengthsFile() >> or forgeSeqFiles() is undocumented/unsupported. The reason >> for this >> is that this route is *much* more complicated and error- prone >> than using a seed file! This is exactly why seed files where >> invented: to make the whole process much easier for you, and to >> make troubleshooting much easier for us. >> >> Cheers, >> H. >> >> >> >> >> 2011/1/19 Hervé Pagès <hpages@fhcrc.org>> <mailto:hpages@fhcrc.org> >> <mailto:hpages@fhcrc.org <mailto:hpages@fhcrc.org="">> >> <mailto:hpages@fhcrc.org <mailto:hpages@fhcrc.org=""> >> >> <mailto:hpages@fhcrc.org <mailto:hpages@fhcrc.org="">>>> >> >> >> >> Hi Steve, >> >> There are several things that look wrong with your >> seed file. >> >> 1. The # must be the first character in a line to >> make it a >> line of comment (ignored). >> For example, those 2 lines will certainly not be >> interpreted >> as you might expect: >> >> >> mseqnames: NA #paste("upstream", c("1000", "2000", >> "5000"), sep="") >> >> source_url: NA # >> http://hgdownload.cse.ucsc.edu/goldenPath/hg19/bigZips/ >> >> 2. If you don't have multiple sequences, specify: >> >> mseqnames: character(0) >> >> 3. As explained in the BSgenomeForge vignette, the >> default for >> seqfiles_prefix is .fa so this won't work: >> >> seqnames: "clof_v3.fa" >> >> Unless your file is named clof_v3.fa.fa? >> If your file is named clof_v3.fa, then you should >> specify: >> >> seqnames: "clof_v3" >> >> 4. This doesn't look like a valid file of assembly >> gaps: >> >> >> AGAPSfiles_name: clof_v3.fa.masked >> >> Please refer to the BSgenomeForge vignette for >> what kind >> of masks >> and what file formats are supported. >> >> 5. Having this >> >> >> PkgExamples: Hsapiens >> seqlengths(Hsapiens) >> Hsapiens$chr1 # same as Hsapiens[["chr1"]] >> >> in a seed file for clof obviously doesn't make >> sense and your >> package won't pass 'R CMD check' because it will >> contain >> broken >> examples. >> >> There might be other problems with your seed file. >> All you >> need to do >> is read and follow carefully the instructions >> described in the >> BSgenomeForge vignette. Let me know if things are >> not clear >> in the >> vignette and I'll try to improve it. Thanks! >> >> H. >> >> >> >> On 01/18/2011 05:32 PM, Steve Shen wrote: >> >> Hi list, >> >> This wasn't sent out with a none registered ids. >> I have a >> problem with >> forging a new BSgenome data package. The >> sequence data >> file is >> clof.v3.fa >> and the mask file is clof.v3.fa.masked. Below >> are seed file, >> command, error >> and sessioninfo. Your help will be much >> appreciated. >> >> Thanks a lot, >> Steve >> >> Package: BSgenome.Clof.yu.v3 >> Title: Clof (insects) full genome (version 3) >> Description: Clof (insects) full genome as >> provided by >> YU (V3.3, >> Jan. 2011) >> # and will store in Biostrings objects. >> Version: 1.0.0 >> organism: clof >> species: insect >> provider: YU >> provider_version: Assembly V3.3 >> release_date: Jan, 2011 >> release_name: insects Genome Reference Consortium >> source_url: NA # >> http://hgdownload.cse.ucsc.edu/goldenPath/hg19/bigZips/ >> organism_biocview: Clof >> BSgenomeObjname: Clof >> seqnames: "clof_v3.fa" >> mseqnames: NA #paste("upstream", c("1000", "2000", >> "5000"), sep="") >> nmask_per_seq: 2 >> #SrcDataFiles1: sequences: chromFa.zip, >> upstream1000.zip, >> upstream2000.zip, >> upstream5000.zip >> #from >> http://hgdownload.cse.ucsc.edu/goldenPath/hg19/bigZips/ >> #SrcDataFiles2: AGAPS masks: >> http://hgdownload.cse.ucsc.edu/goldenPath/hg19/database/gap.txt.gz >> #RM masks: >> >> http://hgdownload.cse.ucsc.edu/goldenPath/hg19/bigZips/chromOut.tar.gz >> #TRF masks: >> >> http://hgdownload.cse.ucsc.edu/goldenPath/hg19/bigZips/chromTrf.tar.gz >> PkgExamples: Hsapiens >> seqlengths(Hsapiens) >> Hsapiens$chr1 # same as Hsapiens[["chr1"]] >> seqs_srcdir: /home/steve/Data/Genomes >> masks_srcdir: /home/steve/Data/Genomes >> AGAPSfiles_name: clof_v3.fa.masked >> >> The command, error and sessionInfo are below >> >> forgeBSgenomeDataPkg("Clof_seed.R", >> seqs_srcdir=".", >> masks_srcdir=".", >> >> destdir=".", verbose=TRUE) >> Loading required package: Biobase >> >> Welcome to Bioconductor >> >> Vignettes contain introductory material. To >> view, type >> 'openVignette()'. To cite Bioconductor, see >> 'citation("Biobase")' and for packages 'citation(pkgname)'. >> >> >> Attaching package: 'Biobase' >> >> The following object(s) are masked from >> 'package:IRanges': >> >> updateObject >> >> Error in forgeBSgenomeDataPkg(y, seqs_srcdir = >> seqs_srcdir, >> masks_srcdir = >> masks_srcdir, : >> values for symbols NMASKPERSEQ are not single >> strings >> In addition: Warning message: >> In storage.mode(x$nmask_per_seq)<- "integer" : NAs >> introduced by >> coercion >> >> sessionInfo() >> >> R version 2.12.1 (2010-12-16) >> Platform: x86_64-pc-linux-gnu (64-bit) >> >> locale: >> [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C >> [3] LC_TIME=en_US.UTF-8 >> LC_COLLATE=en_US.UTF-8 >> [5] LC_MONETARY=C >> LC_MESSAGES=en_US.UTF-8 >> [7] LC_PAPER=en_US.UTF-8 LC_NAME=C >> [9] LC_ADDRESS=C LC_TELEPHONE=C >> [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C >> >> attached base packages: >> [1] stats graphics grDevices utils >> datasets >> methods base >> >> other attached packages: >> [1] Biobase_2.6.1 BSgenome_1.18.2 >> Biostrings_2.18.0 >> [4] GenomicRanges_1.2.1 IRanges_1.8.8 >> >> loaded via a namespace (and not attached): >> [1] tools_2.12.1 >> >> [[alternative HTML version deleted]] >> >> _______________________________________________ >> Bioconductor mailing list >> Bioconductor@r-project.org <mailto:bioconductor@r-project.org> >> <mailto:bioconductor@r-project.org>> <mailto:bioconductor@r-project.org>> >> <mailto:bioconductor@r-project.org>> <mailto:bioconductor@r-project.org> >> <mailto:bioconductor@r-project.org>> <mailto:bioconductor@r-project.org>>> >> >> https://stat.ethz.ch/mailman/listinfo/bioconductor >> Search the archives: >> http://news.gmane.org/gmane.science.biology.informatics.conductor >> >> >> >> -- >> Hervé Pagès >> >> Program in Computational Biology >> Division of Public Health Sciences >> Fred Hutchinson Cancer Research Center >> 1100 Fairview Ave. N, M2-B876 >> P.O. Box 19024 >> Seattle, WA 98109-1024 >> >> E-mail: hpages@fhcrc.org <mailto:hpages@fhcrc.org> >> <mailto:hpages@fhcrc.org <mailto:hpages@fhcrc.org="">> >> <mailto:hpages@fhcrc.org <mailto:hpages@fhcrc.org=""> >> <mailto:hpages@fhcrc.org <mailto:hpages@fhcrc.org="">>> >> >> >> Phone: (206) 667-5791 >> Fax: (206) 667-1319 >> >> >> >> >> -- >> Hervé Pagès >> >> Program in Computational Biology >> Division of Public Health Sciences >> Fred Hutchinson Cancer Research Center >> 1100 Fairview Ave. N, M2-B876 >> P.O. Box 19024 >> Seattle, WA 98109-1024 >> >> E-mail: hpages@fhcrc.org <mailto:hpages@fhcrc.org> >> <mailto:hpages@fhcrc.org <mailto:hpages@fhcrc.org="">> >> Phone: (206) 667-5791 >> Fax: (206) 667-1319 >> >> >> >> >> -- >> Hervé Pagès >> >> Program in Computational Biology >> Division of Public Health Sciences >> Fred Hutchinson Cancer Research Center >> 1100 Fairview Ave. N, M2-B876 >> P.O. Box 19024 >> Seattle, WA 98109-1024 >> >> E-mail: hpages@fhcrc.org <mailto:hpages@fhcrc.org> >> Phone: (206) 667-5791 >> Fax: (206) 667-1319 >> >> >> > > -- > Hervé Pagès > > Program in Computational Biology > Division of Public Health Sciences > Fred Hutchinson Cancer Research Center > 1100 Fairview Ave. N, M2-B876 > P.O. Box 19024 > Seattle, WA 98109-1024 > > E-mail: hpages@fhcrc.org > Phone: (206) 667-5791 > Fax: (206) 667-1319 > [[alternative HTML version deleted]]

ADD REPLY • link 13.9 years ago Steve Shen ▴ 330

0

Entering edit mode

Steve, On 01/26/2011 06:41 PM, Steve Shen wrote: > Hi Herve, > > Sorry about that I forgot using reply all. Below is my seed file. I am > dealing with a brand new genome that was published a couple of months > ago. The genome consists of lots of scaffolds and contigs, no chromosome > information available. That's why it looks weird. Yous explanation and > suggestion does make lot of sense to me. I begin to understand some > basic mechanism of BSgenome. This is great and thanks so much. Good. Thanks for showing your seed file. I can confirm that you should probably set seqnames to character(0) and mseqnames to "cflo_v3.3.fold". However, I should point out that making a BSgenome package with a single DNAStringSet object in it offers limited added value compared to the solution that consists in simply loading the DNAStringSet object with read.DNAStringSet() every time you need to work with it. The small added value of the BSgenome wrapping here is that it's versioned and annotated and also it makes it easier to distribute and update thru the standard R package management infrastructure (local CRAN-style repository / install.packages() / update.packages()). > > I understand now that the nmask_per_seq parameter should be set to 0 > other than 2 (which I made up previously). Yes it will be easier if you set nmask_per_seq to 0, at least for now. If it turns out that you really want to have masks, then you can always add them later e.g. in the version 1.1.0 of the package. > However, it looks to me that > I should probably try some other strategies for this reference genome. Good luck. Let us know how it goes. H. > Thanks so much for your help. > > Steve > > ## seed file start here > Package: BSgenome.CFlo.YU.v3 > Title: Camponotus floridanus (Ants) full genome (YU version V3.3) > Description: Camponotus floridanus (Ants) full genome as provided by YU > (V3.3, Jan. 2011) > Version: 1.0.0 > organism: camponotus floridanus > species: Ant > provider: YU > provider_version: BGI Assembly V3.3 > release_date: Jan, 2011 > release_name: Ant Genome Reference Consortium > source_url: NA > organism_biocview: C_flo > BSgenomeObjname: CFlo > seqnames: "cflo_v3.3.fold" > mseqnames: character(0) > nmask_per_seq:2 > seqs_srcdir: /home/steve/Data/Genomes > #masks_srcdir: /home/steve/Data/Masks_src > #AGAPSfiles_name: cflo_v3.3.fa.masked > ## seed file end here --- > > 2011/1/26 Hervé Pagès <hpages at="" fhcrc.org="" <mailto:hpages="" at="" fhcrc.org="">> > > Hi Steve, > > Since we've started this on the list, let's keep it there. > > Please provide your seed file so I can get a better picture of what you > are doing. As I already mentioned before, you need to provide 1 FASTA > file per chromosome. But your cflo_v3.3.fold.fa file seems to contain > 24026 sequences! So you have 2 options: (1) either you split it into 1 > file per sequence, (2) either you store it as a multiple sequence in > your BSgenome data package. > > Option 1: It's easy to split a FASTA file into 1 file per FASTA > record using the Biostrings package. You can do something like: > > library(Biostrings) > x <- read.DNAStringSet("cflo_v3.3.fold.fa") > for (i in seq_len(length(x))) { > filepath <- paste(names(x)[i], ".fa", sep="") > write.XStringSet(x[i], filepath) > } > > The problem is that then you will need to list *all* the sequences in > the seqnames field of your seed file. Another problem with this is that > it's probably not the right thing to do. See below. > > Option 2: First some background. As you've probably noticed, the > sequence objects stored in a BSgenome data package are divided in 2 > groups: the group of single sequences and the group of multiple > sequences. If you've ever displayed a BSgenome object that contains > upstream sequences, you can't have missed it. For example: > > > library(BSgenome.Celegans.UCSC.ce2) > > Celegans > Worm genome > | > | organism: Caenorhabditis elegans (Worm) > | provider: UCSC > | provider version: ce2 > | release date: Mar. 2004 > | release name: WormBase v. WS120 > | > | single sequences (see '?seqnames'): > | chrI chrII chrIII chrIV chrV chrX chrM > | > | multiple sequences (see '?mseqnames'): > | upstream1000 upstream2000 upstream5000 > | > | (use the '$' or '[[' operator to access a given sequence) > > Objects in the 1st group are DNAString (or MaskedDNAString) objects. > Each DNAString (or MaskedDNAString) object can only contain 1 sequence. > Objects in the 2nd group are DNAStringSet objects. Each DNAStringSet > object can contain an (almost) arbitrary number of sequences. > The 2 groups are respectively defined thru the seqnames and mseqnames > fields of the seed file. I'm just paraphrasing the existing > documentation here: all this is documented in the man page for > BSgenome objects and in the BSgenomeForge vignette. > > Typically the 1st group will contain chromosome sequences, or, more > generally (and to use Ensembl terminology), the "top-level" sequences > of the genome. This can include the Mitochondrial DNA (often denoted > chrM), the 2-micron plasmid, the haplotype sequences, and other exotic > DNA material that is considered to be at the "top-level". The number > of top-level sequences varies from one genome to the other but it > remains generally small (< 50). > > Typically the 2nd group will contain 1 or more big collections of > sequences. Each collection itself can have thousands of sequences. > But the number of collections should preferably remain small > (i.e. < 50). So I guess this would be the place of choice for your > cflo_v3.3.fold object (which seems to be a collection of 24026 > scaffolds). > > Now for the problem with your gap file, I can't really help without > seeing your seed file first. However, I'm a little bit puzzled that > you have a file containing assembly gaps for your scaffolds: that > doesn't seem to make a lot of sense to me. Anyway, you can't put > masks on a multiple sequence object so, if you choose option 2, you > will have to set nmask_per_seq to 0. > > Also, as a general rule when forging a BSgenome data package, you > should ask yourself whether it is worth the extra effort to deal > with the masks at all. A BSgenome data package doesn't *have* to > contain masks. > > Cheers, > H. > > > > On 01/26/2011 02:53 PM, Steve Shen wrote: > > Hi Herve, > > Thank you so much for your help. I was finally able to have a > working > seed file. However, there seem to be still long way to go. I > guess that > most likely these errors are due to genome itself and gap file. > I just > wonder if you could kindly give me some hints what do these error > message mean. Is gap file required for building up data package? > Why do > package only use first sequence and others are truncated? Is this > because of two many sequences? What steps can I take in order to > make it > work with GenomicRange packages? I really appreciate your help. > Thanks > so much, > > Steve > > > forgeBSgenomeDataPkg("Cflo_seed.R") > Loading required package: Biobase > > Welcome to Bioconductor > > Vignettes contain introductory material. To view, type > 'openVignette()'. To cite Bioconductor, see > 'citation("Biobase")' and for packages 'citation(pkgname)'. > > > Attaching package: 'Biobase' > > The following object(s) are masked from 'package:IRanges': > > updateObject > > Creating package in ./BSgenome.CFlo.YU.v3 > Saving 'seqlengths' object to compressed data file > './BSgenome.CFlo.YU.v3/inst/extdata/seqlengths.rda'... DONE > Loading FASTA file '/home/steve/Data/Genomes/cflo_v3.3.fold.fa' in > 'cflo_v3.3.fold' object... DONE > Saving 'cflo_v3.3.fold' object to compressed data file > './BSgenome.CFlo.YU.v3/inst/extdata/cflo_v3.3.fold.rda'... DONE > Error in .guessGapFileCOL2CLASS(file) : > unable to guess the column names in "gap" file > './cflo_v3.3.fold_gap.txt', sorry > In addition: Warning messages: > 1: In FUN("cflo_v3.3.fold"[[1L]], ...) : > In file '/home/steve/Data/Genomes/cflo_v3.3.fold.fa': 24026 > sequences > found, using first sequence only > 2: In FUN("cflo_v3.3.fold"[[1L]], ...) : > In file '/home/steve/Data/Genomes/cflo_v3.3.fold.fa': sequence > description > "scaffold9scaffold12scaffold16scaffold2scaffold4scaffold20sc affold10scaffold11scaffold24scaffold1scaffold3scaffold6scaffold15scaff old28scaffold30scaffold22scaffold18scaffold21scaffold7scaffold32scaffo ld36scaffold13scaffold23scaffold39scaffold19scaffold41scaffold29scaffo ld37scaffold33scaffold45scaffold38scaffold17scaffold5scaffold46scaffol d48scaffold31scaffold25scaffold51scaffold49scaffold44scaffold47scaffol d54scaffold43scaffold55scaffold60scaffold35scaffold42scaffold63scaffol d53scaffold50scaffold40scaffold67scaffold61scaffold69scaffold34scaffol d70scaffold27scaffold73scaffold8scaffold75scaffold71scaffold74scaffold 14scaffold66scaffold65scaffold80scaffold62scaffold76scaffold79scaffold 85scaffold78scaffold86scaffold88scaffold81scaffold87scaffold26scaffold 92scaffold93scaffold94scaffold59scaffold52scaffold84scaffold77scaffold 89scaffold100scaffold97scaffold99scaffold64scaffold103scaffold90scaffo ld106scaffold56scaffold98 > [... truncated] > 3: In .forgeSeqFile(name, prefix, suffix, seqs_srcdir, > seqs_destdir, : > file contains 24026 sequences, using the first sequence only > 4: In file(file, "rt") : > cannot open file './cflo_v3.3.fold_gap.txt': No such file or > directory > 5: In file(file, "rt") : > cannot open file './cflo_v3.3.fold_gap.txt': No such file or > directory > > > 2011/1/20 Hervé Pagès <hpages at="" fhcrc.org=""> <mailto:hpages at="" fhcrc.org=""> <mailto:hpages at="" fhcrc.org=""> <mailto:hpages at="" fhcrc.org="">>> > > > Steve, > > > On 01/19/2011 04:37 PM, Steve Shen wrote: > > Hi Herve, > > Thank you very much. I tried carefully with correct path and > file name > by triggering auto-complete feature. But it didn't help > either. > It may > be something wrong with my seed file, so I also appended the > file below. > Please have a look. Thanks in advance, > > Steve > > > forgeBSgenomeDataPkg("/home/steve/Data/Genomes/seed") > Error in as.list(.readSeedFile(x, verbose = verbose)) : > error in evaluating the argument 'x' in selecting a > method for > function 'as.list' > > forgeBSgenomeDataPkg("seed") > Error in as.list(.readSeedFile(x, verbose = verbose)) : > error in evaluating the argument 'x' in selecting a > method for > function 'as.list' > > > An easy way to check if your seed file is a valid DCF file is: > > > read.dcf("seed") > Error in read.dcf("seed") : > Line starting 'I'm a broken DCF fil ...' is malformed! > > Again, the error message you get when calling > forgeBSgenomeDataPkg() on > a broken DCF file should return exactly that, but, > unfortunately, it > doesn't with the current version of BSgenome. Updated > versions of > BSgenome (1.18.3 in release and 1.19.3 in devel) are > correcting this > and will become available tomorrow around noon, and midnight, > respectively (Seattle time). > > > > Seed file starts: > > Package: BSgenome.CFlo.YU.v3 > Title: Camponotus floridanus (Ants) full genome (YU > version V3.3) > Description: Camponotus floridanus (Ants) full genome as > provided by YU > (V3.3, Jan. 2011) > > ^^^^^^^^^^^^^^^^^ > Is this on a line of its own in your file? This might be the > problem. > > Please consult the help for the read.dcf() function > (?read.dcf from > your R session) for how to format a DCF file. Note that a > 'tag:value' > should either be placed on a single line, or, if the value spans > several lines, then the extra lines (aka "continuation > lines") must > start with a whitespace. Like this: > > > Description: Camponotus floridanus (Ants) full genome as > provided > by YU (V3.3, Jan. 2011) > > (I'm using 4 whitespaces here but 1 is enough. Using a <tab> > would > also be OK.) > > > Version: 1.0.0 > organism: camponotus floridanus > species: Ant > provider: YU > provider_version: Assembly V3.3 > release_date: Jan, 2011 > release_name: Ant Genome Reference Consortium > source_url: NA > organism_biocview: C_flo > BSgenomeObjname: CFlo > seqnames: "cflo_v3.3.fold" > mseqnames: character(0) > nmask_per_seq:2 > > seqs_srcdir: /home/steve/Data/Genomes/ > > > So you have a FASTA file named "cflo_v3.3.fold.fa" located in > /home/steve/Data/Genomes/ that contains a *single* FASTA record > that corresponds to the cflo_v3.3.fold sequence? Please > double check > this. (You can see the list of records contained in a FASTA > file with > the fasta.info <http: fasta.info=""> <http: fasta.info="">() > function from Biostrings.) > > > H. > > # masks_srcdir: /home/steve/Data/Masks_src > # AGAPSfiles_name: cflo_v3.3.fa.masked > > > 2011/1/19 Hervé Pagès <hpages at="" fhcrc.org=""> <mailto:hpages at="" fhcrc.org=""> > <mailto:hpages at="" fhcrc.org="" <mailto:hpages="" at="" fhcrc.org="">> > <mailto:hpages at="" fhcrc.org="" <mailto:hpages="" at="" fhcrc.org=""> > <mailto:hpages at="" fhcrc.org="" <mailto:hpages="" at="" fhcrc.org="">>>> > > Hi Steve, > > > On 01/19/2011 01:04 PM, Steve Shen wrote: > > Hi Herve, > > Thank you so much for your quick reply. Your > vignette is > pretty > clear. I > think the problem is on my side. I just lack of > experience on > dealing > with this issue and maybe the file format > doesn't fit. I > sort of > gave up > using the seed file but just using low level > commands. I > still have > errors. I really have no idea what is going on. Your > help will > be much > appreciated. > > Best, > Steve > > 1. with seed file, I got > > forgeBSgenomeDataPkg("./cflo_seed.R") > Error in as.list(.readSeedFile(x, verbose = > verbose)) : > error in evaluating the argument 'x' in > selecting a > method for > function 'as.list' > > > I agree that the error message is kind of obscure > but the > problem > is probably that the specified path to your seed file is > invalid. > I'm about to commit a change to the BSgenomeForge > code that will > produce this error message instead: > > > forgeBSgenomeDataPkg("aaaaa") > Error in .readSeedFile(x, verbose = verbose) : > seed file 'aaaaa' not found > > An easy way to make sure the path is valid is to > press <tab> > when > you are in the middle of typing the path: this will > trigger the > auto-completion feature. > > > > 2. with command forgeSeqlengthsFile, > > forgeSeqlengthsFile("cflo_v3.3.fold", prefix="", > suffix=".fa",seqs_srcdir=".", seqs_destdir=".", > verbose=TRUE) > Saving 'seqlengths' object to compressed data file > './seqlengths.rda'... > DONE > Warning messages: > 1: In FUN("cflo_v3.3.fold"[[1L]], ...) : > In file './cflo_v3.3.fold.fa': 24026 > sequences found, > using first > sequence only > 2: In FUN("cflo_v3.3.fold"[[1L]], ...) : > In file './cflo_v3.3.fold.fa': sequence > description > "scaffold9scaffold12scaffold16scaffold2scaffold4scaffold20sc affold10scaffold11scaffold24scaffold1scaffold3scaffold6scaffold15scaff old28scaffold30scaffold22scaffold18scaffold21scaffold7scaffold32scaffo ld36scaffold13scaffold23scaffold39scaffold19scaffold41scaffold29scaffo ld37scaffold33scaffold45scaffold38scaffold17scaffold5scaffold46scaffol d48scaffold31scaffold25scaffold51scaffold49scaffold44scaffold47scaffol d54scaffold43scaffold55scaffold60scaffold35scaffold42scaffold63scaffol d53scaffold50scaffold40scaffold67scaffold61scaffold69scaffold34scaffol d70scaffold27scaffold73scaffold8scaffold75scaffold71scaffold74scaffold 14scaffold66scaffold65scaffold80scaffold62scaffold76scaffold79scaffold 85scaffold78scaffold86scaffold88scaffold81scaffold87scaffold26scaffold 92scaffold93scaffold94scaffold59scaffold52scaffold84scaffold77scaffold 89scaffold100scaffold97scaffold99scaffold64scaffold103scaffold90scaffo ld106scaffold56scaffold98scaffold109scaffold68sc > [... truncated] > > 3. with forgeSeqfiles, > > forgeSeqFiles("cflo_v3.3.fold", mseqnames=NULL, prefix="", > suffix=".fa",seqs_srcdir=".", > seqs_destdir="./BSgenome", > verbose=TRUE) > Loading FASTA file './cflo_v3.3.fold.fa' in > 'cflo_v3.3.fold' > object... DONE > Saving 'cflo_v3.3.fold' object to compressed > data file > './BSgenome/cflo_v3.3.fold.rda'... DONE > Warning message: > In .forgeSeqFile(name, prefix, suffix, seqs_srcdir, > seqs_destdir, : > file contains 24026 sequences, using the first > sequence only > > > Sorry but I can only try to help if you stick to the > vignette and > use a seed file. Using low-level functions like > forgeSeqlengthsFile() > or forgeSeqFiles() is undocumented/unsupported. The > reason > for this > is that this route is *much* more complicated and > error-prone > than using a seed file! This is exactly why seed > files where > invented: to make the whole process much easier for > you, and to > make troubleshooting much easier for us. > > Cheers, > H. > > > > > 2011/1/19 Hervé Pagès <hpages at="" fhcrc.org=""> <mailto:hpages at="" fhcrc.org=""> > <mailto:hpages at="" fhcrc.org="" <mailto:hpages="" at="" fhcrc.org="">> > <mailto:hpages at="" fhcrc.org="" <mailto:hpages="" at="" fhcrc.org=""> > <mailto:hpages at="" fhcrc.org="" <mailto:hpages="" at="" fhcrc.org="">>> > <mailto:hpages at="" fhcrc.org="" <mailto:hpages="" at="" fhcrc.org=""> > <mailto:hpages at="" fhcrc.org="" <mailto:hpages="" at="" fhcrc.org="">> > > <mailto:hpages at="" fhcrc.org="" <mailto:hpages="" at="" fhcrc.org=""> > <mailto:hpages at="" fhcrc.org="" <mailto:hpages="" at="" fhcrc.org="">>>>> > > > > Hi Steve, > > There are several things that look wrong > with your > seed file. > > 1. The # must be the first character in a > line to > make it a > line of comment (ignored). > For example, those 2 lines will certainly > not be > interpreted > as you might expect: > > > mseqnames: NA #paste("upstream", > c("1000", "2000", > "5000"), sep="") > > source_url: NA # > http://hgdownload.cse.ucsc.edu/goldenPath/hg19/bigZips/ > > 2. If you don't have multiple sequences, > specify: > > mseqnames: character(0) > > 3. As explained in the BSgenomeForge > vignette, the > default for > seqfiles_prefix is .fa so this won't work: > > seqnames: "clof_v3.fa" > > Unless your file is named clof_v3.fa.fa? > If your file is named clof_v3.fa, then > you should > specify: > > seqnames: "clof_v3" > > 4. This doesn't look like a valid file of > assembly gaps: > > > AGAPSfiles_name: clof_v3.fa.masked > > Please refer to the BSgenomeForge > vignette for > what kind > of masks > and what file formats are supported. > > 5. Having this > > > PkgExamples: Hsapiens > seqlengths(Hsapiens) > Hsapiens$chr1 # same as > Hsapiens[["chr1"]] > > in a seed file for clof obviously doesn't > make > sense and your > package won't pass 'R CMD check' because > it will > contain > broken > examples. > > There might be other problems with your seed > file. > All you > need to do > is read and follow carefully the instructions > described in the > BSgenomeForge vignette. Let me know if > things are > not clear > in the > vignette and I'll try to improve it. Thanks! > > H. > > > > On 01/18/2011 05:32 PM, Steve Shen wrote: > > Hi list, > > This wasn't sent out with a none > registered ids. > I have a > problem with > forging a new BSgenome data package. The > sequence data > file is > clof.v3.fa > and the mask file is clof.v3.fa.masked. > Below > are seed file, > command, error > and sessioninfo. Your help will be much > appreciated. > > Thanks a lot, > Steve > > Package: BSgenome.Clof.yu.v3 > Title: Clof (insects) full genome > (version 3) > Description: Clof (insects) full genome as > provided by > YU (V3.3, > Jan. 2011) > # and will store in Biostrings objects. > Version: 1.0.0 > organism: clof > species: insect > provider: YU > provider_version: Assembly V3.3 > release_date: Jan, 2011 > release_name: insects Genome Reference > Consortium > source_url: NA # > http://hgdownload.cse.ucsc.edu/goldenPath/hg19/bigZips/ > organism_biocview: Clof > BSgenomeObjname: Clof > seqnames: "clof_v3.fa" > mseqnames: NA #paste("upstream", > c("1000", "2000", > "5000"), sep="") > nmask_per_seq: 2 > #SrcDataFiles1: sequences: chromFa.zip, > upstream1000.zip, > upstream2000.zip, > upstream5000.zip > #from > http://hgdownload.cse.ucsc.edu/goldenPath/hg19/bigZips/ > #SrcDataFiles2: AGAPS masks: > http://hgdownload.cse.ucsc.edu/goldenPath/hg19/database/gap.txt.gz > #RM masks: > http://hgdownload.cse.ucsc.edu/goldenPath/hg19/bigZips/chromOut.tar.gz > #TRF masks: > http://hgdownload.cse.ucsc.edu/goldenPath/hg19/bigZips/chromTrf.tar.gz > PkgExamples: Hsapiens > seqlengths(Hsapiens) > Hsapiens$chr1 # same as > Hsapiens[["chr1"]] > seqs_srcdir: /home/steve/Data/Genomes > masks_srcdir: /home/steve/Data/Genomes > AGAPSfiles_name: clof_v3.fa.masked > > The command, error and sessionInfo are below > > forgeBSgenomeDataPkg("Clof_seed.R", > seqs_srcdir=".", > masks_srcdir=".", > > destdir=".", verbose=TRUE) > Loading required package: Biobase > > Welcome to Bioconductor > > Vignettes contain introductory > material. To > view, type > 'openVignette()'. To cite Bioconductor, see > 'citation("Biobase")' and for packages 'citation(pkgname)'. > > > Attaching package: 'Biobase' > > The following object(s) are masked from > 'package:IRanges': > > updateObject > > Error in forgeBSgenomeDataPkg(y, > seqs_srcdir = > seqs_srcdir, > masks_srcdir = > masks_srcdir, : > values for symbols NMASKPERSEQ are > not single > strings > In addition: Warning message: > In storage.mode(x$nmask_per_seq)<- > "integer" : NAs > introduced by > coercion > > sessionInfo() > > R version 2.12.1 (2010-12-16) > Platform: x86_64-pc-linux-gnu (64-bit) > > locale: > [1] LC_CTYPE=en_US.UTF-8 > LC_NUMERIC=C > [3] LC_TIME=en_US.UTF-8 > LC_COLLATE=en_US.UTF-8 > [5] LC_MONETARY=C > LC_MESSAGES=en_US.UTF-8 > [7] LC_PAPER=en_US.UTF-8 LC_NAME=C > [9] LC_ADDRESS=C > LC_TELEPHONE=C > [11] LC_MEASUREMENT=en_US.UTF-8 > LC_IDENTIFICATION=C > > attached base packages: > [1] stats graphics grDevices utils > datasets > methods base > > other attached packages: > [1] Biobase_2.6.1 BSgenome_1.18.2 > Biostrings_2.18.0 > [4] GenomicRanges_1.2.1 IRanges_1.8.8 > > loaded via a namespace (and not attached): > [1] tools_2.12.1 > > [[alternative HTML version deleted]] > > > _______________________________________________ > Bioconductor mailing list > Bioconductor at r-project.org <mailto:bioconductor at="" r-project.org=""> > <mailto:bioconductor at="" r-project.org=""> <mailto:bioconductor at="" r-project.org="">> > <mailto:bioconductor at="" r-project.org=""> <mailto:bioconductor at="" r-project.org=""> > <mailto:bioconductor at="" r-project.org=""> <mailto:bioconductor at="" r-project.org="">>> > <mailto:bioconductor at="" r-project.org=""> <mailto:bioconductor at="" r-project.org=""> > <mailto:bioconductor at="" r-project.org=""> <mailto:bioconductor at="" r-project.org="">> > <mailto:bioconductor at="" r-project.org=""> <mailto:bioconductor at="" r-project.org=""> > <mailto:bioconductor at="" r-project.org=""> <mailto:bioconductor at="" r-project.org="">>>> > > https://stat.ethz.ch/mailman/listinfo/bioconductor > Search the archives: > http://news.gmane.org/gmane.science.biology.informatics.conductor > > > > -- > Hervé Pagès > > Program in Computational Biology > Division of Public Health Sciences > Fred Hutchinson Cancer Research Center > 1100 Fairview Ave. N, M2-B876 > P.O. Box 19024 > Seattle, WA 98109-1024 > > E-mail: hpages at fhcrc.org > <mailto:hpages at="" fhcrc.org=""> <mailto:hpages at="" fhcrc.org=""> <mailto:hpages at="" fhcrc.org="">> > <mailto:hpages at="" fhcrc.org="" <mailto:hpages="" at="" fhcrc.org=""> > <mailto:hpages at="" fhcrc.org="" <mailto:hpages="" at="" fhcrc.org="">>> > <mailto:hpages at="" fhcrc.org="" <mailto:hpages="" at="" fhcrc.org=""> > <mailto:hpages at="" fhcrc.org="" <mailto:hpages="" at="" fhcrc.org="">> > <mailto:hpages at="" fhcrc.org="" <mailto:hpages="" at="" fhcrc.org=""> > <mailto:hpages at="" fhcrc.org="" <mailto:hpages="" at="" fhcrc.org="">>>> > > > Phone: (206) 667-5791 > Fax: (206) 667-1319 > > > > > -- > Hervé Pagès > > Program in Computational Biology > Division of Public Health Sciences > Fred Hutchinson Cancer Research Center > 1100 Fairview Ave. N, M2-B876 > P.O. Box 19024 > Seattle, WA 98109-1024 > > E-mail: hpages at fhcrc.org <mailto:hpages at="" fhcrc.org=""> > <mailto:hpages at="" fhcrc.org="" <mailto:hpages="" at="" fhcrc.org="">> > <mailto:hpages at="" fhcrc.org="" <mailto:hpages="" at="" fhcrc.org=""> > <mailto:hpages at="" fhcrc.org="" <mailto:hpages="" at="" fhcrc.org="">>> > Phone: (206) 667-5791 > Fax: (206) 667-1319 > > > > > -- > Hervé Pagès > > Program in Computational Biology > Division of Public Health Sciences > Fred Hutchinson Cancer Research Center > 1100 Fairview Ave. N, M2-B876 > P.O. Box 19024 > Seattle, WA 98109-1024 > > E-mail: hpages at fhcrc.org <mailto:hpages at="" fhcrc.org=""> > <mailto:hpages at="" fhcrc.org="" <mailto:hpages="" at="" fhcrc.org="">> > Phone: (206) 667-5791 > Fax: (206) 667-1319 > > > > > -- > Hervé Pagès > > Program in Computational Biology > Division of Public Health Sciences > Fred Hutchinson Cancer Research Center > 1100 Fairview Ave. N, M2-B876 > P.O. Box 19024 > Seattle, WA 98109-1024 > > E-mail: hpages at fhcrc.org <mailto:hpages at="" fhcrc.org=""> > Phone: (206) 667-5791 > Fax: (206) 667-1319 > > -- Hervé Pagès Program in Computational Biology Division of Public Health Sciences Fred Hutchinson Cancer Research Center 1100 Fairview Ave. N, M2-B876 P.O. Box 19024 Seattle, WA 98109-1024 E-mail: hpages at fhcrc.org Phone: (206) 667-5791 Fax: (206) 667-1319

ADD REPLY • link 13.9 years ago Hervé Pagès 16k

0

Entering edit mode

Hi Herve, Thank you very very much for your time and very valuable advice. The distributable standard data package is definitely a way to go. Since the genomes I am working on are still in evolvement, there are lots of problems with current version of contigs and scaffolds. However, I will try it again when the more seq data come and between assembly can be obtained. I will certainly keep you posted. Right now, I am thinking about a quick way to build a genomicRange and genomicfeatures data set from both sequences and annotations and use it for my RNA-seq experiment. I shouldn't take advantage of your time to discuss further in this discussion. But please feel free to let me know your advice and suggestions. I really appreciate your help. steve 2011/1/27 Hervé Pagès <hpages@fhcrc.org> > Steve, > > > On 01/26/2011 06:41 PM, Steve Shen wrote: > >> Hi Herve, >> >> Sorry about that I forgot using reply all. Below is my seed file. I am >> dealing with a brand new genome that was published a couple of months >> ago. The genome consists of lots of scaffolds and contigs, no chromosome >> information available. That's why it looks weird. Yous explanation and >> suggestion does make lot of sense to me. I begin to understand some >> basic mechanism of BSgenome. This is great and thanks so much. >> > > Good. Thanks for showing your seed file. I can confirm that you should > probably set seqnames to character(0) and mseqnames to "cflo_v3.3.fold". > However, I should point out that making a BSgenome package with a single > DNAStringSet object in it offers limited added value compared to the > solution that consists in simply loading the DNAStringSet object with > read.DNAStringSet() every time you need to work with it. The small > added value of the BSgenome wrapping here is that it's versioned and > annotated and also it makes it easier to distribute and update thru > the standard R package management infrastructure (local CRAN-style > repository / install.packages() / update.packages()). > > > >> I understand now that the nmask_per_seq parameter should be set to 0 >> other than 2 (which I made up previously). >> > > Yes it will be easier if you set nmask_per_seq to 0, at least for now. > If it turns out that you really want to have masks, then you can always > add them later e.g. in the version 1.1.0 of the package. > > > However, it looks to me that >> I should probably try some other strategies for this reference genome. >> > > Good luck. Let us know how it goes. > > H. > > Thanks so much for your help. >> >> Steve >> >> ## seed file start here >> Package: BSgenome.CFlo.YU.v3 >> Title: Camponotus floridanus (Ants) full genome (YU version V3.3) >> Description: Camponotus floridanus (Ants) full genome as provided by YU >> (V3.3, Jan. 2011) >> Version: 1.0.0 >> organism: camponotus floridanus >> species: Ant >> provider: YU >> provider_version: BGI Assembly V3.3 >> release_date: Jan, 2011 >> release_name: Ant Genome Reference Consortium >> source_url: NA >> organism_biocview: C_flo >> BSgenomeObjname: CFlo >> seqnames: "cflo_v3.3.fold" >> mseqnames: character(0) >> nmask_per_seq:2 >> seqs_srcdir: /home/steve/Data/Genomes >> #masks_srcdir: /home/steve/Data/Masks_src >> #AGAPSfiles_name: cflo_v3.3.fa.masked >> ## seed file end here --- >> >> 2011/1/26 Hervé Pagès <hpages@fhcrc.org <mailto:hpages@fhcrc.org="">> >> >> Hi Steve, >> >> Since we've started this on the list, let's keep it there. >> >> Please provide your seed file so I can get a better picture of what you >> are doing. As I already mentioned before, you need to provide 1 FASTA >> file per chromosome. But your cflo_v3.3.fold.fa file seems to contain >> 24026 sequences! So you have 2 options: (1) either you split it into 1 >> file per sequence, (2) either you store it as a multiple sequence in >> your BSgenome data package. >> >> Option 1: It's easy to split a FASTA file into 1 file per FASTA >> record using the Biostrings package. You can do something like: >> >> library(Biostrings) >> x <- read.DNAStringSet("cflo_v3.3.fold.fa") >> for (i in seq_len(length(x))) { >> filepath <- paste(names(x)[i], ".fa", sep="") >> write.XStringSet(x[i], filepath) >> } >> >> The problem is that then you will need to list *all* the sequences in >> the seqnames field of your seed file. Another problem with this is that >> it's probably not the right thing to do. See below. >> >> Option 2: First some background. As you've probably noticed, the >> sequence objects stored in a BSgenome data package are divided in 2 >> groups: the group of single sequences and the group of multiple >> sequences. If you've ever displayed a BSgenome object that contains >> upstream sequences, you can't have missed it. For example: >> >> > library(BSgenome.Celegans.UCSC.ce2) >> > Celegans >> Worm genome >> | >> | organism: Caenorhabditis elegans (Worm) >> | provider: UCSC >> | provider version: ce2 >> | release date: Mar. 2004 >> | release name: WormBase v. WS120 >> | >> | single sequences (see '?seqnames'): >> | chrI chrII chrIII chrIV chrV chrX chrM >> | >> | multiple sequences (see '?mseqnames'): >> | upstream1000 upstream2000 upstream5000 >> | >> | (use the '$' or '[[' operator to access a given sequence) >> >> Objects in the 1st group are DNAString (or MaskedDNAString) objects. >> Each DNAString (or MaskedDNAString) object can only contain 1 sequence. >> Objects in the 2nd group are DNAStringSet objects. Each DNAStringSet >> object can contain an (almost) arbitrary number of sequences. >> The 2 groups are respectively defined thru the seqnames and mseqnames >> fields of the seed file. I'm just paraphrasing the existing >> documentation here: all this is documented in the man page for >> BSgenome objects and in the BSgenomeForge vignette. >> >> Typically the 1st group will contain chromosome sequences, or, more >> generally (and to use Ensembl terminology), the "top-level" sequences >> of the genome. This can include the Mitochondrial DNA (often denoted >> chrM), the 2-micron plasmid, the haplotype sequences, and other exotic >> DNA material that is considered to be at the "top-level". The number >> of top-level sequences varies from one genome to the other but it >> remains generally small (< 50). >> >> Typically the 2nd group will contain 1 or more big collections of >> sequences. Each collection itself can have thousands of sequences. >> But the number of collections should preferably remain small >> (i.e. < 50). So I guess this would be the place of choice for your >> cflo_v3.3.fold object (which seems to be a collection of 24026 >> scaffolds). >> >> Now for the problem with your gap file, I can't really help without >> seeing your seed file first. However, I'm a little bit puzzled that >> you have a file containing assembly gaps for your scaffolds: that >> doesn't seem to make a lot of sense to me. Anyway, you can't put >> masks on a multiple sequence object so, if you choose option 2, you >> will have to set nmask_per_seq to 0. >> >> Also, as a general rule when forging a BSgenome data package, you >> should ask yourself whether it is worth the extra effort to deal >> with the masks at all. A BSgenome data package doesn't *have* to >> contain masks. >> >> Cheers, >> H. >> >> >> >> On 01/26/2011 02:53 PM, Steve Shen wrote: >> >> Hi Herve, >> >> Thank you so much for your help. I was finally able to have a >> working >> seed file. However, there seem to be still long way to go. I >> guess that >> most likely these errors are due to genome itself and gap file. >> I just >> wonder if you could kindly give me some hints what do these error >> message mean. Is gap file required for building up data package? >> Why do >> package only use first sequence and others are truncated? Is this >> because of two many sequences? What steps can I take in order to >> make it >> work with GenomicRange packages? I really appreciate your help. >> Thanks >> so much, >> >> Steve >> >> > forgeBSgenomeDataPkg("Cflo_seed.R") >> Loading required package: Biobase >> >> Welcome to Bioconductor >> >> Vignettes contain introductory material. To view, type >> 'openVignette()'. To cite Bioconductor, see >> 'citation("Biobase")' and for packages 'citation(pkgname)'. >> >> >> Attaching package: 'Biobase' >> >> The following object(s) are masked from 'package:IRanges': >> >> updateObject >> >> Creating package in ./BSgenome.CFlo.YU.v3 >> Saving 'seqlengths' object to compressed data file >> './BSgenome.CFlo.YU.v3/inst/extdata/seqlengths.rda'... DONE >> Loading FASTA file '/home/steve/Data/Genomes/cflo_v3.3.fold.fa' in >> 'cflo_v3.3.fold' object... DONE >> Saving 'cflo_v3.3.fold' object to compressed data file >> './BSgenome.CFlo.YU.v3/inst/extdata/cflo_v3.3.fold.rda'... DONE >> Error in .guessGapFileCOL2CLASS(file) : >> unable to guess the column names in "gap" file >> './cflo_v3.3.fold_gap.txt', sorry >> In addition: Warning messages: >> 1: In FUN("cflo_v3.3.fold"[[1L]], ...) : >> In file '/home/steve/Data/Genomes/cflo_v3.3.fold.fa': 24026 >> sequences >> found, using first sequence only >> 2: In FUN("cflo_v3.3.fold"[[1L]], ...) : >> In file '/home/steve/Data/Genomes/cflo_v3.3.fold.fa': sequence >> description >> >> "scaffold9scaffold12scaffold16scaffold2scaffold4scaffold20scaffold 10scaffold11scaffold24scaffold1scaffold3scaffold6scaffold15scaffold28s caffold30scaffold22scaffold18scaffold21scaffold7scaffold32scaffold36sc affold13scaffold23scaffold39scaffold19scaffold41scaffold29scaffold37sc affold33scaffold45scaffold38scaffold17scaffold5scaffold46scaffold48sca ffold31scaffold25scaffold51scaffold49scaffold44scaffold47scaffold54sca ffold43scaffold55scaffold60scaffold35scaffold42scaffold63scaffold53sca ffold50scaffold40scaffold67scaffold61scaffold69scaffold34scaffold70sca ffold27scaffold73scaffold8scaffold75scaffold71scaffold74scaffold14scaf fold66scaffold65scaffold80scaffold62scaffold76scaffold79scaffold85scaf fold78scaffold86scaffold88scaffold81scaffold87scaffold26scaffold92scaf fold93scaffold94scaffold59scaffold52scaffold84scaffold77scaffold89scaf fold100scaffold97scaffold99scaffold64scaffold103scaffold90scaffold106s caffold56scaffold98 >> [... truncated] >> 3: In .forgeSeqFile(name, prefix, suffix, seqs_srcdir, >> seqs_destdir, : >> file contains 24026 sequences, using the first sequence only >> 4: In file(file, "rt") : >> cannot open file './cflo_v3.3.fold_gap.txt': No such file or >> directory >> 5: In file(file, "rt") : >> cannot open file './cflo_v3.3.fold_gap.txt': No such file or >> directory >> >> >> 2011/1/20 Hervé Pagès <hpages@fhcrc.org>> <mailto:hpages@fhcrc.org> <mailto:hpages@fhcrc.org>> <mailto:hpages@fhcrc.org>>> >> >> >> Steve, >> >> >> On 01/19/2011 04:37 PM, Steve Shen wrote: >> >> Hi Herve, >> >> Thank you very much. I tried carefully with correct path >> and >> file name >> by triggering auto-complete feature. But it didn't help >> either. >> It may >> be something wrong with my seed file, so I also appended >> the >> file below. >> Please have a look. Thanks in advance, >> >> Steve >> >> > forgeBSgenomeDataPkg("/home/steve/Data/Genomes/seed") >> Error in as.list(.readSeedFile(x, verbose = verbose)) : >> error in evaluating the argument 'x' in selecting a >> method for >> function 'as.list' >> > forgeBSgenomeDataPkg("seed") >> Error in as.list(.readSeedFile(x, verbose = verbose)) : >> error in evaluating the argument 'x' in selecting a >> method for >> function 'as.list' >> >> >> An easy way to check if your seed file is a valid DCF file is: >> >> > read.dcf("seed") >> Error in read.dcf("seed") : >> Line starting 'I'm a broken DCF fil ...' is malformed! >> >> Again, the error message you get when calling >> forgeBSgenomeDataPkg() on >> a broken DCF file should return exactly that, but, >> unfortunately, it >> doesn't with the current version of BSgenome. Updated >> versions of >> BSgenome (1.18.3 in release and 1.19.3 in devel) are >> correcting this >> and will become available tomorrow around noon, and midnight, >> respectively (Seattle time). >> >> >> >> Seed file starts: >> >> Package: BSgenome.CFlo.YU.v3 >> Title: Camponotus floridanus (Ants) full genome (YU >> version V3.3) >> Description: Camponotus floridanus (Ants) full genome as >> provided by YU >> (V3.3, Jan. 2011) >> >> ^^^^^^^^^^^^^^^^^ >> Is this on a line of its own in your file? This might be the >> problem. >> >> Please consult the help for the read.dcf() function >> (?read.dcf from >> your R session) for how to format a DCF file. Note that a >> 'tag:value' >> should either be placed on a single line, or, if the value >> spans >> several lines, then the extra lines (aka "continuation >> lines") must >> start with a whitespace. Like this: >> >> >> Description: Camponotus floridanus (Ants) full genome as >> provided >> by YU (V3.3, Jan. 2011) >> >> (I'm using 4 whitespaces here but 1 is enough. Using a <tab> >> would >> also be OK.) >> >> >> Version: 1.0.0 >> organism: camponotus floridanus >> species: Ant >> provider: YU >> provider_version: Assembly V3.3 >> release_date: Jan, 2011 >> release_name: Ant Genome Reference Consortium >> source_url: NA >> organism_biocview: C_flo >> BSgenomeObjname: CFlo >> seqnames: "cflo_v3.3.fold" >> mseqnames: character(0) >> nmask_per_seq:2 >> >> seqs_srcdir: /home/steve/Data/Genomes/ >> >> >> So you have a FASTA file named "cflo_v3.3.fold.fa" located in >> /home/steve/Data/Genomes/ that contains a *single* FASTA record >> that corresponds to the cflo_v3.3.fold sequence? Please >> double check >> this. (You can see the list of records contained in a FASTA >> file with >> the fasta.info <http: fasta.info=""> <http: fasta.info="">() >> >> function from Biostrings.) >> >> >> H. >> >> # masks_srcdir: /home/steve/Data/Masks_src >> # AGAPSfiles_name: cflo_v3.3.fa.masked >> >> >> 2011/1/19 Hervé Pagès <hpages@fhcrc.org>> <mailto:hpages@fhcrc.org> >> <mailto:hpages@fhcrc.org <mailto:hpages@fhcrc.org="">> >> <mailto:hpages@fhcrc.org <mailto:hpages@fhcrc.org=""> >> <mailto:hpages@fhcrc.org <mailto:hpages@fhcrc.org="">>>> >> >> Hi Steve, >> >> >> On 01/19/2011 01:04 PM, Steve Shen wrote: >> >> Hi Herve, >> >> Thank you so much for your quick reply. Your >> vignette is >> pretty >> clear. I >> think the problem is on my side. I just lack of >> experience on >> dealing >> with this issue and maybe the file format >> doesn't fit. I >> sort of >> gave up >> using the seed file but just using low level >> commands. I >> still have >> errors. I really have no idea what is going on. >> Your >> help will >> be much >> appreciated. >> >> Best, >> Steve >> >> 1. with seed file, I got >> > forgeBSgenomeDataPkg("./cflo_seed.R") >> Error in as.list(.readSeedFile(x, verbose = >> verbose)) : >> error in evaluating the argument 'x' in >> selecting a >> method for >> function 'as.list' >> >> >> I agree that the error message is kind of obscure >> but the >> problem >> is probably that the specified path to your seed file >> is >> invalid. >> I'm about to commit a change to the BSgenomeForge >> code that will >> produce this error message instead: >> >> > forgeBSgenomeDataPkg("aaaaa") >> Error in .readSeedFile(x, verbose = verbose) : >> seed file 'aaaaa' not found >> >> An easy way to make sure the path is valid is to >> press <tab> >> when >> you are in the middle of typing the path: this will >> trigger the >> auto-completion feature. >> >> >> >> 2. with command forgeSeqlengthsFile, >> > forgeSeqlengthsFile("cflo_v3.3.fold", prefix="", >> suffix=".fa",seqs_srcdir=".", seqs_destdir=".", >> verbose=TRUE) >> Saving 'seqlengths' object to compressed data file >> './seqlengths.rda'... >> DONE >> Warning messages: >> 1: In FUN("cflo_v3.3.fold"[[1L]], ...) : >> In file './cflo_v3.3.fold.fa': 24026 >> sequences found, >> using first >> sequence only >> 2: In FUN("cflo_v3.3.fold"[[1L]], ...) : >> In file './cflo_v3.3.fold.fa': sequence >> description >> >> "scaffold9scaffold12scaffold16scaffold2scaffold4scaffold20scaffold 10scaffold11scaffold24scaffold1scaffold3scaffold6scaffold15scaffold28s caffold30scaffold22scaffold18scaffold21scaffold7scaffold32scaffold36sc affold13scaffold23scaffold39scaffold19scaffold41scaffold29scaffold37sc affold33scaffold45scaffold38scaffold17scaffold5scaffold46scaffold48sca ffold31scaffold25scaffold51scaffold49scaffold44scaffold47scaffold54sca ffold43scaffold55scaffold60scaffold35scaffold42scaffold63scaffold53sca ffold50scaffold40scaffold67scaffold61scaffold69scaffold34scaffold70sca ffold27scaffold73scaffold8scaffold75scaffold71scaffold74scaffold14scaf fold66scaffold65scaffold80scaffold62scaffold76scaffold79scaffold85scaf fold78scaffold86scaffold88scaffold81scaffold87scaffold26scaffold92scaf fold93scaffold94scaffold59scaffold52scaffold84scaffold77scaffold89scaf fold100scaffold97scaffold99scaffold64scaffold103scaffold90scaffold106s caffold56scaffold98scaffold109scaffold68sc >> [... truncated] >> >> 3. with forgeSeqfiles, >> > forgeSeqFiles("cflo_v3.3.fold", mseqnames=NULL, prefix="", >> suffix=".fa",seqs_srcdir=".", >> seqs_destdir="./BSgenome", >> verbose=TRUE) >> Loading FASTA file './cflo_v3.3.fold.fa' in >> 'cflo_v3.3.fold' >> object... DONE >> Saving 'cflo_v3.3.fold' object to compressed >> data file >> './BSgenome/cflo_v3.3.fold.rda'... DONE >> Warning message: >> In .forgeSeqFile(name, prefix, suffix, seqs_srcdir, >> seqs_destdir, : >> file contains 24026 sequences, using the first >> sequence only >> >> >> Sorry but I can only try to help if you stick to the >> vignette and >> use a seed file. Using low-level functions like >> forgeSeqlengthsFile() >> or forgeSeqFiles() is undocumented/unsupported. The >> reason >> for this >> is that this route is *much* more complicated and >> error-prone >> than using a seed file! This is exactly why seed >> files where >> invented: to make the whole process much easier for >> you, and to >> make troubleshooting much easier for us. >> >> Cheers, >> H. >> >> >> >> >> 2011/1/19 Hervé Pagès <hpages@fhcrc.org>> <mailto:hpages@fhcrc.org> >> <mailto:hpages@fhcrc.org <mailto:hpages@fhcrc.org="">> >> <mailto:hpages@fhcrc.org <mailto:hpages@fhcrc.org=""> >> <mailto:hpages@fhcrc.org <mailto:hpages@fhcrc.org="">>> >> <mailto:hpages@fhcrc.org <mailto:hpages@fhcrc.org=""> >> <mailto:hpages@fhcrc.org <mailto:hpages@fhcrc.org="">> >> >> <mailto:hpages@fhcrc.org <mailto:hpages@fhcrc.org=""> >> <mailto:hpages@fhcrc.org <mailto:hpages@fhcrc.org="">>>>> >> >> >> >> Hi Steve, >> >> There are several things that look wrong >> with your >> seed file. >> >> 1. The # must be the first character in a >> line to >> make it a >> line of comment (ignored). >> For example, those 2 lines will certainly >> not be >> interpreted >> as you might expect: >> >> >> mseqnames: NA #paste("upstream", >> c("1000", "2000", >> "5000"), sep="") >> >> source_url: NA # >> http://hgdownload.cse.ucsc.edu/goldenPath/hg19/bigZips/ >> >> 2. If you don't have multiple sequences, >> specify: >> >> mseqnames: character(0) >> >> 3. As explained in the BSgenomeForge >> vignette, the >> default for >> seqfiles_prefix is .fa so this won't work: >> >> seqnames: "clof_v3.fa" >> >> Unless your file is named clof_v3.fa.fa? >> If your file is named clof_v3.fa, then >> you should >> specify: >> >> seqnames: "clof_v3" >> >> 4. This doesn't look like a valid file of >> assembly gaps: >> >> >> AGAPSfiles_name: clof_v3.fa.masked >> >> Please refer to the BSgenomeForge >> vignette for >> what kind >> of masks >> and what file formats are supported. >> >> 5. Having this >> >> >> PkgExamples: Hsapiens >> seqlengths(Hsapiens) >> Hsapiens$chr1 # same as >> Hsapiens[["chr1"]] >> >> in a seed file for clof obviously doesn't >> make >> sense and your >> package won't pass 'R CMD check' because >> it will >> contain >> broken >> examples. >> >> There might be other problems with your seed >> file. >> All you >> need to do >> is read and follow carefully the instructions >> described in the >> BSgenomeForge vignette. Let me know if >> things are >> not clear >> in the >> vignette and I'll try to improve it. Thanks! >> >> H. >> >> >> >> On 01/18/2011 05:32 PM, Steve Shen wrote: >> >> Hi list, >> >> This wasn't sent out with a none >> registered ids. >> I have a >> problem with >> forging a new BSgenome data package. The >> sequence data >> file is >> clof.v3.fa >> and the mask file is clof.v3.fa.masked. >> Below >> are seed file, >> command, error >> and sessioninfo. Your help will be much >> appreciated. >> >> Thanks a lot, >> Steve >> >> Package: BSgenome.Clof.yu.v3 >> Title: Clof (insects) full genome >> (version 3) >> Description: Clof (insects) full genome as >> provided by >> YU (V3.3, >> Jan. 2011) >> # and will store in Biostrings objects. >> Version: 1.0.0 >> organism: clof >> species: insect >> provider: YU >> provider_version: Assembly V3.3 >> release_date: Jan, 2011 >> release_name: insects Genome Reference >> Consortium >> source_url: NA # >> http://hgdownload.cse.ucsc.edu/goldenPath/hg19/bigZips/ >> organism_biocview: Clof >> BSgenomeObjname: Clof >> seqnames: "clof_v3.fa" >> mseqnames: NA #paste("upstream", >> c("1000", "2000", >> "5000"), sep="") >> nmask_per_seq: 2 >> #SrcDataFiles1: sequences: chromFa.zip, >> upstream1000.zip, >> upstream2000.zip, >> upstream5000.zip >> #from >> http://hgdownload.cse.ucsc.edu/goldenPath/hg19/bigZips/ >> #SrcDataFiles2: AGAPS masks: >> http://hgdownload.cse.ucsc.edu/goldenPath/hg19/database/gap.txt.gz >> #RM masks: >> >> http://hgdownload.cse.ucsc.edu/goldenPath/hg19/bigZips/chromOut.tar.gz >> #TRF masks: >> >> http://hgdownload.cse.ucsc.edu/goldenPath/hg19/bigZips/chromTrf.tar.gz >> PkgExamples: Hsapiens >> seqlengths(Hsapiens) >> Hsapiens$chr1 # same as >> Hsapiens[["chr1"]] >> seqs_srcdir: /home/steve/Data/Genomes >> masks_srcdir: /home/steve/Data/Genomes >> AGAPSfiles_name: clof_v3.fa.masked >> >> The command, error and sessionInfo are >> below >> >> forgeBSgenomeDataPkg("Clof_seed.R", >> seqs_srcdir=".", >> masks_srcdir=".", >> >> destdir=".", verbose=TRUE) >> Loading required package: Biobase >> >> Welcome to Bioconductor >> >> Vignettes contain introductory >> material. To >> view, type >> 'openVignette()'. To cite Bioconductor, see >> 'citation("Biobase")' and for packages 'citation(pkgname)'. >> >> >> Attaching package: 'Biobase' >> >> The following object(s) are masked from >> 'package:IRanges': >> >> updateObject >> >> Error in forgeBSgenomeDataPkg(y, >> seqs_srcdir = >> seqs_srcdir, >> masks_srcdir = >> masks_srcdir, : >> values for symbols NMASKPERSEQ are >> not single >> strings >> In addition: Warning message: >> In storage.mode(x$nmask_per_seq)<- >> "integer" : NAs >> introduced by >> coercion >> >> sessionInfo() >> >> R version 2.12.1 (2010-12-16) >> Platform: x86_64-pc-linux-gnu (64-bit) >> >> locale: >> [1] LC_CTYPE=en_US.UTF-8 >> LC_NUMERIC=C >> [3] LC_TIME=en_US.UTF-8 >> LC_COLLATE=en_US.UTF-8 >> [5] LC_MONETARY=C >> LC_MESSAGES=en_US.UTF-8 >> [7] LC_PAPER=en_US.UTF-8 LC_NAME=C >> [9] LC_ADDRESS=C >> LC_TELEPHONE=C >> [11] LC_MEASUREMENT=en_US.UTF-8 >> LC_IDENTIFICATION=C >> >> attached base packages: >> [1] stats graphics grDevices utils >> datasets >> methods base >> >> other attached packages: >> [1] Biobase_2.6.1 BSgenome_1.18.2 >> Biostrings_2.18.0 >> [4] GenomicRanges_1.2.1 IRanges_1.8.8 >> >> loaded via a namespace (and not attached): >> [1] tools_2.12.1 >> >> [[alternative HTML version >> deleted]] >> >> >> _______________________________________________ >> Bioconductor mailing list >> Bioconductor@r-project.org <mailto:bioconductor@r-project.org> >> <mailto:bioconductor@r-project.org>> <mailto:bioconductor@r-project.org>> >> <mailto:bioconductor@r-project.org>> <mailto:bioconductor@r-project.org> >> <mailto:bioconductor@r-project.org>> <mailto:bioconductor@r-project.org>>> >> <mailto:bioconductor@r-project.org>> <mailto:bioconductor@r-project.org> >> <mailto:bioconductor@r-project.org>> <mailto:bioconductor@r-project.org>> >> <mailto:bioconductor@r-project.org>> <mailto:bioconductor@r-project.org> >> <mailto:bioconductor@r-project.org>> <mailto:bioconductor@r-project.org>>>> >> >> https://stat.ethz.ch/mailman/listinfo/bioconductor >> Search the archives: >> http://news.gmane.org/gmane.science.biology.informatics.conductor >> >> >> >> -- >> Hervé Pagès >> >> Program in Computational Biology >> Division of Public Health Sciences >> Fred Hutchinson Cancer Research Center >> 1100 Fairview Ave. N, M2-B876 >> P.O. Box 19024 >> Seattle, WA 98109-1024 >> >> E-mail: hpages@fhcrc.org >> <mailto:hpages@fhcrc.org> <mailto:hpages@fhcrc.org>> <mailto:hpages@fhcrc.org>> >> <mailto:hpages@fhcrc.org <mailto:hpages@fhcrc.org=""> >> <mailto:hpages@fhcrc.org <mailto:hpages@fhcrc.org="">>> >> <mailto:hpages@fhcrc.org <mailto:hpages@fhcrc.org=""> >> <mailto:hpages@fhcrc.org <mailto:hpages@fhcrc.org="">> >> <mailto:hpages@fhcrc.org <mailto:hpages@fhcrc.org=""> >> <mailto:hpages@fhcrc.org <mailto:hpages@fhcrc.org="">>>> >> >> >> Phone: (206) 667-5791 >> Fax: (206) 667-1319 >> >> >> >> >> -- >> Hervé Pagès >> >> Program in Computational Biology >> Division of Public Health Sciences >> Fred Hutchinson Cancer Research Center >> 1100 Fairview Ave. N, M2-B876 >> P.O. Box 19024 >> Seattle, WA 98109-1024 >> >> E-mail: hpages@fhcrc.org <mailto:hpages@fhcrc.org> >> <mailto:hpages@fhcrc.org <mailto:hpages@fhcrc.org="">> >> <mailto:hpages@fhcrc.org <mailto:hpages@fhcrc.org=""> >> <mailto:hpages@fhcrc.org <mailto:hpages@fhcrc.org="">>> >> Phone: (206) 667-5791 >> Fax: (206) 667-1319 >> >> >> >> >> -- >> Hervé Pagès >> >> Program in Computational Biology >> Division of Public Health Sciences >> Fred Hutchinson Cancer Research Center >> 1100 Fairview Ave. N, M2-B876 >> P.O. Box 19024 >> Seattle, WA 98109-1024 >> >> E-mail: hpages@fhcrc.org <mailto:hpages@fhcrc.org> >> <mailto:hpages@fhcrc.org <mailto:hpages@fhcrc.org="">> >> Phone: (206) 667-5791 >> Fax: (206) 667-1319 >> >> >> >> >> -- >> Hervé Pagès >> >> Program in Computational Biology >> Division of Public Health Sciences >> Fred Hutchinson Cancer Research Center >> 1100 Fairview Ave. N, M2-B876 >> P.O. Box 19024 >> Seattle, WA 98109-1024 >> >> E-mail: hpages@fhcrc.org <mailto:hpages@fhcrc.org> >> Phone: (206) 667-5791 >> Fax: (206) 667-1319 >> >> >> > > -- > Hervé Pagès > > Program in Computational Biology > Division of Public Health Sciences > Fred Hutchinson Cancer Research Center > 1100 Fairview Ave. N, M2-B876 > P.O. Box 19024 > Seattle, WA 98109-1024 > > E-mail: hpages@fhcrc.org > Phone: (206) 667-5791 > Fax: (206) 667-1319 > [[alternative HTML version deleted]]

ADD REPLY • link 13.9 years ago Steve Shen ▴ 330

0

Entering edit mode

Hi Herve, Sorry for bothering you again. With you help, I was able to build this BSgenome data package. Now it brings me another level of problem. How do I manipulate the sequences since it is quite different from the examples in your documentation? For example, how can I find the sequences that length is 1000bp or less? I have issued couple of commands and list the output below. Thanks so much for your help. By the way, do you guys offer a training class in a near future? Steve > seq <- CFlo$cflo_v3.3.fold > str(seq) Formal class 'DNAStringSet' [package "Biostrings"] with 5 slots ..@ pool :Formal class 'SharedRaw_Pool' [package "IRanges"] with 2 slots .. .. ..@ xp_list :List of 1 .. .. .. ..$ :<externalptr> .. .. ..@ .link_to_cached_object_list:List of 1 .. .. .. ..$ :<environment: 0x7a68ec0=""> ..@ ranges :Formal class 'GroupedIRanges' [package "IRanges"] with 7 slots .. .. ..@ group : int [1:24026] 1 1 1 1 1 1 1 1 1 1 ... .. .. ..@ start : int [1:24026] 1 372 921 1459 2139 3015 17388 34650 67069 78840 ... .. .. ..@ width : int [1:24026] 371 549 538 680 876 14373 17262 32419 11771 16416 ... .. .. ..@ NAMES : chr [1:24026] "scaffold9" "scaffold12" "scaffold16" "scaffold2" ... .. .. ..@ elementMetadata: NULL .. .. ..@ elementType : chr "integer" .. .. ..@ metadata : list() ..@ elementMetadata: NULL ..@ elementType : chr "DNAString" ..@ metadata : list() > seq A DNAStringSet instance of length 24026 width seq names [1] 371 GTATATAAATAAAAATCTTA...AAACTGTCATACAAACAGTT scaffold9 [2] 549 TATCTATATTTTATAAATTA...GAAATTTTTATAAGCTTATT scaffold12 [3] 538 ATAATGAGCAAGAATAAATT...TACATATTAGTTGGTTCTCT scaffold16 [4] 680 CGCCAATCTATGAAGACGAA...GATCATGAATTCAAATGGGA scaffold2 [5] 876 CACATGATTTATTTGCAGAA...CGCCGATTGTACATGTGATA scaffold4 [6] 14373 CGTTCGGTCTGCTCAATGTT...ACTGCCGCATGTTTAGCATT scaffold20 [7] 17262 ATTATCTACAAGCTTGGTAA...TAGGAGCCGCCCGACCGCTC scaffold10 [8] 32419 AGAGAAGGAACAGAGAAGAA...TACTTTTGTTATTTTAGAAG scaffold11 [9] 11771 CAATGTCAAACTGATAACAT...GTCTGATTAGGATTCATCGC scaffold24 ... ... ... [24018] 5878 TTTTTTTTTTTTTCGAAAAT...ATAGTTATTTTACATATTAG C4054188 [24019] 5945 TTATATTAGATTCTTAAAGT...AAATAATAAATAATAAATAA C4054414 [24020] 5777 ATAATAATACCGTTCACAAG...CGATGCGTCCTTTTCTTTTT C4053796 [24021] 6301 TTTATTATTTATTAAGCTTT...TATTTTACGTAGCGAAAAAA C4055556 [24022] 6987 TAAAAAAAATAGTTATATTT...AGAAATTTATTATTATTATT C4057106 [24023] 7532 TCTCAGGGGCTGGTCTGCCT...CGGACGCGTGGAGAGAGTCA C4058062 [24024] 9748 TCCTCGCGCTTACCGTTTGA...AAAAATTTGGTTGCGTTAAT C4059958 [24025] 32760 GCTTCCGAGGTCGGGGACGC...AACGACGGTTCGCCTGCTAT scaffold5109 [24026] 18211 TTATTTATTAAAATAAATAA...AGCGAGGGTGCGGCCCCGGC scaffold4322 [[alternative HTML version deleted]]

ADD REPLY • link 13.9 years ago Steve Shen ▴ 330

0

Entering edit mode

Hi Steve, On 01/27/2011 10:37 PM, Steve Shen wrote: > Hi Herve, > > Sorry for bothering you again. With you help, I was able to build this > BSgenome data package. Now it brings me another level of problem. How do > I manipulate the sequences since it is quite different from the examples > in your documentation? For example, how can I find the sequences that > length is 1000bp or less? I have issued couple of commands and list the > output below. Thanks so much for your help. By the way, do you guys > offer a training class in a near future? > > Steve > > > seq <- CFlo$cflo_v3.3.fold > > str(seq) > Formal class 'DNAStringSet' [package "Biostrings"] with 5 slots > ..@ pool :Formal class 'SharedRaw_Pool' [package "IRanges"] > with 2 slots > .. .. ..@ xp_list :List of 1 > .. .. .. ..$ :<externalptr> > .. .. ..@ .link_to_cached_object_list:List of 1 > .. .. .. ..$ :<environment: 0x7a68ec0=""> > ..@ ranges :Formal class 'GroupedIRanges' [package "IRanges"] > with 7 slots > .. .. ..@ group : int [1:24026] 1 1 1 1 1 1 1 1 1 1 ... > .. .. ..@ start : int [1:24026] 1 372 921 1459 2139 3015 > 17388 34650 67069 78840 ... > .. .. ..@ width : int [1:24026] 371 549 538 680 876 14373 > 17262 32419 11771 16416 ... > .. .. ..@ NAMES : chr [1:24026] "scaffold9" "scaffold12" > "scaffold16" "scaffold2" ... > .. .. ..@ elementMetadata: NULL > .. .. ..@ elementType : chr "integer" > .. .. ..@ metadata : list() > ..@ elementMetadata: NULL > ..@ elementType : chr "DNAString" > ..@ metadata : list() > > > seq > A DNAStringSet instance of length 24026 > width seq names > [1] 371 GTATATAAATAAAAATCTTA...AAACTGTCATACAAACAGTT scaffold9 > [2] 549 TATCTATATTTTATAAATTA...GAAATTTTTATAAGCTTATT scaffold12 > [3] 538 ATAATGAGCAAGAATAAATT...TACATATTAGTTGGTTCTCT scaffold16 > [4] 680 CGCCAATCTATGAAGACGAA...GATCATGAATTCAAATGGGA scaffold2 > [5] 876 CACATGATTTATTTGCAGAA...CGCCGATTGTACATGTGATA scaffold4 > [6] 14373 CGTTCGGTCTGCTCAATGTT...ACTGCCGCATGTTTAGCATT scaffold20 > [7] 17262 ATTATCTACAAGCTTGGTAA...TAGGAGCCGCCCGACCGCTC scaffold10 > [8] 32419 AGAGAAGGAACAGAGAAGAA...TACTTTTGTTATTTTAGAAG scaffold11 > [9] 11771 CAATGTCAAACTGATAACAT...GTCTGATTAGGATTCATCGC scaffold24 > ... ... ... > [24018] 5878 TTTTTTTTTTTTTCGAAAAT...ATAGTTATTTTACATATTAG C4054188 > [24019] 5945 TTATATTAGATTCTTAAAGT...AAATAATAAATAATAAATAA C4054414 > [24020] 5777 ATAATAATACCGTTCACAAG...CGATGCGTCCTTTTCTTTTT C4053796 > [24021] 6301 TTTATTATTTATTAAGCTTT...TATTTTACGTAGCGAAAAAA C4055556 > [24022] 6987 TAAAAAAAATAGTTATATTT...AGAAATTTATTATTATTATT C4057106 > [24023] 7532 TCTCAGGGGCTGGTCTGCCT...CGGACGCGTGGAGAGAGTCA C4058062 > [24024] 9748 TCCTCGCGCTTACCGTTTGA...AAAAATTTGGTTGCGTTAAT C4059958 > [24025] 32760 GCTTCCGAGGTCGGGGACGC...AACGACGGTTCGCCTGCTAT scaffold5109 > [24026] 18211 TTATTTATTAAAATAAATAA...AGCGAGGGTGCGGCCCCGGC scaffold4322 > 'seq' is a DNAStringSet object so you could start by looking at the man page for those objects (?DNAStringSet) for the basics of how to manipulate yours. This is a good starting point. From there you will eventually end up visiting other man pages in the Biostrings package by following the suggestions given in the "See also" section. Biostrings has a lot of man pages! Most man pages provide a lot of examples that you will definitely want to look at. For your particular problem of selecting the elements that are <= 1000bp, just do 'seq[width(seq) <= 1000]'. Please see our website for course announcements: http://bioconductor.org/ Cheers, H. -- Hervé Pagès Program in Computational Biology Division of Public Health Sciences Fred Hutchinson Cancer Research Center 1100 Fairview Ave. N, M2-B876 P.O. Box 19024 Seattle, WA 98109-1024 E-mail: hpages at fhcrc.org Phone: (206) 667-5791 Fax: (206) 667-1319

ADD REPLY • link 13.9 years ago Hervé Pagès 16k

0

Entering edit mode

Hi Herve, Thanks so much for your time and advices. Best, Steve 2011/1/28 Hervé Pagès <hpages@fhcrc.org> > Hi Steve, > > > On 01/27/2011 10:37 PM, Steve Shen wrote: > >> Hi Herve, >> >> Sorry for bothering you again. With you help, I was able to build this >> BSgenome data package. Now it brings me another level of problem. How do >> I manipulate the sequences since it is quite different from the examples >> in your documentation? For example, how can I find the sequences that >> length is 1000bp or less? I have issued couple of commands and list the >> output below. Thanks so much for your help. By the way, do you guys >> offer a training class in a near future? >> >> Steve >> >> > seq <- CFlo$cflo_v3.3.fold >> > str(seq) >> Formal class 'DNAStringSet' [package "Biostrings"] with 5 slots >> ..@ pool :Formal class 'SharedRaw_Pool' [package "IRanges"] >> with 2 slots >> .. .. ..@ xp_list :List of 1 >> .. .. .. ..$ :<externalptr> >> .. .. ..@ .link_to_cached_object_list:List of 1 >> .. .. .. ..$ :<environment: 0x7a68ec0=""> >> ..@ ranges :Formal class 'GroupedIRanges' [package "IRanges"] >> with 7 slots >> .. .. ..@ group : int [1:24026] 1 1 1 1 1 1 1 1 1 1 ... >> .. .. ..@ start : int [1:24026] 1 372 921 1459 2139 3015 >> 17388 34650 67069 78840 ... >> .. .. ..@ width : int [1:24026] 371 549 538 680 876 14373 >> 17262 32419 11771 16416 ... >> .. .. ..@ NAMES : chr [1:24026] "scaffold9" "scaffold12" >> "scaffold16" "scaffold2" ... >> .. .. ..@ elementMetadata: NULL >> .. .. ..@ elementType : chr "integer" >> .. .. ..@ metadata : list() >> ..@ elementMetadata: NULL >> ..@ elementType : chr "DNAString" >> ..@ metadata : list() >> >> > seq >> A DNAStringSet instance of length 24026 >> width seq names >> [1] 371 GTATATAAATAAAAATCTTA...AAACTGTCATACAAACAGTT scaffold9 >> [2] 549 TATCTATATTTTATAAATTA...GAAATTTTTATAAGCTTATT scaffold12 >> [3] 538 ATAATGAGCAAGAATAAATT...TACATATTAGTTGGTTCTCT scaffold16 >> [4] 680 CGCCAATCTATGAAGACGAA...GATCATGAATTCAAATGGGA scaffold2 >> [5] 876 CACATGATTTATTTGCAGAA...CGCCGATTGTACATGTGATA scaffold4 >> [6] 14373 CGTTCGGTCTGCTCAATGTT...ACTGCCGCATGTTTAGCATT scaffold20 >> [7] 17262 ATTATCTACAAGCTTGGTAA...TAGGAGCCGCCCGACCGCTC scaffold10 >> [8] 32419 AGAGAAGGAACAGAGAAGAA...TACTTTTGTTATTTTAGAAG scaffold11 >> [9] 11771 CAATGTCAAACTGATAACAT...GTCTGATTAGGATTCATCGC scaffold24 >> ... ... ... >> [24018] 5878 TTTTTTTTTTTTTCGAAAAT...ATAGTTATTTTACATATTAG C4054188 >> [24019] 5945 TTATATTAGATTCTTAAAGT...AAATAATAAATAATAAATAA C4054414 >> [24020] 5777 ATAATAATACCGTTCACAAG...CGATGCGTCCTTTTCTTTTT C4053796 >> [24021] 6301 TTTATTATTTATTAAGCTTT...TATTTTACGTAGCGAAAAAA C4055556 >> [24022] 6987 TAAAAAAAATAGTTATATTT...AGAAATTTATTATTATTATT C4057106 >> [24023] 7532 TCTCAGGGGCTGGTCTGCCT...CGGACGCGTGGAGAGAGTCA C4058062 >> [24024] 9748 TCCTCGCGCTTACCGTTTGA...AAAAATTTGGTTGCGTTAAT C4059958 >> [24025] 32760 GCTTCCGAGGTCGGGGACGC...AACGACGGTTCGCCTGCTAT scaffold5109 >> [24026] 18211 TTATTTATTAAAATAAATAA...AGCGAGGGTGCGGCCCCGGC scaffold4322 >> >> > 'seq' is a DNAStringSet object so you could start by looking at the > man page for those objects (?DNAStringSet) for the basics of how to > manipulate yours. This is a good starting point. From there you > will eventually end up visiting other man pages in the Biostrings > package by following the suggestions given in the "See also" section. > Biostrings has a lot of man pages! Most man pages provide a lot of > examples that you will definitely want to look at. > > For your particular problem of selecting the elements that are > <= 1000bp, just do 'seq[width(seq) <= 1000]'. > > Please see our website for course announcements: > > http://bioconductor.org/ > > Cheers, > H. > > > > -- > Hervé Pagès > > Program in Computational Biology > Division of Public Health Sciences > Fred Hutchinson Cancer Research Center > 1100 Fairview Ave. N, M2-B876 > P.O. Box 19024 > Seattle, WA 98109-1024 > > E-mail: hpages@fhcrc.org > Phone: (206) 667-5791 > Fax: (206) 667-1319 > [[alternative HTML version deleted]]

ADD REPLY • link 13.9 years ago Steve Shen ▴ 330

Login before adding your answer.