Hi Steve,
Since we've started this on the list, let's keep it there.
Please provide your seed file so I can get a better picture of what
you
are doing. As I already mentioned before, you need to provide 1 FASTA
file per chromosome. But your cflo_v3.3.fold.fa file seems to contain
24026 sequences! So you have 2 options: (1) either you split it into 1
file per sequence, (2) either you store it as a multiple sequence in
your BSgenome data package.
Option 1: It's easy to split a FASTA file into 1 file per FASTA
record using the Biostrings package. You can do something like:
library(Biostrings)
x <- read.DNAStringSet("cflo_v3.3.fold.fa")
for (i in seq_len(length(x))) {
filepath <- paste(names(x)[i], ".fa", sep="")
write.XStringSet(x[i], filepath)
}
The problem is that then you will need to list *all* the sequences in
the seqnames field of your seed file. Another problem with this is
that
it's probably not the right thing to do. See below.
Option 2: First some background. As you've probably noticed, the
sequence objects stored in a BSgenome data package are divided in 2
groups: the group of single sequences and the group of multiple
sequences. If you've ever displayed a BSgenome object that contains
upstream sequences, you can't have missed it. For example:
> library(BSgenome.Celegans.UCSC.ce2)
> Celegans
Worm genome
|
| organism: Caenorhabditis elegans (Worm)
| provider: UCSC
| provider version: ce2
| release date: Mar. 2004
| release name: WormBase v. WS120
|
| single sequences (see '?seqnames'):
| chrI chrII chrIII chrIV chrV chrX chrM
|
| multiple sequences (see '?mseqnames'):
| upstream1000 upstream2000 upstream5000
|
| (use the '$' or '[[' operator to access a given sequence)
Objects in the 1st group are DNAString (or MaskedDNAString) objects.
Each DNAString (or MaskedDNAString) object can only contain 1
sequence.
Objects in the 2nd group are DNAStringSet objects. Each DNAStringSet
object can contain an (almost) arbitrary number of sequences.
The 2 groups are respectively defined thru the seqnames and mseqnames
fields of the seed file. I'm just paraphrasing the existing
documentation here: all this is documented in the man page for
BSgenome objects and in the BSgenomeForge vignette.
Typically the 1st group will contain chromosome sequences, or, more
generally (and to use Ensembl terminology), the "top-level" sequences
of the genome. This can include the Mitochondrial DNA (often denoted
chrM), the 2-micron plasmid, the haplotype sequences, and other exotic
DNA material that is considered to be at the "top-level". The number
of top-level sequences varies from one genome to the other but it
remains generally small (< 50).
Typically the 2nd group will contain 1 or more big collections of
sequences. Each collection itself can have thousands of sequences.
But the number of collections should preferably remain small
(i.e. < 50). So I guess this would be the place of choice for your
cflo_v3.3.fold object (which seems to be a collection of 24026
scaffolds).
Now for the problem with your gap file, I can't really help without
seeing your seed file first. However, I'm a little bit puzzled that
you have a file containing assembly gaps for your scaffolds: that
doesn't seem to make a lot of sense to me. Anyway, you can't put
masks on a multiple sequence object so, if you choose option 2, you
will have to set nmask_per_seq to 0.
Also, as a general rule when forging a BSgenome data package, you
should ask yourself whether it is worth the extra effort to deal
with the masks at all. A BSgenome data package doesn't *have* to
contain masks.
Cheers,
H.
On 01/26/2011 02:53 PM, Steve Shen wrote:
> Hi Herve,
>
> Thank you so much for your help. I was finally able to have a
working
> seed file. However, there seem to be still long way to go. I guess
that
> most likely these errors are due to genome itself and gap file. I
just
> wonder if you could kindly give me some hints what do these error
> message mean. Is gap file required for building up data package? Why
do
> package only use first sequence and others are truncated? Is this
> because of two many sequences? What steps can I take in order to
make it
> work with GenomicRange packages? I really appreciate your help.
Thanks
> so much,
>
> Steve
>
> > forgeBSgenomeDataPkg("Cflo_seed.R")
> Loading required package: Biobase
>
> Welcome to Bioconductor
>
> Vignettes contain introductory material. To view, type
> 'openVignette()'. To cite Bioconductor, see
> 'citation("Biobase")' and for packages 'citation(pkgname)'.
>
>
> Attaching package: 'Biobase'
>
> The following object(s) are masked from 'package:IRanges':
>
> updateObject
>
> Creating package in ./BSgenome.CFlo.NYU.v3
> Saving 'seqlengths' object to compressed data file
> './BSgenome.CFlo.NYU.v3/inst/extdata/seqlengths.rda'... DONE
> Loading FASTA file '/home/steve/Data/Genomes/cflo_v3.3.fold.fa' in
> 'cflo_v3.3.fold' object... DONE
> Saving 'cflo_v3.3.fold' object to compressed data file
> './BSgenome.CFlo.NYU.v3/inst/extdata/cflo_v3.3.fold.rda'... DONE
> Error in .guessGapFileCOL2CLASS(file) :
> unable to guess the column names in "gap" file
> './cflo_v3.3.fold_gap.txt', sorry
> In addition: Warning messages:
> 1: In FUN("cflo_v3.3.fold"[[1L]], ...) :
> In file '/home/steve/Data/Genomes/cflo_v3.3.fold.fa': 24026
sequences
> found, using first sequence only
> 2: In FUN("cflo_v3.3.fold"[[1L]], ...) :
> In file '/home/steve/Data/Genomes/cflo_v3.3.fold.fa': sequence
> description
> "scaffold9scaffold12scaffold16scaffold2scaffold4scaffold20scaffold10
scaffold11scaffold24scaffold1scaffold3scaffold6scaffold15scaffold28sca
ffold30scaffold22scaffold18scaffold21scaffold7scaffold32scaffold36scaf
fold13scaffold23scaffold39scaffold19scaffold41scaffold29scaffold37scaf
fold33scaffold45scaffold38scaffold17scaffold5scaffold46scaffold48scaff
old31scaffold25scaffold51scaffold49scaffold44scaffold47scaffold54scaff
old43scaffold55scaffold60scaffold35scaffold42scaffold63scaffold53scaff
old50scaffold40scaffold67scaffold61scaffold69scaffold34scaffold70scaff
old27scaffold73scaffold8scaffold75scaffold71scaffold74scaffold14scaffo
ld66scaffold65scaffold80scaffold62scaffold76scaffold79scaffold85scaffo
ld78scaffold86scaffold88scaffold81scaffold87scaffold26scaffold92scaffo
ld93scaffold94scaffold59scaffold52scaffold84scaffold77scaffold89scaffo
ld100scaffold97scaffold99scaffold64scaffold103scaffold90scaffold106sca
ffold56scaffold98
> [... truncated]
> 3: In .forgeSeqFile(name, prefix, suffix, seqs_srcdir, seqs_destdir,
:
> file contains 24026 sequences, using the first sequence only
> 4: In file(file, "rt") :
> cannot open file './cflo_v3.3.fold_gap.txt': No such file or
directory
> 5: In file(file, "rt") :
> cannot open file './cflo_v3.3.fold_gap.txt': No such file or
directory
>
>
> 2011/1/20 Hervé Pagès <hpages at="" fhcrc.org="" <mailto:hpages="" at="" fhcrc.org="">>
>
> Steve,
>
>
> On 01/19/2011 04:37 PM, Steve Shen wrote:
>
> Hi Herve,
>
> Thank you very much. I tried carefully with correct path and
> file name
> by triggering auto-complete feature. But it didn't help
either.
> It may
> be something wrong with my seed file, so I also appended the
> file below.
> Please have a look. Thanks in advance,
>
> Steve
>
> > forgeBSgenomeDataPkg("/home/steve/Data/Genomes/seed")
> Error in as.list(.readSeedFile(x, verbose = verbose)) :
> error in evaluating the argument 'x' in selecting a
method for
> function 'as.list'
> > forgeBSgenomeDataPkg("seed")
> Error in as.list(.readSeedFile(x, verbose = verbose)) :
> error in evaluating the argument 'x' in selecting a
method for
> function 'as.list'
>
>
> An easy way to check if your seed file is a valid DCF file is:
>
> > read.dcf("seed")
> Error in read.dcf("seed") :
> Line starting 'I'm a broken DCF fil ...' is malformed!
>
> Again, the error message you get when calling
forgeBSgenomeDataPkg() on
> a broken DCF file should return exactly that, but,
unfortunately, it
> doesn't with the current version of BSgenome. Updated versions
of
> BSgenome (1.18.3 in release and 1.19.3 in devel) are correcting
this
> and will become available tomorrow around noon, and midnight,
> respectively (Seattle time).
>
>
>
> Seed file starts:
>
> Package: BSgenome.CFlo.YU.v3
> Title: Camponotus floridanus (Ants) full genome (YU version
V3.3)
> Description: Camponotus floridanus (Ants) full genome as
> provided by YU
> (V3.3, Jan. 2011)
>
> ^^^^^^^^^^^^^^^^^
> Is this on a line of its own in your file? This might be the
problem.
>
> Please consult the help for the read.dcf() function (?read.dcf
from
> your R session) for how to format a DCF file. Note that a
'tag:value'
> should either be placed on a single line, or, if the value spans
> several lines, then the extra lines (aka "continuation lines")
must
> start with a whitespace. Like this:
>
>
> Description: Camponotus floridanus (Ants) full genome as
provided
> by YU (V3.3, Jan. 2011)
>
> (I'm using 4 whitespaces here but 1 is enough. Using a <tab>
would
> also be OK.)
>
>
> Version: 1.0.0
> organism: camponotus floridanus
> species: Ant
> provider: YU
> provider_version: Assembly V3.3
> release_date: Jan, 2011
> release_name: Ant Genome Reference Consortium
> source_url: NA
> organism_biocview: C_flo
> BSgenomeObjname: CFlo
> seqnames: "cflo_v3.3.fold"
> mseqnames: character(0)
> nmask_per_seq:2
>
> seqs_srcdir: /home/steve/Data/Genomes/
>
>
> So you have a FASTA file named "cflo_v3.3.fold.fa" located in
> /home/steve/Data/Genomes/ that contains a *single* FASTA record
> that corresponds to the cflo_v3.3.fold sequence? Please double
check
> this. (You can see the list of records contained in a FASTA file
with
> the fasta.info <http: fasta.info="">() function from Biostrings.)
>
> H.
>
> # masks_srcdir: /home/steve/Data/Masks_src
> # AGAPSfiles_name: cflo_v3.3.fa.masked
>
>
> 2011/1/19 Hervé Pagès <hpages at="" fhcrc.org=""> <mailto:hpages at="" fhcrc.org=""> <mailto:hpages at="" fhcrc.org=""> <mailto:hpages at="" fhcrc.org="">>>
>
> Hi Steve,
>
>
> On 01/19/2011 01:04 PM, Steve Shen wrote:
>
> Hi Herve,
>
> Thank you so much for your quick reply. Your
vignette is
> pretty
> clear. I
> think the problem is on my side. I just lack of
> experience on
> dealing
> with this issue and maybe the file format doesn't
fit. I
> sort of
> gave up
> using the seed file but just using low level
commands. I
> still have
> errors. I really have no idea what is going on. Your
> help will
> be much
> appreciated.
>
> Best,
> Steve
>
> 1. with seed file, I got
> > forgeBSgenomeDataPkg("./cflo_seed.R")
> Error in as.list(.readSeedFile(x, verbose =
verbose)) :
> error in evaluating the argument 'x' in selecting
a
> method for
> function 'as.list'
>
>
> I agree that the error message is kind of obscure but
the
> problem
> is probably that the specified path to your seed file is
> invalid.
> I'm about to commit a change to the BSgenomeForge code
that will
> produce this error message instead:
>
> > forgeBSgenomeDataPkg("aaaaa")
> Error in .readSeedFile(x, verbose = verbose) :
> seed file 'aaaaa' not found
>
> An easy way to make sure the path is valid is to press
<tab>
> when
> you are in the middle of typing the path: this will
trigger the
> auto-completion feature.
>
>
>
> 2. with command forgeSeqlengthsFile,
> > forgeSeqlengthsFile("cflo_v3.3.fold", prefix="",
> suffix=".fa",seqs_srcdir=".", seqs_destdir=".",
> verbose=TRUE)
> Saving 'seqlengths' object to compressed data file
> './seqlengths.rda'...
> DONE
> Warning messages:
> 1: In FUN("cflo_v3.3.fold"[[1L]], ...) :
> In file './cflo_v3.3.fold.fa': 24026 sequences
found,
> using first
> sequence only
> 2: In FUN("cflo_v3.3.fold"[[1L]], ...) :
> In file './cflo_v3.3.fold.fa': sequence
description
> "scaffold9scaffold12scaffold16scaffold2scaffold4scaffold20sc
affold10scaffold11scaffold24scaffold1scaffold3scaffold6scaffold15scaff
old28scaffold30scaffold22scaffold18scaffold21scaffold7scaffold32scaffo
ld36scaffold13scaffold23scaffold39scaffold19scaffold41scaffold29scaffo
ld37scaffold33scaffold45scaffold38scaffold17scaffold5scaffold46scaffol
d48scaffold31scaffold25scaffold51scaffold49scaffold44scaffold47scaffol
d54scaffold43scaffold55scaffold60scaffold35scaffold42scaffold63scaffol
d53scaffold50scaffold40scaffold67scaffold61scaffold69scaffold34scaffol
d70scaffold27scaffold73scaffold8scaffold75scaffold71scaffold74scaffold
14scaffold66scaffold65scaffold80scaffold62scaffold76scaffold79scaffold
85scaffold78scaffold86scaffold88scaffold81scaffold87scaffold26scaffold
92scaffold93scaffold94scaffold59scaffold52scaffold84scaffold77scaffold
89scaffold100scaffold97scaffold99scaffold64scaffold103scaffold90scaffo
ld106scaffold56scaffold98scaffold109scaffold68sc
> [... truncated]
>
> 3. with forgeSeqfiles,
> > forgeSeqFiles("cflo_v3.3.fold", mseqnames=NULL,
prefix="",
> suffix=".fa",seqs_srcdir=".",
seqs_destdir="./BSgenome",
> verbose=TRUE)
> Loading FASTA file './cflo_v3.3.fold.fa' in
'cflo_v3.3.fold'
> object... DONE
> Saving 'cflo_v3.3.fold' object to compressed data
file
> './BSgenome/cflo_v3.3.fold.rda'... DONE
> Warning message:
> In .forgeSeqFile(name, prefix, suffix, seqs_srcdir,
> seqs_destdir, :
> file contains 24026 sequences, using the first
> sequence only
>
>
> Sorry but I can only try to help if you stick to the
> vignette and
> use a seed file. Using low-level functions like
> forgeSeqlengthsFile()
> or forgeSeqFiles() is undocumented/unsupported. The
reason
> for this
> is that this route is *much* more complicated and error-
prone
> than using a seed file! This is exactly why seed files
where
> invented: to make the whole process much easier for you,
and to
> make troubleshooting much easier for us.
>
> Cheers,
> H.
>
>
>
>
> 2011/1/19 Hervé Pagès <hpages at="" fhcrc.org=""> <mailto:hpages at="" fhcrc.org="">
> <mailto:hpages at="" fhcrc.org="" <mailto:hpages="" at="" fhcrc.org="">>
> <mailto:hpages at="" fhcrc.org="" <mailto:hpages="" at="" fhcrc.org="">
>
> <mailto:hpages at="" fhcrc.org="" <mailto:hpages="" at="" fhcrc.org="">>>>
>
>
> Hi Steve,
>
> There are several things that look wrong with
your
> seed file.
>
> 1. The # must be the first character in a line
to
> make it a
> line of comment (ignored).
> For example, those 2 lines will certainly not
be
> interpreted
> as you might expect:
>
>
> mseqnames: NA #paste("upstream", c("1000",
"2000",
> "5000"), sep="")
>
> source_url: NA #
>
http://hgdownload.cse.ucsc.edu/goldenPath/hg19/bigZips/
>
> 2. If you don't have multiple sequences,
specify:
>
> mseqnames: character(0)
>
> 3. As explained in the BSgenomeForge vignette,
the
> default for
> seqfiles_prefix is .fa so this won't work:
>
> seqnames: "clof_v3.fa"
>
> Unless your file is named clof_v3.fa.fa?
> If your file is named clof_v3.fa, then you
should
> specify:
>
> seqnames: "clof_v3"
>
> 4. This doesn't look like a valid file of
assembly gaps:
>
>
> AGAPSfiles_name: clof_v3.fa.masked
>
> Please refer to the BSgenomeForge vignette
for
> what kind
> of masks
> and what file formats are supported.
>
> 5. Having this
>
>
> PkgExamples: Hsapiens
> seqlengths(Hsapiens)
> Hsapiens$chr1 # same as
Hsapiens[["chr1"]]
>
> in a seed file for clof obviously doesn't
make
> sense and your
> package won't pass 'R CMD check' because it
will
> contain
> broken
> examples.
>
> There might be other problems with your seed
file.
> All you
> need to do
> is read and follow carefully the instructions
> described in the
> BSgenomeForge vignette. Let me know if things
are
> not clear
> in the
> vignette and I'll try to improve it. Thanks!
>
> H.
>
>
>
> On 01/18/2011 05:32 PM, Steve Shen wrote:
>
> Hi list,
>
> This wasn't sent out with a none registered
ids.
> I have a
> problem with
> forging a new BSgenome data package. The
> sequence data
> file is
> clof.v3.fa
> and the mask file is clof.v3.fa.masked.
Below
> are seed file,
> command, error
> and sessioninfo. Your help will be much
appreciated.
>
> Thanks a lot,
> Steve
>
> Package: BSgenome.Clof.yu.v3
> Title: Clof (insects) full genome (version
3)
> Description: Clof (insects) full genome as
> provided by
> YU (V3.3,
> Jan. 2011)
> # and will store in Biostrings objects.
> Version: 1.0.0
> organism: clof
> species: insect
> provider: YU
> provider_version: Assembly V3.3
> release_date: Jan, 2011
> release_name: insects Genome Reference
Consortium
> source_url: NA #
>
http://hgdownload.cse.ucsc.edu/goldenPath/hg19/bigZips/
> organism_biocview: Clof
> BSgenomeObjname: Clof
> seqnames: "clof_v3.fa"
> mseqnames: NA #paste("upstream", c("1000",
"2000",
> "5000"), sep="")
> nmask_per_seq: 2
> #SrcDataFiles1: sequences: chromFa.zip,
> upstream1000.zip,
> upstream2000.zip,
> upstream5000.zip
> #from
>
http://hgdownload.cse.ucsc.edu/goldenPath/hg19/bigZips/
> #SrcDataFiles2: AGAPS masks:
>
http://hgdownload.cse.ucsc.edu/goldenPath/hg19/database/gap.txt.gz
> #RM masks:
>
http://hgdownload.cse.ucsc.edu/goldenPath/hg19/bigZips/chromOut.tar.gz
> #TRF masks:
>
http://hgdownload.cse.ucsc.edu/goldenPath/hg19/bigZips/chromTrf.tar.gz
> PkgExamples: Hsapiens
> seqlengths(Hsapiens)
> Hsapiens$chr1 # same as
Hsapiens[["chr1"]]
> seqs_srcdir: /home/steve/Data/Genomes
> masks_srcdir: /home/steve/Data/Genomes
> AGAPSfiles_name: clof_v3.fa.masked
>
> The command, error and sessionInfo are below
>
> forgeBSgenomeDataPkg("Clof_seed.R",
> seqs_srcdir=".",
> masks_srcdir=".",
>
> destdir=".", verbose=TRUE)
> Loading required package: Biobase
>
> Welcome to Bioconductor
>
> Vignettes contain introductory material.
To
> view, type
> 'openVignette()'. To cite Bioconductor, see
> 'citation("Biobase")' and for packages 'citation(pkgname)'.
>
>
> Attaching package: 'Biobase'
>
> The following object(s) are masked from
> 'package:IRanges':
>
> updateObject
>
> Error in forgeBSgenomeDataPkg(y, seqs_srcdir
=
> seqs_srcdir,
> masks_srcdir =
> masks_srcdir, :
> values for symbols NMASKPERSEQ are not
single
> strings
> In addition: Warning message:
> In storage.mode(x$nmask_per_seq)<- "integer"
: NAs
> introduced by
> coercion
>
> sessionInfo()
>
> R version 2.12.1 (2010-12-16)
> Platform: x86_64-pc-linux-gnu (64-bit)
>
> locale:
> [1] LC_CTYPE=en_US.UTF-8
LC_NUMERIC=C
> [3] LC_TIME=en_US.UTF-8
> LC_COLLATE=en_US.UTF-8
> [5] LC_MONETARY=C
> LC_MESSAGES=en_US.UTF-8
> [7] LC_PAPER=en_US.UTF-8 LC_NAME=C
> [9] LC_ADDRESS=C
LC_TELEPHONE=C
> [11] LC_MEASUREMENT=en_US.UTF-8
LC_IDENTIFICATION=C
>
> attached base packages:
> [1] stats graphics grDevices utils
datasets
> methods base
>
> other attached packages:
> [1] Biobase_2.6.1 BSgenome_1.18.2
> Biostrings_2.18.0
> [4] GenomicRanges_1.2.1 IRanges_1.8.8
>
> loaded via a namespace (and not attached):
> [1] tools_2.12.1
>
> [[alternative HTML version deleted]]
>
>
_______________________________________________
> Bioconductor mailing list
> Bioconductor at r-project.org <mailto:bioconductor at="" r-project.org="">
> <mailto:bioconductor at="" r-project.org=""> <mailto:bioconductor at="" r-project.org="">>
> <mailto:bioconductor at="" r-project.org=""> <mailto:bioconductor at="" r-project.org="">
> <mailto:bioconductor at="" r-project.org=""> <mailto:bioconductor at="" r-project.org="">>>
>
>
https://stat.ethz.ch/mailman/listinfo/bioconductor
> Search the archives:
>
http://news.gmane.org/gmane.science.biology.informatics.conductor
>
>
>
> --
> Hervé Pagès
>
> Program in Computational Biology
> Division of Public Health Sciences
> Fred Hutchinson Cancer Research Center
> 1100 Fairview Ave. N, M2-B876
> P.O. Box 19024
> Seattle, WA 98109-1024
>
> E-mail: hpages at fhcrc.org <mailto:hpages at="" fhcrc.org="">
> <mailto:hpages at="" fhcrc.org="" <mailto:hpages="" at="" fhcrc.org="">>
> <mailto:hpages at="" fhcrc.org="" <mailto:hpages="" at="" fhcrc.org="">
> <mailto:hpages at="" fhcrc.org="" <mailto:hpages="" at="" fhcrc.org="">>>
>
>
> Phone: (206) 667-5791
> Fax: (206) 667-1319
>
>
>
>
> --
> Hervé Pagès
>
> Program in Computational Biology
> Division of Public Health Sciences
> Fred Hutchinson Cancer Research Center
> 1100 Fairview Ave. N, M2-B876
> P.O. Box 19024
> Seattle, WA 98109-1024
>
> E-mail: hpages at fhcrc.org <mailto:hpages at="" fhcrc.org="">
> <mailto:hpages at="" fhcrc.org="" <mailto:hpages="" at="" fhcrc.org="">>
> Phone: (206) 667-5791
> Fax: (206) 667-1319
>
>
>
>
> --
> Hervé Pagès
>
> Program in Computational Biology
> Division of Public Health Sciences
> Fred Hutchinson Cancer Research Center
> 1100 Fairview Ave. N, M2-B876
> P.O. Box 19024
> Seattle, WA 98109-1024
>
> E-mail: hpages at fhcrc.org <mailto:hpages at="" fhcrc.org="">
> Phone: (206) 667-5791
> Fax: (206) 667-1319
>
>
--
Hervé Pagès
Program in Computational Biology
Division of Public Health Sciences
Fred Hutchinson Cancer Research Center
1100 Fairview Ave. N, M2-B876
P.O. Box 19024
Seattle, WA 98109-1024
E-mail: hpages at fhcrc.org
Phone: (206) 667-5791
Fax: (206) 667-1319