Looking at the vignette for this package, I see how you might be confused. So, here's a step-by-step.
1.) Download the fasta file. As in, like, download it to your computer.
2.) If you downloaded the tar.gz file, do tar xvfz S_lycopersicum_chromosomes.3.00.fa.tar.gz
. Otherwise get the full fasta to begin with.
3.) In R, after loading BSgenome
do
> fasta.seqlengths("S_lycopersicum_chromosomes.3.00.fa")
SL3.0ch00 SL3.0ch01 SL3.0ch02 SL3.0ch03 SL3.0ch04 SL3.0ch05 SL3.0ch06
20852292 98455869 55977580 72290146 66557038 66723567 49794276
SL3.0ch07 SL3.0ch08 SL3.0ch09 SL3.0ch10 SL3.0ch11 SL3.0ch12
68175699 65987440 72906345 65633393 56597135 68126176
4.) Note that
> paste0("SL3.0ch", sprintf("%02d", 0:12))
[1] "SL3.0ch00" "SL3.0ch01" "SL3.0ch02" "SL3.0ch03" "SL3.0ch04" "SL3.0ch05"
[7] "SL3.0ch06" "SL3.0ch07" "SL3.0ch08" "SL3.0ch09" "SL3.0ch10" "SL3.0ch11"
[13] "SL3.0ch12"
generates the same chromosome names. This is important!
5.) For FASTA files, you need one FASTA file per chromosome (it says so in the vignette).
> z <- readDNAStringSet("S_lycopersicum_chromosomes.3.00.fa")
> dir.create("S_lycopersicum_chromosomes.3.00")
> for(i in 1:13) writeXStringSet(z[i,], paste0("S_lycopersicum_chromosomes.3.00/", gsub("\\s+", "", names(z)[i], perl = TRUE), ".fa"))
> dir("S_lycopersicum_chromosomes.3.00/")
[1] "SL3.0ch00.fa" "SL3.0ch01.fa" "SL3.0ch02.fa" "SL3.0ch03.fa" "SL3.0ch04.fa"
[6] "SL3.0ch05.fa" "SL3.0ch06.fa" "SL3.0ch07.fa" "SL3.0ch08.fa" "SL3.0ch09.fa"
[11] "SL3.0ch10.fa" "SL3.0ch11.fa" "SL3.0ch12.fa"
6.) Now you need a seed file. It should look like this:
Package: BSgenome.Slycopersicum.SGN.SL3
Title: Full genome sequences for Solanum lycopersicum (SGN version 3)
Description: Full genome sequences for Solanum lycopersicum as provided by SGN.
Version: 0.0.1
Suggests: GenomicFeatures
organism: Solanum lycopersicum
common_name: Tomato
provider: SGN
provider_version: SL3.00
release_date: Feb 2017
release_name: SL3.00
source_url: ftp://ftp.solgenomics.net/tomato_genome/assembly/build_3.00/
organism_biocview: Solanum_lycopersicum
BSgenomeObjname: Slycopersicum
SrcDataFiles: S_lycopersicum_chromosomes.3.00.fa from ftp://ftp.solgenomics.net/tomato_genome/assembly/build_3.00/
seqs_srcdir: C:/Users/jmacdon/Desktop/S_lycopersicum_chromosomes.3.00
seqnames: paste0("SL3.0ch", sprintf("%02d", 0:12))
EDIT Note that the last line has the same R code that generates the chromosome names that I showed in step 4! In addition this is a text file that I saved on my computer as "Slycopersicum-seed".
ALSO NOTE THAT the seqs_srcdir has to point to the directory that you put your FASTA files in! Mine points to a dir on my computer, so don't use that.
7.) Build and install
> forgeBSgenomeDataPkg("Slycopersicum-seed")
Creating package in ./BSgenome.Slycopersicum.SGN.SL3
Loading 'SL3.0ch00' sequence from FASTA file 'C:/Users/jmacdon/Desktop/S_lycopersicum_chromosomes.3.00/SL3.0ch00.fa' ... DONE
<snip>
Writing all sequences to './BSgenome.Slycopersicum.SGN.SL3/inst/extdata/single_sequences.2bit' ... DONE
> install.packages("BSgenome.Slycopersicum.SGN.SL3/", repos = NULL, type = "source")
## I'm on Windows so I need to say 'source'
<snip>
* DONE (BSgenome.Slycopersicum.SGN.SL3)
> library(BSgenome.Slycopersicum.SGN.SL3)
> ls(2)
[1] "BSgenome.Slycopersicum.SGN.SL3" "Slycopersicum"
> Slycopersicum
Tomato genome:
# organism: Solanum lycopersicum (Tomato)
# provider: SGN
# provider version: SL3.00
# release date: Feb 2017
# release name: SL3.00
# 13 sequences:
# SL3.0ch00 SL3.0ch01 SL3.0ch02 SL3.0ch03 SL3.0ch04 SL3.0ch05 SL3.0ch06
# SL3.0ch07 SL3.0ch08 SL3.0ch09 SL3.0ch10 SL3.0ch11 SL3.0ch12
# (use 'seqnames()' to see all the sequence names, use the '$' or '[[' operator
# to access a given sequence)
Et voila!
Thank you so much for such a detailed step by step explanation. Although I got a few warnings, it worked and the package is loaded.
Hi Jim,
I'd love to improve the vignette so if you could provide more details about what you find confusing that would be great. Thanks!
H.
Hi Herve,
I think the confusing part is that there isn't a basic overview to get people oriented. All they need are two files; a genome and a text file that describes it. The tricky part is the acceptable format of the genome and the seed file.
The vignette is complete as is, but it can be TL;DR; if the end user already has a genome in the acceptable format.
For example, if an end user has a 2bit file, they don't need to know anything more about the genome, and can go on to generating the seed file. The same is true if they have a multi-chromosome FASTA file. It's only tricky if they have something different (like the OP) and have to either convert to a multi-chromosome FASTA file or 2bit. If the vignette were HTML you could say what they need, and if they have the 2bit or multi-chromosome FASTA file, provide a link to go to the seed file section. If they don't, then provide a link to go to a section that has more information about generating the correct format for the genome.
It's a bit different for the seed file. You have a whole section that shows all the fields that people could use, and what goes in each field. If you just want to build a basic package and don't need to get fancy, the easiest thing to do is just copy an existing seed file and modify to suit (which is what I did). If the vignette just said to do that, provided code to copy an existing file to the working directory and gave a basic idea of what should go into the fields, that might be sufficient for most. You could then have a link that takes people to the more detailed description of all the fields, for those who want or need to include more detail.
Thanks for the feedback. Very useful. I'll work on that.
H.