BioStrings for current R: RE: BSgenomeForge seed file

BioStrings for current R: RE: BSgenomeForge seed file - seqnames field

0

Entering edit mode

Hervé Pagès 16k

@herve-pages-1542

Last seen 7 hours ago

Seattle, WA, United States

Hi Kelly, On 12/06/2013 12:23 PM, Vining, Kelly wrote: > Hi Herve, > It has been a while since I have worked with Biostrings and the MEDIPS package. When I went to install Biostrings just now, got an error message that Biostrings isn't available: > >> install.packages("Biostrings") > Installing package into ?/raid1/home/pi/viningk/R? > (as ?lib? is unspecified) > --- Please select a CRAN mirror for use in this session --- Here you are asked to choose a CRAN mirror but Biostrings is not a CRAN package. This suggests that you might be on the wrong path. > Warning message: > package ?Biostrings? is not available (for R version 3.0.0) > > Revisiting the online documentation, it does look like the package is still available. > Is there an available update, or a different alternative I should be using? The only proper way to install a Bioconductor package is to use biocLite(), as documented here: http://bioconductor.org/install/ Unlike install.packages(), biocLite() "knows" where to find Bioconductor packages. Also make sure you're using the latest BioC release (BioC 2.13). Cheers, H. > > Thanks, > --Kelly Vining > ________________________________________ > From: Hervé Pagès [hpages at fhcrc.org] > Sent: Thursday, May 02, 2013 10:49 AM > To: Vining, Kelly > Cc: Kelly V [guest]; bioconductor at r-project.org; MEDIPS Maintainer > Subject: Re: [BioC] BSgenomeForge seed file - seqnames field > > Hi Kelly, > > On 05/02/2013 06:44 AM, Vining, Kelly wrote: >> Hi Herve, >> Thanks for your helpful response, and for pointing me to the proper scripts. Given this, it appears that I should treat my gene feature annotation files (promoters, genes) as mseqnames objects, similarly to how I will treat the extrachromosomal scaffolds. Is there any way to include additional features that I may find of interest after I forge this initial package (say, using gff files), or am I limited such that I would have to re-forge a new reference? > > A BSgenome data package is for storing the sequences of a given > genome/assembly. No annotations should go there. > > In Bioconductor we try to maintain a clear separation between > packages that contain the sequences of a genome/assembly (BSgenome > data packages), and packages that contain annotations for that > genome/assembly (TxDb, FDb, OrganismDb, SNPlocs, etc... packages). > > There are at least 2 reasons for this: > > 1. For a given assembly there are many kinds of annotations available > in many places on the internet, and some of them tend to be updated > frequently. By contrast, the sequences of a given assembly > never change. So putting only the reference sequences in a > BSgenome package make it very stable. It almost never needs to > be updated, except for updating a man page or when something > changes in the way sequences are stored in the package. This is > good with such big packages (800 Mb for the biggest ones). > > 2. Storing the DNA sequences of the genes, promoters, or other genomic > feature in a BSgenome package would duplicate a lot of DNA > sequences that are already in the reference sequences (those > small sequences are substrings of the reference sequences). > This would make the BSgenome package unnecessarily bigger. > It's better to store only the locations of those genomic features, > not in the BSgenome package, but in a separate package, and then > to use those locations to extract the corresponding sequences from > the BSgenome object. This extraction can be done with getSeq() > (defined in the BSgenome software package) and is relatively > cheap. This approach gives you a lot more flexibility than having > those sequences pre-extracted and bundled with the BSgenome data > package. > > If your annotations are in GFF files, you don't really need to > put them in a package: you can load them directly in GRanges > objects with import() (defined in the rtracklayer package), > or if you are particularly interested in the exon/transcript > structure of your genes, you could use makeTranscriptDbFromGFF() > to load a GFF file containing genes, mRNAs, exons, and CDSs into > a TranscriptDb. See ?makeTranscriptDbFromGFF in the > GenomicFeatures package for more information. Then it's easy to > extract the locations of transcripts, exons, or CDSs as a GRanges > or GRangesList object from this TranscriptDb object. See > ?transcripts and ?exonsBy. Finally once you have the locations > in a GRanges object, you can use getSeq() to extract the > corresponding sequences from the BSgenome object. > > Other more specialized functions are promoters() and > getPromoterSeq() for extracting the promoter locations and > their sequences, respectively. Also extractTranscriptsFromGenome() > for extracting the full transcriptome from a BSgenome package. > > HTH, > H. > > PS: The upstream sequences that you see in some of our BSgenome data > packages are a relic of the past and will be removed soon. > >> >> Thanks much, >> --Kelly V. >> >> -----Original Message----- >> From: Hervé Pagès [mailto:hpages at fhcrc.org] >> Sent: Tuesday, April 30, 2013 11:12 PM >> To: Kelly V [guest] >> Cc: bioconductor at r-project.org; Vining, Kelly; MEDIPS Maintainer >> Subject: Re: [BioC] BSgenomeForge seed file - seqnames field >> >> Hi Kelly, >> >> On 04/30/2013 03:52 PM, Kelly V [guest] wrote: >>> >>> I'm preparing a custom reference genome for use with the MEDIPS package. I see that one field of the seed file, which is apparently not optional, is the 'seqnames' field. The example given in the documentation is this: >>> >>> paste("chr", c(1:20, "X", "M", "Un", paste(c(1:20, "X", "Un"), >>> "_random", sep="")), sep="") >>> >>> I have two simple questions about this. >>> >>> 1. Does R match this information with the source sequence file? For example, if I have a single fasta file with fasta headers chr_01...chr_20, must the seqnames entries exactly match those headers? >> >> No. You need to provide 1 FASTA file per single sequence, that is, 1 file per name you put in the 'seqnames' field. That means that each file is expected to contain only 1 sequence. What's in the FASTA header of each file is not important. What's important is that the name of each file be of the form <prefix><seqname><suffix>, where <seqname> is the name of the sequence as it appears in the 'seqnames' field, and <prefix> and <suffix> are a prefix and a suffix (eventually empty) that are the same for all the files. >> >> If what you have is a big FASTA file containing all the chromosome sequences, then you first need to split it into 1 file per chromosome. >> This is easy to do in R. For example, here is the script I used to split the big bosTau6.fa file provided by UCSC for the bosTau6 genome: >> >> library(Biostrings) >> bosTau6 <- readDNAStringSet("bosTau6.fa") >> >> ### Partitioning: >> is_chrUn <- grepl("^chrUn", names(bosTau6)) >> is_chrom <- !is_chrUn >> >> ### Send each chromosome to a FASTA file. >> seqnames <- paste("chr", c(1:29, "X", "M"), sep="") >> stopifnot(setequal(seqnames, names(bosTau6)[is_chrom])) >> >> for (seqname in seqnames) { >> seq <- bosTau6[match(seqname, names(bosTau6))] >> filename <- paste(seqname, ".fa", sep="") >> cat("writing ", filename, "\n", sep="") >> writeXStringSet(seq, file=filename, width=50L) >> } >> >> ### Send the 3286 chrUn_* sequences to 1 FASTA file. >> chrUn_mseq <- bosTau6[is_chrUn] >> writeXStringSet(chrUn_mseq, file="chrUn.fa", width=50L) >> >> The input is the bosTau6.fa file containing 3317 FASTA records: >> 31 records for the chromosomes, and 3286 for the chrUn_* sequences. >> The script produces 32 FASTA files: 1 per chromosome (chr1.fa, chr2.fa, chr3.fa, ..., chr29.fa, chrX.fa, chrM.fa), and the chrUn.fa file (containing the 3286 chrUn_* sequences). >> >> Note that you can find this script in the BSgenome package, and display its source with: >> >> > splitbigfasta_R <- system.file("extdata", >> "GentlemanLab", >> "BSgenome.Btaurus.UCSC.bosTau6-tools", >> "splitbigfasta.R", >> package="BSgenome") >> >> > cat(readLines(splitbigfasta_R), sep="\n") >> >> It should not be too hard to adapt this script to your own needs. >> >>> >>> 2. Revealing the reason for my first question:In my genome fasta file, I have 1427 extrachromosomal scaffolds, but they are not all sequentially numbered, so that I have scaffold_1..scaffold_3681. Do I need to use a regular expression in my seqnames field to tell R to look for scaffold_ followed by 1-4 digits? >> >> No, not in the 'seqnames' field, because those 1427 extrachromosomal scaffolds should not go there. They would need to go in the 'mseqnames' >> field. >> >> Unlike the 'seqnames' field where you enumerate objects that can only contain 1 sequence, in the 'mseqnames' field you can enumerate objects that contain more than 1 sequence. More precisely, in the final BSgenome data package, each entry in the 'seqnames' field will correspond to a DNAString object (the DNAString container can hold >> 1 sequence only), and each entry in the 'mseqnames' field will correspond to a DNAStringSet object (the DNAStringSet container can hold multiple sequences). >> >> So typically, all the extrachromosomal scaffolds would go in one DNAStringSet object in the final BSgenome package (this is for example what is done in the BSgenome.Drerio.UCSC.danRer7 package). >> To achieve this, you need to put 1 entry in the 'mseqnames' field (e.g. "scaffolds"), and to put the 1427 extrachromosomal scaffolds in one FASTA file named accordingly to that entry (e.g. scaffolds.fa). >> >> In the above script, replace: >> >> is_chrUn <- grepl("^chrUn", names(bosTau6)) >> >> with: >> >> is_scaffold <- grepl("^scaffold", names(<your_genome>)) >> >> then every occurrence of 'is_chrUn' with 'is_scaffold', and finally those 3 lines: >> >> ### Send the 3286 chrUn_* sequences to 1 FASTA file. >> chrUn_mseq <- bosTau6[is_chrUn] >> writeXStringSet(chrUn_mseq, file="chrUn.fa", width=50L) >> >> with: >> >> ### Send the 1427 scaffold_* sequences to 1 FASTA file. >> scaffolds_mseq <- <your_genome>[is_scaffold] >> writeXStringSet(scaffolds_mseq, file="scaffolds.fa", width=50L) >> >> and that should take care of sending all the scaffold sequences to the scaffolds.fa file. >> >> Let me know if you need further assistance with this. >> >> Cheers, >> H. >> >>> >>> Thanks for any help, >>> --Kelly V. >>> >>> -- output of sessionInfo(): >>> >>> R version 3.0.0 (2013-04-03) >>> Platform: i386-w64-mingw32/i386 (32-bit) >>> >>> locale: >>> [1] LC_COLLATE=English_United States.1252 [2] LC_CTYPE=English_United >>> States.1252 [3] LC_MONETARY=English_United States.1252 [4] >>> LC_NUMERIC=C [5] LC_TIME=English_United States.1252 >>> >>> attached base packages: >>> [1] stats graphics grDevices utils datasets methods base >>> >>> -- >>> Sent via the guest posting facility at bioconductor.org. >>> >>> _______________________________________________ >>> Bioconductor mailing list >>> Bioconductor at r-project.org >>> https://stat.ethz.ch/mailman/listinfo/bioconductor >>> Search the archives: >>> http://news.gmane.org/gmane.science.biology.informatics.conductor >>> >> >> -- >> Hervé Pagès >> >> Program in Computational Biology >> Division of Public Health Sciences >> Fred Hutchinson Cancer Research Center >> 1100 Fairview Ave. N, M1-B514 >> P.O. Box 19024 >> Seattle, WA 98109-1024 >> >> E-mail: hpages at fhcrc.org >> Phone: (206) 667-5791 >> Fax: (206) 667-1319 >> > > -- > Hervé Pagès > > Program in Computational Biology > Division of Public Health Sciences > Fred Hutchinson Cancer Research Center > 1100 Fairview Ave. N, M1-B514 > P.O. Box 19024 > Seattle, WA 98109-1024 > > E-mail: hpages at fhcrc.org > Phone: (206) 667-5791 > Fax: (206) 667-1319 > -- Hervé Pagès Program in Computational Biology Division of Public Health Sciences Fred Hutchinson Cancer Research Center 1100 Fairview Ave. N, M1-B514 P.O. Box 19024 Seattle, WA 98109-1024 E-mail: hpages at fhcrc.org Phone: (206) 667-5791 Fax: (206) 667-1319

GO Cancer BSgenome OrganismDb SNPlocs TranscriptDb Biostrings BSgenome rtracklayer MEDIPS • 2.3k views

ADD COMMENT • link 11.4 years ago Hervé Pagès 16k

Login before adding your answer.