How can I extract sequences from a FASTA file for each of the intervals defined in a BED file using R?
The reference genome used is "Gallus gallus" that can be obtained by:
source("http://bioconductor.org/biocLite.R") biocLite("BSgenome.Ggallus.UCSC.galGal4") library(BSgenome.Ggallus.UCSC.galGal4)
My data file is a result of gRanges package
library("GenomicRanges") > olaps GRanges object with 2141 ranges and 0 metadata columns: seqnames ranges strand <Rle> <IRanges> <Rle> [1] chr14 [ 1665929, 1673673] * [2] chr14 [ 2587465, 2595209] * [3] chr14 [ 8143785, 8151529] * [4] chr14 [ 9779705, 9787449] * [5] chr14 [10281129, 10288873] * ... ... ... ... [2137] chr24 [3280553, 3288297] * [2138] chr24 [3330889, 3338633] * [2139] chr24 [3005641, 3015321] * [2140] chr24 [3319273, 3327017] * [2141] chr24 [5549545, 5557289] * ------- seqinfo: 31 sequences from an unspecified genome; no seqlengths
That I can transform in data.table
olaps<- as.data.table(olaps)
Example to be used:
olaps<-"seqnames start end width strand chr1 1665929 1673673 7745 * chr1 2587465 2595209 7745 * chr1 8143785 8151529 7745 * chr2 9779705 9787449 7745 * chr2 10281129 10288873 7745 *" olaps<-read.table(text=olaps,header=T)
Expected outcome:
something like this (fasta format):
>SEQUENCE_1 ACTGACTAGCATCGCAT... >SEQUENCE_2 ACGTAGAGAGGGACATA... >SEQUENCE_3...
I have tried to use this package unsuccessful until now: I got the Reference Genome fasta:
seq = BSgenome::getSeq(BSgenome.Ggallus.UCSC.galGal4, olaps)
And I am trying this:
Biostrings::writeXStringSet(seq, olaps)
With this error message:
Error in .open_output_file(filepath, append, compress, compression_level) : 'filepath' must be a single string
Don't worry about making a data.table, it's irrelevant to your work flow here.
It seems like you've extracted the sequences you're interested in
and you're just missing the last step
Before writing the output, you might wish to add names to the sequences,
Thank you @Martin Morgan, the problem is that when I run this command:
The follow error message appear:
Error in loadFUN(x, seqname, ranges) :
trying to load regions beyond the boundaries of non-circular sequence "chr4"
What is weird, because this olaps is a result of CNV call from CNV-seq package and comes from the same reference genome...
The software isn't lying; some of your ranges are outside the sequence lengths of the BSgenome object. You're right that the most common cause for this is the use of different reference genomes. Other problems might be use of 0-based versus 1-based coordinates (Bioconductor uses 1-based coordinates; did you import your BED file using
rtracklayer::import()
?) or a similar but not identical genome build, especially for mitochondrial or similar chromosomes. You can get the genomic ranges of the genome in the BSgenome object withand use this to find the overlapping ranges that are out of bounds
If you have the reference genome FASTA file used for copy number calling, it is relatively easy to use it instead of a BSgenome package. See C: problem creating txdb object and bsgenome for solanum pennellii.
Thank you @Martin
Do you know why this error?
The command looks like correct according to the manual:
http://finzi.psych.upenn.edu/library/GenomicRanges/html/GRanges-class.html
> genomerngs = as(seqinfo(BSgenome.Ggallus.UCSC.galGal4, "GRanges"))
Error in .identC(Class, "double") :
argument "Class" is missing, with no default
Because there was a typo in the placement of parentheses, now fixed ---
as(seqinfo(BSgenome.Ggallus.UCSC.galGal4), "GRanges")
I further updated my response to use
%within%
rather than%in%
.