Question

How can I extract sequences from a FASTA file for each of the intervals defined in a BED file using R?

0

Entering edit mode

pertille • 0

@pertille-9104

Last seen 5.8 years ago

Sweden

How can I extract sequences from a FASTA file for each of the intervals defined in a BED file using R?
The reference genome used is "Gallus gallus" that can be obtained by:

    source("http://bioconductor.org/biocLite.R")
    biocLite("BSgenome.Ggallus.UCSC.galGal4")
        library(BSgenome.Ggallus.UCSC.galGal4)

My data file is a result of gRanges package

    library("GenomicRanges")

    > olaps
    GRanges object with 2141 ranges and 0 metadata columns:
             seqnames               ranges strand
                <Rle>            <IRanges>  <Rle>
         [1]    chr14 [ 1665929,  1673673]      *
         [2]    chr14 [ 2587465,  2595209]      *
         [3]    chr14 [ 8143785,  8151529]      *
         [4]    chr14 [ 9779705,  9787449]      *
         [5]    chr14 [10281129, 10288873]      *
         ...      ...                  ...    ...
      [2137]    chr24   [3280553, 3288297]      *
      [2138]    chr24   [3330889, 3338633]      *
      [2139]    chr24   [3005641, 3015321]      *
      [2140]    chr24   [3319273, 3327017]      *
      [2141]    chr24   [5549545, 5557289]      *
      -------
      seqinfo: 31 sequences from an unspecified genome; no seqlengths

That I can transform in data.table

    olaps<- as.data.table(olaps)

Example to be used:

    olaps<-"seqnames    start      end width strand
    chr1  1665929  1673673  7745      *
    chr1  2587465  2595209  7745      *
    chr1  8143785  8151529  7745      *
    chr2  9779705  9787449  7745      *
    chr2 10281129 10288873  7745      *"
    olaps<-read.table(text=olaps,header=T)

Expected outcome:
something like this (fasta format):

    >SEQUENCE_1
    ACTGACTAGCATCGCAT...
    >SEQUENCE_2
    ACGTAGAGAGGGACATA...
    >SEQUENCE_3...

I have tried to use this package unsuccessful until now: I got the Reference Genome fasta:

seq = BSgenome::getSeq(BSgenome.Ggallus.UCSC.galGal4, olaps)

And I am trying this:

Biostrings::writeXStringSet(seq, olaps)

With this error message:

Error in .open_output_file(filepath, append, compress, compression_level) :
  'filepath' must be a single string

R bioconductor granges • 9.7k views

ADD COMMENT • link 8.8 years ago • updated 8.4 years ago pertille • 0

2

Entering edit mode

Don't worry about making a data.table, it's irrelevant to your work flow here.

It seems like you've extracted the sequences you're interested in

seq = BSgenome::getSeq(BSgenome.Ggallus.UCSC.galGal4, olaps)

and you're just missing the last step

Biostrings::writeXStringSet(seq, "my.fasta")

Before writing the output, you might wish to add names to the sequences,

names(seq) = paste0("SEQUENCE_", seq_along(seq))

ADD REPLY • link 8.8 years ago Martin Morgan 25k

0

Entering edit mode

Thank you @Martin Morgan, the problem is that when I run this command:

{seq = BSgenome::getSeq(BSgenome.Ggallus.UCSC.galGal4, olaps)}

The follow error message appear:

Error in loadFUN(x, seqname, ranges) :
trying to load regions beyond the boundaries of non-circular sequence "chr4"

What is weird, because this olaps is a result of CNV call from CNV-seq package and comes from the same reference genome...

ADD REPLY • link 8.8 years ago pertille • 0

1

Entering edit mode

The software isn't lying; some of your ranges are outside the sequence lengths of the BSgenome object. You're right that the most common cause for this is the use of different reference genomes. Other problems might be use of 0-based versus 1-based coordinates (Bioconductor uses 1-based coordinates; did you import your BED file using rtracklayer::import()?) or a similar but not identical genome build, especially for mitochondrial or similar chromosomes. You can get the genomic ranges of the genome in the BSgenome object with

genomerngs = as(seqinfo(BSgenome.Ggallus.UCSC.galGal4), "GRanges")

and use this to find the overlapping ranges that are out of bounds

olaps[!olaps %within% genomerngs]

If you have the reference genome FASTA file used for copy number calling, it is relatively easy to use it instead of a BSgenome package. See C: problem creating txdb object and bsgenome for solanum pennellii.

ADD REPLY • link 8.8 years ago Martin Morgan 25k

0

Entering edit mode

Thank you @Martin
Do you know why this error?
The command looks like correct according to the manual:
http://finzi.psych.upenn.edu/library/GenomicRanges/html/GRanges-class.html

> genomerngs = as(seqinfo(BSgenome.Ggallus.UCSC.galGal4, "GRanges"))

Error in .identC(Class, "double") :
argument "Class" is missing, with no default

ADD REPLY • link 8.8 years ago pertille • 0

0

Entering edit mode

Because there was a typo in the placement of parentheses, now fixed ---as(seqinfo(BSgenome.Ggallus.UCSC.galGal4), "GRanges")

ADD REPLY • link 8.8 years ago Martin Morgan 25k

0

Entering edit mode

I further updated my response to use %within% rather than %in%.

ADD REPLY • link 8.8 years ago Martin Morgan 25k

score 1 · Answer 1 · 2016-02-05

It looks like you're fairly close to what you want to get done - you should be able to do it using GenomicRanges and the appropriate BSgenome package.

I would suggest looking at the GenomicRanges vignette here (http://bioconductor.org/packages/3.3/bioc/vignettes/GenomicRanges/inst/doc/GenomicRangesHOWTOs.pdf), specifically section 2.14. This has an example of create a GRanges object specifying regions of interest (which you've already done) and then how to apply that with a BSgenome object. You will then be able to write the result out to a fasta file.