Anh Tran writes:
> Hi all,
>
> I'm new to this mailing list and have a very simple question
> about digestion with restriction enzyme for the whole genome.
>
> I'm using package BSgenome.Hsapiens.UCSC.hg18 to find the cut
> site of enzyme. Is there a faster way (ie. pre-built package)
> to get the fragment sequence and it's location in UCSC genome
> browser format (Chr#: start bp - end bp).
>
> I'm planning on finding the restriction site location, then
> find the substring between two consecutive cut sites.
>
getSeq accepts vectors of start and stop sites, so you can try this:
library(Biostrings)
library(BSgenome.Hsapiens.UCSC.hg18)
c1<-Hsapiens$chr1
m<-matchPattern('CTAG', c1)
starts <- m at start[1:length(m at start)-1]
ends <- m at start[2:length(m at start)]
seqs <- getSeq(Hsapiens, 'chr1', starts, ends)
Which works great for a "few" sequences, but stalls on my windows box
with this many sequences due to memory constraints. While Biostrings
is fantastic (for example, the matchPattern above takes only a few
seconds), it does come at a cost (IMHO) in terms of memory overhead
(the authors know this, and explicitly state that retrieving this many
sequences at once is not likely to work). But you might be able to
loop through in chunks and rbind together.
An alternative may be to use biomaRt's getSequence in mysql mode
(requires mySql libraries to be installed and the RMySQL package). I
am not sure if you can send multiple sites, though (and I can't test
it at the moment due to firewall issues), but you could try:
mart <- useMart("ensembl",dataset="hsapiens_gene_ensembl", mysql=TRUE)
getSequence(chromosome="1", start=starts, end=ends, mart=ensembl)
<driftingofftopic>
An unpublished alternative is to use a new library I have concocted
which will suck sequences directly out of a 2bit file. You have to
download the human genome .2bit file (800Mb) from here:
ftp://hgdownload.cse.ucsc.edu/gbdb/hg18/hg18.2bit
And then, if you had my library installed (which I can send to you if
you want it), you could do this:
library(r2bit)
chr <- rep("chr1", length(starts))
seq <- fetch2bit("/path/to/hg18.2bit", chr, starts, ends)
On my box, this takes 35 seconds to retrieve the 623,302 sequences (as
character strings, not at Biostrings objects) from my local copy of
hg18.2bit.
Personally, as we move into the massively high throughput sequencing
era, I wonder if this isn't a good alternative (i.e., coupling
biomaRt's ability to quickly figure out what sequences are of
interest, and then retrieving those sequences quickly out of a local
compressed file)
</driftingofftopic>
Cheers,
Eric
> Is there any faster way to do this?
>
> Here's the code that I'm thinking of:
>
> library(BSgenome.Hsapiens.UCSC.hg18)
> c1<-Hsapiens$chr1
> m<-matchPatterns('CTAG', c1)
>
> #And the for loop
> reslt<-NULL
> for (i in 1:(length(m)-1)) {
> reslt<-rbind(reslt, cbind(start(m[i])+1, start(m[i+1])-3,
> getSeq(Hsapiens, 'chr1', start(m[i])+2, start(m[i+1])-2))) }
>
> It seems like this code is no where near perfect. Please give
> me some input.
>
> Thanks
>
> --
> Regards,
> Anh Tran
>
> [[alternative HTML version deleted]]
>
> _______________________________________________
> Bioconductor mailing list
> Bioconductor at stat.math.ethz.ch
>
https://stat.ethz.ch/mailman/listinfo/bioconductor
> Search the archives:
>
http://news.gmane.org/gmane.science.biology.informatics.conductor
>
This email message, including any attachments, is for
th...{{dropped:6}}