Fetch scaffold sequences from BSgenome package
1
0
Entering edit mode
Julie Zhu ★ 4.3k
@julie-zhu-3596
Last seen 11 months ago
United States
Herve, It seems that the getSeq function does not work for scaffold sequence such as Zv9_scaffold3564:93,507-93,556 and Zv9_NA384:3,507-3,556 for BSgenome.Drerio.UCSC.danRer7_1.3.17. What is the best way to obtain such sequences using BSgenome package? Many thanks! Best regards, Julie [[alternative HTML version deleted]]
BSgenome BSgenome BSgenome BSgenome • 760 views
ADD COMMENT
0
Entering edit mode
@herve-pages-1542
Last seen 2 days ago
Seattle, WA, United States
Hi Julie, On 09/18/2012 09:05 AM, Zhu, Lihua (Julie) wrote: > Herve, > > It seems that the getSeq function does not work for scaffold sequence > such as Zv9_scaffold3564:93,507-93,556 and Zv9_NA384:3,507-3,556 for > BSgenome.Drerio.UCSC.danRer7_1.3.17. What is the best way to obtain such > sequences using BSgenome package? Many thanks! Indeed: > library(BSgenome.Drerio.UCSC.danRer7) > getSeq(Drerio, "Zv9_scaffold3564") Error in .getOneSeqFromBSgenomeMultipleSequences(x, names[i], start[i], : sequence Zv9_scaffold3564 found more than once, please use a non-ambiguous name The problem is this: > grep("Zv9_scaffold3564", names(Drerio$Zv9_scaffold), value=TRUE) [1] "Zv9_scaffold3564" > grep("Zv9_scaffold3564", names(Drerio$upstream1000), value=TRUE) [1] "ENSDART00000099775_up_1000_Zv9_scaffold3564_137113_r Zv9_scaffold3564:137113-138112" [2] "ENSDART00000062791_up_1000_Zv9_scaffold3564_129501_r Zv9_scaffold3564:129501-130500" and also that getSeq() here is looking for a sequence that *contains* "Zv9_scaffold3564" in its name. Note that when used with a GRanges object, getSeq() is looking for a sequence with the exact specified name so that should address the problem: > getSeq(Drerio, GRanges("Zv9_scaffold3564", IRanges(3507, 3556))) A DNAStringSet instance of length 1 width seq [1] 50 CCTAAGTATCCACTTTAGTATCCATAACACAATAATCAGATGCTATTGTT Note that our plan for BioC 2.12 is to get rid of the upstream sequences in the BSgenome packages (they should never have been included here in the first place, people will be able to use Paul's new getPromoterSeq() from the GenomicFeatures package instead) so that will entirely solve the ambiguous name problem. Let me know if you have any questions about this. Thanks, H. > > Best regards, > > Julie -- Hervé Pagès Program in Computational Biology Division of Public Health Sciences Fred Hutchinson Cancer Research Center 1100 Fairview Ave. N, M1-B514 P.O. Box 19024 Seattle, WA 98109-1024 E-mail: hpages at fhcrc.org Phone: (206) 667-5791 Fax: (206) 667-1319
ADD COMMENT
0
Entering edit mode
Herve, Thanks so much for the quick response and the alternative solutions! Best regards, Julie On 9/18/12 2:49 PM, "Hervé Pagès" <hpages at="" fhcrc.org=""> wrote: > Hi Julie, > > On 09/18/2012 09:05 AM, Zhu, Lihua (Julie) wrote: >> Herve, >> >> It seems that the getSeq function does not work for scaffold sequence >> such as Zv9_scaffold3564:93,507-93,556 and Zv9_NA384:3,507-3,556 for >> BSgenome.Drerio.UCSC.danRer7_1.3.17. What is the best way to obtain such >> sequences using BSgenome package? Many thanks! > > Indeed: > >> library(BSgenome.Drerio.UCSC.danRer7) >> getSeq(Drerio, "Zv9_scaffold3564") > Error in .getOneSeqFromBSgenomeMultipleSequences(x, names[i], > start[i], : > sequence Zv9_scaffold3564 found more than once, please use a > non-ambiguous name > > The problem is this: > >> grep("Zv9_scaffold3564", names(Drerio$Zv9_scaffold), value=TRUE) > [1] "Zv9_scaffold3564" >> grep("Zv9_scaffold3564", names(Drerio$upstream1000), value=TRUE) > [1] "ENSDART00000099775_up_1000_Zv9_scaffold3564_137113_r > Zv9_scaffold3564:137113-138112" > [2] "ENSDART00000062791_up_1000_Zv9_scaffold3564_129501_r > Zv9_scaffold3564:129501-130500" > > and also that getSeq() here is looking for a sequence that *contains* > "Zv9_scaffold3564" in its name. > > Note that when used with a GRanges object, getSeq() is looking for a > sequence with the exact specified name so that should address the > problem: > >> getSeq(Drerio, GRanges("Zv9_scaffold3564", IRanges(3507, 3556))) > A DNAStringSet instance of length 1 > width seq > [1] 50 CCTAAGTATCCACTTTAGTATCCATAACACAATAATCAGATGCTATTGTT > > Note that our plan for BioC 2.12 is to get rid of the upstream sequences > in the BSgenome packages (they should never have been included here in > the first place, people will be able to use Paul's new getPromoterSeq() > from the GenomicFeatures package instead) so that will entirely solve > the ambiguous name problem. > > Let me know if you have any questions about this. > > Thanks, > H. > >> >> Best regards, >> >> Julie
ADD REPLY

Login before adding your answer.

Traffic: 529 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6