How Does one subset a XStringView or PDict object?

0

Entering edit mode

Noah Dowell ▴ 410

@noah-dowell-3791

Last seen 10.6 years ago

Hello to all, I am using the excellent BSGenome and Biostrings packages to look for the variety and number of a transcription factor DNA binding motif across the E. coli genome. From biochemistry and molecular biology experiments we know our favorite transcription factor binds a fairly degenerate motif. I want to look at the number of times a particular motif occurs in the E. coli genome and see if specific motifs map to specific genome locations. Here is a working example of what I have done: library(BSgenome.Ecoli.NCBI.20080805) # create and object to work with one genome: Ecoli str. K-12 substr. MG1655 genome12 <- Ecoli$NC_000913 consensus <- "TGTTCAAAAAATAAGCA" TFmotifDict = DNAStringSet(consensus) ConsMatch = matchPDict(TFmotifDict, genome12, max.mismatch=7) z = extractAllMatches(genome12, TFmotifDict) x = PDict(z) table(patternFrequency(x)) # 1 2 3 4 5 # 17088 128 60 52 80 So this is working great and providing some interesting results but in reading through the archives and vignettes I have not figured out how to subset my motif dictionary into the small class of motifs that occur more than once. See the output of the table function above. I want to get the start and end genome locations and the sequence info for the 128 + 60 + 52 + 80 patterns. I can do the following to get one: x[[61]] Or I can do this: freq = patternFrequency(x) getit = which(freq != 1) But this only tells me which ones they are. This could be a pretty basic R task or something specific to these types of objects but I seem to be stuck with my newbie R skills. Thank you in advance for any help. Best, Noah > sessionInfo() R version 2.12.1 (2010-12-16) Platform: x86_64-pc-linux-gnu (64-bit) locale: [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C [3] LC_TIME=en_US.UTF-8 LC_COLLATE=en_US.UTF-8 [5] LC_MONETARY=C LC_MESSAGES=en_US.UTF-8 [7] LC_PAPER=en_US.UTF-8 LC_NAME=C [9] LC_ADDRESS=C LC_TELEPHONE=C [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C attached base packages: [1] stats graphics grDevices utils datasets methods base other attached packages: [1] BSgenome.Ecoli.NCBI.20080805_1.3.16 BSgenome_1.16.5 [3] Biostrings_2.16.9 GenomicRanges_1.0.7 [5] IRanges_1.6.11 loaded via a namespace (and not attached): [1] Biobase_2.8.0 tools_2.12.1

Transcription BSgenome Biostrings BSgenome Transcription BSgenome Biostrings BSgenome • 1.4k views

ADD COMMENT • link updated 14.2 years ago by Martin Morgan 25k • written 14.2 years ago by Noah Dowell ▴ 410

0

Entering edit mode

Martin Morgan 25k

@martin-morgan-1513

Last seen 12 weeks ago

United States

On 02/04/2011 06:40 PM, Noah Dowell wrote: > Hello to all, > > I am using the excellent BSGenome and Biostrings packages to look for the variety and number of a transcription factor DNA binding motif across the E. coli genome. From biochemistry and molecular biology experiments we know our favorite transcription factor binds a fairly degenerate motif. I want to look at the number of times a particular motif occurs in the E. coli genome and see if specific motifs map to specific genome locations. > > Here is a working example of what I have done: > > library(BSgenome.Ecoli.NCBI.20080805) > > > # create and object to work with one genome: Ecoli str. K-12 substr. MG1655 > > genome12 <- Ecoli$NC_000913 > > consensus <- "TGTTCAAAAAATAAGCA" > > TFmotifDict = DNAStringSet(consensus) > > > ConsMatch = matchPDict(TFmotifDict, genome12, max.mismatch=7) > > z = extractAllMatches(genome12, TFmotifDict) > > x = PDict(z) > > > > table(patternFrequency(x)) > > # 1 2 3 4 5 > # 17088 128 60 52 80 > > So this is working great and providing some interesting results but in reading through the archives and vignettes I have not figured out how to subset my motif dictionary into the small class of motifs that occur more than once. See the output of the table function above. I want to get the start and end genome locations and the sequence info for the 128 + 60 + 52 + 80 patterns. > > I can do the following to get one: > > x[[61]] > > Or I can do this: > > freq = patternFrequency(x) > getit = which(freq != 1) > > But this only tells me which ones they are. > > This could be a pretty basic R task or something specific to these types of objects but I seem to be stuck with my newbie R skills. Thank you in advance for any help. Hi Noah I ended up at unique(tb(x)[patternFrequency(x)==5]) This was mostly from looking at the help page for patternFrequency, guided by a little discovery on those that might be relevant to 'x' with showMethods(class=class(x), where=getNamespace("Biostrings")) (this last is definitely obscure). Martin > Best, > > Noah > > >> sessionInfo() > R version 2.12.1 (2010-12-16) > Platform: x86_64-pc-linux-gnu (64-bit) > > locale: > [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C > [3] LC_TIME=en_US.UTF-8 LC_COLLATE=en_US.UTF-8 > [5] LC_MONETARY=C LC_MESSAGES=en_US.UTF-8 > [7] LC_PAPER=en_US.UTF-8 LC_NAME=C > [9] LC_ADDRESS=C LC_TELEPHONE=C > [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C > > attached base packages: > [1] stats graphics grDevices utils datasets methods base > > other attached packages: > [1] BSgenome.Ecoli.NCBI.20080805_1.3.16 BSgenome_1.16.5 > [3] Biostrings_2.16.9 GenomicRanges_1.0.7 > [5] IRanges_1.6.11 > > loaded via a namespace (and not attached): > [1] Biobase_2.8.0 tools_2.12.1 > > _______________________________________________ > Bioconductor mailing list > Bioconductor at r-project.org > https://stat.ethz.ch/mailman/listinfo/bioconductor > Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor -- Computational Biology Fred Hutchinson Cancer Research Center 1100 Fairview Ave. N. PO Box 19024 Seattle, WA 98109 Location: M1-B861 Telephone: 206 667-2793

ADD COMMENT • link 14.2 years ago Martin Morgan 25k

0

Entering edit mode

Thank you Martin! That should work nicely; the patternFrequency man page was one I missed. The showMethods is a good general tip that I can put to use. Best, noah On Feb 4, 2011, at 7:52 PM, Martin Morgan wrote: > On 02/04/2011 06:40 PM, Noah Dowell wrote: >> Hello to all, >> >> I am using the excellent BSGenome and Biostrings packages to look for the variety and number of a transcription factor DNA binding motif across the E. coli genome. From biochemistry and molecular biology experiments we know our favorite transcription factor binds a fairly degenerate motif. I want to look at the number of times a particular motif occurs in the E. coli genome and see if specific motifs map to specific genome locations. >> >> Here is a working example of what I have done: >> >> library(BSgenome.Ecoli.NCBI.20080805) >> >> >> # create and object to work with one genome: Ecoli str. K-12 substr. MG1655 >> >> genome12 <- Ecoli$NC_000913 >> >> consensus <- "TGTTCAAAAAATAAGCA" >> >> TFmotifDict = DNAStringSet(consensus) >> >> >> ConsMatch = matchPDict(TFmotifDict, genome12, max.mismatch=7) >> >> z = extractAllMatches(genome12, TFmotifDict) >> >> x = PDict(z) >> >> >> >> table(patternFrequency(x)) >> >> # 1 2 3 4 5 >> # 17088 128 60 52 80 >> >> So this is working great and providing some interesting results but in reading through the archives and vignettes I have not figured out how to subset my motif dictionary into the small class of motifs that occur more than once. See the output of the table function above. I want to get the start and end genome locations and the sequence info for the 128 + 60 + 52 + 80 patterns. >> >> I can do the following to get one: >> >> x[[61]] >> >> Or I can do this: >> >> freq = patternFrequency(x) >> getit = which(freq != 1) >> >> But this only tells me which ones they are. >> >> This could be a pretty basic R task or something specific to these > types of objects but I seem to be stuck with my newbie R skills. Thank > you in advance for any help. > > Hi Noah > > I ended up at > > unique(tb(x)[patternFrequency(x)==5]) > > This was mostly from looking at the help page for patternFrequency, > guided by a little discovery on those that might be relevant to 'x' with > > showMethods(class=class(x), where=getNamespace("Biostrings")) > > (this last is definitely obscure). > > Martin > >> Best, >> >> Noah >> >> >>> sessionInfo() >> R version 2.12.1 (2010-12-16) >> Platform: x86_64-pc-linux-gnu (64-bit) >> >> locale: >> [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C >> [3] LC_TIME=en_US.UTF-8 LC_COLLATE=en_US.UTF-8 >> [5] LC_MONETARY=C LC_MESSAGES=en_US.UTF-8 >> [7] LC_PAPER=en_US.UTF-8 LC_NAME=C >> [9] LC_ADDRESS=C LC_TELEPHONE=C >> [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C >> >> attached base packages: >> [1] stats graphics grDevices utils datasets methods base >> >> other attached packages: >> [1] BSgenome.Ecoli.NCBI.20080805_1.3.16 BSgenome_1.16.5 >> [3] Biostrings_2.16.9 GenomicRanges_1.0.7 >> [5] IRanges_1.6.11 >> >> loaded via a namespace (and not attached): >> [1] Biobase_2.8.0 tools_2.12.1 >> >> _______________________________________________ >> Bioconductor mailing list >> Bioconductor at r-project.org >> https://stat.ethz.ch/mailman/listinfo/bioconductor >> Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor > > > -- > Computational Biology > Fred Hutchinson Cancer Research Center > 1100 Fairview Ave. N. PO Box 19024 Seattle, WA 98109 > > Location: M1-B861 > Telephone: 206 667-2793

ADD REPLY • link 14.2 years ago Noah Dowell ▴ 410

Login before adding your answer.