kmer and zscore calculation
2
1
Entering edit mode
@fabrice-tourre-4394
Last seen 10.3 years ago
Dear list, I have a list of bed regions. each region is 10bp length. I want to calculate the hexamers and pentamers in theses regions and get the zscore. Is there any existed packages to do this? Thank you very much in advance.
• 3.2k views
ADD COMMENT
3
Entering edit mode
@herve-pages-1542
Last seen 2 hours ago
Seattle, WA, United States
[Oops, forgot to Cc the list when I answered this. Sending it again...] Hi Fabrice, The oligonucleotideFrequency() function in the Biostrings package counts the nb of occurrences of all possible 5-mers (use 'width=5') or 6-mers (use 'width=6'). You need to store your sequence(s) in a DNAString or DNAStringSet object first. On a DNAStringSet, the counts are returned in a matrix with 1 row per sequence and 1 column per k-mer: library(Biostrings) library(hgu95av2probe) probes <- DNAStringSet(hgu95av2probe) count5 <- oligonucleotideFrequency(probes, width=5) Then: > dim(count5) [1] 201800 1024 > count5[1:6, 1:10] AAAAA AAAAC AAAAG AAAAT AAACA AAACC AAACG AAACT AAAGA AAAGC [1,] 0 0 0 0 0 0 0 0 0 0 [2,] 0 0 0 0 0 0 0 0 0 0 [3,] 0 0 0 0 0 0 0 0 0 0 [4,] 0 0 0 0 0 0 0 0 0 0 [5,] 0 0 0 0 0 0 0 0 0 0 [6,] 0 0 0 0 0 0 0 0 0 0 Maybe this function should have been called kmerFrequency()... Cheers, H. On 12/17/2013 10:28 AM, Fabrice Tourre wrote: > Dear list, > > I have a list of bed regions. each region is 10bp length. I want to > calculate the hexamers and pentamers in theses regions and get the > zscore. Is there any existed packages to do this? > > Thank you very much in advance. > > _______________________________________________ > Bioconductor mailing list > Bioconductor at r-project.org > https://stat.ethz.ch/mailman/listinfo/bioconductor > Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor > -- Hervé Pagès Program in Computational Biology Division of Public Health Sciences Fred Hutchinson Cancer Research Center 1100 Fairview Ave. N, M1-B514 P.O. Box 19024 Seattle, WA 98109-1024 E-mail: hpages at fhcrc.org Phone: (206) 667-5791 Fax: (206) 667-1319
ADD COMMENT
0
Entering edit mode
Thank you very much. It is helpful. On Thu, Dec 19, 2013 at 1:33 PM, Hervé Pagès <hpages at="" fhcrc.org=""> wrote: > [Oops, forgot to Cc the list when I answered this. Sending it again...] > > > Hi Fabrice, > > The oligonucleotideFrequency() function in the Biostrings > package counts the nb of occurrences of all possible 5-mers > (use 'width=5') or 6-mers (use 'width=6'). You need to store > your sequence(s) in a DNAString or DNAStringSet object first. > On a DNAStringSet, the counts are returned in a matrix with 1 > row per sequence and 1 column per k-mer: > > library(Biostrings) > library(hgu95av2probe) > probes <- DNAStringSet(hgu95av2probe) > count5 <- oligonucleotideFrequency(probes, width=5) > > Then: > > > dim(count5) > [1] 201800 1024 > > count5[1:6, 1:10] > AAAAA AAAAC AAAAG AAAAT AAACA AAACC AAACG AAACT AAAGA AAAGC > [1,] 0 0 0 0 0 0 0 0 0 0 > [2,] 0 0 0 0 0 0 0 0 0 0 > [3,] 0 0 0 0 0 0 0 0 0 0 > [4,] 0 0 0 0 0 0 0 0 0 0 > [5,] 0 0 0 0 0 0 0 0 0 0 > [6,] 0 0 0 0 0 0 0 0 0 0 > > Maybe this function should have been called kmerFrequency()... > > Cheers, > H. > > > On 12/17/2013 10:28 AM, Fabrice Tourre wrote: >> >> Dear list, >> >> I have a list of bed regions. each region is 10bp length. I want to >> calculate the hexamers and pentamers in theses regions and get the >> zscore. Is there any existed packages to do this? >> >> Thank you very much in advance. >> >> _______________________________________________ >> Bioconductor mailing list >> Bioconductor at r-project.org >> https://stat.ethz.ch/mailman/listinfo/bioconductor >> Search the archives: >> http://news.gmane.org/gmane.science.biology.informatics.conductor >> > > -- > Hervé Pagès > > Program in Computational Biology > Division of Public Health Sciences > Fred Hutchinson Cancer Research Center > 1100 Fairview Ave. N, M1-B514 > P.O. Box 19024 > Seattle, WA 98109-1024 > > E-mail: hpages at fhcrc.org > Phone: (206) 667-5791 > Fax: (206) 667-1319
ADD REPLY
0
Entering edit mode
Julian Gehring ★ 1.3k
@julian-gehring-5818
Last seen 5.6 years ago
Hi Fabrice, Could you give an example for this, especially on what the z-score statistics is about to tell you? Best wishes Julian On 12/17/2013 07:28 PM, Fabrice Tourre wrote: > Dear list, > > I have a list of bed regions. each region is 10bp length. I want to > calculate the hexamers and pentamers in theses regions and get the > zscore. Is there any existed packages to do this? > > Thank you very much in advance. > > _______________________________________________ > Bioconductor mailing list > Bioconductor at r-project.org > https://stat.ethz.ch/mailman/listinfo/bioconductor > Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor >
ADD COMMENT
0
Entering edit mode
Julian, Thank you for your reply. For example, I have a sample fastq file. >chr1:14404-14435(-) GGCACA >chr1:14409-14440(-) AAAACG >chr1:14423-14454(-) AGAGGC >chr1:14424-14455(-) AAGAGG I want to calculate 6kmers in this sample file. Also I have extract a control fastq file. >chr1:14404-14435(-) CCTACA >chr1:14409-14440(-) TCGACG >chr1:14423-14454(-) TCAGAT >chr1:14424-14455(-) CAAGGC The z-score was calculated for each pentamer as: (occurrence in sequences ? average occurrence in control sequences) / standard deviation of occurrence in control sequences On Wed, Dec 18, 2013 at 4:07 AM, Julian Gehring <julian.gehring at="" embl.de=""> wrote: > > > Hi Fabrice, > > Could you give an example for this, especially on what the z-score > statistics is about to tell you? > > Best wishes > Julian > > > > On 12/17/2013 07:28 PM, Fabrice Tourre wrote: >> >> Dear list, >> >> I have a list of bed regions. each region is 10bp length. I want to >> calculate the hexamers and pentamers in theses regions and get the >> zscore. Is there any existed packages to do this? >> >> Thank you very much in advance. >> >> _______________________________________________ >> Bioconductor mailing list >> Bioconductor-0bNBQ1PAWB4BXFe83j6qeQ at public.gmane.org >> https://stat.ethz.ch/mailman/listinfo/bioconductor >> Search the archives: >> http://news.gmane.org/gmane.science.biology.informatics.conductor >> > >
ADD REPLY
0
Entering edit mode
Hi Fabrice, I'm not sure whether there is a package which offers this out of the box. However, it should be easy to achieve with some standard functionality of bioconductor. 1. Import the sequences with the 'Rsamtools' or 'ShortRead' package. 2. Define your k-mers and count the occurrance with functions like 'countPattern' or 'vcountPattern' from the 'Biostrings' package. 3. Calculate your z-scores. Hope this helps. Best wishes Julian On 12/18/2013 06:38 PM, Fabrice Tourre wrote: > Julian, > > Thank you for your reply. > > For example, I have a sample fastq file. > >> chr1:14404-14435(-) > GGCACA >> chr1:14409-14440(-) > AAAACG >> chr1:14423-14454(-) > AGAGGC >> chr1:14424-14455(-) > AAGAGG > > I want to calculate 6kmers in this sample file. Also I have extract a > control fastq file. > >> chr1:14404-14435(-) > CCTACA >> chr1:14409-14440(-) > TCGACG >> chr1:14423-14454(-) > TCAGAT >> chr1:14424-14455(-) > CAAGGC > > The z-score was calculated for each pentamer as: > > (occurrence in sequences ? average occurrence in control sequences) / > standard deviation of occurrence in control sequences > > On Wed, Dec 18, 2013 at 4:07 AM, Julian Gehring <julian.gehring at="" embl.de=""> wrote: >> >> >> Hi Fabrice, >> >> Could you give an example for this, especially on what the z-score >> statistics is about to tell you? >> >> Best wishes >> Julian >> >> >> >> On 12/17/2013 07:28 PM, Fabrice Tourre wrote: >>> >>> Dear list, >>> >>> I have a list of bed regions. each region is 10bp length. I want to >>> calculate the hexamers and pentamers in theses regions and get the >>> zscore. Is there any existed packages to do this? >>> >>> Thank you very much in advance. >>> >>> _______________________________________________ >>> Bioconductor mailing list >>> Bioconductor-0bNBQ1PAWB4BXFe83j6qeQ at public.gmane.org >>> https://stat.ethz.ch/mailman/listinfo/bioconductor >>> Search the archives: >>> http://news.gmane.org/gmane.science.biology.informatics.conductor >>> >> >> > > _______________________________________________ > Bioconductor mailing list > Bioconductor at r-project.org > https://stat.ethz.ch/mailman/listinfo/bioconductor > Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor >
ADD REPLY
0
Entering edit mode
Julian, Thank you for your reply. For example, I have a sample fastq file. >chr1:14404-14435(-) GGCACA >chr1:14409-14440(-) AAAACG >chr1:14423-14454(-) AGAGGC >chr1:14424-14455(-) AAGAGG I want to calculate 6kmers in this sample file. Also I have extract a control fastq file. >chr1:14404-14435(-) CCTACA >chr1:14409-14440(-) TCGACG >chr1:14423-14454(-) TCAGAT >chr1:14424-14455(-) CAAGGC The z-score was calculated for each pentamer as: (occurrence in sequences ? average occurrence in control sequences) / standard deviation of occurrence in control sequences On Wed, Dec 18, 2013 at 4:07 AM, Julian Gehring <julian.gehring at="" embl.de=""> wrote: > > > Hi Fabrice, > > Could you give an example for this, especially on what the z-score > statistics is about to tell you? > > Best wishes > Julian > > > > On 12/17/2013 07:28 PM, Fabrice Tourre wrote: >> >> Dear list, >> >> I have a list of bed regions. each region is 10bp length. I want to >> calculate the hexamers and pentamers in theses regions and get the >> zscore. Is there any existed packages to do this? >> >> Thank you very much in advance. >> >> _______________________________________________ >> Bioconductor mailing list >> Bioconductor-0bNBQ1PAWB4BXFe83j6qeQ at public.gmane.org >> https://stat.ethz.ch/mailman/listinfo/bioconductor >> Search the archives: >> http://news.gmane.org/gmane.science.biology.informatics.conductor >> > >
ADD REPLY

Login before adding your answer.

Traffic: 780 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6