PWM matching using matchPWM in Biostrings

0

Entering edit mode

Kristoffer Rigbolt ▴ 10

@kristoffer-rigbolt-3460

Last seen 10.6 years ago

Hi Bioconductor I was pleased to find the matchPWM function in the Biostrings package, but I have one fundamental question to the function, or maybe to the general approach. To test the specificity of the approach I started out from the example given in the matchPWM vignette: pwm <- rbind(A=c( 1, 0, 19, 20, 18, 1, 20, 7), C=c( 1, 0, 1, 0, 1, 18, 0, 2), G=c(17, 0, 0, 0, 1, 0, 0, 3), T=c( 1, 20, 0, 0, 0, 1, 0, 8)) chr3R <- unmasked(Dmelanogaster$chr3R) Next, I generated a randomized DNA string with nucleotide frequencies identical to chr3R: propBP <- prop.table(table(s2c(as.character(chr3R)))) chr3R.random <- sample(names(propBP), length(s2c(as.character(chr3R))), TRUE, propBP) chr3R.random <- DNAString(c2s(chr3R.random)) To extract the number of occurrences of the PWM in the two DNA strings as a function of the minimum score threshold I used this for-loop: outMat <- cbind(seq(80,100,2), NA, NA) for(i in 1:nrow(outMat)){ outMat[i,2] <- countPWM(pwm, chr3R, min.score=c2s(c(outMat[i,1], "%"))) outMat[i,3] <- countPWM(pwm, chr3R.random, min.score=c2s(c(outMat[i,1], "%"))) } As expected the number of matches decrease as the minimum score are increased, but much to my surprise an almost identical number of matches of the PWM in the D. melanogaster DNA string and the randomized DNA string occurred. I believe this corresponds to a FDR of 100%(!). > outMat min.score chr3R randomized [1,] 80 52695 51122 [2,] 82 52695 51122 [3,] 84 43225 39996 [4,] 86 27684 24943 [5,] 88 14037 10459 [6,] 90 2836 2435 [7,] 92 2836 2435 [8,] 94 2836 2435 [9,] 96 2836 2435 [10,] 98 1878 1376 [11,] 100 660 722 Based on this observation my question is whether this issue relates to a property of the matchPWM function itself or if it is an inherent property of using PWMs to predict transcription factor binding sites? If the latter is the case what is then the scientific value of this approach. Hope you can help clarify this issue. Best regards Kristoffer Rigbolt University of Southern Denmark > sessionInfo() R version 2.9.0 (2009-04-17) i386-pc-mingw32 locale: LC_COLLATE=English_United Kingdom.1252;LC_CTYPE=English_United Kingdom.1252;LC_MONETARY=English_United Kingdom.1252;LC_NUMERIC=C;LC_TIME=English_United Kingdom.1252 attached base packages: [1] stats graphics grDevices utils datasets methods base other attached packages: [1] seqinr_2.0-3 [2] BSgenome.Dmelanogaster.UCSC.dm3_1.3.11 [3] BSgenome_1.10.5 [4] Biostrings_2.10.22 [5] IRanges_1.0.16 loaded via a namespace (and not attached): [1] grid_2.9.0 lattice_0.17-25 Matrix_0.999375-26 tools_2.9.0 [[alternative HTML version deleted]]

Transcription BSgenome Biostrings BSgenome Transcription BSgenome Biostrings BSgenome • 2.6k views

ADD COMMENT • link updated 15.9 years ago by Patrick Aboyoun ★ 1.6k • written 15.9 years ago by Kristoffer Rigbolt ▴ 10

0

Entering edit mode

Hervé Pagès 16k

@herve-pages-1542

Last seen 1 day ago

Seattle, WA, United States

Hi Kristoffer, Not easy to read/reproduce your code since you don't show what the c2s() or s2c() functions are. I guess c2s() is turning a character vector into a single string, e.g. something like c2s <- function(x) paste(x, collapse="") and s2c() is exploding it e.g. something like s2c <- function(x) strsplit(x, NULL, fixed=TRUE)[[1]] more below... Kristoffer Rigbolt wrote: > Hi Bioconductor > > > > I was pleased to find the matchPWM function in the Biostrings package, > but I have one fundamental question to the function, or maybe to the > general approach. > > > > To test the specificity of the approach I started out from the example > given in the matchPWM vignette: > > > > pwm <- rbind(A=c( 1, 0, 19, 20, 18, 1, 20, 7), > > C=c( 1, 0, 1, 0, 1, 18, 0, 2), > > G=c(17, 0, 0, 0, 1, 0, 0, 3), > > T=c( 1, 20, 0, 0, 0, 1, 0, 8)) > > > > chr3R <- unmasked(Dmelanogaster$chr3R) > > > > Next, I generated a randomized DNA string with nucleotide frequencies > identical to chr3R: > > > > propBP <- prop.table(table(s2c(as.character(chr3R)))) Note that alphabetFrequency(chr3R, baseOnly=TRUE, freq=TRUE)[1:4] is *much* faster! > > chr3R.random <- sample(names(propBP), length(s2c(as.character(chr3R))), > TRUE, propBP) > > chr3R.random <- DNAString(c2s(chr3R.random)) > > > > To extract the number of occurrences of the PWM in the two DNA strings > as a function of the minimum score threshold I used this for-loop: > > > > outMat <- cbind(seq(80,100,2), NA, NA) > > for(i in 1:nrow(outMat)){ > > outMat[i,2] <- countPWM(pwm, chr3R, min.score=c2s(c(outMat[i,1], > "%"))) > > outMat[i,3] <- countPWM(pwm, chr3R.random, > min.score=c2s(c(outMat[i,1], "%"))) > > } > > > > As expected the number of matches decrease as the minimum score are > increased, but much to my surprise an almost identical number of matches > of the PWM in the D. melanogaster DNA string and the randomized DNA > string occurred. I believe this corresponds to a FDR of 100%(!). The 'pwm' given in the man page of matchPWM() (and that you are using here) was chosen arbitrarily and is very short i.e. it has a small number of cols. More precisely the consensus string associated with this 'pwm' is GTAAACAT and the distribution of such a small pattern in a genome of the size of the Fly genome can almost be considered random. So I'm not too surprised that there is no significant difference between chr3R and your chr3R.random. > > > >> outMat > > min.score chr3R randomized > > [1,] 80 52695 51122 > > [2,] 82 52695 51122 > > [3,] 84 43225 39996 > > [4,] 86 27684 24943 > > [5,] 88 14037 10459 > > [6,] 90 2836 2435 > > [7,] 92 2836 2435 > > [8,] 94 2836 2435 > > [9,] 96 2836 2435 > > [10,] 98 1878 1376 > > [11,] 100 660 722 > > > > Based on this observation my question is whether this issue relates to a > property of the matchPWM function itself or if it is an inherent > property of using PWMs to predict transcription factor binding sites? If > the latter is the case what is then the scientific value of this > approach. I don't think this is a property of the matchPWM() function itself. You should get almost the same if you counted the occurences of GTAAACAT with countPattern(). I think this is just due to the length of the pattern: there *must* be a lot of occurences of such a short pattern. Furthermore, if you counted the occurences of all the possible 7-mers in a given chromosome (with oligonucleotideFrequency()), you would always get numbers very close to what you get with a fake (i.e. randomly-generated) chromosome. As long as the length of the chromosome is much bigger than 4^7 (at least 10x or 20x bigger). Cheers, H. > > > > Hope you can help clarify this issue. > > > > Best regards > > Kristoffer Rigbolt > > University of Southern Denmark > > > >> sessionInfo() > > R version 2.9.0 (2009-04-17) > > i386-pc-mingw32 > > > > locale: > > LC_COLLATE=English_United Kingdom.1252;LC_CTYPE=English_United > Kingdom.1252;LC_MONETARY=English_United > Kingdom.1252;LC_NUMERIC=C;LC_TIME=English_United Kingdom.1252 > > > > attached base packages: > > [1] stats graphics grDevices utils datasets methods base > > > > > other attached packages: > > [1] seqinr_2.0-3 > > [2] BSgenome.Dmelanogaster.UCSC.dm3_1.3.11 > > [3] BSgenome_1.10.5 > > [4] Biostrings_2.10.22 > > [5] IRanges_1.0.16 > > > > loaded via a namespace (and not attached): > > [1] grid_2.9.0 lattice_0.17-25 Matrix_0.999375-26 tools_2.9.0 > > > > [[alternative HTML version deleted]] > > _______________________________________________ > Bioconductor mailing list > Bioconductor at stat.math.ethz.ch > https://stat.ethz.ch/mailman/listinfo/bioconductor > Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor -- Hervé Pagès Program in Computational Biology Division of Public Health Sciences Fred Hutchinson Cancer Research Center 1100 Fairview Ave. N, M2-B876 P.O. Box 19024 Seattle, WA 98109-1024 E-mail: hpages at fhcrc.org Phone: (206) 667-5791 Fax: (206) 667-1319

ADD COMMENT • link 15.9 years ago Hervé Pagès 16k

0

Entering edit mode

Patrick Aboyoun ★ 1.6k

@patrick-aboyoun-6734

Last seen 10.6 years ago

United States

Kristoffer, I don't know the details of the matchPWM and countPWM examples in the man page for matchPWM, but my guess is that they were created to show how the software is used rather than demonstrating a hunt for a real TF binding site in D. melanogaster using a PWM. From my experience, PWM are useful for a first-order approximation of where transcription factors bind. One nice aspect they have is they lend themselves to relatively fast computations. The downside is that the strong PWM assumption of positional independence most likely does not hold for the TF-binding site of interest, so it is hard to believe the hits you get constitute reality. Patrick Kristoffer Rigbolt wrote: > Hi Bioconductor > > > > I was pleased to find the matchPWM function in the Biostrings package, > but I have one fundamental question to the function, or maybe to the > general approach. > > > > To test the specificity of the approach I started out from the example > given in the matchPWM vignette: > > > > pwm <- rbind(A=c( 1, 0, 19, 20, 18, 1, 20, 7), > > C=c( 1, 0, 1, 0, 1, 18, 0, 2), > > G=c(17, 0, 0, 0, 1, 0, 0, 3), > > T=c( 1, 20, 0, 0, 0, 1, 0, 8)) > > > > chr3R <- unmasked(Dmelanogaster$chr3R) > > > > Next, I generated a randomized DNA string with nucleotide frequencies > identical to chr3R: > > > > propBP <- prop.table(table(s2c(as.character(chr3R)))) > > chr3R.random <- sample(names(propBP), length(s2c(as.character(chr3R))), > TRUE, propBP) > > chr3R.random <- DNAString(c2s(chr3R.random)) > > > > To extract the number of occurrences of the PWM in the two DNA strings > as a function of the minimum score threshold I used this for-loop: > > > > outMat <- cbind(seq(80,100,2), NA, NA) > > for(i in 1:nrow(outMat)){ > > outMat[i,2] <- countPWM(pwm, chr3R, min.score=c2s(c(outMat[i,1], > "%"))) > > outMat[i,3] <- countPWM(pwm, chr3R.random, > min.score=c2s(c(outMat[i,1], "%"))) > > } > > > > As expected the number of matches decrease as the minimum score are > increased, but much to my surprise an almost identical number of matches > of the PWM in the D. melanogaster DNA string and the randomized DNA > string occurred. I believe this corresponds to a FDR of 100%(!). > > > > >> outMat >> > > min.score chr3R randomized > > [1,] 80 52695 51122 > > [2,] 82 52695 51122 > > [3,] 84 43225 39996 > > [4,] 86 27684 24943 > > [5,] 88 14037 10459 > > [6,] 90 2836 2435 > > [7,] 92 2836 2435 > > [8,] 94 2836 2435 > > [9,] 96 2836 2435 > > [10,] 98 1878 1376 > > [11,] 100 660 722 > > > > Based on this observation my question is whether this issue relates to a > property of the matchPWM function itself or if it is an inherent > property of using PWMs to predict transcription factor binding sites? If > the latter is the case what is then the scientific value of this > approach. > > > > Hope you can help clarify this issue. > > > > Best regards > > Kristoffer Rigbolt > > University of Southern Denmark > > > > >> sessionInfo() >> > > R version 2.9.0 (2009-04-17) > > i386-pc-mingw32 > > > > locale: > > LC_COLLATE=English_United Kingdom.1252;LC_CTYPE=English_United > Kingdom.1252;LC_MONETARY=English_United > Kingdom.1252;LC_NUMERIC=C;LC_TIME=English_United Kingdom.1252 > > > > attached base packages: > > [1] stats graphics grDevices utils datasets methods base > > > > > other attached packages: > > [1] seqinr_2.0-3 > > [2] BSgenome.Dmelanogaster.UCSC.dm3_1.3.11 > > [3] BSgenome_1.10.5 > > [4] Biostrings_2.10.22 > > [5] IRanges_1.0.16 > > > > loaded via a namespace (and not attached): > > [1] grid_2.9.0 lattice_0.17-25 Matrix_0.999375-26 tools_2.9.0 > > > > [[alternative HTML version deleted]] > > _______________________________________________ > Bioconductor mailing list > Bioconductor at stat.math.ethz.ch > https://stat.ethz.ch/mailman/listinfo/bioconductor > Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor >

ADD COMMENT • link 15.9 years ago Patrick Aboyoun ★ 1.6k

Login before adding your answer.