Entering edit mode
Kristoffer Rigbolt
▴
10
@kristoffer-rigbolt-3460
Last seen 10.4 years ago
Hi Bioconductor
I was pleased to find the matchPWM function in the Biostrings package,
but I have one fundamental question to the function, or maybe to the
general approach.
To test the specificity of the approach I started out from the example
given in the matchPWM vignette:
pwm <- rbind(A=c( 1, 0, 19, 20, 18, 1, 20, 7),
C=c( 1, 0, 1, 0, 1, 18, 0, 2),
G=c(17, 0, 0, 0, 1, 0, 0, 3),
T=c( 1, 20, 0, 0, 0, 1, 0, 8))
chr3R <- unmasked(Dmelanogaster$chr3R)
Next, I generated a randomized DNA string with nucleotide frequencies
identical to chr3R:
propBP <- prop.table(table(s2c(as.character(chr3R))))
chr3R.random <- sample(names(propBP),
length(s2c(as.character(chr3R))),
TRUE, propBP)
chr3R.random <- DNAString(c2s(chr3R.random))
To extract the number of occurrences of the PWM in the two DNA strings
as a function of the minimum score threshold I used this for-loop:
outMat <- cbind(seq(80,100,2), NA, NA)
for(i in 1:nrow(outMat)){
outMat[i,2] <- countPWM(pwm, chr3R,
min.score=c2s(c(outMat[i,1],
"%")))
outMat[i,3] <- countPWM(pwm, chr3R.random,
min.score=c2s(c(outMat[i,1], "%")))
}
As expected the number of matches decrease as the minimum score are
increased, but much to my surprise an almost identical number of
matches
of the PWM in the D. melanogaster DNA string and the randomized DNA
string occurred. I believe this corresponds to a FDR of 100%(!).
> outMat
min.score chr3R randomized
[1,] 80 52695 51122
[2,] 82 52695 51122
[3,] 84 43225 39996
[4,] 86 27684 24943
[5,] 88 14037 10459
[6,] 90 2836 2435
[7,] 92 2836 2435
[8,] 94 2836 2435
[9,] 96 2836 2435
[10,] 98 1878 1376
[11,] 100 660 722
Based on this observation my question is whether this issue relates to
a
property of the matchPWM function itself or if it is an inherent
property of using PWMs to predict transcription factor binding sites?
If
the latter is the case what is then the scientific value of this
approach.
Hope you can help clarify this issue.
Best regards
Kristoffer Rigbolt
University of Southern Denmark
> sessionInfo()
R version 2.9.0 (2009-04-17)
i386-pc-mingw32
locale:
LC_COLLATE=English_United Kingdom.1252;LC_CTYPE=English_United
Kingdom.1252;LC_MONETARY=English_United
Kingdom.1252;LC_NUMERIC=C;LC_TIME=English_United Kingdom.1252
attached base packages:
[1] stats graphics grDevices utils datasets methods base
other attached packages:
[1] seqinr_2.0-3
[2] BSgenome.Dmelanogaster.UCSC.dm3_1.3.11
[3] BSgenome_1.10.5
[4] Biostrings_2.10.22
[5] IRanges_1.0.16
loaded via a namespace (and not attached):
[1] grid_2.9.0 lattice_0.17-25 Matrix_0.999375-26
tools_2.9.0
[[alternative HTML version deleted]]