Entering edit mode
Hi Lax,
On 09/25/2013 09:52 AM, Lakshmanan Iyer wrote:
> In Bioconductor, Is it possible to create a PWM from DNA sequence
alignment
> with gaps to be used to scan sequences?
>
> When I try it as suggested in "Biostrings" it gives the following
error:
>
>> pwm <- PWM(pfm)
> Error in .normargPfm(x) :
> invalid PFM 'x': IUPAC ambiguity letters are represented
>
> The help for "PWM" does say "with no IUPAC ambiguity letters" and
Gaps are
> ambiguity letters.
Gaps in alignments are not represented with ambiguity letters but with
the "-" letter:
> library(BSgenome.Dmelanogaster.UCSC.dm3)
> pairwiseAlignment("ACCACTATCCATTACGGGCCCAGT",
unmasked(Dmelanogaster$chr2L), type="global-local")
Global-Local PairwiseAlignmentsSingleSubject (1 of 1)
pattern: [1] ACCACTATCCATTACGGGCCCAGT
subject: [1216159] ACCACTAT-CATTACGG--CCAGT
score: 9.616879
So let's say you want to compute the consensus matrix from a
collection of DNA sequences that contain alignment gaps:
> dna <- DNAStringSet(c("CTAATGG--CCAGT",
"AT-CATTG-CCAGA",
"CTATCAACGG--AG"))
> cm <- consensusMatrix(dna)
> dim(cm)
[1] 17 14
> rownames(cm)
[1] "A" "C" "G" "T" "M" "R" "W" "S" "Y" "K" "V" "H" "D" "B" "N"
"-" "+"
If you try to pass this matrix directly to PWM(), it will be rejected
with the error you reported because PWM() only accepts a position
frequency matrix with 4 rows: 1 for each DNA base letter.
> PWM(cm)
Error in .normargPfm(x) :
invalid PFM 'x': IUPAC ambiguity letters are represented
I realize the error message is mis-leading (because "-" is not an
ambiguity letter) and I'm going to modify it to make it a little bit
more informative.
To work around this, one might want to subset 'cm' to keep only the
A, C, G, and T rows before passing it to PWM(). Unfortunately that
doesn't work either because the Wasserman & Sandelin algo used
internally by PWM() to compute a Position Weight Matrix from a
Position
Frequency Matrix expects the latter to have all its columns sum up to
the same value:
> PWM(cm[DNA_BASES, ])
Error in .normargPfm(x) :
invalid PFM 'x': all columns in 'x' must sum to the same value.
If the PFM was obtained by calling consensusMatrix() on a
DNAStringSet
object, please make sure that this object is rectangular (i.e.
has a
constant width).
To me this sounds like a legitimate situation for bypassing the
pfm-to-pwm transformation and use the pfm directly as a pwm:
pwm <- cm[DNA_BASES, ]
matchPWM(pwm, subject)
Hope this helps,
H.
>
>> sessionInfo()
> R version 3.0.1 (2013-05-16)
> Platform: x86_64-apple-darwin10.8.0 (64-bit)
>
> locale:
> [1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8
>
> attached base packages:
> [1] parallel stats graphics grDevices utils datasets
methods
> [8] base
>
> other attached packages:
> [1] Biostrings_2.28.0 IRanges_1.18.3 BiocGenerics_0.6.0
>
> loaded via a namespace (and not attached):
> [1] stats4_3.0.1 tools_3.0.1
>
> -best
> -Lax
>
> [[alternative HTML version deleted]]
>
> _______________________________________________
> Bioconductor mailing list
> Bioconductor at r-project.org
> https://stat.ethz.ch/mailman/listinfo/bioconductor
> Search the archives:
http://news.gmane.org/gmane.science.biology.informatics.conductor
>
--
Hervé Pagès
Program in Computational Biology
Division of Public Health Sciences
Fred Hutchinson Cancer Research Center
1100 Fairview Ave. N, M1-B514
P.O. Box 19024
Seattle, WA 98109-1024
E-mail: hpages at fhcrc.org
Phone: (206) 667-5791
Fax: (206) 667-1319