Hey there,
I am new to Biostrings - any help appreciated.
Target:
I would like to scan all WIPO patent sequences from lens.org (https://www.lens.org/lens/bio/patseqdata#globe/WO/, aa-all.fa) and check which patent is using a given sequence.
Begin of fasta file:
gnl|patseq|WO_2012_007458-2 Sequence 2 from pre-grant Patent WO_2012_007458
GCAGCTGCGCGCTCGCTCGCTCACTGAGGCCGCCCGGGCAAAGCCCGGGCGTCGGGCGACCTTTGGTCGCCCGGCCTCAGTGAGCGAGCGAGCGCGCAGAGAGGGAGTGGCCAACTCCATCACTAGGGGTTCCTTGTAGTTAATGATTAACCCGCCATGCTACTTATCTACGTAGCCATGCTCTAGACATGGCTCGACAGATCTCAATATTGGCCATTAGCCATATTATTCATTGGTTATATAGCATAAATCAATATTGGCTATTGGCCATTGCATACGTTGTATCTATATCATAATATGTACATTTATATTGGCTCATGTCCAATATGACCGCCATGTTGGCATTGATTATTGACTAGTTATTAATAGTAATCAATTACGGGGTCATTAGTTCATAGCCCATATATGGAGTTCCGCGTTACATAACTTACGGTAAATGGCCCGCCTGGCTGACCGCCCAACGACCCCCGCCCATTGACGTCAATAATGACGTATGTTCCCATAGTAACGCCAATAGGGACTTTCCATTGACGTCAATGGGTGGAGTATTTACGGTAAACTGCCCACTTGGCAGTACATCAAGTGTATCATATGCCAAGTCCGCCCCCTATTGACGTCAATGACGGTAAATGGCCCGCCTGGCATTATGCCCAGTACATGACCTTACGGGACTTTCCTACTTGGCAGTACATCTACGTATTAGTCATCGCTATTACCATGGTGATGCGGTTTTGGCAGTACACCAATGGGCGTGGATAGCGGTTTGACTCACGGGGATTTCCAAGTCTCCACCCCATTGACGTCAATGGGAGTTTGTTTTGGCACCAAAATCAACGGGACTTTCCAAAATGTCGTAACAACTGCGATCGCCCGCCCCGTTGACGCAAATGGGCGGTAGGCGTGTACGGTGGGAGGTCTATATAAGCAGAGCTCGTTTAGTGAACCGTCAGATCACTAGAAGCTTTATTGCGGTAGTTTATCACAGTTAAATTGCTAACGCAGTCAGTGCTTCTGACACAACAGTCTCGAACTTAAGCTGCAGTGACTCTCTTAAGGTAGCCTTGCAGAAGTTGGTCGTGAGGCACTGGGCAGGTAAGTATCAAGGTTACAAGACAGGTTTAAGGAGACCAATAGAAACTGGGCTTGTCGAGACAGAGAAGACTCTTGCGTTTCTGATAGGCACCTATTGGTCTTACTGACATCCACTTTGCCTTTCTCTCCACAGGTGTCCACTCCCAGTTCAATTACAGCTCTTAAGGCTAGAGTACTTAATACGACTCACTATAGGCTAGCCTCGAGAATTCCCTCAGCCAGACAGTCCTTACCTGCAACAGGTGGCCTCAGGAGTCAGGAACATCTCTACTTCCCCAACGACCCCTGGGTTGTCCTCTCAGAGATGGCTATGGATACTACAAGGTGTGGAGCCCAGTTGTTGACTCTGGTCGAGCAGATCCTGGCAGAGTTCCAGCTGCAGGAGGAAGACCTGAAGAAGGTGATGAGCCGGATGCAGAAGGAGATGGACCGTGGCCTGAGGCTGGAGACCCACGAGGAGGCCAGTGTAAAGATGTTACCCACCTACGTGCGTTCCACCCCAGAAGGCTCAGAAGTCGGAGACTTTCTCTCCTTAGACCTGGGAGGAACCAACTTCAGAGTGATGCTGGTCAAAGTGGGAGAGGGGGAGGCAGGGCAGTGGAGCGTGAAGACAAAACACCAGATGTACTCCATCCCCGAGGACGCCATGACGGGCACTGCCGAGATGCTCTTTGACTACATCTCTGAATGCATCTCTGACTTCCTTGACAAGCATCAGATGAAGCACAAGAAACTGCCCCTGGGCTTCACCTTCTCCTTCCCTGTGAGGCACGAAGACCTAGACAAGGGCATCCTCCTCAATTGGACCAAGGGCTTCAAGGCCTCTGGAGCAGAAGGGAACAACATCGTAGGACTTCTCCGAGATGCTATCAAGAGGAGAGGGGACTTTGAGATGGATGTGGTGGCAATGGTGAACGACACAGTGGCCACAATGATCTCCTGCTACTATGAAGACCGCCAATGTGAGGTCGGCATGATTGTGGGCACTGGCTGCAATGCCTGCTACATGGAGGAAATGCAGAATGTGGAGCTGGTGGAAGGGGATGAGGGACGCATGTGCGTCAACACGGAGTGGGGCGCCTTCGGGGACTCGGGCGAGCTGGATGAGTTCCTACTGGAGTATGACCGGATGGTGGATGAAAGCTCAGCGAACCCCGGTCAGCAGCTGTACGAGAAGATCATCGGTGGGAAGTATATGGGCGAGCTGGTACGACTTGTGCTGCTTAAGCTGGTGGACGAGAACCTTCTGTTCCACGGAGAGGCCTCGGAGCAGCTGCGCACGCGTGGTGCTTTTGAGACCCGTTTCGTGTCACAAGTGGAGAGCGACTCCGGGGACCGAAAGCAGATCCACAACATCCTAAGCACTCTGGGGCTTCGACCCTCTGTCACCGACTGCGACATTGTGCGCCGTGCCTGTGAAAGCGTGTCCACTCGCGCCGCCCATATGTGCTCCGCAGGACTAGCTGGGGTCATAAATCGCATGCGCGAAAGCCGCAGTGAGGACGTGATGCGCATCACTGTGGGCGTGGATGGCTCCGTGTACAAGCTGCACCCGAGCTTCAAGGAGCGGTTTCACGCCAGTGTGCGCAGGCTGACACCCAACTGCGAAATCACCTTCATCGAATCAGAGGAGGGCAGCGGCAGGGGAGCCGCACTGGTCTCTGCGGTGGCCTGCAAGAAGGCTTGCATGCTGGCCCAGTGAAATCCAGGTCATATGGACCGGGACCTGGGTTCCACGGGGACTCCACACACCACAAATGCTCCCAGCCCACCGGGGCAGGAGACCTATTCTGCTGCTACCCCTGGAAAATGGGGAGAGGCCCCTGCAAGCCGAGTCGGCCAGTGGGACAGCCCTAGGGCTCTCAGCCTGGGGCAGGGGGCTGGGAGGAAGAAGAGGATCAGAGGCGCCAAGGCCTTTCTTGCTAGAATCAACTACAGAAAATGGCGGAAAATACTCAGGACTTGCACTTTCACGATTCTTGCTTCCCAAGCGTGGGTCTGGCCTCCCAAGGGAATGCTTCCTGGACCTTGCAATGGCCTGGCTTCCCTGGGGGGGACACACCTTCATGGGGAGGTAACTTCAGCAGTTCGGCCAGACCAGACCCCAGGAGAGTAAGGGCTGCTAGTCACCCAGACCTGGCTGTTTTCTTGTCTGTGGCTGAAGAGGCCGGGGAGCCATGAGAGACTGACTATCCGGCTACATGGAGAGGACTTTCCAGGCATGAACATGCCAGAGACTGTTGCCTTCATATACCTCCACCCGAGTGGCTTACAGTTCTGGGATGAACCCTCCCAGGAGATGCCAGAGGTTAGAGCCCCAGAGTCCTTGCTCTAAGGGGACCAGAAAGGGGAGGCCTCACTCTGCACTATTCAAGCAGGAATCATCTCCAACACTCAGGTCCCTGACCCAGGAGGAAGAAGCCACCCTCAGTGTCCCTCCAAGAGACCACCCAGGTCCTTCTCTCCCTCGTTCCCAAATGCCAGCCTCTCTACCTGGGACTGTGGGGGAGTTTTTAATTAAATATTTAAAACTACTTCAAAAAAAAAAAAAGGAATTCACGCGTGGTACCTCTAGAGTCGACCCGGGCGGCCGCTTCCCTTTAGTGAGGGTTAATGCTTCGAGCAGACATGATAAGATACATTGATGAGTTTGGACAAACCACAACTAGAATGCAGTGAAAAAAATGCTTTATTTGTGAAATTTGTGATGCTATTGCTTTATTTGTAACCATTATAAGCTGCAATAAACAAGTTAACAACAACAATTGCATTCATTTTATGTTTCAGGTTCAGGGGGAGATGTGGGAGGTTTTTTAAAGCAAGTAAAACCTCTACAAATGTGGTAAAATCCGATAAGGGACTAGAGCATGGCTACGTAGATAAGTAGCATGGCGGGTTAATCATTAACTACAAGGAACCCCTAGTGATGGAGTTGGCCACTCCCTCTCTGCGCGCTCGCTCGCTCACTGAGGCCGGGCGACCAAAGGTCGCCCGACGCCCGGGCTTTGCCCGGGCGGCCTCAGTGAGCGAGCGAGCGCGC
As my computer cannot handle the whole fasta.file (8GB RAM) I tried to use an indexed fasta file.
What I have done so far:
library(Rsamtools) indexFa("Lens PatSeq/aa-all.fa") # create an index of file 'aa-all.fa' fa = FaFile("Lens PatSeq/aa-all.fa") # reference the fasta file and it's index gr = as(seqinfo(fa), "GRanges") searchpattern <- DNAString("GGGCCCAAGTTCACTTAAAAAGGAGATCAACAATGAAAGCAATTTTCGTACTGAAACATCTTAATCATGCACAGGAGACTTTCTAATG") sourcepattern <- getSeq(fa, gr[2:1]) mindex <- vmatchPattern(searchpattern, sourcepattern) nmatch_per_seq <- elementLengths(mindex) # Get the number of matches per # subject element. sum(nmatch_per_seq) # Total number of matches. table(nmatch_per_seq)
My Question:
Is this the correct code to use vmatchPattern in order to compare strings with an indexed file?
Many thanks in advance!
Warm regards,
Giulio
> sessionInfo() R version 3.2.2 (2015-08-14) Platform: x86_64-w64-mingw32/x64 (64-bit) Running under: Windows 8 x64 (build 9200) locale: [1] LC_COLLATE=English_United States.1252 LC_CTYPE=English_United States.1252 LC_MONETARY=English_United States.1252 LC_NUMERIC=C LC_TIME=English_United States.1252 attached base packages: [1] parallel stats4 stats graphics grDevices utils datasets methods base other attached packages: [1] Rsamtools_1.22.0 Biostrings_2.38.0 XVector_0.10.0 GenomicRanges_1.22.1 GenomeInfoDb_1.6.1 IRanges_2.4.1 S4Vectors_0.8.1 BiocGenerics_0.16.1 loaded via a namespace (and not attached): [1] zlibbioc_1.16.0 futile.logger_1.4.1 tools_3.2.2 lambda.r_1.1.7 futile.options_1.0.0 BiocParallel_1.4.0 bitops_1.0-6 |
|
|