readFasta and gzipped FASTA files
1
0
Entering edit mode
@ivan-gregoretti-3975
Last seen 10.2 years ago
Canada
Hello everybody, The library ShortRead includes two very useful functions: readFastq() and readFasta() While readFastq() can open FASTQ files as either plain text or gzipped files, readFasta() can only open files in plain text. For example: # FASTQ: success > readFastq("t01213R0QU.fq.gz") class: ShortReadQ length: 43608 reads; width: 178..486 cycles # FASTA: failure > readFasta("t01213R0QU.P.fa.gz") Error in .normargInputFilepath(filepath) : file "t01213R0QU.P.fa.gz" has unsupported type: gzfile Is this the current status or it's time for me to update my BioC? Can someone offer a work around that does not involve decompressing the FASTA file to disc? I tried, yet unsuccessfully: readFasta(gzfile("t01213R0QU.P.fa.gz","r")) Error in function (classes, fdef, mtable) : unable to find an inherited method for function ?readFasta? for signature ?"gzfile"? Thank you, Ivan > sessionInfo() R Under development (unstable) (2012-11-30 r61184) Platform: x86_64-unknown-linux-gnu (64-bit) locale: [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C [3] LC_TIME=en_US.UTF-8 LC_COLLATE=en_US.UTF-8 [5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8 [7] LC_PAPER=C LC_NAME=C [9] LC_ADDRESS=C LC_TELEPHONE=C [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C attached base packages: [1] parallel stats graphics grDevices utils datasets methods [8] base other attached packages: [1] Rsamtools_1.11.16 Biostrings_2.27.11 GenomicRanges_1.11.29 [4] IRanges_1.17.32 BiocGenerics_0.5.6 loaded via a namespace (and not attached): [1] bitops_1.0-5 stats4_2.16.0 tools_2.16.0 zlibbioc_1.5.0
ShortRead ShortRead • 6.5k views
ADD COMMENT
0
Entering edit mode
@martin-morgan-1513
Last seen 4 months ago
United States
On 2/14/2013 12:32 PM, Ivan Gregoretti wrote: > Hello everybody, > > The library ShortRead includes two very useful functions: readFastq() > and readFasta() > > While readFastq() can open FASTQ files as either plain text or gzipped > files, readFasta() can only open files in plain text. > > For example: > > # FASTQ: success >> readFastq("t01213R0QU.fq.gz") > class: ShortReadQ > length: 43608 reads; width: 178..486 cycles > > # FASTA: failure >> readFasta("t01213R0QU.P.fa.gz") > Error in .normargInputFilepath(filepath) : > file "t01213R0QU.P.fa.gz" has unsupported type: gzfile > > > Is this the current status or it's time for me to update my BioC? > > Can someone offer a work around that does not involve decompressing > the FASTA file to disc? I tried, yet unsuccessfully: readFasta is just an alias for readDNAStringSet, which does not support compressed files. Not sure exactly what is in your fasta file, but Rsamtools::FaFile might be what you are looking for (typically, relatively few long sequences; you need to create an index (using indexFa) if one does not already exist then something along the lines of fa = FaFile("some.fa.gz") ## index is some.fa.gz. scanFa(fa, param=scanFaIndex(fa)) would read the whole file, or a more selective GRanges to go after particular sub-sequences. A workaround might also involve http://comments.gmane.org/gmane.comp.lang.r.sequencing/1981 all <- readLines("s_1.fa.gz") sread <- DNAStringSet(all[c(FALSE, TRUE)]) id <- BStringSet(all[c(TRUE, FALSE)]) fas <- ShortRead(sread=sread, id=id) > > readFasta(gzfile("t01213R0QU.P.fa.gz","r")) > Error in function (classes, fdef, mtable) : > unable to find an inherited method for function ?readFasta? for > signature ?"gzfile"? > > > Thank you, > > Ivan > > >> sessionInfo() > R Under development (unstable) (2012-11-30 r61184) > Platform: x86_64-unknown-linux-gnu (64-bit) > > locale: > [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C > [3] LC_TIME=en_US.UTF-8 LC_COLLATE=en_US.UTF-8 > [5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8 > [7] LC_PAPER=C LC_NAME=C > [9] LC_ADDRESS=C LC_TELEPHONE=C > [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C > > attached base packages: > [1] parallel stats graphics grDevices utils datasets methods > [8] base > > other attached packages: > [1] Rsamtools_1.11.16 Biostrings_2.27.11 GenomicRanges_1.11.29 > [4] IRanges_1.17.32 BiocGenerics_0.5.6 > > loaded via a namespace (and not attached): > [1] bitops_1.0-5 stats4_2.16.0 tools_2.16.0 zlibbioc_1.5.0 > > _______________________________________________ > Bioconductor mailing list > Bioconductor at r-project.org > https://stat.ethz.ch/mailman/listinfo/bioconductor > Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor > -- Dr. Martin Morgan, PhD Fred Hutchinson Cancer Research Center 1100 Fairview Ave. N. PO Box 19024 Seattle, WA 98109
ADD COMMENT
0
Entering edit mode
Thank you Martin. I realize now that I have faced this problem before and that I can work around it with the ShortRead constructor. I appreciate it. Ivan On 14 Feb 2013 21:24, "Martin Morgan" <mtmorgan@fhcrc.org> wrote: > On 2/14/2013 12:32 PM, Ivan Gregoretti wrote: > >> Hello everybody, >> >> The library ShortRead includes two very useful functions: readFastq() >> and readFasta() >> >> While readFastq() can open FASTQ files as either plain text or gzipped >> files, readFasta() can only open files in plain text. >> >> For example: >> >> # FASTQ: success >> >>> readFastq("t01213R0QU.fq.gz") >>> >> class: ShortReadQ >> length: 43608 reads; width: 178..486 cycles >> >> # FASTA: failure >> >>> readFasta("t01213R0QU.P.fa.gz"**) >>> >> Error in .normargInputFilepath(**filepath) : >> file "t01213R0QU.P.fa.gz" has unsupported type: gzfile >> >> >> Is this the current status or it's time for me to update my BioC? >> >> Can someone offer a work around that does not involve decompressing >> the FASTA file to disc? I tried, yet unsuccessfully: >> > > readFasta is just an alias for readDNAStringSet, which does not support > compressed files. > > Not sure exactly what is in your fasta file, but Rsamtools::FaFile might > be what you are looking for (typically, relatively few long sequences; you > need to create an index (using indexFa) if one does not already exist then > something along the lines of > > fa = FaFile("some.fa.gz") ## index is some.fa.gz. > scanFa(fa, param=scanFaIndex(fa)) > > would read the whole file, or a more selective GRanges to go after > particular sub-sequences. > > A workaround might also involve > > http://comments.gmane.org/**gmane.comp.lang.r.sequencing/**1981<http :="" comments.gmane.org="" gmane.comp.lang.r.sequencing="" 1981=""> > > all <- readLines("s_1.fa.gz") > sread <- DNAStringSet(all[c(FALSE, TRUE)]) > id <- BStringSet(all[c(TRUE, FALSE)]) > fas <- ShortRead(sread=sread, id=id) > > > >> readFasta(gzfile("t01213R0QU.**P.fa.gz","r")) >> Error in function (classes, fdef, mtable) : >> unable to find an inherited method for function ‘readFasta’ for >> signature ‘"gzfile"’ >> >> >> Thank you, >> >> Ivan >> >> >> sessionInfo() >>> >> R Under development (unstable) (2012-11-30 r61184) >> Platform: x86_64-unknown-linux-gnu (64-bit) >> >> locale: >> [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C >> [3] LC_TIME=en_US.UTF-8 LC_COLLATE=en_US.UTF-8 >> [5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8 >> [7] LC_PAPER=C LC_NAME=C >> [9] LC_ADDRESS=C LC_TELEPHONE=C >> [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C >> >> attached base packages: >> [1] parallel stats graphics grDevices utils datasets methods >> [8] base >> >> other attached packages: >> [1] Rsamtools_1.11.16 Biostrings_2.27.11 GenomicRanges_1.11.29 >> [4] IRanges_1.17.32 BiocGenerics_0.5.6 >> >> loaded via a namespace (and not attached): >> [1] bitops_1.0-5 stats4_2.16.0 tools_2.16.0 zlibbioc_1.5.0 >> >> ______________________________**_________________ >> Bioconductor mailing list >> Bioconductor@r-project.org >> https://stat.ethz.ch/mailman/**listinfo/bioconductor<https: stat.e="" thz.ch="" mailman="" listinfo="" bioconductor=""> >> Search the archives: http://news.gmane.org/gmane.** >> science.biology.informatics.**conductor<http: news.gmane.org="" gmane="" .science.biology.informatics.conductor=""> >> >> > > -- > Dr. Martin Morgan, PhD > Fred Hutchinson Cancer Research Center > 1100 Fairview Ave. N. > PO Box 19024 Seattle, WA 98109 > [[alternative HTML version deleted]]
ADD REPLY
0
Entering edit mode
Actually, now I also remember what problems I encountered with that strategy: not all my FASTA records are limited to 2 lines. Some sequences are long and span 6 lines. So, I can't take advantage of the elegant solution sread <- DNAStringSet(all[c(FALSE, TRUE)]) I'll have to think about a strategy where I read the file one line at a time while I identify the start of a new record every time the character ">" is found at the beginning of the line. Still, thank you. Ivan On 14 Feb 2013 22:50, "Ivan Gregoretti" <ivangreg@gmail.com> wrote: > Thank you Martin. I realize now that I have faced this problem before and > that I can work around it with the ShortRead constructor. > > I appreciate it. > > Ivan > On 14 Feb 2013 21:24, "Martin Morgan" <mtmorgan@fhcrc.org> wrote: > >> On 2/14/2013 12:32 PM, Ivan Gregoretti wrote: >> >>> Hello everybody, >>> >>> The library ShortRead includes two very useful functions: readFastq() >>> and readFasta() >>> >>> While readFastq() can open FASTQ files as either plain text or gzipped >>> files, readFasta() can only open files in plain text. >>> >>> For example: >>> >>> # FASTQ: success >>> >>>> readFastq("t01213R0QU.fq.gz") >>>> >>> class: ShortReadQ >>> length: 43608 reads; width: 178..486 cycles >>> >>> # FASTA: failure >>> >>>> readFasta("t01213R0QU.P.fa.gz"**) >>>> >>> Error in .normargInputFilepath(**filepath) : >>> file "t01213R0QU.P.fa.gz" has unsupported type: gzfile >>> >>> >>> Is this the current status or it's time for me to update my BioC? >>> >>> Can someone offer a work around that does not involve decompressing >>> the FASTA file to disc? I tried, yet unsuccessfully: >>> >> >> readFasta is just an alias for readDNAStringSet, which does not support >> compressed files. >> >> Not sure exactly what is in your fasta file, but Rsamtools::FaFile might >> be what you are looking for (typically, relatively few long sequences; you >> need to create an index (using indexFa) if one does not already exist then >> something along the lines of >> >> fa = FaFile("some.fa.gz") ## index is some.fa.gz. >> scanFa(fa, param=scanFaIndex(fa)) >> >> would read the whole file, or a more selective GRanges to go after >> particular sub-sequences. >> >> A workaround might also involve >> >> http://comments.gmane.org/**gmane.comp.lang.r.sequencing/**1981<htt p:="" comments.gmane.org="" gmane.comp.lang.r.sequencing="" 1981=""> >> >> all <- readLines("s_1.fa.gz") >> sread <- DNAStringSet(all[c(FALSE, TRUE)]) >> id <- BStringSet(all[c(TRUE, FALSE)]) >> fas <- ShortRead(sread=sread, id=id) >> >> >> >>> readFasta(gzfile("t01213R0QU.**P.fa.gz","r")) >>> Error in function (classes, fdef, mtable) : >>> unable to find an inherited method for function ‘readFasta’ for >>> signature ‘"gzfile"’ >>> >>> >>> Thank you, >>> >>> Ivan >>> >>> >>> sessionInfo() >>>> >>> R Under development (unstable) (2012-11-30 r61184) >>> Platform: x86_64-unknown-linux-gnu (64-bit) >>> >>> locale: >>> [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C >>> [3] LC_TIME=en_US.UTF-8 LC_COLLATE=en_US.UTF-8 >>> [5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8 >>> [7] LC_PAPER=C LC_NAME=C >>> [9] LC_ADDRESS=C LC_TELEPHONE=C >>> [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C >>> >>> attached base packages: >>> [1] parallel stats graphics grDevices utils datasets methods >>> [8] base >>> >>> other attached packages: >>> [1] Rsamtools_1.11.16 Biostrings_2.27.11 GenomicRanges_1.11.29 >>> [4] IRanges_1.17.32 BiocGenerics_0.5.6 >>> >>> loaded via a namespace (and not attached): >>> [1] bitops_1.0-5 stats4_2.16.0 tools_2.16.0 zlibbioc_1.5.0 >>> >>> ______________________________**_________________ >>> Bioconductor mailing list >>> Bioconductor@r-project.org >>> https://stat.ethz.ch/mailman/**listinfo/bioconductor<https: stat.="" ethz.ch="" mailman="" listinfo="" bioconductor=""> >>> Search the archives: http://news.gmane.org/gmane.** >>> science.biology.informatics.**conductor<http: news.gmane.org="" gman="" e.science.biology.informatics.conductor=""> >>> >>> >> >> -- >> Dr. Martin Morgan, PhD >> Fred Hutchinson Cancer Research Center >> 1100 Fairview Ave. N. >> PO Box 19024 Seattle, WA 98109 >> > [[alternative HTML version deleted]]
ADD REPLY

Login before adding your answer.

Traffic: 656 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6