Entering edit mode
daniel.magnus.bader
▴
50
@danielmagnusbader-19953
Last seen 4.8 years ago
Dear all,
I am creating a local organism-specific version of uniproton on our server.
Using the Bioconductor Biostrings
package I experienced an
"embedded nul string" error with the readAAStringSet()
function,
but only on the trembl download, not for the swissprot fasta file.
Q1: Can you reproduce the error?
Q2: How can it be fixed?
Session info:
- R 3.5.2
- Biostrings 2.50.2
Below I provided the error message and the code to reproduce the error for a specific protein. However, download and index creation are time- and memory-consuming steps.
Best, Daniel
# Download uniprot trembl fasta sequences
# to server with ~100GB memory
#
dl_link <- "ftp://ftp.uniprot.org//pub/databases/uniprot/current_release/knowledgebase/complete/uniprot_trembl.fasta.gz"
# file size 65GB
file_db <- "uniprot_trembl.fasta.gz"
# Use a server for that step,
# takes up to 80GB RAM during generation
library(data.table)
library(Biostrings)
fai_trembl <- as.data.table(fasta.index(file_db))
# search for a specific human protein
# and retrieve "recno" index
fai_trembl[grepl("Q5HYB6_HUMAN", desc)]
# read the sequence of this protein from file
# using the precomputed index
readAAStringSet(fai_trembl[12538693, ])
# ERROR MESSAGE:
#
# A AAStringSet instance of length 1
# width seq names
#Error in .Call2("new_CHARACTER_from_XString", x, xs_dec_lkup(x), PACKAGE = "Biostrings") :
# embedded nul in string: 'PIAALGAKLNTWTYRWMAA\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0'
Hi,
Such big files are insane!
Ok, so I have access to a server with a lot of memory (384GB) so that should be enough. However I started the download of the
uniprot_trembl.fasta.gz
file (usingwget
from the Unix command line) and I see an ETA of more than 5h! The download speed I get is only 2 MB/s which is very slow. We have fast internet here at my institution (e.g. I easily see download speeds of 30 MB/s or more when downloading from other places) so it seems that the bottleneck is on the uniprot FTP server side.We cannot exclude that the file actually contains embedded
\0
bytes. Maybe somehow they got introduced when the file was compressed, or the file got corrupted during your download. Do the Uniprot people provide md5sums somewhere for their files so we can check them? This is pretty standard practice for institutions that provide big files for download.Right now the AA alphabet is not enforced for AAStringSet objects so the
readAAStringSet()
function will accept any byte value, even the\0
byte.\0
bytes in the file are treated like any other byte value so they would end up in the AAStringSet object. Here is how such an object can be created:Note that, strictly speaking, there is nothing wrong with the object itself in the sense that you can do most of the usual operations on it:
It's just that it cannot be displayed or coerced to an ordinary character vector (the
show()
method for these objects actually callsas.character()
on the parts of the object that are displayed):If we trim the first 4 bytes, then the object can be displayed:
FWIW you can replace the nulls with the letter of your choice with:
OK so that was only to show you that AAStringSet objects (like BStringSet objects) are actually allowed to contain embedded nuls.
Just to discard the possibility that these nul bytes are an artefact of the compression/decompression mechanism, do you think you can uncompress the file and try to reproduce the error on the uncompressed file? I know that decompressing such a big file is going to require a lot of resources and might take a long time but you seem to have access to some powerful hardware.
Thanks,
H.
My download failed after a few hours with some error message I don't remember.
Were you able to download the file again, uncompress it, and reproduce the error on the uncompressed file?
H.
Hello Herve,
Sorry for sleeping so long. I am just at the next update cycle right now. I wrote my own file parser and do not use
Biostrings
at the moment, but I did not check the sanity of the download again.Uniprot offers this solution for download sanity: https://www.uniprot.org/help/metalink including md5 sums.
Once I assured that the download worked, I give
Biostrings
another try. Compared toRsamtools::indexFa()
, I like that the complete fasta header lines are returned in the index table, which I wanted to use to search for specific organisms entries.Side note:
I found that
Rsamtools::indexFa()
checks for the following errorsBest, Daniel
The way I see it, you have to download the bundle with all sequencing files again. If you want to get the "RELEASE.metalink" file that contains the information on the *fasta.gz files from an older version, e.g. "2019_03":
ftp://ftp.uniprot.org/pub/databases/uniprot/previousreleases/release-201903/knowledgebase
The minimal bundle would be the
uniprot_sprot-only2019_03.tar.gz
file I guess.