Question

DNAString: Standard checksum function?

0

Entering edit mode

Henrik Bengtsson ★ 2.4k

@henrik-bengtsson-4333

Last seen 7 months ago

United States

Hi,

in a reverse-engineering task, I'd like to compare the content of two FASTA files (for which I cannot trace the origin). When comparing the FASTA index files, the sequence/chromosome names as well as the lengths of the individual sequences agree, although in a different order. Next I'd like to compare the actual sequences. I could read these in using the FaFile class of Rsamtools and do pairwise comparison in memory.

However, if I had to redo the same task in the future I figured that why not calculate checksums of the sequences and store those on file (imagine an enhanced FASTA index file containing also the checksums per sequence). Then one could quite quickly load and compare checksums instead of having to parse the whole FASTA file.

To my question:

I cannot be the first one doing this, so is there a standard/preferred checksum algorithm I should use for genomic sequences? Is there even a method for this that I'm not aware of? If I would implement this myself, I would probably go with md5 on toupper(as.character(seq)), but I prefer to use a de facto standard if that already exist.

Thxs

Biostrings FASTA sequencing • 3.2k views

ADD COMMENT • link 8.8 years ago • updated 7 months ago Henrik Bengtsson ★ 2.4k

score 0 · Answer 1 · 2016-03-11

Hi Henrik,

MD5 sounds good. I guess any decent checksum algo would probably do. Note that if seq is a DNAString object, you don't need to pass the result of as.character(seq)) thru toupper().

Maybe I should add something like this to Biostrings, but that means I would need to commit to whatever checksum algo I choose so people in the future can rely on the checksums they've calculated (and saved) in the past. I guess the reason I didn't add this to Biostrings so far is to avoid this kind of commitment.

H.

score 0 · Answer 2 · 2016-03-11

I found the following presentation:

Bassi, Sebastian and Gonzalez, Virginia. New checksum functions for Biopython. Available from Nature Precedings, 2007

Abstract: Checksum algorithms are used in biological databases for integrity check and identification purposes. CRC64 is the only checksum algorithm already included in Biopython. This work proposes two new implementation of known algorithms (GCG Checksum and SEGUID). There is also an application based on SEGUID: Looking for redundancy between two FASTA files full of protein sequences based only in sequence information, by comparing the SEGUIDs of both files. The code is shown in the manuscript and may be available at Biopython.org.

Download presentation: http://dx.doi.org/10.1038/npre.2007.278.1 (PDF/PPT without paywall)

To summarize, they mention the following checksum algorithms:

CRC64: Proteins in Uniprot.
GCG-Checksum: DNA and Protein sequences in the file format of GCG and compatible programs.
SEGUID: “A SEquence Globally Unique IDentifier” Proteome Database

If you read the slides you find that the first two are not strong enough, i.e. two different sequences can get identical checksums. The SEGUID looks very promising:

“We propose the use of a unique sequence identifier (SEGUID) that is derived from the primary sequence itself and easily generated by any user. SEGUIDs are resilient to changes in public and private databases as they remain constant throughout the lifetime of a given protein sequence. The SEGUID Proteome Database (http://bioinformatics.anl.gov/seguid/ ) provides aliases for the annotated entries available from several public databases and can be downloaded or generated easily at remote sites. SEGUIDs have been used in our proteomics laboratory for years and proved to be useful integrating mass spectrometry results, two-dimensional gelelectrophoresis data, and bioinformatics information”

Source: SEGUID: Overview. http://bioinformatics.anl.gov/seguid/overview.aspx (broken URL; Most recent Web Archive version: http://web.archive.org/web/20130214121710/http://bioinformatics.anl.gov/seguid/overview.aspx)

There is also a reference to a 2006 Proteomics article (http://dx.doi.org/10.1002/pmic.200600032), which is behind a paywall, and that I can't be bothered to read.

In other words, it might be that SEGUID is a better checksum algorithm for genomic sequences than a generic algorithms such as MD5. The fact that Biopython has decided to implement may help the decision and to do initial validation.