I found the following presentation:
Bassi, Sebastian and Gonzalez, Virginia. New checksum functions for Biopython. Available from Nature Precedings, 2007
Abstract: Checksum algorithms are used in biological databases for integrity check and identification purposes. CRC64 is the only checksum algorithm already included in Biopython. This work proposes two new implementation of known algorithms (GCG Checksum and SEGUID). There is also an application based on SEGUID: Looking for redundancy between two FASTA files full of protein sequences based only in sequence information, by comparing the SEGUIDs of both files. The code is shown in the manuscript and may be available at Biopython.org.
Download presentation: http://dx.doi.org/10.1038/npre.2007.278.1 (PDF/PPT without paywall)
To summarize, they mention the following checksum algorithms:
- CRC64: Proteins in Uniprot.
- GCG-Checksum: DNA and Protein sequences in the file format of GCG and compatible programs.
- SEGUID: “A SEquence Globally Unique IDentifier” Proteome Database
If you read the slides you find that the first two are not strong enough, i.e. two different sequences can get identical checksums. The SEGUID looks very promising:
“We propose the use of a unique sequence identifier (SEGUID) that is derived from the primary sequence itself and easily generated by any user. SEGUIDs are resilient to changes in public and private databases as they remain constant throughout the lifetime of a given protein sequence. The SEGUID Proteome Database (http://bioinformatics.anl.gov/seguid/ ) provides aliases for the annotated entries available from several public databases and can be downloaded or generated easily at remote sites. SEGUIDs have been used in our proteomics laboratory for years and proved to be useful integrating mass spectrometry results, two-dimensional gelelectrophoresis data, and bioinformatics information”
Source: SEGUID: Overview. http://bioinformatics.anl.gov/seguid/overview.aspx (broken URL; Most recent Web Archive version: http://web.archive.org/web/20130214121710/http://bioinformatics.anl.gov/seguid/overview.aspx)
There is also a reference to a 2006 Proteomics article (http://dx.doi.org/10.1002/pmic.200600032), which is behind a paywall, and that I can't be bothered to read.
In other words, it might be that SEGUID is a better checksum algorithm for genomic sequences than a generic algorithms such as MD5. The fact that Biopython has decided to implement may help the decision and to do initial validation.
All finite-length hash functions have collisions. The longer the checksum, the less likely it is to generate collisions. But long checksums are also more expensive to generate, store, and compare. So it's a question of compromise.
It seems that SEGUID was primarily designed for protein sequences. IIUC there still seems to be a lack of consensus about which checksum the Bioinformatics community as a whole should adopt for DNA sequences. Right now different DNA sequence databases use different sequence IDs but it would be great if they all agreed on an ID that is unique across databases.
A big advantage of md5 is that it's already available via the digest package.
H.
I would favor long enough checksums to minimize the risk for collisions; the cost for troubleshooting when there are collisions is much higher. It's fairly simple to memoize/cache results to file.
> Right now different DNA sequence databases use different sequence IDs but it would be great if they all agreed on an ID that is unique across databases.
At least, should we try to gather what the common ones are?
> A big advantage of md5 is that it's already available via the digest package.
Agree about MD5. SEGUID and others could be kept on the todo list. Also, when using digest it's fairly easy to select among its supported checksum algorithms; currently md5, sha1, crc32, sha256, sha512, xxhash32, xxhash64, murmur32.
However, it would be helpful to define exactly what should be checksum:ed, e.g.
For completeness:
SEGUID v2 (https://www.seguid.org/), implemented in CRAN package seguid (also available in Python, JavaScript, ...), supports various types of sequences.
PS. Funny, I had completely forgotten that I have asked this question eight years ago. Yes, I'm one of the co-authors of SEGUID v2.