Archives

Categories

Y
Y

Why Computer Science Professors Dislike Hash Functions

Homolog.us blog is written by professional janitors dedicated to clean up US science. During lunch breaks and other time off from the job, we discuss bioinformatics. The name 'homolog.us' is not a spelling mistake, but is derived by taking Arabic translation of the 'O' in the original word.

Please follow us on twitter – @homolog_us.

-------------------------------------------------------------------------------------------------------------

We have been going through various web-based resources on high-quality hash functions and made a startling discovery. None of the good websites was maintained by the members of computer science departments at top universities or even second-rank universities. Based on our highly anecdotal evidence, computer science professors stopped thinking about hash functions many decades back. That seemed puzzling, because in the world where it matters, research on hash function still attracts big money.

We were thinking about the apparent anomaly and came up with the following explanation. ‘Hash functions’ destroy the fantasy called ‘computer science’, and therefore computer science professors do not want to acknowledge their continued development. By promoting a researcher working on novel hash functions, they will be doing disservice to their motto of ‘universal programming’ and ‘commodity hardware’. Computer science will not be seen as ‘science’ but be ‘looked down upon’ as another engineering field. Therefore, the best strategy is to marginalize advancements on hash functions.

Best Websites on Hash Functions

Before we continue with that line of thinking, let us present the evidence. We will be happy, if readers can provide evidence disputing our claim. Here are the top hash function-related websites and success stories found by us.

1. Keccak won the NIST competition as SHA-3 or top cryptographic hash function. It was developed by Guido Bertoni, Joan Daemen, Michaël Peeters and Gilles Van Assche, four persons working in two semiconductor companies, not computer science researchers as we expected them to be.

2. Bob Jenkins’ website is web’s best resource on Hash functions. Here is how Bob describes himself.

A year at UC Berkeley in the doctoral program for theoretical computer science convinced Bob that he didn’t want to become a professor.

3. Bob Jenkins’ 1997 paper, which he continually updates with latest information, is possibly the best single paper on non-crypto hash functions. It was published in Dr Dobbs, not one of the popular CS journals.

4. Murmurhash was the latest big advancement in the hash function world. It was published in Austin Appleby’s personal blog, but a not-so-insignificant company hired him for his research (see below).

5. Google has taken over Murmurhash project and maintains another highly informative website (smhasher) comparing the performances of various hash functions.

6. Two other informative website on non-crypto hash functions are maintained by dedicated non-academics –

Non-Cryptographic Hash Function Zoo

General Purpose Hash Function Algorithms

Academic Literature

We did some web search to find out respectable technical publications on hash functions. The most cited one came from China and is worth reading, especially by those Americans, who want to fund giant military projects at the cost of math research. It was published in 2005 and already received 1100 highly deserved citations.

How to Break MD5 and Other Hash Functions – Xiaoyun Wang and Hongbo Yu

Another paper from last 15 years with over 200 citations was

Merkle-Damgard˚ Revisited : how to Construct a Hash Function

However, it was based on Merkle–Damgård construction described in Ralph Merkle’s 1979 Ph.D. thesis. Other top hits were mostly from 1970s and 80s as well. It seems like the excitement stopped after Ron Rivest’s work on RSA in early 90s.

A Secure One-way Hash Function Built from DES – Robert S. Winternitz, ISL, Stanford

Universal classes of hash functions, J.Lawrence Carter, Mark N. Wegman, IBM Thomas J. Watson Research Center, Yorktown Heights, New York 10598, USA

New hash functions and their use in authentication and set equality, Mark N. Wegman, J.Lawrence Carter, IBM Thomas J. Watson Research Center, Yorktown Heights, New York 10598, USA

(1996) Tiger: A fast new hash function – Ross Anderson, Eli Biham – Purchase on Springer.com
$29.95 / €24.95 / £19.95 + VAT

Why Computer Science Academics Do not Like Hash Functions

As PW Anderson argued in his ‘More is Different’ article, every academic field is based on a set of abstractions. ‘Computer science’ is an abstraction built on top of computer hardware, and hash function is the direct connection between hardware and computer science. The von Neumann architecture used by computer science uses linear memory. A hash function is the abstraction to go from large input set to small linear memory.

write-blog

To the dismay of computer science researchers, hash functions must be highly hardware dependent. As we said earlier, that goes against the notions of ‘universal programming algorithms’ or ‘commodity hardware’. For example, if Intel decides to put CRC32 or Murmurhash as part of their instruction code, relative performances of all programs using those hash functions will change substantially. The same is true for new kind of data. Hash functions tested for English text may not be good for bioinformatics text. There is no universal ‘one-size-fit-all’ hash function. Unless computer hardware and data stop changing, research on hash functions will continue to find ‘better’ hash functions.

Is CRC32 a Good Hash Function?

Bob Jenkins says yes, and the following threads say no.

Can CRC32 be used as a hash function?

Can one construct a “good” hash function using CRC32C as a base

One thing for sure, Intel’s adoption of CRC32 as part of SSE 4.2 changes many performance comparisons.

Edit.
Ruibang Luo, author of BGI’s SOAPdenovo assembler, chimed in:

In your latest blog post “Why computer scientists dislike hash functions”, the reference website showing that CRC32 is not suitable for hash function at all and failed all avalanche tests is inconsistent with my test (somebody mentioned that they have used buggy CRC32 table but I haven’t checked that). A test using genome text (both ascii and 2-bit encoding) instead of using English text should be more informative and accurate. According to my experience, different state of the art hash functions including murmurhash, cityhash, spookyhash, Jenkins and CRC32c do not make too much difference to the evenness of 2-bit encoding.

Choosing a Good Hash Function

For Bioinformatics, we do not need crypto hash functions. Among non-crypto ones, SpookyHash and Murmur3 are currently the hottest closely followed by Jenkins and City. Spooky hash is Bob Jenkins’ latest contribution. Cityhash is being developed by Google as an improvement over Murmur3. The following website presents very good comparative analysis, but do keep in mind that the tests were done on English text and Intel architecture. If you move to new kind of text and new hardware architecture, results will change.

Choosing a Good Hash Function, Part 3

Choosing a Good Hash Function, Part 2

What is a good hash function for genomic data?




-------------------------------------------------------------------------------------------------------------
Heroes and Heroines of New Media--2014

I am strongly influenced by Charles Hugh Smith, who runs his insightful social blog of Two Minds. I hope he will not mind, if I copy his style of acknowledgement to the supporters of our blog.

Our blog is deeply honored by the generous contribution of the following readers. Without their patronage, this site would go away.

Outstandingly Generous:  
Amemiya C. Schnable J. Bowman B.

We are also looking for subscribers to get help to finish the tutorials. Please see this post for details.

5 comments to Why Computer Science Professors Dislike Hash Functions

  • Heng Li

    For integer randomization hash functions, my experience is which to use does not matter too much. Most commonly used randomization functions have similar performance on real data. Strings may be different, but I doubt the several popular string hash functions make obvious practical differences.

  • samanta

    Thanks Heng and welcome to our blog.

    For integer randomization hash functions, my experience is which to use does not matter too much.

    Does that mean for nucleotide sequences, we can pick first 14 bits and get done with it?

  • samanta

    Added Ruibang’s comment. He seems to agree with Heng.

  • A CS professor

    I think you are over-thinking it. It’s not uncommon to see papers that run on specialized hardware, such as GPUs, hybrid computing, etc. There are also a zillion papers that do not propose “one-size-fits-all” solutions. You might be thinking of theoretical CS, and generalizing to all CS. Some of your thinking does apply to theoretical CS. For example, the traditional way to analyze algorithms is worst case behavior, which can result in algorithms that are good in practice being ignored due to bad worst-case performance.

    On the other hand, you are right that hash functions aren’t a hot topic in CS currently. But there are lots of important and useful things that aren’t hot topics in CS.

  • Yves Orton

    I thought I would add that these days in many areas the question is not just hash functions, but rather randomly seeded hash functions. If a hash function is used to hash arbitrary data then it needs to be robust to various attacks, such as key discovery attacks, multi-collision attacks, and precomputed attacks. A “well known” and unseeded algorithm may perform well in tests, but allow an attacker to brute force a set of colliding keys offline, and then use them to attack the hash function. Similarly a “good” hashing function sometimes turn out to be susceptible to multi-collision attacks.

    Another consideration in designing and choosing a hash function is how suitable it is for small hash tables. Many hash table implementation use the lower bits of the hash, sometimes as few as 3 bits, and still demand good properties when doing so. Many “good” hash functions produce poor bit distribution in the low bits and are unsuitable for such changes.

    Another consideration is how suitable the hash function is to resizing the hash table. Some algorithms require one to rehash all keys when resizing the table, others allow one to mask off a set of bits from the hash value.

    It is ironic that hash functions are one of the most commonly used data structures in common and web computing, yet the academic community generally ignores the subject. The only hash function that has any cred these days that actually came from academia is SipHash, and it is considered by many to be too slow, all the rest came from individual contributors or corps. Only cryptographic strength hashing has apparently caught the interest of academia, even tho non-cryptographic hashing is by far more common. Consider that most scripting language have a built in hash table implementation that is widely used for many purposes.

    Anyway, I am not sure I agree with your why, but I definitely agree with your what. The computing academia has let down the wider computing community by not doing more and better research on this important subject. (At least if you assume that academia is there to do research for the rest of us.)

Leave a Reply

  

  

  

You can use these HTML tags

<a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <strike> <strong>