Hashing Primer



Searching for an element in an un-ordered array has cost of O(n). Binary search does a better job of searching for an element by sorting and dividing the array into two halves repeatedly and eliminating half the size of array elements for effective lookup. BST has an efficiency of O(log2n) but it requires a sorting of the array. In order to achieve near constant time lookup i.e O(1), we could use a hash table with key as index to the value. Hash table is an implementation of dictionary abstract type. When key is Integer then it can be directly taken as index. However if the key is any other object type, we need to use an hashing strategy to utilize less amount of memory to store large range of keys (hash codes) without sacrificing the quality of hashing.

In Java hash code and equals should follow the basic contract that if two object have same hash code, they may not be equal but if two objects are same, both should yield same hash code value (deterministic).

A collision of hash code occurs when two different object have same hash code. In other words if we try to store two unequal keys by their hash code value, they fall into the same location. This is not allowed to preserve the basic quality of hash table.

To reduce the likelihood of collision, computer science recommends uniform hashing where each item to be hashed has an equal probability of being placed into a hash location (otherwise known as memory slot), regardless of the other elements already placed at the position. In other words keys are distributed evenly in the allocated memory range or they are not clustered together. In common sense, there is a greater likely hood of bumping into someone in a crowded room rather than a large open place. Perfect hashing is a technique for building a hash table with no collisions. It is only possible to build one when we know all of the keys in advance. Perfect hashing guarantees O(1) for insert and lookup of keys in hash table. Another less important and idealistic concept is Minimal perfect hashing which implies that the resulting table contains one entry for each key, and no empty slots. It has limited use because its impossible to achieve minimal perfect hashing unless you know all keys in advance. Two other property of hashing for a practical use are optimal usage of available memory and the speed of calculating hash. If we have a near perfect hash function which takes forever to calculate or expects huge amount of memory, it will not be practical.

Universal hash function has the probability of collision is less than or equal to 1/n (where n is the number of bucket). In simple words the an hash function from a family of hash functions should not do a worse job than randomly assigning keys over a decent size of buckets/slots. It is also called SUHA (Simple universal hashing assumption) principle. Expected search time with chaining with load factor N/M is O(1+N/M).

The load factor of a hash function is N/M where N is number of items to hash and M is the size of the hash table.

There is always a trade off between space and time. With decent speed and manageable space an hash function will eventually be susceptible to collision. There are two types of collision resolution strategies:

  1. Open Hashing or Separate Chaining : Instead of using a single key elements, use linked list of key elements for all keys with same hash. Linked list is preferred over a simple array so that if the key is deleted, we do not have to reinitialize or resize the array. Never the less this strategy requires an overflow area or auxilliary set for storing colliding keys and some way to link them to index of main key storage slot. Not to mention that the performance of lookup would degrade with the length of the chain.
  1. Closed Hashing or Probing : Here we need to find an alternate location for the key if collision occurs. There are different techniques for finding alternate location as well:
  • Linear probing : Look for alternate key location starting from adjacent location of original hash move towards the end and then from beginning to preceding slot. Linear probing suffers from primary and secondary clustering problem where probe path is same for colliding keys (secondary) or when a key hashes into the probe path of other collision.
  • Quadratic Probing : A quadratic probing uses following sequence for finding alternate locations: (hash(key) + 1) MOD size,  (hash(key)   + 4)  MOD size, (hash(key)   + 9)  MOD size, (hash(key)   + 16)  MOD size...(hash(key)   + (add successive odd number to previous incremental factor))  MOD size
    One issue which must be resolved is whether we will actually be able to find an open slot with this method under reasonable circumstances. A relatively easy number-theoretic result ensures that if the capacity is prime and the load is under 50%, then the probe sequence will succeed.
  • Double-hash probing : Another hash function is used in combination main hash function. This helps in eliminating clustering problem and distributes keys more evenly across the allocated space reducing the chances of collision. However this adds complexity and suffers from slowness. The probe sequence with double hashing follows : h1(key) MOD size, (h1(key) + 1*h2(key)) MOD size, (h1(key) + 2*h2(key)) MOD size, (h1(key) + 3*h2(key)) MOD size and so on. A double hash should never evaluate to 0. A good choice for h2 is [P - (key MOD P)] where P is a prime number. 
In some cases Re-hashing is needed to transfer all keys into a new table by rehashing all existing keys. This would help in spreading the keys and insert/delete and lookup on new table would perform better. Re-hashing is an expensive operation.


Hash Functions:


Java's hashcode for Integer is the int primitive of the Integer. For hashcode of Long the result is The result is the exclusive OR of the two halves of the primitive long value held by the Long object

(int)(this.longValue()^(this.longValue()>>>32))

Integer which is the type of hashcode has a word size of 32 bits in Java. Long is 8 bytes i.e 64 bits in size. If the value of long falls in integer range then the hashcode would compute to same value.

The hash code for a String object in Java is computed as

s[0]*31^(n-1) + s[1]*31^(n-2) + ... + s[n-1]

31 is considered a good prime number for better distribution. 31 is also a Mersenne prime (like 127 or 8191) which is a prime number that is one less than a power of 2. This means that the mod can be done with one shift and one subtract if the machine's multiply instruction is slow. and multiplication by 31 can be achieved by left shift and subtraction i.e 31 * X = (X<<5) - X

Division Method: Reminder value after diving the key with size of of hash table. The size should not be selected power of two or one less than any power of two number. Also the size m should not be a power of 10. Good value of m is a prime number not close a power of two.

h(key) = key MOD m

Knuth variant of Division Hash function is :

h(key) = key(key + 3) MOD m


Multiplication Method: It can be used for a size of table as power of two. Here the key is multiplied by a constant and bit shifted to compute hash code.
h(key) = key * A << x

Knuth suggested a magic fraction for constant value calculation as :

A = Size Of Table * (Sqrt(5)-1)/2 = m * 0.6180339887498948


The number of right shifts should be equal to the log2 N subtracted from the number of bits in a data item key. For instance, for a 1024 position table (or 210) and a 16-bit data item key, you should shift the product key * k right six (or 16 - 10) places.

Folding Method:
Two folding methods are of two types
  1. Fold Shift : In fold shift, hashing key is divided into parts whose size match size of the required address. Left and right parts are shifted and added to the middle part.
  2. Fold Boundary: In fold boundary, left and right numbers are folded on a fixed boundary between them and center number.

CRC 32 : Hashing is computing for checksums to verify the integrity of data  downloaded from a site of comparing files (typically certificate files or compressed files). Java provides java.util.zip.CRC32. A CRC of a data stream is the remainder after performing a long division of the data (treated as a large binary number), but using exclusive or instead of subtraction at each long division step. This corresponds to computing a remainder in the field of polynomials with binary coefficients. CRCs can be computed very quickly in specialized hardware. Fast software CRC algorithms rely on accessing precomputed tables of data.

CRC32 checksum = new CRC32();
checksum.update(s.getBytes());
System.out.println(checksum.getValue());

The logic for CRC32 is some what like this:

highorder = h & 0xf8000000    // extract high-order 5 bits from h
                                   // 0xf8000000 is the hexadecimal representation
                                   //   for the 32-bit number with the first five 
                                   //   bits = 1 and the other bits = 0   
     h = h << 5                    // shift h left by 5 bits
     h = h ^ (highorder >> 27)     // move the highorder 5 bits to the low-order
                                   //   end and XOR into h
     h = h ^ ki                    // XOR h and ki

One at a time Hash  (SHA, MD5) : These are cryptographically secure hash functions. SHA-1 and MD5 can be computed by converting string to bytes, or when reading in bytes 1 at a time. JDK has java.security.MessageDigest to compute these hash values. 

Fowler–Noll–Vo hash function (FNV) : The purpose of FNV hashing was to be fast and maintain low collision rate. It is best suited for predictable texts such as URLs, machine names, host names, ip addresses, router tables, file names, person names, country names, station names, domain entitiy names such as stock ticker etc. There are different variations of FNV hash. They are listed at http://www.isthe.com/chongo/tech/comp/fnv/ FNV and Knuth are the basis for many advanced hash functions although they are rarely used in their original form now a days. Please refer to wikipedia for details on FNV https://en.wikipedia.org/wiki/Fowler%E2%80%93Noll%E2%80%93Vo_hash_function

Murmur Hash: Murmur hash thoroughly mixes the bits of a value by application of left rotation and multiplication by magic number for certain number of times (usually 15).
x *= m;
x = rotate_left(x,r);

m, r, and x are pseudo-randomized. Unfortunately multiply+rotate has a few major weaknesses when used in a hash function, so some implementation use multiply+shift+xor. Java implementation is available at http://d3s.mff.cuni.cz/~holub/sw/javamurmurhash/MurmurHash.java

Paul Hsieh's SuperFastHash : SuperFast Hash uses one at a time hashing as a model. A brief description behind all the avalanche and magic numbers used for rotation and multiplication can be found at the author's website http://www.azillionmonkeys.com/qed/hash.html. Java implementation of super fast hash is available here

Google's CityHash : Not suitable for cryptography, CityHash mixes inputs bits thoroughly. It is used for strings. The C++ code for cityhash family of hash functions are available at https://github.com/google/cityhash

Cuckoo hashing: Cuckoo hashing achieves constant average time insertion and constant worst-case search: each item has two possible slots. Put in either of two available slots if empty; if not, eject another item in one of the two slots and move to its other slot (and recur). "The name derives from the behavior of some species of cuckoo, where the mother bird pushes eggs out of another bird's nest to lay her own." Rehash everything if you get into a relocation cycle. Maximum load with uniform hashing is log n / log log n. Improve to log log n by choosing least loaded of two.

Djb2, SDBM, Lose Lose : These hash functions are well defined at http://www.cse.yorku.ca/~oz/hash.html

An old profiling numbers of all the major hash functions implemented in C can be found at http://www.strchr.com/hash_functions?allcomments=1 . New and improved hash functions are still a subject of many research scholars for specifically targeted for various usage. In my personal experience SuperFastHash implementation worked well for integer arrays and strings. Bob Jenkin has analyzed different hashing techniques and compared against his implementation in his website: http://burtleburtle.net/bob/hash/doobs.html

Many of the hashing function implementation is available in experimental form in google's guava library under com.google.common.hash.Hashing. We should test all of them and chose what works best for expected data set.