Data Structures: Hash Tables

Open and Closed Hashing, etc.

All hash tables have in common the idea that a hash function \(H(x)\), maps a value to a location. For simplicity, we’ll treat this location as the “index of an array of size n”, i.e. an integer value between 0 and n-1.

This scheme works perfectly when there are no hash collisions.

The best way to deal with this is to choose a clever hash function—one that takes the statistical distribution of your data into account, and choose a scheme that makes collisions very unlikely.

But, since the number of data values you have is usually much larger than the size of your array, by the Pigeonhole Principle (which you learned about, or will learn about in CS40), there will ALWAYS be some collisions. What to do?

There are two basic approaches, with some variations on each. Sadly, the terminology used for these techniques can be quite confusing.

One way of thinking about the two approaches:

The terminology can gets confusing here.

How it works Hashing—do we leave the hash table, or not, to find data? Addressing—how do we determine an address? Other names that may apply
All elements are in the array, and you go probing through the array to resolve collisions. Closed Hashing: nothing is outside the hash table Open Addressing: the address is not determined only by the key, but also by what else is in the hash table There are various kinds of probing: linear probing, quadratic probing, double hashing
The element are outside the array; each array element is the head of a linked list (or other data structure) Open Hashing: we may have to go outside the hash table to find our data Closed Addressing: the hash function tells us exactly where to go (though we may have to iterate through a list once we get there. Separate Chaining is the term often used for this technique.

The Underlying Idea

Below is the a summary of the three most common operations supported by a Hash Table. Each operation is listed with its average case and its worst case complexity.

find(item)

\(O(1)\) in average case

\(O(n)\) in worst case

insert(item)

\(O(1)\) in average case

\(O(n)\) in worst case

delete(item)

\(O(1)\) in average case

\(O(n)\) in worst case

Hash Tables are a data structure that take advantage of random access to achieve extremely fast lookups. “Random Access” in this case is a term of art that means we can directly access a piece of data at any arbitary address in memory (i.e. RAM). The term RAM for memory refers to this property—it is an acronym for “random access memory”.

An array in C++, for example, supports random access: you can index into an array by adding an offset (computed by the array index) to the base address, and directly retrieve any element in one operation.