Lecture 6

Lecture 6, Tue 10/15

Hashing

The ability to address unique key values in an array whose size may be smaller than the set of possible key values.
Generally, keys are unique values that identify some record of information.
- In order to put data (value) in the appropriate place in the collection, a hash function is required.
  - The hash function outputs a position in the collection based on the key value
  - Hash function outputs should be uniformly distributed
    - Or else all data would try to be stored in the same index.
  - For example, if you have array of 100 elements, a simple hash function could be:
    - key % 100 = [0-99]

Hash Table

Hashing is very efficient for searching for data in an array.
- Recall binary search: O(log n) search time (if elements are sorted).
- Linear search: O(n) search time.
- Hash Table search: O(1) average search time in unsorted order.
  - Hash Table searching provides instant access to an element in an array since the hash function computes the index where the data is stored.

Collisions

It’s possible that two elements may be indexed to the same location.
- This is known as collisions

Open Address

Technique that uses a 2nd hash function when resolving a collision.
- If a hash function index results in a collision, then use the 2nd hash function to determine how far to step in the array to look for an empty slot.
- Helps reduce the clustering effect.
Problems
- If hash2 function is large, there is a possibility that we will go out of bounds.
- Depending on the table size and hash2, it is possible that the index won’t be uniformly distributed.

Double Hashing

The biggest problem to open-address hashing is
- If the table is full, no more elements can be added.
- Similar to a vector, it could expand the capacity “under-the-hood” when needed, but…
  - All elements will probably have to be rehashed
  - New capacity shouldn’t be wasteful (too big) or too small
Chained hashing (chaining)
- If a collision occurs, then we store a series of data records in a list that the index in the hash table references.
- Linked Lists are a common collection to store collided data records.

Chaining

The underlying structure of an std::map is implemented with a balanced tree structure.
- Useful if you care about the ordering of keys.
std::unordered_map is used similarly to std::map, but the std::unordered_map is implemented with hash tables.
- Notice the std::unsorted_map’s key / value pairs are not printed in sorted order by keys, but std::map’s keys are.
- Hash Tables have a better average case performance (O(1)), but data order is not guaranteed when traversing the structure with iterators.