hash table are a randomized data structure that allows fast insert / delete / search.
- insert: stick it into the data structure based on key
- delete: delete from data structure
- search: get a pointer into the data structure, or null
\(O\qty(1)\) time insert, delete, and search
constituents
- \(U\) is a universe, where \(|U| = M\)
- \(K \subseteq U\) keys are going to show up, \(|K| = n\)
- \(M \gg n\)
- \(h: U \to \qty {1 \dots n}\) is a function that maps elements of \(U\) to buckets
slight abuse of notation: there will be \(n\) buckets, and \(n\) elements that come in. They should be similar OOM but can be not equal (since you can stick multiple into a bucket and run a constant-time search).
requirements
A hash table with chaining consists of \(n\) buckets, and each bucket is a linked list: we can insert into the list once hashed in \(O\qty(1)\), to find something, it takes \(O\qty(\text{length}\qty(ll))\), which is constant per bucket if \(n \approx n_{\text{items}}\).
additional information
showing that random hash function is good
It is bad if some adversarial input can land a bunch of elements into one bucket using a praticular hash function \(h\).
Let \(h\) be a random hash function. For all choice \(u_1 …, u_{n} \in U\), \(\mathbb{E}[ \mid h\qty(u_{i}) \mid] \leq 2\) (i.e. the size of each bucket must be less than 2 after being filled by a particular choice of inputs of length \(n\)). In particular this is saying the expectation of the number of elements in filled buckets is no more than 2 (i.e. we don’t consider empty buckets).
\begin{align} \mathbb{E}[\dots] &= \mathbb{E}\qty(\sum_{j=1}^{n} 1\qty{h\qty(u_{i} = h\qty(u_{j}))}) \\ &= \sum_{j=1}^{n} P \qty(h\qty(u_{i}) = h\qty(u_{j})) \\ &= 1 + \sum_{j \neq i}^{}P\qty(h\qty(u_{i} = h\qty(u_{j}))) \\ &= 1 + \sum_{j \neq i}^{} \frac{1}{n} \\ &= 1+ \frac{n-1}{n} \leq 2 \end{align}
We can pull out the \(1\) because \(P\qty(h\qty(u_{j}) = h\qty(u_{j})) = 1\). As desired, this means that any input including adversarial ones will result in each bucket being not too large.
picking a hash function
- we don’t want too many buckets, so around \(n\) (otherwise we take too much space)
- we want items spread out so that adversarial inputs can’t stack too many in \(n\)
“Why not a hash function with state?” Recovering the state for lookup will take more than \(O\qty(1)\) (i.e. such as looking up which ID you assigned per token.)
deterministic hash functions can’t work
Since \(M \gg n\), each \(\frac{M}{n} \gg n\) chunk of \(U\) have to fall into the same bucket. Therefore, deterministic hashes can be broken in worst-case by choosing inputs only in one choice of \(\frac{M}{n}\).
uniformly random hash function
A very basic hash function is uniquely generate a random function (at great cost) which uniformly maps each element of \(U\) to an integer \(1 \to n\). This however, takes a lot of memory to store and compute.
This is Bad idea: suppose each bucket takes \(\log n\) to identify, and there’s \(M\) item in universe, there’s \(M \log \qty(n)\) sized table to represent this function. Yet direct addressing just takes \(M\). This is lame.
Getting around this (^) is actually quite hard. The set of all hash functions is \(n^{M}\) (for each element in the universe, we choose one of \(n\) buckets); yet, representing this takes \(\log \qty(n^{M}) = M \log n\) bits either way; this is bad.
We essentially have to sample from a smaller set to be able to represent the function. Inncomes hash families.
direct addressing
A very primitive system for hashing would just be:
- know how many hashes in advance
- create a big-ol array
- stick an element into the bucket
problem: universe may be far bigger, so you don’t know hat inputs you are going to get in advance
why not radixing?
Worst case can be very filled