Expand description
§The Bytell Hash Table
The bytell (byte linked list) hash table is an algorithm by Malte Skarupke, which we implement here in Rust.
The overview of the algorithm can be found here: A new fast hash table in response to Google’s new fast hash table
And Matle Skarupke’s C++ implementation can be found in he’s GitHub repository.
§Properties
The bytell hash table is optimized for memory usage, it uses 1 byte overhead per entry. Insetion, query and removal are quite fast as well.
§Basic layout
Bytell hash table internally is built from a contiguous sequence of blocks. The number of blocks is always a power of two, at any time.
Each block is composed of N entries, where N is also a power of two, and it
is fixed, depending on key and value types only. Typically N is a low number like 8 or 16.
A block entry is a triple (control_byte, key, value), where control_byte is
a single byte (explained later), key is a comparable, hashable key and value
is any value.
The exact layout of block is not a contiguous sequence of (control_byte, key, value)
triples. On the contrary, we first store a sequence of N control bytes, and then
a sequence of N key-value pairs. In particular N has to be at least the alignment
of (key, value) pair (in reality it is at least 8 as well).
The capacity of bytell hash table is calculated as number_of_blocks * N. And so
it is a power of two as well.
Additionally a hash function hash(x) is attached to the bytell hash table. And also
a hash_to_index(x) function, which turns the hashed value into an index usable by
the bytell hash table internally. The hash_to_index(x) value is guaranteed to be
in 0..capacity range.
§control_byte
The control_byte is a byte (8-bit unsigned number) which is divided into two pieces:
- The first bit, if set, denotes that the entry is “direct”, meaning some hash maps
exactly to this entry. If it is zero, then it is a “storage” entry, meaning it is used
for storing a linked list of (key, value) pairs that share the hash. We will call
this field
control_byte.type. - The remaining 7 bits are used as jump distance indexes. Given such an entry with non-zero
jump distance index we can calculate the next item in the linked list by calculating an
appropriate offset based on the index. So this field is a 7-bit integer denoted by
control_byte.distance_index.
The distance_index is not an offset to the next entry. There’s an indirection: the
distance_index retrieves the offset from a global static JUMP_DISTANCES array, where
each distance is carefuly chosen for the highest quality.
Nevertheless, the next_entry that follows prev_entry through distance_index will
be denoted by next_entry := prev_entry + prev_entry.control_byte.distance_index. We
keep in mind that in reality there’s this JUMP_DISTANCES array indirection.
§Query
For a given query_key the retrieval of the value works through the following steps:
- Let
hash_value := hash(query_key) - Let
index := hash_to_index(hash_value) - Let
block := blocks[index % number_of_blocks] - Let
entry := block.entries[index % N] - If
entry.control_byte.type != DIRECT_HITthen returnNONE - If
entry.control_byte.type == DIRECT_HITthen:- loop:
- If
entry.key == query_keythen return(entry.key, entry.value) - If
entry.control_byte.distance_index == 0then returnNONE entry := entry + entry.control_byte.distance_index
- If
- loop:
§Delete
For a given delete_key the deletion of a key works like this:
- Let
hash_value := hash(delete_key) - Let
index := hash_to_index(hash_value) - Let
block := blocks[index % number_of_blocks] - Let
entry := block.entries[index % N] - If
entry.control_byte.type != DIRECT_HITthen returnNONE - loop:
- If
entry.key == delete_keythen break withfound_entry := entry - If
entry.control_byte.distance_index == 0then returnNONE entry := entry + entry.control_byte.distance_index
- If
- We found matching entry. But in order to delete it, we first swap it with the last in the linked list.
- Let
tail := last_linked_entry(found_entry) - If
tail != found_entrythenswap(tail, found_entry). Nowfound_entryis at the end of the list, it has no successor. - Let
parent_entry := parent(found_entry) parent_entry.control_byte.distance_index := 0found_entry.control_byte := EMPTY- Return
found_entry
§Insert
For a given (insert_key, insert_value) pair the insertion works like this:
- Let
hash_value := hash(insert_key) - Let
index := hash_to_index(hash_value) - Let
block := blocks[index % number_of_blocks] - Let
entry := block.entries[index % N] - If
entry.control_byte.type == DIRECT_HIT:- loop “direct”:
- If
entry.key == insert_keythenentry.value := insert_valueand returnold_value. We did an in-place update. - If
entry.control_byte.distance_index != 0then:entry := entry + entry.control_byte.distance_index- continue loop “direct”
- We reached the end of the linked list. We need to insert a new element.
- If
should_grow()then grow the table - loop “inner”:
- Let
(free_entry, free_entry_distance) := search_for_free_entry(entry) - If found then break “inner”, otherwise grow the table. Panic if that is not possible.
- Let
entry.control_byte.distance_index := free_entry_distancefree_entry.control_byte.type := STORAGEfree_entry.control_byte.distance_index := 0free_entry.key := insert_keyfree_entry.value := value- RETURN
- If
- loop “direct”:
- If
entry.control_byte.type != DIRECT_HIT:- We have direct hit on an entry that is actually a storage entry. What we need to do is to move around values to make place for this hit.
- Let
parent_entry := parent(entry) - Let
tmp_current := entryandentry := reserved. We reserve the entry, because we don’t know the operation of moving around stuff to potentially hit it again. - Let
original_entry := entry - loop “outer”:
- loop “inner”:
- Let
(free_entry, free_entry_distance) := search_for_free_entry(entry) - If found then break “inner”, otherwise grow the table. Panic if that is not possible
- Let
- copy
tmp_currenttofree_entry, set distance onparent_entry - If
tmp_current.control_byte.distance_index == 0then break “outer” parent_entry := tmp_current, andtmp_current := tmp_current + tmp_current.control_byte.distance_index
- loop “inner”:
- Now we have place for a direct hit
original_entry.control_byte.type := DIRECT_HIToriginal_entry.control_byte.distance_index := 0original_entry.key := insert_keyoriginal_entry.value := insert_value- RETURN
Modules§
- configuration
- Contains the configuration for the bytell hash table.
- defaults
- Contains the default, recommended configuration for the bytell hash table.
- errors
- Contains bytell specific errors.
- hash_
table - Contains the bytell (byte linked list) hash table implementation.
- hash_
to_ index - Contains strategies for converting hash values to indices in the bytell hash table.