The team built their hash table in two parts. They had a primary data structure, in which the items are stored without any wasted bits at all, and a secondary data structure, which helps a query request find the item it’s looking for. While the group did not invent the notion of a secondary data structure, they did make a crucial discovery that made their hyperefficient hash table possible: The structure’s overall memory efficiency depends on how the primary structure arranges its stored items.
The basic idea is that every item in the primary structure has preferred storage locations — a best location, a second-best one, a third best and so on. If an item is in its best spot, the number 1 is affixed to it, and that number is stored in the secondary data structure. In response to a query, the secondary structure provides just the number 1, which spells out the item’s exact location in the primary structure.
If the item is in its 100th-best spot, the secondary data structure attaches the number 100. And because the system uses binary, it represents the number 100 as 1100100. It takes more memory, of course, to store the number 1100100 than 1 — the number assigned to an item when it’s in the best spot. Differences like that become significant if you’re storing, say, a million items.
So the team realized that if you continually shift items in the primary data structure into their more preferred locations, you could significantly reduce the memory consumed by the secondary structure without having to increase query times.
“Before this work, no one had realized you could further compress the data structure by moving information around,” Pagh said. “That was the big insight of the Bender paper.”
The authors showed that their invention established a new upper bound for the most efficient hash tables, meaning that it was the best data structure yet devised in terms of both time and space efficiency. But the possibility remained that someone else might do even better.
Bound to Succeed
The next year, a team led by Huacheng Yu, a computer scientist at Princeton University, tried to improve the Bender team’s hash table. “We worked really hard and couldn’t do it,” said Renfei Zhou, a student at Tsinghua University in Beijing and a member of Yu’s team. “That’s when we suspected that their upper bound was [also] a lower bound” — the best that can possibly be achieved. “When the upper bound equals the lower bound, the game is over, and you have your answer.” No matter how clever you are, no hash table can do any better.
Yu’s team employed a novel strategy to find out if that hunch was correct by calculating a lower bound from first principles. First, they reasoned that to perform an insertion or a deletion, a hash table — or, really, any data structure — must access the computer’s memory some number of times. If they could figure out the minimum number of times needed for a space-efficient hash table, they could multiply that by the time required per access (a constant), giving them a lower bound on the runtime.
But if they didn’t know anything about the hash table (except that it was space-efficient), how could the researchers figure out the minimum number of times required to access the memory? They derived it purely from theory, using a seemingly unrelated field called the theory of communication complexity, which studies how many bits are required to convey information between two parties. Eventually, the team succeeded: They figured out how many times a data structure must access its memory per operation.