Motivation One of the scenarios were Swiss Tables tend NOT to fare

<a href="https://www.reddit.com/r/cpp/comments/1apd205/comment/kqib3h5" rel="nofollow"

I submitted <a class="issue-link js-issue-link" data-error-text="Failed to load title"

<a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-url="/us

Was swap-remove behavior ever considered when removing entries?,about rust-lang/hashbrown

Comments (10)

matthieu-m commented on August 17, 2024 1

https://raw.githubusercontent.com/joaquintides/usingstdcpp2023/main/More%20than%20a%20rehash.pdf

It may actually be worth looking into the design of boost's new hash table to see if it's worth adopting it instead of SwissTable for hashbrown. They claim that it's faster in benchmarks.

There are two key differences, from what I can tell, which I would guess are independent from one another. See slide 24 (Mechanics):

Instead of using a residual (h2) in the [0, 127] range, they use one in the [2, 255] range, nearly doubling the residual value space, and thus reducing the collision rate by a factor of about x2.
They use groups of 15 elements, instead of 16, and use the leading metadata byte as a bit-indexed overflow marker. The advantage of the overflow marker is that it may allow to "abort" the quadratic probing sequence even if all other slots of the groups are non-empty, and the advantage of the bit-index overflow marker is that it maintains 8 markers in 1 byte, so that a lone element overflowing only leads to pursuing the quadratic probing sequence in 1/8th of the look-ups.

I have no idea which of the two differences could have the bigger impact.

With that said, the first is easy to try, the second, not so much.

The logic for the first can be found at https://github.com/boostorg/unordered/blob/develop/include/boost/unordered/detail/foa/core.hpp#L335-L390 for SSE2, and a bit further down for Neon.

In short:

Empty is 0.
Deleted is 1.
The residual is the low byte of the hash, and if it collides with a special value, it's bumped by 8.

The 8 seems fairly arbitrary, though I do note that it means that for the bit-marker of overflow choosing a multiple of 8 maintains the "group" the residual would fall in, so that all 256 values are equally split across the 8 bits of the overflow marker.

from hashbrown.

matthieu-m commented on August 17, 2024

Comment from joaquintides, general hash map guru, and more recently one of the authors of boost::unordered_flat_map:

I haven't observed that situation you describe. Take into account however that "holes" will be reused upon new insertions, so it's not like they keep growing indefinitely. Also, in Abseil the natural distribution of occupied slots and holes is quite random to begin with, see slide 29 of

https://raw.githubusercontent.com/joaquintides/usingstdcpp2023/main/More%20than%20a%20rehash.pdf

boost::unordered_flat_map, otoh, tends to have clustered elements (which is good for cache locality), so relocation upon erasure could be beneficial, but that would impose an extra overhead on erase and invalidate iterators to the relocated elements. Some hash tables do this, for instance ankerl::unordered_dense::map.

So, as expected, my idea isn't original, though it could potentially be beneficial.

from hashbrown.

Amanieu commented on August 17, 2024

I'm hesitant to add this because it adds an extra cost when removing an element, and the benefit is not obvious. I would be more inclined to accept this if you have a benchmark that shows a clear performance benefit.

from hashbrown.

matthieu-m commented on August 17, 2024

I submitted #507 as an attempt to measure the performance loss from leaving holes behind.

I can't be certain that the benchmark is doing the right thing. In particular, I attempted to construct benchmarks which precisely place elements right where I'd want them for maximum effect by exploiting the particulars of the implementation, and it's a bit complicated to judge whether I was successful.

The benchmark results, on my machine (11th Gen Intel(R) Core(TM) i9-11900K @ 3.50GHz) are as follow:

test swiss_cheese_empty_cluster_l2_quarter          ... bench:     107,253 ns/iter (+/- 2,729)
test swiss_cheese_empty_random_l2_quarter           ... bench:     109,898 ns/iter (+/- 6,124)
test swiss_cheese_lone_cluster_l2_quarter           ... bench:   4,141,660 ns/iter (+/- 245,388)
test swiss_cheese_lone_random_l2_quarter            ... bench:   4,106,439 ns/iter (+/- 63,492)
test swiss_cheese_eigth_cluster_l2_quarter          ... bench:   8,329,706 ns/iter (+/- 603,492)
test swiss_cheese_eigth_random_l2_quarter           ... bench:   8,366,143 ns/iter (+/- 841,366)
test swiss_cheese_quarter_cluster_l2_quarter        ... bench:  17,091,345 ns/iter (+/- 851,073)
test swiss_cheese_quarter_random_l2_quarter         ... bench:  16,941,942 ns/iter (+/- 734,029)
test swiss_cheese_half_cluster_l2_quarter           ... bench:  33,936,353 ns/iter (+/- 2,089,683)
test swiss_cheese_half_random_l2_quarter            ... bench:  33,779,245 ns/iter (+/- 901,914)
test swiss_cheese_full_cluster_l2_quarter           ... bench:  66,712,969 ns/iter (+/- 2,283,823)
test swiss_cheese_full_random_l2_quarter            ... bench:  67,051,106 ns/iter (+/- 1,346,535)

test swiss_cheese_empty_cluster_l2_three_quarters   ... bench:      35,830 ns/iter (+/- 1,213)
test swiss_cheese_empty_random_l2_three_quarters    ... bench:      35,789 ns/iter (+/- 1,370)
test swiss_cheese_lone_cluster_l2_three_quarters    ... bench:     889,015 ns/iter (+/- 32,123)
test swiss_cheese_lone_random_l2_three_quarters     ... bench:     889,737 ns/iter (+/- 70,205)
test swiss_cheese_eigth_cluster_l2_three_quarters   ... bench:   1,811,351 ns/iter (+/- 176,593)
test swiss_cheese_eigth_random_l2_three_quarters    ... bench:   1,813,937 ns/iter (+/- 124,895)
test swiss_cheese_quarter_cluster_l2_three_quarters ... bench:   3,679,451 ns/iter (+/- 225,465)
test swiss_cheese_quarter_random_l2_three_quarters  ... bench:   3,664,752 ns/iter (+/- 118,518)
test swiss_cheese_half_cluster_l2_three_quarters    ... bench:   7,550,932 ns/iter (+/- 707,679)
test swiss_cheese_half_random_l2_three_quarters     ... bench:   7,335,543 ns/iter (+/- 453,063)
test swiss_cheese_full_cluster_l2_three_quarters    ... bench:  14,246,891 ns/iter (+/- 464,697)
test swiss_cheese_full_random_l2_three_quarters     ... bench:  14,968,833 ns/iter (+/- 724,493)

test swiss_cheese_empty_cluster_l2_single           ... bench:      26,941 ns/iter (+/- 1,121)
test swiss_cheese_empty_random_l2_single            ... bench:      26,936 ns/iter (+/- 1,546)
test swiss_cheese_lone_cluster_l2_single            ... bench:     575,818 ns/iter (+/- 24,934)
test swiss_cheese_lone_random_l2_single             ... bench:     576,248 ns/iter (+/- 29,015)
test swiss_cheese_eigth_cluster_l2_single           ... bench:   1,201,152 ns/iter (+/- 98,435)
test swiss_cheese_eigth_random_l2_single            ... bench:   1,222,738 ns/iter (+/- 145,428)
test swiss_cheese_quarter_cluster_l2_single         ... bench:   2,464,260 ns/iter (+/- 191,353)
test swiss_cheese_quarter_random_l2_single          ... bench:   2,402,873 ns/iter (+/- 135,769)
test swiss_cheese_half_cluster_l2_single            ... bench:   4,849,259 ns/iter (+/- 534,081)
test swiss_cheese_half_random_l2_single             ... bench:   4,831,783 ns/iter (+/- 449,949)
test swiss_cheese_full_cluster_l2_single            ... bench:   9,451,680 ns/iter (+/- 457,181)
test swiss_cheese_full_random_l2_single             ... bench:   9,500,475 ns/iter (+/- 372,736)

test swiss_cheese_empty_cluster_l2_double           ... bench:      13,509 ns/iter (+/- 362)
test swiss_cheese_empty_random_l2_double            ... bench:      13,595 ns/iter (+/- 832)
test swiss_cheese_lone_cluster_l2_double            ... bench:     216,011 ns/iter (+/- 10,948)
test swiss_cheese_lone_random_l2_double             ... bench:     216,045 ns/iter (+/- 16,825)
test swiss_cheese_eigth_cluster_l2_double           ... bench:     495,331 ns/iter (+/- 58,196)
test swiss_cheese_eigth_random_l2_double            ... bench:     448,442 ns/iter (+/- 25,042)
test swiss_cheese_quarter_cluster_l2_double         ... bench:     959,318 ns/iter (+/- 68,571)
test swiss_cheese_quarter_random_l2_double          ... bench:     941,993 ns/iter (+/- 57,970)
test swiss_cheese_half_cluster_l2_double            ... bench:   1,870,896 ns/iter (+/- 114,250)
test swiss_cheese_half_random_l2_double             ... bench:   1,883,339 ns/iter (+/- 192,821)
test swiss_cheese_full_cluster_l2_double            ... bench:   3,766,748 ns/iter (+/- 106,637)
test swiss_cheese_full_random_l2_double             ... bench:   3,756,684 ns/iter (+/- 228,687)

Where:

swiss_cheese is a fixed prefix to allow running only these benchmarks.
Followed by the "density" of the hashmap:
- empty: completely empty, all look-ups are misses, elements (none) are not accessed.
- lone: for each group, 16 elements are introduced, only the first is retained, all look-ups are on retained elements.
- eigth: for each group, 16 elements are introduced, only the first and ninth are retained, all look-ups are on retained elements.
- quarter: for each group, 16 elements are introduced, only the first, fifth, ninth, and thirteenth are retained, all look-ups are on retained elements.
- half: for each group, 16 elements are introduced, only the odd ones are retained, all look-ups are on retained elements.
- full: for each group, 16 elements are introduced, all look-ups are on retained elements.
Followed by the walk strategy:
- cluster: iterates over the first 2 elements of each group, then the next 2, etc...
- random: iterates over the first element of each group, the next element, etc... maximally spacing out hits to the same cache line.
Followed by l2, as the benchmarks are tuned to use 16 times the L2 cache (and thus fit in the L3 cache).
Followed by the size of the element, relative to size of the cache line.

Due to the structure of the benchmark, there's simply more elements as the element size is lower, in order to get to 16 times the L2 cache size, so benchmarks cannot be compared between element sizes.

Within an element size, however, the trend is fairly clear:

cluster or random has barely any impact, even though cluster was expected to trigger half the cache misses of random.
Runtime roughly doubles as the number of look-ups performed doubles.

The fact that cluster yields no significant advantage over random indicates, to me, that there doesn't seem to be much of an impact from cache misses here.

I would appreciate a review of the strategy and implementation, if you have time.

Unless I made a mistake -- which is not all that unlikely -- there doesn't seem to be any reason to try to improve locality of elements within a group any further, however, and thus this issue is moot.

from hashbrown.

Amanieu commented on August 17, 2024

I don't expect cache locality to have much of an impact in practice because the access pattern of hash table lookup is effectively random due to the hash function. Additionally, due to the way we scan control bytes with SIMD instructions, we will immediately "seek" to the first entry with a matching H2 hash (matching top 7 bits of the hash value) without looking at other elements. This means that unless we get a false positive on the H2 value (<1% chance per lookup) then the layout of elements within a group is irrelevant because we are only ever accessing a single element in that group.

The only case where this could conceivably make a difference might be iterating over all the elements of a hash table, but this is a relatively rare use case that we shouldn't be optimizing for: removing an entry is a much more common operation than iteration.

from hashbrown.

Amanieu commented on August 17, 2024

https://raw.githubusercontent.com/joaquintides/usingstdcpp2023/main/More%20than%20a%20rehash.pdf

It may actually be worth looking into the design of boost's new hash table to see if it's worth adopting it instead of SwissTable for hashbrown. They claim that it's faster in benchmarks.

from hashbrown.

matthieu-m commented on August 17, 2024

The only case where this could conceivably make a difference might be iterating over all the elements of a hash table, but this is a relatively rare use case that we shouldn't be optimizing for: removing an entry is a much more common operation than iteration.

I don't think iteration should be prioritized either.

I don't expect cache locality to have much of an impact in practice because the access pattern of hash table lookup is effectively random due to the hash function.

I disagree with the assessment.

In essence, we can think of hash-table access as randomly accessing elements of an array. Let's imagine two access patterns:

Access only the first half of the array.
Access only the odd elements of the array.

Both access patterns access the same number of elements, but the first accesses twice as less cache-lines as the second.

If the access are infrequent, then it shouldn't matter indeed: whatever cache-line was cached is most likely long gone. If accesses are very frequent, I would expect the cache lines to be cached, unless there's too many of them for the cache, and thus I'd expect an access pattern which minimizes the number of accessed cache lines to also minimize the number of cache misses.

Benchmarks seem to disprove the effect -- so either I'm bad at intuiting, or I'm bad at designing benchmarks.

This means that unless we get a false positive on the H2 value (<1% chance per lookup) then the layout of elements within a group is irrelevant because we are only ever accessing a single element in that group.

Despite being random, the accesses are clustered in time, which should lead to cache hits for a certain proportion of elements.

Additionally, due to the way we scan control bytes with SIMD instructions, we will immediately "seek" to the first entry with a matching H2 hash (matching top 7 bits of the hash value) without looking at other elements.

The effect of H2 is quite different depending on scenarios.

In particular, the scenario benchmarked in the above PR was for known-present elements, with distinct H2 within a group, in which case the only benefit of H2 is in helping locating exact offset of the element. No benefit for "skipping" elements -- as found in looking up absent elements, or "colliding" present elements -- should be seen in such a case.

from hashbrown.

ktprime commented on August 17, 2024

boost's flat hash map is quite faster than swiss table from more than 10 different third-party benchmarks.
c++ hash benchmark.

from hashbrown.

matthieu-m commented on August 17, 2024

@ktprime Yes, unfortunately that's not that helpful.

Firstly, hashbrown is inspired by Swiss Table, but not an exact port as far as I can tell. This means that while at a high-level it roughly follows Swiss Table, in a benchmark its performance is not necessarily quite the same, so the benchmarks provided against the C++ implementation of Swiss Table are not an exact match.

Still, I've been investigating the most obvious differences (254 values residuals and overflow tracking) in an attempt to check whether they could be improving performance. Overall, though, the patches are not that great performance wise. It's not that unexpected, as both only offer a performance boost is relatively rare situations, while having a potential cost for every element.

Unfortunately, my rough attempts did not show any great improvement. Performance is roughly on par, with often a more regressions than improvements. This suggests that while in theory they could improve performance, in practice they may require significant micro-optimization efforts, or more large-scale integration changes, for those improvements to kick in. But it's hard to guess which path would be more fruitful, and costly (time-wise) to investigate them all.

Still, the PRs are out there for anyone who wishes to investigate them further:

#513 : Investigates switching to 254 values residuals, instead of 128 values residuals.
#517 : Investigates switching to overflow tracking, with various overflow tracking strategies (F14's, boost flat hash map's, hybrids, ...).

from hashbrown.

ktprime commented on August 17, 2024

I agree with you. I have also ported a Swiss table implementation, which has slightly slower performance compared to the Boost version.

Most benchmarks consist of simple loops that focus on testing find, hit/miss, and insertion operations. However, in real-world code, the scenario is different, especially when metadata is frequently accessed and needs to be kept hot in the cache. In such cases, I prefer to use a flat hash map where the key-value pair and metadata are located in the same slot, even if its performance may not be as good as the benchmark results suggest.

Based on my benchmark results, if the data is fully located in the CPU cache or if the metadata cannot fully reside in the cache (e.g., when the size of the hash map is less than 10,000 or greater than 500,000,000), a hash map based on SIMD does not have any priority over other hash map designs.

from hashbrown.

Was swap-remove behavior ever considered when removing entries? about hashbrown HOT 10 OPEN

Comments (10)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent