TransWikia.com

unordered_map excess calls to hash function

Stack Overflow Asked by Amir Kirsh on November 29, 2021

The following code results with unexplained calls to the hash function:

namespace foo {
    using Position = tuple <int, int, int>;
    
    std::ostream& operator<<(std::ostream& out, const Position& pos) noexcept{
        return out << get<0>(pos) << ", " << get<1>(pos) << ", " << get<2>(pos);
    }

    struct hashFunc{
        std::size_t operator()(const Position& pos) const noexcept{
            int res = get<0>(pos) * 17 ^ get<1>(pos) * 11 ^ get<2>(pos);
            cout << "@@@ hash function called for key: " << pos 
                 << ", hash: " << res << endl;
            return res;
        }
    };

    template<typename T>
    void print_buckets(T&& map) {
        auto num_buckets = map.bucket_count();
        cout << "------------------------------" << endl;
        cout << "NUM BUCKETS: " << num_buckets << endl;
        for(size_t i=0; i<num_buckets; ++i) {
            auto bucket_size = map.bucket_size(i);
            if(bucket_size) {
                cout << "BUCKET " << i << " size: " << bucket_size << endl;        
            }
        }
        cout << "------------------------------" << endl;
    }
}

Main:

using namespace foo;

int main() {
    // note: bucket_count specified
    unordered_map <Position, std::string, hashFunc> test(10); 
    
    auto x = tuple{1,0,0};
    auto z = tuple{0,1,0};
    auto w = tuple{0,0,1};
            
    cout << "==================================" << endl;
    cout << "about to insert: " << x << endl;
    test[x] =  "hello";
    print_buckets(test);
    cout << "after insert of: " << x << endl;
    
    cout << "==================================" << endl;
    cout << "about to insert: " << z << endl;
    test[z] = "hey";
    print_buckets(test);
    cout << "after insert of: " << z << endl;
    
    cout << "==================================" << endl;
    cout << "about to insert: " << w << endl;
    test.insert({w, "hello"});
    print_buckets(test);
    cout << "after insert of: " << w << endl;    
    cout << "==================================" << endl;
}

Output:

==================================
about to insert: 1, 0, 0
@@@ hash function called for key: 1, 0, 0, hash: 17
------------------------------
NUM BUCKETS: 11
BUCKET 6 size: 1
------------------------------
after insert of: 1, 0, 0
==================================
about to insert: 0, 1, 0
@@@ hash function called for key: 0, 1, 0, hash: 11
@@@ hash function called for key: 1, 0, 0, hash: 17   <= why?
------------------------------
NUM BUCKETS: 11
@@@ hash function called for key: 1, 0, 0, hash: 17   <= why?
BUCKET 0 size: 1
BUCKET 6 size: 1
------------------------------
after insert of: 0, 1, 0
==================================
about to insert: 0, 0, 1
@@@ hash function called for key: 0, 0, 1, hash: 1
@@@ hash function called for key: 0, 1, 0, hash: 11   <= why?
------------------------------
NUM BUCKETS: 11
@@@ hash function called for key: 1, 0, 0, hash: 17   <= why?
BUCKET 0 size: 1
@@@ hash function called for key: 0, 1, 0, hash: 11   <= why?
BUCKET 1 size: 1
BUCKET 6 size: 1
------------------------------
after insert of: 0, 0, 1
==================================

Code (same behavior for gcc and clang)


Notes:

1. Trying the same without the bucket_count parameter for the constructor, calls to hash function become even more excessive, due to rehash. But in the scenario above it seems that there is no rehash and there are no collisions.

2. Related, but specifically on MSVC: Inserting to std::unordered_map calls hash function twice in MSVC++'s STL, bad design or special reason?

4 Answers

As others pointed out, an unordered map, which is just a form of a hash table, is in libstdc++ implemented basically just as a single ("global") linked list. Plus, there is an array of buckets that point into this list. What is important is that the pointer stored in bucket[i] does not point to the first node that belongs to this bucket (according to hash function mapping), but to its predecessor in the global list instead. The reason is obvious — when you add an item into the singly-linked list, you need to update its predecessor. Here, when you need to insert an element into some bucket, you need to update the predecessor of the first node of this bucket.

However, the very first node of the global linked list does not have any predecessor. To make things unified, there is a sentinel node that plays this role. In libstdc++, it's a member variable _M_before_begin.

Let's assume that we have a hash table with keys A and B that belong to bucket[0] and a key C that belongs to bucket[1]. It may, for instance, look like as follows:

global linked list          buckets[]
------------------          ---------

_M_before_begin  <--------  bucket[0]
       |
       v
node_with_key_A 
       |
       v
node_with_key_B  <--------  bucket[1]
       |
       v
node_with_key_C
       |
       x

Now, when a new key, say D, is added into an empty bucket, say bucket[2], libstdc++ inserts it at the beginning of the global linked list.

Therefore, the situation after this insertion is as follows:

global linked list          buckets[]
------------------          ---------

_M_before_begin  <--------  bucket[2]
       |
       v
node_with_key_D  <--------  bucket[0]
       |
       v
node_with_key_A 
       |
       v
node_with_key_B  <--------  bucket[1]
       |
       v
node_with_key_C
       |
       x

Note that bucket[0] that corresponds with node_with_key_A pointed to by _M_before_begin needs to be updated. And, since, as again pointed to by others, libstdc++ does not cache hash values by default, the only option how to find a bucket index for node_with_key_A is to trigger a hash function.

Note that basically I just said the same as others, but wanted to add some illustrations that may help.


Another consequence of this approach is that hash function might be called during lookup: https://godbolt.org/z/K6qhWc. The reason is that the first element for some bucket is known, but not the last one. Therefore, the hash function for node keys needs to be resolved to find out whether a node still belongs to the actual bucket during the linked list traversal.

Answered by Daniel Langr on November 29, 2021

Firstly, a couple of observations:

  • The unordered map is both a hash table, and a singly-linked list.

    See here that begin returns an iterator which models LegacyForwardIterator.

  • Inserting an entry into the map requires updating both the hash table and the linked list.

Secondly, a couple of notes on these containers' implementation decisions:

  • For singly-linked lists, it's common to have a sentinel node which doesn't contain any data (for something like Node<T>, it'll still have a T, just default-initialized). We only want it for its next pointer, because it helps keep list operations regular (ie, we don't have to write insert-at-the-head and insert-after-node as different special cases).

  • For hash tables (assuming linked-list buckets, since it's required by the standard) we can either use Node table[N] (so each bucket has its own sentinel preallocated) or Node* table[N].

    In this case, since we're actually using Node<T> and don't know the size of T, it seems reasonable to store a pointer for each bucket.

  • For a hash table which is also a singly-linked list, it makes sense to use the per-bucket list as (part of) the all-elements list. Otherwise we'd need to store two pointers per node, next_in_bucket and next_in_list.

    This means that the "sentinel" (one-before-the-beginning) node pointed to by a bucket is actually the last node of the previous bucket ... except for the bucket at the front of the list, when it really is the overall list sentinel.

    The comments in the code say

      /* ...
      *  The non-empty buckets contain the node before the first node in the
      *  bucket. This design makes it possible to implement something like a
      *  std::forward_list::insert_after on container insertion and
      *  std::forward_list::erase_after on container erase
      *  calls. _M_before_begin is equivalent to
      *  std::forward_list::before_begin. Empty buckets contain
      *  nullptr.  Note that one of the non-empty buckets contains
      *  &_M_before_begin which is not a dereferenceable node so the
      *  node pointer in a bucket shall never be dereferenced, only its
      *  next node can be.
    

    (the sentinel is _M_before_begin in this code)

So, when we add an element to an already-populated bucket, the steps are roughly

void insert_to_non_empty_bucket(Node *n, Key k) {
  Node *sentinel = table[k];
  n->next = sentinel->next;
  sentinel->next = n;
}

Note again that we don't know or care whether the sentinel here is the last element of the previous bucket, or the overall list sentinel. The code is the same either way (which was one of the reasons for using a sentinel in the first place).

However, when we add the first element to an empty bucket (and it is not the only non-empty bucket), we have one extra step: we need to update the sentinel pointer for the next bucket, to point to our new node. Otherwise we'd have two buckets both pointing to the list sentinel.

void insert_to_empty_bucket(Node *n, Key k) {
  Node *sentinel = &list_sentinel; // ie, &_M_before_begin
  n->next = sentinel->next;
  sentinel->next = n;

  // update the *next* bucket in the table
  table[n->next->key] = n;
}

Finally: in this implementation, Node doesn't cache the key, so there is no n->next->key. There is actually a trait controlling this, but it's clearly false in this case, which means that final line has to re-compute the hash in order to update the next bucket.


NB. just to clarify, when I say previous bucket or next bucket, I'm just talking about position in the list, where buckets appear in reverse order of when they became non-empty. It doesn't have anything to do with position in the table, or imply any intrinsic ordering.

Answered by Useless on November 29, 2021

Maybe it's the implementation of std::unordered_map. The hash_value is not stored into each node. So, it will re-hash the first element in next bucket when inserting new element or calcualte the bucket size.

You can try to use <tr1/unordered_map> to avoid this issue. Example:

#include <tr1/unordered_map>
using std::tr1::unordered_map;

NOTE: I don't know if tr1/unordered_map or unordered_map is better.

Answered by binhgreat on November 29, 2021

I can't explain why it is done that way, but it doesn't fit in a comment, so I leave it here in the answer section. You have two parts in the stdlib (10.1.0) upon insertion of an element:

__hash_code __code = __h->_M_hash_code(__k);

Which calculates the hash value of the element to insert __k.

And later on this part of the code:

    {
      // The bucket is empty, the new node is inserted at the
      // beginning of the singly-linked list and the bucket will
      // contain _M_before_begin pointer.
      __node->_M_nxt = _M_before_begin._M_nxt;
      _M_before_begin._M_nxt = __node;
      if (__node->_M_nxt)
        // We must update former begin bucket that is pointing to
        // _M_before_begin.
        _M_buckets[_M_bucket_index(__node->_M_next())] = __node;
      _M_buckets[__bkt] = &_M_before_begin;
    }

Where _M_bucket_index calculates the hash for __node->_M_next(), __node referes to the node created for __k.

Maybe that helps you or someone else to explain it further.

Answered by t.niese on November 29, 2021

Add your own answers!

Ask a Question

Get help from others!

© 2024 TransWikia.com. All rights reserved. Sites we Love: PCI Database, UKBizDB, Menu Kuliner, Sharing RPP