HashSet and principle analysis in .NET

In .NET, a series of collection classes are provided in the namespace. HashSet is defined in it and is a generic collection that is not repeated and unordered. This article learns the working principle of HashSet.

The principle of hash table

HashSet is implemented based on the principle of hash table. To learn HashSet, you must first understand the hash table.

A hash table (also called hash table) is a data structure that directly accesses the storage location based on the key. It maps the data required to query to a position in the table through a key-value function, which speeds up the search speed.

The above function is a hash function. The hash function should be calculated as simple as possible to improve the insertion and retrieval efficiency; the calculated addresses should be distributed as evenly as possible to reduce hash conflicts; it should have greater compression to save memory. Common hash function construction methods include direct addressing method, residual remainder method, numerical analysis method, etc. HashSet uses the method of dividing the remainder of the element, dividing the HashCode of an element by the remainder of a certain constant (hash table Size) as the address. The constant usually selects a prime number.

The hash values of two equal objects are the same, but the hash values of two unequal objects are possible, which is a hash conflict. Methods for handling conflicts include open addressing method, linked list method, double hash method, etc. HashSet uses the linked list method to place conflicting elements in the linked list.

Hash table is a data structure used for high-performance set operations. It has the following characteristics:

Unordered, non-repeat; the time complexity of insertion and search is O(1);
No indexing is used;
Automatic capacity expansion when the capacity is insufficient, but the capacity expansion cost is high;
It can provide many high-performance collection operations, such as merging, cropping, etc.;

HashSet implementation

HashSet has two arrays built in, as follows. _buckets stores the index value calculated by the hash function. The value in _buckets starts from 1, so -1 is required when using it. This value is the relative index of the _entries array. If there is no conflict, the value pointed to is the relative index of the element to be searched. If a conflict occurs, the element can also be quickly located according to the conflict list. _entries stores Entry objects, and the Entry type is as follows. HashCode is the hash value of an element, which is used in search, insert, delete, expand capacity and other operations. Value stores data. Next has different functions at different moments. When Entry is in the list, it forms a conflicting link list, and its Next points to the next element of the conflicting link list, and the Next value of the last element of the linked list is -1; if Entry has been deleted by the list, it forms a vacant link list, and its Next points to the next element of the vacant link list, and the last element of the vacant link list is -2.

HashSet also has several key members: _count, _freeList, _freeCount. _count indicates the number of elements added, note that it is not the number of elements actually stored, because it is not updated when the element is deleted. _freeList is the empty list header, whose value points to the deleted _entries index, and _entries[_freeList].Next points to the relative position of the next empty. _freeCount represents the number of empty spaces, and the actual number of elements stored in the list is _count - _freeCount.

private int[]? _buckets; // _entries index arrayprivate Entry[]? _entries; // Entity array
private int _count; // The actual number of elements storedprivate int _freeList; // The header index of the empty list, initial value -1private int _freeCount; // Number of vacant seatsprivate struct Entry
{
    public int HashCode;
    public int Next;
    public T Value;
}
public int Count =&gt; _count - _freeCount; // _countDeleted elements will not be recorded，Therefore, the actual number of elements is_count - _freeCount

HashSet provides multiple constructor overloading, and _buckets and _entries are not initialized if no parameters are passed. When adding an element, Initialize(0) is called. The Initialize method accepts an int parameter that indicates the list capacity that needs to be initialized. The actual initialized list capacity is the minimum prime number greater than or equal to that value. The prime number is taken as the list length because this value is used as the divisor of the hash function constructed using the divisor remainder method, and the distribution of the results of the prime number remnants is more uniform, reducing the occurrence of conflicts.

private int Initialize(int capacity)
{
    int size = (capacity); // Get the minimum prime number of >=capacity    var buckets = new int[size];
    var entries = new Entry[size];
    // ...
    return size;
}

When searching for an element, first call the GetHashCode method of the element to calculate the hash value, and then call the GetBucketRef method to perform the hash function operation to obtain the index. The return value of GetBucketRef is -1, and if i is -1, no element is found. If i>=0, it means that there is an element in the list with the same hash value as the element to be searched, but the equal hash value does not necessarily mean that the elements are equal. You need to further judge the HashCode. If the HashCode is equal, then determine whether the elements are equal. If the elements are satisfied, find the element and return the index i of _entries.

private int FindItemIndex(T item)
{
    // ...
    int hashCode = item != null ? () : 0;
    if (typeof(T).IsValueType)
    {
        int i = GetBucketRef(hashCode) - 1; // The _buckets element starts at 1        while (i &gt;= 0) // There is the same hash value as the item, and it does not necessarily have the item.        {
            ref Entry entry = ref entries[i];
            if ( == hashCode &amp;&amp; EqualityComparer&lt;T&gt;.(, item))
            {
                return i; // HashCodes are equal and the elements are equal, then the element is found and the _entries index is returned            }
            i = ;

            collisionCount++;
            // ...
        }
    }
    // ...
    return -1;
}

private ref int GetBucketRef(int hashCode)
{
    int[] buckets = _buckets!;
    return ref buckets[(uint)hashCode % (uint)]; // Construct the hash function using the residual method}

When inserting an element, first, it will look for whether the element to be inserted exists. HashSet is not repeated, so if the inserted element already exists, it will directly return false. If no element exists, the index where the element is stored will be searched. If there is a deleted vacancy, the element is placed on the vacancy pointed to by _freeList; if there is no vacancy, the elements are inserted in the order of _entries. After finding the index, you can assign the HashCode and element of the element to the corresponding field of _entries[index]. When there is no conflict, the Next value is -1; if there is a conflict, a linked list is formed and added to the header of the linked list, and Next points to the next position of the conflict.

private bool AddIfNotPresent(T value, out int location)
{
    bucket = ref GetBucketRef(hashCode); // bucket is ref int type. If there is no conflict, bucket should be 0, because the default value of int is 0    // ...

    int index;
    if (_freeCount &gt; 0) // There is a deleted vacancy, put the element on the vacancy    {
        index = _freeList;
        _freeCount--; // Update the number of empty spaces after deletion        _freeList = StartOfFreeList - entries[_freeList].Next; // Update the empty space index    }
    else // Store elements in _entries order    {
        int count = _count;
        if (count == ) // Insufficient capacity, expansion        {
            Resize();
            bucket = ref GetBucketRef(hashCode);
        }
        index = count;
        _count = count + 1;
        entries = _entries;
    }

    {   // Assignment        ref Entry entry = ref entries![index];
         = hashCode;
         = bucket - 1; // If there is no conflict, it is -1, otherwise a linked list is formed, pointing to the next element index of the conflict         = value;
        bucket = index + 1; // Assign a value to bucket here, that is, change the corresponding element of _buckets, that is, the _buckets value indexed with the hash value of the insert element pointed to the index of the _entries that store the inserted element +1        _version++;
        location = index;
    }

    // ...
    return true;
}

If the list capacity is insufficient during insertion, the Resize method will be called to expand the capacity. The size after expansion is the minimum prime number that is greater than or equal to 2 times the original size. After obtaining the size to be expanded, re-allocate the entries memory at the new size and call the method to copy the original content to the new location. Since the length of the list changes, the hash value will change, so the content of _buckets (_entries index) needs to be updated, and the same value must be updated.

private void Resize() =&gt; Resize((_count), forceNewHashCodes: false);

public static int ExpandPrime(int oldSize)
{
    int num = 2 * oldSize;
    if ((uint)num &gt; 2146435069u &amp;&amp; 2146435069 &gt; oldSize)
    {
        return 2146435069;
    }

    return GetPrime(num); // Returns the minimum prime number twice the original size}

private void Resize(int newSize, bool forceNewHashCodes)
{
    var entries = new Entry[newSize];
    (_entries, entries, count);
    // ...

    _buckets = new int[newSize];
    for (int i = 0; i &lt; count; i++)
    {
        ref Entry entry = ref entries[i];
        if ( &gt;= -1) // Update _buckets content        {
            ref int bucket = ref GetBucketRef(); // Get the hash value obtained by the hash function with the new size as the divisor             = bucket - 1; // Update Next            bucket = i + 1; // Update the element of _buckets, pointing to the recalculated _entry index +1        }
    }

    _entries = entries;
}

When deleting an element, first look for whether the element to be deleted exists. If there is a conflict in the hash value, the previous index of the conflicted linked list will be recorded. After finding the element, you need to update the pointer to the conflicting linked list. After deleting an element, the number of _freeCount vacancy will be updated, and the delete element index will be assigned to _freeList, the delete vacancy will be recorded, and the vacancy will be added to the vacancy list header, and its Next points to the relative position of the next vacancy. When an element is inserted, the element is inserted at the empty index of the _freeList record, and the value of the _freeList is updated according to the Next of the vacancies.

public bool Remove(T item)
{
    int last = -1;
    int hashCode = item != null ? (_comparer?.GetHashCode(item) ?? ()) : 0;
    ref int bucket = ref GetBucketRef(hashCode);
    int i = bucket - 1;

    while (i &gt;= 0)
    {
        ref Entry entry = ref entries[i];
        if ( == hashCode &amp;&amp; (_comparer?.Equals(, item) ?? EqualityComparer&lt;T&gt;.(, item)))
        {
            // Find the element to be deleted            if (last &lt; 0) // The element to be deleted is located at the head of the linked list, and the value of the updated _buckets element points to the next position of the linked list            {
                bucket =  + 1;
            }
            else // The element header to be deleted must be updated.                entries[last].Next = ;
             = StartOfFreeList - _freeList; // Vacancy link list, record the relative position of the next vacancy index, update according to the value when inserting_freeList            if (&lt;T&gt;())
                 = default!;
            _freeList = i; // Record the deleted element position, which will be inserted next time an element is inserted            _freeCount++;  // Update the number of empty spaces after deletion            return true;
        }
        last = i; // There is a conflict, record the location on the linked list        i = ;
    }
    return false;
}

Summarize

Through the above analysis, we can see that HashSet is a clever design and flexible data structure. Based on the idea of hash table, the insertion and searching of HashSet is very fast, and only simple calculations are required. Based on this, HashSet also has the conditions for high-performance set operations, which can efficiently perform merge, crop and other operations. But this also causes HashSet to fail to store duplicate elements. The design of deleting the space-time link list is very clever, but it also causes HashSet to be disordered, so the index cannot be used. Therefore, when selecting a data structure, if it is necessary to include duplicate elements, or if it needs to be ordered, other data structures, such as List, should be considered.

In addition, Dictionary and HashSet principle are the same, except that HashSet only has Key and no Value.

Reference article

HashSet Class
Introduction to DotNet Dictionary Implementation
The principle of hash table algorithm

This is all about this article about HashSet in .NET. For more related HashSet content in .NET, please search for my previous articles or continue browsing the related articles below. I hope everyone will support me in the future!