SoFunction
Updated on 2025-03-02

Analysis on the principle of PHP hash table implementation algorithm

In the PHP kernel, one of the most important data structures is HashTable. The arrays we use commonly used are implemented in the kernel using HashTable. So, how is PHP's HashTable implemented? I have been reading the data structure of HashTable recently, but there is no specific implementation algorithm in the algorithm book. I happened to be reading the source code of PHP recently. So I referred to the implementation of HashTable in PHP and implemented a simple version of HashTable myself. I have summarized some experiences. Let me share them with you below.

Introduction to HashTable

Hash table is an effective data structure that implements dictionary operations.

definition

Simply put, HashTable is a data structure of key-value pairs. Supports insertion, search, deletion and other operations. Under some reasonable assumptions, the time complexity of all operations in the hash table is O(1) (those who are interested in the relevant proof can be consulted by themselves).

The key to implementing a hash table

In the hash table, instead of using keywords as subscripts, the hash value of the key is calculated as subscripts through the hash function, and then the hash value of the key is calculated when searching/deleting, so as to quickly locate the location where the element is saved.

In a hash table, different keywords may calculate the same hash value, which is called "hash conflict", which is to deal with the same hash value of two or more keys. There are many ways to resolve hash conflicts, such as open addressing method, zipper method, etc.

Therefore, the key to implementing a good hash table is a good hash function and a method to handle hash conflicts.

Hash function

To determine the quality of a hash algorithm, there are the following definitions:

  • Consistency, equivalent bonds must produce equal hash values;
  • Efficiency, easy calculation;
  • Uniformity, hash all bonds evenly.
  • The hash function establishes the correspondence between the key value and the hash value, that is: h = hash_func(key). The corresponding relationship is shown in the figure below:

hash-exam

Let’s leave it to the experts to design a perfect hash function. We just need to use the existing more mature hash function. The hash function used by the PHP kernel is the time33 function, also known as DJBX33A, and its implementation is as follows:

static inline ulong zend_inline_hash_func(const char *arKey, uint nKeyLength)
{
     register ulong hash = 5381;

    /* variant with the hash unrolled eight times */
    for (; nKeyLength >= 8; nKeyLength -= 8) {
      hash = ((hash << 5) + hash) + *arKey++;
      hash = ((hash << 5) + hash) + *arKey++;
      hash = ((hash << 5) + hash) + *arKey++;
      hash = ((hash << 5) + hash) + *arKey++;
      hash = ((hash << 5) + hash) + *arKey++;
      hash = ((hash << 5) + hash) + *arKey++;
      hash = ((hash << 5) + hash) + *arKey++;
      hash = ((hash << 5) + hash) + *arKey++;
  }

  switch (nKeyLength) {
    case 7: hash = ((hash << 5) + hash) + *arKey++; /* fallthrough... */
    case 6: hash = ((hash << 5) + hash) + *arKey++; /* fallthrough... */
    case 5: hash = ((hash << 5) + hash) + *arKey++; /* fallthrough... */
    case 4: hash = ((hash << 5) + hash) + *arKey++; /* fallthrough... */
    case 3: hash = ((hash << 5) + hash) + *arKey++; /* fallthrough... */
    case 2: hash = ((hash << 5) + hash) + *arKey++; /* fallthrough... */
    case 1: hash = ((hash << 5) + hash) + *arKey++; break;
    case 0: break;
    EMPTY_SWITCH_DEFAULT_CASE()
  }
  return hash;
}

Note: The function uses an 8-loop + switch to optimize the for loop, reducing the number of runs of the loop, and then executing the remaining elements that are not traversed in the switch.

Zipper method

The method of storing all elements with the same hash value in a linked list is called the zipper method. When searching, first calculate the hash value corresponding to the key, then find the corresponding linked list based on the hash value, and finally find the corresponding value in the order of the linked list.

PHP's HashTable Structure

After briefly introducing the data structure of the hash table, continue to see how the hash table is implemented in PHP.

The definition of PHP kernel hashtable:

typedef struct _hashtable {
     uint nTableSize;
     uint nTableMask;
     uint nNumOfElements;
     ulong nNextFreeElement;
     Bucket *pInternalPointer;
     Bucket *pListHead;
     Bucket *pListTail; 
     Bucket **arBuckets;
     dtor_func_t pDestructor;
     zend_bool persistent;
     unsigned char nApplyCount;
     zend_bool bApplyProtection;
     #if ZEND_DEBUG
        int inconsistent;
     #endif
} HashTable;
  • nTableSize, the size of HashTable, grows in multiples of 2
  • nTableMask is used to obtain the index value of the hash value when performing and operation with the hash value. After arBuckets is initialized, it is always nTableSize-1.
  • nNumOfElements, the number of elements currently owned by HashTable, the count function directly returns this value
  • nNextFreeElement, represents the position of the next numeric index in the array of numeric key values
  • pInternalPointer, internal pointer, pointing to the current member, used to traverse elements
  • pListHead, points to the first element of the HashTable and the first element of the array
  • pListTail, points to the last element of the HashTable and the last element of the array. Combined with the above pointer, it is very convenient when it comes to iterating through arrays, such as reset and endAPI
  • arBuckets, an array of two-way linked lists composed of buckets, indexing key hash value and nTableMask for operation generation
  • pDestructor, destructor used to delete elements in hash table
  • persistent, identify the memory allocation function. If it is TRUE, use the memory allocation function of the operating system itself, otherwise use the PHP memory allocation function.
  • nApplyCount saves the number of times the current bucket is accessed recursively, preventing multiple recursions
  • bApplyProtection, which identifies whether the hash table needs to be recursively protected. The default is 1, and it needs to be used

Take an example of combining hash with mask:

For example, the true hash value of "foo" (using the DJBX33A hash function) is 193491849. If we now have a hash table of 64 capacity, we obviously can't use it as a subscript for the array. Instead, apply the mask of the hash table and then take only the low bits of the hash table.

hash      |    193491849 |   0b1011100010000111001110001001
& mask     | &       63 | &  0b0000000000000000000000111111
----------------------------------------------------------------------
= index    | = 9        | =  0b0000000000000000000000001001

Therefore, in the hash table, foo is stored in a bucket vector with subscript of 9 in arBuckets.

Definition of bucket structure

typedef struct bucket {
   ulong h;
   uint nKeyLength;
   void *pData;
   void *pDataPtr;
   struct bucket *pListNext;
   struct bucket *pListLast;
   struct bucket *pNext;
   struct bucket *pLast;
   const char *arKey;
} Bucket;
  • h, hash value (or key of numeric key value)
  • nKeyLength, length of key
  • pData, pointer to data
  • pDataPtr, pointer data
  • pListNext, pointing to the next element in the arBuckets linked list in HashTable
  • pListLast, point to the previous element in the arBuckets linked list in HashTable
  • pNext, pointing to the next element in the bucket linked list with the same hash value
  • pLast, pointing to the previous element in the bucket linked list with the same hash value
  • arKey, the name of key

HashTable in PHP adopts the implementation method of vector plus bidirectional linked list. The vector is stored in the arBuckets variable. The vector contains pointers of multiple buckets. Each pointer points to a bidirectional linked list composed of multiple buckets. The addition of new elements uses pre-interpolation method, that is, the new element is always at the first position of the bucket. As can be seen from the above, the hash table implementation of PHP is quite complex. This is the price it pays to use super flexible array types.

HashTable related API

  • zend_hash_init
  • zend_hash_add_or_update
  • zend_hash_find
  • zend_hash_del_key_or_index

zend_hash_init

Function execution steps

Set the hash table size

Set the initial value of other member variables of the structure (including the destructor pDescructor for freeing memory)
Click for detailed code comments:zend_hash_init source code

Note:

1. pHashFunction is not used here. The hash function of php uses the internal zend_inline_hash_func

2. After the zend_hash_init is executed, the memory is not really allocated for arBuckets and the size of nTableMask is calculated. The real memory is allocated and the calculation of nTableMask is performed when the CHECK_INIT check is performed when the element is inserted.

zend_hash_add_or_update

Function execution steps

  • Check the length of the key
  • Check initialization
  • Calculate hash and subscript
  • Iterate through the bucket where the hash value is located. If the same key is found and the value needs to be updated, update the data, otherwise continue to point to the next element of the bucket until it points to the last position of the bucket.
  • Assign buckets to newly added elements, set the property value of the new bucket, and then add it to the hash table

If the hash table space is full, resize the hash table

zend_hash_add_or_update

CONNECT_TO_BUCKET_DLLIST is to add a new element to a bucket linked list with the same hash value.

CONNECT_TO_GLOBAL_DLLIST is a bidirectional linked list that adds new elements to a HashTable.

For detailed codes and annotations, please click:zend_hash_add_or_update code annotation

zend_hash_find

Function execution steps

  • Calculate hash and subscript
  • Iterate through the bucket where the hash value is located. If the bucket where the key is found, the value will be returned. Otherwise, point to the next bucket until it points to the last position in the bucket linked list.

For detailed codes and annotations, please click:zend_hash_find code annotation

zend_hash_del_key_or_index

Function execution steps

  • Calculate the hash value and subscript of the key
  • Iterate through the bucket where the hash value is. If the bucket where the key is found, then proceed to the third step. Otherwise, point to the next bucket until it points to the last position in the bucket linked list.
  • If the first element you want to delete is the first element, point arBucket[nIndex] directly to the second element; the rest is to execute the current next of the last pointer
  • Adjust related pointers
  • Free data memory and bucket structure memory

For detailed codes and annotations, please click:zend_hash_del_key_or_index code annotation

Performance Analysis

Advantages of PHP's hash table: PHP's HashTable provides great convenience for array operations. Whether it is the creation of arrays, the addition of new elements or deleting elements, the hash table provides good performance, but its shortcomings are more obvious when the data volume is large. Judging from the time complexity and space complexity, it is insufficient.

The shortcomings are as follows:

  • The structure zval that stores data needs to be allocated separately and needs to manage this extra memory. Each zval takes up 16bytes of memory;
  • When adding a bucket, the bucket is also an additional allocation and requires 16 bytes of memory;
  • In order to be able to perform sequential traversal, a two-way linked list is used to connect the entire HashTable, and each pointer also requires 16bytes of memory;
  • During traversal, if the element is at the end of the bucket link list, it is also necessary to traverse the entire bucket link list to find the value to be found

The main drawback of PHP's HashTable is that the extra pointers and zval and buckets in its bidirectional linked list require additional memory allocation, which results in a lot of memory space and a lot of time spent in searching.

Subsequent

The shortcomings mentioned above are well solved in PHP7. PHP7 has made a major transformation of the data structure in the kernel, making PHP much more efficient. Therefore, it is recommended that PHP developers all develop and deploy versions to update. Take a look at the following PHP code:

&lt;?php
$size = pow(2, 16); 
 
$startTime = microtime(true);
$array = array();
for ($key = 0, $maxKey = ($size - 1) * $size; $key &lt;= $maxKey; $key += $size) {
  $array[$key] = 0;
}
$endTime = microtime(true);
echo 'Insert', $size, 'A malicious element needs ', $endTime - $startTime, ' Second', "\n";
 
$startTime = microtime(true);
$array = array();
for ($key = 0, $maxKey = $size - 1; $key &lt;= $maxKey; ++$key) {
  $array[$key] = 0;
}
$endTime = microtime(true);
echo 'Insert', $size, 'A normal element is needed', $endTime - $startTime, ' Second', "\n";

The above demo is a comparison of the time consumption of multiple hash conflicts and no conflicts. The author runs this code under PHP5.4, and the results are as follows

Inserting 65536 malicious elements requires 43.72204709053 seconds

Inserting 65536 ordinary elements requires 0.009843111038208 seconds

And the result of running on PHP7:

Inserting 65536 malicious elements takes 4.4028408527374 seconds

Inserting 65536 ordinary elements requires 0.0018510818481445 seconds

It can be seen that PHP7's performance has been improved a lot, regardless of conflicting and non-conflicting array operations. Of course, conflicting performance has been improved more significantly. As for why PHP7 has improved so much, it is worth exploring.

Finally, let’s recommend it. There is a simple version of HashTable implementation on github:HashTable implementation

In addition, I have more detailed annotations on PHP source code in github. If you are interested, you can watch it and give it a star.PHP5.4 source code annotation. Can be passedcommit recordView added notes.

The above is all the content of this article. I hope it will be helpful to everyone's study and I hope everyone will support me more.