1. Preamble
Hash tables, or hash tables, are a common and very frequently used data storage scheme.
Hash table belongs to the abstract data structure, need to developers according to the storage requirements of the hash table data structure API customization, for most high-level languages, will provide has been implemented, can be directly used API, such as JAVA in the MAP collection, C + + MAP container, Python in the dictionary... ...
Users can use the methods in the API to complete a series of operations to add, delete, change, check ...... the hash table.
How to learn about hash tables?
There are 2 angles to start with:
- User perspective: only need to know the hash table is based on the key, value pair storage solution, another need to be familiar with different computer languages provide based on the hash table data structure of the API implementation, learn to use the API in the method.
- Developer's point of view: then you need to know the underlying principle of hash table implementation, as well as the implementation process needs to be resolved in a variety of issues. In this paper, we will stand in the developer's point of view, take you together to explore the world of hash.
2. Hash tables
What is a hash table?
A hash table is a data structure based on the storage of key and value pairs, and the underlying layer is generally a list (array).
As you know, list (array) based queries are very fast with O(1) time complexity, constant level.
The underlying storage structure of a list is a contiguous region of memory that can be queried directly for data given its position in the list (array). This is true in theory, but in practice, the time complexity of querying the data is not necessarily constant level.
If you store the following student information, the student information includes the student's name and school number. When storing the student data, if the school number is0
The students are stored in the list0
Position, school number1
The students are stored in the list1
Location ......
Here the student's school number and the index number of the list are associated, the query of a student, knowing the student's school number also knows the location of the student's data stored in the list, it can be assumed that the time complexity of the query isO(1)
。
The reason why it can reach the constant level is because there is information associated here (the student's academic number is associated to the location where the data is stored).
Another point is that a student's school number is public information as well as commonly used information and is easily accessible.
However, when not storing any data, you can find the information associated with the list position. For example, if you store all the English words, it is impossible to number each English word, even if you do, the number here is just a running number, data without data meaning is not user-friendly, who can not remember which English word corresponds to which number.
So when you use a list to store English words and then need to query them, there is no storage location for the words. You still need to use a query algorithm such as linear, bisection ...... and so on, and the time complexity at this point is determined by the time complexity of the query algorithm used.
If the above student information stored in the list of insertion, deletion ...... and other operations, change the original location of the data, due to the destruction of the student number and location of the associated information, and then query can only use other query algorithms, it is not possible to reach the constant level.
Does a solution exist that maximizes the optimization of data storage and querying?
Through the above analysis, it can be concluded that to improve the speed of the query, you have to find a way to correlate the data with the location. And this is the core idea of hash tables.
2.1 Hash functions
Hash tables introduce the concept of keywords, which can be thought of as aliases for data. As in the above table, each student can be given an alias and this is the keyword.
Tip: The keyword here is the pinyin abbreviation of the name, and the keyword is more relevant to the data, making it easier to remember and query.
Once you have a keyword, then you map the keyword to a valid position in the list, the mapping method is the hash function, the most important concept in the hash table.
The keyword is a bridge, i.e., it is associated to the real data and then to the position in the hash table.
The keyword can also be the data itself that needs to be saved.
The function of the hash function: provides the algorithm for mapping keywords to positions in a list, which is the core of the hash table to store data. The following figure, demonstrates the relationship between data, hash function, hash table, it can be said that the hash function is the entrance of the data into the hash table.
Where in the list the data will end up being stored is determined entirely by the hashing algorithm.
When it is necessary to query the student data, it is also necessary to call the hash function to convert the keywords, and it is easy to query the data after calculating the position of the data in the list.
If the time complexity of the hash function is ignored, the time complexity of storing and querying data based on hash tables is O(1).
In this way the merit of the hash function algorithm design is the key to the performance of the hash table.
2.2 Hash algorithms
Hash algorithm determines the final storage location of the data, different hash algorithm design scheme, but also about the overall performance of the hash table, so the hash algorithm becomes particularly important.
The following section describes and compares the design of several common hashing algorithms.
Tip:Regardless of the hash algorithm used, there is a fundamental that the result of the hash must be a number that represents a valid position in the list (hash table). Also known as a hash value.
When using a hash table to store data, the keyword can be of numeric or non-numeric type; in fact, the keyword can be of any type. The basic idea of designing a hash algorithm when the keyword is a non-numeric type is discussed here first.
As mentioned earlier, a keyword has been provided for each student in the form of a phonetic abbreviation of their name.
Now how do I map the keyword to a valid position in the list?
Here you can simply think of pinyin as the letters of the alphabet in the English language, first calculating the position of each letter in the alphabet individually, and then adding them together to get a number.
Use the hashing idea above to hash each student's keyword:
-
zjl
The hash value of26+10+12=48
。 -
llj
The hash value of12+12+10=34
。 -
cl
The hash value of3+12=15
。 -
zxy
The hash value of26+25+24=75
。
It was stated earlier that the hash value is an indication of where the data is stored in the list, so now assume an idealized state where the names of the students are all3
characters, meaning that the keyword is also3
letters, using the hash algorithm above, the maximum hash value should bezzz=26+26+26=78
, which means that at least one length should be provided78
The list of .
If, for now, just saving the4
The number of students, although only4
students, as there is no guarantee that the students' keywords will not appearzzz
, so the list length still needs to be78
. As shown in the figure below.
Using this hashing algorithm leads to a serious waste of space in the list, and the most intuitive idea is to make another constraint on the hash value, such as dividing by4
and then take the remainder, limiting the hash to4
within4
The data corresponds to4
a hash value. We call this remainder taking scheme the remainder taking algorithm.
In the remainder method, the divisor is generally chosen to be a prime number smaller than the length of the hash table. This paper introduces other hash algorithms, will also use the remainder method to shrink the hash value of the appropriate range.
reopen consideration of4
The keywords of the first student are hashed.
-
zjl
The hash value of26+10+12=48
,48
divided by4
Taking the remainder results in0
。 -
llj
The hash value of12+12+10=34
,34
divided by4
Taking the remainder results in2
。 -
cl
The hash value of3+12=15
,15
divided by4
Taking the remainder results in3
。 -
zzz
The hash value of26+26+26=78
,78
divided by4
Taking the remainder results in2
。
A very strange phenomenon appears on the demo image, there is no Jet Li storage information to be seen.
4
storage locations to store4
students, which should be just right, but only stores the3
students. And there are also1
A position is free. Now the coding is verified to see if it's caused by human factors.
''' Hash Functions ''' def hash_code(key): # Set the position of the letter A in the alphabet to 1 pos = 0 for i in key: i = () res = ord(i) - ord('a') + 1 pos += res return pos % 4
Test code:
# Hash tables hash_table = [None] * 4 # Calculate the keyword hash idx = hash_code('zjl') # Store data based on locations converted from keywords hash_table[idx] = 'Jay Chou' idx = hash_code('llj') hash_table[idx] = 'Jet Li' idx = hash_code('cl') hash_table[idx] = '*' idx = hash_code('zzz') hash_table[idx] = 'Zhang Zhizhong' print('Data in the hash table:', hash_table) ''' Output results: Data in the hash table: ['Jay Chou', None, 'Zhang Zhizhong', '*'] '''
Execute the code, output the results, and still still don't see Jet Li's message.
What are the reasons?
This is because both Jet Li and Cheung's hash values are2
, resulting in storage, the data stored later will overwrite the data stored earlier, which is a typical problem in hashing, hash conflict problem.
The so-called hash conflict refers to the fact that different keywords get the same hash value after the hash algorithm is performed, which means that the data corresponding to different keywords will be stored in the same location, which will surely occur data loss, so there is a need to provide algorithms to solve the conflict problem.
Tip: The study of hash tables is, ultimately, the study of how to compute hash values and how to resolve hash value conflicts.
In response to the above problem, there is a taken-for-granted conflict solution that extends the storage length of the list, e.g., extending the list to a length of8
。
The intuitive thinking is that by extending the length of the list, the range of hashes will increase and the likelihood of conflicts will decrease.
''' Hash Functions ''' def hash_code(key): # Set the position of the letter A in the alphabet to 1 pos = 0 for i in key: i = () res = ord(i) - ord('a') + 1 pos += res return pos % 8 # Hash tables hash_table = [None] * 8 # Save all students idx = hash_code('zjl') hash_table[idx] = 'Jay Chou' idx = hash_code('llj') hash_table[idx] = 'Jet Li' idx = hash_code('cl') hash_table[idx] = '*' idx = hash_code('zzz') hash_table[idx] = 'Zhang Zhizhong' print('Data in the hash table:', hash_table) ''' Output results: Data in the hash table: ['Jay Chou', None, 'Jet Li', None, None, None, 'Zhang Zhizhong', '*'] '''
It seems to solve the conflict problem, but it doesn't, when trying to set the length of the list to 6, 7, 8, 9, 10, only when the length is 8 there is no conflict, and this is still an attempt when the data to be stored is known.
If the data is dynamically changing, it is clear that this extension of the length of the program is definitely not the essential solution to the conflict. That is, it can not solve the conflict, and produce a large amount of space waste.
How to resolve hash conflicts will be described in detail later, but let's go back to the hash algorithm here.
In summary, our ideal requirement for a hashing algorithm is:
- A unique hash value is generated for each keyword, ensuring that each piece of data has a storage location that belongs only to it.
- The performance time complexity of hash algorithms is to be low.
The reality is that it is almost impossible to have a hash algorithm that satisfies these 2 conditions at the same time, and in the face of large amounts of data, hash conflicts are the norm. So, it can only be satisfied as much as possible.
Due to the existence of conflicts, even if 100 effective storage space is provided for 100 data, there will still be unused space. Here the actual space used and the effective space provided by the list are divided to get the result, which is called the occupancy of the hash table (load factor).
As mentioned above, when the length of the list is 4, the occupancy rate is 3/4=0.75, and when the length of the list is 8, the occupancy rate is 4/8=0.5, and the general requirement is to control the occupancy rate between 0.6 and 0.9.
2.3 Common Hash Algorithms
In the previous section on what hashing algorithms are, the method of taking the remainder was mentioned, and in addition to that, there are several other common hashing algorithms.
2.3.1 Folding method
Folding method: the keyword is divided into several parts with the same number of bits (the last part of the number of bits can be different) and then take the sum of the superposition of these parts (rounding off the rounding) as the hash value.
The folding method is subdivided into shift stacking and inter-boundary stacking.
- shift superposition: Align the lowest bits of each part of the split and add them together.
- superposition of boundaries: Fold back and forth from one end along the dividing line and then align to add up.
The folding method is suitable for numeric types or keywords that can be converted to numeric types because of the summing calculation. Assume that there is now a lot of information about the order of goods, in order to simplify the problem, the order only includes the order number and the amount of the order.
Now use to store the order data in a hash table and with order number as the keyword and order amount as the value.
Order Number | Order Amount |
---|---|
20201011 | 400.00 |
19981112 | 300.00 |
20221212 | 200 |
Ideas for shifting stacks to convert keywords:
Step one:Split the order number 20201011 in groups of every 3 digits, the result of the split: 202, 010, 11.
Splitting by 2-digit groups or 3-digit groups can be decided on a case-by-case basis.
Step two: Adding the partitioned numbers 202 + 010 + 11 gives the result: 223. then using the remainder method, if the length of the hash table is 10, the remainder after dividing by 10 is 3.
Dividing by 10 is used here only to simplify the details of the problem; the length of the list is rarely chosen for specific operations.
Step Three:The same treatment scheme is used for other keywords.
Keywords. | hash value |
---|---|
20201011 | 3 |
19981112 | 2 |
20221212 | 6 |
The coding implementation saves the merchandise order information:
''' Shift-stacked hash algorithm ''' def hash_code(key, hash_table_size): # Convert to String key_s = str(key) # Save the results of the summation s = 0 # Use slicing for i in range(0, len(key_s), 3): s += int(key_s[i:i + 3]) return s % hash_table_size # Commodity information products = [[20201011, 400.00], [19981112, 300], [20221212, 200]] # Hash table length hash_size = 10 # Hash tables hash_table = [None] * hash_size # Stored in hash tables for p in products: key = hash_code(p[0], hash_size) hash_table[key] = p[1] # Display the data in the hash table print("Data in the hash table:",hash_table) # Queries based on order number hash_val = hash_code(19981112, hash_size) val = hash_table[hash_val] print("The order number is{0}The amount of{1}".format(19981112, val)) ''' Output results Data in hash table: [None, None, 300, 400.0, None, None, 200, None, None, None] Order number 19981112 has an amount of 300 '''
Interboundary superposition method:
The inter-boundary superposition method will invert the numbers to be added at intervals.
Such as the order number 19981112 by a group of 3-bit segmentation, segmentation of the results: 199, 811, 12, between the boundaries of the superposition operation and the expression for 199 + 118 + 12 = 339, and then the results of the 339% 10 = 9.
Coding implementation of the inter-boundary superposition algorithm:
''' Inter-boundary superposition hash algorithm ''' def hash_code(key, hash_table_size): # Convert to String key_s = str(key) # Save the results of the summation s = 0 # Use slicing for i in range(0, len(key_s), 3): # Slicing tmp_s = key_s[i:i + 3] # Reverse if i % 2 != 0: tmp_s = tmp_s[::-1] s += int(tmp_s) return s % hash_table_size # Commodity information (sample data) products = [[20201011, 400.00], [19981112, 300], [20221212, 200]] # Hash table length hash_size = 10 # Hash tables hash_table = [None] * hash_size # Stored in hash tables for p in products: key = hash_code(p[0], hash_size) hash_table[key] = p[1] # Display the data in the hash table print("Data in the hash table:", hash_table) # Queries based on order number hash_val = hash_code(19981112, hash_size) val = hash_table[hash_val] print("The order number is{0}The amount of{1}".format(19981112, val)) ''' Output results: Data in hash table: [None, None, None, 400.0, None, None, 200, None, None, 300] Order number 19981112 has an amount of 300 '''
2.3.2 Squared center method
square-jumping (i.e. plotting a square root): First you square the keyword, then you take the number in the middle position in the result.
Squaring and then take the middle of the algorithm, is a more common hash algorithm, from the mathematical formula can be seen, squaring the middle few digits obtained after the keyword is related to each of the keyword, take the middle of the method can make the final calculation of the hash value is more uniform.
Because to square the keyword, the keyword can only be a number or a type that can be converted to a number, as for the keyword itself, the size of the range of restrictions, according to the use of computer language flexibility.
As in the following book data, the book consists of book number and book name. Now you need to save the book information using a hash table with book number as the keyword and book name as the value.
Book Number | Book Name |
---|---|
58 | python from beginner to master |
67 | C++ STL |
78 | Java Memory Model |
Calculate the hash value of a keyword using the square-median method:
Step one:Squaring book number 58 results in 3364.
second step: Take the middle value of 3364, 36, and then use the remainder scheme. If the length of the hash table is 10, then 36%10 = 6.
Step Three:The same calculation scheme is used for the other keywords.
Coding implementation of the square-taking algorithm:
''' Hash Algorithm Squared Fetch Center ''' def hash_code(key, hash_table_size): # Square it res = key ** 2 # Take the middle value, in this case the middle 2 digits (simplifies the problem) res = int(str(res)[1:3]) # Take the remainder return res % hash_table_size hash_table_size = 10 hash_table = [None]*hash_table_size # Book information books = [[58, "python from beginner to master."], [67, "C++ STL"], [78, "Java Memory Model."]] for b in books: hash_val = hash_code(b[0],hash_table_size) hash_table[hash_val]=b[1] # Display the data in the hash table print("Data in the hash table:", hash_table) # Search by number hash_val = hash_code(67, hash_table_size) val = hash_table[hash_val] print("number{0}The title of the book is{1}".format(67, val))
The above algorithm for squaring and taking the middle value is only for the book data provided in this paper, if the algorithm needs to be generalized, it needs to be modified according to the actual situation.
Don't be fooled by the word middle in the fetch, it doesn't have to be the number in the absolute middle position.
2.3.3 Direct address method
direct address method: Provide a linear function associated with a keyword. For example, for the book data above, a linear function f(k)=2*key+10 can be provided.
The choice of factor 2 and constant 10 affects the size of the final generated hash value. It can be chosen by yourself according to the size of the hash table and the data meaning of the operation.
key is the book number. When the keyword is not the same, the value obtained by using the linear function is also unique, so there will be no hash conflict, but it will require the storage length of the hash table to be larger than the actual data.
Such algorithms are not common in practice.
Practical applications, the specific choice of what hash algorithms, completely determined by the developer, the choice of hash algorithms do not have a fixed pattern to follow, although the above introduces several algorithms, just to provide a kind of algorithmic thinking.
2.4 Hash conflicts
How hash conflicts are caused has already been said before. Now talk about a few common hash conflict solutions.
2.4.1 Linear detection
When a hash conflict occurs, it looks for an available empty position after the conflicting position. As shown in the figure below, the data is saved to the hash table using the take remainder hash algorithm.
The length of the hash table is set to 15 and the divisor is set to 13.
Conflict resolution process:
-
78
cap (a poem)26
The hashes of the0
. And because78
exist26
The front of the78
Occupying the hash table first0
Location. - When storing
26
When you do, you can only use the0
position is the starting position, looking backward for an empty position due to the1
position is not occupied by other data and is ultimately saved in the hash table's1
Location. - When storing numbers
14
When the hash is calculated by the hash algorithm, its hash value is1
It's supposed to be kept in a hash table.1
The location of the1
The location has been26
occupied and had to look backward for an empty spot, eventually landing on the2
Location.
Linear probing method to allow the occurrence of hash conflict data saved in the hash location of other data, if the conflict is more data, it occupies the hash location that should belong to other data is also more, this phenomenon is known as hash aggregation.
Query Process:
to query the data14
As an example.
- count
14
The hash of the hash gets the value1
, find the corresponding position in the hash table based on the hash value. - Check if data exists in the corresponding location, if not, declare the query failed, if it exists, you need to provide a data comparison method.
- reason
1
Data on location26
is not equal to14
. So, the search continues backwards and compares them one by one. - Ultimately, the conclusion can be drawn
14
The number in the hash table is2
The location.
So, in addition to providing a hash function during the query, you also need to provide a data comparison function.
Delete the process:
to delete the number26
As an example.
Follow the search process above to find the number26
Position in the hash table1
。
Setting position1
For a deleted state, be sure to mark this location as having had data saved in it, rather than setting it to an empty state. Why?
If it is set to the empty state, the query number in the14
will produce an incorrect return result when it is assumed that the14
Doesn't exist. Why? Think for yourself.
Coding Realization of the Linear Detection Method:
Add data:
''' Linear probing method for resolving hash conflicts ''' def hash_code(key, hash_table, num): # Length of the hash table size = len(hash_table) # Calculate the hash by taking the remainder hash_val = key % num # Check if other data is already stored in this location if hash_table[hash_val] is not None: # then look for the empty position after hash_val for i in range(hash_val + 1, size + hash_val): if i >= size: i = i % size if hash_table[i] is None: hash_val = i break return hash_val # Hash tables hash_table = [None] * 15 src_nums = [25, 78, 56, 32, 88, 26, 73, 81, 14] for n in src_nums: hash_val = hash_code(n, hash_table, 13) hash_table[hash_val] = n print("Data in the hash table:", hash_table) ''' Output results: Data in hash table: [78, 26, 14, 81, 56, None, 32, None, 73, None, 88, None, 25, None, None] '''
Tip:In order to ensure that when a hash value conflict occurs, if the empty position is still not found by checking from the conflict position to the end position of the hash table, then it is checked again from the start position of the hash table, that is, the0
The location is then searched to the conflict location. The conflict location is the starting point and the end point, and a logical loop of search is constructed to ensure that the empty location will definitely be found.
for i in range(hash_val + 1, size + hash_val): pass
The data query procedure based on linear probing is roughly the same as the stored procedure:
def get(key, hash_table, num): # Length of the hash table size = len(hash_table) # Calculate the hash by taking the remainder hash_val = key % num is_exist = False # Check if other data is already stored in this location if hash_table[hash_val] is None: # Doesn't exist return None if hash_table[hash_val] != key: # then look for the empty position after hash_val for i in range(hash_val + 1, size + hash_val): if i >= size: i = i % size if hash_table[i] == key: hash_val = i is_exist = True break else: is_exist=True if is_exist: return hash_val # Testing res = get(25, hash_table, 13) print(res)
In order to reduce data aggregation, incremental linear probing can be used, by incremental I mean that when probing for empty positions after a hash conflict occurs, use a step value greater than1
way of jumping forward in the lookup. The goal is to distribute the data evenly and reduce data aggregation.
In addition to using incremental probing, a re-hashing scheme can also be used. That is, to provide2
hash function, the first1
After a conflict occurs in the second hash, then call the first2
The hash function is hashed again until the conflict no longer arises. This scheme increases the computation time.
2.4.2 The linked-table method
The core idea of the conflict solution described above is to find another valid empty position in the hash table when a conflict occurs.
The advantage of this scheme is that it does not generate additional storage space, but it is prone to data aggregation, which can make the storage of data unbalanced, and it will defeat the original purpose, the hash value calculated by the keyword does not accurately describe the correct location of the data.
The chain table method should be the more perfect solution of all the hash conflicts. The so-called chain table method, refers to when a hash conflict occurs, to the conflict position as the first node to build a chain table, to chain table way to save all the data in conflict. As shown in the figure below:
Chained table scheme to resolve conflicts, whether in the storage, query, deletion will not affect the independence and uniqueness of other data locations, and because of the faster operation speed of the chained table, for the overall performance of the hash table have a better improvement.
When using the chained table method, the hash table holds the first node of the chained table. The first node may or may not hold data.
Coded implementation of the chained table method: The chained table implementation requires the definition of 2 classes, 1 for the node class and 1 for the hash class.
''' Node Classes ''' class HashNode(): def __init__(self, value): = value self.next_node = None ''' Hash class ''' class HashTable(): def __init__(self): # Hash table, initial size 15, can be dynamically modified as needed = [None] * 15 # Actual data size = 0 ''' Stored Data key:keyword value:value ''' def put(self, key, value): hash_val = self.hash_code(key) # New nodes new_node = HashNode(value) if [hash_val] is None: # This code uses the first node save data scheme [hash_val] = new_node +=1 else: move = [hash_val] while move.next_node is not None: move = move.next_node move.next_node = new_node +=1 ''' Query Data ''' def get(self, key): hash_val = self.hash_code(key) if [hash_val] is None: # Data not available return -1 if [hash_val].value == key: # The first node is the data to be found return [hash_val].value # Move the pointer move = [hash_val].next_node while != key and move is not None: move = move.next_node if move is None: return -1 else: return def hash_code(self, key): # For illustrative purposes only, the choice of 13 is fixed # hash_val = key % 13 return hash_val # Raw data src_nums = [25, 78, 56, 32, 88, 26, 39, 82, 14] # Hash objects hash_table = HashTable() # Add data to the hash table for n in src_nums: hash_table.put(n, n) # Output the first node data in the hash table for i in hash_table.table: if i is not None: print(,end=" ") print("\n------------- query -----------") print(hash_table.get(26)) ''' Output results: 78 14 56 32 88 25 ------------- query ----------- 26 '''
3. Summary
Hash table is an advanced data structure, its storage, query performance is very good, without considering the hash hash algorithm and hash conflict time complexity, hash lookup time complexity can reach the constant level, become the first choice for many practical application scenarios.
The study of hash tables focuses on figuring out hash algorithms and how to resolve hash conflicts. In the world of algorithms, there is no fixed pattern, developers can design their own hash algorithms according to their own needs.
Above is a detailed explanation of the use of hash tables in Python in detail, more information about Python hash tables please pay attention to my other related articles!