Hash tables and hash functions are courses in university data structures. In actual development, we often use Hashtable structures. When we encounter key-value pair storage, Hashtable is higher performance than ArrayList search. Why? What price do we have to pay while enjoying high performance (see the classic lines of the red-topped businessman Hu Xueyan these days: Before you enjoy this, you must suffer the hardships that others cannot endure and endure the humiliation that others cannot endure). So is using Hashtable a business without any capital? Regarding this question, I will make the following analysis, hoping to attract attention.
1. Why hash hash has high performance for key-value search
Those who have learned data structures should know that in linear tables and trees, the relative positions of records in the structure are random, and there is no clear relationship between records and keywords. Therefore, when searching records, a series of keyword comparisons are needed. This search method is based on comparison. In .net (Array, ArrayList, List) these collection structures adopt the above storage method.
For example, now we have data for students in a class, including name, gender, age, student number, etc. If the data has
Name | gender | age | Student ID |
---|---|---|---|
Zhang San | male | 15 | 1 |
Li Si | female | 14 | 2 |
Wang Wu | male | 14 | 3 |
If we look up by name, suppose the search function FindByName(string name);
1) Find "Zhang San"
Just match once on the first line.
2) Find "Wang Wu"
Match on the first line, fails,
Match on the second line, fails,
Match on the third line, success
In the above two cases, the best and worst cases are analyzed respectively. Then the average number of searches should be (1+3)/2=2 times, that is, the average number of searches is 1/2 of (total records +1).
Although there are some optimized algorithms that can increase the efficiency of search sorting, the complexity will remain within the range of log2n.
How to search faster? The effect we expect is to position it at once to the location where we are looking for the record. At this time, the time complexity is 1, and the search is the fastest. If we assign a sequence number to each record in advance and let them enter the position by number, and we know what rules to number these records, if we look up a record again, we just need to calculate the number of the record through the rules, and then according to the number, we can easily find the record in the linear queue of the records.
Note that the above description contains two concepts. One is the rule used to number students. In the data structure, it is called a hash function, and the other is a sequential structure arranged for students according to the rules, which is called a hash table.
Let’s take the above students as an example. Assume that the student number is the rule. The teacher has a rule table in his hand. When placing seats, he also sorts according to this rule to find Li Si. First of all, the teacher will judge based on the rules that Li Si’s number is 2, which means he is at the No. 2 position in the seat and walk over directly, “Li Si, haha, you kid, that’s here!”
Let's take a look at the general process:
From the figure above, it can be seen that the hash table can be described as two barrels, one barrel is used to hold the record position number, the other barrel is used to hold the record, and there is a set of rules to express the connection between the record and the number. How is this rule usually formulated?
a) Direct addressing method:
In my previous article, I mentioned that for the performance comparison of GetHashCode(), the GetHashCode() function returns plastic surgery itself, which is actually based on direct addressing methods. For example, there is a set of 0-100 data to represent the age of a person.
Then, the hash table composed of direct addressing method is:
0 | 1 | 2 | 3 | 4 | 5 |
0 years old | 1 year old | 2 years old | 3 years old | 4 years old | 5 years old |
.....
Such an addressing method is simple and convenient, and is suitable for situations where metadata can be expressed in numbers or the original data has a distinct sequential relationship.
b) Digital analysis method:
There is a set of data used to describe the date of birth of some people
Year | moon | day |
75 | 10 | 1 |
75 | 12 | 10 |
75 | 02 | 14 |
To analyze, the first digits of year and month are basically the same, and the chance of conflict is very high, while the last three digits are quite different, so the last three digits are used.
c) Take the Chinese method in square
Take the middle digits after the squared keyword as the hash address
d) Folding method:
Divide the keyword into parts with the same number of bits, and the last part of the bits can be different, and then use the superposition sum of these parts (take out carry bits) as the hash address. For example, there is such data 20-1445-4547-3
Can
5473
+ 4454
+ 201
= 10128
Take out carry bit 1, take 0128 as the hash address
e) Method of taking the remainder
Take the remainder obtained by dividing the keyword by a number p that is not greater than the length m of the hash table as the hash address. H(key)=key MOD p (p<=m)
f) Random number method
Select a random function and take the random function value of the keyword as its hash address, that is, H(key)=random(key), where random is a random function. This method is usually used when keyword lengths are not equal.
In short, the rule of hash function is: through some transformation relationship, the keywords are moderately distributed into the sequential structure of the specified size. The more scattered the later search, the smaller the time complexity and the higher the space complexity.
2. What did we pay for using hash?
hash is a typical algorithm that exchanges space for time. For example, if an array of length 100 is originally needed to search for it, it only needs to traverse and match the corresponding records. From the perspective of spatial complexity, if the array stores byte type data, then the array occupies 100 byte space. Now we use the hash algorithm. The hash we mentioned earlier must have a rule, which constrains the relationship between the key and the storage position, so we need a fixed-length hash table. At this time, it is still an array of 100bytes. Assuming that the 100byte we need is used to record the relationship between the key and the position, the total space is 200bytes, and the table size used to record the rules will be uncertain according to the rules, and the size may be uncertain. For example, in the lzw algorithm, if a very long byte array for recording pixels is used to record the table space for the relationship between the position and the key, the algorithm is recommended as an integer size that can be expressed by 12 bits, then how can a pixel array be dispersed into such a fixed-length table? The lzw algorithm uses variable length encoding, which will be introduced in detail when introducing the lzw algorithm.
Note: The most prominent problem with the hash table is conflict, that is, the index positions calculated by the hash function of the two key values are likely to be the same. This problem will be explained in the next article.
Note: The reason why we briefly introduce hash is to better learn the lzw algorithm. The lzw algorithm is to better study the structure of the gif file. Finally, I will explain in detail how the gif file is composed and how to efficiently operate this type of file.
The above is the entire content of this article. I hope you can give you a reference and I hope you can support me more.