The encryption policy of password fields is discussed by the drag library attack (database encryption)

The most shocking thing about these incidents was the invasion of RSA, which directly led to chain attacks by many industrial giants, and many security companies themselves also use RSA tokens. DigiNotar, a Dutch electronic certification company that is much weaker than RSA, has declared bankruptcy after being invaded.

Just in the first half of 2011, we still discussed these things from the perspective of bystanders. But then we encountered data leakage from CSDN, Duowan and Tianya, etc., the most sensitive ones are user information on one hand, and of course the user password. Due to the influence of real-name identity and universal passwords, everyone is in danger for a while. All sites were also in a state of drool.

But in fact, according to inferences, these invasions were all in the past, which means that these libraries had long been circulating underground. It may be a collective psychological effect when flowing out at the same time.

This kind of stealing database records is called "drag database" by some attackers, so there is a natural and homophonic joking called "take off pants". But the attackers are becoming increasingly unkind. In the past, they just stole other people's pants, but now they have to be hung on the street and posted a notice saying, "Look, there are patches on their pants."

If dragging a library is difficult to avoid, then it is necessary to adopt a reasonable encryption strategy to reduce the impact of the attacker after obtaining the library to a smaller extent.

The era of storing passwords in plain text is definitely coming to an end, but is encryption safe?

Those wrong encryption policies

Although plain text passwords are unacceptable, the wrong encryption strategy is also bad. Let's take a look at the following situation.

Simple use of standard HASH

I remembered a hacker joke from the 90s. Someone entered a UNIX host and caught a shadow document, but couldn't crack it. So he used his machine to make a fake scene, deliberately left this shadow, and finally saw what passwords others used to try, and finally used these passwords to penetrate the original host. Unfortunately, at that time we all regarded this as a Joke and replied at best, "I convinced you!" without reflecting on the problem of using standard algorithms.

At present, the most widely used algorithm in password preservation is the standard MD5 HASH. But in fact, for a long time, we have ignored that the original intention of HASH design is not for encryption, but for verification. The system designer "borrows" it to save passwords because the HASH algorithm has irreversible characteristics. But its irreversible premise assumption is that the collection of plain texts is infinitely large. But it is different when putting a password. The length of the password is limited, and the characters that can be used are also limited. We can look at the total number of passwords as a de facto finite set (it's hard to imagine someone using 100 characters as passwords).

For example, a person's password is "123456", then any website database that uses standard MD5 encryption will store such an MD5 value: E10ADC3949BA59ABBE56E057F20F883E

Since the ciphertexts are the same, and the HASH algorithm is one-way, the early method used by the attacker was to generate a ciphertext dictionary after "ciphertext comparison + high-frequency statistics" to attack. Since the encryption implementation of most websites and systems is the same plaintext password to generate the same ciphertext, those users with high-frequency plaintext passwords may be users who use high-frequency plaintext passwords. On the one hand, an attacker can formulate corresponding cryptographic documents for high-frequency plaintext for querying on standard algorithms. On the other hand, for non-standard algorithms, high-frequency statistical attack methods are also very common.

However, the reason why table lookup attacks quickly overwhelm high-frequency statistics was that there were a number of website-scale plain-text password leaks since 2000. In every plaintext password leakage incident in the past, attackers will deal with passwords made using common HASH algorithms such as MD5 and SHA1 with libraries that use HASH values to save.

With the cheapness of supercomputing resources, the popularization of GPUs, and the growth of storage capabilities, a threat that cannot be ignored has begun to jump to the desktop, that is, these huge HASH tables are no longer made based on leaked passwords and common string dictionaries. Many attackers have worked together for a long time to create a mapping result set of password strings of numbers with below-local numbers and encrypted results with multiple algorithms. These result sets range from hundreds of GB to dozens of TB. This is the legendary rainbow table.

The unidirectional advantage of HASH has only theoretical significance here, because the unidirectionality of HASH is guaranteed by algorithm design. Using a finite set to represent an infinite set is bound to be irreversible. However, the attacker completes the restoration from HASH to the password plaintext from the lookup table. Therefore, the unidirectionality of its algorithm loses its significance.

Use HASH in combination

Some people mistakenly believe that HASH is not safe enough because of the strength of the HASH algorithm, so using MD5 or SHA1 in combination is actually worthless (just consumes storage resources). As mentioned above, the insecurity of HASH is that the correspondence between a large number of passwords and their HASH values has long been made into a rainbow table. As long as you use one of the algorithms of HASH in combination, you can naturally find them.

Similarly, it is also meaningless to use "MD5's head + SHA's tail", or other methods that mix two values. Because the attacker can easily observe the rules of this combination method, and then continue to crack it according to the table lookup method after disassembly.

Design your own algorithm

I have always believed that since we are not cryptographers, but engineers and programmers, it is quite stupid to develop encryption algorithms yourself without having to put out ready-made good things. I believe many programmers have encountered a "new algorithm" that has been hollowed out, and then found that in a mathematical paper in the 1980s, relevant algorithms have been proposed.

Moreover, in the open source era, many algorithms have not only been implemented and released, but have also undergone long-term use and consideration. These are incomparable to design and implementing by yourself.

There is one thing that goes deep into my mind about the insecurity of autonomously designed algorithms. I remember when I was working in the securities system, I just took over the acquisition of the sales department and needed to migrate a counter system compiled by the clipper, but the original developer could no longer contact us. At that time, we formulated two paths. A master, Mr. Li, was responsible for cracking the data to see if the plain text could be restored, and I was responsible for cracking the algorithm. If Mr. Li couldn't get through, I needed to solve the algorithm, encrypt all the numbers between 000000 and 9999999, and then use the cipher text to collide (at that time, securities were all operated on the counter, and there was no online stock trading, and the passwords were entered on the counter with a numeric keyboard).

Since the original developer added some fun work, I had no idea yet. The engineer who was watching Teacher Li on the other side had already made a marvelous sound. I ran over and saw that Teacher Li collected something that looked very much like Yang Hui's triangle on the paper based on the encryption results of several passwords constructed. In less than half an hour, Teacher Li had already completed the decryption program.

The purpose of the above story is to show that no matter how you design your own algorithm, you will understand that you cannot confront the intelligent combination of mathematicians around the world.

Therefore, it is not a good idea to design the implementation algorithm yourself. This also includes whether there will be bugs like inputting super long strings overflowing.

Use symmetry algorithm alone

After the standard HASH security was broken, I saw someone calling for the use of AES, which is actually not a good suggestion. None of these symmetric algorithms such as AES are unidirectional. The situation of a website being attacked is complicated. Some are only the database being dragged, while others are the entire environment falling. Once the latter AES key is obtained, the password will be restored, which is even worse than the lookup table.

Of course, we also see a idea of using AES as HASH, which is to only retain part of the AES encryption results, only verify and not restore. But in fact, such AES does not necessarily have an advantage over HASH. For example, even if the attacker does not get the key, he only drags the library, but the attacker himself registered enough accounts before dragging the library and used a large number of different short passwords. Then I got a set of short plain text and corresponding cipher text. At this time, the key is completely possible to be analyzed.

If you use algorithms such as DES and AES, use labeling HASH, or design your own algorithm. If you do not solve the statistical defect of the same password and ciphertext of different users, even if the attacker cannot get the key, he can first use some high-frequency passwords for account registration, and then drag the library to compare the ciphertext. You can lock in a large number of users who use common passwords.

Add "a grain of salt"

In fact, many colleagues pointed out that the hash salt method (HASH+SALT) is the solution to the problem. The so-called salt addition (SALT) is actually very simple. It is to give a disturbance when generating HASH, so that the HASH value is different from the standard HASH result, so that the rainbow lookup table can be resistant.

For example, the user's password is 123456, add a salt, that is, the random string "1cd73466fdc24040b5". The two are combined together and calculate MD5, and the result is 6c9055e7cc9b1bd9b48475aaab59358e. Through this operation, even if the weak password used by the user is added, the actual hash value is a long string, which is protected from exhaustive attacks and rainbow table attacks to a certain extent.

But judging from the implementation we have audited, many people only added "one grain of salt". In other words, for the same site, different users use the same password, and the ciphertext is still the same. This brings us back to the problems of encountering high-frequency statistical attacks, pre-registered attacks, etc.

Password security policy

In the eyes of traditional cryptographers, there is only one kind of encryption that is ideal, that is, "one secret at a time", which is, of course, in fact, this is impossible. But if we apply this wording, we can also say that the ideal state of password security strategy can be called one-way, one secret for one person, and one secret for one stop.

One-way: Although the value of the standard HASH algorithm has been pushed down in this scenario, its unidirectional idea is still correct. As long as the password can be restored, it means that the attacker can do this, thus losing its meaning. Therefore, using a one-way algorithm is necessary.

One person, one password: Different users who set the same password on the same site have different encrypted ciphertext contents. This will effectively deal with result collisions and statistical attacks. The method of attack using dictionary is basically not convergent.

One Site, One Secret: Just ensuring one person, one Secret is not enough. It is also necessary to ensure that users of different websites use the same information and the same password to register. The password encryption results on different sites are different. Given that a large number of users use the same information and the same password to register different websites, if this can be done, the lost library information will be further discounted. And the attacker will basically give up the attempt to generate a ciphertext dictionary.

It is very simple to implement these, but it is still HASH+SALT. The key is that each site must have different SALTs and each user must have different salts.

But if the attacker does not only obtain the library, but also obtains relevant encryption parameters and keys, we must see that the attacker can still use the relevant parameters and keys to call the algorithm, use common passwords to generate the ciphertext for each user, and then whether there is a match. Of course, we can see that due to the "one grain of salt per person" strategy, the calculation cost required by the attacker has changed. If it only needed to generate once in the past, then if 100 common passwords were used to do it, as long as the password did not collide, 100 encryption operations would be performed for each user. But this is also a threat that cannot be underestimated. Because there are so many users who like to use common passwords.

Therefore, setting a password disable table to allow users to avoid using common passwords can further make crackers pay a greater price, which will eventually lead to the abandonment of computing resources without convergence, which can also be a strategy that can be considered. But it is also necessary to remind WEB developers that this will increase the risk of your users forgetting their passwords.

In addition, do users have the freedom to set their password to 123456? I think as long as they are not national defense, aerospace, confidential systems and corporate environments with security requirements, if they are just diving, cursing, the website may just remind users, but maybe they do not need to be compulsory.

Specific implementation

After so much, how can we implement a strategy of one site, one secret, and one secret? On December 23, 2011, we thought that instead of empty preaching algorithm principles and strategies, we should provide some very direct example programs and documents.

Therefore, my colleagues wrote an open source code called Antiy Password Mixer. Of course, this has no technical content, nor is it a "domestic algorithm with own intellectual property rights". It only uses a better model of popular open source algorithm packages. The current Python version only has 300 lines of code, which encapsulates RSA and HASH+SALT usage, and gives a specific example document on how to use it during initialization, registration and authentication.

You can find this thing here: /p/password-mixer/

Of course, just like we regret that many application developers lack attention to security, we actually do not understand application development, so these codes and documents may be very ugly to application developers. Despite being despised, we still have to open the door and prove that the security team is not conservative.

At the same time, we have to get closer to the applications because we are also using these applications that believe they violate some security principles but cannot modify them because they are not their developers.

In the past 10 years, China's web applications have been running away from security and have been running wildly. Developers have laid the existing pattern with their own diligence and impact, but they have also lost some things because of their rapid running, such as security. Maybe it's time to pick up these discards.

China's security industry is conservative, sensitive and many of its own reasons, and the distance between us and applications is getting farther and farther away. While we are still imagining some perfect security picture, we find that we can no longer see the back of the application. Perhaps, when the application will turn back and wait for us, it is time for us to accelerate our progress, pick up the security left behind by the application, and chase it.