Interpretation of the problem of GBK and UTF-8 transfer of garbled code

GBK and UTF-8 are garbled

We know that in computer memory, binary data is stored, and in network transmission, it is also binary data, but ultimately presented to the user is a string. The conversion of binary and strings requires the participation of encoding and decoding. If there was only one character encoding method in the world, there would be no such thing as garbled code. But the fact is, there are too many encoding methods, such as utf-8, utf-32, utf-16, gbk, gb2312, iso-8859-1, big5, unicode, etc. Since the rules of each encoding are different, it is generally impossible to encode one and decode the other.

likeIn UTF-8, a letter is represented by one byte, a Chinese character is represented by three bytes, and a special Chinese character is represented by four bytes, while in GBK, a letter is represented by one byte, and a Chinese character is represented by two bytes.。

There is a saying that the binary stored in memory is a unicode code. If the data in memory needs to be stored or transmitted, it will only be converted once, and the unicode code will be converted into other encoded binary (to be verified). I personally think this method is very reasonable. After all, each character in the unicode code has a unique binary corresponding to it.

The difficulty of troubleshooting garbled code lies in which link there is a problem, but the essence of garbled code is the same.The encoding of reading binary is inconsistent with the encoding method of initially converting strings into binary。

Here is a concept,Encoding refers to converting a string into binary, decoding refers to converting a binary into a string。

UTF-8 encoding, GBK decoding

Here we will discuss the garbled problem of gbk and utf-8 redirecting, and directly enter the code:

package ;

import ;
 
public class CodingTest {
	public static void main(String[] args) throws UnsupportedEncodingException {
		String str = "Hello World";
		("String length:"+());
		
		byte[] utfBytes = ("utf-8");
		("UTF-8 requires"++"Byte Storage");
		
		byte[] gbkBytes = ("gbk");
		("Gbk requires"++"Byte Storage");
	}
}

Run the above code and print the following content:

String length: 5

UTF-8 requires 15 bytes of storage

gbk requires 10 bytes of storage

It can be seen that utf-8 stores a Chinese character and requires 3 bytes, and gbk stores a Chinese character and requires 2 bytes.

Now test with single characters.

package ;

import ;
 
public class CodingTest {
	public static void main(String[] args) throws UnsupportedEncodingException {
		String str = "you";
		
		byte[] utfBytes = ("utf-8");
		for(byte utfByte:utfBytes){
			//The decimal corresponding to bytes is a negative number. Because the two's in Java are represented by complement, here 0xff is used to restore the data represented by int, and then convert it into hexadecimal.			(((utfByte &amp; 0xFF)) +",");
		}
		();
		String utf2gbkStr = new String(("utf-8"),"gbk");
		("utf-8 converted to gbk:"+utf2gbkStr);
		
		byte[] gbkBytes = ("gbk");
		for(byte gbkByte:gbkBytes){
			(((gbkByte &amp; 0xFF))+",");
		}
		
		();
		String gbk2utfStr = new String(("gbk"),"utf-8");
		("Gbk converts to utf-8:"+gbk2utfStr);
	}
}

Run the above code and the result is:

e4,bd,a0,

UTF-8 converted to gbk:ocean�

e4,bd,3f,

Convert gbk to utf-8:?

Test with two characters, change the above code String str = "you" to String str = "hello". Run the code and the result is:

e4,bd,a0,e5,a5,bd,

UTF-8 converted to gbk: Japan Japan

e4,bd,a0,e5,a5,bd,

Gbk converted to utf-8: Hello

In the above experiment, garbled code was generated when UTF-8 was converted into GBK. This is easy to understand, but if you restore it, gbk was converted into UTF-8. A single Chinese character is still garbled code, but two characters can be displayed normally. What is going on?

After some research, if you want to explain this matter clearly, you still need to start with their coding rules.

ISO-8859-1

Single-byte encoding is backward compatible with ASCII. Its encoding range is 0x00-0xFF, and the 0x00-0x7F is completely consistent with ASCII. The control characters are between 0x80-0x9F, and the literal symbols are between 0xA0-0xFF.

GBK

It adopts single-byte variable-length encoding, and English uses single-byte encoding, which is fully compatible with ASCII character encoding, and the Chinese part uses double-byte encoding. Double bytes have encoding ranges from 8140 to FEFE (excluding xx7F).

Single byte: 00000000 - 011111111
Double bytes: 10000001 01000000 - 11111110 1111110 (Exclude ******* 0111111111)

The distinction between single-byte and double-byte is divided by high-byte high bits, with a single-byte high bit of 0 and a double-byte high bit of 1.

UTF-8

Variable length character encoding is a specific implementation of unicode code. UTF-8 encodes Unicode characters in 1 to 6 bytes.

UTF-8 encoding rules: If there is only one byte, its highest binary bit is 0; if it is a multi-byte, its first byte starts from the highest bit, and the number of consecutive binary bit values of 1 determines the number of bytes it encodes, and the remaining bytes start with 10.

1 byte 0xxxxxxx
2 bytes 110xxxxx 10xxxxxx
3 bytes 1110xxxx 10xxxxx 10xxxxxx
4 bytes 11110xxx 10xxxxxx 10xxxxxx 10xxxxxxx
5 bytes 111110xx 10xxxxxx 10xxxxx 10xxxxxx 10xxxxxx 10xxxxxxx
6 bytes 1111110x 10xxxxxx 10xxxxx 10xxxxxx 10xxxxxx 10xxxxxx 10xxxxxxx

Understand the above encoding rules of GBK and UTF-8, let’s analyze the problem that a single Chinese character is garbled, but two characters can be displayed normally.

"you"

The corresponding binary of UTF-8 encoding: 11100100 10111101 10100000

Decode the above binary through GBK. According to the GBK rule, the high bit of the first byte is 1, and use double-byte encoding.

"11100100 10111101" is decoded into "Huan", "10100000" is illegal for GBK, so it is decoded into a special character "�".

Let’s see if we can restore “Huanxi” back to “you”?

GBK encoding corresponding binary: 11100100 10111101 0011111111

Seeing that the above binary does not comply with the UTF-8 encoding rules at all, it is decoded using UTF-8, which is decoded into some special characters "�?".

As can be seen from the above situation, if a binary system does not comply with the current encoding rules, it will be decoded into a special character, but if this special character is encoded, it will not return to the original binary.

In the same way, analyze why "Hello" can be displayed normally in the end.

The corresponding binary of UTF-8 encoding: 11100100 10111101 10100000 11100101 10100101 10111101

The above binary is encoded by GBK, and according to the GBK rules, double-byte encoding is used, "1100100 10111101" is decoded into "Huan", "10100000 11100101" is decoded into "Refined", and "10100101 10111101" is decoded into "mouth".

Let’s see if “Huanzi” can be restored to “Hello”?

GBK encoding corresponding binary: 11100100 10111101 10100000 11100101 10100101 10111101

It can be seen that the binary can be restored. Decoding this binary through UTF-8 will definitely become "Hello".

A string is encoded through UTF-8, then decoded through GBK, and then encoded the obtained string in GBK, and finally decoded the obtained binary through UTF-8.Whether it can be restored to the original string depends on whether the binary obtained after UTF-8 encoding complies with the GBK encoding rules. If it meets, it can be restored in the end. If it does not meet, it cannot be restored.。

GBK encoding, UTF-8 decoding

package ;

import ;
 
public class CodingTest {
	public static void main(String[] args) throws UnsupportedEncodingException {
		String str = "Hello";
		
		byte[] gbkBytes = ("gbk");
		for(byte gbkByte:gbkBytes){
			//The decimal corresponding to bytes is a negative number. Because the two's in Java are represented by complement, here 0xff is used to restore the data represented by int, and then convert it into hexadecimal.			(((gbkByte &amp; 0xFF)) +",");
		}
		();
		String gbk2utfStr = new String(("gbk"),"utf-8");
		("Gbk converts to utf-8:"+gbk2utfStr);
		
		byte[] utfBytes = ("utf-8");
		for(byte utfByte:utfBytes){
			(((utfByte &amp; 0xFF))+",");
		}
		
		();
		String utf2gbkStr = new String(("utf-8"),"gbk");
		("utf-8 converted to gbk:"+utf2gbkStr);
	}
}

Run the above code and the result is:

c4,e3,ba,c3,

Convert gbk to utf-8:��

ef,bf,bd,ef,bf,bd,ef,bf,bd,

UTF-8 converted to gbk:

The above results should be expected, let’s analyze them through the above method.

"Hello" GBK encoding binary: 11000100 11100011 10111010 11000011

The binary data encoded by GBK cannot match the encoding rules of UTF-8. In the end, UTF-8 can only match as follows: check the first byte, the beginning is "110", which theoretically matches two bytes, but look at the next byte, but the beginning is not "10", and finally "11000100" is decoded to "�", look at the beginning of the second byte, which theoretically matches three bytes, look at the next byte, which starts with "10", but the beginning of the next byte is "110", which does not match. Finally, "11100011 10111010" is decoded to "�", and similarly, "11000011" is also decoded to "�", and this symbol is a special character that cannot be found that the corresponding rules match at will.

The binary code of the "��" UTF-8 encoding is: 1110111111 10111111 101111101 1110111111 10111111 10111111 10111111 10111111 10111111 101111111 101111111

This binary is different from the original binary and cannot convert the original string at all. According to the encoding rules of GBK, "11101111111" is encoded as "Kun", "10111101 11101111" is encoded as "jin", "1011111111111" is encoded as "copy", "1110111111111" is encoded as "Kun", "101111101" does not comply with the GBK rules and is encoded as a special character "�".

Theoretically, UTF-8 decoded strings cannot be restored to the original strings by using GBK encoding. Due to the particularity of UTF-8 encoding rules, the binary codes compiled by GBK are difficult to match.

Summarize

Theoretically, if the system appears, the garbled code will be restored to its original appearance. The above-mentioned UTF-8 encoding and GBK decoding can sometimes be restored, and sometimes it cannot be restored. It depends on whether the UTF-8 encoding binary can comply with the GBK encoding rules, but GBK encoding and UTF-8 decoding are basically a way out.

But in reality, there is a situation where garbled code can be restored to the original string 100%. It is encoding in any encoding format, decoding in ISO-8859-1. This is mainly because ISO-8859-1 is a single-byte encoding and matches all single-byte cases. The garbled string can always be restored to the original binary.

Expand a little knowledge point:

There are two ways to represent the basis system, one is the prefix representation and the other is the suffix representation.

Prefix notation

Hexadecimal: 0x
Decimal: No prefix
Octal: 0
Binary: No symbol

Suffix notation

B: Binary number
Q: Octal number
D: Decimal number
H: Hexadecimal number

For decimal numbers, the letter D after decimal numbers can be omitted.

The above is personal experience. I hope you can give you a reference and I hope you can support me more.