To make UTF-8 clear, it will be more convenient to introduce a table:
U-00000000 – U-0000007F: 0xxxxxxx
U-00000080 – U-000007FF: 110xxxxx 10xxxxxx
U-00000800 – U-0000FFFF: 1110xxxx 10xxxxxx 10xxxxxx
U-00010000 – U-001FFFFF: 11110xxx 10xxxxxx 10xxxxxx 10xxxxxx
U-00200000 – U-03FFFFFF: 111110xx 10xxxxxx 10xxxxxx 10xxxxxx 10xxxxxx
U-04000000 – U-7FFFFFFF: 1111110x 10xxxxxx 10xxxxxx 10xxxxxx 10xxxxxx 10xxxxxx
To understand this table, it's enough for us to look at the first two lines
U-00000000 – U-0000007F:
The first line of 0xxxxxxxx The meaning is that if you find that the binary code of an utf-8 encoded byte is 0xxxxxxx, which starts with 0, that is, between 0-127 in the decimal system, then it is the single byte that represents a character, and it has the same meaning as the ascii code. All other utf8-encoded binary values are 1xxxxxxx starting with 1, greater than 127, and require at least 2 bytes to represent a symbol. So the first bit of a byte is a switch, which means whether this character is an ascii code. This is the compatibility mentioned just now. From an English definition, it is the two attributes of utf8 encoding:
UCS characters U+0000 to U+007F (ASCII) are encoded simply as bytes 0×00 to 0×7F (ASCII compatibility). This means that files and strings which contain only 7-bit ASCII characters have the same encoding under both ASCII and UTF-8.
All UCS characters >U+007F are encoded as a sequence of several bytes, each of which has the most significant bit set. Therefore, no ASCII byte (0×00-0×7F) can appear as part of any other character.
Then we look at the second line:
U-00000080 – U-000007FF: 110xxxxx 10xxxxxx
Let’s look at the first byte: 110xxxxx. Its meaning is that I am not an ascii code (because the first bit is not 0), I am the first byte of a multi-byte character (the second bit is 1). The character I participate in representing is composed of 2 bytes (the third bit is 0). Starting from the fourth bit, it is the location where the character information is stored.
Let's look at the second byte: 10xxxxxx, which means: I'm not an ascii code (because the first bit is not 0), I'm not the first byte of a multi-bytes character (the second bit is 0), and the third bit starts with the location where the character's information is stored.
It can be summarized from this example that in the utf-8 encoding method, in a long series of consecutive binary byte codes, 2 to 6 bytes may represent a symbol. Compared with the ascii code that uses a byte to represent the symbol, we need space to store two additional information: 1. The starting position of this symbol, the position of a "starter", in biological terms, is the position of the start codon AUG during protein translation; 2. The number of bytes used by this symbol (in fact, if each symbol has a starter, this length may not be provided, but providing length information increases the fault tolerance when some bytes are lost). The solution is: use the second bit of a byte to represent whether this byte is the beginning of a character byte (because the first bit in a byte has been used just now, 0 represents the ascii code, and 1 represents non-ascii ), that is, the first bytes of a multi-byte symbol must be 11xxxxxx, a binary number between 192 and 255. Next, starting from the third bit, provide length information. The third bit is 0, which means that the symbol is 2 bytes. Starting from the third bit, each additional 1, the number of bytes occupied by the character is added to one. utf-8 defines up to 6 byte characters, which requires 4 more 1s than 110xxxxx, which represents 2 bytes, such as 110xxxxx, so this starter is 1111110x, as shown in the table above.
Let’s take a look at the English definition standards, which expresses the same meaning:
The first byte of a multibyte sequence that represents a non-ASCII character is always in the range 0xC0 to 0xFD and it indicates how many bytes follow for this character. All further bytes in a multibyte sequence are in the range 0×80 to 0xBF. This allows easy resynchronization and makes the encoding stateless and robust against missing bytes.
The real information bits (i.e., the numeric information in the real charset character set) are placed directly in binary form on the 'x' of the table above in order. Let’s take the Chinese characters that we Chinese programmers have the most exposed to. Their encoding interval is between U-00000800 – U-0000FFFF. From the table above, we can find that the utf-8 encoding of this interval is represented by three bytes (this is why the Chinese characters encoded by utf-8 use more storage space than the Chinese characters encoded by EUC-CN of 2 bytes in the EUC-CN encoded gb2312 character set). Let’s use the word-of-mouth word “mouth” as an example. The number of the word-of-mouth word in Unicode is as follows:
Port: 21475 == 0×53e3 == Binary 101001111100011
In javascript, run this code (using firebug console, or editing an HTML to insert the following code between a pair of script tags):
alert('\u53e3'); //get 'mouth'
alert(escape('port')); // get ‘%u53E3'
alert(('21475')); // get 'mouth'
alert('port'.charCodeAt(0)); // get '21475'
alert(encodeURI('port')); //get '%E5%8F%A3'
As you can see, the string direct quantity can be used in the form of \u+hexadecimal Unicode code to obtain the character 'port', and the fromCharCode method accepts the decimal Unicode code to obtain the character 'port'.
The second alert gets ‘%u7545′, which is an unstandard Unicode encoding and is part of the Percent encoding of the URI. However, this usage method has been officially rejected by W3C. This standard does not exist in any RFC. The ECMA-262 standard stipulates the behavior of escape, which is probably temporary.
What is more interesting is the '%E5%8F%A3' obtained by alert for the fifth time. What is this? How did you get it?
This is the Percent encoding, percentage encoding, specified in the RFC 3986 standard, which is used more frequently in URIs.
Previous page12Read the full text