SoFunction
Updated on 2025-03-04

How to convert Unicode and Utf-8 encodings into each other

Recently, I happened to use unicode encoding conversion, so I checked the php library functions and couldn't find a function that can encode and decode strings Unicode! Well, if you can't find it, you can do it yourself. . .
The difference between Unicode and Utf-8 encoding

Unicode is a character set, and UTF-8 is one of Unicode. Unicode has a fixed length of two bytes, while UTF-8 is variable. For Chinese characters, Unicode occupies 1 byte less byte than UTF-8. Unicode is double bytes, while Chinese characters in UTF-8 account for three bytes.
UTF-8 encoded characters can theoretically be up to 6 bytes long, but 16-bit BMP (Basic Multilingual Plane) characters only use up to 3 bytes long. Let’s take a look at the UTF-8 encoding table below:

U-00000000 - U-0000007F: 0xxxxxxx
U-00000080 - U-000007FF: 110xxxxx 10xxxxxx
U-00000800 - U-0000FFFF: 1110xxxx 10xxxxxx 10xxxxxx
U-00010000 - U-001FFFFF: 11110xxx 10xxxxxx 10xxxxxx 10xxxxxx
U-00200000 - U-03FFFFFF: 111110xx 10xxxxxx 10xxxxxx 10xxxxxx 10xxxxxx
U-04000000 - U-7FFFFFFF: 1111110x 10xxxxxx 10xxxxxx 10xxxxxx 10xxxxxx 10xxxxxx

The position of xxx is filled in by bits represented by the binary character-encoded number. The more right x is, the less special meaning it has, and only the shortest multibyte string that is enough to express a character-encoded number. Note that in a multibyte string, the number of "1" at the beginning of the first byte is the number of bytes in the entire string. The first line starts with 0, which is for ASCII encoding compatibility, and is one byte, the second line is a double-byte string, and the third line is 3 bytes, such as Chinese characters, and so on. (I personally think: in fact, we can simply regard the number of the previous 1 as a byte number)

How to convert Unicode to Utf-8

In order to convert Unicode to UTF-8, of course, you have to know where their differences lie. Let’s take a look at how the encoding in Unicode is converted to UTF-8. In UTF-8, if the byte of a character is less than 0x80 (128), it is an ASCII character, occupying one byte, and there is no need to convert, because UTF-8 is compatible with ASCII encoding. If the Chinese character "you" is encoded in Unicode, convert it to binary to 100111101100000, and then convert it according to UTF-8 method. You can take the Unicode binary number from the low to the high, and take 6 bits each time. As for the above binary, you can take it out into the format shown below. The previous one is filled in the format, and less than 8 bits are filled with 0.

unicode: 100111101100000                   4F60

utf-8:    11100100,10111101,10100000       E4BDA0

From the above, we can intuitively see the conversion between Unicode and UTF-8. Of course, after knowing the format of UTF-8, we can perform inverse operations, that is, take it out at the corresponding position in the binary according to the format, and then convert it to the resulting Unicode characters (this operation can be completed through "displacement"). For example, the above "you" conversion has a value greater than 0x800 and less than 0x10000, so it can be judged as three-byte storage. The highest bit needs to be shifted to the right "12" bits and then the highest bit in the three-byte format is 11100000 (0xE0) to find or (|) to get the highest bit value. Similarly, the second bit is shifted right by "6" bits, and the binary values ​​of the highest bit and the second bit are left. You can use the following operation (&) to find the position (&) and then find the position (|) to 11000000 (0x80). There is no need to shift the third digit, just take the last six digits directly (with 1111111 (ox3F) and find the or (|) with 11000000 (0x80).

How to reverse Utf-8 back to Unicode

Of course, the conversion from UTF-8 to Unicode is also completed through shifting, which is to extract the binary numbers in the corresponding positions of those UTF-8 formats. In the above example, "you" is three bytes, so each byte needs to be processed, with high to low bits for processing. In UTF-8 "You" is 11100100, 10111101, 10100000. From the high position, the first byte 11100100 is to take out the "0100" in it. This is very simple. Just get the same as 11111 (0x1F). From the three bytes, you can know that the most accurate position must be before the 12 bits, because six bits are taken each time. Therefore, the result obtained must be shifted left by 12 bits, and the highest bit is completed in this way 0100,000,000,00000. The second bit is to take out "111101", and you only need to get the second bytes 10111101 and 1111111 (0x3F) and (&). After shifting the result left by 6 bits to the highest byte, the result obtained by taking or (|), the second bit is completed in this way, and the result is 0100,111101,000000. And so on, take the last bit directly to 111111 (0x3F) and then take the result or (|) from the previous result to get the result 0100,111101,100000.

PHP code implementation: 

/**
  * Utf8 characters converted to Unicode characters
  * @param [type] $utf8_str Utf-8 characters
  * @return [type] Unicode characters
  */
function utf8_str_to_unicode($utf8_str) {
  $unicode = 0;
  $unicode = (ord($utf8_str[0]) & 0x1F) << 12;
  $unicode |= (ord($utf8_str[1]) & 0x3F) << 6;
  $unicode |= (ord($utf8_str[2]) & 0x3F);
  return dechex($unicode);
}

/**
  * Unicode characters converted to utf8 characters
  * @param [type] $unicode_str Unicode characters
  * @return [type] Utf-8 characters
  */
function unicode_to_utf8($unicode_str) {
  $utf8_str = '';
  $code = intval(hexdec($unicode_str));
  //Note that the converted code must be plastic surgery, so that the correct bitwise operation can be performed  $ord_1 = decbin(0xe0 | ($code >> 12));
  $ord_2 = decbin(0x80 | (($code >> 6) & 0x3f));
  $ord_3 = decbin(0x80 | ($code & 0x3f));
  $utf8_str = chr(bindec($ord_1)) . chr(bindec($ord_2)) . chr(bindec($ord_3));
  return $utf8_str;
}

Have a test

$utf8_str = 'I';

//This is the Unicode encoding of the Chinese character "you"$unicode_str = '4f6b';

//Output 6211echo utf8_str_to_unicode($utf8_str) . "<br/>";

//Output the Chinese character "you"echo unicode_str_to_utf8($unicode_str);

The above conversions are tests for Chinese characters (non-ASCII), and only support the conversion of a single character [a complete utf8 character or a complete Unicode character]. I hope it will be helpful for everyone's learning.