SoFunction
Updated on 2025-04-08

Perl Chinese processing skills

Perl has started to use utf8 encoding internally to represent characters since 5.6, which means that there should be no problem in handling Chinese and other language characters. We just need to make good use of the Encode module to fully utilize the advantages of Perl's utf8 characters.

Let’s take the processing of Chinese text as an example to illustrate. For example, there is a string "test text". We want to split this Chinese string into a single character, and we can write it like this:

use Encode;
$dat="test text";
$str=decode("gb2312",$dat);
@chars=split //,$str;
foreach $char (@chars) {
print encode("gb2312",$char),"\n";
}

As a result, you will know it after a try, it should be satisfactory.

Here we mainly use the decode and encode functions of the Encode module. To understand the role of these two functions, we need to understand several concepts:

1. Perl strings are encoded using utf8. They are composed of Unicode characters instead of a single byte. Each utf8-encoded Unicode character accounts for 1 to 4 bytes (change length).

2. When entering or leaving the Perl processing environment (such as outputting to the screen, reading in and saving files, etc.), you do not use Perl strings directly, but you need to convert Perl strings into byte streams. The encoding method used during the conversion process depends entirely on you (or Perl will do it for you). Once the encoding of the Perl string to the byte stream is completed, the concept of characters no longer exists and becomes a pure byte combination. How to interpret these combinations is your own job.

We can see that if Perl wants to treat text according to our character concept, the text data needs to be stored in the form of Perl strings. However, every character we usually write is generally saved as a pure ASCII character (including strings written in plain text in the program), that is, the form of byte streams. Here we need the help of encode and decode functions.

The encode function, as the name suggests, is used to encode Perl strings. It encodes characters in Perl strings in the specified encoding format and eventually converts them into the form of a byte stream, so it is often needed to deal with things outside of Perl's processing environment. The format is very simple:
$octets = encode(ENCODING, $string [, CHECK])

$string: Perl string
encoding: It is the given encoding method
$octets: is the encoded byte stream
check: Indicates how to deal with distorted characters (that is, characters that Perl cannot recognize) during conversion. Generally, no use is required

The encoding method varies greatly depending on the locale environment. By default, utf8, ascii, ascii-ctrl,
iso-8859-1, etc.

The decode function is used to decode the byte stream. It interprets the given byte stream according to the encoding format you give, converting it into Perl strings encoded using utf8. Generally speaking, text data obtained from the terminal or file should be converted into Perl strings in decode. Its format is:

$string = decode(ENCODING, $octets [, CHECK])
The meanings of $string, ENCODING, $octets and CHECK are the same as above.

It is easy to understand the program written above now. Because strings are written in plain text and are already in the form of byte streams when stored, and have lost their original meaning, the first thing to do is to use the decode function to convert it into Perl strings. Since Chinese characters are generally encoded in the gb2312 format, decode also needs to use the gb2312 encoding format. After the conversion is completed, Perl's behavior towards characters is the same as we do. Functions that usually operate on strings can basically process characters correctly, except for those functions that originally treat strings as a bunch of bytes (such as vec, pack, unpack, etc.). So split can cut the string into single characters. Finally, since the UTF8-encoded string cannot be used directly when outputting, the cut characters need to be encoded into a byte stream in gb2312 format using encode function, and then output with print.