In-depth analysis of JSP and Servlet's handling of Chinese

All regions of the world have local languages. Regional differences directly lead to differences in the locale environment. In the process of developing an internationalized program, it is important to deal with language issues.

This is a problem that exists around the world, so Java provides a global solution. The methods described in this article are used to deal with Chinese, but, by extension, are equally applicable to dealing with languages in other countries and regions around the world.

Chinese characters are double bytes. The so-called double byte refers to the positions where a double word occupies two BYTE positions (i.e., 16 bits), which are called high and low bits respectively. The Chinese character encoding stipulated in China is GB2312, which is mandatory. Currently, almost all applications that can handle Chinese support GB2312. GB2312 includes first- and second-level Chinese characters and 9-zone symbols. The high positions are from 0xa1 to 0xfe, and the low positions are also from 0xa1 to 0xfe. Among them, the encoding range of Chinese characters is 0xb0a1 to 0xf7fe.

There is another encoding called GBK, but this is a specification and is not mandatory. GBK provides 20902 Chinese characters, which is compatible with GB2312 and has an encoding range of 0x8140 to 0xfefe. All characters in GBK can be mapped to Unicode 2.0 one by one.

In the near future, China will issue another standard: GB18030-2000 (GBK2K). It includes the fonts of ethnic minorities such as * and *n, and fundamentally solves the problem of insufficient character position. Note: It is no longer fixed length. The second byte part is compatible with GBK, and the four byte part is an expanded character and glyph. Its first byte and third byte range from 0x81 to 0xfe, two bytes and fourth bytes from 0x30 to 0x39.

This article does not intend to introduce Unicode. If you are interested, you can browse "/" to view more information. Unicode has a feature: it includes all the character glyphs in the world. Therefore, languages in each region can establish mapping relationships with Unicode, and Java uses this to achieve conversion between different languages.

In JDK, Chinese-related encodings are:

Table 1 List of Chinese-related encodings in JDK

Encoding name	illustrate
ASCII	7 digits, same as ascii7
ISO8859-1	8-bit, same as 8859_1, ISO-8859-1, ISO_8859-1, latin1... etc.
GB2312-80	16-bit, same as gb2312, gb2312-1980, EUC_CN, euccn,1381, Cp1381, 1383, Cp1383, ISO2022CN, ISO2022CN_GB... etc.
GBK	Same as MS936, note: case sensitive
UTF8	Same as UTF-8
GB18030	Same as cp1392 and 1392, there are currently few JDKs supported

When actually programming, the ones that are exposed to are GB2312 (GBK) and ISO8859-1.

Why is there a "?" number

As mentioned above, the conversion between different languages is done through Unicode. Suppose there are two different languages A and B. The conversion step is: first convert A into Unicode, and then convert Unicode into B.

Give an example. There is a Chinese character "Li" in GB2312, which is encoded as "C0EE", which is intended to be converted into ISO8859-1 encoding. The steps are: first convert the word "Li" into Unicode to obtain "674E", and then convert "674E" into ISO8859-1 characters. Of course, this mapping will not succeed, because there are no characters corresponding to "674E" in ISO8859-1 at all.

When the mapping fails, the problem occurs! When converting from a certain language to Unicode, if there is no such character in a certain language, the code "\uffffd" of Unicode will be obtained ("\u" means Unicode encoding,). When converting from Unicode to a certain language, if the language does not have corresponding characters, the result is "0x3f" ("?"). This is the origin of "?"

For example: perform new String(buf, "gb2312") operation with the character stream buf = "0x80 0x40 0xb0 0xa1", the result is "\uffffd\u554a", and then println, the result will be "?a", because "0x80 0x40" is a character in GBK, which is not in GB2312.

For example, if you perform new String (("GBK")) operation on the string String="\u00d6\u00ec\u00e9\u0046\u00bb\u00f9" operation, the result is "3fa8aca8a6463fa8b4", where "\u00d6" does not have a corresponding character in "GBK", and gets "3f", "\u00ec" corresponds to "a8ac", "\u00e9" corresponds to "a8a6", and "0046" corresponds to "46" (because this is an ASCII character), "\u00bb" is not found, getting "3f", and finally, "\u00f9" corresponds to "a8b4". Println this string and the result is "?ìéF?ù". See? This is not all question marks, because there are characters in the contents of GBK and Unicode mapping, and this example is the best proof.

Therefore, if there is a confusion when transcoding Chinese characters, what you get is not necessarily a question mark! However, if you are wrong, it is wrong after all. There is no qualitative difference between 50 steps and 100 steps.

Or ask: What will happen if there is a source character set but not in Unicode? The answer is that I don't know. Because I don't have the source character set on hand that can do this test. But one thing is certain, that is, the source character set is not standardized enough. In Java, if this happens, an exception will be thrown.

What is UTF

UTF is the abbreviation of Unicode Text Format, which means Unicode text format. For UTF, it is defined like this:

(1) If the first 9 bit of the 16-bit character of Unicode is 0, it is represented by one byte, the first bit of this byte is "0", and the remaining 7 bits are the same as the last 7 bits in the original character, such as "\u0034" (0000 0000 0011 0100), represented by "34" (0011 0100); (same as the source Unicode character);

(2) If the first 5 bits of the 16-bit character of Unicode are 0, it is represented by 2 bytes, the first byte starts with "110", and the next 5 bits are the same as the highest 5 bits after removing the first 5 zeros in the source character; the second byte starts with "10", and the next 6 bits are the same as the lower 6 bits in the source character. For example, "\u025d" (0000 0010 0101 1101), it is converted into "c99d" (1100 1001 1001 1101);

(3) If the above two rules are not met, it is represented by three bytes. The first byte starts with "1110", and the last four digits are the upper four digits of the source character; the second byte starts with "10", and the last six digits are the six digits in the middle of the source character; the third byte starts with "10", and the last six digits are the lower six digits of the source character; such as "\u9da7" (1001 1101 1010 0111), it is converted into "e9b6a7" (1110 1001 10110 10110 0111);

The relationship between Unicode and UTF in JAVA programs can be described in this way, although it is not absolute: when a string runs in memory, it is represented as Unicode code, and when it is to be saved to a file or other medium, UTF is used. This conversion process is done by writeUTF and readUTF.

Okay, the basic discussion is almost done, let’s get to the topic.

Think of this question as a black box first. Let’s look at the first-level representation of the black box:

input(charsetA)->process(Unicode)->output(charsetB)

Simple, this is an IPO model, namely input, processing and output. The same content must be converted from "from charsetA to unicode to charsetB".

Let's look at the secondary expression:

SourceFile(jsp,java)->class->output

In this figure, it can be seen that the input is the jsp and java source files. During the processing process, the Class file is used as the carrier and then output. Refine it to level three:

jsp->temp file->class->browser,os console,db

app,servlet->class->browser,os console,db

This picture is even more clear. The Jsp file is created into the middle Java file and then the Class is generated. Servlets and ordinary apps directly compile and generate Class. Then, output from Class to the browser, console, or database, etc.

JSP: The process from source file to class

The source file of Jsp is a text file ending with ".jsp". In this section, the interpretation and compilation process of JSP files will be explained, and the Chinese changes will be tracked.

1. The JSP conversion tool (jspc) provided by the JSP/Servlet engine searches for charset specified in <%@ page contentType ="text/html; charset=<Jsp-charset>"%> in the JSP file. If <Jsp-charset> is not specified in the JSP file, the default settings in the JVM are taken. Generally, this value is ISO8859-1;

2. jspc uses the command equivalent to "javac -encoding <Jsp-charset>" to interpret all characters appearing in the JSP file, including Chinese characters and ASCII characters, and then converts these characters into Unicode characters, and then converts them into UTF format and saves them as JAVA files. When converting ASCII characters into Unicode characters, simply add "00" in front of them, such as "A", and converting them to "\u0041" (no reason, this is how Unicode code table is compiled). Then, after the conversion to UTF, it changed back to "41"! This is why you can use a normal text editor to view JAVA files generated by JSP;

3. The engine uses the command equivalent to "javac -encoding UNICODE" to compile the JAVA file into a CLASS file;

Let’s first look at the conversion of Chinese characters during these processes. There are the following source code:

<%@ page contentType="text/html; charset=gb2312"%>
<html><body>
<%
String a="Chinese";
　(a);
%>
</body></html>

This code is written on UltraEdit for Windows. After saving, the hexadecimal code of the two words "Chinese" is "D6 D0 CE C4" (GB2312 encoding). After searching the table, the Unicode code of the word "Chinese" is "\u4E2D\u6587", which is represented by UTF, which means "E4 B8 AD E6 96 87". Open the JAVA file converted from JSP file generated by the engine and found that the word "Chinese" is indeed replaced by "E4 B8 AD E6 96 87". Then check the CLASS file compiled by the JAVA file and find that the result is exactly the same as in the JAVA file.

Let’s look at the situation where the CharSet specified in JSP is ISO-8859-1.

<%@ page contentType="text/html; charset=ISO-8859-1"%>
<html><body>
<%
String a="Chinese";
　(a);
%>
</body></html>

Similarly, the file is written in UltraEdit, and the two words "Chinese" are also stored as GB2312 encoding "D6 D0 CE C4". First simulate the process of generating JAVA and CLASS files: jspc uses ISO-8859-1 to interpret "Chinese" and map it to Unicode. Since ISO-8859-1 is 8-bit and is Latin, its mapping rule is to add "00" before each byte, so the mapped Unicode encoding should be "\u00D6\u00D0\u00CE\u00C4", and after conversion to UTF, it should be "C3 96 C3 90 C3 8E C3 84". OK, open the file and take a look. In the JAVA file and the CLASS file, "Chinese" is indeed expressed as "C3 96 C3 90 C3 8E C3 84".

If <Jsp-charset> is not specified in the above code, that is, the first line is written as "<%@ page contentType="text/html" %>", the settings JSPC will use to interpret the JSP file. On RedHat 6.2, the processing result is exactly the same as specified as ISO-8859-1.

Until now, the mapping process of Chinese characters during the transition from JSP file to CLASS file has been explained. In a word: From "JspCharSet to Unicode to UTF". The following table summarizes the process:

Table 2 The conversion process of "Chinese" from JSP to CLASS

Jsp-CharSet	In JSP file	In JAVA file	In the CLASS file
GB2312	D6 D0 CE C4(GB2312)	From \u4E2D\u6587(Unicode) to E4 B8 AD E6 96 87 (UTF)	E4 B8 AD E6 96 87 (UTF)
ISO-8859-1	D6 D0 CE C4 (GB2312)	From \u00D6\u00D0\u00CE\u00C4 (Unicode) to C3 96 C3 90 C3 8E C3 84 (UTF)	C3 96 C3 90 C3 8E C3 84 (UTF)
None (default=)	Same as ISO-8859-1	Same as ISO-8859-1	Same as ISO-8859-1

The next section discusses the conversion process of servlets from JAVA files to CLASS files, and then explains how to output from CLASS files to the client. The reason for this is that JSP and Servlet are the same when outputting.

Servlet: The process from source file to class

The Servlet source file is a text file ending with ".java". This section discusses the compilation process of Servlets and tracks Chinese changes.

Use "javac" to compile the Servlet source file. javac can take the "-encoding <Compile-charset>" parameter, which means "to interpret the Serlvet source file with the encoding specified in <Compile-charset>".

When the source file is compiled, <Compile-charset> is used to interpret all characters, including Chinese characters and ASCII characters. Then convert the character constant into Unicode characters, and finally, convert Unicode into UTF.

In the Servlet, there is also a place to set the CharSet for the output stream. Usually, before outputting the result, the setContentType method of HttpServletResponse is called to achieve the same effect as setting <Jsp-charset> in JSP, which is called <Servlet-charset>.

Note that there are three variables mentioned in the article: <Jsp-charset>, <Compile-charset> and <Servlet-charset>. Among them, JSP files are only related to <Jsp-charset>, while <Compile-charset> and <Servlet-charset> are only related to Servlet.

See the following example:

import .*;

class testServlet extends HttpServlet
{
　public void doGet(HttpServletRequest req,HttpServletResponse resp)
　throws ServletException,
　{
("text/html; charset=GB2312");
out=();
("<html>");
("#Chinese#");
("</html>");
　}
}

This file is also written in UltraEdit for Windows, and the two words "Chinese" are saved as "D6 D0 CE C4" (GB2312 encoding).

Start compilation. The following table is the hexadecimal code of the word "Chinese" in the CLASS file with different <Compile-charset>. <Servlet-charset> does not work during compilation. <Servlet-charset> only affects the output of the CLASS file. In fact, <Servlet-charset> and <Compile-charset> achieve the same effect as <Jsp-charset> in JSP files, because <Jsp-charset> will have an impact on both compilation and output of the CLASS file.

Table 3 The transformation process of "Chinese" from Servlet source file to Class

Compile-charset	In the Servlet source file	In the Class file	Equivalent Unicode code
GB2312	D6 D0 CE C4 (GB2312)	E4 B8 AD E6 96 87 (UTF)	\u4E2D\u6587 (in Unicode = "Chinese")
ISO-8859-1	D6 D0 CE C4 (GB2312)	C3 96 C3 90 C3 8E C3 84 (UTF)	\u00D6 \u00D0 \u00CE \u00C4 (A 00 is added in front of D6 D0 CE C4)
None (default)	D6 D0 CE C4 (GB2312)	Same as ISO-8859-1	Same as ISO-8859-1

The compilation process of ordinary Java programs is exactly the same as that of Servlets.

Is the Chinese representation in the CLASS file obvious? OK, let’s see how CLASS outputs Chinese?

Class: Output string

As mentioned above, strings appear as Unicode encoding in memory. As for what this Unicode encoding represents, it depends on which character set it maps from, that is, its ancestors. This is like when checking luggage, the appearance is a cardboard box, and what is contained in it depends on what the person sending the email actually posted.

Look at the example above. If you encode a string of Unicode encoding "00D6 00D0 00CE 00C4", if you do not convert and use the Unicode code table to compare it directly, it is four characters (and special characters); if you map it with "ISO8859-1", you can directly remove the previous "00" to get "D6 D0 CE C4”, which is the four characters in the ASCII code table; if it is mapped as GB2312, the result is likely to be a lot of garbled code, because in GB2312 there may be no (and may be) characters corresponding to characters such as 00D6 (if it does not correspond, 0x3f, that is, a question mark will be obtained. If it corresponds, since characters such as 00D6 are too high, it is probably also some special symbols. The encoding of real Chinese characters in Unicode starts from 4E00).

You all saw that the same Unicode characters can be interpreted in different ways. Of course, one of these is the result we expect. In the above example, "D6 D0 CE C4" should be what we want. When outputting "D6 D0 CE C4" to IE, you can see the clear word "Chinese" by looking in "simplified Chinese". (Of course, if you have to use "Western European characters", there is nothing you can do, you will not get anything from where and when) Why? Because "00D6 00D0 00CE 00C4" was originally converted from ISO8859-1.
The following conclusions are given:

Before Class outputs the string, the Unicode string will be regenerated into a byte stream according to a certain internal code, and then input the byte stream, which is equivalent to performing a "(???)" operation. ??? Represents a certain character set.

If it is a servlet, then this inner code is the inner code specified in the () method, which is the <Servlet-charset> defined above.

If it is JSP, then this inner code is the inner code specified in <%@ page contentType=""%>, that is, the <Jsp-charset> defined above.

If it is a Java program, then this inner code is the specified inner code, which defaults to ISO8859-1.

When the output object is a browser

Take the popular browser IE as an example. IE supports multiple internal codes. If IE receives a byte stream "D6 D0 CE C4", you can try to use various internal codes to view it. You will find that you can get the correct results when using "Simplified Chinese". Because "D6 D0 CE C4" is originally the encoding of the two words "Chinese" in simplified Chinese.

OK, watch it in full.

JSP: The source file is a text file in GB2312 format, and the JSP source file contains the two Chinese characters "Chinese"

If <Jsp-charset> is specified as GB2312, the conversion process is as follows.

Table 4 Change process when Jsp-charset = GB2312

Serial number	Step description	result
1	Write JSP source files and save them in GB2312 format	D6 D0 CE C4 (D6D0=CEC4=text)
2	jspc converts JSP source files into temporary JAVA files, maps strings to Unicode according to GB2312, and writes them into JAVA files in UTF format.	E4 B8 AD E6 96 87
3	Compile temporary JAVA file into CLASS file	E4 B8 AD E6 96 87
4	When running, first read the string from the CLASS file using readUTF, and the Unicode encoding in memory is	4E 2D 65 87 (in Unicode 4E2D=6587=text)
5	Convert Unicode into byte stream according to Jsp-charset=GB2312	D6 D0 CE C4
6	Output the byte stream to IE and set the encoding of IE to GB2312 (Author’s note: This information is hidden in the HTTP header)	D6 D0 CE C4
7	IE uses "Simplified Chinese" to view the results	"Chinese" (correctly displayed)

If <Jsp-charset> is specified as ISO8859-1, the conversion process is as follows.

Table 5 Change process when Jsp-charset = ISO8859-1

Serial number	Step description	result
1	Write JSP source files and save them in GB2312 format	D6 D0 CE C4 (D6D0=CEC4=text)
2	jspc converts JSP source files into temporary JAVA files, maps strings to Unicode according to ISO8859-1, and writes them into JAVA files in UTF format.	C3 96 C3 90 C3 8E C3 84
3	Compile temporary JAVA file into CLASS file	C3 96 C3 90 C3 8E C3 84
4	When running, first read the string from the CLASS file using readUTF, and the Unicode encoding in memory is	00 D6 00 D0 00 CE 00 C4 (Nothing is not!!!)
5	Convert Unicode into byte stream according to Jsp-charset=ISO8859-1	D6 D0 CE C4
6	Output the byte stream to IE and set the encoding of IE to ISO8859-1 (Author’s note: This information is hidden in the HTTP header)	D6 D0 CE C4
7	IE uses "Western European Characters" to view the results	Garbled code, it actually has four ASCII characters, but since it is greater than 128, it shows a strange appearance.
8	Change the page encoding of IE to "Simplified Chinese"	"Chinese" (correctly displayed)

Strange! Why is it possible to set <Jsp-charset> to GB2312 and ISO8859-1 the same, and can both be displayed correctly? Because the second and fifth steps in Table 4 and Table 5 are inversely opposite each other, they are "counter-offset" each other. However, when specified as ISO8859-1, it is inconvenient to add the 8th step.

Let’s look at the situation when <Jsp-charset> is not specified.

Table 6 Changes when Jsp-charset is not specified

Serial number	Step description	result
1	Write JSP source files and save them in GB2312 format	D6 D0 CE C4 (D6D0=CEC4=text)
2	jspc converts JSP source files into temporary JAVA files, maps strings to Unicode according to ISO8859-1, and writes them into JAVA files in UTF format.	C3 96 C3 90 C3 8E C3 84
3	Compile temporary JAVA file into CLASS file	C3 96 C3 90 C3 8E C3 84
4	When running, first read the string from the CLASS file using readUTF, and the Unicode encoding in memory is	00 D6 00 D0 00 CE 00 C4
5	Convert Unicode into byte stream according to Jsp-charset=ISO8859-1	D6 D0 CE C4
6	Output byte stream to IE	D6 D0 CE C4
7	IE uses the encoding of the page when the request is issued to view the results	Depends on the situation. If it is Simplified Chinese, it can be displayed correctly. Otherwise, you need to perform step 8 in Table 5

Servlet: The source file is a JAVA file, the format is GB2312, the source file contains the two Chinese characters "Chinese"

If <Compile-charset>=GB2312, <Servlet-charset>=GB2312

Table 7 Change process when Compile-charset=Servlet-charset=GB2312

Serial number	Step description	result
1	Write Servlet source file and save it in GB2312 format	D6 D0 CE C4 (D6D0=CEC4=text)
2	Use javac -encoding GB2312 to compile JAVA source file into CLASS file	E4 B8 AD E6 96 87　（UTF）
3	When running, first read the string from the CLASS file using readUTF, and the Unicode encoding in memory is	4E 2D 65 87 (Unicode)
4	Convert Unicode into byte stream according to Servlet-charset=GB2312	D6 D0 CE C4 (GB2312)
5	Output the byte stream to IE and set the encoding attribute of IE to Servlet-charset=GB2312	D6 D0 CE C4 (GB2312)
6	IE uses "Simplified Chinese" to view the results	"Chinese" (correctly displayed)

If <Compile-charset>=ISO8859-1, <Servlet-charset>=ISO8859-1

Table 8 Changes when Compile-charset=Servlet-charset=ISO8859-1

Serial number	Step description	result
1	Write Servlet source file and save it in GB2312 format	D6 D0 CE C4 (D6D0=CEC4=text)
2	Use javac -encoding ISO8859-1 to compile JAVA source file into CLASS file	C3 96 C3 90 C3 8E C3 84　（UTF）
3	When running, first read the string from the CLASS file using readUTF, and the Unicode encoding in memory is	00 D6 00 D0 00 CE 00 C4
4	Convert Unicode into byte stream according to Servlet-charset=ISO8859-1	D6 D0 CE C4
5	Output the byte stream to IE and set the encoding attribute of IE to Servlet-charset=ISO8859-1	D6 D0 CE C4 (GB2312)
6	IE uses "Western European Characters" to view the results	Garbled code (the same reason as Table 5)
7	Change the page encoding of IE to "Simplified Chinese"	"Chinese" (correctly displayed)

If Compile-charset or Servlet-charset is not specified, its default values are ISO8859-1.

When Compile-charset=Servlet-charset, steps 2 and 4 can reverse each other, "counter" and the display results are correct. Readers can try to write about the situation when Compile-charset<>Servlet-charset, which is definitely incorrect.

When the output object is a database

When output to the database, the principle is the same as output to the browser. This section is just a Servlet as an example. Please deduce the situation of JSP by yourself.

Suppose there is a servlet that can receive Chinese character strings from the client (IE, Simplified Chinese), and then write them to the database with the inner code ISO8859-1, then take out the string from the database and display it to the client.

Table 9 The process of changing when the output object is a database (1)

Serial number	Step description	result	domain
1	Enter "Chinese" in IE	D6 D0 CE C4	IE
2	IE converts strings into UTFs and sends them to the transport stream	E4 B8 AD E6 96 87	IE
3	The servlet receives the input stream and reads it with readUTF.	4E 2D 65 87(unicode)	Servlet
4	Programmers must restore strings to byte streams in Servlets according to GB2312	D6 D0 CE C4
5	The programmer generates a new string based on the database code ISO8859-1	00 D6 00 D0 00 CE 00 C4
6	Submit the newly generated string to JDBC	00 D6 00 D0 00 CE 00 C4
7	JDBC detected that the database code is ISO8859-1	00 D6 00 D0 00 CE 00 C4	JDBC
8	JDBC generates a byte stream according to ISO8859-1	D6 D0 CE C4
9	JDBC writes byte streams into the database	D6 D0 CE C4
10	Complete data storage work	D6 D0 CE C4 database
The following is the process of fetching numbers from the database
11	JDBC fetches byte stream from database	D6 D0 CE C4	JDBC
12	JDBC generates a string according to the database character set ISO8859-1 and submits it to Servlet	00 D6 00 D0 00 CE 00 C4 (Unicode)
13	Servlet gets string	00 D6 00 D0 00 CE 00 C4 (Unicode)	Servlet
14	The programmer must restore to the original byte stream based on the database internal code ISO8859-1	D6 D0 CE C4
15	The programmer must generate a new string based on the client character set GB2312	4E 2D 65 87 （Unicode）
Servlet is ready to output strings to the client
16	Servlet generates byte streams based on Servlet-charset	D6D0 CE C4	Servlet
17	The Servlet outputs the byte stream to IE. If <Servlet-charset> is specified, the IE encoding will also be set to <Servlet-charset>	D6 D0 CE C4	Servlet
18	IE view the results based on the specified encoding or default encoding	"Chinese" (correctly displayed)	IE

To explain, the 4th, 5th, and 15th, 16th in the table are marked in red, indicating that the converter is to be made. Steps 4 and 5 are actually one sentence: "new String(("GB2312"), "ISO8859-1")". Steps 15 and 16 are also the same sentence: "new String(("ISO8859-1"), "GB2312")". Dear readers, did you realize every detail of it when writing code like this?

As for the process when the client internal code and database internal code are other values, and the output object is the process when the system console is the process. Please think about it yourself. After understanding the principle of the above process, I believe you can write it easily.

The writing has come to an end. The end point has returned to the starting point, and for programmers, it has almost no effect.

Because we have been told to do this long ago.

A conclusion is given below.

1. In the Jsp file, you need to specify contentType, where the value of charset should be the same as the character set used by the client browser; for the string constants therein, no internal code conversion is required; for string variables, it is required to be restored to a byte stream that the client can recognize based on the character set specified in ContentType. Simply put, "the string variable is based on the character set <Jsp-charset>";

2. In a servlet, the charset must be set with () and set to be consistent with the client's internal code; for the string constants in it, encoding needs to be specified during Javac compilation. This encoding must be the same as the character set of the platform where the source file is written, generally speaking, it is GB2312 or GBK; for string variables, like JSP, it must be "based on the character set of <Servlet-charset>".