Chinese character encoding problem in JSP/Servlet

Chinese character encoding problem in JSP/Servlet There are many excellent articles and discussions on the DBCS character encoding problem in JSP/Servlet. This article summarizes them and explains them in combination with the solution to IBM WebSphere Application Server 3.5 (WAS). I hope it is not redundant.
content:

The origin of the problem
??????-80, GBK, GB18030-2000 Chinese character character set and Encoding
The origin of '?' and garbled code when transcoding in Chinese
JSP/Servlet Chinese character encoding problem and solutions in WAS
Conclusion
Reference article

1. The origin of the problem

Each country (or region) stipulates a set of character encodings used for computer information exchange, such as the United States' extended ASCII code, China's ??????-80, Japan's JIS, etc., which play an important role in unified encoding as the basis for information processing in the country/region. Character encoding sets are divided into two categories according to length: SBCS (single-byte character set) and DBCS (double-byte character set). In early software (especially operating systems), various localized versions (L10N) appeared in order to solve the computer processing of local character information. In order to distinguish, concepts such as LANG and Codepage were introduced. However, due to the overlapping code ranges of each local character set, it is difficult to exchange information between each other; the independent maintenance cost of each localized version of the software is high. Therefore, it is necessary to extract the commonalities in localization work and perform consistent processing, so as to minimize the content of special localization processing. This is what is called internationalization (I18N). Information in various languages is further regulated as Locale information. The underlying character set processed becomes a Unicode that contains almost all glyphs.

Nowadays, most software core character processing with international characteristics is based on Unicode. When the software is running, the corresponding local character encoding settings are determined based on the Locale/Lang/Codepage settings at that time, and local characters are processed accordingly. During the processing process, it is necessary to implement the mutual conversion between Unicode and local character sets, or even the mutual conversion between two different local character sets with Unicode as the middle. This method is further extended in a network environment, and the character information on both ends of any network also needs to be converted into acceptable content according to the character set settings.

The Java language uses Unicode to represent characters, complying with Unicode V2.0. Whether Java programs read/write files from/to the file system in a character stream, write HTML information to a URL connection, or read parameter values from a URL connection, there will be character encoding conversion. Although doing this increases the complexity of programming and easily causes confusion, it is in line with the idea of internationalization.

In theory, these character conversions based on character set settings should not cause too many problems. The fact is that due to the different actual operating environment of the application, the supplement and improvement of Unicode and various local character sets, and the irregular implementation of the system or application, the problems that arise during transcoding always plague programmers and users.

2. ??????-80, GBK, GB18030-2000 Chinese character character set and Encoding

In fact, the method to solve the Chinese character encoding problem in JAVA programs is often very simple, but to understand the reasons behind it and position the problem, you also need to understand the existing Chinese character encoding and encoding conversion.

??????-80 was formulated in the initial stage of the development of computer Chinese characters information technology in China, and it contains most commonly used first- and second-level Chinese characters and symbols in District 9. This character set is a Chinese character set supported by almost all Chinese systems and international software, which is also the most basic Chinese character set. Its encoding range is the high 0xa1-0xfe, and the low 0xa1-0xfe; Chinese characters start from 0xb0a1 and end at 0xf7fe;

GBK is an extension of ??????-80 and is upward compatible. It contains 20902 Chinese characters, and its encoding range is 0x8140-0xfefe, excluding the characters with the high 0x80. All its characters can be mapped one-to-one to Unicode 2.0, which means that JAVA actually provides support for the GBK character set. This is the default character set for Windows and some other Chinese operating systems at this stage, but not all international software supports this character set. It feels like they don't fully know what GBK is going on. It is worth noting that it is not a national standard, but a norm. With the release of the GB18030-2000 national standard, it will complete its historical mission in the near future.

GB18030-2000(GBK2K) further expanded Chinese characters based on GBK and added characters for ethnic minorities such as * and *n. GBK2K fundamentally solves the problem of insufficient character position and insufficient glyph shape. It has several characteristics,

It does not determine all the glyphs, but only specifies the encoding range and is left to be expanded later.
The encoding is variable length, and its two-byte part is compatible with GBK; the four-byte part is an expanded glyph and word bit, and its encoding range is first byte 0x81-0xfe, two-byte 0x30-0x39, three-byte 0x81-0xfe, four-byte 0x30-0x39.
Its promotion is phased, and the first thing that is required is to fully map to all glyphs of the Unicode 3.0 standard.
It is a national standard and mandatory.
No operating system or software has implemented GBK2K support, which is the current and future work content.
Unicode introduction...forget it.

Among the encoding supported by JAVA, related to Chinese programming are: (several ones are not listed in the JDK documentation)

ASCII 7-bit, same ascii7
ISO8859-1 8-bit, same as 8859_1,ISO-8859-1,ISO_8859-1,latin1...
??????-80 Same as ???????????-1980,EUC_CN,euccn,1381,Cp1381, 1383, Cp1383, ISO2022CN,ISO2022CN_GB......
GBK (note case), same as MS936
UTF8 UTF-8
GB18030 (now only IBM JDK1.3.? has support), same as Cp1392,1392

The JAVA language uses Unicode to process characters. But from another perspective, non-Unicode transcoding can also be used in Java programs. The important thing is to ensure that the Chinese character information at the entrance and export of the program is not distorted. If ISO-8859-1 is used completely to process Chinese characters, the correct result can also be achieved. Many popular solutions on the Internet belong to this type. In order not to cause confusion, this article will not discuss this method.

3. The origin of '?' and garbled code when transcoding in Chinese

Both direction conversions may result in incorrect results:

Unicode-->Byte, if the target code set does not have the corresponding code, the result is 0x3f.
like:
The result of "\u00d6\u00ec\u00e9\u0046\u00bb\u00f9".getBytes("GBK") is "?ìéF?ù", and the Hex value is 3fa8aca8a6463fa8b4.
If you look at the above results carefully, you will find that \u00ec is converted to 0xa8ac, \u00e9 is converted to \xa8a6... Its actual effective bits have become longer! This is because some symbols in the symbol area are mapped to some common symbol encodings. Since these symbols appear in ISO-8859-1 or some other SBCS character sets, they are encoding relatively high in Unicode. Some of them have only 8 bits of valid bits, which overlap with the encoding of Chinese characters (in fact, this mapping is just an encoded mapping, and is not the same when displayed. The symbols in Unicode are single byte wide, while the symbols in Chinese characters are double byte wide). There are 20 such symbols between Unicode\u00a0--\u00ff. It is very important to understand this feature! From this, it is not difficult to understand why in JAVA programming, some garbled codes often appear in the error results of Chinese character encoding (actually symbolic characters), rather than all '?' characters, such as the example above.

Byte-->Unicode, if the character identified by Byte does not exist in the source code set, the result is 0xffffd.
like:
Byte ba[] = {(byte)0x81,(byte)0x40,(byte)0xb0,(byte)0xa1}; new String(ba,"??????");
The result is "?ah", the hex value is "\uffffd\u554a". 0x8140 is a GBK character. Press ?????? to convert the table without a corresponding value, take \uffffd. (Please note: when displaying this uniCode, because there is no corresponding local character, the previous situation is also applicable, and it is displayed as a "?".)

In actual programming, the JSP/Servlet program obtains incorrect Chinese character information, which is often the superposition of these two processes, and sometimes even the result of repeated actions after the superposition of the two processes.

4. JSP/Servlet Chinese character encoding problem and solutions in WAS

4.1 Common encoding problems
The common JSP/Servlet encoding problems on the Internet are generally reflected in browser or application side, such as:
Why do all the Chinese characters in the Jsp/Servlet page seen in the browser become ’?’?
Why do all the Chinese characters in the Servlet page seen in the browser become garbled?
Why do all Chinese characters in the JAVA application interface become squares?
The Jsp/Servlet page cannot display GBK Chinese characters.
The Chinese in the JAVA code embedded in the JSP page containing the tags such as <%...%>, <%=...%> is garbled, but the other Chinese characters on the page are correct.
Jsp/Servlet cannot receive Chinese characters submitted by form.
The JSP/Servlet database read and write cannot obtain the correct content.
Hidden behind these problems are various wrong character conversion and processing (except for the third one, it is caused by a Java font setup error). To solve similar character encoding problems, you need to understand the running process of Jsp/Servlet and check each point where the problem may occur.

4.2 Issues in encoding when programming JSP/Servlet web
The JSP/Servlet running on the Java application server provides HTML content to the Browser, and the process is shown in the figure below:

Among them, there are character encoding conversions:

JSP compilation. The Java application server will read the JSP source file according to the JVM value, compile and generate the JAVA source file, and then write it back to the file system according to the value. If the current system language supports GBK, then there will be no encoding problem at this time. If it is an English system, such as LANG is Linux, AIX or Solaris with en_US, the JVM value should be set to GBK. If the system language is ??????, please determine whether to set it as needed. Setting it to GBK can solve the potential GBK character garbled problem.

Java needs to be compiled into .class to be executed in the JVM, and this process has the same problems as a. From here, the operation of servlet and jsp is similar, except that the compilation of servlet is not performed automatically. For JSP programs, the compilation of generated JAVA intermediate files is automatically performed (the class is called directly in the program). Therefore, if there is a problem at this step, you should also check the locale of encoding and OS, or convert the static Chinese characters embedded in JSP JAVA Code to Unicode, or do not place the static text output in JAVA code. For Servlets, the -encoding parameter is manually specified during javac compilation.

The Servlet needs to convert the HTML page content into the encoding content acceptable to browser and send it out. Depend on the implementation methods of each JAVA App Server, some will query the Browser's accept-charset and accept-language parameters or determine the encoding value by other guessing methods, while others will ignore it. Therefore, using fixed encoding is perhaps the best solution. For Chinese web pages, you can set contentType="text/html; charset=??????" in JSP or Servlet; if there are GBK characters in the page, set to contentType="text/html; charset=GBK". Since IE and Netscape have different support for GBK, you need to test this setting.
Because the 16-bit JAVA char will be discarded when transmitted on the network, in order to ensure that the Chinese characters in the Servlet page (including embedded and obtained during the servlet operation) are the expected internal code, you can use PrintWriter out=() instead of ServletOutputStream out=(). PrinterWriter will convert according to the charset specified in the contentType (ContentType needs to be specified before this!); You can also use OutputStreamWriter to encapsulate the ServletOutputStream class and use write(String) to output Chinese character strings.
For JSP, JAVA Application Server should be able to ensure that the embedded Chinese characters are correctly transmitted at this stage.

This is an explanation of the URL character encoding problem. If the parameter value returned from browser via get/post contains Chinese character information, the servlet will not get the correct value. In SUN's J2SDK, the browser's language settings are not considered when parsing parameters, but the resulting value is parsed in byte. This is the most discussed encoding issue online. Because this is a design flaw, you can only re-parse the obtained string in bin; or solve it in the form of hack HttpUtils class. Reference Article 2 is introduced, but it is best to change the Chinese encoding ?????? and CP1381 into GBK, otherwise there will still be problems when encountering GBK Chinese characters.
Servlet API 2.3 provides a new function for specifying the encoding the application wants before calling ("param_name"), which will help to completely resolve this problem.
4.3 Solutions in IBM Websphere Application Server

WebSphere Application Server has extended the standard Servlet API and provides better multilingual support. Running in a Chinese operating system, you can handle Chinese characters well without any settings. The following instructions are only valid for WAS that is running in English systems or require GBK support.

In the above c and d cases, WAS must query the language settings of Browser. By default, zh, zh-cn, etc. are mapped to JAVA encoding CP1381 (Note: CP1381 is just a codepage equivalent to ???????, without GBK support). I think this is because I cannot confirm whether the operating system running Browser supports ??????, or GBK, so I take it small. However, the actual application system still requires the GBK Chinese character to appear on the page, the most famous one is "rong" (rong2, 0xe946, \u9555), so sometimes it is necessary to specify Encoding/Charset as GBK. Of course, changing the default encoding in WAS is not as troublesome as mentioned above. For a and b, refer to article 5, just specify -=GBK in the command line parameters of Application Server; for d, specify -=GBK in the command line parameters of Application Server. If -=GBK is specified, then charset can no longer be specified in c.

There is also a problem among the questions listed above about the static text contained in the JAVA code in Tag<%...%>, <%=...%> that cannot be displayed correctly. The solution in WAS is to set it correctly in addition to setting it correctly, and you also need to set -=zh -=CN in the same way. This has something to do with the settings of JAVA locale.

4.4 Identification issues during database reading and writing

Another place where encoding problems often occur in JSP/Servlet programming is reading and writing data in the database.

Popular relational database systems all support database encoding, that is, when creating a database, you can specify its own character set settings, and the database data is stored in the specified encoding form. When an application accesses data, there is an encoding conversion at the entrance and exit. For Chinese data, the database character encoding settings should ensure the integrity of the data. ??????, GBK, UTF-8, etc. are all optional database encodings; you can also choose ISO8859-1 (8-bit). Before writing data, the application must split a Chinese character or Unicode of 16Bit into two 8-bit characters. After reading the data, the two bytes must be merged, and the SBCS characters must be identified. Failure to fully utilize the role of database encoding has increased the complexity of programming. ISO8859-1 is not the recommended database encoding. When programming JSP/Servlet, you can first use the management functions provided by the database management system to check whether the Chinese data is correct.

Then you should pay attention to the encoding of the read data. What you usually get in JAVA programs is Unicode. The opposite is true when writing data.

4.5 Common techniques for locating problems

The most dumbest and most effective method is usually used to locate Chinese encoding problems - print the inner code of the string after the program you think is suspected. By printing the inner code of the string, you can find when the Chinese characters are converted to Unicode, when Unicode is converted back to the inner code of the Chinese characters, when a Chinese character becomes two Unicode characters, when the Chinese string is converted into a string of question marks, and when the high position of the Chinese string is truncated...

Taking appropriate sample strings can also help distinguish the types of problems. For example: "aaaaa and aa" and other Chinese-English character strings with all Chinese and English characters, GB and GBK. Generally speaking, no matter how English characters are converted or processed, they will not be distorted (if you encounter it, you can try to increase the length of consecutive English letters).

5. Conclusion

In fact, the Chinese encoding of JSP/Servlet is not as complicated as imagined. Although there are no regulations on positioning and solving problems, and various operating environments are different, the subsequent principles are the same. Understanding the knowledge of character sets is the basis for solving character problems. However, with the changes in Chinese character sets, not just Java programming, but problems in Chinese information processing will still exist for a while.

6. Reference article
Character Problem Review
Analysis and solution of Chinese character problems in Java programming technology
GB18030
Setting language encoding in web applications: Websphere applications Server