Preparation knowledge:
1. Bytes and unicode
The Java kernel is unicode, even class files, but many media, including files/streams, are saved.
It uses a byte stream. Therefore, Java needs to convert these bytes through flows. char is unicode, and byte is byte.
The functions of byte/char transmutation in Java are in the middle of the package. The ByteToCharConverter class is scheduled.
It can be used to tell you the Convertor you use. Two of the most commonly used static functions are
public static ByteToCharConverter getDefault() ;
public static ByteToCharConverter getConverter(String encoding);
If you do not specify converter, the system will automatically use the current Encoding, GB platform, and EN platform.
8859_1
Let's take a simple example:
"Your" gb code is: 0xC4E3, unicode is 0x4F60
You use:
--encoding="gb2312";
--byte b[]={(byte)'u00c4',(byte)'u00E3'};
--convertor=(encoding);
--char [] c=(b);
--for(int i=0;i<;c++)
--{
-- ((c[i]));
--}
-- Printed it is 0x4F60
--But if the encoding of 8859_1 is used, the printout is
--0x00C4,0x00E3
----Example 1
in turn:
--encoding="gb2312";
char c[]={'u4F60'};
convertor=(encoding);
--byte [] b=(c);
--for(int i=0;i<;c++)
--{
-- ((b[i]));
--}
--Print it out: 0xC4, 0xE3
---Example 2
--If 8859_1 is used, it is 0x3F, the number is ?, which means that it cannot be converted-
Many Chinese questions are derived from these two simplest classes. But there are many kinds
It does not directly support the input of Encoding, which brings us many inconveniences. Many programs rarely use encoding
So, use default encoding directly, which brings us a lot of difficulties in transplanting
--
-8
--UTF-8 is one-to-one with Unicode, and its implementation is very simple
--
-- 7-bit Unicode: 0 _ _ _ _ _ _ _ _
--11-bit Unicode: 1 1 0 _ _ _ _ _ _ 1 0 _ _ _ _ _ _ _ _
--16-bit Unicode: 1 1 0 _ _ _ _ _ 1 0 _ _ _ _ _ _ _ _ _ _ _ _ _ _
--21-bit Unicode: 1 1 1 0 _ _ _ _ 1 0 _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _
--In most cases, only Unicode below 16 bits is used:
--"Your" gb code is: 0xC4E3, unicode is 0x4F60
--Let's use the above example
---Example 1: 0xC4E3 binary:
---- 1 1 0 0 0 1 0 0 1 1 1 0 0 0 1 1
---- Since there are only two digits we sort according to the two digit code, but we found that this doesn't work.
---- Because the 7th bit is not 0, return "?"
----
---Example 2: Binary of 0x4F60:
---- 0 1 0 0 1 1 1 1 0 1 1 0 0 0 0 0
---- We used UTF-8 to make up for it and became:
---- 11100100 10111101 10100000
---- E4--BD-- A0
--- So return 0xE4,0xBD,0xA0
----
and byte[]
--String is actually the core of char[], but to convert byte into String, it must be encoded.
--() is actually the length of the char array. If you use different encodings, it is very good
--It can be misaligned, causing scattered characters and garbled codes.
--example:
----byte [] b={(byte)'u00c4',(byte)'u00e3'};
----String str=new String(b,encoding);----
---If encoding=8859_1, there will be two words, but encoding=gb2312 only has one word---
--This problem occurs frequently when dealing with paging
,Writer/InputStream,OutputStream
--Reader and Writer cores are char, InputStream and OutputStream cores are byte.
--But the main purpose of Reader and Writer is to read/write Char InputStream/OutputStream
--A reader example:
--The file only has one word "you", 0xC4, 0xE3-
--String encoding=;
--InputStreamReader reader=new InputStreamReader(
----new FileInputStream(""),encoding);
--char []c=new char[10];
--int length=(c);
--for(int i=0;i<;i++)
----(c[i]);
--If encoding is gb2312, there is only one character, if encoding=8859_1, there are two characters
--------
--
--
----
2. We need to have some understanding of the Java compiler:
--javac -encoding
We often do not use the parameter ENCODING. In fact, the Encoding parameter is very important for cross-platform operations.
If Encoding is not specified, according to the system's default Encoding, gb2312 is on the gb platform and ISO8859_1 is on the English platform.
--Java compiler is actually a calling class that compiles files. This class --
There is a compile function that contains an encoding variable, and the -encoding parameter is actually passed directly to the encoding variable.
The compiler reads the java file based on this variable, and then compiles it into a class file in UTF-8 form.
An example:
--public void test()
--{
----String str="you";
----FileWriter write=new FileWriter("");
----(str);
----();
--}
----Example 3
--If you compile with gb2312, you will find the fields of E4 BD A0
--
--If compiled with 8859_1,
--00C4 00E3 binary:
--00000000 11000100 00000000 11100011--
--Because each character is greater than 7 bits, it is encoded with 11 bits:
--11000001 10000100 11000011 10100011
--C1-- 84-- C3-- A3
--You will find C1 84 C3 A3 --
But we often ignore this parameter, so there are often cross-platform problems:
--Example 3 compiled on the Chinese platform to generate ZhClass
--Example 3 compiled on the English platform, output EnClass
--1. ZhClass executes OK on the Chinese platform, but not on the English platform
--2. EnClass executes OK on the English platform, but not on the Chinese platform
reason:
--1. After compiling on the Chinese platform, the char[] of str in the running state is 0x4F60, ----
-- Run on Chinese platform, the default encoding of FileWriter is gb2312, so
--CharToByteConverter will automatically convert str with converter calling gb2312.
---Input byte into FileOutputStream, 0xC4, 0xE3 are put into the file.
--But if it is on the English platform, the default value of CharToByteConverter is 8859_1,
--FileWriter will automatically call 8859_1 to convert str, but he cannot explain it, so he will
--Output"?" ----
--2. After compiling on the English platform, the char[] of str in running state is 0x00C4 0x00E3, ----
-- Running on a Chinese platform, Chinese cannot be recognized, so it will appear??
-- On the English platform, 0x00C4-->0xC4,0x00E3->0xE3, so 0xC4,0xE3 was put into
--document
----
1. Explanation of JSP body:
--Tomcat First check if there is a "<%@page include symbol in your leaf surface. Yes, then the same
--Local setting (..); read according to encoding, without him, according to 8859_1
--Read the file, then write it into .java file in UTF-8, and then use it to read the file,
-- (Of course it uses UTF-8 to read), and then compile it into a class file
--setContentType changes the out property, the default encoding of the out variable is 8859_1
2. Interpretation of Parameter
-- Unfortunately Parameter only has ISO8859_1 explanation, this material can be found in the implementation code of servlet.
3. Interpretation of include
Format, but unfortunately, because of the person who wrote ""
Forgot to add a parameter: encoding, so it is not supported
Hold this method. You can compile the source code completely, plus support for encoding
Summarize:
If you are under NT, the easiest way is to cheat Java without adding any Encoding variables:
<html>
Hello <%=("value")%>
</html>
http://localhost/test/?value=You
Result: Hello
However, this method has great limitations, such as segmenting uploaded articles. This approach is deadly and the best
The solution is to use this solution:
<%@ page contentType="text/html;charset=gb2312" %>
<html>
Hello <%=new String(("value").getBytes("8859_1"),"gb2312")%>
</html>
Must read good articles, but the solution is not satisfactory
--------------------------------------------------------------------------------
1. The use of get method is not recommended for web page transmission parameters, and users can adjust whether to send using utf-8.
2. It is recommended that it is best not to use it in jsp. In fact, adding or not has a solution to realize the normal display of Chinese. I think it is not convenient to add it, at least you don’t need to write these codes. I think the following configuration can make Chinese display normally:
a. All javabeans are compiled with iso8859-1
Do not write the above statement with charset=gb2312 in the file (it is wrong if you write it)
Just pay attention to the above 2 points in the case of tomcat ---------------------------------------------------------------------------------------------------------------
c. The operating system language on the server is set to English (such as Linux without bluepoint Chinese system installed is usually in English)
Just do ---
If anyone is wrong, please report...
Re: Must read good articles, but the solution is not satisfactory
--------------------------------------------------------------------------------
Tomcat parameter problems are encoded with 8859_1 in GET or POST. This can be found in the source code of Tomcat Servlet implementation:
a) For POST method
parsePostData method: (For POST's Form data)
String postedBody = new String(postedBytes, 0, len, "8859_1");) There is no problem here because all Chinese words are used to explain it. However, the parseName function does not integrate things in Chinese. It just simply pieced together, so it can be determined that it uses the encoding rules of 8859_1
((char) ((i+1, i+3), 16));
----i += 2;
--
b) For GET method
-- line=new String(buf, 0, count,
);
----=8859_1
This code is not easy to track, so don't be confused by some illusions. HttpRequestAdapter is derived from RequestImpl. However, in fact, using the Server with port 8080 does not use RequestImpl directly, but uses HttpRequestAdapter to obtain queryString
Regarding whether to add or not, I keep my opinion because if you want to solve the problem of uploading file paging, you must use it to encode it. Moreover, encoding can ensure transitiveness in some beans.
It seems I'm going to explain it here
--------------------------------------------------------------------------------
Tomcat is only a standard implementation of jsp1.1 and servlet2.2. We should not require this free software to be comprehensive in detail and performance. It mainly considers English users. This is also the reason why there is a problem with passing the Chinese characters using the URL method. Most of our browsers have the option of sending urls in UTF-8 in their advanced settings. If this is a bug in Tomcat, it is also OK. In addition, Tomcat seems to be compiling JSp according to Iso8859, no matter what language the current operating system is, it seems that it is a bit inappropriate to compile JSp, but no matter what, the implementation of the new standard and popular software always consider English first in language support.
What do I say is better?
1. Still the same sentence, software in English countries always consider English first. The specifications of java virtual machines require that the virtual machine must implement three types: iso8859, unicode, and UTF-8. Others are not required. The virtual machines in jdk we use are like this, let alone embedded ones. In other words, other ENCODEs are probably not directly supported by java virtual machines. Our Chinese is naturally not listed in it. External packages need to support conversion. Sun jdk should be in. Iso8859 is the fastest, and no other calls and exchanges are required, and there is no io operation to read packages.
2. At least write less code, no extra operation, no simple style, no one likes
3. The jsp page I wrote is very international, so I wrote a chat room software for jsp+javabeans (there is no servlet, jsp is really good). The same program Americans use their browser to enter the English interface, and the Chinese interface is to enter the Chinese interface. If you add charset=gb2312, it is at least troublesome.
4. Limited gb2312. If the user wants to use GBK, what should I do? It doesn’t add better. No matter what character set, as long as my current browser sets it, I can display it.
Summary: Whether in terms of speed, development efficiency, and scalability, my solution is better than yours. In addition, I can't find a better place to your solution than mine.
1. Bytes and unicode
The Java kernel is unicode, even class files, but many media, including files/streams, are saved.
It uses a byte stream. Therefore, Java needs to convert these bytes through flows. char is unicode, and byte is byte.
The functions of byte/char transmutation in Java are in the middle of the package. The ByteToCharConverter class is scheduled.
It can be used to tell you the Convertor you use. Two of the most commonly used static functions are
public static ByteToCharConverter getDefault() ;
public static ByteToCharConverter getConverter(String encoding);
If you do not specify converter, the system will automatically use the current Encoding, GB platform, and EN platform.
8859_1
Let's take a simple example:
"Your" gb code is: 0xC4E3, unicode is 0x4F60
You use:
--encoding="gb2312";
--byte b[]={(byte)'u00c4',(byte)'u00E3'};
--convertor=(encoding);
--char [] c=(b);
--for(int i=0;i<;c++)
--{
-- ((c[i]));
--}
-- Printed it is 0x4F60
--But if the encoding of 8859_1 is used, the printout is
--0x00C4,0x00E3
----Example 1
in turn:
--encoding="gb2312";
char c[]={'u4F60'};
convertor=(encoding);
--byte [] b=(c);
--for(int i=0;i<;c++)
--{
-- ((b[i]));
--}
--Print it out: 0xC4, 0xE3
---Example 2
--If 8859_1 is used, it is 0x3F, the number is ?, which means that it cannot be converted-
Many Chinese questions are derived from these two simplest classes. But there are many kinds
It does not directly support the input of Encoding, which brings us many inconveniences. Many programs rarely use encoding
So, use default encoding directly, which brings us a lot of difficulties in transplanting
--
-8
--UTF-8 is one-to-one with Unicode, and its implementation is very simple
--
-- 7-bit Unicode: 0 _ _ _ _ _ _ _ _
--11-bit Unicode: 1 1 0 _ _ _ _ _ _ 1 0 _ _ _ _ _ _ _ _
--16-bit Unicode: 1 1 0 _ _ _ _ _ 1 0 _ _ _ _ _ _ _ _ _ _ _ _ _ _
--21-bit Unicode: 1 1 1 0 _ _ _ _ 1 0 _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _
--In most cases, only Unicode below 16 bits is used:
--"Your" gb code is: 0xC4E3, unicode is 0x4F60
--Let's use the above example
---Example 1: 0xC4E3 binary:
---- 1 1 0 0 0 1 0 0 1 1 1 0 0 0 1 1
---- Since there are only two digits we sort according to the two digit code, but we found that this doesn't work.
---- Because the 7th bit is not 0, return "?"
----
---Example 2: Binary of 0x4F60:
---- 0 1 0 0 1 1 1 1 0 1 1 0 0 0 0 0
---- We used UTF-8 to make up for it and became:
---- 11100100 10111101 10100000
---- E4--BD-- A0
--- So return 0xE4,0xBD,0xA0
----
and byte[]
--String is actually the core of char[], but to convert byte into String, it must be encoded.
--() is actually the length of the char array. If you use different encodings, it is very good
--It can be misaligned, causing scattered characters and garbled codes.
--example:
----byte [] b={(byte)'u00c4',(byte)'u00e3'};
----String str=new String(b,encoding);----
---If encoding=8859_1, there will be two words, but encoding=gb2312 only has one word---
--This problem occurs frequently when dealing with paging
,Writer/InputStream,OutputStream
--Reader and Writer cores are char, InputStream and OutputStream cores are byte.
--But the main purpose of Reader and Writer is to read/write Char InputStream/OutputStream
--A reader example:
--The file only has one word "you", 0xC4, 0xE3-
--String encoding=;
--InputStreamReader reader=new InputStreamReader(
----new FileInputStream(""),encoding);
--char []c=new char[10];
--int length=(c);
--for(int i=0;i<;i++)
----(c[i]);
--If encoding is gb2312, there is only one character, if encoding=8859_1, there are two characters
--------
--
--
----
2. We need to have some understanding of the Java compiler:
--javac -encoding
We often do not use the parameter ENCODING. In fact, the Encoding parameter is very important for cross-platform operations.
If Encoding is not specified, according to the system's default Encoding, gb2312 is on the gb platform and ISO8859_1 is on the English platform.
--Java compiler is actually a calling class that compiles files. This class --
There is a compile function that contains an encoding variable, and the -encoding parameter is actually passed directly to the encoding variable.
The compiler reads the java file based on this variable, and then compiles it into a class file in UTF-8 form.
An example:
--public void test()
--{
----String str="you";
----FileWriter write=new FileWriter("");
----(str);
----();
--}
----Example 3
--If you compile with gb2312, you will find the fields of E4 BD A0
--
--If compiled with 8859_1,
--00C4 00E3 binary:
--00000000 11000100 00000000 11100011--
--Because each character is greater than 7 bits, it is encoded with 11 bits:
--11000001 10000100 11000011 10100011
--C1-- 84-- C3-- A3
--You will find C1 84 C3 A3 --
But we often ignore this parameter, so there are often cross-platform problems:
--Example 3 compiled on the Chinese platform to generate ZhClass
--Example 3 compiled on the English platform, output EnClass
--1. ZhClass executes OK on the Chinese platform, but not on the English platform
--2. EnClass executes OK on the English platform, but not on the Chinese platform
reason:
--1. After compiling on the Chinese platform, the char[] of str in the running state is 0x4F60, ----
-- Run on Chinese platform, the default encoding of FileWriter is gb2312, so
--CharToByteConverter will automatically convert str with converter calling gb2312.
---Input byte into FileOutputStream, 0xC4, 0xE3 are put into the file.
--But if it is on the English platform, the default value of CharToByteConverter is 8859_1,
--FileWriter will automatically call 8859_1 to convert str, but he cannot explain it, so he will
--Output"?" ----
--2. After compiling on the English platform, the char[] of str in running state is 0x00C4 0x00E3, ----
-- Running on a Chinese platform, Chinese cannot be recognized, so it will appear??
-- On the English platform, 0x00C4-->0xC4,0x00E3->0xE3, so 0xC4,0xE3 was put into
--document
----
1. Explanation of JSP body:
--Tomcat First check if there is a "<%@page include symbol in your leaf surface. Yes, then the same
--Local setting (..); read according to encoding, without him, according to 8859_1
--Read the file, then write it into .java file in UTF-8, and then use it to read the file,
-- (Of course it uses UTF-8 to read), and then compile it into a class file
--setContentType changes the out property, the default encoding of the out variable is 8859_1
2. Interpretation of Parameter
-- Unfortunately Parameter only has ISO8859_1 explanation, this material can be found in the implementation code of servlet.
3. Interpretation of include
Format, but unfortunately, because of the person who wrote ""
Forgot to add a parameter: encoding, so it is not supported
Hold this method. You can compile the source code completely, plus support for encoding
Summarize:
If you are under NT, the easiest way is to cheat Java without adding any Encoding variables:
<html>
Hello <%=("value")%>
</html>
http://localhost/test/?value=You
Result: Hello
However, this method has great limitations, such as segmenting uploaded articles. This approach is deadly and the best
The solution is to use this solution:
<%@ page contentType="text/html;charset=gb2312" %>
<html>
Hello <%=new String(("value").getBytes("8859_1"),"gb2312")%>
</html>
Must read good articles, but the solution is not satisfactory
--------------------------------------------------------------------------------
1. The use of get method is not recommended for web page transmission parameters, and users can adjust whether to send using utf-8.
2. It is recommended that it is best not to use it in jsp. In fact, adding or not has a solution to realize the normal display of Chinese. I think it is not convenient to add it, at least you don’t need to write these codes. I think the following configuration can make Chinese display normally:
a. All javabeans are compiled with iso8859-1
Do not write the above statement with charset=gb2312 in the file (it is wrong if you write it)
Just pay attention to the above 2 points in the case of tomcat ---------------------------------------------------------------------------------------------------------------
c. The operating system language on the server is set to English (such as Linux without bluepoint Chinese system installed is usually in English)
Just do ---
If anyone is wrong, please report...
Re: Must read good articles, but the solution is not satisfactory
--------------------------------------------------------------------------------
Tomcat parameter problems are encoded with 8859_1 in GET or POST. This can be found in the source code of Tomcat Servlet implementation:
a) For POST method
parsePostData method: (For POST's Form data)
String postedBody = new String(postedBytes, 0, len, "8859_1");) There is no problem here because all Chinese words are used to explain it. However, the parseName function does not integrate things in Chinese. It just simply pieced together, so it can be determined that it uses the encoding rules of 8859_1
((char) ((i+1, i+3), 16));
----i += 2;
--
b) For GET method
-- line=new String(buf, 0, count,
);
----=8859_1
This code is not easy to track, so don't be confused by some illusions. HttpRequestAdapter is derived from RequestImpl. However, in fact, using the Server with port 8080 does not use RequestImpl directly, but uses HttpRequestAdapter to obtain queryString
Regarding whether to add or not, I keep my opinion because if you want to solve the problem of uploading file paging, you must use it to encode it. Moreover, encoding can ensure transitiveness in some beans.
It seems I'm going to explain it here
--------------------------------------------------------------------------------
Tomcat is only a standard implementation of jsp1.1 and servlet2.2. We should not require this free software to be comprehensive in detail and performance. It mainly considers English users. This is also the reason why there is a problem with passing the Chinese characters using the URL method. Most of our browsers have the option of sending urls in UTF-8 in their advanced settings. If this is a bug in Tomcat, it is also OK. In addition, Tomcat seems to be compiling JSp according to Iso8859, no matter what language the current operating system is, it seems that it is a bit inappropriate to compile JSp, but no matter what, the implementation of the new standard and popular software always consider English first in language support.
What do I say is better?
1. Still the same sentence, software in English countries always consider English first. The specifications of java virtual machines require that the virtual machine must implement three types: iso8859, unicode, and UTF-8. Others are not required. The virtual machines in jdk we use are like this, let alone embedded ones. In other words, other ENCODEs are probably not directly supported by java virtual machines. Our Chinese is naturally not listed in it. External packages need to support conversion. Sun jdk should be in. Iso8859 is the fastest, and no other calls and exchanges are required, and there is no io operation to read packages.
2. At least write less code, no extra operation, no simple style, no one likes
3. The jsp page I wrote is very international, so I wrote a chat room software for jsp+javabeans (there is no servlet, jsp is really good). The same program Americans use their browser to enter the English interface, and the Chinese interface is to enter the Chinese interface. If you add charset=gb2312, it is at least troublesome.
4. Limited gb2312. If the user wants to use GBK, what should I do? It doesn’t add better. No matter what character set, as long as my current browser sets it, I can display it.
Summary: Whether in terms of speed, development efficiency, and scalability, my solution is better than yours. In addition, I can't find a better place to your solution than mine.