SoFunction
Updated on 2025-02-28

Coding problems often encountered by HTML and javascript page 1/2

Here I will briefly talk about the coding problems that are often encountered in front-end HTML and javascript daily work.
In computers, the information we store is represented by binary code. What we know about the conversion of symbols such as English and Chinese characters displayed on the screen and the binary code used for storage is encoding.

There are two basic concepts to be explained, charset and character encoding:

charset, a character set, that is, a table of mapping relationships between a symbol and a number, that is, it determines that 107 is the ‘a’ of koubei, 21475 is the “mouth” of word of mouth, and different tables have different mapping relationships, such as ascii, gb2312, Unicode. Through this mapping table of numbers and characters, we can convert a binary represented number into a certain character.
gramter encoding, encoding method. For example, for the number 21475 that should be "ported", should we use \u5k3e3 to represent it, or %E5%8F%A3 to represent it? This is determined by character encoding.

For strings like ‘', they are commonly used characters for Americans. They have formulated a character set called ASCII, with the full name being american standard code of information interchange. The 128 numbers 0–127 (to the power of 2, 0×00-0×7f) representing the commonly used 128 characters such as 123abc. There are 7 bits in total, and the first one is the sign bit, which should be used to use the complement inverse code to represent negative numbers or something. There are 8 bits in total to form a byte. Americans were a little stingy back then. If they had designed it as a byte of 16 bits or 32 bits from the beginning, there would be many problems in the world. However, at that time, they probably thought that 8 bits was enough, which could represent 128 different characters!

Because computers are made by Americans, they save trouble themselves and encode all the symbols they use, which makes them very comfortable to use. But when computers began to become internationalized, the problem arises. Take China as an example. Chinese characters are only tens of thousands. What should I do?

The existing 8 bits A byte system is the basic, it cannot be destroyed, and it cannot be modified to 16 bits or something, otherwise the changes will be too big, so you can only take another path: use multiple ascii characters to represent a different character, that is, MBCS (Multi-Byte Character System, multi-byte character system).
With this concept of MBCS, we can represent more characters. For example, if we use 2 ascii characters, there are 16 bits, and theoretically, there are 2 characters to the power of 16 65,536 characters. But how do these encodings be assigned to characters? For example, the Unicode code for "mouth" with reputation is 21475, who decided? Character set, which is the charset just introduced. Ascii is the most basic character set. On top of this, we have character sets similar to gb2312, big5, which target MBCS in Simplified and Traditional Chinese, etc. Finally, there was an institution called Unicode Consortium, which decided to create a character set (UCS, Universal Character Set) that includes all characters and the corresponding encoding method, namely Unicode. Since 1991, it has released the first edition of Unicode International Standards, ISBN 0-321-18578-1, and the International Organization for Standardization ISO has also participated in the customization of this, ISO/IEC 10646: the Universal Character Set. In short, Unicode is a character standard that basically covers all existing symbols on the earth. It is now being used more and more widely. The ECMA standard also stipulates that the internal characters of the javascript language use the Unicode standard (this means that javascript variable names, function names, etc. are allowed in Chinese!).

For developers in China, they may encounter more problems such as conversion between gbk, gb2312, and utf-8. Strictly speaking, this statement is not very accurate. gbk and gb2312 are character sets (charsets), while utf-8 is a encoding method (character encoding), which is a encoding method of UCS character sets in the Unicode standard. Because web pages using Unicode character sets are mainly encoded with UTF-8, people often put them together, which is actually inaccurate.

With Unicode, at least before human civilization encounters aliens, this is a master key, so use it. The most widely used Unicode encoding method is UTF-8 (8-bit UCS/Unicode Transformation Format), which has several particularly good things:

Encoding UCS character set, universal worldwide
It is a variable-length character encoding method, compatible with ascii
The second point is a great advantage, which makes the systems that used pure ascii encoding compatible with each other and does not add additional storage (assuming that a long encoding method is set, each character consists of 2 bytes, then the storage space occupied by ascii characters will be doubled at this time).
12Next pageRead the full text