1 Overview
Go strings are encoded using UTF-8. UTF-8 is one of the implementation methods of Unicode. This article includes: the relationship between UTF-8 and Unicode, the use of unicode packages provided by Go and unicode/utf8 packages.
I won't say much below, let's take a look at the detailed introduction
2 The relationship between UTF-8 and Unicode
Unicode is a character set designed by the International Organization for Standardization (ISO) to include all cultures, letters and symbols on the planet. They call it Universal Multiple-Octet Coded Character Set, referred to as UCS, or Unicode. Unicode assigns a unique code point to each character, which is a unique value. For example, Kang's code point is 24247, and the hexadecimal system is 5eb7.
The Unicode character set only defines the correspondence between characters and code points, but does not define how to encode (storage) the code value, which leads to many problems. For example, due to different code values of characters, the required storage space is inconsistent, and the computer cannot determine that the next character occupies several bytes. Also, if the fixed length assumption is used to store the code point value, it will lead to additional waste of space, because ascii code characters actually only need one byte of space.
UTF-8 is a coding rule that solves how to design for Unicode encoding. It can be said that UTF-8 is one of the implementation methods of Unicode. Its characteristic is a variable-length encoding, using 1 to 4 bytes to represent a character, and the length varies according to different symbols. There are two encoding rules for UTF-8:
- For single-byte symbols, the first bit of the byte is set to 0, and the next 7 bits are the Unicode code of this symbol. Therefore, for ASCII characters, the UTF-8 encoding and ASCII code are the same.
- For n byte symbols (n > 1, 2 to 4), the first n bits of the first byte are set to 1, the n + 1 bit is set to 0, and the first two bits of the next byte are set to 10. The remaining binary bits that are not mentioned are all Unicode codes for this symbol.
The following are the encoding rules:
Unicode | UTF-8 --------------------------------------------------------- 0000 0000-0000 007F | 0xxxxxxx 0000 0080-0000 07FF | 110xxxxx 10xxxxxx 0000 0800-0000 FFFF | 1110xxxx 10xxxxxx 10xxxxxx 0001 0000-0010 FFFF | 11110xxx 10xxxxxx 10xxxxxx 10xxxxxx ---------------------------------------------------------
In Go language, Unicode and UTF-8 use unicode and unicode/utf8 packages to implement it. The following is a summary and description of the reading API.
3 Unicode packages
In Go language, Unicode package is provided to handle Unicode-related operations, sorted as follows:
Is(rangeTab *RangeTable, r rune) bool
Detects whether rune r is within the range specified by rangeTable.
rangeTable A collection of Unicode code values, usually using the set defined in the unicode package.
Determine whether characters appear in the Chinese character collection:
(["Han"], 'k') // Return false(["Han"], 'Kang') // return true
In(r rune, ranges …*RangeTable) bool
Detects whether rune r is within the range of characters specified by multiple rangeTables.
rangeTable A collection of Unicode code values, usually using the set defined in the unicode package.
('Kang', ["Han"], ["Latin"]) // Return true('k', ["Han"], ["Latin"]) // return true
IsOneOf(ranges []*RangeTable, r rune) bool
Detects whether rune r is within the range of characters specified by rangeTable ranges. Similar to In, it is recommended to use In.
IsSpace(r rune) bool
Detect whether the character rune r is a whitespace character. In Latin-1 character space, the blank characters are:
'\t', '\n', '\v', '\f', '\r', ' ', U+0085 (NEL), U+00A0 (NBSP)
For other whitespace characters, see policy Z and attribute Pattern_White_Space.
IsDigit(r rune) bool
Detect whether the character rune r is a decimal numeric character.
('9') // Return true('k') // return false
IsNumber(r rune) bool
Detect whether the character rune r is a Unicode numeric character.
IsLetter(r rune) bool
Detect whether a character rune r is a letter
('9') // Return false('k') // return true
IsGraphic(r rune) bool
A character rune r is a unicode graphic character. Graphic characters include letters, marks, numbers, symbols, punctuation, and blanks.
('9') // Return true(',') // return true
IsControl(r rune) bool
Detect whether a character rune r is a unicode control character.
IsMark(r rune) bool
Detect whether a character rune r is a tag character.
IsPrint(r rune) bool
Detect whether a character rune r is a printable character, which is basically consistent with the graphic character, except for the ASCII whitespace character U+0020.
IsPunct(r rune) bool
Detect whether a character rune r is a unicode punctuation character.
('9') // Return false(',') // return true
IsSymbol(r rune) bool
Detects whether a character rune r is a unicode symbolic character.
IsLower(r rune) bool
Detect whether a character rune r is a lowercase letter.
('h') // Return true('H') // return false
IsUpper(r rune) bool
Detect whether a character rune r is a capital letter.
('h') // Return false('H') // return true
IsTitle(r rune) bool
Detect whether a character rune r is a Title character. The Title format of most characters is its capital format, and the Title format of few numeric characters is special characters, such as ᾏᾟᾯ.
('ᾯ') // Return true('h') // Return false('H') // return true
To(_case int, r rune) rune
Convert the character rune r to the specified format, format_case supports:,,
(, 'h') // return H
ToLower(r rune) rune
Converts the character rune r to lower case.
('H') // return h
func (SpecialCase) ToLower
Converts the character rune r to lower case. Priority is given to the mapping table SpecialCase.
Mapping Table SpecialCase is a case mapping table in a specific locale. It is mainly used in some European characters, such as Türkiye TurkishCase.
('İ') // return i
ToUpper(r rune) rune
Converts the character rune r to uppercase.
('h') // return H
func (SpecialCase) ToUpper
Converts the character rune r to uppercase. Priority is given to the mapping table SpecialCase.
Mapping Table SpecialCase is a case mapping table in a specific locale. It is mainly used in some European characters, such as Türkiye TurkishCase.
('i') // return İ
ToTitle(r rune) rune
Converts the character rune r to a Title character.
('h') // return H
func (SpecialCase) ToTitle
Converts the character rune r to a Title character. Priority is given to the mapping table SpecialCase.
Mapping Table SpecialCase is a case mapping table in a specific locale. It is mainly used in some European characters, such as Türkiye TurkishCase.
('i') // return İ
SimpleFold(r rune) rune
Find the unicode code value corresponding to rune r in the unicode standard character map. Loop search in a direction with a large code value. Corresponding to each other refers to the various ways of writing that may appear in the same character.
('H') // Return to h('Φ')) // return φ
4 unicode/utf8 packages
DecodeLastRune(p []byte) (r rune, size int)
Decode the last UTF-8 encoding sequence in []byte p, returning the code value and length.
([]byte("Xiao Han's lesson")) // Return 35838 3// 35838 It's the lesson unicode Code value
DecodeLastRuneInString(s string) (r rune, size int)
Decode the last UTF-8 encoding sequence in string s, returning the code value and length.
("Xiao Han's lesson") // Return 35838 3// 35838 It's the lesson unicode Code value
DecodeRune(p []byte) (r rune, size int)
Decode the first UTF-8 encoding sequence in []byte p, returning the code value and length.
([]byte("Xiao Han's lesson")) // Return 23567 3// 23567 that is Small of unicode Code value
DecodeRuneInString(s string) (r rune, size int)
Decode the first UTF-8 encoding sequence in string s, returning the code value and length.
("Xiao Han's lesson") // Return 23567 3// 23567 that is Small of unicode Code value
EncodeRune(p []byte, r rune) int
Writes the UTF-8 encoding sequence of rune r to []byte p and returns the number of bytes written. p satisfies sufficient length.
buf := make([]byte, 3) n := (buf, 'Kang') (buf, n) // Output [229 186 183] 3
FullRune(p []byte) bool
Detects whether []byte p contains a full UTF-8 encoding.
buf := []byte{229, 186, 183} // Kang(buf) // Return true(buf[:2]) // return false
FullRuneInString(s string) bool
Detects whether string s contains a full UTF-8 encoding.
buf := "Kang" // Kang(buf) // Return true(buf[:2]) // return false
RuneCount(p []byte) int
Returns the number of UTF-8 encoded code values in []byte p.
buf := []byte("Xiao Han's lesson") len(buf) // Return 12(buf) // return 4
RuneCountInString(s string) (n int)
Returns the number of UTF-8 encoded code values in string s.
buf := "Xiao Han's lesson" len(buf) // Return 12(buf) // return 4
RuneLen(r rune) int
Returns the number of bytes encoded by rune r.
('Kang') // Return to 3('H') // return 1
RuneStart(b byte) bool
Detects whether byte b can be used as the first byte encoded by a rune.
buf := "Xiao Han's lesson" (buf[0]) // Return true(buf[1]) // Return false(buf[3]) // return true
Valid(p []byte) bool
Detect whether the slice []byte p contains a complete and legal UTF-8 coding sequence.
valid := []byte("Xiao Han's lesson") invalid := []byte{0xff, 0xfe, 0xfd} (valid) // Return true(invalid) // return false
ValidRune(r rune) bool
Detect whether the character rune r contains a complete and legal UTF-8 encoding sequence.
valid := 'a' invalid := rune(0xfffffff) ((valid)) // Return true((invalid)) // return false
ValidString(s string) bool
Detects whether the string string s contains a complete and legal UTF-8 encoding sequence.
valid := "Xiao Han's lesson" invalid := string([]byte{0xff, 0xfe, 0xfd}) ((valid)) // Return true((invalid)) // return false
over!
Summarize
The above is the entire content of this article. I hope that the content of this article has certain reference value for everyone's study or work. If you have any questions, you can leave a message to communicate. Thank you for your support.