Detailed explanation of the processing method of multi-byte characters in Go language

1 Overview

Go strings are encoded using UTF-8. UTF-8 is one of the implementation methods of Unicode. This article includes: the relationship between UTF-8 and Unicode, the use of unicode packages provided by Go and unicode/utf8 packages.

I won't say much below, let's take a look at the detailed introduction

2 The relationship between UTF-8 and Unicode

Unicode is a character set designed by the International Organization for Standardization (ISO) to include all cultures, letters and symbols on the planet. They call it Universal Multiple-Octet Coded Character Set, referred to as UCS, or Unicode. Unicode assigns a unique code point to each character, which is a unique value. For example, Kang's code point is 24247, and the hexadecimal system is 5eb7.

The Unicode character set only defines the correspondence between characters and code points, but does not define how to encode (storage) the code value, which leads to many problems. For example, due to different code values of characters, the required storage space is inconsistent, and the computer cannot determine that the next character occupies several bytes. Also, if the fixed length assumption is used to store the code point value, it will lead to additional waste of space, because ascii code characters actually only need one byte of space.

UTF-8 is a coding rule that solves how to design for Unicode encoding. It can be said that UTF-8 is one of the implementation methods of Unicode. Its characteristic is a variable-length encoding, using 1 to 4 bytes to represent a character, and the length varies according to different symbols. There are two encoding rules for UTF-8:

For single-byte symbols, the first bit of the byte is set to 0, and the next 7 bits are the Unicode code of this symbol. Therefore, for ASCII characters, the UTF-8 encoding and ASCII code are the same.
For n byte symbols (n > 1, 2 to 4), the first n bits of the first byte are set to 1, the n + 1 bit is set to 0, and the first two bits of the next byte are set to 10. The remaining binary bits that are not mentioned are all Unicode codes for this symbol.

The following are the encoding rules:

Unicode    | UTF-8
--------------------------------------------------------- 
0000 0000-0000 007F | 0xxxxxxx
0000 0080-0000 07FF | 110xxxxx 10xxxxxx
0000 0800-0000 FFFF | 1110xxxx 10xxxxxx 10xxxxxx
0001 0000-0010 FFFF | 11110xxx 10xxxxxx 10xxxxxx 10xxxxxx
---------------------------------------------------------

In Go language, Unicode and UTF-8 use unicode and unicode/utf8 packages to implement it. The following is a summary and description of the reading API.

3 Unicode packages

In Go language, Unicode package is provided to handle Unicode-related operations, sorted as follows:

Is(rangeTab *RangeTable, r rune) bool

Detects whether rune r is within the range specified by rangeTable.

rangeTable A collection of Unicode code values, usually using the set defined in the unicode package.

Determine whether characters appear in the Chinese character collection:

(["Han"], 'k')
// Return false(["Han"], 'Kang')
// return true

In(r rune, ranges …*RangeTable) bool

Detects whether rune r is within the range of characters specified by multiple rangeTables.

rangeTable A collection of Unicode code values, usually using the set defined in the unicode package.

('Kang', ["Han"], ["Latin"])
// Return true('k', ["Han"], ["Latin"])
// return true

IsOneOf(ranges []*RangeTable, r rune) bool

Detects whether rune r is within the range of characters specified by rangeTable ranges. Similar to In, it is recommended to use In.

IsSpace(r rune) bool

Detect whether the character rune r is a whitespace character. In Latin-1 character space, the blank characters are:

'\t', '\n', '\v', '\f', '\r', ' ', U+0085 (NEL), U+00A0 (NBSP)

For other whitespace characters, see policy Z and attribute Pattern_White_Space.

IsDigit(r rune) bool

Detect whether the character rune r is a decimal numeric character.

('9')
// Return true('k')
// return false

IsNumber(r rune) bool

Detect whether the character rune r is a Unicode numeric character.

IsLetter(r rune) bool

Detect whether a character rune r is a letter

('9')
// Return false('k')
// return true

IsGraphic(r rune) bool

A character rune r is a unicode graphic character. Graphic characters include letters, marks, numbers, symbols, punctuation, and blanks.

('9')
// Return true(',')
// return true

IsControl(r rune) bool

Detect whether a character rune r is a unicode control character.

IsMark(r rune) bool

Detect whether a character rune r is a tag character.

IsPrint(r rune) bool

Detect whether a character rune r is a printable character, which is basically consistent with the graphic character, except for the ASCII whitespace character U+0020.

IsPunct(r rune) bool

Detect whether a character rune r is a unicode punctuation character.

('9')
// Return false(',')
// return true

IsSymbol(r rune) bool

Detects whether a character rune r is a unicode symbolic character.

IsLower(r rune) bool

Detect whether a character rune r is a lowercase letter.

('h')
// Return true('H')
// return false

IsUpper(r rune) bool

Detect whether a character rune r is a capital letter.

('h')
// Return false('H')
// return true

IsTitle(r rune) bool

Detect whether a character rune r is a Title character. The Title format of most characters is its capital format, and the Title format of few numeric characters is special characters, such as ᾏᾟᾯ.

('ᾯ')
// Return true('h')
// Return false('H')
// return true

To(_case int, r rune) rune

Convert the character rune r to the specified format, format_case supports:,,

(, 'h')
// return H

ToLower(r rune) rune

Converts the character rune r to lower case.

('H')
// return h

func (SpecialCase) ToLower

Converts the character rune r to lower case. Priority is given to the mapping table SpecialCase.

Mapping Table SpecialCase is a case mapping table in a specific locale. It is mainly used in some European characters, such as Türkiye TurkishCase.

('İ')
// return i

ToUpper(r rune) rune

Converts the character rune r to uppercase.

('h')
// return H

func (SpecialCase) ToUpper

Converts the character rune r to uppercase. Priority is given to the mapping table SpecialCase.

Mapping Table SpecialCase is a case mapping table in a specific locale. It is mainly used in some European characters, such as Türkiye TurkishCase.

('i')
// return İ

ToTitle(r rune) rune

Converts the character rune r to a Title character.

('h')
// return H

func (SpecialCase) ToTitle

Converts the character rune r to a Title character. Priority is given to the mapping table SpecialCase.

Mapping Table SpecialCase is a case mapping table in a specific locale. It is mainly used in some European characters, such as Türkiye TurkishCase.

('i')
// return İ

SimpleFold(r rune) rune

Find the unicode code value corresponding to rune r in the unicode standard character map. Loop search in a direction with a large code value. Corresponding to each other refers to the various ways of writing that may appear in the same character.

('H')
// Return to h('Φ')) 
// return φ

4 unicode/utf8 packages

DecodeLastRune(p []byte) (r rune, size int)

Decode the last UTF-8 encoding sequence in []byte p, returning the code value and length.

([]byte("Xiao Han's lesson"))
// Return 35838 3// 35838 It's the lesson unicode Code value

DecodeLastRuneInString(s string) (r rune, size int)

Decode the last UTF-8 encoding sequence in string s, returning the code value and length.

("Xiao Han's lesson")
// Return 35838 3// 35838 It's the lesson unicode Code value

DecodeRune(p []byte) (r rune, size int)

Decode the first UTF-8 encoding sequence in []byte p, returning the code value and length.

([]byte("Xiao Han's lesson"))
// Return 23567 3// 23567 that is Small of unicode Code value

DecodeRuneInString(s string) (r rune, size int)

Decode the first UTF-8 encoding sequence in string s, returning the code value and length.

("Xiao Han's lesson")
// Return 23567 3// 23567 that is Small of unicode Code value

EncodeRune(p []byte, r rune) int

Writes the UTF-8 encoding sequence of rune r to []byte p and returns the number of bytes written. p satisfies sufficient length.

buf := make([]byte, 3)
n := (buf, 'Kang')
(buf, n)
// Output [229 186 183] 3

FullRune(p []byte) bool

Detects whether []byte p contains a full UTF-8 encoding.

buf := []byte{229, 186, 183} // Kang(buf)
// Return true(buf[:2])
// return false

FullRuneInString(s string) bool

Detects whether string s contains a full UTF-8 encoding.

buf := "Kang" // Kang(buf)
// Return true(buf[:2])
// return false

RuneCount(p []byte) int

Returns the number of UTF-8 encoded code values in []byte p.

buf := []byte("Xiao Han's lesson")
len(buf)
// Return 12(buf)
// return 4

RuneCountInString(s string) (n int)

Returns the number of UTF-8 encoded code values in string s.

buf := "Xiao Han's lesson"
len(buf)
// Return 12(buf)
// return 4

RuneLen(r rune) int

Returns the number of bytes encoded by rune r.

('Kang')
// Return to 3('H')
// return 1

RuneStart(b byte) bool

Detects whether byte b can be used as the first byte encoded by a rune.

buf := "Xiao Han's lesson"
(buf[0])
// Return true(buf[1])
// Return false(buf[3])
// return true

Valid(p []byte) bool

Detect whether the slice []byte p contains a complete and legal UTF-8 coding sequence.

valid := []byte("Xiao Han's lesson")
invalid := []byte{0xff, 0xfe, 0xfd}
(valid)
// Return true(invalid)
// return false

ValidRune(r rune) bool

Detect whether the character rune r contains a complete and legal UTF-8 encoding sequence.

valid := 'a'
invalid := rune(0xfffffff)
((valid))
// Return true((invalid))
// return false

ValidString(s string) bool

Detects whether the string string s contains a complete and legal UTF-8 encoding sequence.

valid := "Xiao Han's lesson"
invalid := string([]byte{0xff, 0xfe, 0xfd})
((valid))
// Return true((invalid))
// return false

over!

Summarize

The above is the entire content of this article. I hope that the content of this article has certain reference value for everyone's study or work. If you have any questions, you can leave a message to communicate. Thank you for your support.