The use and difference between rune and byte in Golang

In Go,runeandbyteThey are all types that represent individual characters, but they have some key differences.

byte type

byteyesuint8The alias of , that is, an 8-bit unsigned integer, represents a byte, with a range of 0 to 255.

byteUsed to represent the UTF-8 encodingbyte, suitable for handling byte streams and ASCII characters.

The number of bytes occupied by characters:

ASCII characters (0-127) take up 1 byte.
Common characters, such as Latin letters and punctuation marks, take up 1 byte.
Non-ASCII characters such as Chinese will take up 3 bytes.

byteRepresents: string"you", its UTF-8 encoding in Go is0xE4, 0xBD, 0xA0(hexadecimal).

s := "you"
for i := 0; i &lt; len(s); i++ {
    ("byte at index %d: %d\n", i, s[i])
}

Output:

byte at index 0: 228
byte at index 1: 189
byte at index 2: 160

rune type

runeyesint32Alias, that is, a 32-bit signed integer, used to represent a Unicode character. All characters in Go (including ASCII and Unicode characters) areruneDenoted by type, the range is 0 to 0x10FFFF.

runeUsed to representUnicode characters, it represents the characterCode points, suitable for handling character operations, especially involving Unicode characters (such as Chinese, emojis, etc.).

runeexpress:

s := "you"
for _, c := range s {
    ("rune: %c, rune value: %d\n", c, c)
}

Output:

rune: You, rune value: 20320

this means"you"Unicode encoding points (20320,Right now0x4F60)quiltruneType storage.

UTF-8 and Unicode relationship

Unicode is a character set, and UTF-8 isOne of the ways to encode a Unicode character set. Unicode defines the encoded points of all characters, but it does not specify how characters are stored and transferred. To achieve cross-platform and cross-language compatibility, UTF-8 is defined as a way to convert Unicode encoded points into byte sequences. In addition to UTF-8, there are UTF-16 and UTF-32.
connect：
- Unicode assigns an encoded point (one number) to each character.
- UTF-8 encodes these Unicode encoding points through sequences of bytes of different lengths, so that they can be stored in files, transmitted over the network, displayed on screen, etc.

The main differences between byte and rune

characteristic	`byte`	`rune`
type	`uint8` (8-bit unsigned int)	`int32` (32-bit signed int)
use	Processing ASCII or byte data	Handle Unicode characters
Express range	0 to 255	0 to 0x10FFFF
Common Applications	Byte stream, ASCII characters	Unicode characters (including multibyte characters)
Storage size	1 byte	4 bytes
Character set support	Only ASCII characters are supported	Supports all Unicode characters

Go's default encoding method

The default encoding method of Go strings isUTF-8. So use by defaultbyteSequence to represent each character in the string.

Specifically, strings in Go (stringType) is fromUTF-8 encoded byte sequenceComposition. therefore:

A Go string is composed of multiple bytes (byte) consists of each byte, which is a UTF-8 encoded character.
These bytes follow UTF-8 encoding, and the Go string can contain both ASCII characters (the characters occupy 1 byte in UTF-8) or multi-byte Unicode characters (such as Chinese characters, which usually occupy 3 bytes in UTF-8).

s := "a"
("Count of bytes occupied:", len(s))
("; Type: %T ", s[0])
()
s1 := "you"
("Count of bytes occupied:", len(s1))
("; Type: %T ", s1[0])

Output:

Number of bytes occupied: 1; Type: uint8
Number of bytes occupied: 3; type: uint8

Traversal method

Traversal byte

bytes := []byte(s)You can convert the string directly tobyte, of course, you can also traverse:

usefor i := 0; i < len(s); i++, each byte in the string can be accessed in each iteration.
len(s)Returns the stringBytes, i.e. the total number of bytes contained in a string, not the number of characters. For a string containing multibyte characters (such as Chinese characters),len(s)Returns the number of bytes occupied by the string.

package main

import "fmt"

func main() {
	s := "you" // Contains Chinese characters
	// traverse string by bytes	("Travel over string by byte:")
	for i := 0; i &lt; len(s); i++ {
		("s[%d] = %v (type: %T)\n", i, s[i], s[i]) // Output the value of each byte	}
}

Output:

Bytes to traverse the string:
s[0] = 228 (Type: uint8)
s[1] = 189 (Type: uint8)
s[2] = 160 (Type: uint8)

Traversal rune

runes := []rune(s)You can convert the string directly torune, of course, you can also traverse:

usefor _, c := range sWhen traversing a string, Go will automatically transfer the string.sDecode each character in it intoruneType, so that even if the characters are multibytes, they can be processed correctly.
rangeWhen traversing the string, pressCharacter (rune)Iterate. Return one for each iterationUnicode code point (rune)and the index of this character in the string. For multibyte characters,rangeThese bytes will be automatically skipped and iterated by characters.

package main

import "fmt"

func main() {
	s := "you"

	// len(s) returns the number of bytes	("len(s) =", len(s)) // Output: 3, because "you" is represented by 3 bytes
	// Use range to traverse strings and traverse by character (rune)	("Use range to traverse strings, traverse by character (rune):")
	for i, r := range s {
		("i = %d, r = %v (type: %T)\n", i, r, r)
	}
}

Output:

len(s) = 3
Use range to traverse strings and traverse by character (rune):
i = 0, r = 20320 (Type: int32)

Replenish

for i := range sThe s[i] is actuallybyte, but there will be problems when dealing with Chinese.

When you usefor i := range sThere may not be any problem when processing English strings, because English characters (ASCII characters) are represented by a single byte in UTF-8 encoding, so each character corresponds to exactly one byte.
But if the string contains non-English characters (such as Chinese, emojis, etc.), they usually take up multiple bytes. in this case. usefor i := range sYou will find problems.rangeWill follow characters (rune) traversal, the number of characters counted is (rune)【There is only 1 as follows】, not the number of bytes (byte)【One Chinese should correspond to 3 bytes】.

package main

import "fmt"

func main() {
	s := "you" // The string contains Chinese characters
	// Use range to traverse strings	("Use range to traverse strings:")
	for i := range s {
		("s[%d] = %v (type: %T)\n", i, s[i], s[i]) // Print the value of each byte	}
}

Output:

Use range to traverse strings:
s[0] = 228 (Type: uint8)

Character restoration

To get frombyteSequence orruneThe sequence is restored back to the original string, you can do it in the following ways:

frombyteSequence restore string: Can be used directlystring(byteSlice)。
fromruneSequence restore string: Can be used directlystring(runeSlice)。

Restore string from byte sequence

package main

import "fmt"

func main() {
	s := "Hello" // String "Hello"
	// Convert string to run slice	bytes := []byte(s)

	("bytes：", bytes)
	// Convert rune slice back to string	s1 := string(bytes)
	("Restored string:", s1)
}

bytes： [228 189 160 229 165 189]
Restored string: Hello

Restore string from rune sequence

package main

import "fmt"

func main() {
	s := "Hello" // String "Hello"
	// Convert string to run slice	runes := []rune(s)

	("runes encoding:", runes)
	// Convert rune slice back to string	s1 := string(runes)
	("Restored string:", s1)
}

Runes encoding: [20320 22909]
Restored string: Hello

This is the end of this article about the use and differences between rune and byte in Golang. For more related Golang rune and byte content, please search for my previous articles or continue browsing the related articles below. I hope everyone will support me in the future!