Go language model: Detailed explanation of the underlying data structure and efficient operation of string

Golang's string type has a simple underlying data structure, and is essentially a structure instance, and is const immutable.

The underlying data structure of string

From the following example:

package main
import (
	"fmt"
	"unsafe"
)
// from: Double-click shift in GoLand IDE to find it quicklytype stringStruct struct {
	array  // Point to an array of [len]byte	length int    // length}
func main() {
	test := "hello"
	p := (*str)((&amp;test))
	(&amp;p, p) // 0xc420070018 &amp;{0xa3f71 5}
	c := make([]byte, )
	for i := 0; i &lt; ; i++ {
		tmp := uintptr(())   // Pointer type conversion is passed through unsafe package		c[i] = *(*byte)((tmp + uintptr(i))) // Pointer operation can only be passed through uintptr	}
	(c)   // [104 101 108 108 111]
	(string(c)) // [byte] --&gt; string, "hello"
	test2 := test + " world" // The string is an immutable type and will generate a new string instance	p2 := (*str)((&amp;test2))
	(&amp;p2, p2) // 0xc420028030 &amp;{0xc42000a2e5 11}
	(test2) // hello, world
}

Stitching and modification of string

+Operation

The string type is an immutable type, so any modification to the string will generate a new instance of the string. If it is a scenario that considers efficiency, you should carefully consider how to modify it. Let’s first talk about the longest + operation. For the same example above, take a look at the disassembly of the + operation splicing string:

25		test2 := test + " world"
 0x00000000004824d7 &lt;+1127&gt;:	lea 0x105a2(%rip),%rax  # 0x492a80
 0x00000000004824de &lt;+1134&gt;:	mov %rax,(%rsp)
 0x00000000004824e2 &lt;+1138&gt;:	callq 0x40dda0 &lt;&gt; # Call the newobject function 0x00000000004824e7 &lt;+1143&gt;:	mov 0x8(%rsp),%rax
 0x00000000004824ec &lt;+1148&gt;:	mov %rax,0xa0(%rsp)
 0x00000000004824f4 &lt;+1156&gt;:	mov 0xa8(%rsp),%rax
 0x00000000004824fc &lt;+1164&gt;:	mov 0x8(%rax),%rcx
 0x0000000000482500 &lt;+1168&gt;:	mov (%rax),%rax
 0x0000000000482503 &lt;+1171&gt;:	mov %rax,0x8(%rsp)
 0x0000000000482508 &lt;+1176&gt;:	mov %rcx,0x10(%rsp)
 0x000000000048250d &lt;+1181&gt;:	movq $0x0,(%rsp)
 0x0000000000482515 &lt;+1189&gt;:	lea 0x30060(%rip),%rax  # 0x4b257c
 0x000000000048251c &lt;+1196&gt;:	mov %rax,0x18(%rsp)
 0x0000000000482521 &lt;+1201&gt;:	movq $0x6,0x20(%rsp)
 0x000000000048252a &lt;+1210&gt;:	callq 0x43cc00 &lt;runtime.concatstring2&gt; # Call the concatstring2 function

Because the current go[2018.11 version: go1.11] does not follow the default x86 calling convention to pass parameters in registers, but pass parameters through stack, go's disassembly is not as easy to understand as c's, but it is still fine to understand the operation behind +. Take a look at the splicing function of the runtime source code:

func concatstring2(buf *tmpBuf, a [2]string) string {
 return concatstrings(buf, a[:])
}
// concatstrings implements a Go string concatenation x+y+z+...
// The operands are passed in the slice a.
// If buf != nil, the compiler has determined that the result does not
// escape the calling function, so the string data can be stored in buf
// if small enough.
func concatstrings(buf *tmpBuf, a []string) string {
 idx := 0
 l := 0
 count := 0
 for i, x := range a {
  n := len(x)
  if n == 0 {
   continue
  }
  if l+n &lt; l {
   throw("string concatenation too long")
  }
  l += n
  count++
  idx = i
 }
 if count == 0 {
  return ""
 }
 // If there is just one string and either it is not on the stack
 // or our result does not escape the calling frame (buf != nil),
 // then we can return that string directly.
 if count == 1 &amp;&amp; (buf != nil || !stringDataOnStack(a[idx])) {
  return a[idx]
 }
 s, b := rawstringtmp(buf, l)
 for _, x := range a {
  copy(b, x) // The most critical copy operation  b = b[len(x):]
 }
 return s
}

Analyzing the implementation of runtime's concatstrings, you can see that + finally new buf is applied, copy the original string to buf, and finally return the new instance. Then every + operation involves a new application buf, and then the corresponding copy. If + is used repeatedly, there will inevitably be a large number of applications for memory operations, and the performance will be affected by a large number of splicing.

By looking at the source code, the memory is increased by 2 times when growing the buffer, which can effectively avoid frequent memory application. From an example:

func main() {
 var buf 
 for i := 0; i < 10; i++ {
  ("hi ")
 }
 (())
}

The corresponding byte package library function source code

// @file: 
func (b *Buffer) WriteString(s string) (n int, err error) {
  = opInvalid
 m, ok := (len(s))
 if !ok {
  m = (len(s)) // Efficient growth strategy -> let capacity get twice as large }
 return copy([m:], s), nil
}
// @file: 
// let capacity get twice as large !!!
func (b *Buffer) grow(n int) int {
 m := ()
 // If buffer is empty, reset to recover space.
 if m == 0 &amp;&amp;  != 0 {
  ()
 }
 // Try to grow by means of a reslice.
 if i, ok := (n); ok {
  return i
 }
 // Check if we can make use of bootstrap array.
 if  == nil &amp;&amp; n &lt;= len() {
   = [:n]
  return 0
 }
 c := cap()
 if n &lt;= c/2-m {
  // We can slide things down instead of allocating a new
  // slice. We only need m+n &lt;= c to slide, but
  // we instead let capacity get twice as large so we
  // don't spend all our time copying.
  copy(, [:])
 } else if c &gt; maxInt-c-n {
  panic(ErrTooLarge)
 } else {
  // Not enough space anywhere, we need to allocate.
  buf := makeSlice(2*c + n)
  copy(buf, [:])
   = buf
 }
 // Restore  and len().
  = 0
  = [:m+n]
 return m
}

This function can apply the size of the final string at once, but use all strings in advance. This scenario is also efficient, an example:

func main() {
 var strs []string
 for i := 0; i < 10; i++ {
 strs = append(strs, "hi")
 }
 ((strs, " "))
}

The source code of the corresponding library:

// Join concatenates the elements of a to create a single string. The separator string
// sep is placed between elements in the resulting string.
func Join(a []string, sep string) string {
 switch len(a) {
 case 0:
  return ""
 case 1:
  return a[0]
 case 2:
  // Special case for common small values.
  // Remove if /issue/6714 is fixed
  return a[0] + sep + a[1]
 case 3:
  // Special case for common small values.
  // Remove if /issue/6714 is fixed
  return a[0] + sep + a[1] + sep + a[2]
 }
 
 // Calculate the final string size n := len(sep) * (len(a) - 1) //
 for i := 0; i &lt; len(a); i++ {
  n += len(a[i])
 }
 b := make([]byte, n)
 bp := copy(b, a[0])
 for _, s := range a[1:] {
  bp += copy(b[bp:], sep)
  bp += copy(b[bp:], s)
 }
 return string(b)
}

(go1.10+)

When I saw this name, I thought of Java library. Haha, this Builder is the most convenient to use, but it was introduced after 1.10. Its efficiency is also reflected in the memory growth of 2 times. The WriteString function uses the 2 times the speed of the append function corresponding to the slice type.

An example:

func main() {
 var s 
 for i := 0; i < 10; i++ {
  ("hi ")
 }
 (())
}

The source code of the corresponding library

@file: 
// WriteString appends the contents of s to b's buffer.
// It returns the length of s and a nil error.
func (b *Builder) WriteString(s string) (int, error) {
 ()
  = append(, s...)
 return len(s), nil
}

Summarize

Golang's string processing is quite convenient, with garbage collection and some built-in language-level writing support, making complex string operations less complicated and much more efficient than C/C++.

Supplement: The internal implementation of go string

go string internal implementation

Exploration of this string

Let's have an example

func boo(a int, b int)(int, string){
 return a + b, "abcd"
}

81079 000000000044dfa0 <>:
81080 44dfa0:>------48 c7 44 24 18 00 00 >--movq $0x0,0x18(%rsp)
81081 44dfa7:>------00 00- 
81082 44dfa9:>------0f 57 c0    >--xorps %xmm0,%xmm0
81083 44dfac:>------0f 11 44 24 20  >--movups %xmm0,0x20(%rsp)
81084 44dfb1:>------48 8b 44 24 08  >--mov 0x8(%rsp),%rax
81085 44dfb6:>------48 03 44 24 10  >--add 0x10(%rsp),%rax
81086 44dfbb:>------48 89 44 24 18  >--mov %rax,0x18(%rsp)
81087 44dfc0:>------48 8d 05 d4 eb 01 00 >--lea 0x1ebd4(%rip),%rax  # 46cb9b <.*+0xbb>
81088 44dfc7:>------48 89 44 24 20  >--mov %rax,0x20(%rsp)
81089 44dfcc:>------48 c7 44 24 28 04 00 >--movq $0x4,0x28(%rsp)
81090 44dfd3:>------00 00- 
81091 44dfd5:>------c3     >--retq---

81087 44dfc0:&gt;------48 8d 05 d4 eb 01 00 &gt;--lea 0x1ebd4(%rip),%rax  # 46cb9b &lt;.*+0xbb&gt;
81088 44dfc7:&gt;------48 89 44 24 20  &gt;--mov %rax,0x20(%rsp)
81089 44dfcc:&gt;------48 c7 44 24 28 04 00 &gt;--movq $0x4,0x28(%rsp)
81090 44dfd3:&gt;------00 00- 
81091 44dfd5:&gt;------c3     &gt;--retq---
lea 0x1ebd4(%rip),%raxgetchar*, mov %rax,0x20(%rsp)Copy to return value, movq $0x4,0x28(%rsp)Fill in the length as well,

In fact, you can see that string is a combination of char* and len in c

The above is personal experience. I hope you can give you a reference and I hope you can support me more. If there are any mistakes or no complete considerations, I would like to give you advice.