Summary of common performance optimization methods in golang slice

This article will not discuss cache hit rate and SIMD. I know that these two are also related to the performance of slices, but I think the former is something that qualified developers must understand. There are many excellent tutorials on the Internet that do not need to be described in detail. The latter should not be considered unless the performance bottleneck is really in data throughput, especially in the Go language, so these two topics will not be introduced in this article.

Before the end, I would like to remind you that performance bottlenecks depend on testing and profiles to locate, and the benefits and overhead of performance optimization solutions also need to be measured by performance testing. Remember not to copy them.

Preallocate memory when creating slice

Pre-allocated memory is the most common optimization method. I will divide it into two parts: creation and use to explain how to optimize.

Allocate enough memory for the slice to be created in advance, which can eliminate the performance loss caused by capacity expansion when adding elements later.

The specific methods are as follows:

s1 := make([]T, 0, Number of pre-allocated elements)
 
// Another less common pre-allocation method, at this time the number of elements must be constantsvar arr [Number of elements]T
s2 := arr[:]

Very simple code, I won't do performance testing.

As mentioned earlier, the performance loss caused by expansion when adding elements is added. This loss is divided into two aspects. One is that expansion requires recalculating the slice's cap. Especially after 1.19, the calculation amount increases after adopting a more gentle allocation strategy. On the other hand, the memory is redistributed. If the capacity cannot be expanded on the spot, it is necessary to redistribute a piece of memory to move the data over and then release the original memory. The more elements you add, the greater the probability of encountering this situation, which is quite expensive.

In addition, the expansion strategy adopted by slice sometimes causes waste, such as the following:

func main() {
    var a []int
    for i := 0; i < 2048; i++ {
            a = append(a, i)
    }
    (cap(a)) // go1.22: 2560
}

As you can see, we added 2048 elements, but go finally allocated us 2560 elements of memory, wasting nearly 500.

However, pre-distribution is not a panacea, and there are limited applicable scenarios:

Applicable scenarios:

A scene that clearly knows how many elements there will be in the slice
Although the number of elements is uncertain, it is roughly[x, y]In the interval ofy+N(N depends on the error range. The cost of triggering expansion after preallocating a large amount of memory is very high, so it is better to waste a small amount of error range than to avoid expansion again). Of course, the difference between x and y cannot be too large. For example, 1 and 1000 are obviously not preallocated. The main basis for judgment is the memory waste rate in the worst case.

In addition to the above two cases, I do not recommend using pre-allocation, because allocating memory itself requires performance costs. It is not that pre-allocation will inevitably cause a lot of waste in the above two scenarios, and the performance cost brought by these wastes is likely to exceed the cost of expansion.

Pre-allocated memory has another benefit: if the size of the allocation is a constant or a constant expression, there is a chance that it will be determined by escape analysis as appropriate to allocate on the stack, thereby further improving performance. This is also implemented by the compiler, the specific code is as follows:

// /golang/go/blob/master/src/cmd/compile/internal/walk/#L412
 
// walkMakeSlice walks an OMAKESLICE node.
func walkMakeSlice(n *, init *)  {
	l := 
	r := 
	if r == nil {
		r = safeExpr(l, init)
		l = r
	}
	t := ()
	if ().NotInHeap() {
		("%v can't be allocated in Go; it is incomplete (or unallocatable)", ())
	}
	if () ==  {
		if why := (n); why != "" {
			("%v has EscNone, but %v", n, why)
		}
		// Check if i is a constant		i := (r)
		if i &lt; 0 {
			("walkExpr: invalid index %v", r)
		}
 
		// Create a temporary slice variable after the check is passed and allocated on the stack	}
 
	// Escape, the calling code will be generated at this time    // Use mallocgc to allocate memory from the heap}

The memory allocation speed on the stack is faster and the pressure on GC is less. However, we cannot control where the object will be allocated. All we can do is create the opportunity to allocate the object on the stack. That's all we can do.

Preallocate memory before operating slice

Starting from entering the standard library with the slices package, memory can be preallocated when operating existing slices.

Of course, it's also possible before, but you have to take some detours. If you are interested, you can go and see it.How to do it.

Let's see the effect through a simple test:

func BenchmarkAppend(b *) {
	for i := 0; i < ; i++ {
		s := []int{1, 2, 3, 4, 5}
		for j := 0; j < 1024; j++ {
			s = append(s, j)
		}
	}
}
 
func BenchmarkAppendWithGrow(b *) {
	for i := 0; i < ; i++ {
		s := []int{1, 2, 3, 4, 5}
		s = (s, 1024)
		for j := 0; j < 1024; j++ {
			s = append(s, j)
		}
	}
}

Here is the result, compared with benchmark:

goos: windows
goarch: amd64
cpu: Intel(R) Core(TM) i5-10200H CPU @ 2.40GHz
│ │ │
│ sec/op │ sec/op vs base │
Append-8 4.149µ ± 3% 1.922µ ± 5% -53.69% (p=0.000 n=10)

│ │ │
│ B/op │ B/op vs base │
Append-8 19.547Ki ± 0% 9.250Ki ± 0% -52.68% (p=0.000 n=10)

│ │ │
│ allocs/op │ allocs/op vs base │
Append-8 8.000 ± 0% 1.000 ± 0% -87.50% (p=0.000 n=10)

Not only is the speed twice as fast, but the memory is also saved by 50%, and compared with the code without Grow, the optimized code only requires one memory allocation.

The reason for performance improvement is exactly the same as in the previous section: it avoids the overhead caused by multiple expansions.

At the same time, the benefits of saving memory exist as in the previous section:

func main() {
	s1 := make([]int, 10, 50) // Note that there is already a certain pre-allocated	for i := 0; i &lt; 1024; i++ {
		s1 = append(s1, i)
	}
	(cap(s1))  // 1280
 
	s2 := make([]int, 10, 50)
	s2 = (s3, 1024)
	for i := 0; i &lt; 1024; i++ {
		s2 = append(s2, i)
	}
	(cap(s2))  // 1184
}

As shown in the example, the memory utilization rate of the former is 80%, while the latter is 86.5%.Although Grow also uses the append mechanism to expand capacity, it can make full use of memory and avoid waste.。

Like the previous section, there are only two applicable scenarios for pre-allocation before use:

A scene where you know exactly how many elements you will add to the slice
Although the number of additional elements is uncertain, it is roughly[x, y]In the interval ofy+N(As above, N depends on the error range).

In addition, if you splice multiple slices, it is best to use, because it will pre-allocate enough memory internally with Grow, which is faster than using append directly. This is also a living example of the optimization methods described in this section.

Set the cap value reasonably in slice expression

In the newer version of Go, the slice expression can have a third parameter, that is, the value of cap, in the form similar to:slice[start:end:capEnd]。

Note that I used itcapEndInstead of cap, because this parameter is not the length of cap, but refers to the element that can access the original array or slice (index-1) by the new slice. For example:slice[1:2:3], This expression creates a new slice with a length of2-1That is, 1, you can access the index of the original slice3-1i.e. the element of 2, so the elements that the new slice can access actually haveindex 1andindex 2Two, cap is 2.

Why add this parameter? Because the scope of slice access can be limited, avoid unexpected changes to data.

Of course, how do cap deal with it when there is no third parameter? Of course it's equivalent tocap(old slice) - startNow.

What does this have to do with performance optimization? See an example:

func noop(s []int) int {
	return s[1] + s[2]
}
 
func BenchmarkSlice(b *) {
	slice := []int{0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10}
	for i := 0; i < ; i++ {
		noop(slice[1:5])
	}
}
 
func BenchmarkSliceWithEqualCap(b *) {
	slice := []int{0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10}
	for i := 0; i < ; i++ {
		noop(slice[1:5:5])
	}
}

Test results:

goos: windows
goarch: amd64
cpu: Intel(R) Core(TM) i5-10200H CPU @ 2.40GHz
BenchmarkSlice-8 1000000000 0.3263 ns/op 0 B/op 0 allocs/op
BenchmarkSliceWithEqualCap-8 1000000000 0.3015 ns/op 0 B/op 0 allocs/op

If you use benchmark for comparison, on average, useslice[1:5:5]The code is about 3% faster.

In fact, there is a small optimization of Go here. When the second parameter in the slice expression is the same as the third parameter, the cap can be used to directly take the length calculated before. This will reduce memory access and a subtraction operation.

If you don't believe it, you can check out the compilerCode：

// slice computes the slice v[i:j:k] and returns ptr, len, and cap of result.
// i,j,k may be nil, in which case they are set to their default value.
// v may be a slice, string or pointer to an array.
func (s *state) slice(v, i, j, k *, bounded bool) (p, l, c *) {
	t := 
	var ptr, len, cap *
	switch {
	case ():
		ptr = s.newValue1(, (()), v)
        // Calculate the len and cap of slice		len = s.newValue1(, [], v)
		cap = s.newValue1(, [], v)
	case ():
		// Omitted, it doesn't matter here	case ():
		// Same as above omitted	default:
		("bad type in slice %v\n", t)
	}
 
	// If it is s[:j:k], i will be set to 0 by default	if i == nil {
		i = ([], 0)
	}
    // If it is s[i:], then j is set to len(s)	if j == nil {
		j = len
	}
	three := true
    // If it is s[i:j:], then k is set to cap(s)	if k == nil {
		three = false
		k = cap
	}
 
	// Perform boundary checks on i, j and k 
	// Just understand it as an operator of addition, subtraction, multiplication and division first	subOp := (, [])
	mulOp := (, [])
	andOp := (, [])
 
	// Calculate the length (rlen) and capacity (rcap) of the new slice.
	// For strings the capacity of the result is unimportant. However,
	// we use rcap to test if we've generated a zero-length slice.
	// Use length of strings for that.
	rlen := s.newValue2(subOp, [], j, i)
	rcap := rlen
	if j != k &amp;&amp; !() {
		rcap = s.newValue2(subOp, [], k, i)
	}
 
	// There is no need to ignore the memory of the slice from there 
	return rptr, rlen, rcap
}

Overall, there is nothing difficult. All slice expressions will eventually reach this function. This function will produce the corresponding opcode. This opcode will undergo a relatively simple optimization, and the compiler will generate a real running program based on these opcodes.

The point isif j != k && !()This sentence, that sentence in the branchrcap = s.newValue2(subOp, [], k, i)Translated into ordinary go code is equivalent torcap = k - i, how to calculate the value of k is written in the previous comments. This means that if the two or three parameters of the slice expression have the same values and are not strings, the length will be reused directly without additional calculations. Aside from the topic, although I used the word "calculation", in fact, rcap and rlen are just expressions. The real result can only be calculated when the program is running. If you are interested, you can study the Go compiler yourself.

It is precisely because of this small optimization that brings slight performance improvements.

Of course, these are just details in code generation, and for this reason I usually don't recommend this approach.

So what is more important is the security mentioned above: limit the scope of slice access and avoid accidentally changing the data. On this basis, not only will there be no performance decline but also a slight increase, which is the icing on the cake.

Applicable scenario: When the length of the slice should be equal in theory, it is best to set it clearly, for example:slice[i : j+2 : j+2]so.

The above scenario is estimated to account for about half, and of course there are many scenarios that do not meet the above requirements, so don’t copy them, everything is subject to performance testing.

For details, you can see how this pr is done:/golang/go/pull/64835

Optimization to add multiple zero-value elements to slice

There are some tips to add "0" to slice, and take a look at the following test:

func BenchmarkAppendZeros1(b *) {
	for i := 0; i &lt; ; i++ {
		slice := []int{}
		slice = append(slice, []int{0, 0, 0, 0, 0}...)
	}
}
 
// Optimized versionfunc BenchmarkAppendZeros2(b *) {
	for i := 0; i &lt; ; i++ {
		slice := []int{}
		slice = append(slice, make([]int, 5)...)
	}
}

Test results:

goos: windows
goarch: amd64
cpu: Intel(R) Core(TM) i5-10200H CPU @ 2.40GHz
│ │ │
│ sec/op │ sec/op vs base │
AppendZeros-8 31.79n ± 2% 30.04n ± 2% -5.50% (p=0.000 n=10)

│ │ │
│ B/op │ B/op vs base │
AppendZeros-8 48.00 ± 0% 48.00 ± 0% ~ (p=1.000 n=10) ¹
¹ all samples are equal

│ │ │
│ allocs/op │ allocs/op vs base │
AppendZeros-8 1.000 ± 0% 1.000 ± 0% ~ (p=1.000 n=10) ¹
¹ all samples are equal

With one line of code, the performance has been improved by 5% without changing memory usage.

The secret is still in the compiler.

No matter whatappend(s1, s2...)stillappend(s1, make([]T, length)...), the compiler has special processing.

The process of the former is as follows:

Create s2 (if s2 is a literal of slice)
Check the cap of s1. If it is not enough, expand the capacity.
Copy the content of s2 into s1

The process when using make is as follows:

Check the cap of s1. If it is not enough, expand the capacity.
Do memclr for free memory of s1 of length length (set all values in memory to 0)

The code is here:/golang/go/blob/master/src/cmd/compile/internal/walk/#L647

The secret to performance improvement is that you don't have to create temporary slices, and memclr does less and easier and faster than copy.

And obviouslyappend(s1, make([]T, length)...)The readability is also better, and it can be said to kill two birds with one stone.

Applicable scenario: When you need to add continuous zero values to slice.

Circular expansion

It is also a common requirement to use loop processing of data in slices. Compared with the for-range mentioned in the next section, the normal form of circulating data can be more flexible and will not be affected by the behavior of changing the range runtime in 1.22.

When it comes to cycle-related optimization, cycle-expanding is an unavoidable topic. As the name suggests, it means changing the loop that was originally iterated n times into processing data that was m times more than the original in each iteration, so that the total number of iterations will be reduced ton/m + 1Second-rate.

Why is this faster? One of these is that there are many fewer loop jumps and updates and comparisons of boundary conditions. Another point is that modern CPUs have something called instruction pipelines that can run multiple instructions at the same time. If there is no data dependence between them (the latter data depends on the previous item as input), expanding the loop means there is a chance to let some instructions be parallelized to improve throughput.

However, this is usually not something that programmers should care about, because how to unfold the loop, when to unfold it should not (the loop will affect whether the current function can be inlined, etc.) are all things that a compiler with a good optimization process should do.

What are you asking about go? That is naturally not available. Between runtime performance and language expression, go chooses compilation speed. It compiles really fast, but optimization is going to get dark.

So I can only write it myself:

func loop(s []int) int {
	sum := 0
	for i := 0; i < len(s); i++ {
		sum += s[i]
	}
	return sum
}
 
func unroll4(s []int) int {
	sum := 0
	for i := 0; i < len(s); i += 4 {
		sum += s[i]
		sum += s[i+1]
		sum += s[i+2]
		sum += s[i+3]
	}
	return sum
}
 
func BenchmarkLoop(b *) {
	s := []int{0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 35, 26, 27, 28, 29, 30, 31}
	for i := 0; i < ; i++ {
		loop(s)
	}
}
 
func BenchmarkUnroll4(b *) {
	s := []int{0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 35, 26, 27, 28, 29, 30, 31}
	for i := 0; i < ; i++ {
		unroll4(s)
	}
}
 
func BenchmarkUnroll8(b *) {
	s := []int{0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 35, 26, 27, 28, 29, 30, 31}
	for i := 0; i < ; i++ {
		unroll8(s)
	}
}

The test uses a slice of 32 ints, first comparing it with a loop processing of four data:

goos: windows
goarch: amd64
cpu: Intel(R) Core(TM) i5-10200H CPU @ 2.40GHz
│ │ │
│ sec/op │ sec/op vs base │
Unroll-8 9.718n ± 3% 3.196n ± 2% -67.11% (p=0.000 n=10)

│ │ │
│ B/op │ B/op vs base │
Unroll-8 0.000 ± 0% 0.000 ± 0% ~ (p=1.000 n=10) ¹
¹ all samples are equal

│ │ │
│ allocs/op │ allocs/op vs base │
Unroll-8 0.000 ± 0% 0.000 ± 0% ~ (p=1.000 n=10) ¹
¹ all samples are equal

The increase was nearly 67%, which is quite large. Then we compare the processing of 8 data at a time:

goos: windows
goarch: amd64
cpu: Intel(R) Core(TM) i5-10200H CPU @ 2.40GHz
│ │ │
│ sec/op │ sec/op vs base │
Unroll-8 9.718n ± 3% 2.104n ± 1% -78.34% (p=0.000 n=10)

This time, the increase was 78%, which is 30% faster than processing only four at a time.

For convenience, I only handled the situation where the total data volume is an integer multiple of the number of data processing in each round of iterations. When it is not an integer multiple, I need to use the "Dafu device", which is more troublesome to implement in Go, so I'm lazy. However, given the great improvement brought by loop expansion, if you determine that the code for loop processing slice is a performance bottleneck, you might as well try it out.

Applicable scenarios: The length of the slice needs to be maintained at a fixed value, and when the length is required, the integer multiple of the data processing amount per round of iteration is required.

Scenarios that require careful performance testing: If a single loop needs to process a lot of code, the expansion effect is likely to be not that good or even reverse, because too much code will affect whether the current function and the functions called by the current code are inlined and the escape analysis of local variables. The former will amplify the overhead of the function call and interfere with branch prediction and pipeline execution, resulting in performance degradation, while the latter will cause unnecessary escapes to reduce performance and increase heap memory usage.

In addition, there is no need to stick to multiples of 4 or 2 in each iteration. Theoretically, no matter how many elements are processed at a time, there will be significant performance improvements. The same is true for actual testing. The effect of 3, 5 or 7 processing at one time is about the same as 4 or 8 times. Generally speaking, the more you process at one time, the more you get, the more obvious it will be. But if it is expanded too much, it will develop into the scenario mentioned above that requires strict testing. So I suggest that the number of expanded processing should not exceed 8.

Avoid the loss caused by for-ranges replication data

Ordinary loop structures provide flexible access methods, but if you traverse slices, I think most people's first choice should be the for-ranges structure.

What I want to talk about in this section is not so much about performance optimization, but rather about the performance trap of "how to avoid for-ranges".

Let’s talk about where the trap is first.

There are actually two traps, one can basically be avoided, and the other depends on the situation. Let's start with being able to avoid it completely.

Avoid copying

The first pitfall is that when the range traversal slice, it will copy the data to be traversed into the loop variable. Moreover, starting from 1.22, the range loop traversal will create a new instance. If you don't notice this, not only will the performance decline but the memory pressure will also increase sharply. What we have to do is avoid the overhead caused by unnecessary replication.

As an example, we use 8int64and 1 string structure fills slice and then compares the performance when copying and not copying:

type Data struct {
	a, b, c, d, e, f, g, h int64
	text                   string
}
 
func generateData(n int) []Data {
	ret := make([]Data, 0, n)
	for i := range int64(n) {
		ret = append(ret, Data{
			a:    i,
			b:    i + 1,
			c:    i + 2,
			d:    i + 3,
			e:    i + 4,
			f:    i + 5,
			g:    i + 6,
			h:    i + 7,
			text: "test",
		})
	}
	return ret
}
 
// Examples that will result in additional copying of datafunc BenchmarkRanges1(b *) {
	data := generateData(100)
	()
	for i := 0; i &lt; ; i++ {
		tmp := int64(0)
		for _, v := range data { // The data is copied to the loop variable v			tmp -=  - 
		}
	}
}
 
// Copying examples are avoidedfunc BenchmarkRanges2(b *) {
	data := generateData(100)
	()
	for i := 0; i &lt; ; i++ {
		tmp := int64(0)
		for i := range data { // Pay attention to these two lines			v := &amp;data[i]
			tmp -=  - 
		}
	}
}

result:

goos: windows
goarch: amd64
cpu: Intel(R) Core(TM) i5-10200H CPU @ 2.40GHz
│ │ │
│ sec/op │ sec/op vs base │
Ranges-8 33.51n ± 2% 32.63n ± 1% -2.41% (p=0.000 n=10)

Copying can be avoided using pointers or accessed directly through indexes, as shown in the results, the larger the structure, the more obvious the difference in performance. In addition, the new version of go has modified the semantics of range, from reusing loop variables in the past to creating new loop variables for each loop, which will make some for-range loops with replication overhead slower.

Applicable scenarios: When you need to traverse each element, the individual data in the traversed slice is relatively large and clearly data that does not need to be traversed is additionally copied to the loop variable.

Avoid overhead of conversion when traversing strings

The string may be a bit off topic, but what we want to say is barely related to slice.

This pitfall is that when range traversing strings, it will convert the content of the string into runes, which will bring overhead, especially when there are only ascii characters in the string.

Write a simple example to see how much performance loss is:

func checkByte(s string) bool {
	for _, b := range []byte(s) {
		if b == '\n' {
			return true
		}
	}
	return false
}
 
func checkRune(s string) bool {
	for _, r := range s {
		if r == '\n' {
			return true
		}
	}
	return false
}
 
func BenchmarkRanges1(b *) {
	s := "abcdefghijklmnopqrstuvwxyz1234567890."
	for i := 0; i < ; i++ {
		checkRune(s)
	}
}
 
func BenchmarkRanges2(b *) {
	s := "abcdefghijklmnopqrstuvwxyz1234567890."
	for i := 0; i < ; i++ {
		checkByte(s)
	}
}

Here is the result:

goos: windows
goarch: amd64
cpu: Intel(R) Core(TM) i5-10200H CPU @ 2.40GHz
│ │ │
│ sec/op │ sec/op vs base │
Ranges-8 36.07n ± 2% 23.95n ± 1% -33.61% (p=0.000 n=10)

The performance of converting string to []byte and traversing has improved by 1/3. In other words, if you don’t notice this pit, you will lose so much performance in vain.

And convert string to[]byteThere is no need to allocate new memory. You can directly reuse the data inside the string. Of course, the premise is that the converted slice will not be modified. Here we hand over this slice directly to the range, which will not modify the slice, so the conversion overhead is saved.

This optimization starts from 1.6. If you are interested, you can check out the compiler's code:/golang/go/blob/master/src/cmd/compile/internal/walk/#L316(Seeing the code, there are other optimizations for this conversion, such as when the string is relatively short.[]byteWill be allocated on the stack)

Of course, if you want to deal with characters other than ASCII, such as Chinese characters, then this optimization won't work.

Applicable scenarios: The characters in the strings that need to be traversed are within the range of ASCII encoding, such as strings with only line breaks in English half-width numbers and half-width punctuation.

BCE boundary check elimination

Boundary checking refers to checking whether the value of the parameter exceeds the maximum limit and whether the memory will be accessed out of bounds in scenarios such as accessing slice elements, using slice expressions, and creating slices. These checks are added to the corresponding location by the compiler based on the information obtained at compile time, and the checked code will be run at runtime.

This feature is very important for the safety of the program.

So is the bounds checking as long as the above expressions are present? The answer is no, because bounds checking requires taking the length or cap of the slice and then comparing it. When the check fails, it will panic, which will take some time and is not very friendly to branch prediction. Overall, adding checks to each expression that accesses the slice element will drag down performance.

Therefore, it is natural to eliminate boundary inspections - in some scenarios, the inspection is obviously impossible to have an out-of-bounds problem, so the inspection is completely unnecessary.

How to check where the compiler inserted the check? You can use the following command:go build -gcflags='-d=ssa/check_bce' 。

The above sectionunroll4As an example:

$ go build -gcflags='-d=ssa/check_bce' 

# command-line-arguments
./:8:11: Found IsInBounds
./:9:11: Found IsInBounds
./:10:11: Found IsInBounds
./:11:11: Found IsInBounds

You will see two outputsIsInBoundsandIsSliceInBounds. Both are proof of insertion boundary detection. The checking content is similar, with only slight differences. If you are interested, you can see how SSA generates the two codes:/golang/go/blob/master/src/cmd/compile/internal/ssa/#L25798

So how do you eliminate these checks? Specifically, it can be divided into several situations, but there will definitely be many changes as the compiler develops, so I will not list them one by one.

Since we are not listed, there must be roughly common rules: if you can calculate that the current index value will not cross the bounds in the expression before accessing the slice using index, then the check can be eliminated.

To give a few examples:

s1 := make([]T, 10)
s1[9] // The constant index value can be judged whether it is out of bounds when compiling, so there is no need to insert runtime detection._ = s1[i&amp;6]   // The index value must be between 0-6, the check is eliminated 
var s2 []int
_ = s2[:i] // examine_ = s2[:i] // Repeat access to eliminate bounds checking_ = s2[:i+1] // examine_ = s2[:i+1] // Repeated expressions, checked so the check is eliminated 
func f(s []int) int {
    if len(s) &lt; 3 {
        panic("error")
    }
 
    return s[1] + s[2] // The previous if ensures that these two accesses will not cross the boundary, so the check can be eliminated}
 
// A common method to avoid multiple boundary detection through temporary variablesfunc f2(s []int) int {
    tmp := s[:4:4] // Border checking will be performed here.  Here we also use the cap of the slice expression reasonably set as mentioned above to avoid additional overhead    a := tmp[2] // The checks in tmp ensure that there will be no cross-border here, so there will be no further inspection    b := tmp[3] // Same as above    return a+b
}

I didn't list all the examples, I can gohere。

Of course there are some hidden scenes that cannot eliminate the check:

func f(s []int, i int) {
    if i &lt; len(s) {
        (s[i]) // Cannot be eliminated, because i is a signed integer and may be less than 0    }
}
 
func f(s []int, i int) {
    if 0 &lt; i &amp;&amp; i &lt; len(s) {
        (s[i+2]) // Can't be eliminated, because i is a signed integer. If i+20, overflow occurs, the index value will become a negative number due to the rewinding.    }
}

With this knowledge, the previous oneunroll4There are four bounds checks, and it doesn't actually require so many, so it can be changed to the following:

func unroll4(s []int) int {
	sum := 0
	for i := 0; i &lt; len(s); i += 4 {
		tmp := s[i : i+4 : i+4] // Only check it once here		sum += tmp[0]
		sum += tmp[1]
		sum += tmp[2]
		sum += tmp[3]
	}
	return sum
}

In fact, I will check it once. Can it be completely eliminated?

func unroll4(s []int) int {
	sum := 0
	for len(s) &gt;= 4 {
		sum += s[0]
		sum += s[1]
		sum += s[2]
		sum += s[3]
        s = s[4:] // Ignore the four elements that have been processed, and since len(s) >= 4, there is no need to check here	}
	return sum
}

This check is completely eliminated, but there is an additional slice assignment.

However, my example is too simple. Performance tests show that bounds check elimination does not bring about performance improvements. The example that completely eliminates checks has caused a slight performance drop due to the additional slice assignment operation (compared to eliminating to only one check left).

If you want to see more obvious examples, you can refer toThis blog。

Applicable scenarios: Can be used effectivelylen(slice)Where the results can be tried BCE.

In other occasions, performance testing is required to determine whether there is an improvement and the extent of the improvement. This does not enhance security like setting the slice expression cap value, nor does it increase readability changes like using make batch to add null values. I personally think that unless it is really a performance bottleneck and there are no other optimization methods.Otherwise, if the increase is less than 5%, it is recommended not to make such changes.。

Parallel processing slice

As mentioned earlier, the circular expansion is further optimized based on this method, which is parallel processing. Parallelism here does not refer to SIMD, but is a parallelism that relies on goroutine implementation.

The prerequisite for parallelism is that the processing of slice elements will not depend on each other, for examples[1]The processing depends ons[0]The processing result is like this.

After you can confirm that the processing of slice can be parallelized, you can write some parallel code, such as parallel summing:

func Sum(s []int64) int64 {
	// Assume that the length of s is 4000	var sum atomic.Int64
	var wg 
	// Each goroutine handles 800	for i := 0; i &lt; len(s); i += 800 {
		(1)
		go func(ss []int) {
			defer ()
			var ret int64
			for j := range ss {
				ret += ss[j]
			}
			(ret)
		}(s[i: i+800])
	}
	()
	return ()
}

Very simple code. Like loop expansion, the additional quantity of dishes is not enough to handle the remaining elements at once.

In addition, the creation, destruction of coroutines and data synchronization are time-consuming. If there are few elements in the slice, parallel processing will not be worth the cost.

Applicable scenarios: There are many elements in the slice, and the processing of elements can be parallel and does not interfere with each other. Another important point is that the golang program can use more than one CPU core to ensure that the code can be run "parallel".

Reuse

Reusing slices is a common routine, where reusing[]byteIt is the most common.

Reuse can be utilized, or it can be like the following:

buf := make([]byte, 1024)
for {
	read(buf)
	...
	// reuse
	buf = buf[:0]
}

inbuf = buf[:0]Make the slice's cap unchanged and the length is cleared, so that the slice's memory can be reused. useThis also needs to make the slice length zero.

In addition to useWhen you are also careful that the size of the slice cannot be too large, otherwise it will also increase the burden on GC. Generally speaking, slices that exceed 1M size are not recommended to be stored. Of course, the upper limit of the size must be determined based on project requirements and performance testing.

Applicable scenarios: Your slice memory can be used repeatedly (it is best to be reusable directly, even cleaning can be done. Cleaning will give you some discounts on the optimization effect) and creating slices multiple times has indeed become a performance bottleneck.

Efficiently delete multiple elements

Deleting elements is also a common requirement, and we will discuss it in three situations here.

All three cases are included in the standard librarySo I recommend you to use the standard library than writing it yourself. Therefore, this section does not apply to the scenario environment, but each subsection provides corresponding suggestions for some special scenarios.

Delete all elements

If you don't plan to reuse slice after deleting the element, just set it to nil.

If you want to reuse the memory, use what we mentioned in the reuse sections := s[:0]That's fine, but this is not enough. In order to prevent memory leaks, all deleted elements must be cleared. Before 1.21 we can only do this:

func deleteSlice[T any, S ~[]T](s S) S {
	var zero T
	for i := range s {
		s[i] = zero
	}
	return s[:0]
}

After 1.21, we have clear built-in functions, and the code can be greatly simplified:

func deleteSlice[T any, S ~[]T](s S) S {
	clear(s)
	return s[:0]
}

The performance of the two writing methods is the same, because go has specially optimized the for-range loop writing zero values, and the effect is the same as using clear directly:

func BenchmarkClear(b *) {
	for i := 0; i < ; i++ {
		var a = [...]uint64{1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 10, 9, 8, 7, 6, 5, 4, 3, 2, 1}
		clear(a[:])
	}
}
 
func BenchmarkForRange(b *) {
	for i := 0; i < ; i++ {
		var a = [...]uint64{1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 10, 9, 8, 7, 6, 5, 4, 3, 2, 1}
		for j := range a {
			a[j] = 0
		}
	}
}

goos: windows
goarch: amd64
cpu: Intel(R) Core(TM) i5-10200H CPU @ 2.40GHz
BenchmarkClear-8 1000000000 0.2588 ns/op 0 B/op 0 allocs/op
BenchmarkForRange-8 1000000000 0.2608 ns/op 0 B/op 0 allocs/op

But if the loop form is not for-range, then you won't be able to get this optimization:

func BenchmarkClear(b *) {
	for i := 0; i < ; i++ {
		var a = [...]uint64{1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 10, 9, 8, 7, 6, 5, 4, 3, 2, 1}
		clear(a[:])
	}
}
 
func BenchmarkForLoop(b *) {
	for i := 0; i < ; i++ {
		var a = [...]uint64{1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 10, 9, 8, 7, 6, 5, 4, 3, 2, 1}
		for j := 0; j < 20; j++ {
			a[j] = 0
		}
	}
}

goos: windows
goarch: amd64
cpu: Intel(R) Core(TM) i5-10200H CPU @ 2.40GHz
BenchmarkClear-8 1000000000 0.2613 ns/op 0 B/op 0 allocs/op
BenchmarkForLoop-8 173418799 7.088 ns/op 0 B/op 0 allocs/op

The speed is one order of magnitude different. Those who are interested in "loop writing zero" optimization can see this is how it is implemented:arrayclear. This optimization is also effective for maps.

We can simply compare the performance of empty nil and clear under:

func BenchmarkDeleteWithClear(b *) {
	for i := 0; i &lt; ; i++ {
		a := []uint64{1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 10, 9, 8, 7, 6, 5, 4, 3, 2, 1}
		clear(a)
		a = a[:0]
	}
}
 
func BenchmarkDeleteWithSetNil(b *) {
	for i := 0; i &lt; ; i++ {
		a := []uint64{1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 10, 9, 8, 7, 6, 5, 4, 3, 2, 1}
		a = a[:] // Prevent the compiler from thinking that a is not used		a = nil
	}
}

Judging from the results, it is not much difference if it is just a deletion operation:

BenchmarkDeleteWithClear-8 1000000000 0.2592 ns/op 0 B/op 0 allocs/op
BenchmarkDeleteWithSetNil-8 1000000000 0.2631 ns/op 0 B/op 0 allocs/op

So which method you choose mainly depends on whether you will reuse the slice's memory in the future. If you need to reuse, use clear, otherwise set it to nil directly.

Delete elements in the head or tail

The easiest way to delete the tail element is the only way to do its = s[:index]This one. Be careful not to forget to use itclearClear the deleted part.

The only disadvantage of this method is that the memory of the deleted part will not be released. Usually this does not hurt and can reuse the memory when new elements are added. However, if you do not reuse the memory again and are sensitive to waste, you can only allocate a new slice and copy the elements you want to leave. But be careful that doing so will be much slower and more memory will be consumed during the deletion process (because the old and new slices must exist at the same time).

There are many choices for deleting head elements, and there are two common ones (we need to keep the relative order between elements):s = s[index+1:]ors = append(s[:0], s[index+1:]...)。

The former is to create a new slice, and the bottom array points to the original slice at the beginning.index+1At the same time, note that although the underlying array is reused, the cap is actually reduced, and the memory of the deleted part has no chance of being reused again. This method requires clearing the element before deleting.

The latter type will not create a new slice, it willindex+1The initial element is translated to the head of the slice, which also removes the head element (overrided). Using this solution does not require active zeroing elements. If you are not at ease, you can also choose to use clear if you move the remaining space at the tail, but it is generally not recommended.

Theoretically, the former really wastes memory but performs better, but performance must always be proved by benchmark:

func BenchmarkClearWithReSlice(b *) {
	for i := 0; i &lt; ; i++ {
		a := []uint64{1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 10, 9, 8, 7, 6, 5, 4, 3, 2, 1}
		// Delete 7 elements in the head		clear(a[:7])
		a = a[7:]
	}
}
 
func BenchmarkClearWithAppend(b *) {
	for i := 0; i &lt; ; i++ {
		a := []uint64{1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 10, 9, 8, 7, 6, 5, 4, 3, 2, 1}
		a = append(a[:0], a[7:]...)
	}
}

The test results show that the first method is indeed fast:

BenchmarkClearWithReSlice-8 1000000000 0.2636 ns/op 0 B/op 0 allocs/op
BenchmarkClearWithAppend-8 100000000 10.82 ns/op 0 B/op 0 allocs/op

Append is an order of magnitude slower, and even though memmove has been optimized quite a bit, it is still very slow to move data in memory.

In practical applications, appropriate solutions should be selected based on memory utilization efficiency and operating speed.

Delete elements in the middle position

To delete the elements in the middle part, you must maintain relative order. The only way to use is to move the elements behind the deleted part to cover it in the front:

s := append(s[:index], s[index+n:]...)

This method does not require active clear elements to be deleted, because they are all overwritten. In addition to the poor optimization of the for loop mentioned above, using append instead of for loops, there are two advantages of being more concise in the code and being able to take advantage of memmove.

Because the method has no reference, the performance will not be tested.

Reduce GC scanning pressure

Simply put, try not to store a large number of pointers or structures containing pointers in the slice. The more pointers GC needs to do when scanning objects, which will eventually lead to performance degradation.

More specific explanations and performance tests can be foundThis article。

Applicable scenarios: No special requirements and the element size is not particularly large, and the storage value is better than the storage pointer.

At the cost, if you choose to save the value, you must be careful about the overhead caused by additional copying.

Summarize

According to personal experience, the optimization methods with the highest frequency of use are pre-allocated memory, avoiding for-ranges traps, slice reuse, and loop expansion. From the perspective of improvement, these are also the most obvious effects.

If the compiler optimization is not powerful enough, you can only find a way to use these optimization techniques yourself.

Sometimes it can also be used to optimize the escape analysis rules, but just likeThis articleAs stated, in most cases you should not consider escape analysis.

There is another way: share code with the go compiler to improve the performance of the compiled products. Although there will be great resistance, I still believe that a big shot will definitely do it. This is why I posted the code of how to optimize the compiler, and it’s just a little bit of money.

Another important point: performance issues are positioning or optimization.All must be based on performance testing, remember not to rely solely on "experience" and "inferences" without factual basis.

Finally, I hope this article can become a handy tool for everyone to optimize performance, rather than a eight-legged essay memorized during interviews.

The above is the detailed content of the summary of common performance optimization methods in golang slice. For more information about go slice performance optimization, please pay attention to my other related articles!