Detailed introduction to golang gc's internal optimization

Today I will talk about the optimization of a common gc compiler (that is, the official version of go compiler and runtime) during the scan marking stage of garbage collection.

What impressed me most about this optimization description is that in the bigcache comment, the general content is that if the map's key value does not contain pointers, then no matter how big the map is, it will not scan the data stored inside the map in depth, and only check whether the map itself needs to be recycled.

The advantage of doing this is obviously that it can greatly increase the scanning speed of gc, thereby reducing the performance loss of gc.

Reducing the number of pointers is a common optimization method in itself, but what makes me curious is the "skip" mentioned in the comments. What is the basis for skipping, and is there only maps like this kind of skipping?

So I conducted a comprehensive search, and there was no useful discovery except for repeating the passage in bigcache.

So this article was born.

What does skip scanning mean?

Pre-emptive knowledge is indispensable.

Simply put, when gc checks whether the object is alive, in addition to the object itself, it also needs to check whether the object's child object references other objects. Specifically:

Arrays and slices refer to whether each element stored in it survives. The element stored here is a child object of the array/slice.
The sub-object of map is the key and value stored in it.
The child object of struct is every field of it

In order to check whether these sub-objects refer to other objects (related to whether these referenced objects can be recycled), gc needs to scan these sub-objects in depth. The more subobjects there are, the more things you need to scan. And this process is recursive because sub-objects will also have sub-objects. Imagine nested arrays or maps.

Skipping scans naturally means skipping the scans of these sub-objects, and only need to check the object itself.

What kind of object can be skipped?

This is my first question. What is the basis for skipping or not skipping, or is it controlling the process.

bigcache tells us that maps that contain key-value pairs that do not contain pointers can be skipped, so what is the specific situation?

If you cannot find useful information, you can only look at the code. The code is subject to Go 1.22.1.

The first thing that should come to mind is to start with the gc code, so I quickly gained something:

	// runtime/

	// The function responsible for gc scanning, and its brother gcDrainN, so the code is almost done.
	func gcDrain(gcw *gcWork, flags gcDrainFlags) {

	...

	// First mark all root objects and check whether the object survives from this
	if  &lt;  {

	for !( &amp;&amp; (preemptible || () ||  != 0)) {

	markroot(gcw, job, flushBgCredit)

	// Check whether you need to be interrupted. If you need it, the function will jump directly to the finishing work and then return
	}

	}

	


	// Take the object to be scanned from the work queue for processing
	for !( &amp;&amp; (preemptible || () ||  != 0)) {

	b := () // Get objects from the work queue
	scanobject(b, gcw)

	...

	}

	...

	}

If the process does not consider interrupts, data statistics and verification, it is still very simple. It is to first mark the starting point of the scan, and then take out the things from the work queue, which is gcw, to process until there is no data in the work queue.

markrootIt's also very simple. It will call according to the type of root object.scanblockormarkrootSpans. inscanblockWill callgreyobjectto mark the pending object. So take a lookmarkrootSpansJust do it.

markrootSpansIt is used to process memory for storing objects with terminals:

// runtime/
func markrootSpans(gcw *gcWork, shard int) {
...
for i := range specialsbits {
...
for j := uint(0); j &lt; 8; j++ {
// Find the span to be processed (you just think it is "a piece of memory space")s := [arenaPage+uint(i)*8+j]


...
lock(&amp;)
for sp := ; sp != nil; sp =  {
if  != _KindSpecialFinalizer {
continue
}
// don't mark finalized object, but scan it so we
// retain everything it points to.
// spf is the finalizer itselfspf := (*specialfinalizer)((sp))
// A finalizer can be set for an inner byte of an object, find object beginning.
p := () + uintptr()/*


// p is the object with the finalizer set// Check here whether the memory occupied by this object is set to skip scan marks// If set, don't continue to scan the object's own child objectif !() {
scanobject(p, gcw)
}


// This span itself is a root object, so the rest are processed directly with scanblockscanblock(uintptr((&amp;)), , &amp;oneptrmask[0], gcw, nil)
}
unlock(&amp;)
}
}
}

It's actually very simple, it's still about finding all the objects and processing them. However, we saw something interesting:()。

It seems that this has something to do with skipping the scan.

But let's not dig into this method for now, why? Because the terminal is specially processed, I haven't finished watching itscanobjectandgreyobjectBefore, we could not assert whether this method controls scanning of objects. (Actually, I have told you in the comments that this thing controls it, but if you track the code yourself, you don’t know it when you see this code for the first time)

So let's continue to seescanobject, this function is to scan the object's child object:

	// runtime/

	func scanobject(b uintptr, gcw *gcWork) {

	// Get the memory that has not been scanned first
	s := spanOfUnchecked(b)

	n := 

	// n means that there are several objects in mspan, which must not be 0 when checked by this function.
	if n == 0 {

	throw("scanobject n == 0")

	}

	if () {

	// If the noscan flag is set in the memory, an error will be reported
	throw("scanobject of a noscan object")

	}

	


	var tp typePointers

	if n &gt; maxObletBytes {

	// Large memory is divided into different blocks and placed into the work queue, so that it can be processed in parallel
	if b == () {

	// Join the team after division
	for oblet := b + maxObletBytes; oblet &lt; ()+; oblet += maxObletBytes {

	if !(oblet) {

	(oblet)

	}

	}

	}

	


	// Get type information
	} else {

	// It doesn't matter here
	}

	


	var scanSize uintptr

	for {

	var addr uintptr

	// Get child objects
	// The exit condition of the entire loop is when next no longer returns the child object (nothing can be done to continue scanning)
	if tp, addr = (); addr == 0 {

	if tp, addr = (); addr == 0 {

	break

	}

	}

	


	// Get the object to be processed
	scanSize = addr - b + 

	obj := *(*uintptr)((addr))

	


	// Exclude nil and pointers to the current object itself
	// The latter is a circular reference that can be recycled. Whether the current object can be recycled is not affected by this pointer.
	// Because if the current object is inaccessible, then its fields are naturally impossible to be accessed, both of which are unreachable from root
	// If this pointer is reachable, then the field of the current object is referenced, and the current object does not need to be recycled.
	// So the pointer field to the current object itself does not need to be processed
	if obj != 0 &amp;&amp; obj-b &gt;= n {

	if obj, span, objIndex := findObject(obj, b, addr-b); obj != 0 {

	greyobject(obj, b, addr-b, span, gcw, objIndex)

	}

	}

	}

	...

	}

This function is long and organized, and it is still clear:

First, see if the object is too large, split the object's memory into small pieces and hand it over to other coroutines in the work queue for parallel processing.
Then scan all sub-objects and usegreyobjectTag these objects

Because this function itself is already scanning, there is not much logic to "skip". Moreover, you have also seen that when calling this function on an object that does not require scanning sub-objects, throw will trigger the throw, which will cause the program to report an error and exit execution.

So the secret isgreyobjectIt's inside. Check out the code:

	// runtime/

	func greyobject(obj, base, off uintptr, span *mspan, gcw *gcWork, objIndex uintptr) {

	...

	if useCheckmark {

	if setCheckmark(obj, base, off, mbits) {

	// Already marked.

	return

	}

	} else {

	...

	// If marked we have nothing to do.

	if () {

	return

	}

	()

	


	...

	


	// If the memory is marked as no further scanning is required, the subsequent process will be skipped (the memory will be put into the work queue of GC scanning waiting to be retrieved and scanned)
	if () {

	...

	return

	}

	}

	// Objects are placed in the work queue waiting for scanning
	}

This function will first check whether the object has been processed, then mark the object, and then check thenoscanThe flag is set and then returned to the call. There is no setting instruction to be further scanned, so it is put into the work queue, waitinggcDrainOr it brothers to deal with it.

Now we can draw a conclusion, will the scan be skipped, and whether it is all set on the memorynoscanThe flag is used to control it, and you can skip it after setting it.

As for whether the map, slice or struct in this memory, it doesn't matter.

Skip the specific process of scanning

After reading the above code, I think you must be confused. What is the process of skipping the specific occurrence?

It doesn't matter, we'll know it by looking at two examples.

The first example is a top-level global scanable object A, which is not yet mentionednoscanUnder what circumstances will it be set, so we first ignore the specific type of A, just know it can skip the scan.

The scanning process of A is as follows:

GC starts running, mark the root object first
A is one of the roots, so it is eitherscanblockHandle or bemarkrootSpandeal with
Assuming that A has set a finalizer, and because A can skip the scan sub-object,markrootSpanWill call it directlyscanblock
scanblockWill callgreyobjectProcess objects in memory
Because A can skip the scan,greyobjectAfter the mark is made, it returns, A will not enter the work queue
A's scan ends, there will be noscanobjectCall

A's example is relatively simple. Now let's assume that there is an object B that is not a root object. B itself cannot skip scanning. B has a child object C that can skip scanning. Let's take a look at the scanning process of C:

Because B is not a root object and cannot be skipped, it is a child object of a root object and must now be in the gc work queue.
gcDrainGet the B from the queue and hand it over toscanobjectdeal with
We assume that B is not very large and will not be divided (it will be the same if it is divided anyway).
scanobjectUse each child object of BgreyobjectHandling, C is no exception
Because C can skip scanning,greyobjectAfter the mark is made, it returns, C will not enter the work queue
The scan of C is over, and there will be no C in the entire process.scanobjectCall

This basically covers all situations. Some cases that I did not say separately, such as "skipping object E is a child object that cannot be skipped by root object D", are actually no different from case 2.

Now we know that the object's child object scanning is skipped like this, and we only have one question left: How is the noscan flag set?

How to set the noscan logo

Before going deeper, let’s take a brief look at how Go allocates memory. I'm afraid that the 5 long articles will not be able to cover up the complete explanation, so I'll do some conceptual simplification.

In go,mspanIt is the basic unit of memory allocation. A mspan can allocate multiple objects of similar size that can be classified into a class (for example, 13-byte and 14-byte objects are the same class and can be allocated to mspans that allow maximum storage of 16-byte objects). This "type" is called mpansizeclass. A simple mental model is to treat mspan as a list of objects of similar sizes.

In order to speed up memory allocation, go will preallocate a piece of memory to each thread, and then divide it into multiple portions according to sizeclass, each corresponding to a sizeclass mspan. This structure is calledmcache。

Of course, there are always objects that will exceed the scope specified by the sizeclass of all mcaches. At this time, go will apply for a large piece of memory like the system and hand over the memory to mspan.

The structures that store span information, such as sizeclass and noscan, are calledspanClass. This structure will be stored as a field in the mspan's control structure.

Once we know this, we will understand(), it means checking whether the spanclass information of mspan is set to a flag that does not require scanning of sub-objects.

Create a spanclass onlymakeSpanClassThis function:

	// runtime/

	type spanClass uint8

	


	func makeSpanClass(sizeclass uint8, noscan bool) spanClass {

	return spanClass(sizeclass<<1) | spanClass(bool2int(noscan))

	}

Now the question is simple. We just need to track who called this function, and we also know additional information: these callers also need to obtain the mspan structure from mcache or system memory. This way the range will shrink.

According to the above idea, we quickly found one of the entrances to the object that go allocated memory.mallocgc：

	// runtime/

	func mallocgc(size uintptr, typ *_type, needzero bool)  {

	...

	// size refers to the size of the type
	// typ is the type information of the object that needs to be created. If it is just allocated memory, typ must be passed nil
	// Is typ empty or does typ contain pointers
	noscan := typ == nil || !()

	if ifsizeSmall enough to be frommspanAllocation {

	if sizeCan be used to meet the requirementstinyallocatordistribute {

	} else {

	// Calculate sizeclass (which type of span corresponds to)
	spc := makeSpanClass(sizeclass, noscan) // noscan was passed in here
	span = [spc] // Get mspan from mcache
	v := nextFreeFast(span) //Really get the available memory from mspan
	//The following is codes such as clearing the memory content and maintaining gc information
	}

	} else {

	// Large object allocation
	// Also call makeSpanClass(0, noscan), and then use the information of the span to apply for memory from the system
	span = (size, noscan) // noscan was passed in here
	//The following is codes such as clearing the memory content and maintaining gc information
	}

	}

Even if the sizeclass is the same, because the values of noscan are different, the values of the two spanClass are also different. For large objects that can be skipped, the memory allocated for this object will be marked as noscan; for small objects that can be skipped, the small object will be directly placed on the memory area allocated in advance by mcache that does not require deep scanning.

Then thismallocgcWho called it? There are too many answers, because new and make will use it. Let's use slice and map as examples.

First is slice. This is very simple, the entrance to create a slice ismakeslice：

	// runtime/

	func makeslice(et *_type, len, cap int)  {

	mem, overflow := (et.Size_, uintptr(cap))

	if overflow || mem > maxAlloc || len < 0 || len > cap {

	// NOTE: Produce a 'len out of range' error instead of a

	// 'cap out of range' error when someone does make([]T, bignumber).

	// 'cap out of range' is true too, but since the cap is only being

	// supplied implicitly, saying len is clearer.

	// See /issue/4085.

	mem, overflow := (et.Size_, uintptr(len))

	if overflow || mem > maxAlloc || len < 0 {

	panicmakeslicelen()

	}

	panicmakeslicecap()

	}

	


	return mallocgc(mem, et, true)

	}

Type information of elements in slice is passed tomallocgc. If the element of the slice does not contain a pointer, then the slice can skip scanning.

Map is quite special. The one who skips scanning is its bucket, which cannot be seen by the outside world:

// runtime/
	// Call chain: makemap -> makeBucketArray -> newarray -> mallocgc	func makeBucketArray(t *maptype, b uint8, dirtyalloc ) (buckets , nextOverflow *bmap) {
	base := bucketShift(b)
	nbuckets := base
	// For small b, overflow buckets are unlikely.
	// Avoid the overhead of the calculation.
	if b &gt;= 4 {
	// Add on the estimated number of overflow buckets
	// required to insert the median number of elements
	// used with this value of b.
	nbuckets += bucketShift(b - 4)
	sz := .Size_ * nbuckets
	up := roundupsize(sz, !())
	if up != sz {
	nbuckets = up / .Size_
	}
	}
	if dirtyalloc == nil {
	// () Return whether the key-value pair contains a pointer	buckets = newarray(, int(nbuckets))
	} else {
	// dirtyalloc was previously generated by
	// the above newarray(, int(nbuckets))
	// but may not be empty.
	buckets = dirtyalloc
	size := .Size_ * nbuckets
	if () {
	memclrHasPointers(buckets, size)
	} else {
	memclrNoHeapPointers(buckets, size)
	}
	}
	if base != nbuckets {
	// We preallocated some overflow buckets.
	// To keep the overhead of tracking these overflow buckets to a minimum,
	// we use the convention that if a preallocated overflow bucket's overflow
	// pointer is nil, then there are more available by bumping the pointer.
	// We need a safe non-nil pointer for the last overflow bucket; just use buckets.
	nextOverflow = (*bmap)(add(buckets, base*uintptr()))
	last := (*bmap)(add(buckets, (nbuckets-1)*uintptr()))
	(t, (*bmap)(buckets))
	}
	return buckets, nextOverflow
	}
	func newarray(typ *_type, n int)  {
	if n == 1 {
	return mallocgc(typ.Size_, typ, true)
	}
	mem, overflow := (typ.Size_, uintptr(n))
	if overflow || mem &gt; maxAlloc || n &lt; 0 {
	panic(plainError("runtime: allocation size out of range"))
	}
	return mallocgc(mem, typ, true)
	}

You can see that if the key-value pair does not contain pointers, the map can be skipped.

So to summarize, as long as the created object does not contain pointers (for example, array/slice members are types that do not contain pointers, map key-value pairs do not contain pointers, and all fields in the structure do not contain pointers) or simply allocate block memory (makeslicecopyWhen allocating a piece of memory and then copying the data, it will be determined that the element package does not contain a pointer, and when it does not contain it, it will be passed nil tomallocgc), noscan will be set.

Now all the questions have been solved: noscan is set according to type information when allocating memory; it is not only maps that can skip scanning, but types that meet the conditions are slice, map or struct.

Improvements brought by optimization

Having said so much, how much improvement does this optimization bring?

See an example:

var a int64 = 1000
	func generateIntSlice(n int64) []int64 {
	ret := make([]int64, 0, n)
	for i := int64(0); i &lt; n; i++ {
	ret = append(ret, a)
	}
	return ret
	}
	func generatePtrSlice(n int64) []*int64 {
	ret := make([]*int64, 0, n)
	for i := int64(0); i &lt; n; i++ {
	ret = append(ret, &amp;a)
	}
	return ret
	}
	func BenchmarkGCScan1(b *) {
	defer ((-1)) // Automatic gc is prohibited during testing	for i := 0; i &lt; ; i++ {
	for j := 0; j &lt; 20; j++ {
	generatePtrSlice(10000)
	}
	()
	}
	}
	func BenchmarkGCScan2(b *) {
	defer ((-1))
	for i := 0; i &lt; ; i++ {
	for j := 0; j &lt; 20; j++ {
	generateIntSlice(10000)
	}
	()
	}
	}

We create 20 each containing 10,000int64or*int64slice (both types are 8 bytes on x64 systems), then trigger GC once manually. In order to make the results more accurate, we also disabled the automatically triggered gc before the test starts, and the length of the slice we created is the same as the size of the elements in the slice, so overall the result should be closer to the performance of real gc when reclaiming memory.

Here is the result:

	goos: windows

	goarch: amd64

	cpu: Intel(R) Core(TM) i5-10200H CPU @ 2.40GHz

	│  │  │

	│ sec/op │ sec/op vs base │

	GCScan-8 379.0µ ± 2% 298.0µ ± 2% -21.51% (p=0.000 n=10)

	


	│  │  │

	│ B/op │ B/op vs base │

	GCScan-8 1.563Mi ± 0% 1.563Mi ± 0% ~ (p=0.438 n=10)

	


	│  │  │

	│ allocs/op │ allocs/op vs base │

	GCScan-8 20.00 ± 0% 20.00 ± 0% ~ (p=1.000 n=10) ¹

	¹ all samples are equal

The memory usage is the same, but the speed is one-fifth slower when storing pointers. The bigger the slice, the bigger the gap. It can be seen that the improvement brought by skipping scanning is still very large.

In addition, using less pointers can also help increase the locality of the data, not only to benefit GC scanning.

How to use this optimization

Finally, let's take a look at how to take advantage of this optimization.

Use less pointers to relieve GC pressure, everyone knows that there are some times when you "have to use" pointers.

Take a local cache as an example:

type Cache[K comparable, V any] struct {
	m map[K]*V
	}
	func (c *Cache[K, V]) Get(key K) *V {
	return [key]
	}
	func (c *Cache[K, V]) Set(key Key, value *V) {
	[key] = value
	}

There are two reasons why we need to use pointers for values. One is that the map element cannot take the address. If we want the data in the cache to be freely used, we have to use temporary variables to add copying, which will be very troublesome if we want to update the value; the other is that if the value is large, the overhead of copying will be large. Using cache is to improve performance, but what if it is reduced in turn?

But doing so will lead toEach key-value pair in it will be scanned. If there are many key-value pairs, the performance will be very touching.

This seems like a "must use pointers" scenario. Is this really the case? Considering that cache itself is a way to exchange space for time, we might as well use more space:

type index = int
	type Cache[K comparable, V any] struct {
	buf []V
	m map[K]index
	}
	func (c *Cache[K, V]) Get(key K) *V {
	idx, ok := [key]
	if !ok {
	return nil
	}
	return &amp;[idx] // You can retrieve the address of the data stored in the slice	}
	func (c *Cache[K, V]) Set(key Key, value V) {
	idx, ok := [key]
	if !ok {
	// New	[key] = len()
	 = append(, value)
	return
	}
	// Overwrite the added	[idx] = value
	}

We use a slice to store all values, and then map the key to the index in the slice. For the element of slice, we can get the address, so we can simply get the pointer of the value, and the update of the value can also be based on thisGetThe time complexity of the pointer you get remains simple and convenient.

Then let's look at it nowbufandmNo pointers, as long asKandVNo pointers are included, so no matter how much stuff is stored in our cache, for GC, just look at the outer layer.CacheWhether the object survives is enough.

But it will cost:

Get will be a little slower because not only do extra checks, but also need to get data from two different data structures, which is not cache friendly
It is inevitable that a copy is needed when storing data into the Cache
The pointer returned by Get is not stable and will fail after the underlying buf expansion.
Deleting elements is very slow. This is because we use slice and need to maintain the mapping relationship in the map. There are many solutions. For example, you can exchange the elements to be deleted and the elements at the end of the slice. In this way, other elements in the slice do not need to be moved, and only traverse them once. For example, you can waste more memory and use the tombstone logo to simulate deletion or simply not provide deletion function (if it is not easy to do, it is just not done. This is a very golang method 👍)

By the way, the bigcache mentioned above uses a similar approach to reduce the GC scanning pressure.

So I suggest using benchmark and trace first to determine that gc is a performance bottleneck, otherwise the above optimization will also bring a lot of extra trouble if the performance cannot be optimized.

Summarize

Now we know that for objects that do not contain pointers, gc will skip scanning of its internal sub-objects, and this optimization is more than map.

Although the interface does not look like a pointer, it actually has a pointer inside, so the interface needs to be scanned in depth.。

Also, it should be emphasized that this is just an optimization made by the official version of Go.There is no guarantee that other compiler implementations such as gccgo and tinygo will have similar optimizations.. But using less pointers to relieve pressure on GC is a consensus among most languages, and this is not wrong.

Finally, it is still the old saying that premature optimization is the source of all evil, but using this sentence to make excuses for your low-performance design is even more wrong. Performance problems rely on data to speak, do more benchmarks and less daydreaming.

This is the end of this article about the detailed introduction of golang gc's internal optimization. For more related golang gc's internal optimization, please search for my previous articles or continue browsing the related articles below. I hope everyone will support me in the future!