Detailed explanation of common misuses of errgroup in Golang

errgroupI believe that all golang programmers with a little experience should have heard of it, and there should be many used in actual projects. It andSimilarly, execution can be initiated and a set of coroutines can be waited until all coroutines run end. In addition, errgroup can cancel the current context when a coroutine error occurs, and it can also control the number of coroutines that can run.

But when reviewing the daily code, I noticed several common problems. Some of these problems will only cause some performance losses at most, while others will lead to resource leakage or even deadlock crashes.

Here are the records of these typical misuses.

Extra context nesting

Let me first talk about the inappropriate usage that is not very common but I have encountered two or three times.

We know that when the coroutine returns an error, the context passed in during creation will be cancelled. This is to let other coroutines in the same group know that an error occurs and should exit the execution as soon as possible.

Therefore, the context used by errgroup should be a new context derived from the current context, so that possible cancellation operations will not affect the scope outside the errgroup.

Therefore, the first common misuse occurs:

func DoWork(ctx ) {
    errCtx, cancel := (ctx)
    defer cancel()
    group, errCtx := (ctx)
    ...
}

Where is the misuse? The answer is that context will automatically derive new contexts for us. In addition to setting timeouts, there is generally no need to encapsulate them again. See the source code:

// /golang/sync/blob/master/errgroup/
 
// WithContext returns a new Group and an associated Context derived from ctx.
//
// The derived Context is canceled the first time a function passed to Go
// returns a non-nil error or the first time Wait returns, whichever occurs
// first.
func WithContext(ctx ) (*Group, ) {
	ctx, cancel := withCancelCause(ctx)
	return &Group{cancel: cancel}, ctx
}
 
// /golang/sync/blob/master/errgroup/
 
func withCancelCause(parent ) (, func(error)) {
	return (parent)
}

More than nesting will waste memory and have negative effects on performance, especially when you need to take out certain values from the context, because the value is recursively searching for layers of nested contexts, the more nested layers, the slower the search is.

However, as mentioned earlier, there is a situation that is allowed, that is, set timeout for all coroutines in the entire errgroup:

func DoWork(ctx ) {
    errCtx, cancel := (ctx, 10 * )
    defer cancel()
    group, errCtx := (ctx)
    ...
}

Currently, if you want to set a timeout, you can only do this, so this is considered a special case.

Wait's return time

The second misuse is more common than the first one. Mainly there is a misunderstanding of the behavior of errgroup.

This misunderstanding is often manifested as: if the coroutine returns an error or the timeout of ctx is triggered,WaitThe method will return immediately.

This is not true.

Let's take a look firstWaitWhat does the documentation say:

Wait blocks until all function calls from the Go method have returned, then returns the first non-nil error (if any) from them.

WaitIt will not return until all goroutines return. Even if the timeout is over, the context is cancelled. You need to wait for all coroutines to exit first. Let's look at the code:

// /golang/sync/blob/master/errgroup/
 
func (g *Group) Wait() error {
	()
	if  != nil {
		()
	}
	return 
}

You can see that you really need to wait for all coroutines to return first. If you have a keen observation, you can actually find that errgroup will wrap coroutines. Will there be any way in the packaged code to abort the execution of coroutines in advance? Let's look at the code:

// /golang/sync/blob/master/errgroup/
 
func (g *Group) Go(f func() error) {
	// The code to check whether the current coroutine is running is ignored first 
	(1)
	go func() {
		defer ()  // The point is here 
		if err := f(); err != nil {
			(func() {
				 = err
				if  != nil {
					()
				}
			})
		}
	}()
}

Note that defer, which means that done will only be executed when the wrapped function is finished (after your own function f has run and error has been set and ctx has been cancelled).

If your own function does not check whether the timeout and context are cancelled, then leak and stuck problems will come to your door, such as the following:

func main() {
    errCtx, cancel := ((), 1 * )
    defer cancel()
    group, errCtx := (errCtx)
    (func () error {
        (10 * )
        ("running")
        return nil
    })
    (func () error {
        return ("error")
    })
    (())
}

Guess the run result and execution time. The answer isrunning\nerror\n, it takes more than 10 seconds to run.

This misuse is easy to identify, just pass it toGoThe function of the method is not handled properlyerrCtx, that's probably a problem.

But to be fair,GoThe parameter form does not conform to the general convention of using context.WaitThe behavior is different from other languages that can automatically cancel thread execution, causing misuse. The language and interface design cannot be used all by programmers.

SetLimit and Deadlock

This is more common, especially when using errgroup as an ordinary coroutine pool.

Let’s first come my favorite guessing game. What is the result of the following code running?

func main() {
    group, _ := (())
    (2) // Idea: Only 2 coroutines are allowed to run at the same time, but multiple tasks are submitted to the "coroutine pool"    (func () error {
        ("running 1")
        // Run subtasks        (func () error {
            ("sub running 1")
            return nil
        })
        (func () error {
            ("sub running 2")
            return nil
        })
        return nil
    })
    (func () error {
        ("running 2")
        // Run subtasks        (func () error {
            ("sub running 3")
            return nil
        })
        (func () error {
            ("sub running 4")
            return nil
        })
        return nil
    })
    (())
}

The answer is that it will deadlock panic. And it is 100% triggered.

I will explain in detail why this is, but before I want to talk about an important knowledge point:

SetLimitWhat is set is not the number of coroutines running at the same time, but the number of coroutines that can be held at most in the errgroup. Coroutines held by the errgroup can be run or waiting for running.

If there is only one sub running for each running, then there is a small probability that it will not be deadlocked, so I specially created two for each group. The reason is not that complicated. It seems that I can reason about it by myself after the subsequent explanation.

Let's explain it below, first look atSetLimitThe code, everything starts from here:

// /golang/sync/blob/master/errgroup/
 
// SetLimit limits the number of active goroutines in this group to at most n.
// A negative value indicates no limit.
//
// Any subsequent call to the Go method will block until it can add an active
// goroutine without exceeding the configured limit.
//
// The limit must not be modified while any goroutines in the group are active.
func (g *Group) SetLimit(n int) {
	if n < 0 {
		 = nil
		return
	}
	if len() != 0 {
		panic(("errgroup: modify limit while %v goroutines in the group are still active", len()))
	}
	 = make(chan token, n)
}

yeschan strcut{}. What you do is very simple. If the parameter is greater than 0, initialize a chan of length n according to the parameter., clear if it is less than 0. If you have more experience, you can already tell that this is a simple oneticket poolThis mode is also used in grpc.

ticket poolThe principle of the mode is to set a fixed size of n, and then write data to this chan when the coroutine is about to run. When the coroutine is about to run, read the written data from the chan when the coroutine is finished running (it may be read to someone else, but as long as you follow the order of write and read, there is no problem). If the write of chan is blocked, it means that n coroutines are running, and the new coroutine needs to wait until the coroutines have completed and read out the data before continuing to execute; under normal circumstances, the readout operation will not be blocked. This is one of the most common means of limiting the number of goroutines. Depending on whether the write operation is performed within the coroutine or the caller who initiates the coroutine, this mode can also control the "maximum number of goroutines running simultaneously" or "total number of goroutines". inTotal number of goroutines = number of goroutines running + other number of goroutines waiting to run。

And errgroup belongs to the latter. Still rememberGoThe part I commented out in the code, now you can read:

// /golang/sync/blob/master/errgroup/
 
func (g *Group) Go(f func() error) {
	if  != nil {
		 &lt;- token{} // token is struct{}	}
 
	(1)
	go func() {
		defer ()
 
		if err := f(); err != nil {
			// Set the error value		}
	}()
}
 
func (g *Group) done() {
	if  != nil {
		&lt;- // Read from the ticket pool	}
	()
}

EnterGoWhen the coroutine is not started, check firstsem, If you set limit, you need to write data first according to the process of operating ticket pool. Only when the write is successful will a coroutine be created, and the data will be read out after the coroutine is run. This limits the maximum number of coroutines that errgroup can hold, because exceeding the number limit will block not creating new coroutines.

existGoBefore completing the writing of sem and executing the go statement, errgroup does not "hold" the coroutine created by the go statement. After the coroutine runs and reads the sem data, the group will not continue to "hold" the coroutine.

The problem lies where the write is. Suppose the scheduler runs our guessing code like this:

First start the running 1 coroutine, there are 2 sem empty spaces, and it runs normally. The data it writes will be read after running 1 is finished.
Then start running 2, and there is still a free space left in sem. No problem, the data it writes will be read after running 2 is finished.
Running 2 is executed first, so we are ready to create a coroutine for sub running 3
At this time, there is no space for sem, create sub running 3Goblock
The scheduler found that running 2 was blocked, so he asked running 1 to execute (assuming it is likely that it was running at the same time on the multi-core processor)
After outputting running 1, prepare to create a coroutine for sub running 1
Sem is still full,GoBlocked again
The scheduler found that running 1 and running 2 were blocked, so he could only let the main goroutine execute (the runtime's own coroutine is ignored here because it does not affect the deadlock detection result)
Main blockingWaitOn, all other coroutines can only continue to execute
There is no coroutine that can continue to run, all of them are blocked (note that it is blocking, not sleep). Deadlock detection found this situation, panic

I know that the actual execution order is definitely different, but the reasons for deadlock are the same: because the previous coroutine did not give up the ticket pool, the subsequent subtasks need to be written to the pool, and the previous coroutine that owns the pool needs to wait until the subtasks are executed before giving up the pool.This is a typical deadlock caused by circular dependencies, and the trigger is the nested use of the same errgroup。

What caused you to fall into the trap? The biggest possibility is the "active" in the document. This word is too vague, you can find that it can refer to both running and running, and can refer to two concurrent references. Since there is another paragraph below, you can guess based on the context that active means that all created coroutines are running regardless of whether they are running. But if you only read the first paragraph and use it with confidence and boldness, the pit will come. Even native speakers feel ambiguity when such words lack sufficient context, not to mention those of us as second or even third languages.

There is a reason why errgroup chooses to limit the total number of goroutines: only limiting the number of goroutines running simultaneously cannot limit the total number of coroutines. Although coroutines are very light, they still have to occupy memory and spend CPU resources to schedule. Uncontrolled may have catastrophic consequences. For example, if a million coroutines are accidentally created in a loop, it will cause serious memory usage and scheduling pressure. Controlling the total number can be avoided.

Fortunately, this misuse is easy to identify.Whenever there is nesting to use the same errgroup, it will be a warning masterpiece。

Luckily, if you don't have nested calls, then this oneSetLimitNo matter which number is set to, the number of goroutines on the top level can be normally limited (or without limiting). What it cannot limit is the nested calls of the child coroutines derived from the top level coroutines. As long as the same group is not called in nested, there will be no problems.

The first two misuses should be avoided, but nested errgroups are rare but useful, so I will also provide simple solutions for reference.

The first is to set a sufficient limit value. Smart people should have discovered that if limit is set to the total number of coroutines that want to exist simultaneously in the group (top level + all nested derived), the problem can be avoided. This is true, but I don't recommend it for two reasons:

After setting the total number, it cannot limit the number of coroutines running simultaneously. It is very troublesome to control the number of coroutines running simultaneously in Go. Limit can usually only play the role of "upper limit", but if the upper limit is set large, problems are likely to occur. For example, your system can only run 3 coroutines at the same time, and you have other tasks that occupy one coroutine to run. In order to avoid deadlock, you set limit to 4. At this time, resource preemption and coroutine scheduling delays will increase significantly. In this case, your system is only one step away from crashing.
It's very troublesome to calculate this number. You can simply calculate it as 4 in the above example. If I set another layer or add a few, I can skip it.GoWhat about the conditional branch called? Moreover, if the limit is too much, it will not be able to limit the number of goroutines. If the limit is too little, it will be deadlocked.
Limit is mostly a constant that is written in dead or simply a magic number. Then the next time the coroutine logic changes this number, it is most likely that you have to change it accordingly. If you calculate it wrongly or forget to change it, then you will be miserable. The deadlock is buried like a landmine.

In summary, you should use the second method: never nest the same errgroup. If you really have nested requirements, you should use a new errgroup instance, so that you can avoid deadlocks and best meet the semantics of the current requirements:

func main() {
    group, errCtx := (())
    (1) // Idea: Only 2 coroutines are allowed to run at the same time, but multiple tasks are submitted to the "coroutine pool"    (func () error {
        ("running 1")
        // Run subtasks        // Create a new errgroup, and use the outer group's context        subGroup, _ := (errCtx)
        (1)
        (func () error {
            ("sub running 1")
            return nil
        })
        (func () error {
            ("sub running 2")
            return nil
        })
        (())
        return nil
    })
    (func () error {
        ("running 2")
        // Run subtasks        subGroup, _ := (errCtx)
        (1)
        (func () error {
            ("sub running 3")
            return nil
        })
        (func () error {
            ("sub running 4")
            return nil
        })
        (())
        return nil
    })
    (())
}

Yes, now all limits are set to 1 and will not be deadlocked. Because there are no nested calls, there is no circular dependency between resources.

Of course, there is also the ultimate solution: don't treat errgroup as a coroutine pool. If you have complex functions, you rely on coroutine pools to find a real coroutine pool with comprehensive functions, such as ants, etc.

By the way. You askSetLimitWhat happens if you pass 0 in? Of course, it will be a deadlock. This is also in line with the semantics, because there cannot be any coroutines in your group, and then click itGoOf course it is wrong, and deadlock panic is also necessary. Therefore, if you pass 0 in, it will cause deadlock, which is not a pitfall, nor is it a misuse.

Summarize

Summarize the above three misuses:

Pass the context with excess nesting to the errgroup
The context cancellation and timeout are not properly handled in the coroutine that joins the errgroup
Nesting with the same errgroup

The existing static analysis tools are not very good at identifying such problems. They either write something that can be identified by themselves, or they can only rely on review to check.

The general public view believes that go is simple and easy to use, but in fact it is not always the case. There is a saying that "Simple is not Easy". Go users need to pay the corresponding price for "the most simple way".

The above is the detailed explanation of the common misuse of errgroup in Golang. For more information about Go errgroup, please follow my other related articles!