MatrixOne is a new generation of hyperconverged heterogeneous databases, committed to building a minimalist big data engine that handles multiple loads such as TP, AP, stream computing, etc. in a single architecture. MatrixOne is developed by Go and was opened in October 2021. It has now released to version 0.3. In MatrixOne's performance report, it is not inferior to Clickhouse, the industry-leading OLAP database. As a database implemented in Go language, it can achieve the same performance as database implemented in C++. One of the important optimizations is to use the assembly capabilities of Go language to accelerate hardware by calling SIMD instructions. This article will give a detailed introduction to Go assembly and its application on MatrixOne.
What is a MatrixOne database?
MatrixOne is a new generation of hyperconverged heterogeneous databases, committed to building a minimalist big data engine that handles multiple loads such as TP, AP, stream computing, etc. in a single architecture. MatrixOne is developed by Go and was opened in October 2021. It has now released to version 0.3. In MatrixOne's performance report, it is not inferior to Clickhouse, the industry-leading OLAP database. As a database implemented in Go language, it can achieve the same performance as database implemented in C++. One of the important optimizations is to use the assembly capabilities of Go language to accelerate hardware by calling SIMD instructions. This article will give a detailed introduction to Go assembly and its application on MatrixOne.
Github address:/matrixorigin/matrixoneInterested readers welcome star and fork.
Go compilation introduction
Go is a newer high-level language that provides exciting features such as coroutines, fast compilation, etc. But in the database engine, using pure Go language is unavailable. For example, vectorization is a common acceleration method used by database computing engines, and Go cannot maximize the performance of vectorized code by calling SIMD instructions. For example, in security-related code, Go language cannot call cryptography-related instructions provided by the CPU. In the world of C/C++/Rust, solving this type of problem can be done by calling CPU architecture-related intrinsics functions. The solution provided by Go language is Go assembly. This article will introduce the syntax characteristics of Go assembly and demonstrate its usage through several specific scenarios.
This article assumes that readers already have a basic understanding of computer architecture and assembly language, so common nouns (such as "registers") are not explained. If you lack relevant preliminary knowledge, you can seek online resources for learning, for examplehere。
Without special instructions, the assembly language referred to in this article is all aimed at the x86 (amd64) architecture. About the x86 instruction set,IntelandAMDThe official provides a complete instruction set reference document. If you want to check it quickly, you can also use itThis list. Intel'sintrinsics documentationIt can also be used as a reference.
Why use Go assembly?
Wikipedia summarizes the reasons for using assembly language into three categories:
- Directly operate the hardware
- Use special CPU instructions
- Solve performance issues
The reasons why Go programmers use assembly are nothing more than these three categories. If the problem you are facing is in these 3 categories and there are no ready-made libraries available, you can consider using Go assembly.
Why not use CGO?
- Huge function call overhead
- Memory management issues
- Breaking the goroutine semantics If a CGO function is run in a coroutine, it will occupy a separate thread and cannot be scheduled normally by Go running.
- Poor portability Cross-compilation requires a complete set of toolchains for the target platform. Deploying on different platforms requires installing more dependency libraries.
If the above points are unacceptable in your scenario, you might as well try Go assembly.
Features of Go assembly grammar
According to Rob Pike's The Design of the Go Assembler, the assembly language used by Go does not strictly correspond to CPU instructions one by one, but is a "pseudo assembly" called Plan 9 assembly.
The most important thing to know about Go's assembler is that it is not a direct representation of the underlying machine. Some of the details map precisely to the machine, but some do not. This is because the compiler suite needs no assembler pass in the usual pipeline. Instead, the compiler operates on a kind of semi-abstract instruction set, and instruction selection occurs partly after code generation. The assembler works on the semi-abstract form, so when you see an instruction like MOV what the toolchain actually generates for that operation might not be a move instruction at all, perhaps a clear or load. Or it might correspond exactly to the machine instruction with that name. In general, machine-specific operations tend to appear as themselves, while more general concepts like memory move and subroutine call and return are more abstract. The details vary with architecture, and we apologize for the imprecision; the situation is not well-defined.
We don’t have to care about the correspondence between Plan 9 assembly and machine instructions, we only need to understand the syntax characteristics of Plan 9 assembly. There are some available documents on the Internet, such ashereandhere。
One example is better than a thousand words. Let’s take the simplest 64-bit integer addition as an example to look at the characteristics of Go assembly grammar from different aspects.
// func Add(x, y int64) int64
//add_amd64.s #include "" TEXT ·Add(SB), NOSPLIT, $0-24 MOVQ x+0(FP), AX MOVQ y+8(FP), CX ADDQ AX, CX MOVQ CX, ret+16(FP) RET
The order these four assembly codes do is:
- The first operand x is put into register AX
- Put the second operand y into the register
- CXCX plus AX, and the result is put back to CX
- CX puts the stack address where the return value is located
Operand order
There are two most commonly used syntaxes for x86 assembly, AT&T syntax and Intel syntax. The result number of AT&T syntax is placed at the end and other operands are placed at the beginning. The number of Intel syntax results is placed at the front, and other operands are placed at the back.
Go's compilation is close to AT&T syntax in this regard, and the results are finally put.
An example of a mistake that is easy to write is the CMP instruction. From the perspective of effect, CMP is similar to SUB instruction that only modifies the EFLAGS flag bit and does not modify the operand. In Go assembly, CMP sets the flag bits by subtracting the second operand (opposite to SUB).
Register width identification
Some instructions support different register widths. Taking ADD of 64-bit operands as an example, according to the AT&T syntax, the instruction name should be added with a width suffix to become ADDQ, and the register should also be added with a width prefix to become RAX and RCX. According to Intel syntax, the instruction name remains unchanged and only prefixes the registers.
As can be seen from the above example, Go assembly is different from both: the instruction name needs to be suffixed with width, and the register remains unchanged.
Function Call Convention
The way programming languages pass parameters in function calls is called function calling convention. The mainstream C/C++ compilers on the x86-64 architecture all use register-based methods by default: the caller puts parameters intoSpecific registersPass to the called function. Go's calling convention, simply put, on the latest Go 1.18, Go's own runtime library uses a register-based method on the amd64, arm64 and ppc64 architectures, and the rest (other CPU architectures, as well as non-runtime libraries and user-written libraries) uses a stack-based method: the caller presses the parameters one by one, the caller accesses the stack through the passed offset, and then presses the return value after the execution is completed.
In the above code, FP is a virtual register pointing to the address of the first parameter in the stack. Multiple parameters and return values are stored in order, so the addresses of x, y, and return values in the stack are FP plus offsets 0, 8, and 16 respectively.
Tools that are helpful for writing Go assembly code
avo
Readers who are familiar with assembly language should know that handwritten assembly language will have cumbersome and error-prone steps such as selecting registers and calculating offsets. The avo library is built to solve such problems. For details on how avo is used, please refer to the following in its repoSample。
text/template
This is a library that comes with Go. It can be helpful when writing a lot of repetitive code, such as implementing the same basic operator for different types in vectorized code. For specific usage, please refer to the official document, which does not take up space here.
Using macros in Go assembly code
Go assembly code supports macros similar to C language, and can also be used in scenarios where code is repeated a lot. There are many examples in the internal library, such ashere。
Go language assembly application in MatrixOne database
Basic vector operation acceleration
In the OLAP database computing engine, vectorization is an indispensable acceleration method. Through vectorization, unnecessary overhead caused by a large number of simple function calls is eliminated. In order to achieve maximum vectorization performance, using SIMD instructions is a very natural choice.
Let's take 8-bit integer vectorized addition as an example. Add the elements of the two arrays in pairs and put the result into the third array. Such operations can be automatically optimized to versions using SIMD instructions in some C/C++ compilers. However, the Go compiler, which is known for its compilation speed, will not do such optimizations. This is also the active choice made by Go to ensure compilation speed. In this example, we introduce how to implement int8 type vector addition with AVX2 instruction set using Go assembly (assuming the array has been filled in 32 bytes).
Since AVX2 has 16 256-bit registers in total, we want to use them all in loop expansion. If it is completely handwritten, repeating the list of registers is very cumbersome and prone to errors. So we use avo to simplify some work. Avo's vector addition code is as follows:
package main import ( . "/mmcloughlin/avo/build" . "/mmcloughlin/avo/operand" . "/mmcloughlin/avo/reg" ) var unroll = 16 var regWidth = 32 func main() { TEXT("int8AddAvx2Asm", NOSPLIT, "func(x []int8, y []int8, r []int8)") x := Mem{Base: Load(Param("x").Base(), GP64())} y := Mem{Base: Load(Param("y").Base(), GP64())} r := Mem{Base: Load(Param("r").Base(), GP64())} n := Load(Param("x").Len(), GP64()) blocksize := regWidth * unroll blockitems := blocksize / 1 regitems := regWidth / 1 Label("int8AddBlockLoop") CMPQ(n, U32(blockitems)) JL(LabelRef("int8AddTailLoop")) xs := make([]VecVirtual, unroll) for i := 0; i < unroll; i++ { xs[i] = YMM() VMOVDQU((regWidth*i), xs[i]) } VPADDB((regWidth*i), xs[i], xs[i]) VMOVDQU(xs[i], (regWidth*i)) ADDQ(U32(blocksize), ) ADDQ(U32(blocksize), ) ADDQ(U32(blocksize), ) SUBQ(U32(blockitems), n) JMP(LabelRef("int8AddBlockLoop")) Label("int8AddTailLoop") CMPQ(n, U32(regitems)) JL(LabelRef("int8AddDone")) VMOVDQU(x, xs[0]) VPADDB(y, xs[0], xs[0]) VMOVDQU(xs[0], r) ADDQ(U32(regWidth), ) ADDQ(U32(regWidth), ) ADDQ(U32(regWidth), ) SUBQ(U32(regitems), n) JMP(LabelRef("int8AddTailLoop")) Label("int8AddDone") RET() }
Run the command
go run -out
The assembly code generated afterwards is as follows:
// Code generated by command: go run -out . DO NOT EDIT. #include "" // func int8AddAvx2Asm(x []int8, y []int8, r []int8) // Requires: AVX, AVX2 TEXT ·int8AddAvx2Asm(SB), NOSPLIT, $0-72 MOVQ x_base+0(FP), AX MOVQ y_base+24(FP), CX MOVQ r_base+48(FP), DX MOVQ x_len+8(FP), BX int8AddBlockLoop: CMPQ BX, $0x00000200 JL int8AddTailLoop VMOVDQU (AX), Y0 VMOVDQU 32(AX), Y1 VMOVDQU 64(AX), Y2 VMOVDQU 96(AX), Y3 VMOVDQU 128(AX), Y4 VMOVDQU 160(AX), Y5 VMOVDQU 192(AX), Y6 VMOVDQU 224(AX), Y7 VMOVDQU 256(AX), Y8 VMOVDQU 288(AX), Y9 VMOVDQU 320(AX), Y10 VMOVDQU 352(AX), Y11 VMOVDQU 384(AX), Y12 VMOVDQU 416(AX), Y13 VMOVDQU 448(AX), Y14 VMOVDQU 480(AX), Y15 VPADDB (CX), Y0, Y0 VPADDB 32(CX), Y1, Y1 VPADDB 64(CX), Y2, Y2 VPADDB 96(CX), Y3, Y3 VPADDB 128(CX), Y4, Y4 VPADDB 160(CX), Y5, Y5 VPADDB 192(CX), Y6, Y6 VPADDB 224(CX), Y7, Y7 VPADDB 256(CX), Y8, Y8 VPADDB 288(CX), Y9, Y9 VPADDB 320(CX), Y10, Y10 VPADDB 352(CX), Y11, Y11 VPADDB 384(CX), Y12, Y12 VPADDB 416(CX), Y13, Y13 VPADDB 448(CX), Y14, Y14 VPADDB 480(CX), Y15, Y15 VMOVDQU Y0, (DX) VMOVDQU Y1, 32(DX) VMOVDQU Y2, 64(DX) VMOVDQU Y3, 96(DX) VMOVDQU Y4, 128(DX) VMOVDQU Y5, 160(DX) VMOVDQU Y6, 192(DX) VMOVDQU Y7, 224(DX) VMOVDQU Y8, 256(DX) VMOVDQU Y9, 288(DX) VMOVDQU Y10, 320(DX) VMOVDQU Y11, 352(DX) VMOVDQU Y12, 384(DX) VMOVDQU Y13, 416(DX) VMOVDQU Y14, 448(DX) VMOVDQU Y15, 480(DX) ADDQ $0x00000200, AX ADDQ $0x00000200, CX ADDQ $0x00000200, DX SUBQ $0x00000200, BX JMP int8AddBlockLoop int8AddTailLoop: CMPQ BX, $0x00000020 JL int8AddDone ADDQ $0x00000020, AX ADDQ $0x00000020, CX ADDQ $0x00000020, DX SUBQ $0x00000020, BX JMP int8AddTailLoop int8AddDone: RET
As you can see, in the avo code, we only need to specify the register type for the variable, and when generating assembly, we will automatically bind the available registers of the corresponding type. This can indeed bring convenience in many scenarios. However, avo currently only supports x86 architecture, and it is unavailable to write assembly to the arm CPU.
Directives that cannot be called directly in Go language
In addition to SIMD, there are many CPU instructions that cannot be used in the Go language itself, such as cryptography-related instructions. If you are using C/C++, you can use the intrinsics function built-in by the compiler (both provided by gcc and clang), which is quite convenient. Unfortunately, Go does not provide intrinsics functions. In such a scenario, assembly is the only solution. Go language's own crypto official library has a large amount of assembly code.
Here we take the CRC32C instruction as an example. In MatrixOne's hash table implementation, the hash function of integer key only uses one CRC32 instruction, achieving the highest theoretical performance. The code is as follows:
TEXT ·Crc32Int64Hash(SB), NOSPLIT, $0-16 MOVQ -1, SI CRC32Q data+0(FP), SI MOVQ SI, ret+8(FP) RET
In actual code, in order to eliminate the instruction jump overhead caused by assembly function calls and parameter entry and exit overhead, batch versions are used. In order to save space, we will use the simplified version as an example.
Special optimization effects that the compiler cannot achieve
Here is a part of the algorithm used by MatrixOne to intersect two ordered 64-bit integer arrays:
... loop: CMPQ DX, DI JE done CMPQ R11, R8 JE done MOVQ (DX), R10 MOVQ R10, (SI) CMPQ R10, (R11) SETLE AL SETGE BL SETEQ CL SHLB $0x03, AL SHLB $0x03, BL SHLB $0x03, CL ADDQ AX, DX ADDQ BX, R11 ADDQ CX, SI JMP loop done: ...
CMPQ R10, (R11)
This line compares the elements that compare the current pointer position of the two arrays. The following lines move the pointers of the corresponding operand array and result array based on the results of this comparison. The text explanation is not as clear as comparing the following equivalent C language code:
while (true) { if (a == a_end) break; if (b == b_end) break; *c = *a; if (*a <= *b) ++a; if (*a >= *b) ++b; if (*a == *b) ++c; }
In assembly code, the loop only performs a comparison operation once, and no branch jumps. High-level language compilers cannot achieve such optimization effects because any high-level language does not provide "modify 3 different numbers according to 3 different results of a comparison operation" so that the semantics related to the CPU instruction set are directly related to the semantics.
This example is a demonstration of the power of assembly language. Programming languages are constantly developing, and the abstraction level is getting higher and higher, but in the scenario of maximizing performance, assembly languages that directly deal with CPU instructions are still needed.
This is the article about briefly analyzing the introduction of Go assembly grammar and MatrixOne usage. For more related content on Go assembly MatrixOne usage, please search for my previous articles or continue browsing the related articles below. I hope everyone will support me in the future!