SoFunction
Updated on 2025-03-03

Golang compiler introduction

cmd/compile contains the packages that make up the main Go compiler. The compiler can be logically divided into four stages, and we will briefly introduce these stages and a list of packages containing the corresponding code.
When talking about compilers, you may sometimes hear the terms front-end and back-end. Roughly speaking, these correspond to the first and last two stages we will list here. The third term middle-end usually refers to most of the work performed in the second stage.
Please note that the go/* series packages such as go/parser and go/types have nothing to do with the compiler. Since the compiler was originally written in C, these go/* packages were developed to make it possible to write tools that work with Go code, such as gofmt and vet.
It should be clarified that the name "gc" stands for "Go compiler" and has nothing to do with capital GC, which stands for garbage collection.

1. Analysis

  • cmd/compile/internal/syntax (lexer, parser, syntax tree)

In the first stage of compilation, the source code is tokenized (lexical analysis), parsed (grammatical analysis), and a syntax tree is constructed for each source file (translation notation: here the token refers to token, which is a set of predefined and recognizable strings, usually composed of names and values, where names are generally lexical categories, such as identifiers, keywords, separators, operators, literals and comments, etc.; the syntax tree, and the abstract syntax tree (AST) mentioned below, refer to the use of trees to express the syntax structure of the programming language. Usually the leaf nodes are operands, and other nodes are opcodes).
Each syntax tree is an exact representation of the corresponding source file, where the node corresponds to various elements of the source file, such as expressions, declarations, and statements. The syntax tree also includes location information for error reporting and debugging information.

2. Type checking and AST transformation

  • cmd/compile/internal/gc (create compiler AST, type-checking, AST transformation)

The gc package contains an AST definition inherited from the (early) C language implementation. All code is written based on it, so the first thing the gc package has to do is convert the syntax tree of the syntax package (defined) into the AST notation of the compiler. This extra step may be refactored in the future.
Then type check the AST. The first step is name resolution and type inference, which determine which object belongs to which identifier and which type each expression has. Type checks include specific additional checks, such as "declared but not used" and determining whether the function will terminate.
Specific transformations are also done based on AST. Some nodes are refined based on type information, such as splitting string addition from the node type of arithmetic addition. Other examples are dead code elimination, function call inlining, and escape analysis.

3. General SSA

  • cmd/compile/internal/gc (convert to SSA)
  • cmd/compile/internal/ssa (SSA-related links (pass) and rules)

(Translator's Note: Compilers in many common high-level languages ​​cannot complete all the compilation work by scanning the source code or AST once. Instead, they scan multiple times, complete part of the work each time, and use the output as input for the next scan until the target code is finally generated. Each scan is called a pass; the results obtained by all links before the last link can be called intermediate notation. In this article, AST, SSA, etc. are all intermediate notation. SSA, a static single assignment form, is a property of intermediate notation. It requires each variable to be assigned only once and defined before use).
At this stage, the AST will be converted to the form of Static Single Assignment (SSA), a low-level intermediate representation with specific attributes that make it easier to implement optimization and eventually generate machine code from it.
During this conversion process, the built-in functions (function intrinsics) will be processed. These are special functions, and the compiler is told to analyze these functions one by one and decide whether to replace them with deeply optimized code (translation notes: built-in functions refer to functions defined by the language itself. Usually, the compiler handles it by using the corresponding sequence of instructions that implement the function instead of calling instructions to the function, which is a bit similar to inline functions).
During the conversion of AST to SSA, specific nodes are also downgraded into simpler components so that the remaining compilation phases can work based on them. For example, the built-in copy is replaced with memory moves and the range loop is rewritten as a for loop. Due to historical reasons, some of this happens before conversion to SSA, but the long-term plan is to move them all here (convert SSA).
Then, a series of machine-independent rules and compilation links will be executed. These do not take into account specific computer architectures and therefore run for the values ​​of all GOARCH variables.
Some examples of such general compilation steps include dead code elimination, removal of unnecessary null value checks, and removal of useless branches. General rewriting rules mainly consider expressions, such as replacing some expressions with constants, optimizing multiplication and floating-point operations.

4. Generate machine code

  • cmd/compile/internal/ssa (SSA low-level and architecture-specific links)
  • cmd/internal/obj (machine code generation)

The machine-related phase in the compiler begins at the "low-level" compilation stage, which rewrites common variables into their specific machine code form. For example, in the amd64 architecture operands can be operated in memory, so that many load-store operations can be merged.
Note that the low-level compilation process runs all machine-specific rewrite rules, so it also applies a lot of optimizations at the moment.
Once the SSA is "lowerized" and more specifically targeted to the target architecture, the compilation process of the final code optimization is run. This includes another dead code elimination link that moves variables closer to where they are used, removes local variables that have never been read, and register allocations.
Other important tasks done in this step include stack frame layout, which assigns stack offset positions to local variables, and pointer liveness analysis, which calculates which stacks on each garbage collection security point are still active.
At the end of the SSA generation phase, the Go function has been converted into a series of instructions. They are passed to the assembler (cmd/internal/obj), which converts them into machine code and outputs the final target file. The target file will also contain reflected data, export data, and debug information.

Summarize

The above is the entire content of this article. I hope that the content of this article has certain reference value for your study or work. Thank you for your support. If you want to know more about it, please see the following links