Implement a small compiler based on JS

The content of this article is based onWarehouse source codeCarry out learning

Recently, I was studying some construction-side things and saw this passage when I was browsing the official website of Babel:

For an awesome tutorial on compilers, check out the-super-tiny-compiler, which also explains how Babel itself works on a high level.

Out of a learning attitude, I searched for the source code of this warehouse. The following are some of the author’s learning records. I hope that after reading it, it will be helpful for you to understand the working principle of babel~

Preface

the-super-tiny-compilerIt is a super small compiler with less than 200 lines of code, but through this compiler, you can learn the most basic compile principle, including babel, which is also developed based on this principle.

An example of the repository itself is to compile a set of lisp-style function syntax into C-style function syntax, for example:

LISP style	C Style
2+2	(add 2 2)	add(2,2)
4-2	(subtract 4 2)	substract(4, 2)
2 + (4 - 2)	(add 2 (substract 4 2))	add(2, (substract(4, 2)))

This is probably what compiler needs to complete this time. It may not seem that the syntax is very complete, but it can also demonstrate the main ideas of modern compilers.

Most Compilers divide the compilation process into three main processes: parse, transform and code generate:

parse mainly converts the source code into a more abstract code expression
transform is to perform any operation that compiler wants to perform in the abstract expression above.
code generates a new code after transform processing

Parse

parse is mainly divided into two steps: lexical analysis and grammatical analysis.

Lexical analysis divides the source code into one tokens according to the expression. Tokens is a group of objects used to describe a separate syntax, which can be numbers, labels, punctuation, operators, etc.
Grammatical analysis is to re-arrange the tokens generated by lexical analysis to obtain a structure called Abstract Syntax Tree (AST). AST is an easy-to-use nested tree structure that can display complete information.

For example, the aforementionedadd 2 (subtract 4 2)The tokens generated by expressions after being processed by lexical analysis are probably:

[
  { type: 'paren',  value: '('        },
  { type: 'name',   value: 'add'      },
  { type: 'number', value: '2'        },
  { type: 'paren',  value: '('        },
  { type: 'name',   value: 'subtract' },
  { type: 'number', value: '4'        },
  { type: 'number', value: '2'        },
  { type: 'paren',  value: ')'        },
  { type: 'paren',  value: ')'        }
]

The AST structure processed by syntax analysis is:

{
  type: 'Program',
    body: [
      {
        type: 'CallExpression',
        name: 'add',
        params: [
          {
          type: 'NumberLiteral',
          value: '2',
          },
         {
           type: 'CallExpression',
           name: 'subtract',
           params: [
            {
              type: 'NumberLiteral',
              value: '4',
            }, 
            {
              type: 'NumberLiteral',
              value: '2',
            }
           ]
         }]
      }]
}

Transform

Transform mainly takes the abstract syntax tree obtained by parse and makes some modifications based on this. The process of transform can be used to modify ast based on the current language style or to use a new language style.

The following is based on the previous ast structure to show the workflow of transform process.

As you can see, the elements in ast all look very similar. These elements form the sub-nodes of ast. The data structure types of these sub-nodes describe a separate part of the code (such as variables, declaration statements, expressions, etc.).

For example, the type mentioned above isNumberLiteralNode:

{
  type: 'NumberLiteral',
  value: '2',
}

Or a more complex child node type:

{
    type: 'CallExpression',
    name: 'substract',
    params: [...nested nodes here ...],
}

In the transfrom process, we can manipulate abstract syntax tree nodes by adding/deleting/modifying, or we can directly create a new one based on the current abstract syntax tree.

Since our goal here is to convert the input code into a new language-style code (Lisp -> C), a brand new AST for the new language will be created here.

Therefore, we need to clarify the operation of modifying AST:

Traversal

In order to handle all nodes, we can traverse them with depth-first search:

{
  type: 'Program',
  body: [{
    type: 'CallExpression',
    name: 'add',
    params: [{
      type: 'NumberLiteral',
      value: '2'
    }, {
      type: 'CallExpression',
      name: 'subtract',
      params: [{
        type: 'NumberLiteral',
        value: '4'
      }, {
        type: 'NumberLiteral',
        value: '2'
      }]
    }]
  }]
}

The traversal process is as follows:

Program - Start from the top node of AST
CallExpression (add) - The first child element of Program
NumberLiteral (2) - The first child element of CallExpression (add)
CallExpression (subtract) - The second child element of CallExpression (add)
NumberLiteral (4) - The first child element of CallExpression (subtract)
NumberLiteral (2) - The second child element of CallExpression (subtract)

If you operate directly inside ast instead of producing a new ast, you may need to introduce all kinds of abstractions. But at present, the method of accessing all nodes is enough.

Visiting represents a pattern of operating on elements within an object structure.

Visitors

Here we can create a visitor object, which includes some methods for receiving different nodes.

For example:

var visitor = {
  NumberLiteral() {},
  CallExpression() {}
};

Therefore, when we traverse ast, if we match the node of the corresponding type, we can call the method in visitor to handle it.

Code generate

The last stage of Compiler is generate. What is done in this stage may overlap with transformation, but the most important part of code generation is to output the code based on AST.

Generate works several different ways. Some Compilers will reuse the tokens generated before, and some will create independent code representations to facilitate linear output of code, but next we will focus on using the previously generated ASTs.

Our generator needs to know how to print all types of nodes in AST and then call itself recursively, knowing that all code is printed into a very long string.

Code implementation

The above is all the parts of Compiler, but not all Compilers are like this. Different compilers have different purposes, so different steps may also be required.

Next, start writing the code:

Lexical analyzer (tokenizer)

According to the previous theoretical analysis, we will first carry out the tokenizer in the parser stage.

This function takes a string and then divides it into an array of tokens:

ex:

(add 2 (substract 4 2)) => [{ type: 'paren', value: '('}, ...]

Therefore, a function like this can be written:

const tokenizer = (input) =&gt; {
  const tokens = [];
  let current = 0;
  while (current &lt; ) {
    let char = input[current];
    if (char === '(') {
      ({
        type: 'paren',
        value: '(',
      })

      current++;
      continue;
    }

    if (char === ')') {
      ({
        type: 'paren',
        value: ')',
      })
      current ++;
      continue;
    }

    // Spaces do not need to be processed at present    if (/\s/.test(char)) {
      current++;
      continue;
    }

    // Process numbers    if (/[0-9]/.test(char)) {
      let value = '';
      // Iterate until it encounters non-numeric characters      while (/[0-9]/.test(char)) {
        value += char;
        char = input[++current];
      }
      ({
        type: 'number',
        value,
      })
      continue;
    }

    // Process strings    if(/[a-z]/(char)) {
      let value = '';

      while(/[a-z]/(char)) {
        value += char;
        char = input[++current];
      }
      ({
        type: 'name',
        value,
      });

      continue;
    }

    // If there is a token that cannot match, it will throw the wrong one    throw new Error(`Unknown token: ${char}`);
  }
  return tokens;
}

Parse parser

The lexer takes the token array obtained by the syntax analysis and converts it into an AST structure.

For example:

[{ type: 'paren', value: '(' }, ...] => { type: 'Program', body: [...] }

const parser = (tokens) =&gt; {
  let current = 0;
  
  const walk = () =&gt; {
    const token = tokens[current];
    // If it is a node of type number, return a new ast node    if ( === 'number') {
      current++;
      return {
        type: 'NumberLiteral',
        value: ,
      }
    }

    // Next check the CallExpression type, starting with the left brackets    if (
       === 'paren' &amp;&amp;
       === '('
    ) {
      // Skip the left bracket      token = tokens[++current];
      // First, a root node of CallExpression will be created, and then name will be set to the current one      // Because the token after the left bracket must be the function name      const node = {
        type: 'CallExpression',
        name: ,
        params: [],
      };

      // The function name is skipped here again      token = tokens[++current];

      // loop through each next token until the right parentheses are encountered      // These tokens will become `params` of `CallExpression`
      // In lisp style code: (add 2 (substract 4 2))      /**
        * There will be many brackets in the token
        * [
           { type: 'paren', value: '(' },
           { type: 'name', value: 'add' },
           { type: 'number', value: '2' },
           { type: 'paren', value: '(' },
           { type: 'name', value: 'subtract' },
           { type: 'number', value: '4' },
           { type: 'number', value: '2' },
           { type: 'paren', value: ')' }, <<< Right parentheses
           { type: 'paren', value: ')' } <<< Right parentheses
         ]
         When nested CallExpressions are encountered, the walk function will be processed.
        *
        */
      while (
         !== 'paren' ||
        ( !== ')' &amp;&amp;  === 'paren')
      ) {
        (walk());
        token = tokens[current];
      }
      current++;
      return node;
    }
    throw new Error(`Unknown token type: ${}`);
  }

  const ast = {
    type: 'Program',
    body: [],
  }

  /**
    * Use recursive functions to process nodes into
    */
  while (current &lt; ) {
    (walk());
  }

  return ast;
}

Visitors

After obtaining ast through syntax analysis, a traverser (visitors) is needed to traverse the nodes. Then when encountering a node of a certain type, you can call the corresponding type processing function in visitors:

traverse(ast, {
  Program(node, parent) {
    // ...
  },
  
  CallExpression(node, parent) {
    // ...
  },
  
  NumberLiteral(node, parent) {
    // ...
  },
});

Therefore our code can be written like this:

const traverser = (ast, visitor) =&gt; {

  // Used to call the traverseNode function on each element in the array  const traverseArray = (array, parent) =&gt; {
    ((child) =&gt; {
      traverseNode(child, parent);
    });
  }

  // The function receives a node and its parent node as a parameter  // This node will be passed into the corresponding processing function in the visitor  const traverseNode = (node, parent) =&gt; {
    const method = visitor[];
    if (method) {
      method(node, parent);
    }
    // Handle different nodes separately    switch () {
      case 'Program':
        traverseArray(, node);
        break;

      case 'CallExpression':
        traverseArray(, node);
        break;
        
      // In this case, there are no child nodes, just break out      case 'NumberLiteral':
        break;
      
      default:
        throw new Error(`Unknown node type: ${}`);
    }
  }

  traverseNode(ast, null);
}

Converter (transformer)

The converter is used with the traverser above. It receives the previously built ast and passes it into the traverser with the visitor, thus obtaining a brand new AST.

The original AST structure is (add 2 (subtract 4 2) ):

{
  type: 'Program',
  body: [{
    type: 'CallExpression',
    name: 'add',
    params: [{
      type: 'NumberLiteral',
      value: '2'
    }, {
      type: 'CallExpression',
      name: 'subtract',
      params: [{
        type: 'NumberLiteral',
        value: '4'
      }, {
        type: 'NumberLiteral',
        value: '2'
      }]
    }]
  }]
}

The AST structure generated after the conversion is (add(2, subtract(4, 2)) ):

{
  type: 'Program',
  body: [{
    type: 'ExpressionStatement',
    expression: {
      type: 'CallExpression',
      callee: {
        type: 'Identifier',
        name: 'add',
      },
      arguments: [{
        type: 'NumberLiteral',
        value: '2',
      }, {
        type: 'CallExpression',
        callee: {
          type: 'Identifier',
          name: 'subtract',
        },
        arguments: [{
          type: 'NumberLiteral',
          value: '4',
        }, {
          type: 'NumberLiteral',
          value: '2',
        }]
      }]
    }
  }
}

Next we can write the corresponding converter code like this:

const transformer = (ast) =&gt; {
  const newAst = {
    type: 'Program',
    body: [],
  }

  ast._context = ;

  traverser(ast, {
    // Handle NumberLiteral type    NumberLiteral: (node, parent) =&gt; {
      parent._context.push({
        type: 'NumberLiteral',
        value: ,
      });
    },

    // Handle CallExpression type    CallExpression: (node, parent) =&gt; {

      // Initialize a new node of CallExpression      // Put a nested Identifier node inside      const expression = {
        type: 'CallExpression',
        callee: {
          type: 'Identifier',
          name: 
        },
        arguments: [],
      };

      node._context = ;

      // Check whether the parent node is CallExpression      if ( !== 'CallExpression') {

        // If not, wrap the CallExpression node in a statement node called `ExpressionStatement`        // This is done because the CallExpression of top level can also be regarded as a declaration statement in JavaScript        expression = {
          type: 'ExpressionStatement',
          expression,
        };
      }

      // Finally, we put CallExpression into the context of the parent node      parent._context.push(expression);
    }
  });

  return newAst;
}

Code generator

The code generator is also a recursive function, and will eventually print each node in the AST into a large string:

const codeGenerator = (node) =&gt; {
  switch () {
    // If it is a Program, it will go through each node in the 'body' property    // Recursively codeGenerator on these nodes, and then print the result into a new line    case 'Program':
      return (codeGenerator).join('\n');
    
    // For ExpressionStatement, make a recursive call to the expression property and add a semicolon    case 'ExpressionStatement':
      return `${codeGenerator()};`;
    
    // Recursively call the callee property for CallExpression, then add (brackets)    // Then make a recursive call to the arguments property and add) brackets    case 'CallExpression':
      return `${codeGenerator()}(${(codeGenerator).join(', ')})`;

    // For Identifier, return name directly    case 'Identifier':
      return ;
    
    // For NumberLiteral, return value directly    case 'NumberLiteral':
      return ;
    
    default:
      throw new Error(`Unknown node type: ${}`);
  }
}

Compiler

At this point, basically all the processes have been completed. We can create a compiler function, and the entire compiler work can be completed by calling the above function:

input => tokenizer => tokens
tokens => parser => ast
ast => transformer => newAst
newAst => generator => output

The code only requires the following simple steps:

const compiler = (input) => {
  const tokens = tokenizer(input);
  const ast = parser(tokens);
  const newAst = transformer(ast);
  
  return codeGenerator(newAst);
}

We can enter the previous set of test examples to ensure that the results are correct.

Summarize

At this point, a tiny-compiler was created, and the compilation process of babel is also completed based on this, because babel itself is a compiler of source to source, and the entire process is also divided into three steps: parser -> transform -> code generate.refer to

The above is the detailed content of implementing a small compiler based on JS. For more information about JS compiler, please pay attention to my other related articles!