SoFunction
Updated on 2025-04-09

Deeply explore the analytical expressions in Lua

Use a mode

This example shows a program that creates and uses patterns, which is very simple but complete:

Copy the codeThe code is as follows:
local lpeg = require "lpeg"

-- matches a word followed by end-of-string
p = "az"^1 * -1

print(p:match("hello"))        --> 6
print((p, "hello"))  --> 6
print(p:match("1 hello"))      --> nil

A pattern is a simple sequence of one or more lowercase characters and ends with (-1) at the end. The program calls match as a method and function. In the above successful case, the matching function returns the index of the first character that successfully matches, adding 1 to its string length.

Copy the codeThe code is as follows:
Name-value lists

This example parses a list of name-value pairs and returns the tables that are paired:

Copy the codeThe code is as follows:
(lpeg)   -- adds locale entries into 'lpeg' table

local space = ^0
local name = (^1) * space
local sep = (",;") * space
local pair = (name * "=" * space * name) * sep^-1
local list = (("") * pair^0, rawset)
t = list:match("a=b, c = hi; next = pi")  --> { a = "b", c = "hi", next = "pi" }

Each pair has an optional splitter (using a comma or semicolon) with formatname =namefollowed. Thepairpattern forms a closure in a group mode, and those names can become a single captured value. Thelistpattern then collapses the captures. It starts with an empty list, matches an empty string by creating a list capture, and then applies to each capture (a name pair) and capture value (a pair). rawsetreturns((uninitialized collection) returns the table itself, so the accumulator is always executed in the table.


The following code creates a pattern that uses the given separator sep to split the string:

Copy the codeThe code is as follows:
function split (s, sep)
  sep = (sep)
  local elem = ((1 - sep)^0)
  local p = elem * (sep * elem)^0
  return (p, s)
end

First, this function ensures that sep is a suitable pattern. As long as there is no matching divider, the elem of this pattern is repeated zero or more arbitrary characters. It also captures its matching values. Pattern p matches a set of elements split by sep.

If the split produces too many results, the maximum number of values ​​returned by a Lua function may overflow. In this case, we can put these values ​​into a table:

Copy the codeThe code is as follows:
function split (s, sep)
  sep = (sep)
  local elem = ((1 - sep)^0)
  local p = (elem * (sep * elem)^0)   -- make a table capture
  return (p, s)
end

Mode Search

Basic matching only works in anchor mode. If we are going to find patterns that match anywhere in the string, then we have to write a pattern that matches anywhere.

Because patterns are writeable, we can write a function that gives an arbitrary pattern p, returning a new pattern that searches for p to match any position of the string. There are several ways to perform this search. One method is as follows:

Copy the codeThe code is as follows:
function anywhere (p)
  return { p + 1 * (1) }
end

A direct interpretation of this syntax: match p or skip a character and try to match again.

If we want to know all matching positions of the string (not just knowing that it is somewhere in the string), then we can add position snaps to this pattern:

Copy the codeThe code is as follows:
local I = ()
function anywhere (p)
  return { I * p * I + 1 * (1) }
end

print(anywhere("world"):match("hello world!"))   -> 7   12

Another method of this search is as follows:

Copy the codeThe code is as follows:
local I = ()
function anywhere (p)
  return (1 - (p))^0 * I * p * I
end

Again, a direct interpretation of this pattern: when p is not matched, it skips as many characters as possible and then matches p (plus the correct position snap).

If we are going to find patterns that only match word boundaries, we can use the following transformation:

Copy the codeThe code is as follows:
local t = ()

function atwordboundary (p)
  return {
    [1] = p + ^0 * (1 - )^1 * (1)
  }
end

Balanced brackets

The following pattern matches only strings with balanced brackets::

Copy the codeThe code is as follows:
b = { "(" * ((1 - "()") + (1))^0 * ")" }

Read the first (and only) given syntax rule, the so-called balanced string is an open bracket followed by zero or more non-bracketed characters or balanced strings ((1)), and finally followed by an end bracket that can be closed with the open bracket.
Global replacement

The following examples are similar to the work done. It receives a parent string and a pattern and a replacement value, and then replaces all substrings in the passed parent string that matches the specified pattern as the specified replacement value::

Copy the codeThe code is as follows:
function gsub (s, patt, repl)
  patt = (patt)
  patt = ((patt / repl + 1)^0)
  return (patt, s)
end

As a result, the replacement value can be a string, function, or a table.

Comma-separated values ​​(CSV)

The following example converts a string to a comma-separated value and returns all fields:

Copy the codeThe code is as follows:
local field = '"' * ((((1) - '"') + '""' / '"')^0) * '"' +
                    ((1 - ',\n"')^0)

local record = field * (',' * field)^0 * ('\n' + -1)

function csv (s)
  return (record, s)
end

A field or a quoted field (a family may contain any characters except single or double quotes) or an unquoted field (excluding commas, newlines, or quotes). A record is a comma-separated list of fields (ends with newlines or strings).

Just like this, each field returned by the previous match is returned independently. If we add a list to intercept the defined record. The returned will no longer be a separate list containing all fields.

Copy the codeThe code is as follows:
local record = (field * (',' * field)^0) * ('\n' + -1)


UTF-8 and Latin 1

Using LPeg to convert a string from UTF-8 encoding to Latin 1 (ISO 88590-1) is not difficult:

Copy the codeThe code is as follows:
-- convert a two-byte UTF-8 sequence to a Latin 1 character
local function f2 (s)
  local c1, c2 = (s, 1, 2)
  return (c1 * 64 + c2 - 12416)
end

local utf8 = ("\0\127")
           + ("\194\195") * ("\128\191") / f2

local decode_pattern = (utf8^0) * -1

In these codes, UTF-8 defines the encoding range (from 0 to 255) that has been Latin 1. All encodings that are not in that range (and any invalid encodings) will not match the pattern.

As required by decode_pattern, this pattern matches all inputs (because -1 is at its end), any invalid strings will fail without any useful information about this problem. We can improve this by redefining the following decode_pattern:

Copy the codeThe code is as follows:
local function er (_, i) error("invalid encoding at position " .. i) end

local decode_pattern = (utf8^0) * (-1 + (er))

Now, if the pattern utf8^0  stops before the end of the string, an applicable error function is called.

UTF-8 and Unicode

We can extend the previous pattern to handle all Unicdoe code snippets, of course, we can't translate them into Arabic numerals 1 or any other byte encoding. Instead, we translate the digits in the result of the translation sequence. Here's the complete code:
 

Copy the codeThe code is as follows:
-- decode a two-byte UTF-8 sequence
local function f2 (s)
  local c1, c2 = (s, 1, 2)
  return c1 * 64 + c2 - 12416
end
 
-- decode a three-byte UTF-8 sequence
local function f3 (s)
  local c1, c2, c3 = (s, 1, 3)
  return (c1 * 64 + c2) * 64 + c3 - 925824
end
 
-- decode a four-byte UTF-8 sequence
local function f4 (s)
  local c1, c2, c3, c4 = (s, 1, 4)
  return ((c1 * 64 + c2) * 64 + c3) * 64 + c4 - 63447168
end
 
local cont = ("\128\191")   -- continuation byte
 
local utf8 = ("\0\127") /
           + ("\194\223") * cont / f2

 
-- decode a two-byte UTF-8 sequence
local function f2 (s)
  local c1, c2 = (s, 1, 2)
  return c1 * 64 + c2 - 12416
end
 
-- decode a three-byte UTF-8 sequence
local function f3 (s)
  local c1, c2, c3 = (s, 1, 3)
  return (c1 * 64 + c2) * 64 + c3 - 925824
end
 
-- decode a four-byte UTF-8 sequence
local function f4 (s)
  local c1, c2, c3, c4 = (s, 1, 4)
  return ((c1 * 64 + c2) * 64 + c3) * 64 + c4 - 63447168
end
 
local cont = ("\128\191")   -- continuation byte
 
local utf8 = ("\0\127") /
           + ("\194\223") * cont / f2

Lua's long string

A long string in Lua begins with the pattern [= *[ and ends with the exact same number of equal signs that appear first. If the open bracket is followed by a newline, the newline will be discarded (i.e., it will not be treated as part of the string).

In Lua, if you want to match a long string, the pattern must capture the first repeated equal sign, and then, just find the candidates with the close string and check if it has the same number of equal signs.

Copy the codeThe code is as follows:
equals = "="^0
open = "[" * (equals, "init") * "[" * "\n"^-1
close = "]" * (equals) * "]"
closeeq = (close * ("init"), function (s, i, a, b) return a == b end)
string = open * (((1) - closeeq)^0) * close / 1

open pattern matches [=*[, which captures duplicate equal signs in a group named init; it also discards an optional newline character (if it exists). close pattern matching ]= *] is also an equal sign that captures duplication. The closeeq pattern first matches close, then it uses reverse snapping to restore the content previously captured by open and named init, and finally, use match-time snapping to check whether the two captures are the same. After the string pattern starts with open, it will be included until the closeeq is matched, and then the final close is matched. The last digital capture simply discards the capture generated by close.

Arithmetic expressions

This example performs a complete analysis and evaluation of simple arithmetic expressions. And we write in two styles.

The first way is to first create a syntax tree and then iterate through the tree to calculate the value of the expression:

Copy the codeThe code is as follows:
-- Dictionary elements
[code]local Space = (" \n\t")^0
local Number = ("-"^-1 * ("09")^1) * Space
local TermOp = (("+-")) * Space
local FactorOp = (("*/")) * Space
local Open = "(" * Space
local Close = ")" * Space

-- grammar
local Exp, Term, Factor = "Exp", "Term", "Factor"
G = { Exp,
  Exp = (Term * (TermOp * Term)^0);
  Term = (Factor * (FactorOp * Factor)^0);
  Factor = Number + Open * Exp * Close;
}

G = Space * G * -1

-- Evaluator
function eval (x)
  if type(x) == "string" then
    return tonumber(x)
  else
    local op1 = eval(x[1])
    for i = 2, #x, 2 do
      local op = x[i]
      local op2 = eval(x[i + 1])
      if (op == "+") then op1 = op1 + op2
      elseif (op == "-") then op1 = op1 - op2
      elseif (op == "*") then op1 = op1 * op2
      elseif (op == "/") then op1 = op1 / op2
      end
    end
    return op1
  end
end

-- parse/evaluate
function evalExp (s)
  local t = (G, s)
  if not t then error("syntax error", 2) end
  return eval(t)
end

-- Use examples
print(evalExp"3 + 5*9 / (1+1) - 12")   --> 13.5

The second style does not require the creation of a syntax tree, and it is directly evaluated. The following code follows this way (assuming that there are the same dictionary elements as above):

Copy the codeThe code is as follows:
-- Helper functions
function eval (v1, op, v2)
  if (op == "+") then return v1 + v2
  elseif (op == "-") then return v1 - v2
  elseif (op == "*") then return v1 * v2
  elseif (op == "/") then return v1 / v2
  end
end

-- grammar
local V =
G = { "Exp",
  Exp = (V"Term" * (TermOp * V"Term")^0, eval);
  Term = (V"Factor" * (FactorOp * V"Factor")^0, eval);
  Factor = Number / tonumber + Open * V"Exp" * Close;
}

-- Use examples
print((G, "3 + 5*9 / (1+1) - 12"))   --> 13.5

Note the usage of fold capture. To calculate the value of an expression, the collector starts with the value of the first term, applying an evolutionary collector, operator, and new term to each replica.