Use a mode
This example shows a program that creates and uses patterns, which is very simple but complete:
-- matches a word followed by end-of-string
p = "az"^1 * -1
print(p:match("hello")) --> 6
print((p, "hello")) --> 6
print(p:match("1 hello")) --> nil
A pattern is a simple sequence of one or more lowercase characters and ends with (-1) at the end. The program calls match as a method and function. In the above successful case, the matching function returns the index of the first character that successfully matches, adding 1 to its string length.
This example parses a list of name-value pairs and returns the tables that are paired:
local space = ^0
local name = (^1) * space
local sep = (",;") * space
local pair = (name * "=" * space * name) * sep^-1
local list = (("") * pair^0, rawset)
t = list:match("a=b, c = hi; next = pi") --> { a = "b", c = "hi", next = "pi" }
Each pair has an optional splitter (using a comma or semicolon) with formatname =namefollowed. Thepairpattern forms a closure in a group mode, and those names can become a single captured value. Thelistpattern then collapses the captures. It starts with an empty list, matches an empty string by creating a list capture, and then applies to each capture (a name pair) and capture value (a pair). rawsetreturns((uninitialized collection) returns the table itself, so the accumulator is always executed in the table.
The following code creates a pattern that uses the given separator sep to split the string:
sep = (sep)
local elem = ((1 - sep)^0)
local p = elem * (sep * elem)^0
return (p, s)
end
First, this function ensures that sep is a suitable pattern. As long as there is no matching divider, the elem of this pattern is repeated zero or more arbitrary characters. It also captures its matching values. Pattern p matches a set of elements split by sep.
If the split produces too many results, the maximum number of values returned by a Lua function may overflow. In this case, we can put these values into a table:
sep = (sep)
local elem = ((1 - sep)^0)
local p = (elem * (sep * elem)^0) -- make a table capture
return (p, s)
end
Mode Search
Basic matching only works in anchor mode. If we are going to find patterns that match anywhere in the string, then we have to write a pattern that matches anywhere.
Because patterns are writeable, we can write a function that gives an arbitrary pattern p, returning a new pattern that searches for p to match any position of the string. There are several ways to perform this search. One method is as follows:
return { p + 1 * (1) }
end
A direct interpretation of this syntax: match p or skip a character and try to match again.
If we want to know all matching positions of the string (not just knowing that it is somewhere in the string), then we can add position snaps to this pattern:
function anywhere (p)
return { I * p * I + 1 * (1) }
end
print(anywhere("world"):match("hello world!")) -> 7 12
Another method of this search is as follows:
function anywhere (p)
return (1 - (p))^0 * I * p * I
end
Again, a direct interpretation of this pattern: when p is not matched, it skips as many characters as possible and then matches p (plus the correct position snap).
If we are going to find patterns that only match word boundaries, we can use the following transformation:
function atwordboundary (p)
return {
[1] = p + ^0 * (1 - )^1 * (1)
}
end
Balanced brackets
The following pattern matches only strings with balanced brackets::
Read the first (and only) given syntax rule, the so-called balanced string is an open bracket followed by zero or more non-bracketed characters or balanced strings ((1)), and finally followed by an end bracket that can be closed with the open bracket.
Global replacement
The following examples are similar to the work done. It receives a parent string and a pattern and a replacement value, and then replaces all substrings in the passed parent string that matches the specified pattern as the specified replacement value::
patt = (patt)
patt = ((patt / repl + 1)^0)
return (patt, s)
end
As a result, the replacement value can be a string, function, or a table.
Comma-separated values (CSV)
The following example converts a string to a comma-separated value and returns all fields:
((1 - ',\n"')^0)
local record = field * (',' * field)^0 * ('\n' + -1)
function csv (s)
return (record, s)
end
A field or a quoted field (a family may contain any characters except single or double quotes) or an unquoted field (excluding commas, newlines, or quotes). A record is a comma-separated list of fields (ends with newlines or strings).
Just like this, each field returned by the previous match is returned independently. If we add a list to intercept the defined record. The returned will no longer be a separate list containing all fields.
UTF-8 and Latin 1
Using LPeg to convert a string from UTF-8 encoding to Latin 1 (ISO 88590-1) is not difficult:
local function f2 (s)
local c1, c2 = (s, 1, 2)
return (c1 * 64 + c2 - 12416)
end
local utf8 = ("\0\127")
+ ("\194\195") * ("\128\191") / f2
local decode_pattern = (utf8^0) * -1
In these codes, UTF-8 defines the encoding range (from 0 to 255) that has been Latin 1. All encodings that are not in that range (and any invalid encodings) will not match the pattern.
As required by decode_pattern, this pattern matches all inputs (because -1 is at its end), any invalid strings will fail without any useful information about this problem. We can improve this by redefining the following decode_pattern:
local decode_pattern = (utf8^0) * (-1 + (er))
Now, if the pattern utf8^0 stops before the end of the string, an applicable error function is called.
UTF-8 and Unicode
We can extend the previous pattern to handle all Unicdoe code snippets, of course, we can't translate them into Arabic numerals 1 or any other byte encoding. Instead, we translate the digits in the result of the translation sequence. Here's the complete code:
local function f2 (s)
local c1, c2 = (s, 1, 2)
return c1 * 64 + c2 - 12416
end
-- decode a three-byte UTF-8 sequence
local function f3 (s)
local c1, c2, c3 = (s, 1, 3)
return (c1 * 64 + c2) * 64 + c3 - 925824
end
-- decode a four-byte UTF-8 sequence
local function f4 (s)
local c1, c2, c3, c4 = (s, 1, 4)
return ((c1 * 64 + c2) * 64 + c3) * 64 + c4 - 63447168
end
local cont = ("\128\191") -- continuation byte
local utf8 = ("\0\127") /
+ ("\194\223") * cont / f2
-- decode a two-byte UTF-8 sequence
local function f2 (s)
local c1, c2 = (s, 1, 2)
return c1 * 64 + c2 - 12416
end
-- decode a three-byte UTF-8 sequence
local function f3 (s)
local c1, c2, c3 = (s, 1, 3)
return (c1 * 64 + c2) * 64 + c3 - 925824
end
-- decode a four-byte UTF-8 sequence
local function f4 (s)
local c1, c2, c3, c4 = (s, 1, 4)
return ((c1 * 64 + c2) * 64 + c3) * 64 + c4 - 63447168
end
local cont = ("\128\191") -- continuation byte
local utf8 = ("\0\127") /
+ ("\194\223") * cont / f2
Lua's long string
A long string in Lua begins with the pattern [= *[ and ends with the exact same number of equal signs that appear first. If the open bracket is followed by a newline, the newline will be discarded (i.e., it will not be treated as part of the string).
In Lua, if you want to match a long string, the pattern must capture the first repeated equal sign, and then, just find the candidates with the close string and check if it has the same number of equal signs.
open = "[" * (equals, "init") * "[" * "\n"^-1
close = "]" * (equals) * "]"
closeeq = (close * ("init"), function (s, i, a, b) return a == b end)
string = open * (((1) - closeeq)^0) * close / 1
open pattern matches [=*[, which captures duplicate equal signs in a group named init; it also discards an optional newline character (if it exists). close pattern matching ]= *] is also an equal sign that captures duplication. The closeeq pattern first matches close, then it uses reverse snapping to restore the content previously captured by open and named init, and finally, use match-time snapping to check whether the two captures are the same. After the string pattern starts with open, it will be included until the closeeq is matched, and then the final close is matched. The last digital capture simply discards the capture generated by close.
Arithmetic expressions
This example performs a complete analysis and evaluation of simple arithmetic expressions. And we write in two styles.
The first way is to first create a syntax tree and then iterate through the tree to calculate the value of the expression:
[code]local Space = (" \n\t")^0
local Number = ("-"^-1 * ("09")^1) * Space
local TermOp = (("+-")) * Space
local FactorOp = (("*/")) * Space
local Open = "(" * Space
local Close = ")" * Space
-- grammar
local Exp, Term, Factor = "Exp", "Term", "Factor"
G = { Exp,
Exp = (Term * (TermOp * Term)^0);
Term = (Factor * (FactorOp * Factor)^0);
Factor = Number + Open * Exp * Close;
}
G = Space * G * -1
-- Evaluator
function eval (x)
if type(x) == "string" then
return tonumber(x)
else
local op1 = eval(x[1])
for i = 2, #x, 2 do
local op = x[i]
local op2 = eval(x[i + 1])
if (op == "+") then op1 = op1 + op2
elseif (op == "-") then op1 = op1 - op2
elseif (op == "*") then op1 = op1 * op2
elseif (op == "/") then op1 = op1 / op2
end
end
return op1
end
end
-- parse/evaluate
function evalExp (s)
local t = (G, s)
if not t then error("syntax error", 2) end
return eval(t)
end
-- Use examples
print(evalExp"3 + 5*9 / (1+1) - 12") --> 13.5
The second style does not require the creation of a syntax tree, and it is directly evaluated. The following code follows this way (assuming that there are the same dictionary elements as above):
function eval (v1, op, v2)
if (op == "+") then return v1 + v2
elseif (op == "-") then return v1 - v2
elseif (op == "*") then return v1 * v2
elseif (op == "/") then return v1 / v2
end
end
-- grammar
local V =
G = { "Exp",
Exp = (V"Term" * (TermOp * V"Term")^0, eval);
Term = (V"Factor" * (FactorOp * V"Factor")^0, eval);
Factor = Number / tonumber + Open * V"Exp" * Close;
}
-- Use examples
print((G, "3 + 5*9 / (1+1) - 12")) --> 13.5
Note the usage of fold capture. To calculate the value of an expression, the collector starts with the value of the first term, applying an evolutionary collector, operator, and new term to each replica.