mirror of https://github.com/yhirose/cpp-peglib.git synced 2026-07-07 19:52:49 -04:00

Go to file

yhirose c290960694 Release v1.15.0		2026-07-02 18:42:41 -04:00
.github/workflows	Add `just release` versioning workflow	2026-06-06 15:08:47 -04:00
abi	Add `just release` versioning workflow	2026-06-06 15:08:47 -04:00
benchmark	Add language independent test framework (#339 )	2026-04-19 21:49:18 -04:00
docs	Performance Improvement (#336 )	2026-03-11 00:02:19 -04:00
example	Update CMake related files	2024-06-17 13:03:53 -04:00
grammar	feat: add { no_whitespace } instruction and peglint whitespace-predicate warning (#44 , #319 )	2026-07-02 12:48:29 -04:00
harness	feat: support `precedence` in grammar serialization + spec round-trip oracle	2026-07-01 12:48:16 -04:00
lint	feat: add { no_whitespace } instruction and peglint whitespace-predicate warning (#44 , #319 )	2026-07-02 12:48:29 -04:00
pl0	Fixed waring	2026-05-29 13:44:01 -04:00
rust-peglib	perf(rust): backport selective packrat memoization via leftmost reachability	2026-07-02 18:25:53 -04:00
scripts	fix(release): judge CI by the most recent run of each workflow	2026-07-02 14:08:44 -04:00
spec	spec: add macro + left-recursion interaction cases	2026-07-01 11:37:24 -04:00
test	feat: structured error reports and named captures in error messages (#272 , #289 )	2026-07-02 16:45:32 -04:00
.clang-format	Apply clang-format	2020-02-05 14:10:00 -05:00
.gitignore	Fix playground crash problem	2025-07-12 10:32:35 -04:00
.pre-commit-config.yaml	Fix playground crash problem	2025-07-12 10:32:35 -04:00
CMakeLists.txt	Add language independent test framework (#339 )	2026-04-19 21:49:18 -04:00
justfile	Add `just release` versioning workflow	2026-06-06 15:08:47 -04:00
LICENSE	Code cleanup	2022-06-28 16:40:33 -04:00
peg.vim	feat: predefined character classes in character class (#87 )	2026-07-02 11:44:26 -04:00
peglib.h	Release v1.15.0	2026-07-02 18:42:41 -04:00
README.md	feat: structured error reports and named captures in error messages (#272 , #289 )	2026-07-02 16:45:32 -04:00

README.md

cpp-peglib

C++17 header-only PEG (Parsing Expression Grammars) library. You can start using it right away just by including peglib.h in your project.

Since this library only supports C++17 compilers, please make sure that the compiler option -std=c++17 is enabled. (/std:c++17 /Zc:__cplusplus for MSVC)

You can also try the online version, PEG Playground at https://yhirose.github.io/cpp-peglib.

The PEG syntax is well described on page 2 in the document by Bryan Ford. cpp-peglib also supports the following additional syntax for now:

'...'i (Case-insensitive literal operator)
[...]i (Case-insensitive character class operator)
[^...] (Negated character class operator)
[^...]i (Case-insensitive negated character class operator)
{2,5} (Regex-like repetition operator)
< ... > (Token boundary operator)
~ (Ignore operator)
\x20 (Hex number char)
\u10FFFF (Unicode char)
[\d\w\s], [\D\W\S] (Predefined character classes in a character class, ASCII semantics)
[[:alpha:]], [[:^alpha:]] (POSIX character classes in a character class, ASCII semantics: alnum, alpha, ascii, blank, cntrl, digit, graph, lower, print, punct, space, upper, word, xdigit)
%whitespace (Automatic whitespace skipping)
%word (Word expression)
$name( ... ) (Capture scope operator)
$name< ... > (Named capture operator)
$name (Backreference operator)
| (Dictionary operator)
↑ (Cut operator)
MACRO_NAME( ... ) (Parameterized rule or Macro)
{ precedence L - + L / * } (Parsing infix expression)
Left recursive rules (direct, indirect, and mutual left recursion)
%recovery( ... ) (Error recovery operator)
exp⇑label or exp^label (Syntax sugar for (exp / %recover(label)))
label { error_message "..." } (Error message instruction)
{ no_ast_opt } (No AST node optimization instruction)
{ no_whitespace } (Disable %whitespace skipping inside the rule)
{ ast_name: NodeTag } (AST node name override instruction)

'End of Input' check will be done as default. To disable the check, please call disable_eoi_check.

This library supports the linear-time parsing known as the Packrat parsing. It also supports left recursive grammars (direct, indirect, and mutual) via a seed-growing algorithm, allowing natural expression of left-associative operators.

IMPORTANT NOTE for some Linux distributions such as Ubuntu and CentOS: Need -pthread option when linking. See #23, #46 and #62.

I am sure that you will enjoy this excellent "Practical parsing with PEG and cpp-peglib" article by bert hubert!

How to use

This is a simple calculator sample. It shows how to define grammar, associate semantic actions to the grammar, and handle semantic values.

// (1) Include the header file
#include <peglib.h>
#include <assert.h>
#include <iostream>

using namespace peg;
using namespace std;

int main(void) {
  // (2) Make a parser
  parser parser(R"(
    # Grammar for Calculator...
    Additive    <- Multiplicative '+' Additive / Multiplicative
    Multiplicative   <- Primary '*' Multiplicative / Primary
    Primary     <- '(' Additive ')' / Number
    Number      <- < [0-9]+ >
    %whitespace <- [ \t]*
  )");

  assert(static_cast<bool>(parser) == true);

  // (3) Setup actions
  parser["Additive"] = [](const SemanticValues &vs) {
    switch (vs.choice()) {
    case 0: // "Multiplicative '+' Additive"
      return any_cast<int>(vs[0]) + any_cast<int>(vs[1]);
    default: // "Multiplicative"
      return any_cast<int>(vs[0]);
    }
  };

  parser["Multiplicative"] = [](const SemanticValues &vs) {
    switch (vs.choice()) {
    case 0: // "Primary '*' Multiplicative"
      return any_cast<int>(vs[0]) * any_cast<int>(vs[1]);
    default: // "Primary"
      return any_cast<int>(vs[0]);
    }
  };

  parser["Number"] = [](const SemanticValues &vs) {
    return vs.token_to_number<int>();
  };

  // (4) Parse
  parser.enable_packrat_parsing(); // Enable packrat parsing.

  int val;
  parser.parse(" (1 + 2) * 3 ", val);

  assert(val == 9);
}

To show syntax errors in grammar text:

auto grammar = R"(
  # Grammar for Calculator...
  Additive    <- Multiplicative '+' Additive / Multiplicative
  Multiplicative   <- Primary '*' Multiplicative / Primary
  Primary     <- '(' Additive ')' / Number
  Number      <- < [0-9]+ >
  %whitespace <- [ \t]*
)";

parser parser;

parser.set_logger([](size_t line, size_t col, const string& msg, const string &rule) {
  cerr << line << ":" << col << ": " << msg << "\n";
});

auto ok = parser.load_grammar(grammar);
assert(ok);

There are four semantic actions available:

[](const SemanticValues& vs, any& dt)
[](const SemanticValues& vs)
[](SemanticValues& vs, any& dt)
[](SemanticValues& vs)

SemanticValues value contains the following information:

Semantic values
Matched string information
Token information if the rule is literal or uses a token boundary operator
Choice number when the rule is 'prioritized choice'

any& dt is a 'read-write' context data which can be used for whatever purposes. The initial context data is set in peg::parser::parse method.

A semantic action can return a value of arbitrary data type, which will be wrapped by peg::any. If a user returns nothing in a semantic action, the first semantic value in the const SemanticValues& vs argument will be returned. (Yacc parser has the same behavior.)

Here shows the SemanticValues structure:

struct SemanticValues : protected std::vector<any>
{
  // Input text
  const char* path;
  const char* ss;

  // Matched string
  std::string_view sv() const { return sv_; }

  // Line number and column at which the matched string is
  std::pair<size_t, size_t> line_info() const;

  // Tokens
  std::vector<std::string_view> tokens;
  std::string_view token(size_t id = 0) const;

  // Token conversion
  std::string token_to_string(size_t id = 0) const;
  template <typename T> T token_to_number() const;

  // Choice number (0 based index)
  size_t choice() const;

  // Transform the semantic value vector to another vector
  template <typename T> vector<T> transform(size_t beg = 0, size_t end = -1) const;
}

The following example uses < ... > operator, which is token boundary operator.

peg::parser parser(R"(
  ROOT  <- _ TOKEN (',' _ TOKEN)*
  TOKEN <- < [a-z0-9]+ > _
  _     <- [ \t\r\n]*
)");

parser["TOKEN"] = [](const SemanticValues& vs) {
  // 'token' doesn't include trailing whitespaces
  auto token = vs.token();
};

auto ret = parser.parse(" token1, token2 ");

We can ignore unnecessary semantic values from the list by using ~ operator.

peg::parser parser(R"(
  ROOT  <-  _ ITEM (',' _ ITEM _)*
  ITEM  <-  ([a-z0-9])+
  ~_    <-  [ \t]*
)");

parser["ROOT"] = [&](const SemanticValues& vs) {
  assert(vs.size() == 2); // should be 2 instead of 5.
};

auto ret = parser.parse(" item1, item2 ");

The following grammar is the same as the above.

peg::parser parser(R"(
  ROOT  <-  ~_ ITEM (',' ~_ ITEM ~_)*
  ITEM  <-  ([a-z0-9])+
  _     <-  [ \t]*
)");

Semantic predicate support is available with a predicate action.

peg::parser parser("NUMBER  <-  [0-9]+");

parser["NUMBER"] = [](const SemanticValues &vs) {
  return vs.token_to_number<long>();
};

parser["NUMBER"].predicate = [](const SemanticValues &vs,
                                const std::any & /*dt*/, std::string &msg) {
  if (vs.token_to_number<long>() != 100) {
    msg = "value error!!";
    return false;
  }
  return true;
};

long val;
auto ret = parser.parse("100", val);
assert(ret == true);
assert(val == 100);

ret = parser.parse("200", val);
assert(ret == false);

The predicate can pass data to the action via predicate_data to avoid redundant computation:

peg::parser parser("NUMBER  <-  < [0-9]+ >");

parser["NUMBER"].predicate = [](const SemanticValues &vs,
                                const std::any & /*dt*/, std::string &msg,
                                std::any &predicate_data) {
  int value;
  auto [ptr, err] = std::from_chars(
      vs.token().data(), vs.token().data() + vs.token().size(), value);
  if (err != std::errc()) {
    msg = "Number out of range.";
    return false;
  }
  predicate_data = value;
  return true;
};

parser["NUMBER"] = [](const SemanticValues & /*vs*/, std::any & /*dt*/,
                      const std::any &predicate_data) {
  return std::any_cast<int>(predicate_data);
};

enter and leave actions are also available.

parser["RULE"].enter = [](const Context &c, const char* s, size_t n, any& dt) {
  std::cout << "enter" << std::endl;
};

parser["RULE"] = [](const SemanticValues& vs, any& dt) {
  std::cout << "action!" << std::endl;
};

parser["RULE"].leave = [](const Context &c, const char* s, size_t n, size_t matchlen, any& value, any& dt) {
  std::cout << "leave" << std::endl;
};

You can receive error information via a logger:

parser.set_logger([](size_t line, size_t col, const string& msg) {
  ...
});

parser.set_logger([](size_t line, size_t col, const string& msg, const string &rule) {
  ...
});

Ignoring Whitespaces

As you can see in the first example, we can ignore whitespaces between tokens automatically with %whitespace rule.

%whitespace rule can be applied to the following three conditions:

trailing spaces on tokens
leading spaces on text
trailing spaces on literal strings in rules

These are valid tokens:

KEYWORD   <- 'keyword'
KEYWORDI  <- 'case_insensitive_keyword'
WORD      <-  < [a-zA-Z0-9] [a-zA-Z0-9-_]* >    # token boundary operator is used.
IDNET     <-  < IDENT_START_CHAR IDENT_CHAR* >  # token boundary operator is used.

The following grammar accepts one, "two three", four.

ROOT         <- ITEM (',' ITEM)*
ITEM         <- WORD / PHRASE
WORD         <- < [a-z]+ >
PHRASE       <- < '"' (!'"' .)* '"' >

%whitespace  <-  [ \t\r\n]*

How `%whitespace` works exactly

Whitespace is skipped at exactly three points, and nowhere else:

once at the beginning of the input text
right after every matched literal string ('...', "...")
right after every closed token boundary (< ... >)

Inside a token boundary, whitespace skipping is completely disabled. Character classes ([...]) and . never skip whitespace by themselves. Knowing these rules explains most surprises with %whitespace:

A rule made only of character classes doesn't skip trailing whitespace. (#327) With the grammar below, main() parses but main () doesn't, because NAME is not a token — nothing skips the space after it. Wrap lexical rules in a token boundary.

DECL    <- NAME ARGS?    # `main ()` fails: whitespace after NAME is not skipped
NAME    <- [a-zA-Z_][a-zA-Z0-9_]*
ARGS    <- '(' ')'

NAME    <- < [a-zA-Z_][a-zA-Z0-9_]* >   # OK: token boundary skips trailing whitespace

A literal skips whitespace before a following predicate sees the input. (#319) In KEYWORD <- "create" !IDCHAR, the literal "create" eats the whitespace after it, so !IDCHAR tests the character of the next word and the keyword check misfires on input like create a. Put the whole thing in a token boundary, or mark the rule with { no_whitespace }, to keep the predicate right next to the literal (peglint warns about this pattern):

KEYWORD <- < "create" !IDCHAR >
KEYWORD <- "create" !IDCHAR  { no_whitespace }

To preserve whitespace locally (e.g. inside string literals), mark the rule with the { no_whitespace } instruction: whitespace skipping is disabled inside the rule and resumes after it, like a token boundary without the token capture. (#44)

StrQuot   <- '"' (StrEscape / StrChars)* '"'  { no_whitespace }
StrEscape <- '\\' .
StrChars  <- (!'"' !'\\' .)+

The same can be written with nested token boundaries — the outer < ... > disables whitespace skipping for everything inside, and the inner one selects the text for token():

StrQuot   <- < '"' < (StrEscape / StrChars)* > '"' >

Rules whose name starts with _ are hidden from error messages. If you define %whitespace in terms of sub-rules (e.g. to support comments), name them with a leading _, otherwise syntax errors report expecting <SPACE> instead of what the user actually needs to fix. (#292)

%whitespace <- (_SPACE / _COMMENT)*
_SPACE      <- [ \t\r\n]
_COMMENT    <- '#' (!'\n' .)*

Keyword-like operators need %word. In a scannerless parser, 'and' happily matches the first three letters of android. Declare %word so literals that look like words are checked against a word boundary. (#328) See the next section.

Operators in precedence instructions must be literal tokens. When using the infix-expression precedence feature, write the operators as plain literals (optionally wrapped in a token boundary in the OPERATOR rule) rather than as rules that manage whitespace themselves. (#325)

Word expression

peg::parser parser(R"(
  ROOT         <-  'hello' 'world'
  %whitespace  <-  [ \t\r\n]*
  %word        <-  [a-z]+
)");

parser.parse("hello world"); // OK
parser.parse("helloworld");  // NG

Capture/Backreference

peg::parser parser(R"(
  ROOT      <- CONTENT
  CONTENT   <- (ELEMENT / TEXT)*
  ELEMENT   <- $(STAG CONTENT ETAG)
  STAG      <- '<' $tag< TAG_NAME > '>'
  ETAG      <- '</' $tag '>'
  TAG_NAME  <- 'b' / 'u'
  TEXT      <- TEXT_DATA
  TEXT_DATA <- ![<] .
)");

parser.parse("This is <b>a <u>test</u> text</b>."); // OK
parser.parse("This is <b>a <u>test</b> text</u>."); // NG
parser.parse("This is <b>a <u>test text</b>.");     // NG

Dictionary

| operator allows us to make a word dictionary for fast lookup by using Trie structure internally. We don't have to worry about the order of words.

START <- 'This month is ' MONTH '.'
MONTH <- 'Jan' | 'January' | 'Feb' | 'February' | '...'

We are able to find which item is matched with choice().

parser["MONTH"] = [](const SemanticValues &vs) {
  auto id = vs.choice();
};

It supports the case-insensitive mode.

START <- 'This month is ' MONTH '.'
MONTH <- 'Jan'i | 'January'i | 'Feb'i | 'February'i | '...'i

Cut operator

↑ operator could mitigate the backtrack performance problem, but has a risk to change the meaning of grammar.

S <- '(' ↑ P ')' / '"' ↑ P '"' / P
P <- 'a' / 'b' / 'c'

When we parse (z with the above grammar, we don't have to backtrack in S after ( is matched, because a cut operator is inserted there.

Parameterized Rule or Macro

# Syntax
Start      ← _ Expr
Expr       ← Sum
Sum        ← List(Product, SumOpe)
Product    ← List(Value, ProOpe)
Value      ← Number / T('(') Expr T(')')

# Token
SumOpe     ← T('+' / '-')
ProOpe     ← T('*' / '/')
Number     ← T([0-9]+)
~_         ← [ \t\r\n]*

# Macro
List(I, D) ← I (D I)*
T(x)       ← < x > _

Parsing infix expression by Precedence climbing

Regarding the precedence climbing algorithm, please see this article.

parser parser(R"(
  EXPRESSION             <-  INFIX_EXPRESSION(ATOM, OPERATOR)
  ATOM                   <-  NUMBER / '(' EXPRESSION ')'
  OPERATOR               <-  < [-+/*] >
  NUMBER                 <-  < '-'? [0-9]+ >
  %whitespace            <-  [ \t]*

  # Declare order of precedence
  INFIX_EXPRESSION(A, O) <-  A (O A)* {
    precedence
      L + -
      L * /
  }
)");

parser["INFIX_EXPRESSION"] = [](const SemanticValues& vs) -> long {
  auto result = any_cast<long>(vs[0]);
  if (vs.size() > 1) {
    auto ope = any_cast<char>(vs[1]);
    auto num = any_cast<long>(vs[2]);
    switch (ope) {
      case '+': result += num; break;
      case '-': result -= num; break;
      case '*': result *= num; break;
      case '/': result /= num; break;
    }
  }
  return result;
};
parser["OPERATOR"] = [](const SemanticValues& vs) { return *vs.sv(); };
parser["NUMBER"] = [](const SemanticValues& vs) { return vs.token_to_number<long>(); };

long val;
parser.parse(" -1 + (1 + 2) * 3 - -1", val);
assert(val == 9);

precedence instruction can be applied only to the following 'list' style rule.

Rule <- Atom (Operator Atom)* {
  precedence
    L - +
    L / *
    R ^
}

precedence instruction contains precedence info entries. Each entry starts with associativity which is 'L' (left) or 'R' (right), then operator literal tokens follow. The first entry has the highest order level.

Left Recursive Grammars

cpp-peglib supports left recursive rules, which are commonly used in expression grammars to achieve left-associative operators naturally. Left recursion is automatically detected at grammar compile time and handled via a seed-growing algorithm at parse time.

parser parser(R"(
  Expr   <- Expr '+' Term / Expr '-' Term / Term
  Term   <- Term '*' Factor / Term '/' Factor / Factor
  Factor <- '(' Expr ')' / Number
  Number <- < [0-9]+ >
  %whitespace <- [ \t]*
)");

parser["Expr"] = [](const SemanticValues &vs) {
  switch (vs.choice()) {
  case 0: return any_cast<long>(vs[0]) + any_cast<long>(vs[1]);
  case 1: return any_cast<long>(vs[0]) - any_cast<long>(vs[1]);
  default: return any_cast<long>(vs[0]);
  }
};

parser["Term"] = [](const SemanticValues &vs) {
  switch (vs.choice()) {
  case 0: return any_cast<long>(vs[0]) * any_cast<long>(vs[1]);
  case 1: return any_cast<long>(vs[0]) / any_cast<long>(vs[1]);
  default: return any_cast<long>(vs[0]);
  }
};

parser["Number"] = [](const SemanticValues &vs) {
  return vs.token_to_number<long>();
};

long val;
parser.parse("1 - 2 - 3", val);
assert(val == -4);  // Left-associative: (1-2)-3 = -4

parser.parse("8 / 4 / 2", val);
assert(val == 1);   // Left-associative: (8/4)/2 = 1

Direct, indirect, and mutual left recursion are all supported. For example, indirect left recursion works as expected:

A <- B 'a'
B <- A 'b' / 'b'

Left recursion support is enabled by default and adds zero overhead to non-left-recursive grammars. To disable it (reverting to the traditional error on left-recursive rules), call enable_left_recursion(false) before loading the grammar:

peg::parser parser;
parser.enable_left_recursion(false);
parser.load_grammar(grammar);

AST generation

cpp-peglib is able to generate an AST (Abstract Syntax Tree) when parsing. enable_ast method on peg::parser class enables the feature.

NOTE: An AST node holds a corresponding token as std::string_vew for performance and less memory usage. It is users' responsibility to keep the original source text along with the generated AST tree.

peg::parser parser(R"(
  ...
  definition1 <- ... { no_ast_opt }
  definition2 <- ... { no_ast_opt }
  ...
)");

parser.enable_ast();

shared_ptr<peg::Ast> ast;
if (parser.parse("...", ast)) {
  cout << peg::ast_to_s(ast);

  ast = parser.optimize_ast(ast);
  cout << peg::ast_to_s(ast);
}

optimize_ast removes redundant nodes to make an AST simpler. If you want to disable this behavior from particular rules, no_ast_opt instruction can be used.

By default an AST node carries the name of the rule that produced it. A rule can override that tag with the { ast_name: NodeTag } instruction, so several rules can emit nodes under a shared tag.

Multiple instructions can be combined in a single { ... } block by separating them with ;, e.g. { no_ast_opt; ast_name: NodeTag }.

It internally calls peg::AstOptimizer to do the job. You can make your own AST optimizers to fit your needs.

Each AST node exposes the following fields:

const std::string      name;     // rule (or ast_name) that produced the node
const unsigned int     tag;      // str2tag(name) — for fast switch dispatch
std::string_view       token;    // matched text (valid when is_token is true)
bool                   is_token;
size_t                 choice;   // which alternative of a prioritized choice matched
size_t                 line, column, position, length;
std::vector<std::shared_ptr<Ast>> nodes;  // child nodes
std::weak_ptr<Ast>     parent;

To walk the tree, switching on tag avoids string comparison. The _ literal in peg::udl turns a node name into the same compile-time tag value:

using namespace peg::udl;

void traverse(const std::shared_ptr<peg::Ast> &ast) {
  switch (ast->tag) {
  case "Additive"_: /* ... */ break;
  case "Number"_:   /* ... */ break;
  default:
    for (auto &node : ast->nodes) { traverse(node); }
    break;
  }
}

See actual usages in the AST calculator example and PL/0 language example.

Make a parser with parser combinators

Instead of making a parser by parsing PEG syntax text, we can also construct a parser by hand with parser combinators. Here is an example:

using namespace peg;
using namespace std;

vector<string> tags;

Definition ROOT, TAG_NAME, _;
ROOT     <= seq(_, zom(seq(chr('['), TAG_NAME, chr(']'), _)));
TAG_NAME <= oom(seq(npd(chr(']')), dot())), [&](const SemanticValues& vs) {
              tags.push_back(vs.token_to_string());
            };
_        <= zom(cls(" \t"));

auto ret = ROOT.parse(" [tag1] [tag:2] [tag-3] ");

The following are available operators:

Operator	Description	Operator	Description
seq	Sequence	cho	Prioritized Choice
zom	Zero or More	oom	One or More
opt	Optional	apd	And predicate
npd	Not predicate	lit	Literal string
liti	Case-insensitive Literal string	cls	Character class
ncls	Negated Character class	chr	Character
dot	Any character	tok	Token boundary
ign	Ignore semantic value	csc	Capture scope
cap	Capture	bkr	Back reference
dic	Dictionary	pre	Infix expression
rec	Infix expression	usr	User defined parser
rep	Repetition

Adjust definitions

It's possible to add/override definitions.

auto syntax = R"(
  ROOT <- _ 'Hello' _ NAME '!' _
)";

Rules additional_rules = {
  {
    "NAME", usr([](const char* s, size_t n, SemanticValues& vs, any& dt) -> size_t {
      static vector<string> names = { "PEG", "BNF" };
      for (const auto& name: names) {
        if (name.size() <= n && !name.compare(0, name.size(), s, name.size())) {
          return name.size(); // processed length
        }
      }
      return -1; // parse error
    })
  },
  {
    "~_", zom(cls(" \t\r\n"))
  }
};

auto g = parser(syntax, additional_rules);

assert(g.parse(" Hello BNF! "));

Unicode support

cpp-peglib accepts UTF8 text. . matches a Unicode codepoint. Also, it supports \u????.

Error report and recovery

cpp-peglib supports the furthest failure error position report as described in the Bryan Ford original document.

For better error report and recovery, cpp-peglib supports 'recovery' operator with label which can be associated with a recovery expression and a custom error message. This idea comes from the fantastic "Syntax Error Recovery in Parsing Expression Grammars" paper by Sergio Medeiros and Fabio Mascarenhas.

The custom message supports %t which is a placeholder for the unexpected token, and %c for the unexpected Unicode char. It can also reference a named capture with %{name}, which expands to the text captured by $name<...> earlier in the parse (an unknown name expands to an empty string):

Enum       <- 'enum' $name<NAME> '{' NAME+^enum_count '}'
enum_count <- '' { error_message "enum '%{name}' must contain at least one member" }

Here is an example of Java-like grammar:

# java.peg
Prog        ← 'public' 'class' NAME '{' 'public' 'static' 'void' 'main' '(' 'String' '[' ']' NAME ')' BlockStmt '}'
BlockStmt   ← '{' (!'}' Stmt^stmtb)* '}' # Annotated with `stmtb`
Stmt        ← IfStmt / WhileStmt / PrintStmt / DecStmt / AssignStmt / BlockStmt
IfStmt      ← 'if' '(' Exp ')' Stmt ('else' Stmt)?
WhileStmt   ← 'while' '(' Exp^condw ')' Stmt # Annotated with `condw`
DecStmt     ← 'int' NAME ('=' Exp)? ';'
AssignStmt  ← NAME '=' Exp ';'^semia # Annotated with `semi`
PrintStmt   ← 'System.out.println' '(' Exp ')' ';'
Exp         ← RelExp ('==' RelExp)*
RelExp      ← AddExp ('<' AddExp)*
AddExp      ← MulExp (('+' / '-') MulExp)*
MulExp      ← AtomExp (('*' / '/') AtomExp)*
AtomExp     ← '(' Exp ')' / NUMBER / NAME

NUMBER      ← < [0-9]+ >
NAME        ← < [a-zA-Z_][a-zA-Z_0-9]* >

%whitespace ← [ \t\n]*
%word       ← NAME

# Recovery operator labels
semia       ← '' { error_message "missing semicolon in assignment." }
stmtb       ← (!(Stmt / 'else' / '}') .)* { error_message "invalid statement" }
condw       ← &'==' ('==' RelExp)* / &'<' ('<' AddExp)* / (!')' .)*

For instance, ';'^semi is a syntactic sugar for (';' / %recovery(semi)). %recover operator tries to recover the error at ';' by skipping input text with the recovery expression semi. Also semi is associated with a custom message "missing semicolon in assignment."

Here is the result:

> cat sample.java
public class Example {
  public static void main(String[] args) {
    int n = 5;
    int f = 1;
    while( < n) {
      f = f * n;
      n = n - 1
    };
    System.out.println(f);
  }
}

> peglint java.peg sample.java
sample.java:5:12: syntax error, unexpected '<', expecting '(', <NUMBER>, <NAME>.
sample.java:8:5: missing semicolon in assignment.
sample.java:8:6: invalid statement

As you can see, it can now show more than one error, and provide more meaningful error messages than the default messages.

Custom error message for definitions

We can associate custom error messages to definitions.

# custom_message.peg
START       <- CODE (',' CODE)*
CODE        <- < '0x' [a-fA-F0-9]+ > { error_message 'code format error...' }
%whitespace <- [ \t]*

> cat custom_message.txt
0x1234,0x@@@@,0xABCD

> peglint custom_message.peg custom_message.txt
custom_message.txt:1:8: code format error...

NOTE: If there is more than one element with an error message instruction in a prioritized choice, this feature may not work as you expect.

Structured error reports

set_logger receives errors as formatted strings. To build tooling on top of the parser — error codes, localized messages, IDE diagnostics — use set_error_reporter instead, which receives the same errors as structured data before they are flattened into a display string:

parser.set_error_reporter([](const peg::ErrorReport &r) {
  // r.line, r.col          : 1-based error position
  // r.position             : byte offset in the input
  // r.unexpected_token     : the token found at the error position
  // r.expected_literals    : e.g. {"}", ";"}
  // r.expected_rules       : e.g. {"NAME"} (rules starting with '_' excluded)
  // r.message              : custom error_message if any (placeholders resolved)
  // r.label                : rule name or recovery label the error belongs to
});

Both callbacks can be set at the same time; each error is delivered to both. With error recovery, the reporter is called once per recovered error, so a single parse can produce multiple reports. Mapping r.label to an application-defined error enum is the intended way to get typed errors.

Change the Start Definition Rule

We can change the start definition rule as below.

auto grammar = R"(
  Start       <- A
  A           <- B (',' B)*
  B           <- '[one]' / '[two]'
  %whitespace <- [ \t\n]*
)";

peg::parser parser(grammar, "A"); // Start Rule is "A"

  or

peg::parser parser;
parser.load_grammar(grammar, "A"); // Start Rule is "A"

parser.parse(" [one] , [two] "); // OK

Tracing the parser

To see how the parser proceeds, peg::enable_tracing prints a trace of every rule the parser enters and leaves to the given output stream.

peg::parser parser(grammar);

peg::enable_tracing(parser, std::cout);

parser.parse(" [one] , [two] ");

This is what peglint --trace uses internally. For full control over the trace output, call parser.enable_trace(enter, leave) with your own callbacks, and parser.set_verbose_trace(true) to trace the intermediate operators inside each rule rather than just the rule boundaries.

Similarly, peg::enable_profiling(parser, std::cout) reports how often each rule is invoked and how much time it takes — the counterpart of peglint --profile.

peglint - PEG syntax lint utility

Build peglint

> cd lint
> mkdir build
> cd build
> cmake ..
> make
> ./peglint
usage: grammar_file_path [source_file_path]

  options:
    --source: source text
    --packrat: enable packrat memoise
    --ast: show AST tree
    --opt, --opt-all: optimize all AST nodes except nodes selected with `no_ast_opt` instruction
    --opt-only: optimize only AST nodes selected with `no_ast_opt` instruction
    --trace: show concise trace messages
    --profile: show profile report
    --verbose: verbose output for trace and profile

Grammar check

> cat a.peg
Additive    <- Multiplicative '+' Additive / Multiplicative
Multiplicative   <- Primary '*' Multiplicative / Primary
Primary     <- '(' Additive ')' / Number
%whitespace <- [ \t\r\n]*

> peglint a.peg
[commandline]:3:35: 'Number' is not defined.

Source check

> cat a.peg
Additive    <- Multiplicative '+' Additive / Multiplicative
Multiplicative   <- Primary '*' Multiplicative / Primary
Primary     <- '(' Additive ')' / Number
Number      <- < [0-9]+ >
%whitespace <- [ \t\r\n]*

> peglint --source "1 + a * 3" a.peg
[commandline]:1:3: syntax error

Serialize a grammar for fast startup

A compiled grammar can be serialized to a portable byte blob and later restored without re-running the meta-parse. Deserializing a blob is roughly 40x faster than load_grammar, so an application can embed a prebuilt blob and skip the grammar parsing on startup.

peg::parser parser(R"(
  ROOT <- _ TOKEN (',' _ TOKEN)*
  ...
)");

// Serialize the loaded grammar to a byte blob.
std::vector<uint8_t> blob = parser.serialize_grammar();

// ...store/embed the blob, then later:

peg::parser parser2;
parser2.load_blob(blob); // skips the meta-parse

// Re-apply semantic actions / enable_ast() etc. as needed:
parser2["TOKEN"] = [](const SemanticValues& vs) { /* ... */ };

Notes:

Only the grammar structure is serialized. Semantic actions, enable_ast(), and other callbacks are not included and must be re-applied after load_blob.
First-sets are recomputed on load, and references are resolved by name.
Grammars that use the precedence instruction, a capture / back-reference, or a User operator are not serializable (serialize_grammar() throws; load_blob() returns false on a bad or incompatible blob).

peglint can emit a blob with the --blob option:

> peglint --blob a.peg > a.blob

AST

> cat a.txt
1 + 2 * 3

> peglint --ast a.peg a.txt
+ Additive
  + Multiplicative
    + Primary
      - Number (1)
  + Additive
    + Multiplicative
      + Primary
        - Number (2)
      + Multiplicative
        + Primary
          - Number (3)

AST optimization

> peglint --ast --opt --source "1 + 2 * 3" a.peg
+ Additive
  - Multiplicative[Number] (1)
  + Additive[Multiplicative]
    - Primary[Number] (2)
    - Multiplicative[Number] (3)

Adjust AST optimization with `no_ast_opt` instruction

> cat a.peg
Additive    <- Multiplicative '+' Additive / Multiplicative
Multiplicative   <- Primary '*' Multiplicative / Primary
Primary     <- '(' Additive ')' / Number          { no_ast_opt }
Number      <- < [0-9]+ >
%whitespace <- [ \t\r\n]*

> peglint --ast --opt --source "1 + 2 * 3" a.peg
+ Additive/0
  + Multiplicative/1[Primary]
    - Number (1)
  + Additive/1[Multiplicative]
    + Primary/1
      - Number (2)
    + Multiplicative/1[Primary]
      - Number (3)

> peglint --ast --opt-only --source "1 + 2 * 3" a.peg
+ Additive/0
  + Multiplicative/1
    - Primary/1[Number] (1)
  + Additive/1
    + Multiplicative/0
      - Primary/1[Number] (2)
      + Multiplicative/1
        - Primary/1[Number] (3)

Override an AST node's tag with `ast_name` instruction

By default, an AST node carries the name of the rule that produced it. A rule annotated { ast_name: NodeTag } emits its node under NodeTag instead. This lets two parallel productions converge onto a single AST tag, so a tree-walker needs only one case per logical nonterminal:

> cat a.peg
Atom        <- Number / String
Number      <- < [0-9]+ >          { ast_name: Literal }
String      <- '"' < [^"]* > '"'   { ast_name: Literal }
%whitespace <- [ \t\r\n]*

> peglint --ast --source "123" a.peg
+ Atom/0
  - Literal (123)

> peglint --ast --source "\"hi\"" a.peg
+ Atom/1
  - Literal (hi)

The override also works on parameterized rules (macros): each instantiation emits the same tag. ast_name composes with no_ast_opt — the AST optimizer's keep-list honors the overridden name.

README.md

cpp-peglib

How to use

Ignoring Whitespaces

How `%whitespace` works exactly

Word expression

Capture/Backreference

Dictionary

Cut operator

Parameterized Rule or Macro

Parsing infix expression by Precedence climbing

Left Recursive Grammars

AST generation

Make a parser with parser combinators

Adjust definitions

Unicode support

Error report and recovery

Custom error message for definitions

Structured error reports

Change the Start Definition Rule

Tracing the parser

peglint - PEG syntax lint utility

Build peglint

Grammar check

Source check

Serialize a grammar for fast startup

AST

AST optimization

Adjust AST optimization with `no_ast_opt` instruction

Override an AST node's tag with `ast_name` instruction

Sample codes

License

README.md

cpp-peglib

How to use

Ignoring Whitespaces

How %whitespace works exactly

Word expression

Capture/Backreference

Dictionary

Cut operator

Parameterized Rule or Macro

Parsing infix expression by Precedence climbing

Left Recursive Grammars

AST generation

Make a parser with parser combinators

Adjust definitions

Unicode support

Error report and recovery

Custom error message for definitions

Structured error reports

Change the Start Definition Rule

Tracing the parser

peglint - PEG syntax lint utility

Build peglint

Grammar check

Source check

Serialize a grammar for fast startup

AST

AST optimization

Adjust AST optimization with no_ast_opt instruction

Override an AST node's tag with ast_name instruction

Sample codes

License

How `%whitespace` works exactly

Adjust AST optimization with `no_ast_opt` instruction

Override an AST node's tag with `ast_name` instruction