14 KiB
cpp-peglib
C++11 header-only PEG (Parsing Expression Grammars) library. You can start using it right away just by including peglib.h
in your project.
You can also try the online version at https://yhirose.github.io/cpp-peglib.
The PEG syntax is well described on page 2 in the document. cpp-peglib also supports the following additional syntax for now:
<
...>
(Token boundary operator)~
(Ignore operator)\x20
(Hex number char)$name<
...>
(Named capture operator)$name
(Backreference operator)%whitespace
(Automatic whitespace skipping)%word
(Word expression)$name(
...)
(Capture scope operator)$name<
...>
(Named capture operator)$name
(Backreference operator)MACRO_NAME(
...)
(Parameterized rule or Macro)
This library also supports the linear-time parsing known as the Packrat parsing.
If you need a Go language version, please see go-peg.
How to use
This is a simple calculator sample. It shows how to define grammar, associate samantic actions to the grammar, and handle semantic values.
// (1) Include the header file
#include <peglib.h>
#include <assert.h>
#include <iostream>
using namespace peg;
using namespace std;
int main(void) {
// (2) Make a parser
auto grammar = R"(
# Grammar for Calculator...
Additive <- Multitive '+' Additive / Multitive
Multitive <- Primary '*' Multitive / Primary
Primary <- '(' Additive ')' / Number
Number <- < [0-9]+ >
%whitespace <- [ \t]*
)";
parser parser;
parser.log = [](size_t line, size_t col, const string& msg) {
cerr << line << ":" << col << ": " << msg << "\n";
};
auto ok = parser.load_grammar(grammar);
assert(ok);
// (3) Setup actions
parser["Additive"] = [](const SemanticValues& sv) {
switch (sv.choice()) {
case 0: // "Multitive '+' Additive"
return sv[0].get<int>() + sv[1].get<int>();
default: // "Multitive"
return sv[0].get<int>();
}
};
parser["Multitive"] = [](const SemanticValues& sv) {
switch (sv.choice()) {
case 0: // "Primary '*' Multitive"
return sv[0].get<int>() * sv[1].get<int>();
default: // "Primary"
return sv[0].get<int>();
}
};
parser["Number"] = [](const SemanticValues& sv) {
return stoi(sv.token(), nullptr, 10);
};
// (4) Parse
parser.enable_packrat_parsing(); // Enable packrat parsing.
int val;
parser.parse(" (1 + 2) * 3 ", val);
assert(val == 9);
}
There are four semantic actions available:
[](const SemanticValues& sv, any& dt)
[](const SemanticValues& sv)
[](SemanticValues& sv, any& dt)
[](SemanticValues& sv)
SemanticValues
value contains the following information:
- Semantic values
- Matched string information
- Token information if the rule is literal or uses a token boundary operator
- Choice number when the rule is 'prioritized choise'
any& dt
is a 'read-write' context data which can be used for whatever purposes. The initial context data is set in peg::parser::parse
method.
peg::any
is a simpler implementatin of boost::any. It can wrap arbitrary data type.
A semantic action can return a value of arbitrary data type, which will be wrapped by peg::any
. If a user returns nothing in a semantic action, the first semantic value in the const SemanticValues& sv
argument will be returned. (Yacc parser has the same behavior.)
Here shows the SemanticValues
structure:
struct SemanticValues : protected std::vector<any>
{
// Input text
const char* path;
const char* ss;
// Matched string
std::string str() const; // Matched string
const char* c_str() const; // Matched string start
size_t length() const; // Matched string length
// Line number and column at which the matched string is
std::pair<size_t, size_t> line_info() const;
// Tokens
std::vector<
std::pair<
const char*, // Token start
size_t>> // Token length
tokens;
std::string token(size_t id = 0) const;
// Choice number (0 based index)
size_t choice() const;
// Transform the semantic value vector to another vector
template <typename T> vector<T> transform(size_t beg = 0, size_t end = -1) const;
}
The following example uses <
... >
operator, which is token boundary operator.
auto syntax = R"(
ROOT <- _ TOKEN (',' _ TOKEN)*
TOKEN <- < [a-z0-9]+ > _
_ <- [ \t\r\n]*
)";
peg pg(syntax);
pg["TOKEN"] = [](const SemanticValues& sv) {
// 'token' doesn't include trailing whitespaces
auto token = sv.token();
};
auto ret = pg.parse(" token1, token2 ");
We can ignore unnecessary semantic values from the list by using ~
operator.
peg::pegparser parser(
" ROOT <- _ ITEM (',' _ ITEM _)* "
" ITEM <- ([a-z])+ "
" ~_ <- [ \t]* "
);
parser["ROOT"] = [&](const SemanticValues& sv) {
assert(sv.size() == 2); // should be 2 instead of 5.
};
auto ret = parser.parse(" item1, item2 ");
The following grammar is same as the above.
peg::parser parser(
" ROOT <- ~_ ITEM (',' ~_ ITEM ~_)* "
" ITEM <- ([a-z])+ "
" _ <- [ \t]* "
);
Semantic predicate support is available. We can do it by throwing a peg::parse_error
exception in a semantic action.
peg::parser parser("NUMBER <- [0-9]+");
parser["NUMBER"] = [](const SemanticValues& sv) {
auto val = stol(sv.str(), nullptr, 10);
if (val != 100) {
throw peg::parse_error("value error!!");
}
return val;
};
long val;
auto ret = parser.parse("100", val);
assert(ret == true);
assert(val == 100);
ret = parser.parse("200", val);
assert(ret == false);
enter and leave actions are also avalable.
parser["RULE"].enter = [](const char* s, size_t n, any& dt) {
std::cout << "enter" << std::endl;
};
parser["RULE"] = [](const SemanticValues& sv, any& dt) {
std::cout << "action!" << std::endl;
};
parser["RULE"].leave = [](const char* s, size_t n, size_t matchlen, any& value, any& dt) {
std::cout << "leave" << std::endl;
};
Ignoring Whitespaces
As you can see in the first example, we can ignore whitespaces between tokens automatically with %whitespace
rule.
%whitespace
rule can be applied to the following three conditions:
- trailing spaces on tokens
- leading spaces on text
- trailing spaces on literal strings in rules
These are valid tokens:
KEYWORD <- 'keyword'
WORD <- < [a-zA-Z0-9] [a-zA-Z0-9-_]* > # token boundary operator is used.
IDNET <- < IDENT_START_CHAR IDENT_CHAR* > # token boundary operator is used.
The following grammar accepts one, "two three", four
.
ROOT <- ITEM (',' ITEM)*
ITEM <- WORD / PHRASE
WORD <- < [a-z]+ >
PHRASE <- < '"' (!'"' .)* '"' >
%whitespace <- [ \t\r\n]*
Word expression
peg::parser parser(R"(
ROOT <- 'hello' 'world'
%whitespace <- [ \t\r\n]*
%word <- [a-z]+
)");
parser.parse("hello world"); // OK
parser.parse("helloworld"); // NG
Capture/Backreference
peg::parser parser(R"(
ROOT <- CONTENT
CONTENT <- (ELEMENT / TEXT)*
ELEMENT <- $(STAG CONTENT ETAG)
STAG <- '<' $tag< TAG_NAME > '>'
ETAG <- '</' $tag '>'
TAG_NAME <- 'b' / 'u'
TEXT <- TEXT_DATA
TEXT_DATA <- ![<] .
)");
parser.parse("This is <b>a <u>test</u> text</b>."); // OK
parser.parse("This is <b>a <u>test</b> text</u>."); // NG
parser.parse("This is <b>a <u>test text</b>."); // NG
Parameterized Rule or Macro
# Syntax
Start ← _ Expr
Expr ← Sum
Sum ← List(Product, SumOpe)
Product ← List(Value, ProOpe)
Value ← Number / T('(') Expr T(')')
# Token
SumOpe ← T('+' / '-')
ProOpe ← T('*' / '/')
Number ← T([0-9]+)
~_ ← [ \t\r\n]*
# Macro
List(I, D) ← I (D I)*
T(x) ← < x > _
AST generation
cpp-peglib is able to generate an AST (Abstract Syntax Tree) when parsing. enable_ast
method on peg::parser
class enables the feature.
peg::parser parser("...");
parser.enable_ast();
shared_ptr<peg::Ast> ast;
if (parser.parse("...", ast)) {
cout << peg::ast_to_s(ast);
ast = peg::AstOptimizer(true).optimize(ast);
cout << peg::ast_to_s(ast);
}
peg::AstOptimizer
removes redundant nodes to make a AST simpler. You can make your own AST optimizers to fit your needs.
See actual usages in the AST calculator example and PL/0 language example.
Make a parser with parser combinators
Instead of makeing a parser by parsing PEG syntax text, we can also construct a parser by hand with parser combinatorss. Here is an example:
using namespace peg;
using namespace std;
vector<string> tags;
Definition ROOT, TAG_NAME, _;
ROOT <= seq(_, zom(seq(chr('['), TAG_NAME, chr(']'), _)));
TAG_NAME <= oom(seq(npd(chr(']')), dot())), [&](const SemanticValues& sv) {
tags.push_back(sv.str());
};
_ <= zom(cls(" \t"));
auto ret = ROOT.parse(" [tag1] [tag:2] [tag-3] ");
The following are available operators:
Operator | Description |
---|---|
seq | Sequence |
cho | Prioritized Choice |
zom | Zero or More |
oom | One or More |
opt | Optional |
apd | And predicate |
npd | Not predicate |
lit | Literal string |
cls | Character class |
chr | Character |
dot | Any character |
tok | Token boundary |
ign | Ignore semantic value |
csc | Capture scope |
cap | Capture |
bkr | Back reference |
usr | User defined parser |
Adjust definitions
It's possible to add/override definitions.
auto syntax = R"(
ROOT <- _ 'Hello' _ NAME '!' _
)";
Rules additional_rules = {
{
"NAME", usr([](const char* s, size_t n, SemanticValues& sv, any& dt) -> size_t {
static vector<string> names = { "PEG", "BNF" };
for (const auto& name: names) {
if (name.size() <= n && !name.compare(0, name.size(), s, name.size())) {
return name.size(); // processed length
}
}
return -1; // parse error
})
},
{
"~_", zom(cls(" \t\r\n"))
}
};
auto g = parser(syntax, additional_rules);
assert(g.parse(" Hello BNF! "));
Unicode support
cpp-peglib accepts UTF8 text. .
matches a Unicode codepoint. Also, it supports \u????
.
peglint - PEG syntax lint utility
Build peglint
> cd lint
> mkdir build
> cd build
> cmake ..
> make
> ./peglint
usage: peglint [--ast] [--optimize_ast_nodes|--opt] [--source text] [--trace] [grammar file path] [source file path]
Lint grammar
> cat a.peg
A <- 'hello' ^ 'world'
> peglint a.peg
a.peg:1:14: syntax error
> cat a.peg
A <- B
> peglint a.peg
a.peg:1:6: 'B' is not defined.
> cat a.peg
A <- B / C
B <- 'b'
C <- A
> peglint a.peg
a.peg:1:10: 'C' is left recursive.
a.peg:3:6: 'A' is left recursive.
Lint source text
> cat a.peg
Additive <- Multitive '+' Additive / Multitive
Multitive <- Primary '*' Multitive / Primary
Primary <- '(' Additive ')' / Number
Number <- < [0-9]+ >
%whitespace <- [ \t\r\n]*
> peglint --source "1 + a * 3" a.peg
[commendline]:1:3: syntax error
> cat a.txt
1 + 2 * 3
> peglint --ast a.peg a.txt
+ Additive
+ Multitive
+ Primary
- Number (1)
+ Additive
+ Multitive
+ Primary
- Number (2)
+ Multitive
+ Primary
- Number (3)
> peglint --ast --opt --source "1 + 2 * 3" a.peg
+ Additive
- Multitive[Number] (1)
+ Additive[Multitive]
- Primary[Number] (2)
- Multitive[Number] (3)
Sample codes
- Calculator
- Calculator (with parser operators)
- Calculator (AST version)
- PL/0 language example
- A tiny PL/0 JIT compiler in less than 700 LOC with LLVM and PEG parser
PEG debug
A debug viewer for Parsing Expression Grammars using cpp-peglib by mqnc. Please see his gihub project page for the detail. You can see a parse result of PL/0 code here.
Tested compilers
- Visual Studio 2017
- Visual Studio 2015
- Visual Studio 2013 with update 5
- Clang++ 5.0.1
- Clang++ 5.0
- Clang++ 4.0
- Clang++ 3.5
- G++ 5.4 on Ubuntu 16.04
IMPORTANT NOTE for Ubuntu: Need -pthread
option when linking. See #23 and #46.
TODO
- Advanced Unicode support (Unicode regular expressoin)
License
MIT license (© 2018 Yuji Hirose)