2015-02-08 01:52:26 +00:00
cpp-peglib
==========
2020-10-02 01:26:04 +00:00
[![ ](https://github.com/yhirose/cpp-peglib/workflows/CMake/badge.svg )](https://github.com/yhirose/cpp-peglib/actions)
2016-06-09 02:35:50 +00:00
[![Bulid Status ](https://ci.appveyor.com/api/projects/status/github/yhirose/cpp-peglib?branch=master&svg=true )](https://ci.appveyor.com/project/yhirose/cpp-peglib)
2016-06-02 22:01:07 +00:00
2020-10-02 01:26:04 +00:00
C++17 header-only [PEG ](http://en.wikipedia.org/wiki/Parsing_expression_grammar ) (Parsing Expression Grammars) library. You can start using it right away just by including `peglib.h` in your project.
Since this library only supports C++17 compilers, please make sure that compiler the option `-std=c++17` is enabled. (`/std:c++17 /Zc:__cplusplus` for MSVC)
2015-02-08 01:52:26 +00:00
2019-03-08 02:22:17 +00:00
You can also try the online version, PEG Playground at https://yhirose.github.io/cpp-peglib.
2015-02-08 01:52:26 +00:00
2021-04-02 14:04:08 +00:00
The PEG syntax is well described on page 2 in the [document ](http://www.brynosaurus.com/pub/lang/peg.pdf ) by Bryan Ford. *cpp-peglib* also supports the following additional syntax for now:
2015-02-16 01:22:34 +00:00
2019-08-27 16:07:35 +00:00
* `'...'i` (Case-insensitive literal operator)
2022-06-25 20:05:23 +00:00
* `[...]i` (Case-insensitive character class operator)
2020-01-27 04:38:39 +00:00
* `[^...]` (Negated character class operator)
2022-06-25 20:05:23 +00:00
* `[^...]i` (Case-insensitive negated character class operator)
2020-03-28 22:38:01 +00:00
* `{2,5}` (Regex-like repetition operator)
2015-08-10 20:37:56 +00:00
* `<` ... `>` (Token boundary operator)
2015-02-18 23:00:11 +00:00
* `~` (Ignore operator)
2015-03-04 03:08:18 +00:00
* `\x20` (Hex number char)
2021-01-15 22:26:29 +00:00
* `\u10FFFF` (Unicode char)
2018-07-13 21:26:57 +00:00
* `%whitespace` (Automatic whitespace skipping)
* `%word` (Word expression)
2018-07-21 02:56:13 +00:00
* `$name(` ... `)` (Capture scope operator)
2018-07-21 02:09:54 +00:00
* `$name<` ... `>` (Named capture operator)
* `$name` (Backreference operator)
2020-02-11 21:50:26 +00:00
* `|` (Dictionary operator)
2021-01-26 03:46:28 +00:00
* `↑` (Cut operator)
2018-07-25 22:08:10 +00:00
* `MACRO_NAME(` ... `)` (Parameterized rule or Macro)
2020-02-09 00:28:39 +00:00
* `{ precedence L - + L / * }` (Parsing infix expression)
2021-01-18 20:26:54 +00:00
* `%recovery(` ... `)` (Error recovery operator)
2021-02-05 19:13:30 +00:00
* `exp⇑label` or `exp^label` (Syntax sugar for `(exp / %recover(label))` )
2022-09-01 23:59:46 +00:00
* `label { error_message "..." }` (Error message instruction)
2022-07-02 02:15:21 +00:00
* `{ no_ast_opt }` (No AST node optimazation instruction)
2022-07-09 00:35:05 +00:00
2022-06-08 15:10:59 +00:00
'End of Input' check will be done as default. In order to disable the check, please call `disable_eoi_check` .
2020-02-07 23:37:14 +00:00
This library supports the linear-time parsing known as the [*Packrat* ](http://pdos.csail.mit.edu/~baford/packrat/thesis/thesis.pdf ) parsing.
2019-05-14 13:26:02 +00:00
IMPORTANT NOTE for some Linux distributions such as Ubuntu and CentOS: Need `-pthread` option when linking. See [#23 ](https://github.com/yhirose/cpp-peglib/issues/23#issuecomment-261126127 ), [#46 ](https://github.com/yhirose/cpp-peglib/issues/46#issuecomment-417870473 ) and [#62 ](https://github.com/yhirose/cpp-peglib/issues/62#issuecomment-492032680 ).
2016-11-23 21:16:00 +00:00
2015-02-08 01:52:26 +00:00
How to use
----------
2021-04-02 14:04:08 +00:00
This is a simple calculator sample. It shows how to define grammar, associate semantic actions to the grammar, and handle semantic values.
2015-02-08 01:52:26 +00:00
2015-11-23 22:48:03 +00:00
```cpp
2015-02-15 22:52:39 +00:00
// (1) Include the header file
2015-02-12 02:04:08 +00:00
#include <peglib.h>
2015-03-11 17:53:24 +00:00
#include <assert.h>
2018-08-01 18:58:24 +00:00
#include <iostream>
2015-02-12 02:04:08 +00:00
2015-08-10 20:37:56 +00:00
using namespace peg;
2015-02-12 02:04:08 +00:00
using namespace std;
int main(void) {
2020-10-02 01:26:04 +00:00
// (2) Make a parser
parser parser(R"(
2021-01-16 22:34:09 +00:00
# Grammar for Calculator...
Additive < - Multitive ' + ' Additive / Multitive
Multitive < - Primary ' * ' Multitive / Primary
Primary < - ' ( ' Additive ' ) ' / Number
Number < - < [ 0-9 ] + >
%whitespace < - [ \t]*
)");
2015-02-13 02:08:58 +00:00
2020-10-02 01:26:04 +00:00
assert(static_cast< bool > (parser) == true);
2015-02-13 02:08:58 +00:00
2020-10-02 01:26:04 +00:00
// (3) Setup actions
parser["Additive"] = [](const SemanticValues & vs) {
switch (vs.choice()) {
case 0: // "Multitive '+' Additive"
return any_cast< int > (vs[0]) + any_cast< int > (vs[1]);
default: // "Multitive"
return any_cast< int > (vs[0]);
}
};
parser["Multitive"] = [](const SemanticValues & vs) {
switch (vs.choice()) {
case 0: // "Primary '*' Multitive"
return any_cast< int > (vs[0]) * any_cast< int > (vs[1]);
default: // "Primary"
return any_cast< int > (vs[0]);
}
};
2015-02-13 02:08:58 +00:00
2020-10-02 01:26:04 +00:00
parser["Number"] = [](const SemanticValues & vs) {
return vs.token_to_number< int > ();
};
2015-02-13 02:08:58 +00:00
2020-10-02 01:26:04 +00:00
// (4) Parse
parser.enable_packrat_parsing(); // Enable packrat parsing.
2015-03-11 17:53:24 +00:00
2020-10-02 01:26:04 +00:00
int val;
parser.parse(" (1 + 2) * 3 ", val);
2015-02-13 02:08:58 +00:00
2020-10-02 01:26:04 +00:00
assert(val == 9);
2015-02-12 02:04:08 +00:00
}
```
2015-02-08 01:52:26 +00:00
2020-02-07 16:55:21 +00:00
To show syntax errors in grammar text:
```cpp
auto grammar = R"(
2021-01-16 22:34:09 +00:00
# Grammar for Calculator...
Additive < - Multitive ' + ' Additive / Multitive
Multitive < - Primary ' * ' Multitive / Primary
Primary < - ' ( ' Additive ' ) ' / Number
Number < - < [ 0-9 ] + >
%whitespace < - [ \t]*
2020-02-07 16:55:21 +00:00
)";
parser parser;
2022-09-03 12:12:12 +00:00
parser.set_logger([](size_t line, size_t col, const string& msg, const string & rule) {
2020-10-02 01:26:04 +00:00
cerr < < line << " : " << col << " : " << msg << " \n";
2022-09-03 12:12:12 +00:00
});
2020-02-07 16:55:21 +00:00
auto ok = parser.load_grammar(grammar);
assert(ok);
```
2018-10-07 21:04:32 +00:00
There are four semantic actions available:
2015-02-15 22:52:39 +00:00
2015-11-23 22:48:03 +00:00
```cpp
2020-10-02 01:26:04 +00:00
[](const SemanticValues& vs, any& dt)
[](const SemanticValues& vs)
[](SemanticValues& vs, any& dt)
[](SemanticValues& vs)
2015-02-15 22:52:39 +00:00
```
2018-10-07 21:13:18 +00:00
`SemanticValues` value contains the following information:
2017-01-11 21:50:12 +00:00
- Semantic values
- Matched string information
- Token information if the rule is literal or uses a token boundary operator
2021-04-02 14:04:08 +00:00
- Choice number when the rule is 'prioritized choice'
2017-01-11 21:50:12 +00:00
`any& dt` is a 'read-write' context data which can be used for whatever purposes. The initial context data is set in `peg::parser::parse` method.
2020-10-02 01:26:04 +00:00
A semantic action can return a value of arbitrary data type, which will be wrapped by `peg::any` . If a user returns nothing in a semantic action, the first semantic value in the `const SemanticValues& vs` argument will be returned. (Yacc parser has the same behavior.)
2017-01-11 21:50:12 +00:00
Here shows the `SemanticValues` structure:
2015-02-19 03:28:57 +00:00
2015-11-23 22:48:03 +00:00
```cpp
2016-01-24 01:26:54 +00:00
struct SemanticValues : protected std::vector< any >
2015-02-22 00:38:30 +00:00
{
2020-10-02 01:26:04 +00:00
// Input text
const char* path;
const char* ss;
2017-11-04 02:27:08 +00:00
2020-10-02 01:26:04 +00:00
// Matched string
std::string_view sv() const { return sv_; }
2016-01-24 01:26:54 +00:00
2020-10-02 01:26:04 +00:00
// Line number and column at which the matched string is
std::pair< size_t , size_t > line_info() const;
2017-11-04 02:27:08 +00:00
2020-10-02 01:26:04 +00:00
// Tokens
std::vector< std::string_view > tokens;
std::string_view token(size_t id = 0) const;
2015-03-11 18:10:59 +00:00
2020-10-02 01:26:04 +00:00
// Token conversion
std::string token_to_string(size_t id = 0) const;
template < typename T > T token_to_number() const;
2016-01-24 01:26:54 +00:00
2020-10-02 01:26:04 +00:00
// Choice number (0 based index)
size_t choice() const;
2015-06-16 04:43:08 +00:00
2020-10-02 01:26:04 +00:00
// Transform the semantic value vector to another vector
template < typename T > vector< T > transform(size_t beg = 0, size_t end = -1) const;
2015-02-22 00:38:30 +00:00
}
2015-02-19 03:28:57 +00:00
```
2021-04-02 14:04:08 +00:00
The following example uses `<` ... `>` operator, which is *token boundary* operator.
2015-02-16 01:11:02 +00:00
2015-11-23 22:48:03 +00:00
```cpp
2020-08-27 01:47:19 +00:00
peg::parser parser(R"(
2020-10-02 01:26:04 +00:00
ROOT < - _ TOKEN (',' _ TOKEN )*
TOKEN < - < [ a-z0-9 ] + > _
_ < - [ \t\r\n]*
2020-08-27 01:47:19 +00:00
)");
2015-02-16 01:11:02 +00:00
2020-10-02 01:26:04 +00:00
parser["TOKEN"] = [](const SemanticValues& vs) {
// 'token' doesn't include trailing whitespaces
auto token = vs.token();
2015-02-16 01:11:02 +00:00
};
2020-08-27 01:47:19 +00:00
auto ret = parser.parse(" token1, token2 ");
2015-02-16 01:11:02 +00:00
```
2015-02-18 23:00:11 +00:00
We can ignore unnecessary semantic values from the list by using `~` operator.
2015-11-23 22:48:03 +00:00
```cpp
2019-09-04 19:59:40 +00:00
peg::parser parser(R"(
2020-10-02 01:26:04 +00:00
ROOT < - _ ITEM (',' _ ITEM _ )*
2021-08-08 16:06:25 +00:00
ITEM < - ( [ a-z0-9 ] ) +
2020-10-02 01:26:04 +00:00
~_ < - [ \t]*
2019-08-27 16:07:35 +00:00
)");
2015-02-18 23:00:11 +00:00
2020-10-02 01:26:04 +00:00
parser["ROOT"] = [& ](const SemanticValues& vs ) {
assert(vs.size() == 2); // should be 2 instead of 5.
2015-02-18 23:00:11 +00:00
};
auto ret = parser.parse(" item1, item2 ");
```
2015-06-13 05:27:49 +00:00
The following grammar is same as the above.
2015-11-23 22:48:03 +00:00
```cpp
2019-09-04 19:59:40 +00:00
peg::parser parser(R"(
2020-10-02 01:26:04 +00:00
ROOT < - ~ _ ITEM ( ' , ' ~ _ ITEM ~ _ ) *
2021-08-09 01:10:49 +00:00
ITEM < - ( [ a-z0-9 ] ) +
2020-10-02 01:26:04 +00:00
_ < - [ \t]*
2019-08-27 16:07:35 +00:00
)");
2015-06-13 05:27:49 +00:00
```
2022-06-30 23:58:23 +00:00
*Semantic predicate* support is available with a *predicate* action.
2015-06-15 20:07:25 +00:00
2015-11-23 22:48:03 +00:00
```cpp
2015-08-10 20:37:56 +00:00
peg::parser parser("NUMBER < - [ 0-9 ] + " ) ;
2015-06-15 20:07:25 +00:00
2022-06-30 23:58:23 +00:00
parser["NUMBER"] = [](const SemanticValues & vs) {
return vs.token_to_number< long > ();
};
parser["NUMBER"].predicate = [](const SemanticValues & vs,
const std::any & /*dt*/, std::string & msg) {
if (vs.token_to_number< long > () != 100) {
msg = "value error!!";
return false;
2020-10-02 01:26:04 +00:00
}
2022-06-30 23:58:23 +00:00
return true;
2015-06-15 20:07:25 +00:00
};
long val;
auto ret = parser.parse("100", val);
assert(ret == true);
assert(val == 100);
ret = parser.parse("200", val);
assert(ret == false);
```
2021-04-02 14:04:08 +00:00
*enter* and *leave* actions are also available.
2015-11-23 22:48:03 +00:00
```cpp
2018-09-06 03:15:35 +00:00
parser["RULE"].enter = [](const char* s, size_t n, any& dt) {
2020-10-02 01:26:04 +00:00
std::cout < < "enter" < < std::endl ;
2015-11-23 22:48:03 +00:00
};
2020-10-02 01:26:04 +00:00
parser["RULE"] = [](const SemanticValues& vs, any& dt) {
std::cout < < "action!" < < std::endl ;
2015-11-23 22:48:03 +00:00
};
2018-09-06 03:15:35 +00:00
parser["RULE"].leave = [](const char* s, size_t n, size_t matchlen, any& value, any& dt) {
2020-10-02 01:26:04 +00:00
std::cout < < "leave" < < std::endl ;
2015-11-23 22:48:03 +00:00
};
```
2022-09-03 12:12:12 +00:00
You can receive error information via a logger:
```cpp
parser.set_logger([](size_t line, size_t col, const string& msg) {
...
});
parser.set_logger([](size_t line, size_t col, const string& msg, const string & rule) {
...
});
```
2015-11-26 06:02:09 +00:00
Ignoring Whitespaces
--------------------
As you can see in the first example, we can ignore whitespaces between tokens automatically with `%whitespace` rule.
`%whitespace` rule can be applied to the following three conditions:
* trailing spaces on tokens
* leading spaces on text
* trailing spaces on literal strings in rules
These are valid tokens:
```
2019-08-27 16:07:35 +00:00
KEYWORD < - ' keyword '
KEYWORDI < - ' case_insensitive_keyword '
WORD < - < [ a-zA-Z0-9 ] [ a-zA-Z0-9-_ ] * > # token boundary operator is used.
IDNET < - < IDENT_START_CHAR IDENT_CHAR * > # token boundary operator is used.
2015-11-26 06:02:09 +00:00
```
The following grammar accepts ` one, "two three", four ` .
```
ROOT < - ITEM ( ' , ' ITEM ) *
ITEM < - WORD / PHRASE
2016-01-24 01:26:54 +00:00
WORD < - < [ a-z ] + >
PHRASE < - < ' " ' ( ! ' " ' . ) * ' " ' >
2015-11-26 06:02:09 +00:00
%whitespace < - [ \t\r\n]*
```
2018-07-13 21:26:57 +00:00
Word expression
---------------
```cpp
peg::parser parser(R"(
2020-10-02 01:26:04 +00:00
ROOT < - ' hello ' ' world '
%whitespace < - [ \t\r\n]*
%word < - [ a-z ] +
2018-07-13 21:26:57 +00:00
)");
2018-07-21 02:56:13 +00:00
parser.parse("hello world"); // OK
parser.parse("helloworld"); // NG
```
Capture/Backreference
---------------------
```cpp
peg::parser parser(R"(
2020-10-02 01:26:04 +00:00
ROOT < - CONTENT
CONTENT < - ( ELEMENT / TEXT ) *
ELEMENT < - $ ( STAG CONTENT ETAG )
STAG < - ' < ' $ tag < TAG_NAME > '>'
ETAG < - ' < / ' $ tag ' > '
TAG_NAME < - ' b ' / ' u '
TEXT < - TEXT_DATA
TEXT_DATA < - ! [ < ] .
2018-07-21 02:56:13 +00:00
)");
parser.parse("This is < b > a < u > test< / u > text< / b > ."); // OK
parser.parse("This is < b > a < u > test< / b > text< / u > ."); // NG
parser.parse("This is < b > a < u > test text< / b > ."); // NG
2018-07-13 21:26:57 +00:00
```
2020-02-11 21:50:26 +00:00
Dictionary
----------
`|` operator allows us to make a word dictionary for fast lookup by using Trie structure internally. We don't have to worry about the order of words.
```peg
START < - ' This month is ' MONTH ' . '
MONTH < - ' Jan ' | ' January ' | ' Feb ' | ' February ' | ' . . . '
```
2021-01-26 03:46:28 +00:00
Cut operator
------------
`↑` operator could mitigate backtrack performance problem, but has a risk to change the meaning of grammar.
```peg
S < - ' ( ' ↑ P ' ) ' / ' " ' ↑ P ' " ' / P
P < - ' a ' / ' b ' / ' c '
```
2021-03-17 14:21:03 +00:00
When we parse `(z` with the above grammar, we don't have to backtrack in `S` after `(` is matched, because a cut operator is inserted there.
2021-01-26 03:46:28 +00:00
2018-07-25 22:08:10 +00:00
Parameterized Rule or Macro
---------------------------
```peg
# Syntax
Start ← _ Expr
Expr ← Sum
Sum ← List(Product, SumOpe)
Product ← List(Value, ProOpe)
Value ← Number / T('(') Expr T(')')
# Token
SumOpe ← T('+' / '-')
ProOpe ← T('*' / '/')
Number ← T([0-9]+)
~_ ← [ \t\r\n]*
# Macro
List(I, D) ← I (D I)*
T(x) ← < x > _
```
2020-02-08 02:52:54 +00:00
Parsing infix expression by Precedence climbing
-----------------------------------------------
2020-02-07 16:55:21 +00:00
2020-02-09 00:28:39 +00:00
Regarding the *precedence climbing algorithm* , please see [this article ](https://eli.thegreenplace.net/2012/08/02/parsing-expressions-by-precedence-climbing ).
2020-02-07 16:55:21 +00:00
```cpp
2020-02-07 20:50:06 +00:00
parser parser(R"(
2020-10-02 01:26:04 +00:00
EXPRESSION < - INFIX_EXPRESSION ( ATOM , OPERATOR )
ATOM < - NUMBER / ' ( ' EXPRESSION ' ) '
OPERATOR < - < [ - + / * ] >
NUMBER < - < ' - ' ? [ 0-9 ] + >
%whitespace < - [ \t]*
# Declare order of precedence
INFIX_EXPRESSION(A, O) < - A ( O A ) * {
precedence
L + -
L * /
}
2020-02-07 20:50:06 +00:00
)");
2020-10-02 01:26:04 +00:00
parser["INFIX_EXPRESSION"] = [](const SemanticValues& vs) -> long {
auto result = any_cast< long > (vs[0]);
if (vs.size() > 1) {
auto ope = any_cast< char > (vs[1]);
auto num = any_cast< long > (vs[2]);
switch (ope) {
case '+': result += num; break;
case '-': result -= num; break;
case '*': result *= num; break;
case '/': result /= num; break;
2020-02-07 20:50:06 +00:00
}
2020-10-02 01:26:04 +00:00
}
return result;
2020-02-07 20:50:06 +00:00
};
2020-10-02 01:26:04 +00:00
parser["OPERATOR"] = [](const SemanticValues& vs) { return *vs.sv(); };
parser["NUMBER"] = [](const SemanticValues& vs) { return vs.token_to_number< long > (); };
2020-02-07 20:50:06 +00:00
long val;
parser.parse(" -1 + (1 + 2) * 3 - -1", val);
assert(val == 9);
2020-02-07 16:55:21 +00:00
```
2020-02-07 20:50:06 +00:00
*precedence* instruction can be applied only to the following 'list' style rule.
```
2020-02-07 23:37:14 +00:00
Rule < - Atom ( Operator Atom ) * {
2020-02-07 20:50:06 +00:00
precedence
L - +
L / *
R ^
}
```
2021-04-07 17:38:26 +00:00
*precedence* instruction contains precedence info entries. Each entry starts with *associativity* which is 'L' (left) or 'R' (right), then operator *literal* tokens follow. The first entry has the highest order level.
2020-02-07 20:50:06 +00:00
2017-01-11 21:50:12 +00:00
AST generation
--------------
*cpp-peglib* is able to generate an AST (Abstract Syntax Tree) when parsing. `enable_ast` method on `peg::parser` class enables the feature.
2021-04-02 14:04:08 +00:00
NOTE: An AST node holds a corresponding token as `std::string_vew` for performance and less memory usage. It is users' responsibility to keep the original source text along with the generated AST tree.
2021-03-08 16:54:46 +00:00
2017-01-11 21:50:12 +00:00
```
2021-01-22 01:56:05 +00:00
peg::parser parser(R"(
...
defenition1 < - . . . { no_ast_opt }
defenition2 < - . . . { no_ast_opt }
...
)");
2017-01-11 21:50:12 +00:00
parser.enable_ast();
shared_ptr< peg::Ast > ast;
if (parser.parse("...", ast)) {
2020-05-25 21:31:22 +00:00
cout < < peg::ast_to_s ( ast ) ;
2017-01-11 21:50:12 +00:00
2021-01-22 01:56:05 +00:00
ast = parser.optimize_ast(ast);
2020-05-25 21:31:22 +00:00
cout < < peg::ast_to_s ( ast ) ;
2017-01-11 21:50:12 +00:00
}
```
2021-01-22 01:56:05 +00:00
`optimize_ast` removes redundant nodes to make a AST simpler. If you want to disable this behavior from particular rules, `no_ast_opt` instruction can be used.
It internally calls `peg::AstOptimizer` to do the job. You can make your own AST optimizers to fit your needs.
2017-01-11 21:50:12 +00:00
2017-08-19 03:29:22 +00:00
See actual usages in the [AST calculator example ](https://github.com/yhirose/cpp-peglib/blob/master/example/calc3.cc ) and [PL/0 language example ](https://github.com/yhirose/cpp-peglib/blob/master/pl0/pl0.cc ).
2017-01-11 21:50:12 +00:00
2015-08-28 02:36:02 +00:00
Make a parser with parser combinators
-------------------------------------
2015-02-08 01:52:26 +00:00
2021-04-02 14:04:08 +00:00
Instead of making a parser by parsing PEG syntax text, we can also construct a parser by hand with *parser combinators* . Here is an example:
2015-02-08 01:52:26 +00:00
2015-11-23 22:48:03 +00:00
```cpp
2015-08-10 20:37:56 +00:00
using namespace peg;
2015-02-08 01:52:26 +00:00
using namespace std;
2015-02-09 22:12:59 +00:00
vector< string > tags;
2015-02-08 14:43:49 +00:00
Definition ROOT, TAG_NAME, _;
2015-02-14 15:38:15 +00:00
ROOT < = seq(_, zom(seq(chr('['), TAG_NAME, chr(']'), _)));
2020-10-02 01:26:04 +00:00
TAG_NAME < = oom(seq(npd(chr(']')), dot())), [& ](const SemanticValues& vs ) {
2021-02-21 20:12:59 +00:00
tags.push_back(vs.token_to_string());
2015-02-14 15:38:15 +00:00
};
_ < = zom(cls(" \t"));
2015-02-08 01:52:26 +00:00
auto ret = ROOT.parse(" [tag1] [tag:2] [tag-3] ");
```
The following are available operators:
2021-01-25 22:33:48 +00:00
| Operator | Description | Operator | Description |
| :------- | :------------------------------ | :------- | :------------------- |
| seq | Sequence | cho | Prioritized Choice |
| zom | Zero or More | oom | One or More |
| opt | Optional | apd | And predicate |
| npd | Not predicate | lit | Literal string |
| liti | Case-insensitive Literal string | cls | Character class |
| ncls | Negated Character class | chr | Character |
| dot | Any character | tok | Token boundary |
| ign | Ignore semantic value | csc | Capture scope |
| cap | Capture | bkr | Back reference |
| dic | Dictionary | pre | Infix expression |
| rec | Infix expression | usr | User defined parser |
2022-05-02 20:49:11 +00:00
| rep | Repetition | | |
2018-09-14 02:08:51 +00:00
Adjust definitions
------------------
It's possible to add/override definitions.
```cpp
auto syntax = R"(
2020-10-02 01:26:04 +00:00
ROOT < - _ 'Hello' _ NAME '!' _
2018-09-14 02:08:51 +00:00
)";
Rules additional_rules = {
2020-10-02 01:26:04 +00:00
{
"NAME", usr([](const char* s, size_t n, SemanticValues& vs, any& dt) -> size_t {
static vector< string > names = { "PEG", "BNF" };
for (const auto& name: names) {
if (name.size() < = n & & !name.compare(0, name.size(), s, name.size())) {
return name.size(); // processed length
}
}
return -1; // parse error
})
},
{
"~_", zom(cls(" \t\r\n"))
}
2018-09-14 02:08:51 +00:00
};
auto g = parser(syntax, additional_rules);
assert(g.parse(" Hello BNF! "));
```
2015-02-08 01:52:26 +00:00
2015-08-28 02:36:02 +00:00
Unicode support
---------------
2018-09-16 16:54:36 +00:00
cpp-peglib accepts UTF8 text. `.` matches a Unicode codepoint. Also, it supports `\u????` .
2015-08-28 02:36:02 +00:00
2021-01-16 22:34:09 +00:00
Error report and recovery
-------------------------
2021-04-02 14:04:08 +00:00
cpp-peglib supports the furthest failure error position report as described in the Bryan Ford original document.
2021-01-16 22:34:09 +00:00
2021-04-02 14:04:08 +00:00
For better error report and recovery, cpp-peglib supports 'recovery' operator with label which can be associated with a recovery expression and a custom error message. This idea comes from the fantastic ["Syntax Error Recovery in Parsing Expression Grammars" ](https://arxiv.org/pdf/1806.11150.pdf ) paper by Sergio Medeiros and Fabio Mascarenhas.
2021-01-16 22:34:09 +00:00
2021-02-04 15:45:18 +00:00
The custom message supports `%t` which is a place holder for the unexpected token, and `%c` for the unexpected Unicode char.
2021-01-18 20:26:54 +00:00
2021-01-16 22:34:09 +00:00
Here is an example of Java-like grammar:
```peg
2021-01-21 19:29:55 +00:00
# java.peg
2021-01-16 22:34:09 +00:00
Prog ← 'public' 'class' NAME '{' 'public' 'static' 'void' 'main' '(' 'String' '[' ']' NAME ')' BlockStmt '}'
2021-01-21 19:29:55 +00:00
BlockStmt ← '{' (!'}' Stmt^stmtb)* '}' # Annotated with `stmtb`
2021-01-16 22:34:09 +00:00
Stmt ← IfStmt / WhileStmt / PrintStmt / DecStmt / AssignStmt / BlockStmt
IfStmt ← 'if' '(' Exp ')' Stmt ('else' Stmt)?
WhileStmt ← 'while' '(' Exp^condw ')' Stmt # Annotated with `condw`
DecStmt ← 'int' NAME ('=' Exp)? ';'
AssignStmt ← NAME '=' Exp ';'^semia # Annotated with `semi`
PrintStmt ← 'System.out.println' '(' Exp ')' ';'
Exp ← RelExp ('==' RelExp)*
RelExp ← AddExp ('< ' AddExp)*
AddExp ← MulExp (('+' / '-') MulExp)*
MulExp ← AtomExp (('*' / '/') AtomExp)*
AtomExp ← '(' Exp ')' / NUMBER / NAME
NUMBER ← < [0-9]+ >
NAME ← < [a-zA-Z_][a-zA-Z_0-9]* >
%whitespace ← [ \t\n]*
%word ← NAME
# Recovery operator labels
2022-09-01 23:59:46 +00:00
semia ← '' { error_message "missing semicolon in assignment." }
stmtb ← (!(Stmt / 'else' / '}') .)* { error_message "invalid statement" }
2021-01-16 22:34:09 +00:00
condw ← & '==' ('==' RelExp)* / & '< ' ('< ' AddExp)* / (!')' .)*
```
2021-04-02 14:04:08 +00:00
For instance, `';'^semi` is a syntactic sugar for `(';' / %recovery(semi))` . `%recover` operator tries to recover the error at ';' by skipping input text with the recovery expression `semi` . Also `semi` is associated with a custom message "missing semicolon in assignment.".
2021-01-16 22:34:09 +00:00
Here is the result:
```java
2021-01-21 19:29:55 +00:00
> cat sample.java
2021-01-16 22:34:09 +00:00
public class Example {
public static void main(String[] args) {
int n = 5;
int f = 1;
while( < n ) {
f = f * n;
n = n - 1
};
System.out.println(f);
}
}
2021-01-21 19:29:55 +00:00
> peglint java.peg sample.java
2022-08-20 02:31:04 +00:00
sample.java:5:12: syntax error, unexpected '< ', expecting '(', < NUMBER > , < NAME > .
2021-04-02 14:04:08 +00:00
sample.java:8:5: missing semicolon in assignment.
2021-01-16 22:34:09 +00:00
sample.java:8:6: invalid statement
```
2021-04-02 14:04:08 +00:00
As you can see, it can now show more than one error, and provide more meaningful error messages than the default messages.
2021-01-16 22:34:09 +00:00
2022-09-02 00:43:32 +00:00
### Custom error message for definitions
We can associate custom error messages to definitions.
```peg
# custom_message.peg
START < - CODE ( ' , ' CODE ) *
CODE < - < ' 0x ' [ a-fA-F0-9 ] + > { error_message 'code format error...' }
%whitespace < - [ \t]*
```
```
> cat custom_message.txt
0x1234,0x@@@@,0xABCD
> peglint custom_message.peg custom_message.txt
custom_message.txt:1:8: code format error...
```
NOTE: If there are more than one elements with error message instruction in a prioritized choice, this feature may not work as you expect.
2018-08-04 03:47:25 +00:00
peglint - PEG syntax lint utility
---------------------------------
### Build peglint
```
> cd lint
> mkdir build
> cd build
> cmake ..
> make
> ./peglint
2020-05-25 21:31:22 +00:00
usage: grammar_file_path [source_file_path]
options:
2021-01-16 22:34:09 +00:00
--source: source text
2021-01-24 03:58:23 +00:00
--packrat: enable packrat memoise
2020-05-25 21:31:22 +00:00
--ast: show AST tree
2021-04-02 14:04:08 +00:00
--opt, --opt-all: optimize all AST nodes except nodes selected with `no_ast_opt` instruction
--opt-only: optimize only AST nodes selected with `no_ast_opt` instruction
2022-05-27 04:06:27 +00:00
--trace: show concise trace messages
2022-06-03 02:15:09 +00:00
--profile: show profile report
2022-06-03 22:42:44 +00:00
--verbose: verbose output for trace and profile
2018-08-04 03:47:25 +00:00
```
2021-01-16 22:34:09 +00:00
### Grammar check
2018-08-04 03:47:25 +00:00
```
> cat a.peg
2021-01-16 22:34:09 +00:00
Additive < - Multitive ' + ' Additive / Multitive
Multitive < - Primary ' * ' Multitive / Primary
Primary < - ' ( ' Additive ' ) ' / Number
%whitespace < - [ \t\r\n]*
2018-08-04 03:47:25 +00:00
> peglint a.peg
2021-01-16 22:34:09 +00:00
[commendline]:3:35: 'Number' is not defined.
2018-08-04 03:47:25 +00:00
```
2021-01-16 22:34:09 +00:00
### Source check
2018-08-04 03:47:25 +00:00
```
> cat a.peg
Additive < - Multitive ' + ' Additive / Multitive
Multitive < - Primary ' * ' Multitive / Primary
Primary < - ' ( ' Additive ' ) ' / Number
Number < - < [ 0-9 ] + >
%whitespace < - [ \t\r\n]*
> peglint --source "1 + a * 3" a.peg
[commendline]:1:3: syntax error
```
2021-01-16 22:34:09 +00:00
### AST
2018-08-04 03:47:25 +00:00
```
> cat a.txt
1 + 2 * 3
> peglint --ast a.peg a.txt
+ Additive
+ Multitive
+ Primary
- Number (1)
+ Additive
+ Multitive
+ Primary
- Number (2)
+ Multitive
+ Primary
- Number (3)
```
2021-01-16 22:34:09 +00:00
### AST optimazation
2018-08-04 03:47:25 +00:00
```
> peglint --ast --opt --source "1 + 2 * 3" a.peg
+ Additive
- Multitive[Number] (1)
+ Additive[Multitive]
- Primary[Number] (2)
- Multitive[Number] (3)
```
2021-01-22 02:16:47 +00:00
### Adjust AST optimazation with `no_ast_opt` instruction
2020-05-26 02:48:42 +00:00
```
2021-01-22 02:16:47 +00:00
> cat a.peg
Additive < - Multitive ' + ' Additive / Multitive
Multitive < - Primary ' * ' Multitive / Primary
Primary < - ' ( ' Additive ' ) ' / Number { no_ast_opt }
Number < - < [ 0-9 ] + >
%whitespace < - [ \t\r\n]*
> peglint --ast --opt --source "1 + 2 * 3" a.peg
2020-05-26 02:48:42 +00:00
+ Additive/0
+ Multitive/1[Primary]
- Number (1)
+ Additive/1[Multitive]
+ Primary/1
- Number (2)
+ Multitive/1[Primary]
- Number (3)
2021-01-22 02:16:47 +00:00
> peglint --ast --opt-only --source "1 + 2 * 3" a.peg
2020-05-26 02:48:42 +00:00
+ Additive/0
+ Multitive/1
- Primary/1[Number] (1)
+ Additive/1
+ Multitive/0
- Primary/1[Number] (2)
+ Multitive/1
- Primary/1[Number] (3)
```
2015-02-08 01:58:25 +00:00
Sample codes
------------
* [Calculator ](https://github.com/yhirose/cpp-peglib/blob/master/example/calc.cc )
2015-02-22 04:23:59 +00:00
* [Calculator (with parser operators) ](https://github.com/yhirose/cpp-peglib/blob/master/example/calc2.cc )
* [Calculator (AST version) ](https://github.com/yhirose/cpp-peglib/blob/master/example/calc3.cc )
2020-02-07 23:37:14 +00:00
* [Calculator (parsing expressions by precedence climbing) ](https://github.com/yhirose/cpp-peglib/blob/master/example/calc4.cc )
* [Calculator (AST version and parsing expressions by precedence climbing) ](https://github.com/yhirose/cpp-peglib/blob/master/example/calc5.cc )
2022-06-28 20:40:33 +00:00
* [A tiny PL/0 JIT compiler in less than 900 LOC with LLVM and PEG parser ](https://github.com/yhirose/pl0-jit-compiler )
2020-02-10 02:00:30 +00:00
* [A Programming Language just for writing Fizz Buzz program. :) ](https://github.com/yhirose/fizzbuzzlang )
2015-02-08 01:58:25 +00:00
2015-02-08 01:52:26 +00:00
License
-------
2022-06-28 20:40:33 +00:00
MIT license (© 2022 Yuji Hirose)