2015-02-08 01:52:26 +00:00
cpp-peglib
==========
2016-06-02 22:01:07 +00:00
[![Build Status ](https://travis-ci.org/yhirose/cpp-peglib.svg?branch=master )](https://travis-ci.org/yhirose/cpp-peglib)
2016-06-09 02:35:50 +00:00
[![Bulid Status ](https://ci.appveyor.com/api/projects/status/github/yhirose/cpp-peglib?branch=master&svg=true )](https://ci.appveyor.com/project/yhirose/cpp-peglib)
2016-06-02 22:01:07 +00:00
2015-02-08 01:52:26 +00:00
C++11 header-only [PEG ](http://en.wikipedia.org/wiki/Parsing_expression_grammar ) (Parsing Expression Grammars) library.
2015-02-13 02:08:58 +00:00
*cpp-peglib* tries to provide more expressive parsing experience in a simple way. This library depends on only one header file. So, you can start using it right away just by including `peglib.h` in your project.
2015-02-08 01:52:26 +00:00
2015-08-28 02:36:02 +00:00
The PEG syntax is well described on page 2 in the [document ](http://www.brynosaurus.com/pub/lang/peg.pdf ). *cpp-peglib* also supports the following additional syntax for now:
2015-02-16 01:22:34 +00:00
2015-08-10 20:37:56 +00:00
* `<` ... `>` (Token boundary operator)
2015-02-18 23:00:11 +00:00
* `~` (Ignore operator)
2015-03-04 03:08:18 +00:00
* `\x20` (Hex number char)
2015-08-28 02:36:02 +00:00
* `$name<` ... `>` (Named capture operator)
2018-07-12 17:06:48 +00:00
* `$name` (Backreference operator)
2018-07-13 21:26:57 +00:00
* `%whitespace` (Automatic whitespace skipping)
* `%word` (Word expression)
2018-07-21 02:56:13 +00:00
* `$name(` ... `)` (Capture scope operator)
2018-07-21 02:09:54 +00:00
* `$name<` ... `>` (Named capture operator)
* `$name` (Backreference operator)
2018-07-25 22:08:10 +00:00
* `MACRO_NAME(` ... `)` (Parameterized rule or Macro)
2015-02-27 03:40:00 +00:00
2018-07-12 17:06:48 +00:00
This library also supports the linear-time parsing known as the [*Packrat* ](http://pdos.csail.mit.edu/~baford/packrat/thesis/thesis.pdf ) parsing.
2016-11-23 21:16:00 +00:00
If you need a Go language version, please see [*go-peg* ](https://github.com/yhirose/go-peg ).
2015-02-08 01:52:26 +00:00
How to use
----------
2017-01-11 21:50:12 +00:00
This is a simple calculator sample. It shows how to define grammar, associate samantic actions to the grammar, and handle semantic values.
2015-02-08 01:52:26 +00:00
2015-11-23 22:48:03 +00:00
```cpp
2015-02-15 22:52:39 +00:00
// (1) Include the header file
2015-02-12 02:04:08 +00:00
#include <peglib.h>
2015-03-11 17:53:24 +00:00
#include <assert.h>
2018-08-01 18:58:24 +00:00
#include <iostream>
2015-02-12 02:04:08 +00:00
2015-08-10 20:37:56 +00:00
using namespace peg;
2015-02-12 02:04:08 +00:00
using namespace std;
int main(void) {
2015-02-15 22:52:39 +00:00
// (2) Make a parser
2018-08-01 18:58:24 +00:00
auto grammar = R"(
2015-02-13 02:08:58 +00:00
# Grammar for Calculator...
2015-11-26 06:02:09 +00:00
Additive < - Multitive ' + ' Additive / Multitive
Multitive < - Primary ' * ' Multitive / Primary
Primary < - ' ( ' Additive ' ) ' / Number
2016-01-24 01:26:54 +00:00
Number < - < [ 0-9 ] + >
2015-11-26 06:02:09 +00:00
%whitespace < - [ \t]*
2015-02-13 02:08:58 +00:00
)";
2018-08-01 18:58:24 +00:00
parser parser;
parser.log = [](size_t line, size_t col, const string& msg) {
cerr < < line << " : " << col << " : " << msg << " \n";
};
auto ok = parser.load_grammar(grammar);
assert(ok);
2015-02-13 02:08:58 +00:00
2017-01-11 21:50:12 +00:00
// (3) Setup actions
2015-06-16 03:26:49 +00:00
parser["Additive"] = [](const SemanticValues& sv) {
2016-01-24 01:26:54 +00:00
switch (sv.choice()) {
2015-06-16 03:26:49 +00:00
case 0: // "Multitive '+' Additive"
return sv[0].get< int > () + sv[1].get< int > ();
default: // "Multitive"
return sv[0].get< int > ();
}
2015-02-13 02:08:58 +00:00
};
2015-02-27 03:40:00 +00:00
parser["Multitive"] = [](const SemanticValues& sv) {
2016-01-24 01:26:54 +00:00
switch (sv.choice()) {
2015-06-16 03:26:49 +00:00
case 0: // "Primary '*' Multitive"
2015-03-09 18:58:43 +00:00
return sv[0].get< int > () * sv[1].get< int > ();
2015-06-16 03:26:49 +00:00
default: // "Primary"
2015-03-09 18:58:43 +00:00
return sv[0].get< int > ();
2015-02-27 03:40:00 +00:00
}
2015-02-13 02:08:58 +00:00
};
2015-06-16 04:25:01 +00:00
parser["Number"] = [](const SemanticValues& sv) {
2016-01-24 01:26:54 +00:00
return stoi(sv.token(), nullptr, 10);
2015-02-13 02:08:58 +00:00
};
2015-02-15 22:52:39 +00:00
// (4) Parse
2016-01-24 01:26:54 +00:00
parser.enable_packrat_parsing(); // Enable packrat parsing.
2015-03-11 17:53:24 +00:00
2015-02-13 02:08:58 +00:00
int val;
2015-11-26 06:02:09 +00:00
parser.parse(" (1 + 2) * 3 ", val);
2015-02-13 02:08:58 +00:00
2015-02-16 03:21:18 +00:00
assert(val == 9);
2015-02-12 02:04:08 +00:00
}
```
2015-02-08 01:52:26 +00:00
2017-01-11 21:50:12 +00:00
There are two semantic actions available:
2015-02-15 22:52:39 +00:00
2015-11-23 22:48:03 +00:00
```cpp
2015-02-22 00:38:30 +00:00
[](const SemanticValues& sv, any& dt)
[](const SemanticValues& sv)
2015-02-15 22:52:39 +00:00
```
2017-01-11 21:50:12 +00:00
`const SemanticValues& sv` contains the following information:
- Semantic values
- Matched string information
- Token information if the rule is literal or uses a token boundary operator
- Choice number when the rule is 'prioritized choise'
`any& dt` is a 'read-write' context data which can be used for whatever purposes. The initial context data is set in `peg::parser::parse` method.
`peg::any` is a simpler implementatin of [boost::any ](http://www.boost.org/doc/libs/1_57_0/doc/html/any.html ). It can wrap arbitrary data type.
A semantic action can return a value of arbitrary data type, which will be wrapped by `peg::any` . If a user returns nothing in a semantic action, the first semantic value in the `const SemanticValues& sv` argument will be returned. (Yacc parser has the same behavior.)
Here shows the `SemanticValues` structure:
2015-02-19 03:28:57 +00:00
2015-11-23 22:48:03 +00:00
```cpp
2016-01-24 01:26:54 +00:00
struct SemanticValues : protected std::vector< any >
2015-02-22 00:38:30 +00:00
{
2017-11-04 02:27:08 +00:00
// Input text
const char* path;
const char* ss;
2016-01-24 01:26:54 +00:00
// Matched string
std::string str() const; // Matched string
const char* c_str() const; // Matched string start
size_t length() const; // Matched string length
2017-11-04 02:27:08 +00:00
// Line number and column at which the matched string is
std::pair< size_t , size_t > line_info() const;
2016-01-24 01:26:54 +00:00
// Tokens
std::vector<
std::pair<
const char*, // Token start
size_t>> // Token length
tokens;
2015-03-11 18:10:59 +00:00
2016-01-24 01:26:54 +00:00
std::string token(size_t id = 0) const;
// Choice number (0 based index)
size_t choice() const;
2015-06-16 04:43:08 +00:00
2015-10-14 21:20:39 +00:00
// Transform the semantic value vector to another vector
2015-06-16 05:04:01 +00:00
template < typename T > vector< T > transform(size_t beg = 0, size_t end = -1) const;
2015-02-22 00:38:30 +00:00
}
2015-02-19 03:28:57 +00:00
```
2017-01-11 21:50:12 +00:00
The following example uses `<` ... ` >` operator, which is *token boundary* operator.
2015-02-16 01:11:02 +00:00
2015-11-23 22:48:03 +00:00
```cpp
2015-02-16 01:11:02 +00:00
auto syntax = R"(
ROOT < - _ TOKEN (',' _ TOKEN )*
TOKEN < - < [ a-z0-9 ] + > _
_ < - [ \t\r\n]*
)";
peg pg(syntax);
2018-08-04 03:47:25 +00:00
pg["TOKEN"] = [](const SemanticValues& sv) {
2015-02-18 03:35:07 +00:00
// 'token' doesn't include trailing whitespaces
2016-01-24 01:26:54 +00:00
auto token = sv.token();
2015-02-16 01:11:02 +00:00
};
auto ret = pg.parse(" token1, token2 ");
```
2015-02-18 23:00:11 +00:00
We can ignore unnecessary semantic values from the list by using `~` operator.
2015-11-23 22:48:03 +00:00
```cpp
2015-08-10 20:37:56 +00:00
peg::pegparser parser(
2015-06-13 05:27:49 +00:00
" ROOT < - _ ITEM (',' _ ITEM _ )* "
" ITEM < - ( [ a-z ] ) + "
" ~_ < - [ \t]* "
2015-02-18 23:00:11 +00:00
);
2018-08-04 03:47:25 +00:00
parser["ROOT"] = [& ](const SemanticValues& sv ) {
2015-02-22 00:38:30 +00:00
assert(sv.size() == 2); // should be 2 instead of 5.
2015-02-18 23:00:11 +00:00
};
auto ret = parser.parse(" item1, item2 ");
```
2015-06-13 05:27:49 +00:00
The following grammar is same as the above.
2015-11-23 22:48:03 +00:00
```cpp
2015-08-10 20:37:56 +00:00
peg::parser parser(
2015-06-13 05:27:49 +00:00
" ROOT < - ~ _ ITEM ( ' , ' ~ _ ITEM ~ _ ) * "
" ITEM < - ( [ a-z ] ) + "
" _ < - [ \t]* "
);
```
2015-08-10 20:37:56 +00:00
*Semantic predicate* support is available. We can do it by throwing a `peg::parse_error` exception in a semantic action.
2015-06-15 20:07:25 +00:00
2015-11-23 22:48:03 +00:00
```cpp
2015-08-10 20:37:56 +00:00
peg::parser parser("NUMBER < - [ 0-9 ] + " ) ;
2015-06-15 20:07:25 +00:00
2018-08-04 03:47:25 +00:00
parser["NUMBER"] = [](const SemanticValues& sv) {
2015-06-16 04:43:08 +00:00
auto val = stol(sv.str(), nullptr, 10);
2015-06-15 20:07:25 +00:00
if (val != 100) {
2015-08-10 20:37:56 +00:00
throw peg::parse_error("value error!!");
2015-06-15 20:07:25 +00:00
}
return val;
};
long val;
auto ret = parser.parse("100", val);
assert(ret == true);
assert(val == 100);
ret = parser.parse("200", val);
assert(ret == false);
```
2016-01-24 01:26:54 +00:00
*enter* and *leave* actions are also avalable.
2015-11-23 22:48:03 +00:00
```cpp
2016-01-24 01:26:54 +00:00
parser["RULE"].enter = [](any& dt) {
std::cout < < "enter" < < std::endl ;
2015-11-23 22:48:03 +00:00
};
2018-08-04 03:47:25 +00:00
parser["RULE"] = [](const SemanticValues& sv, any& dt) {
2015-12-02 16:58:45 +00:00
std::cout < < "action!" < < std::endl ;
2015-11-23 22:48:03 +00:00
};
2016-01-24 01:26:54 +00:00
parser["RULE"].leave = [](any& dt) {
std::cout < < "leave" < < std::endl ;
2015-11-23 22:48:03 +00:00
};
```
2015-11-26 06:02:09 +00:00
Ignoring Whitespaces
--------------------
As you can see in the first example, we can ignore whitespaces between tokens automatically with `%whitespace` rule.
`%whitespace` rule can be applied to the following three conditions:
* trailing spaces on tokens
* leading spaces on text
* trailing spaces on literal strings in rules
These are valid tokens:
```
KEYWORD < - ' keyword '
2016-01-24 01:26:54 +00:00
WORD < - < [ a-zA-Z0-9 ] [ a-zA-Z0-9-_ ] * > # token boundary operator is used.
2015-11-26 06:02:09 +00:00
IDNET < - < IDENT_START_CHAR IDENT_CHAR * > # token boundary operator is used.
```
The following grammar accepts ` one, "two three", four ` .
```
ROOT < - ITEM ( ' , ' ITEM ) *
ITEM < - WORD / PHRASE
2016-01-24 01:26:54 +00:00
WORD < - < [ a-z ] + >
PHRASE < - < ' " ' ( ! ' " ' . ) * ' " ' >
2015-11-26 06:02:09 +00:00
%whitespace < - [ \t\r\n]*
```
2018-07-13 21:26:57 +00:00
Word expression
---------------
```cpp
peg::parser parser(R"(
ROOT < - ' hello ' ' world '
%whitespace < - [ \t\r\n]*
%word < - [ a-z ] +
)");
2018-07-21 02:56:13 +00:00
parser.parse("hello world"); // OK
parser.parse("helloworld"); // NG
```
Capture/Backreference
---------------------
```cpp
peg::parser parser(R"(
ROOT < - CONTENT
CONTENT < - ( ELEMENT / TEXT ) *
ELEMENT < - $ ( STAG CONTENT ETAG )
STAG < - ' < ' $ tag < TAG_NAME > '>'
ETAG < - ' < / ' $ tag ' > '
TAG_NAME < - ' b ' / ' u '
TEXT < - TEXT_DATA
TEXT_DATA < - ! [ < ] .
)");
parser.parse("This is < b > a < u > test< / u > text< / b > ."); // OK
parser.parse("This is < b > a < u > test< / b > text< / u > ."); // NG
parser.parse("This is < b > a < u > test text< / b > ."); // NG
2018-07-13 21:26:57 +00:00
```
2018-07-25 22:08:10 +00:00
Parameterized Rule or Macro
---------------------------
```peg
# Syntax
Start ← _ Expr
Expr ← Sum
Sum ← List(Product, SumOpe)
Product ← List(Value, ProOpe)
Value ← Number / T('(') Expr T(')')
# Token
SumOpe ← T('+' / '-')
ProOpe ← T('*' / '/')
Number ← T([0-9]+)
~_ ← [ \t\r\n]*
# Macro
List(I, D) ← I (D I)*
T(x) ← < x > _
```
2017-01-11 21:50:12 +00:00
AST generation
--------------
*cpp-peglib* is able to generate an AST (Abstract Syntax Tree) when parsing. `enable_ast` method on `peg::parser` class enables the feature.
```
peg::parser parser("...");
parser.enable_ast();
shared_ptr< peg::Ast > ast;
if (parser.parse("...", ast)) {
cout < < peg::ast_to_s ( ast ) ;
ast = peg::AstOptimizer(true).optimize(ast);
cout < < peg::ast_to_s ( ast ) ;
}
```
`peg::AstOptimizer` removes redundant nodes to make a AST simpler. You can make your own AST optimizers to fit your needs.
2017-08-19 03:29:22 +00:00
See actual usages in the [AST calculator example ](https://github.com/yhirose/cpp-peglib/blob/master/example/calc3.cc ) and [PL/0 language example ](https://github.com/yhirose/cpp-peglib/blob/master/pl0/pl0.cc ).
2017-01-11 21:50:12 +00:00
2015-08-28 02:36:02 +00:00
Make a parser with parser combinators
-------------------------------------
2015-02-08 01:52:26 +00:00
2015-08-28 02:36:02 +00:00
Instead of makeing a parser by parsing PEG syntax text, we can also construct a parser by hand with *parser combinatorss* . Here is an example:
2015-02-08 01:52:26 +00:00
2015-11-23 22:48:03 +00:00
```cpp
2015-08-10 20:37:56 +00:00
using namespace peg;
2015-02-08 01:52:26 +00:00
using namespace std;
2015-02-09 22:12:59 +00:00
vector< string > tags;
2015-02-08 14:43:49 +00:00
Definition ROOT, TAG_NAME, _;
2015-02-14 15:38:15 +00:00
ROOT < = seq(_, zom(seq(chr('['), TAG_NAME, chr(']'), _)));
2015-06-16 04:25:01 +00:00
TAG_NAME < = oom(seq(npd(chr(']')), dot())), [& ](const SemanticValues& sv ) {
2015-06-16 04:43:08 +00:00
tags.push_back(sv.str());
2015-02-14 15:38:15 +00:00
};
_ < = zom(cls(" \t"));
2015-02-08 01:52:26 +00:00
auto ret = ROOT.parse(" [tag1] [tag:2] [tag-3] ");
```
The following are available operators:
2015-06-13 05:22:46 +00:00
| Operator | Description |
| :------- | :-------------------- |
| seq | Sequence |
| cho | Prioritized Choice |
| zom | Zero or More |
| oom | One or More |
| opt | Optional |
| apd | And predicate |
| npd | Not predicate |
| lit | Literal string |
| cls | Character class |
| chr | Character |
| dot | Any character |
2015-08-10 20:37:56 +00:00
| tok | Token boundary |
2015-06-13 05:23:27 +00:00
| ign | Ignore semantic value |
2018-07-21 02:56:13 +00:00
| csc | Capture scope |
2018-07-21 02:09:54 +00:00
| cap | Capture |
2018-07-12 17:06:48 +00:00
| bkr | Back reference |
2015-02-08 01:52:26 +00:00
2015-08-28 02:36:02 +00:00
Unicode support
---------------
2016-01-24 03:15:15 +00:00
Since cpp-peglib only accepts 8 bits characters, it probably accepts UTF-8 text. But `.` matches only a byte, not a Unicode character. Also, it dosn't support `\u????` .
2015-08-28 02:36:02 +00:00
2018-08-04 03:47:25 +00:00
peglint - PEG syntax lint utility
---------------------------------
### Build peglint
```
> cd lint
> mkdir build
> cd build
> cmake ..
> make
> ./peglint
usage: peglint [--ast] [--optimize_ast_nodes|--opt] [--source text] [--server [PORT]] [--trace] [grammar file path] [source file path]
```
### Lint grammar
```
> cat a.peg
A < - ' hello ' ^ ' world '
> peglint a.peg
a.peg:1:14: syntax error
```
```
> cat a.peg
A < - B
> peglint a.peg
a.peg:1:6: 'B' is not defined.
```
```
> cat a.peg
A < - B / C
B < - ' b '
C < - A
> peglint a.peg
a.peg:1:10: 'C' is left recursive.
a.peg:3:6: 'A' is left recursive.
```
### Lint source text
```
> cat a.peg
Additive < - Multitive ' + ' Additive / Multitive
Multitive < - Primary ' * ' Multitive / Primary
Primary < - ' ( ' Additive ' ) ' / Number
Number < - < [ 0-9 ] + >
%whitespace < - [ \t\r\n]*
> peglint --source "1 + a * 3" a.peg
[commendline]:1:3: syntax error
```
```
> cat a.txt
1 + 2 * 3
> peglint --ast a.peg a.txt
+ Additive
+ Multitive
+ Primary
- Number (1)
+ Additive
+ Multitive
+ Primary
- Number (2)
+ Multitive
+ Primary
- Number (3)
```
```
> peglint --ast --opt --source "1 + 2 * 3" a.peg
+ Additive
- Multitive[Number] (1)
+ Additive[Multitive]
- Primary[Number] (2)
- Multitive[Number] (3)
```
2015-02-08 01:58:25 +00:00
Sample codes
------------
* [Calculator ](https://github.com/yhirose/cpp-peglib/blob/master/example/calc.cc )
2015-02-22 04:23:59 +00:00
* [Calculator (with parser operators) ](https://github.com/yhirose/cpp-peglib/blob/master/example/calc2.cc )
* [Calculator (AST version) ](https://github.com/yhirose/cpp-peglib/blob/master/example/calc3.cc )
2018-01-22 22:04:19 +00:00
* [PL/0 language example ](https://github.com/yhirose/cpp-peglib/blob/master/pl0/pl0.cc )
* [A tiny PL/0 JIT compiler in less than 700 LOC with LLVM and PEG parser ](https://github.com/yhirose/pl0-jit-compiler )
2015-02-08 01:58:25 +00:00
2018-09-05 01:16:40 +00:00
PEG debug
---------
2018-09-05 01:27:02 +00:00
A debug viewer for Parsing Expression Grammars using cpp-peglib by [mqnc ](https://github.com/mqnc ). Please see [his gihub project page ](https://github.com/mqnc/pegdebug ) for the detail. You can see a parse result of PL/0 code [here ](https://mqnc.github.io/pegdebug/example/output.html ).
2018-09-05 01:16:40 +00:00
2015-08-28 02:36:02 +00:00
Tested compilers
2015-02-08 01:52:26 +00:00
----------------
2017-08-25 13:04:02 +00:00
* Visual Studio 2017
2015-08-04 22:10:53 +00:00
* Visual Studio 2015
2015-11-28 13:03:54 +00:00
* Visual Studio 2013 with update 5
2018-01-05 23:20:54 +00:00
* Clang++ 5.0.1
* Clang++ 5.0
2017-08-25 13:04:02 +00:00
* Clang++ 4.0
2016-11-17 13:40:59 +00:00
* Clang++ 3.5
* G++ 5.4 on Ubuntu 16.04
2016-11-23 21:16:00 +00:00
2018-09-05 01:27:02 +00:00
IMPORTANT NOTE for Ubuntu: Need `-pthread` option when linking. See [#23 ](https://github.com/yhirose/cpp-peglib/issues/23#issuecomment-261126127 ) and [#46 ](https://github.com/yhirose/cpp-peglib/issues/46#issuecomment-417870473 ).
2015-02-08 01:52:26 +00:00
TODO
----
2015-08-28 02:36:02 +00:00
* Unicode support (`.` matches a Unicode char. `\u????` , `\p{L}` )
2015-02-08 01:52:26 +00:00
License
-------
2018-01-22 22:04:19 +00:00
MIT license (© 2018 Yuji Hirose)