cpp-peglib/README.md

310 lines
8.7 KiB
Markdown
Raw Normal View History

2015-02-08 01:52:26 +00:00
cpp-peglib
==========
C++11 header-only [PEG](http://en.wikipedia.org/wiki/Parsing_expression_grammar) (Parsing Expression Grammars) library.
2015-02-13 02:08:58 +00:00
*cpp-peglib* tries to provide more expressive parsing experience in a simple way. This library depends on only one header file. So, you can start using it right away just by including `peglib.h` in your project.
2015-02-08 01:52:26 +00:00
2015-02-16 01:22:34 +00:00
The PEG syntax is well described on page 2 in the [document](http://pdos.csail.mit.edu/papers/parsing:popl04.pdf). *cpp-peglib* also supports the following additional syntax for now:
2015-02-18 23:00:11 +00:00
* `<` ... `>` (Anchor operator)
* `$<` ... `>` (Capture operator)
* `~` (Ignore operator)
2015-02-08 01:52:26 +00:00
How to use
----------
2015-02-15 22:52:39 +00:00
This is a simple calculator sample. It shows how to define grammar, associate samantic actions to the grammar and handle semantic values.
2015-02-08 01:52:26 +00:00
```c++
2015-02-15 22:52:39 +00:00
#include <assert.h>
2015-02-12 02:04:08 +00:00
2015-02-15 22:52:39 +00:00
// (1) Include the header file
2015-02-12 02:04:08 +00:00
#include <peglib.h>
using namespace peglib;
using namespace std;
int main(void) {
2015-02-15 22:52:39 +00:00
// (2) Make a parser
2015-02-13 02:08:58 +00:00
auto syntax = R"(
# Grammar for Calculator...
Additive <- Multitive '+' Additive / Multitive
Multitive <- Primary '*' Multitive / Primary
Primary <- '(' Additive ')' / Number
Number <- [0-9]+
)";
2015-02-14 15:13:10 +00:00
peg parser(syntax);
2015-02-13 02:08:58 +00:00
2015-02-15 22:52:39 +00:00
// (3) Setup an action
2015-02-13 02:08:58 +00:00
parser["Additive"] = {
nullptr, // Default action
2015-02-14 15:13:10 +00:00
[](const vector<any>& v) {
2015-02-13 02:08:58 +00:00
return v[0].get<int>() + v[1].get<int>(); // 1st choice
},
2015-02-14 15:13:10 +00:00
[](const vector<any>& v) { return v[0]; } // 2nd choice
2015-02-13 02:08:58 +00:00
};
parser["Multitive"] = {
nullptr, // Default action
2015-02-14 15:13:10 +00:00
[](const vector<any>& v) {
2015-02-13 02:08:58 +00:00
return v[0].get<int>() * v[1].get<int>(); // 1st choice
},
2015-02-14 15:13:10 +00:00
[](const vector<any>& v) { return v[0]; } // 2nd choice
2015-02-13 02:08:58 +00:00
};
2015-02-16 03:21:18 +00:00
/* This action is not necessary.
2015-02-14 15:13:10 +00:00
parser["Primary"] = [](const vector<any>& v) {
2015-02-16 03:21:18 +00:00
return v[0];
2015-02-13 02:08:58 +00:00
};
2015-02-16 03:21:18 +00:00
*/
2015-02-13 02:08:58 +00:00
2015-02-14 15:13:10 +00:00
parser["Number"] = [](const char* s, size_t l) {
2015-02-13 02:08:58 +00:00
return stoi(string(s, l), nullptr, 10);
};
2015-02-15 22:52:39 +00:00
// (4) Parse
2015-02-13 02:08:58 +00:00
int val;
2015-02-16 03:21:18 +00:00
parser.parse("(1+2)*3", val);
2015-02-13 02:08:58 +00:00
2015-02-16 03:21:18 +00:00
assert(val == 9);
2015-02-12 02:04:08 +00:00
}
```
2015-02-08 01:52:26 +00:00
2015-02-15 22:52:39 +00:00
Here is a complete list of available actions:
```c++
[](const char* s, size_t l, const std::vector<peglib::any>& v, any& c)
[](const char* s, size_t l, const std::vector<peglib::any>& v)
[](const char* s, size_t l)
[](const std::vector<peglib::any>& v, any& c)
[](const std::vector<peglib::any>& v)
[]()
2015-02-19 03:28:57 +00:00
[](const SemanticValues& v, any& c)
[](const SemanticValues& v)
2015-02-15 22:52:39 +00:00
```
`const char* s, size_t l` gives a pointer and length of the matched string.
`const std::vector<peglib::any>& v` contains semantic values. `peglib::any` class is very similar to [boost::any](http://www.boost.org/doc/libs/1_57_0/doc/html/any.html). You can obtain a value by castning it to the actual type. In order to determine the actual type, you have to check the return value type of the child action for the semantic value.
`any& c` is a context data which can be used by the user for whatever purposes.
2015-02-19 03:28:57 +00:00
`const SemanticValues&` is also available. `SemanticValues` structure contains all of above information as well as the vector of definition names of semantic values.
```c++
struct SemanticValues
{
std::vector<any> values; // Semantic value
std::vector<std::string> names; // Definition name
const char* s; // Token start
size_t l; // Token length
};
```
The following example uses `<` ... ` >` operators. They are the *anchor* operators. Each anchor operator creates a semantic value that contains `const char*` of the position. It could be useful to eliminate unnecessary characters.
```c++
auto syntax = R"(
ROOT <- _ TOKEN (',' _ TOKEN)*
TOKEN <- < [a-z0-9]+ > _
_ <- [ \t\r\n]*
)";
peg pg(syntax);
pg["TOKEN"] = [](const char* s, size_t l, const vector<any>& v) {
// 'token' doesn't include trailing whitespaces
auto token = string(s, l);
};
auto ret = pg.parse(" token1, token2 ");
```
2015-02-18 23:00:11 +00:00
We can ignore unnecessary semantic values from the list by using `~` operator.
```c++
peglib::peg parser(
" ROOT <- _ ITEM (',' _ ITEM _)* "
" ITEM <- ([a-z])+ "
" ~_ <- [ \t]* "
);
parser["ROOT"] = [&](const vector<any>& v) {
assert(v.size() == 2); // should be 2 instead of 5.
};
auto ret = parser.parse(" item1, item2 ");
```
2015-02-15 22:52:39 +00:00
Simple interface
----------------
*cpp-peglib* provides std::regex-like simple interface for trivial tasks.
`peglib::peg_match` tries to capture strings in the `$< ... >` operator and store them into `peglib::match` object.
2015-02-15 22:52:39 +00:00
```c++
peglib::match m;
auto ret = peglib::peg_match(
R"(
ROOT <- _ ('[' $< TAG_NAME > ']' _)*
2015-02-15 22:52:39 +00:00
TAG_NAME <- (!']' .)+
_ <- [ \t]*
)",
" [tag1] [tag:2] [tag-3] ",
m);
assert(ret == true);
assert(m.size() == 4);
assert(m.str(1) == "tag1");
assert(m.str(2) == "tag:2");
assert(m.str(3) == "tag-3");
```
There are some ways to *search* a peg pattern in a document.
```c++
using namespace peglib;
auto syntax = R"(
ROOT <- '[' $< [a-z0-9]+ > ']'
2015-02-15 22:52:39 +00:00
)";
auto s = " [tag1] [tag2] [tag3] ";
// peglib::peg_search
peg pg(syntax);
size_t pos = 0;
auto l = strlen(s);
match m;
while (peg_search(pg, s + pos, l - pos, m)) {
cout << m.str() << endl; // entire match
cout << m.str(1) << endl; // submatch #1
pos += m.length();
}
// peglib::peg_token_iterator
peg_token_iterator it(syntax, s);
while (it != peg_token_iterator()) {
cout << it->str() << endl; // entire match
cout << it->str(1) << endl; // submatch #1
++it;
}
// peglib::peg_token_range
for (auto& m: peg_token_range(syntax, s)) {
cout << m.str() << endl; // entire match
cout << m.str(1) << endl; // submatch #1
}
```
2015-02-09 22:12:59 +00:00
Make a parser with parser operators
-----------------------------------
2015-02-08 01:52:26 +00:00
2015-02-09 22:12:59 +00:00
Instead of makeing a parser by parsing PEG syntax text, we can also construct a parser by hand with *parser operators*. Here is an example:
2015-02-08 01:52:26 +00:00
```c++
using namespace peglib;
using namespace std;
2015-02-09 22:12:59 +00:00
vector<string> tags;
2015-02-08 14:43:49 +00:00
Definition ROOT, TAG_NAME, _;
2015-02-14 15:38:15 +00:00
ROOT <= seq(_, zom(seq(chr('['), TAG_NAME, chr(']'), _)));
TAG_NAME <= oom(seq(npd(chr(']')), dot())), [&](const char* s, size_t l) {
tags.push_back(string(s, l));
};
_ <= zom(cls(" \t"));
2015-02-08 01:52:26 +00:00
auto ret = ROOT.parse(" [tag1] [tag:2] [tag-3] ");
```
The following are available operators:
2015-02-20 03:27:47 +00:00
| Operator | Description |
| :------- | :------------------ |
| seq | Sequence |
| cho | Prioritized Choice |
| zom | Zero or More |
| oom | One or More |
| opt | Optional |
| apd | And predicate |
| npd | Not predicate |
| lit | Literal string |
| cls | Character class |
| chr | Character |
| dot | Any character |
| anc | Anchor character |
| cap | Capture character |
| usr | User defiend parser |
Adjust definitions
------------------
2015-02-20 03:27:47 +00:00
It's possible to add and override definitions with parser operaters.
2015-02-20 03:27:47 +00:00
```c++
auto syntax = R"(
ROOT <- _ 'Hello' _ NAME '!' _
)";
Rules rules = {
{
"NAME", usr([](const char* s, size_t l, SemanticValues& v, any& c) {
static vector<string> names = { "PEG", "BNF" };
for (const auto& n: names) {
if (n.size() <= l && !n.compare(0, n.size(), s, n.size())) {
return success(n.size());
}
}
return fail(s);
})
},
{
"~_", zom(cls(" \t\r\n"))
}
};
peg g = peg(syntax, rules);
assert(g.parse(" Hello BNF! "));
```
2015-02-08 01:52:26 +00:00
2015-02-08 01:58:25 +00:00
Sample codes
------------
* [Calculator](https://github.com/yhirose/cpp-peglib/blob/master/example/calc.cc)
* [Calculator with parser operators](https://github.com/yhirose/cpp-peglib/blob/master/example/calc2.cc)
2015-02-13 02:08:58 +00:00
* [PEG syntax Lint utility](https://github.com/yhirose/cpp-peglib/blob/master/lint/peglint.cc)
2015-02-08 01:58:25 +00:00
2015-02-08 01:52:26 +00:00
Tested Compilers
----------------
* Visual Studio 2013
* Clang 3.5
TODO
----
* Linear-time parsing (Packrat parsing)
* Optimization of grammars
* Unicode support
2015-02-08 01:58:25 +00:00
Other C++ PEG parser libraries
------------------------------
Thanks to the authors of the libraries that inspired *cpp-peglib*.
2015-02-08 01:52:26 +00:00
* [Boost Spirit X3](https://github.com/djowel/spirit_x3) - A set of C++ libraries for parsing and output generation implemented as Domain Specific Embedded Languages (DSEL) using Expression templates and Template Meta-Programming
2015-02-08 01:52:26 +00:00
* [PEGTL](https://github.com/ColinH/PEGTL) - Parsing Expression Grammar Template Library
* [lars::Parser](https://github.com/TheLartians/Parser) - A header-only linear-time c++ parsing expression grammar (PEG) parser generator supporting left-recursion and grammar ambiguity
License
-------
MIT license (© 2015 Yuji Hirose)