cpp-peglib/README.md

393 lines
10 KiB
Markdown
Raw Normal View History

2015-02-08 01:52:26 +00:00
cpp-peglib
==========
C++11 header-only [PEG](http://en.wikipedia.org/wiki/Parsing_expression_grammar) (Parsing Expression Grammars) library.
2015-02-13 02:08:58 +00:00
*cpp-peglib* tries to provide more expressive parsing experience in a simple way. This library depends on only one header file. So, you can start using it right away just by including `peglib.h` in your project.
2015-02-08 01:52:26 +00:00
2015-08-28 02:36:02 +00:00
The PEG syntax is well described on page 2 in the [document](http://www.brynosaurus.com/pub/lang/peg.pdf). *cpp-peglib* also supports the following additional syntax for now:
2015-02-16 01:22:34 +00:00
2015-08-10 20:37:56 +00:00
* `<` ... `>` (Token boundary operator)
2015-02-18 23:00:11 +00:00
* `~` (Ignore operator)
2015-03-04 03:08:18 +00:00
* `\x20` (Hex number char)
2015-08-28 02:36:02 +00:00
* `$<` ... `>` (Capture operator)
* `$name<` ... `>` (Named capture operator)
2015-02-08 01:52:26 +00:00
2015-03-03 02:52:09 +00:00
This library also supports the linear-time parsing known as the [*Packrat*](http://pdos.csail.mit.edu/~baford/packrat/thesis/thesis.pdf) parsing.
2015-02-08 01:52:26 +00:00
How to use
----------
2015-02-15 22:52:39 +00:00
This is a simple calculator sample. It shows how to define grammar, associate samantic actions to the grammar and handle semantic values.
2015-02-08 01:52:26 +00:00
2015-11-23 22:48:03 +00:00
```cpp
2015-02-15 22:52:39 +00:00
// (1) Include the header file
2015-02-12 02:04:08 +00:00
#include <peglib.h>
2015-03-11 17:53:24 +00:00
#include <assert.h>
2015-02-12 02:04:08 +00:00
2015-08-10 20:37:56 +00:00
using namespace peg;
2015-02-12 02:04:08 +00:00
using namespace std;
int main(void) {
2015-02-15 22:52:39 +00:00
// (2) Make a parser
2015-02-13 02:08:58 +00:00
auto syntax = R"(
# Grammar for Calculator...
Additive <- Multitive '+' Additive / Multitive
Multitive <- Primary '*' Multitive / Primary
Primary <- '(' Additive ')' / Number
Number <- [0-9]+
)";
2015-08-10 20:37:56 +00:00
parser parser(syntax);
2015-02-13 02:08:58 +00:00
2015-02-15 22:52:39 +00:00
// (3) Setup an action
2015-06-16 03:26:49 +00:00
parser["Additive"] = [](const SemanticValues& sv) {
switch (sv.choice) {
case 0: // "Multitive '+' Additive"
return sv[0].get<int>() + sv[1].get<int>();
default: // "Multitive"
return sv[0].get<int>();
}
2015-02-13 02:08:58 +00:00
};
parser["Multitive"] = [](const SemanticValues& sv) {
switch (sv.choice) {
2015-06-16 03:26:49 +00:00
case 0: // "Primary '*' Multitive"
2015-03-09 18:58:43 +00:00
return sv[0].get<int>() * sv[1].get<int>();
2015-06-16 03:26:49 +00:00
default: // "Primary"
2015-03-09 18:58:43 +00:00
return sv[0].get<int>();
}
2015-02-13 02:08:58 +00:00
};
2015-06-16 04:25:01 +00:00
parser["Number"] = [](const SemanticValues& sv) {
2015-06-16 04:43:08 +00:00
return stoi(sv.str(), nullptr, 10);
2015-02-13 02:08:58 +00:00
};
2015-02-15 22:52:39 +00:00
// (4) Parse
2015-06-16 03:26:49 +00:00
parser.packrat_parsing(); // Enable packrat parsing.
2015-03-11 17:53:24 +00:00
2015-02-13 02:08:58 +00:00
int val;
2015-03-09 18:58:43 +00:00
parser.parse("(1+2)*3", val);
2015-02-13 02:08:58 +00:00
2015-02-16 03:21:18 +00:00
assert(val == 9);
2015-02-12 02:04:08 +00:00
}
```
2015-02-08 01:52:26 +00:00
2015-06-16 04:25:01 +00:00
Here are available actions:
2015-02-15 22:52:39 +00:00
2015-11-23 22:48:03 +00:00
```cpp
2015-02-22 00:38:30 +00:00
[](const SemanticValues& sv, any& dt)
[](const SemanticValues& sv)
2015-02-15 22:52:39 +00:00
```
2015-02-22 00:38:30 +00:00
`const SemanticValues& sv` contains semantic values. `SemanticValues` structure is defined as follows.
2015-02-19 03:28:57 +00:00
2015-11-23 22:48:03 +00:00
```cpp
2015-02-22 00:38:30 +00:00
struct SemanticValue {
2015-06-16 05:04:01 +00:00
any val; // Semantic value
const char* name; // Definition name for the sematic value
2015-02-22 00:38:30 +00:00
const char* s; // Token start for the semantic value
2015-03-09 18:58:43 +00:00
size_t n; // Token length for the semantic value
2015-03-11 18:10:59 +00:00
2015-06-16 04:43:08 +00:00
// Cast semantic value
2015-03-11 18:10:59 +00:00
template <typename T> T& get();
template <typename T> const T& get() const;
2015-06-16 17:15:27 +00:00
// Get token
std::string str() const;
2015-02-19 03:28:57 +00:00
};
2015-02-22 00:38:30 +00:00
struct SemanticValues : protected std::vector<SemanticValue>
{
const char* s; // Token start
2015-03-09 18:58:43 +00:00
size_t n; // Token length
size_t choice; // Choice number (0 based index)
2015-03-11 18:10:59 +00:00
2015-06-16 04:43:08 +00:00
// Get token
std::string str() const;
2015-10-14 21:20:39 +00:00
// Transform the semantic value vector to another vector
2015-06-16 05:04:01 +00:00
template <typename T> vector<T> transform(size_t beg = 0, size_t end = -1) const;
2015-02-22 00:38:30 +00:00
}
2015-02-19 03:28:57 +00:00
```
2015-08-10 20:37:56 +00:00
`peg::any` class is very similar to [boost::any](http://www.boost.org/doc/libs/1_57_0/doc/html/any.html). You can obtain a value by castning it to the actual type. In order to determine the actual type, you have to check the return value type of the child action for the semantic value.
2015-02-22 00:38:30 +00:00
2015-03-09 18:58:43 +00:00
`const char* s, size_t n` gives a pointer and length of the matched string. This is same as `sv.s` and `sv.n`.
2015-02-22 00:38:30 +00:00
`any& dt` is a data object which can be used by the user for whatever purposes.
2015-08-10 20:37:56 +00:00
The following example uses `<` ... ` >` operators. They are the *token boundary* operators. Each token boundary operator creates a semantic value that contains `const char*` of the position. It could be useful to eliminate unnecessary characters.
2015-11-23 22:48:03 +00:00
```cpp
auto syntax = R"(
ROOT <- _ TOKEN (',' _ TOKEN)*
TOKEN <- < [a-z0-9]+ > _
_ <- [ \t\r\n]*
)";
peg pg(syntax);
2015-06-16 04:25:01 +00:00
pg["TOKEN"] = [](const SemanticValues& sv) {
// 'token' doesn't include trailing whitespaces
2015-06-16 04:43:08 +00:00
auto token = sv.str();
};
auto ret = pg.parse(" token1, token2 ");
```
2015-02-18 23:00:11 +00:00
We can ignore unnecessary semantic values from the list by using `~` operator.
2015-11-23 22:48:03 +00:00
```cpp
2015-08-10 20:37:56 +00:00
peg::pegparser parser(
" ROOT <- _ ITEM (',' _ ITEM _)* "
" ITEM <- ([a-z])+ "
" ~_ <- [ \t]* "
2015-02-18 23:00:11 +00:00
);
2015-02-22 00:38:30 +00:00
parser["ROOT"] = [&](const SemanticValues& sv) {
assert(sv.size() == 2); // should be 2 instead of 5.
2015-02-18 23:00:11 +00:00
};
auto ret = parser.parse(" item1, item2 ");
```
The following grammar is same as the above.
2015-11-23 22:48:03 +00:00
```cpp
2015-08-10 20:37:56 +00:00
peg::parser parser(
" ROOT <- ~_ ITEM (',' ~_ ITEM ~_)* "
" ITEM <- ([a-z])+ "
" _ <- [ \t]* "
);
```
2015-08-10 20:37:56 +00:00
*Semantic predicate* support is available. We can do it by throwing a `peg::parse_error` exception in a semantic action.
2015-06-15 20:07:25 +00:00
2015-11-23 22:48:03 +00:00
```cpp
2015-08-10 20:37:56 +00:00
peg::parser parser("NUMBER <- [0-9]+");
2015-06-15 20:07:25 +00:00
2015-06-16 04:25:01 +00:00
parser["NUMBER"] = [](const SemanticValues& sv) {
2015-06-16 04:43:08 +00:00
auto val = stol(sv.str(), nullptr, 10);
2015-06-15 20:07:25 +00:00
if (val != 100) {
2015-08-10 20:37:56 +00:00
throw peg::parse_error("value error!!");
2015-06-15 20:07:25 +00:00
}
return val;
};
long val;
auto ret = parser.parse("100", val);
assert(ret == true);
assert(val == 100);
ret = parser.parse("200", val);
assert(ret == false);
```
2015-11-23 22:48:03 +00:00
*before* and *after* actions are also avalable.
```cpp
parser["RULE"].before = [](any& dt) {
std::cout << "before" << std::cout;
};
parser["RULE"] = [](const SemanticValues& sv, any& dt) {
std::cout << "action!" << std::cout;
};
parser["RULE"].after = [](any& dt) {
std::cout << "after" << std::cout;
};
```
2015-02-15 22:52:39 +00:00
Simple interface
----------------
*cpp-peglib* provides std::regex-like simple interface for trivial tasks.
2015-08-10 20:37:56 +00:00
`peg::peg_match` tries to capture strings in the `$< ... >` operator and store them into `peg::match` object.
2015-02-15 22:52:39 +00:00
2015-11-23 22:48:03 +00:00
```cpp
2015-08-10 20:37:56 +00:00
peg::match m;
2015-08-10 20:37:56 +00:00
auto ret = peg::peg_match(
2015-02-15 22:52:39 +00:00
R"(
ROOT <- _ ('[' $< TAG_NAME > ']' _)*
2015-02-15 22:52:39 +00:00
TAG_NAME <- (!']' .)+
_ <- [ \t]*
)",
" [tag1] [tag:2] [tag-3] ",
m);
assert(ret == true);
assert(m.size() == 4);
assert(m.str(1) == "tag1");
assert(m.str(2) == "tag:2");
assert(m.str(3) == "tag-3");
```
It also supports named capture with the `$name<` ... `>` operator.
2015-11-23 22:48:03 +00:00
```cpp
2015-08-10 20:37:56 +00:00
peg::match m;
2015-08-10 20:37:56 +00:00
auto ret = peg::peg_match(
2015-06-14 12:02:59 +00:00
R"(
ROOT <- _ ('[' $test< TAG_NAME > ']' _)*
TAG_NAME <- (!']' .)+
_ <- [ \t]*
)",
" [tag1] [tag:2] [tag-3] ",
m);
auto cap = m.named_capture("test");
REQUIRE(ret == true);
REQUIRE(m.size() == 4);
REQUIRE(cap.size() == 3);
REQUIRE(m.str(cap[2]) == "tag-3");
```
2015-02-15 22:52:39 +00:00
There are some ways to *search* a peg pattern in a document.
2015-11-23 22:48:03 +00:00
```cpp
2015-08-10 20:37:56 +00:00
using namespace peg;
2015-02-15 22:52:39 +00:00
auto syntax = R"(
2015-06-15 21:47:19 +00:00
ROOT <- '[' $< [a-z0-9]+ > ']'
2015-02-15 22:52:39 +00:00
)";
auto s = " [tag1] [tag2] [tag3] ";
2015-08-10 20:37:56 +00:00
// peg::peg_search
parser pg(syntax);
2015-02-15 22:52:39 +00:00
size_t pos = 0;
2015-03-09 18:58:43 +00:00
auto n = strlen(s);
2015-02-15 22:52:39 +00:00
match m;
2015-03-09 18:58:43 +00:00
while (peg_search(pg, s + pos, n - pos, m)) {
2015-06-15 21:47:19 +00:00
cout << m.str() << endl; // entire match
cout << m.str(1) << endl; // submatch #1
pos += m.length();
2015-02-15 22:52:39 +00:00
}
2015-08-10 20:37:56 +00:00
// peg::peg_token_iterator
2015-02-15 22:52:39 +00:00
peg_token_iterator it(syntax, s);
while (it != peg_token_iterator()) {
2015-06-15 21:47:19 +00:00
cout << it->str() << endl; // entire match
cout << it->str(1) << endl; // submatch #1
++it;
2015-02-15 22:52:39 +00:00
}
2015-08-10 20:37:56 +00:00
// peg::peg_token_range
2015-02-15 22:52:39 +00:00
for (auto& m: peg_token_range(syntax, s)) {
2015-06-15 21:47:19 +00:00
cout << m.str() << endl; // entire match
cout << m.str(1) << endl; // submatch #1
2015-02-15 22:52:39 +00:00
}
```
2015-08-28 02:36:02 +00:00
Make a parser with parser combinators
-------------------------------------
2015-02-08 01:52:26 +00:00
2015-08-28 02:36:02 +00:00
Instead of makeing a parser by parsing PEG syntax text, we can also construct a parser by hand with *parser combinatorss*. Here is an example:
2015-02-08 01:52:26 +00:00
2015-11-23 22:48:03 +00:00
```cpp
2015-08-10 20:37:56 +00:00
using namespace peg;
2015-02-08 01:52:26 +00:00
using namespace std;
2015-02-09 22:12:59 +00:00
vector<string> tags;
2015-02-08 14:43:49 +00:00
Definition ROOT, TAG_NAME, _;
2015-02-14 15:38:15 +00:00
ROOT <= seq(_, zom(seq(chr('['), TAG_NAME, chr(']'), _)));
2015-06-16 04:25:01 +00:00
TAG_NAME <= oom(seq(npd(chr(']')), dot())), [&](const SemanticValues& sv) {
2015-06-16 04:43:08 +00:00
tags.push_back(sv.str());
2015-02-14 15:38:15 +00:00
};
_ <= zom(cls(" \t"));
2015-02-08 01:52:26 +00:00
auto ret = ROOT.parse(" [tag1] [tag:2] [tag-3] ");
```
The following are available operators:
2015-06-13 05:22:46 +00:00
| Operator | Description |
| :------- | :-------------------- |
| seq | Sequence |
| cho | Prioritized Choice |
| zom | Zero or More |
| oom | One or More |
| opt | Optional |
| apd | And predicate |
| npd | Not predicate |
| lit | Literal string |
| cls | Character class |
| chr | Character |
| dot | Any character |
2015-08-10 20:37:56 +00:00
| tok | Token boundary |
2015-06-13 05:23:27 +00:00
| ign | Ignore semantic value |
2015-06-13 05:22:46 +00:00
| cap | Capture character |
| usr | User defiend parser |
2015-02-20 03:27:47 +00:00
Adjust definitions
------------------
2015-02-20 03:27:47 +00:00
2015-06-15 20:05:36 +00:00
It's possible to add/override definitions.
2015-02-20 03:27:47 +00:00
2015-11-23 22:48:03 +00:00
```cpp
2015-02-20 03:27:47 +00:00
auto syntax = R"(
ROOT <- _ 'Hello' _ NAME '!' _
)";
2015-06-15 20:05:36 +00:00
Rules additional_rules = {
2015-02-20 03:27:47 +00:00
{
2015-06-15 17:47:59 +00:00
"NAME", usr([](const char* s, size_t n, SemanticValues& sv, any& c) -> size_t {
2015-02-20 03:27:47 +00:00
static vector<string> names = { "PEG", "BNF" };
2015-06-15 17:47:59 +00:00
for (const auto& name: names) {
if (name.size() <= n && !name.compare(0, name.size(), s, name.size())) {
2015-06-15 21:47:19 +00:00
return name.size(); // processed length
2015-02-20 03:27:47 +00:00
}
}
2015-06-15 21:47:19 +00:00
return -1; // parse error
2015-02-20 03:27:47 +00:00
})
},
{
"~_", zom(cls(" \t\r\n"))
}
};
2015-08-10 20:37:56 +00:00
auto g = parser(syntax, additional_rules);
2015-02-20 03:27:47 +00:00
assert(g.parse(" Hello BNF! "));
```
2015-02-08 01:52:26 +00:00
2015-08-28 02:36:02 +00:00
Unicode support
---------------
Since cpp-peglib only accepts 8 bits characters, it probably accepts UTF-8 text. But `.` matches only a byte, not a Unicode character. Also, it dosn't support `\u????`.
2015-02-08 01:58:25 +00:00
Sample codes
------------
* [Calculator](https://github.com/yhirose/cpp-peglib/blob/master/example/calc.cc)
2015-02-22 04:23:59 +00:00
* [Calculator (with parser operators)](https://github.com/yhirose/cpp-peglib/blob/master/example/calc2.cc)
* [Calculator (AST version)](https://github.com/yhirose/cpp-peglib/blob/master/example/calc3.cc)
2015-08-06 11:56:31 +00:00
* [PEG syntax Lint utility](https://github.com/yhirose/cpp-peglib/blob/master/lint/cmdline/peglint.cc)
* [PL/0 Interpreter](https://github.com/yhirose/cpp-peglib/blob/master/language/pl0/pl0.cc)
2015-02-08 01:58:25 +00:00
2015-08-28 02:36:02 +00:00
Tested compilers
2015-02-08 01:52:26 +00:00
----------------
2015-08-04 22:10:53 +00:00
* Visual Studio 2015
2015-02-08 01:52:26 +00:00
* Clang 3.5
TODO
----
2015-08-28 02:36:02 +00:00
* ٍSemantic predicate (`&{ expr }` and `!{ expr }`)
* Unicode support (`.` matches a Unicode char. `\u????`, `\p{L}`)
* Ignore white spaces after string literals and tokens
* Allow `←` and `ε`
2015-02-08 01:52:26 +00:00
License
-------
MIT license (© 2015 Yuji Hirose)