You can not select more than 25 topics Topics must start with a letter or number, can include dashes ('-') and can be up to 35 characters long.
cpp-peglib/README.md

394 lines
10 KiB

9 years ago
cpp-peglib
==========
C++11 header-only [PEG](http://en.wikipedia.org/wiki/Parsing_expression_grammar) (Parsing Expression Grammars) library.
*cpp-peglib* tries to provide more expressive parsing experience in a simple way. This library depends on only one header file. So, you can start using it right away just by including `peglib.h` in your project.
9 years ago
9 years ago
The PEG syntax is well described on page 2 in the [document](http://www.brynosaurus.com/pub/lang/peg.pdf). *cpp-peglib* also supports the following additional syntax for now:
* `<` ... `>` (Token boundary operator)
* `~` (Ignore operator)
9 years ago
* `\x20` (Hex number char)
9 years ago
* `$<` ... `>` (Capture operator)
* `$name<` ... `>` (Named capture operator)
9 years ago
This library also supports the linear-time parsing known as the [*Packrat*](http://pdos.csail.mit.edu/~baford/packrat/thesis/thesis.pdf) parsing.
9 years ago
How to use
----------
This is a simple calculator sample. It shows how to define grammar, associate samantic actions to the grammar and handle semantic values.
9 years ago
9 years ago
```cpp
// (1) Include the header file
9 years ago
#include <peglib.h>
#include <assert.h>
9 years ago
using namespace peg;
9 years ago
using namespace std;
int main(void) {
// (2) Make a parser
auto syntax = R"(
# Grammar for Calculator...
Additive <- Multitive '+' Additive / Multitive
Multitive <- Primary '*' Multitive / Primary
Primary <- '(' Additive ')' / Number
Number <- [0-9]+
)";
parser parser(syntax);
// (3) Setup an action
9 years ago
parser["Additive"] = [](const SemanticValues& sv) {
switch (sv.choice) {
case 0: // "Multitive '+' Additive"
return sv[0].get<int>() + sv[1].get<int>();
default: // "Multitive"
return sv[0].get<int>();
}
};
parser["Multitive"] = [](const SemanticValues& sv) {
switch (sv.choice) {
9 years ago
case 0: // "Primary '*' Multitive"
return sv[0].get<int>() * sv[1].get<int>();
9 years ago
default: // "Primary"
return sv[0].get<int>();
}
};
9 years ago
parser["Number"] = [](const SemanticValues& sv) {
return stoi(sv.str(), nullptr, 10);
};
// (4) Parse
9 years ago
parser.packrat_parsing(); // Enable packrat parsing.
int val;
parser.parse("(1+2)*3", val);
9 years ago
assert(val == 9);
9 years ago
}
```
9 years ago
9 years ago
Here are available actions:
9 years ago
```cpp
[](const SemanticValues& sv, any& dt)
[](const SemanticValues& sv)
```
`const SemanticValues& sv` contains semantic values. `SemanticValues` structure is defined as follows.
9 years ago
```cpp
struct SemanticValue {
any val; // Semantic value
const char* name; // Definition name for the sematic value
const char* s; // Token start for the semantic value
size_t n; // Token length for the semantic value
// Cast semantic value
template <typename T> T& get();
template <typename T> const T& get() const;
// Get token
std::string str() const;
};
struct SemanticValues : protected std::vector<SemanticValue>
{
const char* s; // Token start
size_t n; // Token length
size_t choice; // Choice number (0 based index)
// Get token
std::string str() const;
// Transform the semantic value vector to another vector
template <typename T> vector<T> transform(size_t beg = 0, size_t end = -1) const;
}
```
`peg::any` class is very similar to [boost::any](http://www.boost.org/doc/libs/1_57_0/doc/html/any.html). You can obtain a value by castning it to the actual type. In order to determine the actual type, you have to check the return value type of the child action for the semantic value.
`const char* s, size_t n` gives a pointer and length of the matched string. This is same as `sv.s` and `sv.n`.
`any& dt` is a data object which can be used by the user for whatever purposes.
The following example uses `<` ... ` >` operators. They are the *token boundary* operators. Each token boundary operator creates a semantic value that contains `const char*` of the position. It could be useful to eliminate unnecessary characters.
9 years ago
```cpp
auto syntax = R"(
ROOT <- _ TOKEN (',' _ TOKEN)*
TOKEN <- < [a-z0-9]+ > _
_ <- [ \t\r\n]*
)";
peg pg(syntax);
9 years ago
pg["TOKEN"] = [](const SemanticValues& sv) {
// 'token' doesn't include trailing whitespaces
auto token = sv.str();
};
auto ret = pg.parse(" token1, token2 ");
```
We can ignore unnecessary semantic values from the list by using `~` operator.
9 years ago
```cpp
peg::pegparser parser(
" ROOT <- _ ITEM (',' _ ITEM _)* "
" ITEM <- ([a-z])+ "
" ~_ <- [ \t]* "
);
parser["ROOT"] = [&](const SemanticValues& sv) {
assert(sv.size() == 2); // should be 2 instead of 5.
};
auto ret = parser.parse(" item1, item2 ");
```
The following grammar is same as the above.
9 years ago
```cpp
peg::parser parser(
" ROOT <- ~_ ITEM (',' ~_ ITEM ~_)* "
" ITEM <- ([a-z])+ "
" _ <- [ \t]* "
);
```
*Semantic predicate* support is available. We can do it by throwing a `peg::parse_error` exception in a semantic action.
9 years ago
```cpp
peg::parser parser("NUMBER <- [0-9]+");
9 years ago
parser["NUMBER"] = [](const SemanticValues& sv) {
auto val = stol(sv.str(), nullptr, 10);
if (val != 100) {
throw peg::parse_error("value error!!");
}
return val;
};
long val;
auto ret = parser.parse("100", val);
assert(ret == true);
assert(val == 100);
ret = parser.parse("200", val);
assert(ret == false);
```
9 years ago
*before* and *after* actions are also avalable.
```cpp
parser["RULE"].before = [](any& dt) {
std::cout << "before" << std::cout;
};
parser["RULE"] = [](const SemanticValues& sv, any& dt) {
std::cout << "action!" << std::cout;
};
parser["RULE"].after = [](any& dt) {
std::cout << "after" << std::cout;
};
```
Simple interface
----------------
*cpp-peglib* provides std::regex-like simple interface for trivial tasks.
`peg::peg_match` tries to capture strings in the `$< ... >` operator and store them into `peg::match` object.
9 years ago
```cpp
peg::match m;
auto ret = peg::peg_match(
R"(
ROOT <- _ ('[' $< TAG_NAME > ']' _)*
TAG_NAME <- (!']' .)+
_ <- [ \t]*
)",
" [tag1] [tag:2] [tag-3] ",
m);
assert(ret == true);
assert(m.size() == 4);
assert(m.str(1) == "tag1");
assert(m.str(2) == "tag:2");
assert(m.str(3) == "tag-3");
```
It also supports named capture with the `$name<` ... `>` operator.
9 years ago
```cpp
peg::match m;
auto ret = peg::peg_match(
9 years ago
R"(
ROOT <- _ ('[' $test< TAG_NAME > ']' _)*
TAG_NAME <- (!']' .)+
_ <- [ \t]*
)",
" [tag1] [tag:2] [tag-3] ",
m);
auto cap = m.named_capture("test");
REQUIRE(ret == true);
REQUIRE(m.size() == 4);
REQUIRE(cap.size() == 3);
REQUIRE(m.str(cap[2]) == "tag-3");
```
There are some ways to *search* a peg pattern in a document.
9 years ago
```cpp
using namespace peg;
auto syntax = R"(
9 years ago
ROOT <- '[' $< [a-z0-9]+ > ']'
)";
auto s = " [tag1] [tag2] [tag3] ";
// peg::peg_search
parser pg(syntax);
size_t pos = 0;
auto n = strlen(s);
match m;
while (peg_search(pg, s + pos, n - pos, m)) {
9 years ago
cout << m.str() << endl; // entire match
cout << m.str(1) << endl; // submatch #1
pos += m.length();
}
// peg::peg_token_iterator
peg_token_iterator it(syntax, s);
while (it != peg_token_iterator()) {
9 years ago
cout << it->str() << endl; // entire match
cout << it->str(1) << endl; // submatch #1
++it;
}
// peg::peg_token_range
for (auto& m: peg_token_range(syntax, s)) {
9 years ago
cout << m.str() << endl; // entire match
cout << m.str(1) << endl; // submatch #1
}
```
9 years ago
Make a parser with parser combinators
-------------------------------------
9 years ago
9 years ago
Instead of makeing a parser by parsing PEG syntax text, we can also construct a parser by hand with *parser combinatorss*. Here is an example:
9 years ago
9 years ago
```cpp
using namespace peg;
9 years ago
using namespace std;
vector<string> tags;
Definition ROOT, TAG_NAME, _;
ROOT <= seq(_, zom(seq(chr('['), TAG_NAME, chr(']'), _)));
9 years ago
TAG_NAME <= oom(seq(npd(chr(']')), dot())), [&](const SemanticValues& sv) {
tags.push_back(sv.str());
};
_ <= zom(cls(" \t"));
9 years ago
auto ret = ROOT.parse(" [tag1] [tag:2] [tag-3] ");
```
The following are available operators:
9 years ago
| Operator | Description |
| :------- | :-------------------- |
| seq | Sequence |
| cho | Prioritized Choice |
| zom | Zero or More |
| oom | One or More |
| opt | Optional |
| apd | And predicate |
| npd | Not predicate |
| lit | Literal string |
| cls | Character class |
| chr | Character |
| dot | Any character |
| tok | Token boundary |
9 years ago
| ign | Ignore semantic value |
9 years ago
| cap | Capture character |
| usr | User defiend parser |
Adjust definitions
------------------
9 years ago
It's possible to add/override definitions.
9 years ago
```cpp
auto syntax = R"(
ROOT <- _ 'Hello' _ NAME '!' _
)";
9 years ago
Rules additional_rules = {
{
"NAME", usr([](const char* s, size_t n, SemanticValues& sv, any& c) -> size_t {
static vector<string> names = { "PEG", "BNF" };
for (const auto& name: names) {
if (name.size() <= n && !name.compare(0, name.size(), s, name.size())) {
9 years ago
return name.size(); // processed length
}
}
9 years ago
return -1; // parse error
})
},
{
"~_", zom(cls(" \t\r\n"))
}
};
auto g = parser(syntax, additional_rules);
assert(g.parse(" Hello BNF! "));
```
9 years ago
9 years ago
Unicode support
---------------
Since cpp-peglib only accepts 8 bits characters, it probably accepts UTF-8 text. But `.` matches only a byte, not a Unicode character. Also, it dosn't support `\u????`.
Sample codes
------------
* [Calculator](https://github.com/yhirose/cpp-peglib/blob/master/example/calc.cc)
* [Calculator (with parser operators)](https://github.com/yhirose/cpp-peglib/blob/master/example/calc2.cc)
* [Calculator (AST version)](https://github.com/yhirose/cpp-peglib/blob/master/example/calc3.cc)
9 years ago
* [PEG syntax Lint utility](https://github.com/yhirose/cpp-peglib/blob/master/lint/cmdline/peglint.cc)
* [PL/0 Interpreter](https://github.com/yhirose/cpp-peglib/blob/master/language/pl0/pl0.cc)
9 years ago
Tested compilers
9 years ago
----------------
9 years ago
* Visual Studio 2015
* Visual Studio 2013 with Update 5
9 years ago
* Clang 3.5
TODO
----
9 years ago
* ٍSemantic predicate (`&{ expr }` and `!{ expr }`)
* Unicode support (`.` matches a Unicode char. `\u????`, `\p{L}`)
* Ignore white spaces after string literals and tokens
* Allow `←` and `ε`
9 years ago
License
-------
MIT license (© 2015 Yuji Hirose)