cpp-peglib/README.md
2015-02-21 19:38:30 -05:00

8.6 KiB

cpp-peglib

C++11 header-only PEG (Parsing Expression Grammars) library.

cpp-peglib tries to provide more expressive parsing experience in a simple way. This library depends on only one header file. So, you can start using it right away just by including peglib.h in your project.

The PEG syntax is well described on page 2 in the document. cpp-peglib also supports the following additional syntax for now:

  • < ... > (Anchor operator)
  • $< ... > (Capture operator)
  • ~ (Ignore operator)

How to use

This is a simple calculator sample. It shows how to define grammar, associate samantic actions to the grammar and handle semantic values.

#include <assert.h>

// (1) Include the header file
#include <peglib.h>

using namespace peglib;
using namespace std;

int main(void) {

    // (2) Make a parser
    auto syntax = R"(
        # Grammar for Calculator...
        Additive  <- Multitive '+' Additive / Multitive
        Multitive <- Primary '*' Multitive / Primary
        Primary   <- '(' Additive ')' / Number
        Number    <- [0-9]+
    )";

    peg parser(syntax);

    // (3) Setup an action
    parser["Additive"] = {
        nullptr,                                                // Default action
        [](const SemanticValues& sv) {
            return sv[0].val.get<int>() + sv[1].val.get<int>(); // 1st choice
        },
        [](const SemanticValues& sv) { return sv[0]; }          // 2nd choice
    };

    parser["Multitive"] = {
        nullptr,                                                // Default action
        [](const SemanticValues& sv) {
            return sv[0].val.get<int>() * sv[1].val.get<int>(); // 1st choice
        },
        [](const SemanticValues& sv) { return sv[0]; }          // 2nd choice
    };

    /* This action is not necessary.
    parser["Primary"] = [](const SemanticValues& sv) {
        return sv[0];
    };
    */

    parser["Number"] = [](const char* s, size_t l) {
        return stoi(string(s, l), nullptr, 10);
    };

    // (4) Parse
    int val;
    parser.parse("(1+2)*3", val);

    assert(val == 9);
}

Here is a complete list of available actions:

[](const SemanticValues& sv, any& dt)
[](const SemanticValues& sv)
[](const char* s, size_t l)
[]()

const SemanticValues& sv contains semantic values. SemanticValues structure is defined as follows.

struct SemanticValue {
    peglib::any val;  // Semantic value
    std::string name; // Definition name for the sematic value
    const char* s;    // Token start for the semantic value
    size_t      l;    // Token length for the semantic value
};

struct SemanticValues : protected std::vector<SemanticValue>
{
    const char* s; // Token start
    size_t      l; // Token length
}

peglib::any class is very similar to boost::any. You can obtain a value by castning it to the actual type. In order to determine the actual type, you have to check the return value type of the child action for the semantic value.

const char* s, size_t l gives a pointer and length of the matched string. This is same as sv.s and sv.l.

any& dt is a data object which can be used by the user for whatever purposes.

The following example uses < ... > operators. They are the anchor operators. Each anchor operator creates a semantic value that contains const char* of the position. It could be useful to eliminate unnecessary characters.

auto syntax = R"(
    ROOT  <- _ TOKEN (',' _ TOKEN)*
    TOKEN <- < [a-z0-9]+ > _
    _     <- [ \t\r\n]*
)";

peg pg(syntax);

pg["TOKEN"] = [](const char* s, size_t l) {
    // 'token' doesn't include trailing whitespaces
    auto token = string(s, l);
};

auto ret = pg.parse(" token1, token2 ");

We can ignore unnecessary semantic values from the list by using ~ operator.

peglib::peg parser(
    "  ROOT  <-  _ ITEM (',' _ ITEM _)* "
    "  ITEM  <-  ([a-z])+  "
    "  ~_    <-  [ \t]*    "
);

parser["ROOT"] = [&](const SemanticValues& sv) {
    assert(sv.size() == 2); // should be 2 instead of 5.
};

auto ret = parser.parse(" item1, item2 ");

Simple interface

cpp-peglib provides std::regex-like simple interface for trivial tasks.

peglib::peg_match tries to capture strings in the $< ... > operator and store them into peglib::match object.

peglib::match m;
auto ret = peglib::peg_match(
    R"(
        ROOT      <-  _ ('[' $< TAG_NAME > ']' _)*
        TAG_NAME  <-  (!']' .)+
        _         <-  [ \t]*
    )",
    " [tag1] [tag:2] [tag-3] ",
    m);

assert(ret == true);
assert(m.size() == 4);
assert(m.str(1) == "tag1");
assert(m.str(2) == "tag:2");
assert(m.str(3) == "tag-3");

There are some ways to search a peg pattern in a document.

using namespace peglib;

auto syntax = R"(
ROOT <- '[' $< [a-z0-9]+ > ']'
)";

auto s = " [tag1] [tag2] [tag3] ";

// peglib::peg_search
peg pg(syntax);
size_t pos = 0;
auto l = strlen(s);
match m;
while (peg_search(pg, s + pos, l - pos, m)) {
  cout << m.str()  << endl; // entire match
  cout << m.str(1) << endl; // submatch #1
  pos += m.length();
}

// peglib::peg_token_iterator
peg_token_iterator it(syntax, s);
while (it != peg_token_iterator()) {
  cout << it->str()  << endl; // entire match
  cout << it->str(1) << endl; // submatch #1
  ++it;
}

// peglib::peg_token_range
for (auto& m: peg_token_range(syntax, s)) {
  cout << m.str()  << endl; // entire match
  cout << m.str(1) << endl; // submatch #1
}

Make a parser with parser operators

Instead of makeing a parser by parsing PEG syntax text, we can also construct a parser by hand with parser operators. Here is an example:

using namespace peglib;
using namespace std;

vector<string> tags;

Definition ROOT, TAG_NAME, _;
ROOT     <= seq(_, zom(seq(chr('['), TAG_NAME, chr(']'), _)));
TAG_NAME <= oom(seq(npd(chr(']')), dot())), [&](const char* s, size_t l) {
                tags.push_back(string(s, l));
            };
_        <= zom(cls(" \t"));

auto ret = ROOT.parse(" [tag1] [tag:2] [tag-3] ");

The following are available operators:

Operator Description
seq Sequence
cho Prioritized Choice
zom Zero or More
oom One or More
opt Optional
apd And predicate
npd Not predicate
lit Literal string
cls Character class
chr Character
dot Any character
anc Anchor character
cap Capture character
usr User defiend parser

Adjust definitions

It's possible to add and override definitions with parser operaters.

auto syntax = R"(
    ROOT <- _ 'Hello' _ NAME '!' _
)";

Rules rules = {
    {
        "NAME", usr([](const char* s, size_t l, SemanticValues& sv, any& c) {
            static vector<string> names = { "PEG", "BNF" };
            for (const auto& n: names) {
                if (n.size() <= l && !n.compare(0, n.size(), s, n.size())) {
                    return success(n.size());
                }
            }
            return fail(s);
        })
    },
    {
        "~_", zom(cls(" \t\r\n"))
    }
};

peg g = peg(syntax, rules);

assert(g.parse(" Hello BNF! "));

Sample codes

Tested Compilers

  • Visual Studio 2013
  • Clang 3.5

TODO

  • Linear-time parsing (Packrat parsing)
  • Optimization of grammars
  • Unicode support

Other C++ PEG parser libraries

Thanks to the authors of the libraries that inspired cpp-peglib.

  • Boost Spirit X3 - A set of C++ libraries for parsing and output generation implemented as Domain Specific Embedded Languages (DSEL) using Expression templates and Template Meta-Programming
  • PEGTL - Parsing Expression Grammar Template Library
  • lars::Parser - A header-only linear-time c++ parsing expression grammar (PEG) parser generator supporting left-recursion and grammar ambiguity

License

MIT license (© 2015 Yuji Hirose)