cpp-peglib/README.md
2015-02-08 09:43:49 -05:00

6.7 KiB

cpp-peglib

C++11 header-only PEG (Parsing Expression Grammars) library.

cpp-peglib tries to provide more expressive parsing experience than common regular expression libraries such as std::regex. It also keeps it in mind that users can easily start using it.

The PEG syntax that cpp-peglib understands is described on page 2 in the document.

How to use

What if we want to extract only tag names in brackets from [tag1] [tag2] [tag3] [tag4]...? It's a bit hard to do it with std::regex. We have to write a loop logic, since it doesn't support Repeated Captures. PEG can handle it pretty easily.

PEG grammar for this task could be like this:

ROOT      <-  _ ('[' TAG_NAME ']' _)*
TAG_NAME  <-  (!']' .)+
_         <-  [ \t]*

Here is how to parse text with the PEG syntax and retreive tag names:

// (1) Include the header file
#include "peglib.h"

// (2) Make a parser
auto parser = peglib::make_parser(R"(
    ROOT      <-  _ ('[' TAG_NAME ']' _)*
    TAG_NAME  <-  (!']' .)+
    _         <-  [ \t]*
)");

// (3) Setup an action
std::vector<std::string> tags;
parser["TAG_NAME"] = [&](const char* s, size_t l) {
    tags.push_back(std::string(s, l));
};

// (4) Parse
auto ret = parser.parse(" [tag1] [tag:2] [tag-3] ");

assert(ret     == true);
assert(tags[0] == "tag1");
assert(tags[1] == "tag:2");
assert(tags[2] == "tag-3");

You may have a question regarding '(3) Setup an action'. When the parser recognizes the definition 'TAG_NAME', it calls back the action [&](const char* s, size_t l) where const char* s, size_t l refers to the matched string, so that the user could use the string for something else.

We can do more with actions. A more complex example is here:

// Calculator example
using namespace peglib;
using namespace std;

auto parser = make_parser(R"(
    # Grammar for Calculator...
    EXPRESSION       <-  TERM (TERM_OPERATOR TERM)*
    TERM             <-  FACTOR (FACTOR_OPERATOR FACTOR)*
    FACTOR           <-  NUMBER / '(' EXPRESSION ')'
    TERM_OPERATOR    <-  [-+]
    FACTOR_OPERATOR  <-  [/*]
    NUMBER           <-  [0-9]+
)");

auto reduce = [](const vector<Any>& v) -> long {
    long ret = v[0].get<long>();
    for (auto i = 1u; i < v.size(); i += 2) {
        auto num = v[i + 1].get<long>();
        switch (v[i].get<char>()) {
            case '+': ret += num; break;
            case '-': ret -= num; break;
            case '*': ret *= num; break;
            case '/': ret /= num; break;
        }
    }
    return ret;
};

parser["EXPRESSION"]      = reduce;
parser["TERM"]            = reduce;
parser["TERM_OPERATOR"]   = [](const char* s, size_t l) { return (char)*s; };
parser["FACTOR_OPERATOR"] = [](const char* s, size_t l) { return (char)*s; };
parser["NUMBER"]          = [](const char* s, size_t l) { return stol(string(s, l), nullptr, 10); };

long val;
auto ret = parser.parse("1+2*3*(4-5+6)/7-8", val);

assert(ret == true);
assert(val == -3);

It may be helpful to keep in mind that the action behavior is similar to the YACC semantic action model (, $1, $2, ...).

In this example, the actions return values. These samentic values will be pushed up to the parent definition which can be referred to in the parent action [](const vector<Any>& v). In other words, when a certain definition has been accepted, we can find all semantic values which are associated with the child definitions in const vector<Any>& v. The values are wrapped by peblib::Any class which is like boost::any. We can retrieve the value by using get<T> method where T is the actual type of the value. If no value is returned in an action, an undefined Any will be pushed up to the parent. Finally, the resulting value of the root definition is received in the out parameter of parse method in the parser. long val is the resulting value in this case.

Here are available user actions:

[](const char* s, size_t l, const std::vector<peglib::Any>& v, const std::vector<std::string>& n)
[](const char* s, size_t l, const std::vector<peglib::Any>& v)
[](const char* s, size_t l)
[](const std::vector<peglib::Any>& v, const std::vector<std::string>& n)
[](const std::vector<peglib::Any>& v)
[]()

const std::vector<std::string>& n holds names of child definitions that could be helpful when we want to check what are the actual child definitions.

Make a parser with parser operators and simple actions

Instead of makeing a parser by parsing PEG syntax text, we can also construct a parser by hand with parser operators and use the simple action method rather than the semantic action method. Here is an example:

using namespace peglib;
using namespace std;

Definition ROOT, TAG_NAME, _;
ROOT     = seq(_, zom(seq(chr('['), TAG_NAME, chr(']'), _)));
TAG_NAME = oom(seq(npd(chr(']')), any()));
_        = zom(cls(" \t"));

vector<string> tags;
TAG_NAME.match = [&](const char* s, size_t l) {
    tags.push_back(string(s, l));
};

auto ret = ROOT.parse(" [tag1] [tag:2] [tag-3] ");

In fact, the PEG parser generator is made with operators. You can see the code at make_peg_grammar function in peglib.h.

The following are available operators:

Description Operator
Sequence seq
Prioritized Choice cho
Grouping grp
Zero or More zom
One or More oom
Optional opt
And predicate apd
Not predicate npd
Literal string lit
Character class cls
Character chr
Any character any

Sample codes

Tested Compilers

  • Visual Studio 2013
  • Clang 3.5

TODO

  • Linear-time parsing (Packrat parsing)
  • Optimization of grammars
  • Unicode support

Other C++ PEG parser libraries

Thanks to the authors of the libraries that inspired cpp-peglib.

  • PEGTL - Parsing Expression Grammar Template Library
  • lars::Parser - A header-only linear-time c++ parsing expression grammar (PEG) parser generator supporting left-recursion and grammar ambiguity

License

MIT license (© 2015 Yuji Hirose)