cpp-peglib/README.md
2015-02-09 17:12:59 -05:00

6.9 KiB

cpp-peglib

C++11 header-only PEG (Parsing Expression Grammars) library.

cpp-peglib tries to provide more expressive parsing experience than common regular expression libraries such as std::regex. It also keeps it in mind that users can easily start using it.

The PEG syntax that cpp-peglib understands is described on page 2 in the document.

How to use

What if we want to extract only tag names in brackets from [tag1] [tag2] [tag3] [tag4]...? It's a bit hard to do it with std::regex. We have to write a loop logic, since it doesn't support Repeated Captures. PEG can handle it pretty easily.

PEG grammar for this task could be like this:

ROOT      <-  _ ('[' TAG_NAME ']' _)*
TAG_NAME  <-  (!']' .)+
_         <-  [ \t]*

Here is how to parse text with the PEG syntax and retreive tag names:

// (1) Include the header file
#include "peglib.h"

// (2) Make a parser
auto parser = peglib::make_parser(R"(
    ROOT      <-  _ ('[' TAG_NAME ']' _)*
    TAG_NAME  <-  (!']' .)+
    _         <-  [ \t]*
)");

// (3) Setup an action
std::vector<std::string> tags;
parser["TAG_NAME"] = [&](const char* s, size_t l) {
    tags.push_back(std::string(s, l));
};

// (4) Parse
auto ret = parser.parse(" [tag1] [tag:2] [tag-3] ");

assert(ret     == true);
assert(tags[0] == "tag1");
assert(tags[1] == "tag:2");
assert(tags[2] == "tag-3");

You may have a question regarding '(3) Setup an action'. When the parser recognizes the definition 'TAG_NAME', it calls back the action [&](const char* s, size_t l) where const char* s, size_t l refers to the matched string, so that the user could use the string for something else.

We can do more with actions. A more complex example is here:

// Calculator example
using namespace peglib;
using namespace std;

auto parser = make_parser(R"(
    # Grammar for Calculator...
    EXPRESSION       <-  TERM (TERM_OPERATOR TERM)*
    TERM             <-  FACTOR (FACTOR_OPERATOR FACTOR)*
    FACTOR           <-  NUMBER / '(' EXPRESSION ')'
    TERM_OPERATOR    <-  [-+]
    FACTOR_OPERATOR  <-  [/*]
    NUMBER           <-  [0-9]+
)");

auto reduce = [](const vector<Any>& v) -> long {
    long ret = v[0].get<long>();
    for (auto i = 1u; i < v.size(); i += 2) {
        auto num = v[i + 1].get<long>();
        switch (v[i].get<char>()) {
            case '+': ret += num; break;
            case '-': ret -= num; break;
            case '*': ret *= num; break;
            case '/': ret /= num; break;
        }
    }
    return ret;
};

parser["EXPRESSION"]      = reduce;
parser["TERM"]            = reduce;
parser["TERM_OPERATOR"]   = [](const char* s, size_t l) { return (char)*s; };
parser["FACTOR_OPERATOR"] = [](const char* s, size_t l) { return (char)*s; };
parser["NUMBER"]          = [](const char* s, size_t l) { return stol(string(s, l), nullptr, 10); };

long val;
auto ret = parser.parse("1+2*3*(4-5+6)/7-8", val);

assert(ret == true);
assert(val == -3);

It may be helpful to keep in mind that the action behavior is similar to the YACC semantic action model (, $1, $2, ...).

In this example, the actions return values. These samentic values will be pushed up to the parent definition which can be referred to in the parent action [](const vector<Any>& v). In other words, when a certain definition has been accepted, we can find all semantic values which are associated with the child definitions in const vector<Any>& v. The values are wrapped by peblib::Any class which is like boost::any. We can retrieve the value by using get<T> method where T is the actual type of the value. If no value is returned in an action, an undefined Any will be pushed up to the parent. Finally, the resulting value of the root definition is received in the out parameter of parse method in the parser. long val is the resulting value in this case.

Here are available user actions:

[](const char* s, size_t l, const std::vector<peglib::Any>& v, const std::vector<std::string>& n)
[](const char* s, size_t l, const std::vector<peglib::Any>& v)
[](const char* s, size_t l)
[](const std::vector<peglib::Any>& v, const std::vector<std::string>& n)
[](const std::vector<peglib::Any>& v)
[]()

const std::vector<std::string>& n holds names of child definitions that could be helpful when we want to check what are the actual child definitions.

Make a parser with parser operators

Instead of makeing a parser by parsing PEG syntax text, we can also construct a parser by hand with parser operators. Here is an example:

using namespace peglib;
using namespace std;

vector<string> tags;

Definition ROOT, TAG_NAME, _;
ROOT     = seq(_, zom(seq(chr('['), TAG_NAME, chr(']'), _)));
TAG_NAME = oom(seq(npd(chr(']')), any())), [&](const char* s, size_t l) { tags.push_back(string(s, l)); };
_        = zom(cls(" \t"));

auto ret = ROOT.parse(" [tag1] [tag:2] [tag-3] ");

It is also possible to specify a string match action with a grp operator. The string match action doesn't affect the resular semantic action behavior.

ROOT     = seq(_, zom(seq(chr('['), grp(TAG_NAME, [&](const char* s, size_t l) { tags.push_back(string(s, l)); }), chr(']'), _)));
TAG_NAME = oom(seq(npd(chr(']')), any()));
_        = zom(cls(" \t"));

In fact, the PEG parser generator is made with the parser operators. You can see the code at make_peg_grammar function in peglib.h.

The following are available operators:

Operator Description
seq Sequence
cho Prioritized Choice
grp Grouping
zom Zero or More
oom One or More
opt Optional
apd And predicate
npd Not predicate
lit Literal string
cls Character class
chr Character
any Any character

Sample codes

Tested Compilers

  • Visual Studio 2013
  • Clang 3.5

TODO

  • Linear-time parsing (Packrat parsing)
  • Optimization of grammars
  • Unicode support

Other C++ PEG parser libraries

Thanks to the authors of the libraries that inspired cpp-peglib.

  • PEGTL - Parsing Expression Grammar Template Library
  • lars::Parser - A header-only linear-time c++ parsing expression grammar (PEG) parser generator supporting left-recursion and grammar ambiguity

License

MIT license (© 2015 Yuji Hirose)