mirror of https://github.com/yhirose/cpp-peglib.git synced 2025-01-22 13:25:30 +00:00

Go to file

yhirose a3cfd1b8ad Handled UTF-8 codes from 0x80 as valid identifier codes.		2015-08-08 20:30:05 -04:00
example	Changed to use C++ raw string.	2015-08-06 23:09:37 -04:00
grammar	Updated PL/0 grammar file.	2015-07-30 17:28:13 -04:00
language	Code cleanup.	2015-08-06 18:27:38 -04:00
lint	Fixed backtrack problem.	2015-08-05 22:52:08 -04:00
test	Handled UTF-8 codes from 0x80 as valid identifier codes.	2015-08-08 20:30:05 -04:00
tutorial	Updated tutorial.	2015-08-08 19:17:43 -04:00
.gitignore	Updated .gitignore.	2015-07-23 22:19:07 -04:00
CMakeLists.txt	Moved files.	2015-08-05 10:28:07 -04:00
LICENSE	Initial commit	2015-02-07 16:10:11 -05:00
peg.vim	Added line comment syntax highlight.	2015-07-08 10:26:29 -04:00
peglib.h	Handled UTF-8 codes from 0x80 as valid identifier codes.	2015-08-08 20:30:05 -04:00
README.md	Updated README.	2015-08-06 07:56:31 -04:00

README.md

cpp-peglib

C++11 header-only PEG (Parsing Expression Grammars) library.

cpp-peglib tries to provide more expressive parsing experience in a simple way. This library depends on only one header file. So, you can start using it right away just by including peglib.h in your project.

The PEG syntax is well described on page 2 in the document. cpp-peglib also supports the following additional syntax for now:

< ... > (Anchor operator)
$< ... > (Capture operator)
$name< ... > (Named capture operator)
~ (Ignore operator)
\x20 (Hex number char)

This library also supports the linear-time parsing known as the Packrat parsing.

How to use

This is a simple calculator sample. It shows how to define grammar, associate samantic actions to the grammar and handle semantic values.

// (1) Include the header file
#include <peglib.h>
#include <assert.h>

using namespace peglib;
using namespace std;

int main(void) {
    // (2) Make a parser
    auto syntax = R"(
        # Grammar for Calculator...
        Additive  <- Multitive '+' Additive / Multitive
        Multitive <- Primary '*' Multitive / Primary
        Primary   <- '(' Additive ')' / Number
        Number    <- [0-9]+
    )";

    peg parser(syntax);

    // (3) Setup an action
    parser["Additive"] = [](const SemanticValues& sv) {
        switch (sv.choice) {
        case 0:  // "Multitive '+' Additive"
            return sv[0].get<int>() + sv[1].get<int>();
        default: // "Multitive"
            return sv[0].get<int>();
        }
    };

    parser["Multitive"] = [](const SemanticValues& sv) {
        switch (sv.choice) {
        case 0:  // "Primary '*' Multitive"
            return sv[0].get<int>() * sv[1].get<int>();
        default: // "Primary"
            return sv[0].get<int>();
        }
    };

    parser["Number"] = [](const SemanticValues& sv) {
        return stoi(sv.str(), nullptr, 10);
    };

    // (4) Parse
    parser.packrat_parsing(); // Enable packrat parsing.

    int val;
    parser.parse("(1+2)*3", val);

    assert(val == 9);
}

Here are available actions:

[](const SemanticValues& sv, any& dt)
[](const SemanticValues& sv)

const SemanticValues& sv contains semantic values. SemanticValues structure is defined as follows.

struct SemanticValue {
    any         val;  // Semantic value
    const char* name; // Definition name for the sematic value
    const char* s;    // Token start for the semantic value
    size_t      n;    // Token length for the semantic value

    // Cast semantic value
    template <typename T> T& get();
    template <typename T> const T& get() const;

    // Get token
    std::string str() const;
};

struct SemanticValues : protected std::vector<SemanticValue>
{
    const char* s;      // Token start
    size_t      n;      // Token length
    size_t      choice; // Choice number (0 based index)

    // Get token
    std::string str() const;

    // Transform the semantice value vector to another vector
    template <typename T> vector<T> transform(size_t beg = 0, size_t end = -1) const;
}

peglib::any class is very similar to boost::any. You can obtain a value by castning it to the actual type. In order to determine the actual type, you have to check the return value type of the child action for the semantic value.

const char* s, size_t n gives a pointer and length of the matched string. This is same as sv.s and sv.n.

any& dt is a data object which can be used by the user for whatever purposes.

The following example uses < ... > operators. They are the anchor operators. Each anchor operator creates a semantic value that contains const char* of the position. It could be useful to eliminate unnecessary characters.

auto syntax = R"(
    ROOT  <- _ TOKEN (',' _ TOKEN)*
    TOKEN <- < [a-z0-9]+ > _
    _     <- [ \t\r\n]*
)";

peg pg(syntax);

pg["TOKEN"] = [](const SemanticValues& sv) {
    // 'token' doesn't include trailing whitespaces
    auto token = sv.str();
};

auto ret = pg.parse(" token1, token2 ");

We can ignore unnecessary semantic values from the list by using ~ operator.

peglib::peg parser(
    "  ROOT  <-  _ ITEM (',' _ ITEM _)*  "
    "  ITEM  <-  ([a-z])+                "
    "  ~_    <-  [ \t]*                  "
);

parser["ROOT"] = [&](const SemanticValues& sv) {
    assert(sv.size() == 2); // should be 2 instead of 5.
};

auto ret = parser.parse(" item1, item2 ");

The following grammar is same as the above.

peglib::peg parser(
    "  ROOT  <-  ~_ ITEM (',' ~_ ITEM ~_)*  "
    "  ITEM  <-  ([a-z])+                   "
    "  _     <-  [ \t]*                     "
);

Semantic predicate support is available. We can do it by throwing a peglib::parse_error exception in a semantic action.

peglib::peg parser("NUMBER  <-  [0-9]+");

parser["NUMBER"] = [](const SemanticValues& sv) {
    auto val = stol(sv.str(), nullptr, 10);
    if (val != 100) {
        throw peglib::parse_error("value error!!");
    }
    return val;
};

long val;
auto ret = parser.parse("100", val);
assert(ret == true);
assert(val == 100);

ret = parser.parse("200", val);
assert(ret == false);

Simple interface

cpp-peglib provides std::regex-like simple interface for trivial tasks.

peglib::peg_match tries to capture strings in the $< ... > operator and store them into peglib::match object.

peglib::match m;

auto ret = peglib::peg_match(
    R"(
        ROOT      <-  _ ('[' $< TAG_NAME > ']' _)*
        TAG_NAME  <-  (!']' .)+
        _         <-  [ \t]*
    )",
    " [tag1] [tag:2] [tag-3] ",
    m);

assert(ret == true);
assert(m.size() == 4);
assert(m.str(1) == "tag1");
assert(m.str(2) == "tag:2");
assert(m.str(3) == "tag-3");

It also supports named capture with the $name< ... > operator.

peglib::match m;

auto ret = peglib::peg_match(
    R"(
        ROOT      <-  _ ('[' $test< TAG_NAME > ']' _)*
        TAG_NAME  <-  (!']' .)+
        _         <-  [ \t]*
    )",
    " [tag1] [tag:2] [tag-3] ",
    m);

auto cap = m.named_capture("test");

REQUIRE(ret == true);
REQUIRE(m.size() == 4);
REQUIRE(cap.size() == 3);
REQUIRE(m.str(cap[2]) == "tag-3");

There are some ways to search a peg pattern in a document.

using namespace peglib;

auto syntax = R"(
    ROOT <- '[' $< [a-z0-9]+ > ']'
)";

auto s = " [tag1] [tag2] [tag3] ";

// peglib::peg_search
peg pg(syntax);
size_t pos = 0;
auto n = strlen(s);
match m;
while (peg_search(pg, s + pos, n - pos, m)) {
    cout << m.str()  << endl; // entire match
    cout << m.str(1) << endl; // submatch #1
    pos += m.length();
}

// peglib::peg_token_iterator
peg_token_iterator it(syntax, s);
while (it != peg_token_iterator()) {
    cout << it->str()  << endl; // entire match
    cout << it->str(1) << endl; // submatch #1
    ++it;
}

// peglib::peg_token_range
for (auto& m: peg_token_range(syntax, s)) {
    cout << m.str()  << endl; // entire match
    cout << m.str(1) << endl; // submatch #1
}

Make a parser with parser operators

Instead of makeing a parser by parsing PEG syntax text, we can also construct a parser by hand with parser operators. Here is an example:

using namespace peglib;
using namespace std;

vector<string> tags;

Definition ROOT, TAG_NAME, _;
ROOT     <= seq(_, zom(seq(chr('['), TAG_NAME, chr(']'), _)));
TAG_NAME <= oom(seq(npd(chr(']')), dot())), [&](const SemanticValues& sv) {
                tags.push_back(sv.str());
            };
_        <= zom(cls(" \t"));

auto ret = ROOT.parse(" [tag1] [tag:2] [tag-3] ");

The following are available operators:

Operator	Description
seq	Sequence
cho	Prioritized Choice
zom	Zero or More
oom	One or More
opt	Optional
apd	And predicate
npd	Not predicate
lit	Literal string
cls	Character class
chr	Character
dot	Any character
anc	Anchor character
ign	Ignore semantic value
cap	Capture character
usr	User defiend parser

Adjust definitions

It's possible to add/override definitions.

auto syntax = R"(
    ROOT <- _ 'Hello' _ NAME '!' _
)";

Rules additional_rules = {
    {
        "NAME", usr([](const char* s, size_t n, SemanticValues& sv, any& c) -> size_t {
            static vector<string> names = { "PEG", "BNF" };
            for (const auto& name: names) {
                if (name.size() <= n && !name.compare(0, name.size(), s, name.size())) {
                    return name.size(); // processed length
                }
            }
            return -1; // parse error
        })
    },
    {
        "~_", zom(cls(" \t\r\n"))
    }
};

peg g = peg(syntax, additional_rules);

assert(g.parse(" Hello BNF! "));

README.md

cpp-peglib

How to use

Simple interface

Make a parser with parser operators

Adjust definitions

Sample codes

Tested Compilers

TODO

License