Performance Improvement (#336)

* Add more tests

* Enhance profiling capabilities and improve Trie performance metrics

* Implement ISpan optimization for ASCII CharacterClass repetition in parsing

* Implement LPeg-style snapshot/rollback for SemanticValues and CaptureScope

Replace push()/pop()/append()/shift_capture_values() with snapshot()
(record sizes) and rollback() (truncate on failure). Operators now write
directly to the parent SemanticValues and truncate on failure, eliminating
child scope allocation and copy-on-success overhead.

Key changes:
- PrioritizedChoice: snapshot/rollback instead of push/pop/append
- Sequence: direct write to parent (no child scope)
- AndPredicate/NotPredicate: always rollback (side-effect isolation)
- CaptureScope: flatten vector<map> to vector<pair> with reverse search
- Remove push()/pop()/append()/shift_capture_values() entirely

Benchmark (A/B, big.sql ~1.2MB): 105.4ms -> 99.2ms (-5.9%)
Small inputs benefit more (TPC-H Q1: -22.7%) where per-rule allocation
overhead is proportionally higher.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

* Add Keyword Guard optimization for identifier parsing

* Implement skip_whitespace function to streamline whitespace handling in parsing

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

* Add whitespace skip optimization and selective packrat memoization for improved parsing performance

* Add support for optimized SQL grammar and enable packrat stats collection in benchmarks

* Optimize argument handling and enhance range-based for loops for improved readability and performance

* Rename SQL grammar files for consistency and clarity

* Add first-reference analysis and warning for left-factoring in PEG grammar

* Consolidate to_lower implementations and optimize LiteralString handling for improved parsing performance

* Refactor Ope::Visitor subclasses to inherit from TraversalVisitor for improved code reuse and maintainability

* Update native.wasm binary to latest version

---------

Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com>
This commit is contained in:
yhirose 2026-03-11 00:02:19 -04:00 committed by GitHub
parent 3b6f47f7fc
commit 60a77c634e
No known key found for this signature in database
GPG Key ID: B5690EEEBB952194
14 changed files with 2243 additions and 401 deletions

View File

@ -11,8 +11,8 @@ A [DuckDB blog post](https://duckdb.org/2024/11/22/runtime-extensible-parsers.ht
All test data comes from the [peg-parser-experiments](https://github.com/hannes/peg-parser-experiments) repository:
| File | Description | Size |
|---|---|---|
| `sql.gram` | PEG grammar for SQL (covers TPC-H and TPC-DS) | 3.9 KB |
| --- | --- | --- |
| `sql.peg` | PEG grammar for SQL (covers TPC-H and TPC-DS) | 3.9 KB |
| `q1.sql` | Single TPC-H query (Q1) | 544 B |
| `all-tpch.sql` | All 22 TPC-H queries | 14 KB |
| `big.sql` | TPC-H + TPC-DS queries repeated 6x | 1.2 MB |
@ -57,7 +57,7 @@ Measured on Apple M2 Max, macOS, AppleClang 17, `-O3` (Release build), 10 iterat
cpp-peglib is approximately **710x slower** than the YACC parser, consistent with the findings reported in the DuckDB article.
| Benchmark | PEG/YACC |
|---|---|
| --- | --- |
| TPC-H Q1 (544 B) | 9.9x slower |
| all TPC-H (14 KB) | 7.8x slower |
| big.sql (1.2 MB) | 7.3x slower |
@ -67,7 +67,7 @@ cpp-peglib is approximately **710x slower** than the YACC parser, consistent
The First-Set optimization precomputes the set of possible first bytes for each `PrioritizedChoice` alternative at grammar compilation time. At parse time, alternatives whose First-Set does not include the current input byte are skipped without attempting them.
| Benchmark | PEG/YACC |
|---|---|
| --- | --- |
| TPC-H Q1 (544 B) | 5.9x slower |
| all TPC-H (14 KB) | 4.6x slower |
| big.sql (1.2 MB) | 4.6x slower |
@ -77,7 +77,7 @@ The First-Set optimization precomputes the set of possible first bytes for each
`Holder::parse_core` previously used 23 `dynamic_cast` calls per rule match to check whether the inner operator is a `TokenBoundary`, `PrioritizedChoice`, or `Dictionary`. These RTTI lookups accounted for ~27% of parse time in profiling. Replacing them with boolean flags (`is_token_boundary`, `is_choice_like`) on the `Ope` base class eliminates the RTTI overhead entirely.
| Benchmark | PEG/YACC |
|---|---|
| --- | --- |
| TPC-H Q1 (544 B) | 4.2x slower |
| all TPC-H (14 KB) | 3.4x slower |
| big.sql (1.2 MB) | 3.4x slower |
@ -87,33 +87,160 @@ The First-Set optimization precomputes the set of possible first bytes for each
Left recursion support adds `DetectLeftRecursion` and seed-growing logic at parse time. For non-left-recursive grammars (such as SQL), this adds zero overhead — only a single `bool` check per rule invocation.
| Benchmark | PEG/YACC |
|---|---|
| --- | --- |
| TPC-H Q1 (544 B) | 4.3x slower |
| all TPC-H (14 KB) | 3.6x slower |
| big.sql (1.2 MB) | 3.6x slower |
| all TPC-H (14 KB) | 3.7x slower |
| big.sql (1.2 MB) | 3.4x slower |
No regression compared to the previous configuration.
## ISpan Optimization (Repetition + CharacterClass Fusion)
At grammar compilation time, `Repetition` nodes whose child is an ASCII-only `CharacterClass` are detected. At parse time, these use a tight bitset-test loop instead of the full operator dispatch chain (vtable call, `push`/`pop`, `decode_codepoint`, `scope_exit`, etc.). This is equivalent to LPeg's `ISpan` instruction.
A/B comparison (same session, alternating builds):
| Benchmark | Baseline | ISpan | Improvement |
| --- | --- | --- | --- |
| TPC-H Q1 (544 B) | 0.088 ms | 0.077 ms | -12.5% |
| all TPC-H (14 KB) | 1.489 ms | 1.409 ms | -5.4% |
| big.sql (1.2 MB) | 126.0 ms | 114.6 ms | -9.1% |
| Benchmark | PEG/YACC |
| --- | --- |
| TPC-H Q1 (544 B) | 5.1x slower |
| all TPC-H (14 KB) | 3.7x slower |
| big.sql (1.2 MB) | 3.7x slower |
Note: Grammar load time increases slightly (~0.8 ms) due to bitset construction, but this is a one-time cost at grammar compilation.
## Snapshot/Rollback (Phase 2)
Replaced the `push()`/`pop()`/`append()` pattern with LPeg-style snapshot/rollback. Instead of allocating a child `SemanticValues` scope and copying results on success, operators now write directly to the parent and truncate on failure. `CaptureScope` was flattened from `vector<map>` to a flat `vector<pair>` with reverse linear search.
Key changes:
- `PrioritizedChoice`: snapshot before each alternative, rollback on failure, no-op on success
- `Sequence`: direct write to parent (no child scope)
- `Repetition`: snapshot only when `max` is bounded
- `AndPredicate`/`NotPredicate`: always rollback (side-effect isolation)
- `CaptureScope`: flat `vector<pair<string_view, string>>` instead of scoped `vector<map>`
A/B comparison (same session, alternating builds):
| Benchmark | Baseline | Snapshot/Rollback | Improvement |
| --- | --- | --- | --- |
| TPC-H Q1 (544 B) | 0.075 ms | 0.058 ms | -22.7% |
| all TPC-H (14 KB) | 1.286 ms | 1.161 ms | -9.7% |
| big.sql (1.2 MB) | 105.4 ms | 99.2 ms | -5.9% |
| Benchmark | PEG/YACC |
| --- | --- |
| TPC-H Q1 (544 B) | 4.1x slower |
| all TPC-H (14 KB) | 3.2x slower |
| big.sql (1.2 MB) | 3.4x slower |
The improvement is most pronounced on small inputs (Q1: -22.7%) where per-rule allocation overhead dominates, and smaller on large inputs (big.sql: -5.9%) where the grammar structure itself is the bottleneck.
## Keyword Guard (Phase 3)
At grammar compilation time, the pattern `!ReservedKeyword <[a-z_]i[a-z0-9_]i*>` is detected. At parse time, instead of running the full NotPredicate → Holder → PrioritizedChoice chain for each keyword alternative, the fast path scans the identifier using a bitset, then checks the result against a precomputed keyword table. Identifiers whose length falls outside the keyword length range skip the lookup entirely.
Key techniques:
- Bitset-based identifier scanning (same as ISpan)
- Stack buffer for case-folding (heap fallback for identifiers > 64 chars)
- Length-range early-out (`min_keyword_len` / `max_keyword_len`)
- Compound keywords (e.g., `GROUP BY`) fall back to the normal path
A/B comparison (same session, alternating builds):
| Benchmark | Baseline | Keyword Guard | Improvement |
| --- | --- | --- | --- |
| TPC-H Q1 (544 B) | 0.058 ms | 0.055 ms | -5.2% |
| all TPC-H (14 KB) | 1.117 ms | 1.109 ms | -0.7% |
| big.sql (1.2 MB) | 99.2 ms | 92.4 ms | -6.8% |
| Benchmark | PEG/YACC |
| --- | --- |
| TPC-H Q1 (544 B) | 3.7x slower |
| all TPC-H (14 KB) | 3.0x slower |
| big.sql (1.2 MB) | 3.1x slower |
## Whitespace Skip Optimization
At grammar compilation time, `Sequence` nodes with whitespace operators between elements are detected. At parse time, instead of dispatching through the full operator chain for each whitespace consumption, a fast inline function scans whitespace using a precomputed bitset. This eliminates vtable calls, scope management, and SemanticValues bookkeeping for one of the most frequently invoked operations.
A/B comparison (same session, alternating builds):
| Benchmark | Baseline | Whitespace Skip | Improvement |
| --- | --- | --- | --- |
| big.sql (1.2 MB) | 92.4 ms | 93.0 ms | ~neutral |
| Benchmark | PEG/YACC |
| --- | --- |
| big.sql (1.2 MB) | 3.0x slower |
The improvement was within noise range on big.sql. The optimization primarily benefits grammars with heavy whitespace-separated sequences.
## Selective Packrat Memoization
At grammar compilation time, static analysis identifies which rules actually benefit from packrat memoization. A rule benefits only if it is reachable from 2+ alternatives of the same `PrioritizedChoice` (i.e., backtracking will re-visit it at the same position). Rules that don't benefit use a lightweight bitvector-only re-entry guard instead of the full `std::map`-based cache.
Empirical profiling of the SQL grammar showed that only 2 of 53 rules benefit from packrat (Identifier: 50.3% hit rate, ColumnReference: 45.1%). The remaining 51 rules had 0% hit rate with ~295K wasted map insertions.
A/B comparison (same session, alternating builds):
| Benchmark | Baseline | Selective Packrat | Improvement |
| --- | --- | --- | --- |
| big.sql (1.2 MB) | 93.0 ms | 88.3 ms | -5.1% |
| Benchmark | PEG/YACC |
| --- | --- |
| TPC-H Q1 (544 B) | 3.8x slower |
| all TPC-H (14 KB) | 2.8x slower |
| big.sql (1.2 MB) | 2.8x slower |
## Micro-optimizations (to_lower consolidation, LiteralString move fix)
Consolidated multiple `to_lower` implementations (Trie member function, inline loop, lambda) into a single `peg::to_lower` free function. Pre-computed lowercase literals (`lower_lit_`) for case-insensitive `LiteralString` matching, eliminating per-character `tolower` calls on the literal side during parsing. Also fixed a missing `std::move` in the `LiteralString` rvalue constructor.
| Benchmark | Baseline | After | Improvement |
| --- | --- | --- | --- |
| big.sql (1.2 MB) | 88.3 ms | 82.5 ms | -6.6% |
| Benchmark | PEG/YACC |
| --- | --- |
| TPC-H Q1 (544 B) | 3.5x slower |
| all TPC-H (14 KB) | 2.9x slower |
| big.sql (1.2 MB) | 2.8x slower |
## Summary (big.sql, ~1.2 MB)
All optimizations measured on Apple M2 Max, macOS, AppleClang 17, `-O3` (Release build).
| Configuration | Median | PEG/YACC |
|---|---|---|
| YACC (libpg_query) | 36.1 ms | 1.0x |
| --- | --- | --- |
| YACC (libpg_query) | 29.6 ms | 1.0x |
| PEG (no optimizations) | 228.4 ms | 7.4x |
| PEG + Devirt | 190.9 ms | 6.2x |
| PEG + First-Set | 135.8 ms | 4.6x |
| PEG (all optimizations) | 105.1 ms | 3.4x |
| PEG (all opts + LR support) | 130.3 ms | 3.6x |
| PEG + First-Set + Devirt + LR | 107.4 ms | 3.4x |
| PEG (all opts + Snapshot/Rollback) | 99.2 ms | 3.4x |
| PEG (all opts + Keyword Guard) | 92.4 ms | 3.1x |
| PEG (all opts + Selective Packrat) | 88.3 ms | 3.0x |
| PEG (all opts + micro-opts) | 82.5 ms | 2.8x |
```
YACC |█████ 36.1 ms (1.0x)
PEG (all optimizations) |██████████████ 105.1 ms (3.4x)
PEG (all opts + LR support) |█████████████████ 130.3 ms (3.6x)
PEG + First-Set |█████████████████ 135.8 ms (4.6x)
PEG + Devirt |████████████████████████ 190.9 ms (6.2x)
PEG (no optimizations) |█████████████████████████████ 228.4 ms (7.4x)
```ascii
YACC |████ 29.6 ms (1.0x)
PEG (all opts + micro) |██████████ 82.5 ms (2.8x)
PEG (all opts + Sel. Pack) |███████████ 88.3 ms (3.0x)
PEG (all opts + KW Guard) |████████████ 92.4 ms (3.1x)
PEG (all opts + S/R) |█████████████ 99.2 ms (3.4x)
PEG + First-Set + Devirt |██████████████ 107.4 ms (3.4x)
PEG + First-Set |█████████████████ 135.8 ms (4.6x)
PEG + Devirt |████████████████████████ 190.9 ms (6.2x)
PEG (no optimizations) |█████████████████████████████ 228.4 ms (7.4x)
```
With all optimizations and left recursion support, the gap to YACC is **3.6x** — no regression from adding LR support to non-LR grammars.
With all optimizations, the gap to YACC is **2.8x** on big.sql — a **2.6x improvement** from the original 7.4x baseline.

View File

@ -1,7 +1,10 @@
#include <algorithm>
#include <chrono>
#include <cstring>
#include <fstream>
#include <iomanip>
#include <iostream>
#include <map>
#include <numeric>
#include <string>
#include <vector>
@ -52,6 +55,57 @@ struct BenchResult {
}
};
static void print_bar_chart(const vector<BenchResult> &results) {
// Collect big.sql results for the bar chart
struct Entry {
string name;
double ms;
};
vector<Entry> entries;
for (const auto &r : results) {
if (r.name.find("big.sql") != string::npos) {
entries.push_back({r.name, r.median()});
}
}
if (entries.empty()) return;
// Sort by time (fastest first)
sort(entries.begin(), entries.end(),
[](const Entry &a, const Entry &b) { return a.ms < b.ms; });
double baseline = entries.front().ms; // fastest as 1.0x
double max_ms = entries.back().ms;
const int max_bar_width = 30;
const int name_width = 28;
cout << endl << " Bar chart (big.sql):" << endl << endl;
for (const auto &e : entries) {
double ratio = baseline > 0 ? e.ms / baseline : 0;
int bar_len =
max_ms > 0 ? static_cast<int>(e.ms / max_ms * max_bar_width) : 0;
if (bar_len < 1) bar_len = 1;
// Build bar with Unicode block character U+2588
string bar_str;
for (int i = 0; i < bar_len; i++) {
bar_str += "\xe2\x96\x88";
}
// Pad with spaces to align the numbers
string padding(max_bar_width - bar_len, ' ');
char line[256];
snprintf(line, sizeof(line), "%6.1f ms (%.1fx)", e.ms, ratio);
// Print: name | bar + padding + numbers
cout << " " << left << setw(name_width) << e.name << " |" << bar_str
<< padding << " " << line << endl;
}
}
static void print_result(const BenchResult &r) {
cout << " " << left << setw(30) << r.name << right << " median: " << fixed
<< setprecision(3) << setw(10) << r.median()
@ -120,7 +174,241 @@ static BenchResult bench_yacc_parse(const string &name, const string &sql_input,
}
#endif
// Profile subcommand: per-rule self-time profiling
struct RuleStats {
string name;
size_t success = 0;
size_t fail = 0;
double self_ns = 0; // exclusive (self) time in nanoseconds
};
struct ProfileData {
vector<RuleStats> rules;
map<string, size_t> index;
vector<chrono::steady_clock::time_point> enter_times;
vector<double> child_ns; // accumulated child time at each stack level
chrono::steady_clock::time_point start;
};
static int run_profile(const string &data_dir, int argc, char *argv[]) {
// Determine which input to profile
string input_name = "big.sql";
if (argc > 0) { input_name = argv[0]; }
string input_file;
if (input_name == "q1") {
input_file = data_dir + "/q1.sql";
} else if (input_name == "tpch") {
input_file = data_dir + "/all-tpch.sql";
} else {
input_file = data_dir + "/big.sql";
}
// Grammar: default or "optimized"
string grammar_file = data_dir + "/sql.peg";
for (int i = 0; i < argc; i++) {
string arg = argv[i];
if (arg == "optimized" || arg == "opt") {
grammar_file = data_dir + "/sql-optimized.peg";
}
}
auto sql_grammar = read_file(grammar_file);
auto sql_input = read_file(input_file);
parser pg(sql_grammar);
if (!pg) {
cerr << "Error: failed to parse SQL grammar" << endl;
return 1;
}
pg.enable_packrat_parsing();
ProfileData *profile_result = nullptr;
pg.enable_trace(
// enter
[](auto &ope, auto, auto, auto &, auto &, auto &, std::any &trace_data) {
auto holder = dynamic_cast<const peg::Holder *>(&ope);
if (!holder) return;
auto &pd = *std::any_cast<ProfileData *>(trace_data);
auto &name = holder->name();
if (pd.index.find(name) == pd.index.end()) {
pd.index[name] = pd.rules.size();
pd.rules.push_back({name, 0, 0, 0});
}
pd.enter_times.push_back(chrono::steady_clock::now());
pd.child_ns.push_back(0);
},
// leave
[](auto &ope, auto, auto, auto &, auto &, auto &, auto len,
std::any &trace_data) {
auto holder = dynamic_cast<const peg::Holder *>(&ope);
if (!holder) return;
auto &pd = *std::any_cast<ProfileData *>(trace_data);
auto now = chrono::steady_clock::now();
auto elapsed =
chrono::duration<double, nano>(now - pd.enter_times.back()).count();
auto child_time = pd.child_ns.back();
auto self_time = elapsed - child_time;
pd.enter_times.pop_back();
pd.child_ns.pop_back();
// Add elapsed to parent's child accumulator
if (!pd.child_ns.empty()) { pd.child_ns.back() += elapsed; }
auto &name = holder->name();
auto idx = pd.index[name];
auto &stat = pd.rules[idx];
stat.self_ns += self_time;
if (len != static_cast<size_t>(-1)) {
stat.success++;
} else {
stat.fail++;
}
},
// start
[&profile_result](auto &trace_data) {
auto pd = new ProfileData{};
pd->start = chrono::steady_clock::now();
trace_data = pd;
profile_result = pd;
},
// end
[](auto & /*trace_data*/) {});
// Enable packrat stats collection
pg["Statements"].collect_packrat_stats = true;
cout << "Profiling parse of " << input_file << " (" << sql_input.size()
<< " bytes)..." << endl;
auto t0 = chrono::steady_clock::now();
pg.parse(sql_input);
auto t1 = chrono::steady_clock::now();
auto total_ms =
chrono::duration_cast<chrono::microseconds>(t1 - t0).count() / 1000.0;
// Output results
auto &pd = *profile_result;
auto &rules = pd.rules;
vector<size_t> order(rules.size());
iota(order.begin(), order.end(), 0);
sort(order.begin(), order.end(),
[&](size_t a, size_t b) { return rules[a].self_ns > rules[b].self_ns; });
size_t total_calls = 0;
double total_self_ns = 0;
for (auto &r : rules) {
total_calls += r.success + r.fail;
total_self_ns += r.self_ns;
}
cout << endl;
cout << "Profile: " << input_file << " (" << sql_input.size() << " bytes)"
<< endl;
cout << "Total time: " << fixed << setprecision(3) << total_ms << " ms"
<< endl;
cout << "Total rule calls: " << total_calls << endl;
cout << endl;
char buf[256];
snprintf(buf, sizeof(buf), "%4s %-30s %10s %6s %10s %10s %6s %8s",
"rank", "rule", "self(ms)", "%", "success", "fail", "fail%",
"avg(ns)");
cout << buf << endl;
cout << string(100, '-') << endl;
size_t rank = 1;
for (auto i : order) {
auto &r = rules[i];
auto total = r.success + r.fail;
if (total == 0) continue;
auto self_ms = r.self_ns / 1e6;
auto pct = r.self_ns / total_self_ns * 100.0;
auto fail_pct = total > 0 ? r.fail * 100.0 / total : 0.0;
auto avg_ns = r.self_ns / total;
snprintf(buf, sizeof(buf),
"%4zu %-30s %10.3f %5.1f%% %10zu %10zu %5.1f%% %8.1f", rank,
r.name.c_str(), self_ms, pct, r.success, r.fail, fail_pct, avg_ns);
cout << buf << endl;
rank++;
}
// Packrat stats
auto &pkstats = pg["Statements"].packrat_stats_;
if (!pkstats.empty()) {
cout << endl;
cout << "Packrat cache stats per rule:" << endl;
snprintf(buf, sizeof(buf), "%4s %-30s %10s %10s %10s %6s", "rank",
"rule", "hits", "misses", "total", "hit%");
cout << buf << endl;
cout << string(80, '-') << endl;
// Build def_id → name map from ProfileData
map<size_t, string> defid_to_name;
for (auto &[name, idx] : pd.index) {
try {
auto &rule = pg[name.c_str()];
defid_to_name[rule.id] = name;
} catch (...) {}
}
struct PkEntry {
string name;
size_t hits, misses;
};
vector<PkEntry> pk_entries;
size_t total_hits = 0, total_misses = 0;
for (size_t i = 0; i < pkstats.size(); i++) {
auto &st = pkstats[i];
if (st.hits + st.misses == 0) continue;
auto it = defid_to_name.find(i);
string name =
it != defid_to_name.end() ? it->second : "id=" + to_string(i);
pk_entries.push_back({name, st.hits, st.misses});
total_hits += st.hits;
total_misses += st.misses;
}
sort(pk_entries.begin(), pk_entries.end(), [](auto &a, auto &b) {
return a.hits + a.misses > b.hits + b.misses;
});
rank = 1;
for (auto &e : pk_entries) {
auto total = e.hits + e.misses;
auto hit_pct = total > 0 ? e.hits * 100.0 / total : 0.0;
snprintf(buf, sizeof(buf), "%4zu %-30s %10zu %10zu %10zu %5.1f%%",
rank, e.name.c_str(), e.hits, e.misses, total, hit_pct);
cout << buf << endl;
rank++;
}
cout << endl;
snprintf(buf, sizeof(buf),
"Total: hits=%zu misses=%zu total=%zu hit%%=%.1f%%", total_hits,
total_misses, total_hits + total_misses,
(total_hits + total_misses) > 0
? total_hits * 100.0 / (total_hits + total_misses)
: 0.0);
cout << buf << endl;
}
delete profile_result;
return 0;
}
int main(int argc, char *argv[]) {
string data_dir = BENCHMARK_DATA_DIR;
// Check for subcommands
if (argc > 1 && strcmp(argv[1], "profile") == 0) {
return run_profile(data_dir, argc - 2, argv + 2);
}
int iterations = 10;
if (argc > 1) {
iterations = atoi(argv[1]);
@ -130,70 +418,73 @@ int main(int argc, char *argv[]) {
}
}
string data_dir = BENCHMARK_DATA_DIR;
auto sql_grammar = read_file(data_dir + "/sql.gram");
auto sql_grammar = read_file(data_dir + "/sql.peg");
auto q1_sql = read_file(data_dir + "/q1.sql");
auto tpch_sql = read_file(data_dir + "/all-tpch.sql");
auto big_sql = read_file(data_dir + "/big.sql");
cout << "cpp-peglib SQL benchmark (" << iterations << " iterations)" << endl;
cout << "cpp-peglib SQL benchmark (" << iterations << " iterations)";
#ifdef HAS_PG_QUERY
cout << "(with libpg_query YACC comparison)" << endl;
cout << "(with libpg_query YACC comparison)";
#endif
cout << string(70, '=') << endl;
cout << endl;
cout << string(80, '=') << endl;
vector<BenchResult> results;
int test_num = 1;
// PEG benchmarks
cout << endl << "--- cpp-peglib (PEG) ---" << endl;
cout << "--- cpp-peglib (PEG) ---" << endl;
cout << endl
<< "[" << test_num++ << "] PEG: grammar load (" << sql_grammar.size()
cout << "[" << test_num++ << "] PEG: grammar load (" << sql_grammar.size()
<< " bytes)" << endl;
results.push_back(bench_sql_grammar_load(sql_grammar, iterations));
cout << endl
<< "[" << test_num++ << "] PEG: TPC-H Q1 (" << q1_sql.size() << " bytes)"
cout << "[" << test_num++ << "] PEG: TPC-H Q1 (" << q1_sql.size() << " bytes)"
<< endl;
results.push_back(
bench_sql_parse("PEG: TPC-H Q1", sql_grammar, q1_sql, iterations));
cout << endl
<< "[" << test_num++ << "] PEG: all TPC-H (" << tpch_sql.size()
cout << "[" << test_num++ << "] PEG: all TPC-H (" << tpch_sql.size()
<< " bytes)" << endl;
results.push_back(
bench_sql_parse("PEG: all TPC-H", sql_grammar, tpch_sql, iterations));
cout << endl
<< "[" << test_num++ << "] PEG: big.sql (" << big_sql.size() << " bytes)"
cout << "[" << test_num++ << "] PEG: big.sql (" << big_sql.size() << " bytes)"
<< endl;
results.push_back(
bench_sql_parse("PEG: big.sql (~1MB)", sql_grammar, big_sql, iterations));
// Optimized grammar benchmarks
{
auto opt_grammar = read_file(data_dir + "/sql-optimized.peg");
cout << endl << "--- cpp-peglib (PEG, optimized grammar) ---" << endl;
cout << "[" << test_num++ << "] PEG-opt: big.sql (" << big_sql.size()
<< " bytes)" << endl;
results.push_back(bench_sql_parse("PEG-opt: big.sql (~1MB)", opt_grammar,
big_sql, iterations));
}
// YACC benchmarks
#ifdef HAS_PG_QUERY
cout << endl << "--- PostgreSQL YACC (libpg_query) ---" << endl;
cout << endl
<< "[" << test_num++ << "] YACC: TPC-H Q1 (" << q1_sql.size()
cout << "[" << test_num++ << "] YACC: TPC-H Q1 (" << q1_sql.size()
<< " bytes)" << endl;
results.push_back(bench_yacc_parse("YACC: TPC-H Q1", q1_sql, iterations));
cout << endl
<< "[" << test_num++ << "] YACC: all TPC-H (" << tpch_sql.size()
cout << "[" << test_num++ << "] YACC: all TPC-H (" << tpch_sql.size()
<< " bytes)" << endl;
results.push_back(bench_yacc_parse("YACC: all TPC-H", tpch_sql, iterations));
cout << endl
<< "[" << test_num++ << "] YACC: big.sql (" << big_sql.size()
cout << "[" << test_num++ << "] YACC: big.sql (" << big_sql.size()
<< " bytes)" << endl;
results.push_back(
bench_yacc_parse("YACC: big.sql (~1MB)", big_sql, iterations));
#endif
cout << endl << string(70, '=') << endl;
cout << endl << string(80, '=') << endl;
cout << "Summary:" << endl;
for (const auto &r : results) {
print_result(r);
@ -217,5 +508,7 @@ int main(int argc, char *argv[]) {
}
#endif
print_bar_chart(results);
return 0;
}

View File

@ -0,0 +1,119 @@
Statements <- (SingleStatement (';' SingleStatement )* ';'*)
SingleStatement <- SelectStatement
SelectStatement <- SimpleSelect (SetopClause SimpleSelect)*
SetopClause <- ('UNION'i / 'EXCEPT'i / 'INTERSECT'i) 'ALL'i?
SimpleSelect <- WithClause? SelectClause FromClause? WhereClause? GroupByClause? HavingClause? OrderByClause? LimitClause?
WithStatement <- Identifier 'AS'i SubqueryReference
WithClause <- 'WITH'i List(WithStatement)
SelectClause <- 'SELECT'i ('*' / List(AliasedExpression))
ColumnAliases <- Parens(List(Identifier))
TableReference <-
(SubqueryReference 'AS'i? Identifier ColumnAliases?) /
(Identifier ('AS'i? Identifier)?)
ExplicitJoin <- ('LEFT'i / 'FULL'i)? 'OUTER'i? 'JOIN'i TableReference 'ON'i Expression
FromClause <- 'FROM'i TableReference ((',' TableReference) / ExplicitJoin)*
WhereClause <- 'WHERE'i Expression
GroupByClause <- 'GROUP'i 'BY'i List(Expression)
HavingClause <- 'HAVING'i Expression
SubqueryReference <- Parens(SelectStatement)
OrderByExpression <- Expression ('DESC'i / 'ASC'i)? ('NULLS'i 'FIRST'i / 'LAST'i)?
OrderByClause <- 'ORDER'i 'BY'i List(OrderByExpression)
LimitClause <- 'LIMIT'i NumberLiteral
ReservedKeyword <-
'SELECT'i /
'FROM'i /
'WHERE'i /
'GROUP'i 'BY'i /
'HAVING'i /
'UNION'i /
'ORDER'i 'BY'i /
'WHEN'i /
'JOIN'i /
'ON'i /
'INTERSECT'i # TODO expand on this
PlainIdentifier <- !ReservedKeyword <[a-z_]i[a-z0-9_]i*> # unqoted identifier can't be top-level keyword
QuotedIdentifier <- '"' [^"]* '"'
Identifier <- QuotedIdentifier / PlainIdentifier
NumberLiteral <- < [+-]?[0-9]*([.][0-9]*)? >
StringLiteral <- '\'' [^\']* '\''
TypeSpecifier <- Identifier (Parens(List(NumberLiteral)))?
# Optimization: Merge ColumnReference, FunctionExpression, and IsNull
# into a single rule to avoid redundant identifier parsing.
# Old: IsNullExpression / FunctionExpression / ColumnReference (3 separate rules)
# New: ColumnOrFuncRef handles all three patterns in one pass.
ColumnOrFuncRef <- (Identifier '.')? Identifier (Parens(List(Expression)))?
ParenthesisExpression <- Parens(Expression)
LiteralExpression <- StringLiteral / NumberLiteral
CastExpression <- 'CAST'i Parens(Expression 'AS'i TypeSpecifier)
ExtractExpression <- 'EXTRACT'i Parens(ColumnReference 'FROM'i Expression)
ColumnReference <- (Identifier '.')? Identifier
CountStarExpression <- 'COUNT'i Parens('*')
SubqueryExpression <- 'NOT'i? 'EXISTS'i? SubqueryReference
CaseExpression <- 'CASE'i ColumnReference? 'WHEN'i Expression 'THEN'i Expression ('ELSE'i Expression)? 'END'i # TODO strict
DateExpression <- 'DATE'i Expression
DistinctExpression <- 'DISTINCT'i Expression
SubstringExpression <- 'SUBSTRING'i Parens(Expression 'FROM'i NumberLiteral 'FOR'i NumberLiteral)
LiteralListExpression <- Parens(List(Expression))
FrameClause <- 'ROWS'i 'BETWEEN'i (('UNBOUNDED'i 'PRECEDING'i)) 'AND' (('CURRENT'i 'ROW'i))
WindowExpression <- Parens(('PARTITION'i 'BY'i List(Expression))? OrderByClause? FrameClause?)
# Optimization: Removed IsNullExpression, FunctionExpression, ColumnReference
# from SingleExpression. Replaced with ColumnOrFuncRef.
SingleExpression <-
SubqueryExpression /
LiteralListExpression /
ParenthesisExpression /
DateExpression /
DistinctExpression /
SubstringExpression /
CaseExpression /
CountStarExpression /
CastExpression /
ExtractExpression /
WindowExpression /
ColumnOrFuncRef /
LiteralExpression
ArithmeticOperator <- '+' / '-' / '*' / '/'
LikeOperator <- 'NOT'i? 'LIKE'i
InOperator <- 'NOT'i? 'IN'i !'T'i # special handling to not match INTERSECT
BooleanOperator <- ('OR'i !'D'i) / 'AND'i # special handling to not match ORDER BY
ComparisionOperator <- '=' / '<=' / '>=' / '<' / '>'
WindowOperator <- 'OVER'i
BetweenOperator <- 'BETWEEN'i
# Optimization: IS NULL as postfix operator instead of standalone expression
IsNullOperator <- 'IS'i 'NOT'i? 'NULL'i
Operator <-
ArithmeticOperator /
ComparisionOperator /
BooleanOperator /
LikeOperator /
InOperator /
WindowOperator /
BetweenOperator
# Optimization: IS NULL as postfix on each operand
Expression <- SingleExpression IsNullOperator? (Operator SingleExpression IsNullOperator?)*
AliasedExpression <- Expression ('AS'i? Identifier)?
# internal definitions
%whitespace <- [ \t\n\r]*
List(D) <- D (',' D)*
Parens(D) <- '(' D ')'

File diff suppressed because one or more lines are too long

Binary file not shown.

View File

@ -6,6 +6,7 @@
//
#include <fstream>
#include <iostream>
#include <peglib.h>
#include <sstream>
@ -111,6 +112,133 @@ int main(int argc, const char **argv) {
if (!parser.load_grammar(syntax.data(), syntax.size())) { return -1; }
{
using namespace peg;
auto &grammar = parser.get_grammar();
const char *source_start = syntax.data();
// Get the first Reference name from an Ope tree
struct GetFirstRef : public Ope::Visitor {
using Ope::Visitor::visit;
string name; // empty if not found
void visit(Reference &ope) override {
if (name.empty() && ope.rule_ && !ope.is_macro_) { name = ope.name_; }
}
void visit(Sequence &ope) override {
if (name.empty() && !ope.opes_.empty()) { ope.opes_[0]->accept(*this); }
}
void visit(Repetition &ope) override {
if (name.empty()) { ope.ope_->accept(*this); }
}
void visit(CaptureScope &ope) override {
if (name.empty()) { ope.ope_->accept(*this); }
}
void visit(Capture &ope) override {
if (name.empty()) { ope.ope_->accept(*this); }
}
void visit(TokenBoundary &ope) override {
if (name.empty()) { ope.ope_->accept(*this); }
}
void visit(Ignore &ope) override {
if (name.empty()) { ope.ope_->accept(*this); }
}
};
// Build first-reference chain: all rules reachable by following
// the first Reference of each definition recursively.
map<string, set<string>> chain_cache;
auto build_chain = [&](const string &start) -> const set<string> & {
auto cached = chain_cache.find(start);
if (cached != chain_cache.end()) { return cached->second; }
auto &chain = chain_cache[start];
string cur = start;
while (!cur.empty() && !chain.count(cur)) {
chain.insert(cur);
auto it = grammar.find(cur);
if (it == grammar.end()) break;
GetFirstRef vis;
it->second.get_core_operator()->accept(vis);
cur = vis.name;
}
return chain;
};
auto warn = [&](const Definition &def, const string &msg) {
auto pos = line_info(source_start, def.s_);
cerr << syntax_path << ":" << pos.first << ":" << pos.second << ": "
<< msg << endl;
};
for (auto &[rule_name, def] : grammar) {
auto core = def.get_core_operator();
auto *choice = dynamic_cast<PrioritizedChoice *>(core.get());
if (!choice) continue;
// Collect first-ref info for each alternative
struct AltInfo {
string direct_ref;
const set<string> *chain = nullptr;
};
vector<AltInfo> alts;
for (auto &ope : choice->opes_) {
GetFirstRef vis;
ope->accept(vis);
AltInfo info;
if (!vis.name.empty()) {
info.direct_ref = vis.name;
info.chain = &build_chain(vis.name);
}
alts.push_back(std::move(info));
}
// Direct common prefix
map<string, vector<size_t>> direct;
for (size_t i = 0; i < alts.size(); i++) {
if (!alts[i].direct_ref.empty()) {
direct[alts[i].direct_ref].push_back(i);
}
}
for (auto &[prefix, idx] : direct) {
if (idx.size() < 2) continue;
warn(def, "'" + rule_name + "' has " + to_string(idx.size()) +
" alternatives starting with '" + prefix +
"'; consider left-factoring.");
}
// Indirect common prefix
map<string, vector<size_t>> indirect;
for (size_t i = 0; i < alts.size(); i++) {
if (!alts[i].chain) continue;
for (auto &ref : *alts[i].chain) {
indirect[ref].push_back(i);
}
}
for (auto &[ref, idx] : indirect) {
if (idx.size() < 2) continue;
if (direct.count(ref) && direct[ref].size() >= 2) continue;
vector<string> names;
for (auto i : idx) {
if (!alts[i].direct_ref.empty()) {
names.push_back(alts[i].direct_ref);
}
}
sort(names.begin(), names.end());
names.erase(unique(names.begin(), names.end()), names.end());
if (names.size() < 2) continue;
string affected;
for (auto &n : names) {
if (!affected.empty()) affected += ", ";
affected += n;
}
warn(def, "'" + rule_name + "' alternatives {" + affected +
"} share indirect prefix '" + ref +
"'; consider left-factoring.");
}
}
}
if (path_list.size() < 2 && !opt_source) { return 0; }
// Check source

929
peglib.h

File diff suppressed because it is too large Load Diff

View File

@ -23,6 +23,11 @@ add_executable(peglib-test-main
test_left_recursive.cc
test_first_set.cc
test_integration.cc
test_trace.cc
test_combinators.cc
test_definition_api.cc
test_utf8.cc
test_snapshot.cc
)
target_include_directories(peglib-test-main PRIVATE ..)

256
test/test_combinators.cc Normal file
View File

@ -0,0 +1,256 @@
#include <gtest/gtest.h>
#include <peglib.h>
using namespace peg;
// Helper: Definition::parse returns Result, extract .ret for EXPECT
static bool def_parse(const Definition &def, const char *s) {
return def.parse(s, strlen(s)).ret;
}
// --- opt() ---
TEST(CombinatorTest, Opt_matches_present) {
Definition ROOT;
ROOT <= seq(lit("hello"), opt(lit(" world")));
EXPECT_TRUE(def_parse(ROOT, "hello world"));
}
TEST(CombinatorTest, Opt_matches_absent) {
Definition ROOT;
ROOT <= seq(lit("hello"), opt(lit(" world")));
EXPECT_TRUE(def_parse(ROOT, "hello"));
}
// --- rep() ---
TEST(CombinatorTest, Rep_exact_range) {
Definition ROOT;
ROOT <= rep(chr('a'), 2, 4);
EXPECT_FALSE(def_parse(ROOT, "a"));
EXPECT_TRUE(def_parse(ROOT, "aa"));
EXPECT_TRUE(def_parse(ROOT, "aaa"));
EXPECT_TRUE(def_parse(ROOT, "aaaa"));
}
TEST(CombinatorTest, Rep_min_only) {
Definition ROOT;
ROOT <= rep(chr('x'), 1, std::numeric_limits<size_t>::max());
EXPECT_FALSE(def_parse(ROOT, ""));
EXPECT_TRUE(def_parse(ROOT, "x"));
EXPECT_TRUE(def_parse(ROOT, "xxx"));
}
// --- apd() (And Predicate) ---
TEST(CombinatorTest, Apd_lookahead_success) {
Definition ROOT;
ROOT <= seq(apd(seq(chr('a'), chr('b'))), chr('a'), chr('b'));
EXPECT_TRUE(def_parse(ROOT, "ab"));
}
TEST(CombinatorTest, Apd_lookahead_failure) {
Definition ROOT;
ROOT <= seq(apd(seq(chr('a'), chr('b'))), chr('a'), chr('c'));
EXPECT_FALSE(def_parse(ROOT, "ac"));
}
// --- npd() (Not Predicate) ---
TEST(CombinatorTest, Npd_negative_lookahead_success) {
Definition ROOT;
ROOT <= seq(npd(seq(chr('a'), chr('b'))), chr('a'), chr('c'));
EXPECT_TRUE(def_parse(ROOT, "ac"));
}
TEST(CombinatorTest, Npd_negative_lookahead_failure) {
Definition ROOT;
ROOT <= seq(npd(seq(chr('a'), chr('b'))), chr('a'), chr('b'));
EXPECT_FALSE(def_parse(ROOT, "ab"));
}
// --- dic() (Dictionary) ---
TEST(CombinatorTest, Dic_matches_keyword) {
Definition ROOT;
ROOT <= dic({"apple", "banana", "cherry"}, false);
EXPECT_TRUE(def_parse(ROOT, "apple"));
EXPECT_TRUE(def_parse(ROOT, "banana"));
EXPECT_TRUE(def_parse(ROOT, "cherry"));
EXPECT_FALSE(def_parse(ROOT, "grape"));
}
TEST(CombinatorTest, Dic_case_insensitive) {
Definition ROOT;
ROOT <= dic({"hello", "world"}, true);
EXPECT_TRUE(def_parse(ROOT, "HELLO"));
EXPECT_TRUE(def_parse(ROOT, "World"));
}
// --- lit() / liti() ---
TEST(CombinatorTest, Lit_exact_match) {
Definition ROOT;
ROOT <= lit("hello");
EXPECT_TRUE(def_parse(ROOT, "hello"));
EXPECT_FALSE(def_parse(ROOT, "Hello"));
EXPECT_FALSE(def_parse(ROOT, "world"));
}
TEST(CombinatorTest, Liti_case_insensitive) {
Definition ROOT;
ROOT <= liti("hello");
EXPECT_TRUE(def_parse(ROOT, "hello"));
EXPECT_TRUE(def_parse(ROOT, "HELLO"));
EXPECT_TRUE(def_parse(ROOT, "HeLLo"));
EXPECT_FALSE(def_parse(ROOT, "world"));
}
// --- ncls() (Negated Character Class) ---
TEST(CombinatorTest, Ncls_negated_class_string) {
Definition ROOT;
ROOT <= oom(ncls("abc"));
EXPECT_TRUE(def_parse(ROOT, "xyz"));
EXPECT_FALSE(def_parse(ROOT, "abc"));
EXPECT_FALSE(def_parse(ROOT, "a"));
}
TEST(CombinatorTest, Ncls_negated_class_ranges) {
Definition ROOT;
std::vector<std::pair<char32_t, char32_t>> ranges = {{'0', '9'}};
ROOT <= oom(ncls(ranges));
EXPECT_TRUE(def_parse(ROOT, "abc"));
EXPECT_FALSE(def_parse(ROOT, "123"));
}
// --- csc() / cap() (Capture Scope / Capture) ---
TEST(CombinatorTest, Cap_capture_match) {
Definition ROOT;
std::string captured;
ROOT <= cap(oom(cls("a-z")), [&](const char *s, size_t n, Context &) {
captured = std::string(s, n);
});
EXPECT_TRUE(def_parse(ROOT, "hello"));
EXPECT_EQ(captured, "hello");
}
TEST(CombinatorTest, Csc_capture_scope_with_backreference) {
// Test capture scope with backreference: opening quote must match closing
parser parser(R"(
ROOT <- QUOTED
QUOTED <- $q< ['"'] > [a-z]+ $q
)");
ASSERT_TRUE(parser);
EXPECT_TRUE(parser.parse("'hello'"));
EXPECT_TRUE(parser.parse("\"hello\""));
EXPECT_FALSE(parser.parse("'hello\""));
}
// --- tok() (Token Boundary) ---
TEST(CombinatorTest, Tok_token_boundary) {
Definition ROOT;
ROOT <= seq(tok(oom(cls("a-z"))), tok(oom(cls("a-z"))));
ROOT.whitespaceOpe = zom(cls(" \t"));
std::string val;
ROOT = [&](const SemanticValues &vs) {
val = std::string(vs.token_to_string(0)) + " " +
std::string(vs.token_to_string(1));
};
EXPECT_TRUE(def_parse(ROOT, "hello world"));
EXPECT_EQ(val, "hello world");
}
// --- ign() (Ignore) ---
TEST(CombinatorTest, Ign_ignore_semantic_value) {
Definition ROOT;
ROOT <= seq(lit("hello"), ign(lit(" ")), lit("world"));
EXPECT_TRUE(def_parse(ROOT, "hello world"));
}
// --- bkr() (Back Reference) ---
TEST(CombinatorTest, Bkr_back_reference) {
parser parser(R"(
ROOT <- $tag< WORD > ' ' WORD ' ' $tag
WORD <- [a-z]+
)");
ASSERT_TRUE(parser);
EXPECT_TRUE(parser.parse("hello world hello"));
EXPECT_FALSE(parser.parse("hello world goodbye"));
}
// --- pre() (Precedence Climbing) ---
TEST(CombinatorTest, Pre_precedence_climbing) {
parser parser(R"(
EXPRESSION <- ATOM (BINOP ATOM)* {
precedence
L + -
L * /
}
ATOM <- NUMBER
BINOP <- < '+' / '-' / '*' / '/' >
NUMBER <- < [0-9]+ >
%whitespace <- [ \t]*
)");
ASSERT_TRUE(parser);
parser["EXPRESSION"] = [](const SemanticValues &vs) -> int {
if (vs.size() == 1) { return std::any_cast<int>(vs[0]); }
auto left = std::any_cast<int>(vs[0]);
auto right = std::any_cast<int>(vs[2]);
auto op = std::any_cast<std::string_view>(vs[1]);
if (op == "+") return left + right;
if (op == "-") return left - right;
if (op == "*") return left * right;
if (op == "/") return left / right;
return 0;
};
parser["BINOP"] = [](const SemanticValues &vs) { return vs.token(); };
parser["NUMBER"] = [](const SemanticValues &vs) {
return std::stoi(std::string(vs.token()));
};
int result;
EXPECT_TRUE(parser.parse("2+3*4", result));
EXPECT_EQ(result, 14); // 2 + (3 * 4) = 14
}
// --- rec() (Recovery) ---
TEST(CombinatorTest, Rec_recovery) {
parser parser(R"(
PROGRAM <- STMT+
STMT <- EXPR ';'
EXPR <- 'ok' / %recover([^;]*)
%whitespace <- [ \t\n]*
)");
ASSERT_TRUE(parser);
std::vector<std::string> errors;
parser.set_logger([&](size_t, size_t, const std::string &msg,
const std::string &) { errors.push_back(msg); });
auto result = parser.parse("ok; bad; ok;");
EXPECT_FALSE(result);
EXPECT_FALSE(errors.empty());
}
// --- cut() ---
TEST(CombinatorTest, Cut_operator) {
parser parser(R"(
ROOT <- A / B
A <- 'a' 'b'
B <- 'a' 'c'
)");
ASSERT_TRUE(parser);
EXPECT_TRUE(parser.parse("ab"));
EXPECT_FALSE(parser.parse("ac"));
}

136
test/test_definition_api.cc Normal file
View File

@ -0,0 +1,136 @@
#include <gtest/gtest.h>
#include <peglib.h>
using namespace peg;
// --- Definition::error_message ---
TEST(DefinitionApiTest, Error_message_custom) {
parser parser(R"(
ROOT <- GREETING
GREETING <- 'hello' ' ' NAME
NAME <- [a-z]+ { error_message "expected a lowercase name" }
)");
ASSERT_TRUE(parser);
std::string error_msg;
parser.set_logger([&](size_t, size_t, const std::string &msg,
const std::string &) { error_msg = msg; });
EXPECT_FALSE(parser.parse("hello 123"));
EXPECT_EQ(error_msg, "expected a lowercase name");
}
TEST(DefinitionApiTest, Error_message_via_definition) {
parser parser(R"(
ROOT <- 'hi ' NAME
NAME <- [a-z]+
)");
ASSERT_TRUE(parser);
parser["NAME"].error_message = "name must be lowercase letters";
std::string error_msg;
parser.set_logger([&](size_t, size_t, const std::string &msg,
const std::string &) { error_msg = msg; });
EXPECT_FALSE(parser.parse("hi 123"));
EXPECT_EQ(error_msg, "name must be lowercase letters");
}
// --- Definition::no_ast_opt ---
TEST(DefinitionApiTest, No_ast_opt_preserves_node) {
// Without no_ast_opt
{
parser pg(R"(
ROOT <- ITEM+
ITEM <- [a-z]+
)");
ASSERT_TRUE(pg);
pg.enable_ast();
std::shared_ptr<Ast> ast;
EXPECT_TRUE(pg.parse("hello", ast));
auto optimized = pg.optimize_ast(ast);
bool found_item = false;
std::function<void(const Ast &)> walk = [&](const Ast &node) {
if (node.name == "ITEM") found_item = true;
for (auto &child : node.nodes) {
walk(*child);
}
};
walk(*optimized);
// May or may not have ITEM depending on optimization
(void)found_item;
// With no_ast_opt on ITEM
parser pg2(R"(
ROOT <- ITEM+
ITEM <- [a-z]+
)");
ASSERT_TRUE(pg2);
pg2["ITEM"].no_ast_opt = true;
pg2.enable_ast();
std::shared_ptr<Ast> ast2;
EXPECT_TRUE(pg2.parse("hello", ast2));
auto optimized2 = pg2.optimize_ast(ast2);
bool found_item2 = false;
std::function<void(const Ast &)> walk2 = [&](const Ast &node) {
if (node.name == "ITEM") found_item2 = true;
for (auto &child : node.nodes) {
walk2(*child);
}
};
walk2(*optimized2);
EXPECT_TRUE(found_item2);
}
}
// --- Definition::disable_action ---
TEST(DefinitionApiTest, Disable_action_skips_semantic_action) {
parser parser(R"(
ROOT <- NUMBER
NUMBER <- [0-9]+
)");
ASSERT_TRUE(parser);
bool action_called = false;
parser["NUMBER"] = [&](const SemanticValues &) { action_called = true; };
EXPECT_TRUE(parser.parse("123"));
EXPECT_TRUE(action_called);
// Now disable the action
action_called = false;
parser["NUMBER"].disable_action = true;
EXPECT_TRUE(parser.parse("456"));
EXPECT_FALSE(action_called);
}
// --- set_logger (lambda version without rule parameter) ---
TEST(DefinitionApiTest, Set_logger_simple_lambda) {
parser parser(R"(
ROOT <- 'hello'
)");
ASSERT_TRUE(parser);
size_t error_line = 0;
size_t error_col = 0;
std::string error_msg;
parser.set_logger([&](size_t line, size_t col, const std::string &msg) {
error_line = line;
error_col = col;
error_msg = msg;
});
EXPECT_FALSE(parser.parse("world"));
EXPECT_GT(error_line, 0u);
EXPECT_GT(error_col, 0u);
EXPECT_FALSE(error_msg.empty());
}

110
test/test_snapshot.cc Normal file
View File

@ -0,0 +1,110 @@
#include <gtest/gtest.h>
#include <peglib.h>
using namespace peg;
// =============================================================================
// Phase 2 Snapshot/Rollback Validation Tests
//
// These tests verify behaviors that Phase 2 must preserve:
// - Nested choice rollback of choice_count_/choice_
// - Sequence direct write (append removal)
// - Predicate side-effect isolation
// - CaptureScope isolation
// - Repetition partial rollback
// - Capture rollback on choice failure
// =============================================================================
TEST(SnapshotTest, Nested_choice_rollback) {
parser pg(R"(
S <- A / B
A <- 'x' INNER 'y'
B <- 'x' 'z'
INNER <- 'a' / 'b' / 'c'
)");
EXPECT_TRUE(pg);
// A fails ('y' not found) -> fallback to B -> choice=1
pg["S"] = [](const SemanticValues &vs) {
EXPECT_EQ(2u, vs.choice_count()); // S has 2 alternatives
EXPECT_EQ(1u, vs.choice()); // B (0-indexed)
};
EXPECT_TRUE(pg.parse("xz"));
}
TEST(SnapshotTest, Sequence_direct_write) {
parser pg(R"(
S <- A B C
A <- 'aaa'
B <- 'bbb'
C <- 'ccc'
)");
EXPECT_TRUE(pg);
pg["A"] = [](const SemanticValues & /*vs*/) { return 1; };
pg["B"] = [](const SemanticValues & /*vs*/) { return 2; };
pg["C"] = [](const SemanticValues & /*vs*/) { return 3; };
pg["S"] = [](const SemanticValues &vs) {
EXPECT_EQ(3u, vs.size());
EXPECT_EQ(1, std::any_cast<int>(vs[0]));
EXPECT_EQ(2, std::any_cast<int>(vs[1]));
EXPECT_EQ(3, std::any_cast<int>(vs[2]));
};
EXPECT_TRUE(pg.parse("aaabbbccc"));
}
TEST(SnapshotTest, Predicate_no_side_effect) {
parser pg(R"(
S <- &(AB) CD
AB <- [a-z]+ [0-9]+
CD <- [a-z]+ [0-9]+
)");
EXPECT_TRUE(pg);
pg["AB"] = [](const SemanticValues & /*vs*/) { return std::string("AB"); };
pg["CD"] = [](const SemanticValues & /*vs*/) { return std::string("CD"); };
// &(AB) succeeds but its result must not leak into S's semantic values
pg["S"] = [](const SemanticValues &vs) {
EXPECT_EQ(1u, vs.size()); // Only CD's result
EXPECT_EQ("CD", std::any_cast<std::string>(vs[0]));
};
EXPECT_TRUE(pg.parse("abc123"));
}
TEST(SnapshotTest, CaptureScope_isolation) {
parser pg(R"(
S <- $ref<'hello'> $(ISOLATED) $ref
ISOLATED <- $ref<'world'>
)");
EXPECT_TRUE(pg);
// $ref='world' inside $(ISOLATED) is isolated by CaptureScope
// The final $ref should match 'hello', not 'world'
EXPECT_TRUE(pg.parse("helloworldhello"));
EXPECT_FALSE(pg.parse("helloworldworld"));
}
TEST(SnapshotTest, Repetition_partial_rollback) {
parser pg(R"(
S <- ITEM+
ITEM <- < [a-z]+ > ' '?
)");
EXPECT_TRUE(pg);
pg["ITEM"] = [](const SemanticValues &vs) { return vs.token_to_string(); };
pg["S"] = [](const SemanticValues &vs) {
EXPECT_EQ(3u, vs.size());
EXPECT_EQ("foo", std::any_cast<std::string>(vs[0]));
EXPECT_EQ("bar", std::any_cast<std::string>(vs[1]));
EXPECT_EQ("baz", std::any_cast<std::string>(vs[2]));
};
EXPECT_TRUE(pg.parse("foo bar baz"));
}
TEST(SnapshotTest, Capture_rollback_on_choice_failure) {
parser pg(R"(
S <- A / B
A <- $ref<'xx'> 'FAIL'
B <- $ref<'yy'> $ref
)");
EXPECT_TRUE(pg);
// A: sets ref='xx' -> fails on 'FAIL' -> rollback discards ref='xx'
// B: sets ref='yy' -> $ref expects 'yy'
EXPECT_TRUE(pg.parse("yyyy"));
EXPECT_FALSE(pg.parse("yyxx")); // Would wrongly succeed if ref='xx' leaked
}

267
test/test_trace.cc Normal file
View File

@ -0,0 +1,267 @@
#include <gtest/gtest.h>
#include <peglib.h>
#include <sstream>
using namespace peg;
TEST(TraceTest, Enable_trace_enter_leave_callbacks) {
parser parser(R"(
ROOT <- 'hello' ' ' 'world'
)");
ASSERT_TRUE(parser);
int enter_count = 0;
int leave_count = 0;
parser.enable_trace([&](auto &, auto, auto, auto &, auto &, auto &,
auto &) { enter_count++; },
[&](auto &, auto, auto, auto &, auto &, auto &, auto,
auto &) { leave_count++; });
EXPECT_TRUE(parser.parse("hello world"));
EXPECT_GT(enter_count, 0);
EXPECT_GT(leave_count, 0);
EXPECT_EQ(enter_count, leave_count);
}
TEST(TraceTest, Enable_trace_with_start_end_callbacks) {
parser parser(R"(
ROOT <- 'a' / 'b'
)");
ASSERT_TRUE(parser);
bool start_called = false;
bool end_called = false;
parser.enable_trace(
[&](auto &, auto, auto, auto &, auto &, auto &, auto &) {},
[&](auto &, auto, auto, auto &, auto &, auto &, auto, auto &) {},
[&](auto &) { start_called = true; }, [&](auto &) { end_called = true; });
EXPECT_TRUE(parser.parse("a"));
EXPECT_TRUE(start_called);
EXPECT_TRUE(end_called);
}
TEST(TraceTest, Trace_data_passing) {
parser parser(R"(
ROOT <- 'test'
)");
ASSERT_TRUE(parser);
// trace_data is initialized by tracer_start, then copied into Context.
// tracer_enter/leave modify Context's copy; tracer_end sees the original.
int enter_count = 0;
bool start_called = false;
bool end_called = false;
parser.enable_trace(
[&](auto &, auto, auto, auto &, auto &, auto &, std::any &trace_data) {
// Verify trace_data was initialized (copied from tracer_start's value)
auto val = std::any_cast<int>(trace_data);
trace_data = val + 1;
enter_count++;
},
[&](auto &, auto, auto, auto &, auto &, auto &, auto, std::any &) {},
[&](std::any &trace_data) {
trace_data = 42;
start_called = true;
},
[&](std::any &trace_data) {
// tracer_end sees the original trace_data (not Context's copy)
EXPECT_EQ(std::any_cast<int>(trace_data), 42);
end_called = true;
});
EXPECT_TRUE(parser.parse("test"));
EXPECT_TRUE(start_called);
EXPECT_TRUE(end_called);
EXPECT_GT(enter_count, 0);
}
TEST(TraceTest, Trace_on_parse_failure) {
parser parser(R"(
ROOT <- 'hello'
)");
ASSERT_TRUE(parser);
int enter_count = 0;
int leave_success = 0;
int leave_fail = 0;
parser.enable_trace(
[&](auto &, auto, auto, auto &, auto &, auto &, auto &) {
enter_count++;
},
[&](auto &, auto, auto, auto &, auto &, auto &, auto len, auto &) {
if (len != static_cast<size_t>(-1)) {
leave_success++;
} else {
leave_fail++;
}
});
EXPECT_FALSE(parser.parse("world"));
EXPECT_GT(enter_count, 0);
EXPECT_GT(leave_fail, 0);
}
TEST(TraceTest, Trace_position_in_callback) {
parser parser(R"(
ROOT <- 'ab' 'cd'
)");
ASSERT_TRUE(parser);
std::vector<size_t> positions;
parser.enable_trace(
[&](auto &, auto s, auto, auto &, auto &c, auto &, auto &) {
positions.push_back(static_cast<size_t>(s - c.s));
},
[&](auto &, auto, auto, auto &, auto &, auto &, auto, auto &) {});
EXPECT_TRUE(parser.parse("abcd"));
EXPECT_FALSE(positions.empty());
// First position should be 0 (start of input)
EXPECT_EQ(positions[0], 0u);
}
TEST(TraceTest, Set_verbose_trace) {
parser parser(R"(
ROOT <- 'hello'
)");
ASSERT_TRUE(parser);
// Just verify it doesn't crash — verbose_trace affects is_traceable
parser.set_verbose_trace(true);
int enter_count = 0;
parser.enable_trace(
[&](auto &, auto, auto, auto &, auto &, auto &, auto &) {
enter_count++;
},
[&](auto &, auto, auto, auto &, auto &, auto &, auto, auto &) {});
EXPECT_TRUE(parser.parse("hello"));
// With verbose trace, more operations should be traced
int verbose_count = enter_count;
enter_count = 0;
parser.set_verbose_trace(false);
EXPECT_TRUE(parser.parse("hello"));
int non_verbose_count = enter_count;
EXPECT_GE(verbose_count, non_verbose_count);
}
TEST(TraceTest, Enable_tracing_helper) {
parser parser(R"(
ROOT <- GREETING
GREETING <- 'hello' ' ' 'world'
)");
ASSERT_TRUE(parser);
std::ostringstream os;
enable_tracing(parser, os);
EXPECT_TRUE(parser.parse("hello world"));
auto output = os.str();
EXPECT_FALSE(output.empty());
// Should contain Enter and Leave markers
EXPECT_NE(output.find("E "), std::string::npos);
EXPECT_NE(output.find("L "), std::string::npos);
}
TEST(TraceTest, Enable_tracing_on_failure) {
parser parser(R"(
ROOT <- 'hello'
)");
ASSERT_TRUE(parser);
std::ostringstream os;
enable_tracing(parser, os);
EXPECT_FALSE(parser.parse("world"));
auto output = os.str();
EXPECT_FALSE(output.empty());
}
TEST(TraceTest, Enable_profiling_helper) {
parser parser(R"(
ROOT <- NUMBER ('+' NUMBER)*
NUMBER <- [0-9]+
)");
ASSERT_TRUE(parser);
std::ostringstream os;
enable_profiling(parser, os);
EXPECT_TRUE(parser.parse("1+2+3"));
auto output = os.str();
EXPECT_FALSE(output.empty());
// Should contain duration info
EXPECT_NE(output.find("duration:"), std::string::npos);
// Should contain rule names
EXPECT_NE(output.find("ROOT"), std::string::npos);
EXPECT_NE(output.find("NUMBER"), std::string::npos);
}
TEST(TraceTest, Enable_profiling_shows_success_fail_counts) {
parser parser(R"(
ROOT <- ITEM+
ITEM <- 'a' / 'b'
)");
ASSERT_TRUE(parser);
std::ostringstream os;
enable_profiling(parser, os);
EXPECT_TRUE(parser.parse("ab"));
auto output = os.str();
// Should contain success/fail summary
EXPECT_NE(output.find("success"), std::string::npos);
EXPECT_NE(output.find("fail"), std::string::npos);
}
TEST(TraceTest, Trace_with_packrat) {
parser parser(R"(
ROOT <- A / B
A <- 'x' 'y'
B <- 'x' 'z'
)");
ASSERT_TRUE(parser);
parser.enable_packrat_parsing();
int enter_count = 0;
parser.enable_trace(
[&](auto &, auto, auto, auto &, auto &, auto &, auto &) {
enter_count++;
},
[&](auto &, auto, auto, auto &, auto &, auto &, auto, auto &) {});
EXPECT_TRUE(parser.parse("xz"));
EXPECT_GT(enter_count, 0);
}
TEST(TraceTest, Trace_with_left_recursion) {
parser parser(R"(
E <- E '+' T / T
T <- [0-9]+
)");
ASSERT_TRUE(parser);
int enter_count = 0;
parser.enable_trace(
[&](auto &, auto, auto, auto &, auto &, auto &, auto &) {
enter_count++;
},
[&](auto &, auto, auto, auto &, auto &, auto &, auto, auto &) {});
EXPECT_TRUE(parser.parse("1+2"));
EXPECT_GT(enter_count, 0);
}

188
test/test_utf8.cc Normal file
View File

@ -0,0 +1,188 @@
#include <gtest/gtest.h>
#include <peglib.h>
using namespace peg;
// --- codepoint_length ---
TEST(Utf8Test, Codepoint_length_ascii) {
const char *s = "A";
EXPECT_EQ(codepoint_length(s, 1), 1u);
}
TEST(Utf8Test, Codepoint_length_2byte) {
// U+00E9 (é) = 0xC3 0xA9
const char s[] = "\xC3\xA9";
EXPECT_EQ(codepoint_length(s, 2), 2u);
}
TEST(Utf8Test, Codepoint_length_3byte) {
// U+3042 (あ) = 0xE3 0x81 0x82
const char s[] = "\xE3\x81\x82";
EXPECT_EQ(codepoint_length(s, 3), 3u);
}
TEST(Utf8Test, Codepoint_length_4byte) {
// U+1F600 (😀) = 0xF0 0x9F 0x98 0x80
const char s[] = "\xF0\x9F\x98\x80";
EXPECT_EQ(codepoint_length(s, 4), 4u);
}
TEST(Utf8Test, Codepoint_length_empty) {
EXPECT_EQ(codepoint_length("", 0), 0u);
}
TEST(Utf8Test, Codepoint_length_truncated) {
// 3-byte sequence but only 2 bytes available
const char s[] = "\xE3\x81";
EXPECT_EQ(codepoint_length(s, 2), 0u);
}
// --- codepoint_count ---
TEST(Utf8Test, Codepoint_count_ascii) {
const char *s = "hello";
EXPECT_EQ(codepoint_count(s, 5), 5u);
}
TEST(Utf8Test, Codepoint_count_mixed) {
// "aあb" = 'a' + 3-byte + 'b' = 5 bytes, 3 codepoints
const char s[] = "a\xE3\x81\x82"
"b";
EXPECT_EQ(codepoint_count(s, 5), 3u);
}
TEST(Utf8Test, Codepoint_count_empty) { EXPECT_EQ(codepoint_count("", 0), 0u); }
TEST(Utf8Test, Codepoint_count_emoji) {
// "😀😀" = 2 x 4-byte = 8 bytes, 2 codepoints
const char s[] = "\xF0\x9F\x98\x80\xF0\x9F\x98\x80";
EXPECT_EQ(codepoint_count(s, 8), 2u);
}
// --- encode_codepoint ---
TEST(Utf8Test, Encode_codepoint_ascii) {
char buff[4];
auto len = encode_codepoint(U'A', buff);
EXPECT_EQ(len, 1u);
EXPECT_EQ(buff[0], 'A');
}
TEST(Utf8Test, Encode_codepoint_2byte) {
auto s = encode_codepoint(U'\u00E9'); // é
EXPECT_EQ(s.size(), 2u);
EXPECT_EQ(s, "\xC3\xA9");
}
TEST(Utf8Test, Encode_codepoint_3byte) {
auto s = encode_codepoint(U'\u3042'); // あ
EXPECT_EQ(s.size(), 3u);
EXPECT_EQ(s, "\xE3\x81\x82");
}
TEST(Utf8Test, Encode_codepoint_4byte) {
auto s = encode_codepoint(U'\U0001F600'); // 😀
EXPECT_EQ(s.size(), 4u);
EXPECT_EQ(s, "\xF0\x9F\x98\x80");
}
TEST(Utf8Test, Encode_codepoint_surrogate_returns_zero) {
// Surrogates (U+D800-U+DFFF) are invalid
char buff[4];
auto len = encode_codepoint(0xD800, buff);
EXPECT_EQ(len, 0u);
}
TEST(Utf8Test, Encode_codepoint_beyond_unicode_returns_zero) {
char buff[4];
auto len = encode_codepoint(0x110000, buff);
EXPECT_EQ(len, 0u);
}
// --- decode_codepoint ---
TEST(Utf8Test, Decode_codepoint_ascii) {
const char *s = "A";
size_t bytes;
char32_t cp;
EXPECT_TRUE(decode_codepoint(s, 1, bytes, cp));
EXPECT_EQ(bytes, 1u);
EXPECT_EQ(cp, U'A');
}
TEST(Utf8Test, Decode_codepoint_2byte) {
const char s[] = "\xC3\xA9";
size_t bytes;
char32_t cp;
EXPECT_TRUE(decode_codepoint(s, 2, bytes, cp));
EXPECT_EQ(bytes, 2u);
EXPECT_EQ(cp, U'\u00E9');
}
TEST(Utf8Test, Decode_codepoint_3byte) {
const char s[] = "\xE3\x81\x82";
size_t bytes;
char32_t cp;
EXPECT_TRUE(decode_codepoint(s, 3, bytes, cp));
EXPECT_EQ(bytes, 3u);
EXPECT_EQ(cp, U'\u3042');
}
TEST(Utf8Test, Decode_codepoint_4byte) {
const char s[] = "\xF0\x9F\x98\x80";
size_t bytes;
char32_t cp;
EXPECT_TRUE(decode_codepoint(s, 4, bytes, cp));
EXPECT_EQ(bytes, 4u);
EXPECT_EQ(cp, U'\U0001F600');
}
TEST(Utf8Test, Decode_codepoint_empty) {
size_t bytes;
char32_t cp;
EXPECT_FALSE(decode_codepoint("", 0, bytes, cp));
}
TEST(Utf8Test, Decode_codepoint_convenience_with_bytes) {
const char s[] = "\xE3\x81\x82";
char32_t cp;
auto bytes = decode_codepoint(s, 3, cp);
EXPECT_EQ(bytes, 3u);
EXPECT_EQ(cp, U'\u3042');
}
TEST(Utf8Test, Decode_codepoint_convenience_simple) {
const char s[] = "\xE3\x81\x82";
auto cp = decode_codepoint(s, 3);
EXPECT_EQ(cp, U'\u3042');
}
// --- decode (full string) ---
TEST(Utf8Test, Decode_full_string) {
const char s[] = "a\xE3\x81\x82"
"b";
auto u32 = decode(s, 5);
EXPECT_EQ(u32.size(), 3u);
EXPECT_EQ(u32[0], U'a');
EXPECT_EQ(u32[1], U'\u3042');
EXPECT_EQ(u32[2], U'b');
}
TEST(Utf8Test, Decode_empty_string) {
auto u32 = decode("", 0);
EXPECT_EQ(u32.size(), 0u);
}
// --- roundtrip ---
TEST(Utf8Test, Encode_decode_roundtrip) {
std::vector<char32_t> codepoints = {U'A', U'\u00E9', U'\u3042',
U'\U0001F600'};
for (auto cp : codepoints) {
auto encoded = encode_codepoint(cp);
auto decoded = decode_codepoint(encoded.c_str(), encoded.size());
EXPECT_EQ(decoded, cp);
}
}