mirror of
https://github.com/yhirose/cpp-peglib.git
synced 2026-03-12 12:48:08 -04:00
Performance Improvement (#336)
* Add more tests * Enhance profiling capabilities and improve Trie performance metrics * Implement ISpan optimization for ASCII CharacterClass repetition in parsing * Implement LPeg-style snapshot/rollback for SemanticValues and CaptureScope Replace push()/pop()/append()/shift_capture_values() with snapshot() (record sizes) and rollback() (truncate on failure). Operators now write directly to the parent SemanticValues and truncate on failure, eliminating child scope allocation and copy-on-success overhead. Key changes: - PrioritizedChoice: snapshot/rollback instead of push/pop/append - Sequence: direct write to parent (no child scope) - AndPredicate/NotPredicate: always rollback (side-effect isolation) - CaptureScope: flatten vector<map> to vector<pair> with reverse search - Remove push()/pop()/append()/shift_capture_values() entirely Benchmark (A/B, big.sql ~1.2MB): 105.4ms -> 99.2ms (-5.9%) Small inputs benefit more (TPC-H Q1: -22.7%) where per-rule allocation overhead is proportionally higher. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> * Add Keyword Guard optimization for identifier parsing * Implement skip_whitespace function to streamline whitespace handling in parsing Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> * Add whitespace skip optimization and selective packrat memoization for improved parsing performance * Add support for optimized SQL grammar and enable packrat stats collection in benchmarks * Optimize argument handling and enhance range-based for loops for improved readability and performance * Rename SQL grammar files for consistency and clarity * Add first-reference analysis and warning for left-factoring in PEG grammar * Consolidate to_lower implementations and optimize LiteralString handling for improved parsing performance * Refactor Ope::Visitor subclasses to inherit from TraversalVisitor for improved code reuse and maintainability * Update native.wasm binary to latest version --------- Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com>
This commit is contained in:
parent
3b6f47f7fc
commit
60a77c634e
@ -11,8 +11,8 @@ A [DuckDB blog post](https://duckdb.org/2024/11/22/runtime-extensible-parsers.ht
|
||||
All test data comes from the [peg-parser-experiments](https://github.com/hannes/peg-parser-experiments) repository:
|
||||
|
||||
| File | Description | Size |
|
||||
|---|---|---|
|
||||
| `sql.gram` | PEG grammar for SQL (covers TPC-H and TPC-DS) | 3.9 KB |
|
||||
| --- | --- | --- |
|
||||
| `sql.peg` | PEG grammar for SQL (covers TPC-H and TPC-DS) | 3.9 KB |
|
||||
| `q1.sql` | Single TPC-H query (Q1) | 544 B |
|
||||
| `all-tpch.sql` | All 22 TPC-H queries | 14 KB |
|
||||
| `big.sql` | TPC-H + TPC-DS queries repeated 6x | 1.2 MB |
|
||||
@ -57,7 +57,7 @@ Measured on Apple M2 Max, macOS, AppleClang 17, `-O3` (Release build), 10 iterat
|
||||
cpp-peglib is approximately **7–10x slower** than the YACC parser, consistent with the findings reported in the DuckDB article.
|
||||
|
||||
| Benchmark | PEG/YACC |
|
||||
|---|---|
|
||||
| --- | --- |
|
||||
| TPC-H Q1 (544 B) | 9.9x slower |
|
||||
| all TPC-H (14 KB) | 7.8x slower |
|
||||
| big.sql (1.2 MB) | 7.3x slower |
|
||||
@ -67,7 +67,7 @@ cpp-peglib is approximately **7–10x slower** than the YACC parser, consistent
|
||||
The First-Set optimization precomputes the set of possible first bytes for each `PrioritizedChoice` alternative at grammar compilation time. At parse time, alternatives whose First-Set does not include the current input byte are skipped without attempting them.
|
||||
|
||||
| Benchmark | PEG/YACC |
|
||||
|---|---|
|
||||
| --- | --- |
|
||||
| TPC-H Q1 (544 B) | 5.9x slower |
|
||||
| all TPC-H (14 KB) | 4.6x slower |
|
||||
| big.sql (1.2 MB) | 4.6x slower |
|
||||
@ -77,7 +77,7 @@ The First-Set optimization precomputes the set of possible first bytes for each
|
||||
`Holder::parse_core` previously used 2–3 `dynamic_cast` calls per rule match to check whether the inner operator is a `TokenBoundary`, `PrioritizedChoice`, or `Dictionary`. These RTTI lookups accounted for ~27% of parse time in profiling. Replacing them with boolean flags (`is_token_boundary`, `is_choice_like`) on the `Ope` base class eliminates the RTTI overhead entirely.
|
||||
|
||||
| Benchmark | PEG/YACC |
|
||||
|---|---|
|
||||
| --- | --- |
|
||||
| TPC-H Q1 (544 B) | 4.2x slower |
|
||||
| all TPC-H (14 KB) | 3.4x slower |
|
||||
| big.sql (1.2 MB) | 3.4x slower |
|
||||
@ -87,33 +87,160 @@ The First-Set optimization precomputes the set of possible first bytes for each
|
||||
Left recursion support adds `DetectLeftRecursion` and seed-growing logic at parse time. For non-left-recursive grammars (such as SQL), this adds zero overhead — only a single `bool` check per rule invocation.
|
||||
|
||||
| Benchmark | PEG/YACC |
|
||||
|---|---|
|
||||
| --- | --- |
|
||||
| TPC-H Q1 (544 B) | 4.3x slower |
|
||||
| all TPC-H (14 KB) | 3.6x slower |
|
||||
| big.sql (1.2 MB) | 3.6x slower |
|
||||
| all TPC-H (14 KB) | 3.7x slower |
|
||||
| big.sql (1.2 MB) | 3.4x slower |
|
||||
|
||||
No regression compared to the previous configuration.
|
||||
|
||||
## ISpan Optimization (Repetition + CharacterClass Fusion)
|
||||
|
||||
At grammar compilation time, `Repetition` nodes whose child is an ASCII-only `CharacterClass` are detected. At parse time, these use a tight bitset-test loop instead of the full operator dispatch chain (vtable call, `push`/`pop`, `decode_codepoint`, `scope_exit`, etc.). This is equivalent to LPeg's `ISpan` instruction.
|
||||
|
||||
A/B comparison (same session, alternating builds):
|
||||
|
||||
| Benchmark | Baseline | ISpan | Improvement |
|
||||
| --- | --- | --- | --- |
|
||||
| TPC-H Q1 (544 B) | 0.088 ms | 0.077 ms | -12.5% |
|
||||
| all TPC-H (14 KB) | 1.489 ms | 1.409 ms | -5.4% |
|
||||
| big.sql (1.2 MB) | 126.0 ms | 114.6 ms | -9.1% |
|
||||
|
||||
| Benchmark | PEG/YACC |
|
||||
| --- | --- |
|
||||
| TPC-H Q1 (544 B) | 5.1x slower |
|
||||
| all TPC-H (14 KB) | 3.7x slower |
|
||||
| big.sql (1.2 MB) | 3.7x slower |
|
||||
|
||||
Note: Grammar load time increases slightly (~0.8 ms) due to bitset construction, but this is a one-time cost at grammar compilation.
|
||||
|
||||
## Snapshot/Rollback (Phase 2)
|
||||
|
||||
Replaced the `push()`/`pop()`/`append()` pattern with LPeg-style snapshot/rollback. Instead of allocating a child `SemanticValues` scope and copying results on success, operators now write directly to the parent and truncate on failure. `CaptureScope` was flattened from `vector<map>` to a flat `vector<pair>` with reverse linear search.
|
||||
|
||||
Key changes:
|
||||
|
||||
- `PrioritizedChoice`: snapshot before each alternative, rollback on failure, no-op on success
|
||||
- `Sequence`: direct write to parent (no child scope)
|
||||
- `Repetition`: snapshot only when `max` is bounded
|
||||
- `AndPredicate`/`NotPredicate`: always rollback (side-effect isolation)
|
||||
- `CaptureScope`: flat `vector<pair<string_view, string>>` instead of scoped `vector<map>`
|
||||
|
||||
A/B comparison (same session, alternating builds):
|
||||
|
||||
| Benchmark | Baseline | Snapshot/Rollback | Improvement |
|
||||
| --- | --- | --- | --- |
|
||||
| TPC-H Q1 (544 B) | 0.075 ms | 0.058 ms | -22.7% |
|
||||
| all TPC-H (14 KB) | 1.286 ms | 1.161 ms | -9.7% |
|
||||
| big.sql (1.2 MB) | 105.4 ms | 99.2 ms | -5.9% |
|
||||
|
||||
| Benchmark | PEG/YACC |
|
||||
| --- | --- |
|
||||
| TPC-H Q1 (544 B) | 4.1x slower |
|
||||
| all TPC-H (14 KB) | 3.2x slower |
|
||||
| big.sql (1.2 MB) | 3.4x slower |
|
||||
|
||||
The improvement is most pronounced on small inputs (Q1: -22.7%) where per-rule allocation overhead dominates, and smaller on large inputs (big.sql: -5.9%) where the grammar structure itself is the bottleneck.
|
||||
|
||||
## Keyword Guard (Phase 3)
|
||||
|
||||
At grammar compilation time, the pattern `!ReservedKeyword <[a-z_]i[a-z0-9_]i*>` is detected. At parse time, instead of running the full NotPredicate → Holder → PrioritizedChoice chain for each keyword alternative, the fast path scans the identifier using a bitset, then checks the result against a precomputed keyword table. Identifiers whose length falls outside the keyword length range skip the lookup entirely.
|
||||
|
||||
Key techniques:
|
||||
|
||||
- Bitset-based identifier scanning (same as ISpan)
|
||||
- Stack buffer for case-folding (heap fallback for identifiers > 64 chars)
|
||||
- Length-range early-out (`min_keyword_len` / `max_keyword_len`)
|
||||
- Compound keywords (e.g., `GROUP BY`) fall back to the normal path
|
||||
|
||||
A/B comparison (same session, alternating builds):
|
||||
|
||||
| Benchmark | Baseline | Keyword Guard | Improvement |
|
||||
| --- | --- | --- | --- |
|
||||
| TPC-H Q1 (544 B) | 0.058 ms | 0.055 ms | -5.2% |
|
||||
| all TPC-H (14 KB) | 1.117 ms | 1.109 ms | -0.7% |
|
||||
| big.sql (1.2 MB) | 99.2 ms | 92.4 ms | -6.8% |
|
||||
|
||||
| Benchmark | PEG/YACC |
|
||||
| --- | --- |
|
||||
| TPC-H Q1 (544 B) | 3.7x slower |
|
||||
| all TPC-H (14 KB) | 3.0x slower |
|
||||
| big.sql (1.2 MB) | 3.1x slower |
|
||||
|
||||
## Whitespace Skip Optimization
|
||||
|
||||
At grammar compilation time, `Sequence` nodes with whitespace operators between elements are detected. At parse time, instead of dispatching through the full operator chain for each whitespace consumption, a fast inline function scans whitespace using a precomputed bitset. This eliminates vtable calls, scope management, and SemanticValues bookkeeping for one of the most frequently invoked operations.
|
||||
|
||||
A/B comparison (same session, alternating builds):
|
||||
|
||||
| Benchmark | Baseline | Whitespace Skip | Improvement |
|
||||
| --- | --- | --- | --- |
|
||||
| big.sql (1.2 MB) | 92.4 ms | 93.0 ms | ~neutral |
|
||||
|
||||
| Benchmark | PEG/YACC |
|
||||
| --- | --- |
|
||||
| big.sql (1.2 MB) | 3.0x slower |
|
||||
|
||||
The improvement was within noise range on big.sql. The optimization primarily benefits grammars with heavy whitespace-separated sequences.
|
||||
|
||||
## Selective Packrat Memoization
|
||||
|
||||
At grammar compilation time, static analysis identifies which rules actually benefit from packrat memoization. A rule benefits only if it is reachable from 2+ alternatives of the same `PrioritizedChoice` (i.e., backtracking will re-visit it at the same position). Rules that don't benefit use a lightweight bitvector-only re-entry guard instead of the full `std::map`-based cache.
|
||||
|
||||
Empirical profiling of the SQL grammar showed that only 2 of 53 rules benefit from packrat (Identifier: 50.3% hit rate, ColumnReference: 45.1%). The remaining 51 rules had 0% hit rate with ~295K wasted map insertions.
|
||||
|
||||
A/B comparison (same session, alternating builds):
|
||||
|
||||
| Benchmark | Baseline | Selective Packrat | Improvement |
|
||||
| --- | --- | --- | --- |
|
||||
| big.sql (1.2 MB) | 93.0 ms | 88.3 ms | -5.1% |
|
||||
|
||||
| Benchmark | PEG/YACC |
|
||||
| --- | --- |
|
||||
| TPC-H Q1 (544 B) | 3.8x slower |
|
||||
| all TPC-H (14 KB) | 2.8x slower |
|
||||
| big.sql (1.2 MB) | 2.8x slower |
|
||||
|
||||
## Micro-optimizations (to_lower consolidation, LiteralString move fix)
|
||||
|
||||
Consolidated multiple `to_lower` implementations (Trie member function, inline loop, lambda) into a single `peg::to_lower` free function. Pre-computed lowercase literals (`lower_lit_`) for case-insensitive `LiteralString` matching, eliminating per-character `tolower` calls on the literal side during parsing. Also fixed a missing `std::move` in the `LiteralString` rvalue constructor.
|
||||
|
||||
| Benchmark | Baseline | After | Improvement |
|
||||
| --- | --- | --- | --- |
|
||||
| big.sql (1.2 MB) | 88.3 ms | 82.5 ms | -6.6% |
|
||||
|
||||
| Benchmark | PEG/YACC |
|
||||
| --- | --- |
|
||||
| TPC-H Q1 (544 B) | 3.5x slower |
|
||||
| all TPC-H (14 KB) | 2.9x slower |
|
||||
| big.sql (1.2 MB) | 2.8x slower |
|
||||
|
||||
## Summary (big.sql, ~1.2 MB)
|
||||
|
||||
All optimizations measured on Apple M2 Max, macOS, AppleClang 17, `-O3` (Release build).
|
||||
|
||||
| Configuration | Median | PEG/YACC |
|
||||
|---|---|---|
|
||||
| YACC (libpg_query) | 36.1 ms | 1.0x |
|
||||
| --- | --- | --- |
|
||||
| YACC (libpg_query) | 29.6 ms | 1.0x |
|
||||
| PEG (no optimizations) | 228.4 ms | 7.4x |
|
||||
| PEG + Devirt | 190.9 ms | 6.2x |
|
||||
| PEG + First-Set | 135.8 ms | 4.6x |
|
||||
| PEG (all optimizations) | 105.1 ms | 3.4x |
|
||||
| PEG (all opts + LR support) | 130.3 ms | 3.6x |
|
||||
| PEG + First-Set + Devirt + LR | 107.4 ms | 3.4x |
|
||||
| PEG (all opts + Snapshot/Rollback) | 99.2 ms | 3.4x |
|
||||
| PEG (all opts + Keyword Guard) | 92.4 ms | 3.1x |
|
||||
| PEG (all opts + Selective Packrat) | 88.3 ms | 3.0x |
|
||||
| PEG (all opts + micro-opts) | 82.5 ms | 2.8x |
|
||||
|
||||
```
|
||||
YACC |█████ 36.1 ms (1.0x)
|
||||
PEG (all optimizations) |██████████████ 105.1 ms (3.4x)
|
||||
PEG (all opts + LR support) |█████████████████ 130.3 ms (3.6x)
|
||||
PEG + First-Set |█████████████████ 135.8 ms (4.6x)
|
||||
PEG + Devirt |████████████████████████ 190.9 ms (6.2x)
|
||||
PEG (no optimizations) |█████████████████████████████ 228.4 ms (7.4x)
|
||||
```ascii
|
||||
YACC |████ 29.6 ms (1.0x)
|
||||
PEG (all opts + micro) |██████████ 82.5 ms (2.8x)
|
||||
PEG (all opts + Sel. Pack) |███████████ 88.3 ms (3.0x)
|
||||
PEG (all opts + KW Guard) |████████████ 92.4 ms (3.1x)
|
||||
PEG (all opts + S/R) |█████████████ 99.2 ms (3.4x)
|
||||
PEG + First-Set + Devirt |██████████████ 107.4 ms (3.4x)
|
||||
PEG + First-Set |█████████████████ 135.8 ms (4.6x)
|
||||
PEG + Devirt |████████████████████████ 190.9 ms (6.2x)
|
||||
PEG (no optimizations) |█████████████████████████████ 228.4 ms (7.4x)
|
||||
```
|
||||
|
||||
With all optimizations and left recursion support, the gap to YACC is **3.6x** — no regression from adding LR support to non-LR grammars.
|
||||
With all optimizations, the gap to YACC is **2.8x** on big.sql — a **2.6x improvement** from the original 7.4x baseline.
|
||||
|
||||
@ -1,7 +1,10 @@
|
||||
#include <algorithm>
|
||||
#include <chrono>
|
||||
#include <cstring>
|
||||
#include <fstream>
|
||||
#include <iomanip>
|
||||
#include <iostream>
|
||||
#include <map>
|
||||
#include <numeric>
|
||||
#include <string>
|
||||
#include <vector>
|
||||
@ -52,6 +55,57 @@ struct BenchResult {
|
||||
}
|
||||
};
|
||||
|
||||
static void print_bar_chart(const vector<BenchResult> &results) {
|
||||
// Collect big.sql results for the bar chart
|
||||
struct Entry {
|
||||
string name;
|
||||
double ms;
|
||||
};
|
||||
vector<Entry> entries;
|
||||
|
||||
for (const auto &r : results) {
|
||||
if (r.name.find("big.sql") != string::npos) {
|
||||
entries.push_back({r.name, r.median()});
|
||||
}
|
||||
}
|
||||
|
||||
if (entries.empty()) return;
|
||||
|
||||
// Sort by time (fastest first)
|
||||
sort(entries.begin(), entries.end(),
|
||||
[](const Entry &a, const Entry &b) { return a.ms < b.ms; });
|
||||
|
||||
double baseline = entries.front().ms; // fastest as 1.0x
|
||||
double max_ms = entries.back().ms;
|
||||
|
||||
const int max_bar_width = 30;
|
||||
const int name_width = 28;
|
||||
|
||||
cout << endl << " Bar chart (big.sql):" << endl << endl;
|
||||
|
||||
for (const auto &e : entries) {
|
||||
double ratio = baseline > 0 ? e.ms / baseline : 0;
|
||||
int bar_len =
|
||||
max_ms > 0 ? static_cast<int>(e.ms / max_ms * max_bar_width) : 0;
|
||||
if (bar_len < 1) bar_len = 1;
|
||||
|
||||
// Build bar with Unicode block character U+2588
|
||||
string bar_str;
|
||||
for (int i = 0; i < bar_len; i++) {
|
||||
bar_str += "\xe2\x96\x88";
|
||||
}
|
||||
// Pad with spaces to align the numbers
|
||||
string padding(max_bar_width - bar_len, ' ');
|
||||
|
||||
char line[256];
|
||||
snprintf(line, sizeof(line), "%6.1f ms (%.1fx)", e.ms, ratio);
|
||||
|
||||
// Print: name | bar + padding + numbers
|
||||
cout << " " << left << setw(name_width) << e.name << " |" << bar_str
|
||||
<< padding << " " << line << endl;
|
||||
}
|
||||
}
|
||||
|
||||
static void print_result(const BenchResult &r) {
|
||||
cout << " " << left << setw(30) << r.name << right << " median: " << fixed
|
||||
<< setprecision(3) << setw(10) << r.median()
|
||||
@ -120,7 +174,241 @@ static BenchResult bench_yacc_parse(const string &name, const string &sql_input,
|
||||
}
|
||||
#endif
|
||||
|
||||
// Profile subcommand: per-rule self-time profiling
|
||||
struct RuleStats {
|
||||
string name;
|
||||
size_t success = 0;
|
||||
size_t fail = 0;
|
||||
double self_ns = 0; // exclusive (self) time in nanoseconds
|
||||
};
|
||||
|
||||
struct ProfileData {
|
||||
vector<RuleStats> rules;
|
||||
map<string, size_t> index;
|
||||
vector<chrono::steady_clock::time_point> enter_times;
|
||||
vector<double> child_ns; // accumulated child time at each stack level
|
||||
chrono::steady_clock::time_point start;
|
||||
};
|
||||
|
||||
static int run_profile(const string &data_dir, int argc, char *argv[]) {
|
||||
// Determine which input to profile
|
||||
string input_name = "big.sql";
|
||||
if (argc > 0) { input_name = argv[0]; }
|
||||
|
||||
string input_file;
|
||||
if (input_name == "q1") {
|
||||
input_file = data_dir + "/q1.sql";
|
||||
} else if (input_name == "tpch") {
|
||||
input_file = data_dir + "/all-tpch.sql";
|
||||
} else {
|
||||
input_file = data_dir + "/big.sql";
|
||||
}
|
||||
|
||||
// Grammar: default or "optimized"
|
||||
string grammar_file = data_dir + "/sql.peg";
|
||||
for (int i = 0; i < argc; i++) {
|
||||
string arg = argv[i];
|
||||
if (arg == "optimized" || arg == "opt") {
|
||||
grammar_file = data_dir + "/sql-optimized.peg";
|
||||
}
|
||||
}
|
||||
auto sql_grammar = read_file(grammar_file);
|
||||
auto sql_input = read_file(input_file);
|
||||
|
||||
parser pg(sql_grammar);
|
||||
if (!pg) {
|
||||
cerr << "Error: failed to parse SQL grammar" << endl;
|
||||
return 1;
|
||||
}
|
||||
pg.enable_packrat_parsing();
|
||||
|
||||
ProfileData *profile_result = nullptr;
|
||||
|
||||
pg.enable_trace(
|
||||
// enter
|
||||
[](auto &ope, auto, auto, auto &, auto &, auto &, std::any &trace_data) {
|
||||
auto holder = dynamic_cast<const peg::Holder *>(&ope);
|
||||
if (!holder) return;
|
||||
|
||||
auto &pd = *std::any_cast<ProfileData *>(trace_data);
|
||||
auto &name = holder->name();
|
||||
if (pd.index.find(name) == pd.index.end()) {
|
||||
pd.index[name] = pd.rules.size();
|
||||
pd.rules.push_back({name, 0, 0, 0});
|
||||
}
|
||||
|
||||
pd.enter_times.push_back(chrono::steady_clock::now());
|
||||
pd.child_ns.push_back(0);
|
||||
},
|
||||
// leave
|
||||
[](auto &ope, auto, auto, auto &, auto &, auto &, auto len,
|
||||
std::any &trace_data) {
|
||||
auto holder = dynamic_cast<const peg::Holder *>(&ope);
|
||||
if (!holder) return;
|
||||
|
||||
auto &pd = *std::any_cast<ProfileData *>(trace_data);
|
||||
auto now = chrono::steady_clock::now();
|
||||
auto elapsed =
|
||||
chrono::duration<double, nano>(now - pd.enter_times.back()).count();
|
||||
auto child_time = pd.child_ns.back();
|
||||
auto self_time = elapsed - child_time;
|
||||
|
||||
pd.enter_times.pop_back();
|
||||
pd.child_ns.pop_back();
|
||||
|
||||
// Add elapsed to parent's child accumulator
|
||||
if (!pd.child_ns.empty()) { pd.child_ns.back() += elapsed; }
|
||||
|
||||
auto &name = holder->name();
|
||||
auto idx = pd.index[name];
|
||||
auto &stat = pd.rules[idx];
|
||||
stat.self_ns += self_time;
|
||||
if (len != static_cast<size_t>(-1)) {
|
||||
stat.success++;
|
||||
} else {
|
||||
stat.fail++;
|
||||
}
|
||||
},
|
||||
// start
|
||||
[&profile_result](auto &trace_data) {
|
||||
auto pd = new ProfileData{};
|
||||
pd->start = chrono::steady_clock::now();
|
||||
trace_data = pd;
|
||||
profile_result = pd;
|
||||
},
|
||||
// end
|
||||
[](auto & /*trace_data*/) {});
|
||||
|
||||
// Enable packrat stats collection
|
||||
pg["Statements"].collect_packrat_stats = true;
|
||||
|
||||
cout << "Profiling parse of " << input_file << " (" << sql_input.size()
|
||||
<< " bytes)..." << endl;
|
||||
|
||||
auto t0 = chrono::steady_clock::now();
|
||||
pg.parse(sql_input);
|
||||
auto t1 = chrono::steady_clock::now();
|
||||
auto total_ms =
|
||||
chrono::duration_cast<chrono::microseconds>(t1 - t0).count() / 1000.0;
|
||||
|
||||
// Output results
|
||||
auto &pd = *profile_result;
|
||||
auto &rules = pd.rules;
|
||||
|
||||
vector<size_t> order(rules.size());
|
||||
iota(order.begin(), order.end(), 0);
|
||||
sort(order.begin(), order.end(),
|
||||
[&](size_t a, size_t b) { return rules[a].self_ns > rules[b].self_ns; });
|
||||
|
||||
size_t total_calls = 0;
|
||||
double total_self_ns = 0;
|
||||
for (auto &r : rules) {
|
||||
total_calls += r.success + r.fail;
|
||||
total_self_ns += r.self_ns;
|
||||
}
|
||||
|
||||
cout << endl;
|
||||
cout << "Profile: " << input_file << " (" << sql_input.size() << " bytes)"
|
||||
<< endl;
|
||||
cout << "Total time: " << fixed << setprecision(3) << total_ms << " ms"
|
||||
<< endl;
|
||||
cout << "Total rule calls: " << total_calls << endl;
|
||||
cout << endl;
|
||||
|
||||
char buf[256];
|
||||
snprintf(buf, sizeof(buf), "%4s %-30s %10s %6s %10s %10s %6s %8s",
|
||||
"rank", "rule", "self(ms)", "%", "success", "fail", "fail%",
|
||||
"avg(ns)");
|
||||
cout << buf << endl;
|
||||
cout << string(100, '-') << endl;
|
||||
|
||||
size_t rank = 1;
|
||||
for (auto i : order) {
|
||||
auto &r = rules[i];
|
||||
auto total = r.success + r.fail;
|
||||
if (total == 0) continue;
|
||||
auto self_ms = r.self_ns / 1e6;
|
||||
auto pct = r.self_ns / total_self_ns * 100.0;
|
||||
auto fail_pct = total > 0 ? r.fail * 100.0 / total : 0.0;
|
||||
auto avg_ns = r.self_ns / total;
|
||||
snprintf(buf, sizeof(buf),
|
||||
"%4zu %-30s %10.3f %5.1f%% %10zu %10zu %5.1f%% %8.1f", rank,
|
||||
r.name.c_str(), self_ms, pct, r.success, r.fail, fail_pct, avg_ns);
|
||||
cout << buf << endl;
|
||||
rank++;
|
||||
}
|
||||
|
||||
// Packrat stats
|
||||
auto &pkstats = pg["Statements"].packrat_stats_;
|
||||
if (!pkstats.empty()) {
|
||||
cout << endl;
|
||||
cout << "Packrat cache stats per rule:" << endl;
|
||||
snprintf(buf, sizeof(buf), "%4s %-30s %10s %10s %10s %6s", "rank",
|
||||
"rule", "hits", "misses", "total", "hit%");
|
||||
cout << buf << endl;
|
||||
cout << string(80, '-') << endl;
|
||||
|
||||
// Build def_id → name map from ProfileData
|
||||
map<size_t, string> defid_to_name;
|
||||
for (auto &[name, idx] : pd.index) {
|
||||
try {
|
||||
auto &rule = pg[name.c_str()];
|
||||
defid_to_name[rule.id] = name;
|
||||
} catch (...) {}
|
||||
}
|
||||
|
||||
struct PkEntry {
|
||||
string name;
|
||||
size_t hits, misses;
|
||||
};
|
||||
vector<PkEntry> pk_entries;
|
||||
size_t total_hits = 0, total_misses = 0;
|
||||
for (size_t i = 0; i < pkstats.size(); i++) {
|
||||
auto &st = pkstats[i];
|
||||
if (st.hits + st.misses == 0) continue;
|
||||
auto it = defid_to_name.find(i);
|
||||
string name =
|
||||
it != defid_to_name.end() ? it->second : "id=" + to_string(i);
|
||||
pk_entries.push_back({name, st.hits, st.misses});
|
||||
total_hits += st.hits;
|
||||
total_misses += st.misses;
|
||||
}
|
||||
|
||||
sort(pk_entries.begin(), pk_entries.end(), [](auto &a, auto &b) {
|
||||
return a.hits + a.misses > b.hits + b.misses;
|
||||
});
|
||||
|
||||
rank = 1;
|
||||
for (auto &e : pk_entries) {
|
||||
auto total = e.hits + e.misses;
|
||||
auto hit_pct = total > 0 ? e.hits * 100.0 / total : 0.0;
|
||||
snprintf(buf, sizeof(buf), "%4zu %-30s %10zu %10zu %10zu %5.1f%%",
|
||||
rank, e.name.c_str(), e.hits, e.misses, total, hit_pct);
|
||||
cout << buf << endl;
|
||||
rank++;
|
||||
}
|
||||
cout << endl;
|
||||
snprintf(buf, sizeof(buf),
|
||||
"Total: hits=%zu misses=%zu total=%zu hit%%=%.1f%%", total_hits,
|
||||
total_misses, total_hits + total_misses,
|
||||
(total_hits + total_misses) > 0
|
||||
? total_hits * 100.0 / (total_hits + total_misses)
|
||||
: 0.0);
|
||||
cout << buf << endl;
|
||||
}
|
||||
|
||||
delete profile_result;
|
||||
return 0;
|
||||
}
|
||||
|
||||
int main(int argc, char *argv[]) {
|
||||
string data_dir = BENCHMARK_DATA_DIR;
|
||||
|
||||
// Check for subcommands
|
||||
if (argc > 1 && strcmp(argv[1], "profile") == 0) {
|
||||
return run_profile(data_dir, argc - 2, argv + 2);
|
||||
}
|
||||
|
||||
int iterations = 10;
|
||||
if (argc > 1) {
|
||||
iterations = atoi(argv[1]);
|
||||
@ -130,70 +418,73 @@ int main(int argc, char *argv[]) {
|
||||
}
|
||||
}
|
||||
|
||||
string data_dir = BENCHMARK_DATA_DIR;
|
||||
|
||||
auto sql_grammar = read_file(data_dir + "/sql.gram");
|
||||
auto sql_grammar = read_file(data_dir + "/sql.peg");
|
||||
auto q1_sql = read_file(data_dir + "/q1.sql");
|
||||
auto tpch_sql = read_file(data_dir + "/all-tpch.sql");
|
||||
auto big_sql = read_file(data_dir + "/big.sql");
|
||||
|
||||
cout << "cpp-peglib SQL benchmark (" << iterations << " iterations)" << endl;
|
||||
cout << "cpp-peglib SQL benchmark (" << iterations << " iterations)";
|
||||
#ifdef HAS_PG_QUERY
|
||||
cout << "(with libpg_query YACC comparison)" << endl;
|
||||
cout << "(with libpg_query YACC comparison)";
|
||||
#endif
|
||||
cout << string(70, '=') << endl;
|
||||
cout << endl;
|
||||
cout << string(80, '=') << endl;
|
||||
|
||||
vector<BenchResult> results;
|
||||
int test_num = 1;
|
||||
|
||||
// PEG benchmarks
|
||||
cout << endl << "--- cpp-peglib (PEG) ---" << endl;
|
||||
cout << "--- cpp-peglib (PEG) ---" << endl;
|
||||
|
||||
cout << endl
|
||||
<< "[" << test_num++ << "] PEG: grammar load (" << sql_grammar.size()
|
||||
cout << "[" << test_num++ << "] PEG: grammar load (" << sql_grammar.size()
|
||||
<< " bytes)" << endl;
|
||||
results.push_back(bench_sql_grammar_load(sql_grammar, iterations));
|
||||
|
||||
cout << endl
|
||||
<< "[" << test_num++ << "] PEG: TPC-H Q1 (" << q1_sql.size() << " bytes)"
|
||||
cout << "[" << test_num++ << "] PEG: TPC-H Q1 (" << q1_sql.size() << " bytes)"
|
||||
<< endl;
|
||||
results.push_back(
|
||||
bench_sql_parse("PEG: TPC-H Q1", sql_grammar, q1_sql, iterations));
|
||||
|
||||
cout << endl
|
||||
<< "[" << test_num++ << "] PEG: all TPC-H (" << tpch_sql.size()
|
||||
cout << "[" << test_num++ << "] PEG: all TPC-H (" << tpch_sql.size()
|
||||
<< " bytes)" << endl;
|
||||
results.push_back(
|
||||
bench_sql_parse("PEG: all TPC-H", sql_grammar, tpch_sql, iterations));
|
||||
|
||||
cout << endl
|
||||
<< "[" << test_num++ << "] PEG: big.sql (" << big_sql.size() << " bytes)"
|
||||
cout << "[" << test_num++ << "] PEG: big.sql (" << big_sql.size() << " bytes)"
|
||||
<< endl;
|
||||
results.push_back(
|
||||
bench_sql_parse("PEG: big.sql (~1MB)", sql_grammar, big_sql, iterations));
|
||||
|
||||
// Optimized grammar benchmarks
|
||||
{
|
||||
auto opt_grammar = read_file(data_dir + "/sql-optimized.peg");
|
||||
cout << endl << "--- cpp-peglib (PEG, optimized grammar) ---" << endl;
|
||||
|
||||
cout << "[" << test_num++ << "] PEG-opt: big.sql (" << big_sql.size()
|
||||
<< " bytes)" << endl;
|
||||
results.push_back(bench_sql_parse("PEG-opt: big.sql (~1MB)", opt_grammar,
|
||||
big_sql, iterations));
|
||||
}
|
||||
|
||||
// YACC benchmarks
|
||||
#ifdef HAS_PG_QUERY
|
||||
cout << endl << "--- PostgreSQL YACC (libpg_query) ---" << endl;
|
||||
|
||||
cout << endl
|
||||
<< "[" << test_num++ << "] YACC: TPC-H Q1 (" << q1_sql.size()
|
||||
cout << "[" << test_num++ << "] YACC: TPC-H Q1 (" << q1_sql.size()
|
||||
<< " bytes)" << endl;
|
||||
results.push_back(bench_yacc_parse("YACC: TPC-H Q1", q1_sql, iterations));
|
||||
|
||||
cout << endl
|
||||
<< "[" << test_num++ << "] YACC: all TPC-H (" << tpch_sql.size()
|
||||
cout << "[" << test_num++ << "] YACC: all TPC-H (" << tpch_sql.size()
|
||||
<< " bytes)" << endl;
|
||||
results.push_back(bench_yacc_parse("YACC: all TPC-H", tpch_sql, iterations));
|
||||
|
||||
cout << endl
|
||||
<< "[" << test_num++ << "] YACC: big.sql (" << big_sql.size()
|
||||
cout << "[" << test_num++ << "] YACC: big.sql (" << big_sql.size()
|
||||
<< " bytes)" << endl;
|
||||
results.push_back(
|
||||
bench_yacc_parse("YACC: big.sql (~1MB)", big_sql, iterations));
|
||||
#endif
|
||||
|
||||
cout << endl << string(70, '=') << endl;
|
||||
cout << endl << string(80, '=') << endl;
|
||||
cout << "Summary:" << endl;
|
||||
for (const auto &r : results) {
|
||||
print_result(r);
|
||||
@ -217,5 +508,7 @@ int main(int argc, char *argv[]) {
|
||||
}
|
||||
#endif
|
||||
|
||||
print_bar_chart(results);
|
||||
|
||||
return 0;
|
||||
}
|
||||
|
||||
119
benchmark/data/sql-optimized.peg
Normal file
119
benchmark/data/sql-optimized.peg
Normal file
@ -0,0 +1,119 @@
|
||||
Statements <- (SingleStatement (';' SingleStatement )* ';'*)
|
||||
|
||||
|
||||
|
||||
SingleStatement <- SelectStatement
|
||||
SelectStatement <- SimpleSelect (SetopClause SimpleSelect)*
|
||||
|
||||
SetopClause <- ('UNION'i / 'EXCEPT'i / 'INTERSECT'i) 'ALL'i?
|
||||
|
||||
SimpleSelect <- WithClause? SelectClause FromClause? WhereClause? GroupByClause? HavingClause? OrderByClause? LimitClause?
|
||||
|
||||
WithStatement <- Identifier 'AS'i SubqueryReference
|
||||
WithClause <- 'WITH'i List(WithStatement)
|
||||
SelectClause <- 'SELECT'i ('*' / List(AliasedExpression))
|
||||
ColumnAliases <- Parens(List(Identifier))
|
||||
|
||||
TableReference <-
|
||||
(SubqueryReference 'AS'i? Identifier ColumnAliases?) /
|
||||
(Identifier ('AS'i? Identifier)?)
|
||||
|
||||
ExplicitJoin <- ('LEFT'i / 'FULL'i)? 'OUTER'i? 'JOIN'i TableReference 'ON'i Expression
|
||||
|
||||
FromClause <- 'FROM'i TableReference ((',' TableReference) / ExplicitJoin)*
|
||||
WhereClause <- 'WHERE'i Expression
|
||||
GroupByClause <- 'GROUP'i 'BY'i List(Expression)
|
||||
HavingClause <- 'HAVING'i Expression
|
||||
|
||||
SubqueryReference <- Parens(SelectStatement)
|
||||
|
||||
OrderByExpression <- Expression ('DESC'i / 'ASC'i)? ('NULLS'i 'FIRST'i / 'LAST'i)?
|
||||
OrderByClause <- 'ORDER'i 'BY'i List(OrderByExpression)
|
||||
|
||||
LimitClause <- 'LIMIT'i NumberLiteral
|
||||
|
||||
ReservedKeyword <-
|
||||
'SELECT'i /
|
||||
'FROM'i /
|
||||
'WHERE'i /
|
||||
'GROUP'i 'BY'i /
|
||||
'HAVING'i /
|
||||
'UNION'i /
|
||||
'ORDER'i 'BY'i /
|
||||
'WHEN'i /
|
||||
'JOIN'i /
|
||||
'ON'i /
|
||||
'INTERSECT'i # TODO expand on this
|
||||
|
||||
PlainIdentifier <- !ReservedKeyword <[a-z_]i[a-z0-9_]i*> # unqoted identifier can't be top-level keyword
|
||||
QuotedIdentifier <- '"' [^"]* '"'
|
||||
Identifier <- QuotedIdentifier / PlainIdentifier
|
||||
NumberLiteral <- < [+-]?[0-9]*([.][0-9]*)? >
|
||||
StringLiteral <- '\'' [^\']* '\''
|
||||
TypeSpecifier <- Identifier (Parens(List(NumberLiteral)))?
|
||||
|
||||
# Optimization: Merge ColumnReference, FunctionExpression, and IsNull
|
||||
# into a single rule to avoid redundant identifier parsing.
|
||||
# Old: IsNullExpression / FunctionExpression / ColumnReference (3 separate rules)
|
||||
# New: ColumnOrFuncRef handles all three patterns in one pass.
|
||||
ColumnOrFuncRef <- (Identifier '.')? Identifier (Parens(List(Expression)))?
|
||||
|
||||
ParenthesisExpression <- Parens(Expression)
|
||||
LiteralExpression <- StringLiteral / NumberLiteral
|
||||
CastExpression <- 'CAST'i Parens(Expression 'AS'i TypeSpecifier)
|
||||
ExtractExpression <- 'EXTRACT'i Parens(ColumnReference 'FROM'i Expression)
|
||||
ColumnReference <- (Identifier '.')? Identifier
|
||||
CountStarExpression <- 'COUNT'i Parens('*')
|
||||
SubqueryExpression <- 'NOT'i? 'EXISTS'i? SubqueryReference
|
||||
CaseExpression <- 'CASE'i ColumnReference? 'WHEN'i Expression 'THEN'i Expression ('ELSE'i Expression)? 'END'i # TODO strict
|
||||
DateExpression <- 'DATE'i Expression
|
||||
DistinctExpression <- 'DISTINCT'i Expression
|
||||
SubstringExpression <- 'SUBSTRING'i Parens(Expression 'FROM'i NumberLiteral 'FOR'i NumberLiteral)
|
||||
LiteralListExpression <- Parens(List(Expression))
|
||||
FrameClause <- 'ROWS'i 'BETWEEN'i (('UNBOUNDED'i 'PRECEDING'i)) 'AND' (('CURRENT'i 'ROW'i))
|
||||
WindowExpression <- Parens(('PARTITION'i 'BY'i List(Expression))? OrderByClause? FrameClause?)
|
||||
|
||||
# Optimization: Removed IsNullExpression, FunctionExpression, ColumnReference
|
||||
# from SingleExpression. Replaced with ColumnOrFuncRef.
|
||||
SingleExpression <-
|
||||
SubqueryExpression /
|
||||
LiteralListExpression /
|
||||
ParenthesisExpression /
|
||||
DateExpression /
|
||||
DistinctExpression /
|
||||
SubstringExpression /
|
||||
CaseExpression /
|
||||
CountStarExpression /
|
||||
CastExpression /
|
||||
ExtractExpression /
|
||||
WindowExpression /
|
||||
ColumnOrFuncRef /
|
||||
LiteralExpression
|
||||
|
||||
ArithmeticOperator <- '+' / '-' / '*' / '/'
|
||||
LikeOperator <- 'NOT'i? 'LIKE'i
|
||||
InOperator <- 'NOT'i? 'IN'i !'T'i # special handling to not match INTERSECT
|
||||
BooleanOperator <- ('OR'i !'D'i) / 'AND'i # special handling to not match ORDER BY
|
||||
ComparisionOperator <- '=' / '<=' / '>=' / '<' / '>'
|
||||
WindowOperator <- 'OVER'i
|
||||
BetweenOperator <- 'BETWEEN'i
|
||||
# Optimization: IS NULL as postfix operator instead of standalone expression
|
||||
IsNullOperator <- 'IS'i 'NOT'i? 'NULL'i
|
||||
|
||||
Operator <-
|
||||
ArithmeticOperator /
|
||||
ComparisionOperator /
|
||||
BooleanOperator /
|
||||
LikeOperator /
|
||||
InOperator /
|
||||
WindowOperator /
|
||||
BetweenOperator
|
||||
|
||||
# Optimization: IS NULL as postfix on each operand
|
||||
Expression <- SingleExpression IsNullOperator? (Operator SingleExpression IsNullOperator?)*
|
||||
AliasedExpression <- Expression ('AS'i? Identifier)?
|
||||
|
||||
# internal definitions
|
||||
%whitespace <- [ \t\n\r]*
|
||||
List(D) <- D (',' D)*
|
||||
Parens(D) <- '(' D ')'
|
||||
File diff suppressed because one or more lines are too long
BIN
docs/native.wasm
BIN
docs/native.wasm
Binary file not shown.
128
lint/peglint.cc
128
lint/peglint.cc
@ -6,6 +6,7 @@
|
||||
//
|
||||
|
||||
#include <fstream>
|
||||
#include <iostream>
|
||||
#include <peglib.h>
|
||||
#include <sstream>
|
||||
|
||||
@ -111,6 +112,133 @@ int main(int argc, const char **argv) {
|
||||
|
||||
if (!parser.load_grammar(syntax.data(), syntax.size())) { return -1; }
|
||||
|
||||
{
|
||||
using namespace peg;
|
||||
auto &grammar = parser.get_grammar();
|
||||
const char *source_start = syntax.data();
|
||||
|
||||
// Get the first Reference name from an Ope tree
|
||||
struct GetFirstRef : public Ope::Visitor {
|
||||
using Ope::Visitor::visit;
|
||||
string name; // empty if not found
|
||||
void visit(Reference &ope) override {
|
||||
if (name.empty() && ope.rule_ && !ope.is_macro_) { name = ope.name_; }
|
||||
}
|
||||
void visit(Sequence &ope) override {
|
||||
if (name.empty() && !ope.opes_.empty()) { ope.opes_[0]->accept(*this); }
|
||||
}
|
||||
void visit(Repetition &ope) override {
|
||||
if (name.empty()) { ope.ope_->accept(*this); }
|
||||
}
|
||||
void visit(CaptureScope &ope) override {
|
||||
if (name.empty()) { ope.ope_->accept(*this); }
|
||||
}
|
||||
void visit(Capture &ope) override {
|
||||
if (name.empty()) { ope.ope_->accept(*this); }
|
||||
}
|
||||
void visit(TokenBoundary &ope) override {
|
||||
if (name.empty()) { ope.ope_->accept(*this); }
|
||||
}
|
||||
void visit(Ignore &ope) override {
|
||||
if (name.empty()) { ope.ope_->accept(*this); }
|
||||
}
|
||||
};
|
||||
|
||||
// Build first-reference chain: all rules reachable by following
|
||||
// the first Reference of each definition recursively.
|
||||
map<string, set<string>> chain_cache;
|
||||
auto build_chain = [&](const string &start) -> const set<string> & {
|
||||
auto cached = chain_cache.find(start);
|
||||
if (cached != chain_cache.end()) { return cached->second; }
|
||||
auto &chain = chain_cache[start];
|
||||
string cur = start;
|
||||
while (!cur.empty() && !chain.count(cur)) {
|
||||
chain.insert(cur);
|
||||
auto it = grammar.find(cur);
|
||||
if (it == grammar.end()) break;
|
||||
GetFirstRef vis;
|
||||
it->second.get_core_operator()->accept(vis);
|
||||
cur = vis.name;
|
||||
}
|
||||
return chain;
|
||||
};
|
||||
|
||||
auto warn = [&](const Definition &def, const string &msg) {
|
||||
auto pos = line_info(source_start, def.s_);
|
||||
cerr << syntax_path << ":" << pos.first << ":" << pos.second << ": "
|
||||
<< msg << endl;
|
||||
};
|
||||
|
||||
for (auto &[rule_name, def] : grammar) {
|
||||
auto core = def.get_core_operator();
|
||||
auto *choice = dynamic_cast<PrioritizedChoice *>(core.get());
|
||||
if (!choice) continue;
|
||||
|
||||
// Collect first-ref info for each alternative
|
||||
struct AltInfo {
|
||||
string direct_ref;
|
||||
const set<string> *chain = nullptr;
|
||||
};
|
||||
vector<AltInfo> alts;
|
||||
for (auto &ope : choice->opes_) {
|
||||
GetFirstRef vis;
|
||||
ope->accept(vis);
|
||||
AltInfo info;
|
||||
if (!vis.name.empty()) {
|
||||
info.direct_ref = vis.name;
|
||||
info.chain = &build_chain(vis.name);
|
||||
}
|
||||
alts.push_back(std::move(info));
|
||||
}
|
||||
|
||||
// Direct common prefix
|
||||
map<string, vector<size_t>> direct;
|
||||
for (size_t i = 0; i < alts.size(); i++) {
|
||||
if (!alts[i].direct_ref.empty()) {
|
||||
direct[alts[i].direct_ref].push_back(i);
|
||||
}
|
||||
}
|
||||
for (auto &[prefix, idx] : direct) {
|
||||
if (idx.size() < 2) continue;
|
||||
|
||||
warn(def, "'" + rule_name + "' has " + to_string(idx.size()) +
|
||||
" alternatives starting with '" + prefix +
|
||||
"'; consider left-factoring.");
|
||||
}
|
||||
|
||||
// Indirect common prefix
|
||||
map<string, vector<size_t>> indirect;
|
||||
for (size_t i = 0; i < alts.size(); i++) {
|
||||
if (!alts[i].chain) continue;
|
||||
for (auto &ref : *alts[i].chain) {
|
||||
indirect[ref].push_back(i);
|
||||
}
|
||||
}
|
||||
for (auto &[ref, idx] : indirect) {
|
||||
if (idx.size() < 2) continue;
|
||||
if (direct.count(ref) && direct[ref].size() >= 2) continue;
|
||||
vector<string> names;
|
||||
for (auto i : idx) {
|
||||
if (!alts[i].direct_ref.empty()) {
|
||||
names.push_back(alts[i].direct_ref);
|
||||
}
|
||||
}
|
||||
sort(names.begin(), names.end());
|
||||
names.erase(unique(names.begin(), names.end()), names.end());
|
||||
if (names.size() < 2) continue;
|
||||
|
||||
string affected;
|
||||
for (auto &n : names) {
|
||||
if (!affected.empty()) affected += ", ";
|
||||
affected += n;
|
||||
}
|
||||
warn(def, "'" + rule_name + "' alternatives {" + affected +
|
||||
"} share indirect prefix '" + ref +
|
||||
"'; consider left-factoring.");
|
||||
}
|
||||
}
|
||||
}
|
||||
|
||||
if (path_list.size() < 2 && !opt_source) { return 0; }
|
||||
|
||||
// Check source
|
||||
|
||||
@ -23,6 +23,11 @@ add_executable(peglib-test-main
|
||||
test_left_recursive.cc
|
||||
test_first_set.cc
|
||||
test_integration.cc
|
||||
test_trace.cc
|
||||
test_combinators.cc
|
||||
test_definition_api.cc
|
||||
test_utf8.cc
|
||||
test_snapshot.cc
|
||||
)
|
||||
|
||||
target_include_directories(peglib-test-main PRIVATE ..)
|
||||
|
||||
256
test/test_combinators.cc
Normal file
256
test/test_combinators.cc
Normal file
@ -0,0 +1,256 @@
|
||||
#include <gtest/gtest.h>
|
||||
#include <peglib.h>
|
||||
|
||||
using namespace peg;
|
||||
|
||||
// Helper: Definition::parse returns Result, extract .ret for EXPECT
|
||||
static bool def_parse(const Definition &def, const char *s) {
|
||||
return def.parse(s, strlen(s)).ret;
|
||||
}
|
||||
|
||||
// --- opt() ---
|
||||
|
||||
TEST(CombinatorTest, Opt_matches_present) {
|
||||
Definition ROOT;
|
||||
ROOT <= seq(lit("hello"), opt(lit(" world")));
|
||||
EXPECT_TRUE(def_parse(ROOT, "hello world"));
|
||||
}
|
||||
|
||||
TEST(CombinatorTest, Opt_matches_absent) {
|
||||
Definition ROOT;
|
||||
ROOT <= seq(lit("hello"), opt(lit(" world")));
|
||||
EXPECT_TRUE(def_parse(ROOT, "hello"));
|
||||
}
|
||||
|
||||
// --- rep() ---
|
||||
|
||||
TEST(CombinatorTest, Rep_exact_range) {
|
||||
Definition ROOT;
|
||||
ROOT <= rep(chr('a'), 2, 4);
|
||||
EXPECT_FALSE(def_parse(ROOT, "a"));
|
||||
EXPECT_TRUE(def_parse(ROOT, "aa"));
|
||||
EXPECT_TRUE(def_parse(ROOT, "aaa"));
|
||||
EXPECT_TRUE(def_parse(ROOT, "aaaa"));
|
||||
}
|
||||
|
||||
TEST(CombinatorTest, Rep_min_only) {
|
||||
Definition ROOT;
|
||||
ROOT <= rep(chr('x'), 1, std::numeric_limits<size_t>::max());
|
||||
EXPECT_FALSE(def_parse(ROOT, ""));
|
||||
EXPECT_TRUE(def_parse(ROOT, "x"));
|
||||
EXPECT_TRUE(def_parse(ROOT, "xxx"));
|
||||
}
|
||||
|
||||
// --- apd() (And Predicate) ---
|
||||
|
||||
TEST(CombinatorTest, Apd_lookahead_success) {
|
||||
Definition ROOT;
|
||||
ROOT <= seq(apd(seq(chr('a'), chr('b'))), chr('a'), chr('b'));
|
||||
EXPECT_TRUE(def_parse(ROOT, "ab"));
|
||||
}
|
||||
|
||||
TEST(CombinatorTest, Apd_lookahead_failure) {
|
||||
Definition ROOT;
|
||||
ROOT <= seq(apd(seq(chr('a'), chr('b'))), chr('a'), chr('c'));
|
||||
EXPECT_FALSE(def_parse(ROOT, "ac"));
|
||||
}
|
||||
|
||||
// --- npd() (Not Predicate) ---
|
||||
|
||||
TEST(CombinatorTest, Npd_negative_lookahead_success) {
|
||||
Definition ROOT;
|
||||
ROOT <= seq(npd(seq(chr('a'), chr('b'))), chr('a'), chr('c'));
|
||||
EXPECT_TRUE(def_parse(ROOT, "ac"));
|
||||
}
|
||||
|
||||
TEST(CombinatorTest, Npd_negative_lookahead_failure) {
|
||||
Definition ROOT;
|
||||
ROOT <= seq(npd(seq(chr('a'), chr('b'))), chr('a'), chr('b'));
|
||||
EXPECT_FALSE(def_parse(ROOT, "ab"));
|
||||
}
|
||||
|
||||
// --- dic() (Dictionary) ---
|
||||
|
||||
TEST(CombinatorTest, Dic_matches_keyword) {
|
||||
Definition ROOT;
|
||||
ROOT <= dic({"apple", "banana", "cherry"}, false);
|
||||
EXPECT_TRUE(def_parse(ROOT, "apple"));
|
||||
EXPECT_TRUE(def_parse(ROOT, "banana"));
|
||||
EXPECT_TRUE(def_parse(ROOT, "cherry"));
|
||||
EXPECT_FALSE(def_parse(ROOT, "grape"));
|
||||
}
|
||||
|
||||
TEST(CombinatorTest, Dic_case_insensitive) {
|
||||
Definition ROOT;
|
||||
ROOT <= dic({"hello", "world"}, true);
|
||||
EXPECT_TRUE(def_parse(ROOT, "HELLO"));
|
||||
EXPECT_TRUE(def_parse(ROOT, "World"));
|
||||
}
|
||||
|
||||
// --- lit() / liti() ---
|
||||
|
||||
TEST(CombinatorTest, Lit_exact_match) {
|
||||
Definition ROOT;
|
||||
ROOT <= lit("hello");
|
||||
EXPECT_TRUE(def_parse(ROOT, "hello"));
|
||||
EXPECT_FALSE(def_parse(ROOT, "Hello"));
|
||||
EXPECT_FALSE(def_parse(ROOT, "world"));
|
||||
}
|
||||
|
||||
TEST(CombinatorTest, Liti_case_insensitive) {
|
||||
Definition ROOT;
|
||||
ROOT <= liti("hello");
|
||||
EXPECT_TRUE(def_parse(ROOT, "hello"));
|
||||
EXPECT_TRUE(def_parse(ROOT, "HELLO"));
|
||||
EXPECT_TRUE(def_parse(ROOT, "HeLLo"));
|
||||
EXPECT_FALSE(def_parse(ROOT, "world"));
|
||||
}
|
||||
|
||||
// --- ncls() (Negated Character Class) ---
|
||||
|
||||
TEST(CombinatorTest, Ncls_negated_class_string) {
|
||||
Definition ROOT;
|
||||
ROOT <= oom(ncls("abc"));
|
||||
EXPECT_TRUE(def_parse(ROOT, "xyz"));
|
||||
EXPECT_FALSE(def_parse(ROOT, "abc"));
|
||||
EXPECT_FALSE(def_parse(ROOT, "a"));
|
||||
}
|
||||
|
||||
TEST(CombinatorTest, Ncls_negated_class_ranges) {
|
||||
Definition ROOT;
|
||||
std::vector<std::pair<char32_t, char32_t>> ranges = {{'0', '9'}};
|
||||
ROOT <= oom(ncls(ranges));
|
||||
EXPECT_TRUE(def_parse(ROOT, "abc"));
|
||||
EXPECT_FALSE(def_parse(ROOT, "123"));
|
||||
}
|
||||
|
||||
// --- csc() / cap() (Capture Scope / Capture) ---
|
||||
|
||||
TEST(CombinatorTest, Cap_capture_match) {
|
||||
Definition ROOT;
|
||||
std::string captured;
|
||||
ROOT <= cap(oom(cls("a-z")), [&](const char *s, size_t n, Context &) {
|
||||
captured = std::string(s, n);
|
||||
});
|
||||
EXPECT_TRUE(def_parse(ROOT, "hello"));
|
||||
EXPECT_EQ(captured, "hello");
|
||||
}
|
||||
|
||||
TEST(CombinatorTest, Csc_capture_scope_with_backreference) {
|
||||
// Test capture scope with backreference: opening quote must match closing
|
||||
parser parser(R"(
|
||||
ROOT <- QUOTED
|
||||
QUOTED <- $q< ['"'] > [a-z]+ $q
|
||||
)");
|
||||
ASSERT_TRUE(parser);
|
||||
EXPECT_TRUE(parser.parse("'hello'"));
|
||||
EXPECT_TRUE(parser.parse("\"hello\""));
|
||||
EXPECT_FALSE(parser.parse("'hello\""));
|
||||
}
|
||||
|
||||
// --- tok() (Token Boundary) ---
|
||||
|
||||
TEST(CombinatorTest, Tok_token_boundary) {
|
||||
Definition ROOT;
|
||||
ROOT <= seq(tok(oom(cls("a-z"))), tok(oom(cls("a-z"))));
|
||||
ROOT.whitespaceOpe = zom(cls(" \t"));
|
||||
|
||||
std::string val;
|
||||
ROOT = [&](const SemanticValues &vs) {
|
||||
val = std::string(vs.token_to_string(0)) + " " +
|
||||
std::string(vs.token_to_string(1));
|
||||
};
|
||||
EXPECT_TRUE(def_parse(ROOT, "hello world"));
|
||||
EXPECT_EQ(val, "hello world");
|
||||
}
|
||||
|
||||
// --- ign() (Ignore) ---
|
||||
|
||||
TEST(CombinatorTest, Ign_ignore_semantic_value) {
|
||||
Definition ROOT;
|
||||
ROOT <= seq(lit("hello"), ign(lit(" ")), lit("world"));
|
||||
EXPECT_TRUE(def_parse(ROOT, "hello world"));
|
||||
}
|
||||
|
||||
// --- bkr() (Back Reference) ---
|
||||
|
||||
TEST(CombinatorTest, Bkr_back_reference) {
|
||||
parser parser(R"(
|
||||
ROOT <- $tag< WORD > ' ' WORD ' ' $tag
|
||||
WORD <- [a-z]+
|
||||
)");
|
||||
ASSERT_TRUE(parser);
|
||||
EXPECT_TRUE(parser.parse("hello world hello"));
|
||||
EXPECT_FALSE(parser.parse("hello world goodbye"));
|
||||
}
|
||||
|
||||
// --- pre() (Precedence Climbing) ---
|
||||
|
||||
TEST(CombinatorTest, Pre_precedence_climbing) {
|
||||
parser parser(R"(
|
||||
EXPRESSION <- ATOM (BINOP ATOM)* {
|
||||
precedence
|
||||
L + -
|
||||
L * /
|
||||
}
|
||||
ATOM <- NUMBER
|
||||
BINOP <- < '+' / '-' / '*' / '/' >
|
||||
NUMBER <- < [0-9]+ >
|
||||
%whitespace <- [ \t]*
|
||||
)");
|
||||
ASSERT_TRUE(parser);
|
||||
|
||||
parser["EXPRESSION"] = [](const SemanticValues &vs) -> int {
|
||||
if (vs.size() == 1) { return std::any_cast<int>(vs[0]); }
|
||||
auto left = std::any_cast<int>(vs[0]);
|
||||
auto right = std::any_cast<int>(vs[2]);
|
||||
auto op = std::any_cast<std::string_view>(vs[1]);
|
||||
if (op == "+") return left + right;
|
||||
if (op == "-") return left - right;
|
||||
if (op == "*") return left * right;
|
||||
if (op == "/") return left / right;
|
||||
return 0;
|
||||
};
|
||||
parser["BINOP"] = [](const SemanticValues &vs) { return vs.token(); };
|
||||
parser["NUMBER"] = [](const SemanticValues &vs) {
|
||||
return std::stoi(std::string(vs.token()));
|
||||
};
|
||||
|
||||
int result;
|
||||
EXPECT_TRUE(parser.parse("2+3*4", result));
|
||||
EXPECT_EQ(result, 14); // 2 + (3 * 4) = 14
|
||||
}
|
||||
|
||||
// --- rec() (Recovery) ---
|
||||
|
||||
TEST(CombinatorTest, Rec_recovery) {
|
||||
parser parser(R"(
|
||||
PROGRAM <- STMT+
|
||||
STMT <- EXPR ';'
|
||||
EXPR <- 'ok' / %recover([^;]*)
|
||||
%whitespace <- [ \t\n]*
|
||||
)");
|
||||
ASSERT_TRUE(parser);
|
||||
|
||||
std::vector<std::string> errors;
|
||||
parser.set_logger([&](size_t, size_t, const std::string &msg,
|
||||
const std::string &) { errors.push_back(msg); });
|
||||
|
||||
auto result = parser.parse("ok; bad; ok;");
|
||||
EXPECT_FALSE(result);
|
||||
EXPECT_FALSE(errors.empty());
|
||||
}
|
||||
|
||||
// --- cut() ---
|
||||
|
||||
TEST(CombinatorTest, Cut_operator) {
|
||||
parser parser(R"(
|
||||
ROOT <- A / B
|
||||
A <- 'a' ↑ 'b'
|
||||
B <- 'a' 'c'
|
||||
)");
|
||||
ASSERT_TRUE(parser);
|
||||
|
||||
EXPECT_TRUE(parser.parse("ab"));
|
||||
EXPECT_FALSE(parser.parse("ac"));
|
||||
}
|
||||
136
test/test_definition_api.cc
Normal file
136
test/test_definition_api.cc
Normal file
@ -0,0 +1,136 @@
|
||||
#include <gtest/gtest.h>
|
||||
#include <peglib.h>
|
||||
|
||||
using namespace peg;
|
||||
|
||||
// --- Definition::error_message ---
|
||||
|
||||
TEST(DefinitionApiTest, Error_message_custom) {
|
||||
parser parser(R"(
|
||||
ROOT <- GREETING
|
||||
GREETING <- 'hello' ' ' NAME
|
||||
NAME <- [a-z]+ { error_message "expected a lowercase name" }
|
||||
)");
|
||||
ASSERT_TRUE(parser);
|
||||
|
||||
std::string error_msg;
|
||||
parser.set_logger([&](size_t, size_t, const std::string &msg,
|
||||
const std::string &) { error_msg = msg; });
|
||||
|
||||
EXPECT_FALSE(parser.parse("hello 123"));
|
||||
EXPECT_EQ(error_msg, "expected a lowercase name");
|
||||
}
|
||||
|
||||
TEST(DefinitionApiTest, Error_message_via_definition) {
|
||||
parser parser(R"(
|
||||
ROOT <- 'hi ' NAME
|
||||
NAME <- [a-z]+
|
||||
)");
|
||||
ASSERT_TRUE(parser);
|
||||
|
||||
parser["NAME"].error_message = "name must be lowercase letters";
|
||||
|
||||
std::string error_msg;
|
||||
parser.set_logger([&](size_t, size_t, const std::string &msg,
|
||||
const std::string &) { error_msg = msg; });
|
||||
|
||||
EXPECT_FALSE(parser.parse("hi 123"));
|
||||
EXPECT_EQ(error_msg, "name must be lowercase letters");
|
||||
}
|
||||
|
||||
// --- Definition::no_ast_opt ---
|
||||
|
||||
TEST(DefinitionApiTest, No_ast_opt_preserves_node) {
|
||||
// Without no_ast_opt
|
||||
{
|
||||
parser pg(R"(
|
||||
ROOT <- ITEM+
|
||||
ITEM <- [a-z]+
|
||||
)");
|
||||
ASSERT_TRUE(pg);
|
||||
pg.enable_ast();
|
||||
|
||||
std::shared_ptr<Ast> ast;
|
||||
EXPECT_TRUE(pg.parse("hello", ast));
|
||||
auto optimized = pg.optimize_ast(ast);
|
||||
|
||||
bool found_item = false;
|
||||
std::function<void(const Ast &)> walk = [&](const Ast &node) {
|
||||
if (node.name == "ITEM") found_item = true;
|
||||
for (auto &child : node.nodes) {
|
||||
walk(*child);
|
||||
}
|
||||
};
|
||||
walk(*optimized);
|
||||
// May or may not have ITEM depending on optimization
|
||||
(void)found_item;
|
||||
|
||||
// With no_ast_opt on ITEM
|
||||
parser pg2(R"(
|
||||
ROOT <- ITEM+
|
||||
ITEM <- [a-z]+
|
||||
)");
|
||||
ASSERT_TRUE(pg2);
|
||||
pg2["ITEM"].no_ast_opt = true;
|
||||
pg2.enable_ast();
|
||||
|
||||
std::shared_ptr<Ast> ast2;
|
||||
EXPECT_TRUE(pg2.parse("hello", ast2));
|
||||
auto optimized2 = pg2.optimize_ast(ast2);
|
||||
|
||||
bool found_item2 = false;
|
||||
std::function<void(const Ast &)> walk2 = [&](const Ast &node) {
|
||||
if (node.name == "ITEM") found_item2 = true;
|
||||
for (auto &child : node.nodes) {
|
||||
walk2(*child);
|
||||
}
|
||||
};
|
||||
walk2(*optimized2);
|
||||
EXPECT_TRUE(found_item2);
|
||||
}
|
||||
}
|
||||
|
||||
// --- Definition::disable_action ---
|
||||
|
||||
TEST(DefinitionApiTest, Disable_action_skips_semantic_action) {
|
||||
parser parser(R"(
|
||||
ROOT <- NUMBER
|
||||
NUMBER <- [0-9]+
|
||||
)");
|
||||
ASSERT_TRUE(parser);
|
||||
|
||||
bool action_called = false;
|
||||
parser["NUMBER"] = [&](const SemanticValues &) { action_called = true; };
|
||||
|
||||
EXPECT_TRUE(parser.parse("123"));
|
||||
EXPECT_TRUE(action_called);
|
||||
|
||||
// Now disable the action
|
||||
action_called = false;
|
||||
parser["NUMBER"].disable_action = true;
|
||||
EXPECT_TRUE(parser.parse("456"));
|
||||
EXPECT_FALSE(action_called);
|
||||
}
|
||||
|
||||
// --- set_logger (lambda version without rule parameter) ---
|
||||
|
||||
TEST(DefinitionApiTest, Set_logger_simple_lambda) {
|
||||
parser parser(R"(
|
||||
ROOT <- 'hello'
|
||||
)");
|
||||
ASSERT_TRUE(parser);
|
||||
|
||||
size_t error_line = 0;
|
||||
size_t error_col = 0;
|
||||
std::string error_msg;
|
||||
parser.set_logger([&](size_t line, size_t col, const std::string &msg) {
|
||||
error_line = line;
|
||||
error_col = col;
|
||||
error_msg = msg;
|
||||
});
|
||||
|
||||
EXPECT_FALSE(parser.parse("world"));
|
||||
EXPECT_GT(error_line, 0u);
|
||||
EXPECT_GT(error_col, 0u);
|
||||
EXPECT_FALSE(error_msg.empty());
|
||||
}
|
||||
110
test/test_snapshot.cc
Normal file
110
test/test_snapshot.cc
Normal file
@ -0,0 +1,110 @@
|
||||
#include <gtest/gtest.h>
|
||||
#include <peglib.h>
|
||||
|
||||
using namespace peg;
|
||||
|
||||
// =============================================================================
|
||||
// Phase 2 Snapshot/Rollback Validation Tests
|
||||
//
|
||||
// These tests verify behaviors that Phase 2 must preserve:
|
||||
// - Nested choice rollback of choice_count_/choice_
|
||||
// - Sequence direct write (append removal)
|
||||
// - Predicate side-effect isolation
|
||||
// - CaptureScope isolation
|
||||
// - Repetition partial rollback
|
||||
// - Capture rollback on choice failure
|
||||
// =============================================================================
|
||||
|
||||
TEST(SnapshotTest, Nested_choice_rollback) {
|
||||
parser pg(R"(
|
||||
S <- A / B
|
||||
A <- 'x' INNER 'y'
|
||||
B <- 'x' 'z'
|
||||
INNER <- 'a' / 'b' / 'c'
|
||||
)");
|
||||
EXPECT_TRUE(pg);
|
||||
// A fails ('y' not found) -> fallback to B -> choice=1
|
||||
pg["S"] = [](const SemanticValues &vs) {
|
||||
EXPECT_EQ(2u, vs.choice_count()); // S has 2 alternatives
|
||||
EXPECT_EQ(1u, vs.choice()); // B (0-indexed)
|
||||
};
|
||||
EXPECT_TRUE(pg.parse("xz"));
|
||||
}
|
||||
|
||||
TEST(SnapshotTest, Sequence_direct_write) {
|
||||
parser pg(R"(
|
||||
S <- A B C
|
||||
A <- 'aaa'
|
||||
B <- 'bbb'
|
||||
C <- 'ccc'
|
||||
)");
|
||||
EXPECT_TRUE(pg);
|
||||
pg["A"] = [](const SemanticValues & /*vs*/) { return 1; };
|
||||
pg["B"] = [](const SemanticValues & /*vs*/) { return 2; };
|
||||
pg["C"] = [](const SemanticValues & /*vs*/) { return 3; };
|
||||
pg["S"] = [](const SemanticValues &vs) {
|
||||
EXPECT_EQ(3u, vs.size());
|
||||
EXPECT_EQ(1, std::any_cast<int>(vs[0]));
|
||||
EXPECT_EQ(2, std::any_cast<int>(vs[1]));
|
||||
EXPECT_EQ(3, std::any_cast<int>(vs[2]));
|
||||
};
|
||||
EXPECT_TRUE(pg.parse("aaabbbccc"));
|
||||
}
|
||||
|
||||
TEST(SnapshotTest, Predicate_no_side_effect) {
|
||||
parser pg(R"(
|
||||
S <- &(AB) CD
|
||||
AB <- [a-z]+ [0-9]+
|
||||
CD <- [a-z]+ [0-9]+
|
||||
)");
|
||||
EXPECT_TRUE(pg);
|
||||
pg["AB"] = [](const SemanticValues & /*vs*/) { return std::string("AB"); };
|
||||
pg["CD"] = [](const SemanticValues & /*vs*/) { return std::string("CD"); };
|
||||
// &(AB) succeeds but its result must not leak into S's semantic values
|
||||
pg["S"] = [](const SemanticValues &vs) {
|
||||
EXPECT_EQ(1u, vs.size()); // Only CD's result
|
||||
EXPECT_EQ("CD", std::any_cast<std::string>(vs[0]));
|
||||
};
|
||||
EXPECT_TRUE(pg.parse("abc123"));
|
||||
}
|
||||
|
||||
TEST(SnapshotTest, CaptureScope_isolation) {
|
||||
parser pg(R"(
|
||||
S <- $ref<'hello'> $(ISOLATED) $ref
|
||||
ISOLATED <- $ref<'world'>
|
||||
)");
|
||||
EXPECT_TRUE(pg);
|
||||
// $ref='world' inside $(ISOLATED) is isolated by CaptureScope
|
||||
// The final $ref should match 'hello', not 'world'
|
||||
EXPECT_TRUE(pg.parse("helloworldhello"));
|
||||
EXPECT_FALSE(pg.parse("helloworldworld"));
|
||||
}
|
||||
|
||||
TEST(SnapshotTest, Repetition_partial_rollback) {
|
||||
parser pg(R"(
|
||||
S <- ITEM+
|
||||
ITEM <- < [a-z]+ > ' '?
|
||||
)");
|
||||
EXPECT_TRUE(pg);
|
||||
pg["ITEM"] = [](const SemanticValues &vs) { return vs.token_to_string(); };
|
||||
pg["S"] = [](const SemanticValues &vs) {
|
||||
EXPECT_EQ(3u, vs.size());
|
||||
EXPECT_EQ("foo", std::any_cast<std::string>(vs[0]));
|
||||
EXPECT_EQ("bar", std::any_cast<std::string>(vs[1]));
|
||||
EXPECT_EQ("baz", std::any_cast<std::string>(vs[2]));
|
||||
};
|
||||
EXPECT_TRUE(pg.parse("foo bar baz"));
|
||||
}
|
||||
|
||||
TEST(SnapshotTest, Capture_rollback_on_choice_failure) {
|
||||
parser pg(R"(
|
||||
S <- A / B
|
||||
A <- $ref<'xx'> 'FAIL'
|
||||
B <- $ref<'yy'> $ref
|
||||
)");
|
||||
EXPECT_TRUE(pg);
|
||||
// A: sets ref='xx' -> fails on 'FAIL' -> rollback discards ref='xx'
|
||||
// B: sets ref='yy' -> $ref expects 'yy'
|
||||
EXPECT_TRUE(pg.parse("yyyy"));
|
||||
EXPECT_FALSE(pg.parse("yyxx")); // Would wrongly succeed if ref='xx' leaked
|
||||
}
|
||||
267
test/test_trace.cc
Normal file
267
test/test_trace.cc
Normal file
@ -0,0 +1,267 @@
|
||||
#include <gtest/gtest.h>
|
||||
#include <peglib.h>
|
||||
|
||||
#include <sstream>
|
||||
|
||||
using namespace peg;
|
||||
|
||||
TEST(TraceTest, Enable_trace_enter_leave_callbacks) {
|
||||
parser parser(R"(
|
||||
ROOT <- 'hello' ' ' 'world'
|
||||
)");
|
||||
ASSERT_TRUE(parser);
|
||||
|
||||
int enter_count = 0;
|
||||
int leave_count = 0;
|
||||
|
||||
parser.enable_trace([&](auto &, auto, auto, auto &, auto &, auto &,
|
||||
auto &) { enter_count++; },
|
||||
[&](auto &, auto, auto, auto &, auto &, auto &, auto,
|
||||
auto &) { leave_count++; });
|
||||
|
||||
EXPECT_TRUE(parser.parse("hello world"));
|
||||
EXPECT_GT(enter_count, 0);
|
||||
EXPECT_GT(leave_count, 0);
|
||||
EXPECT_EQ(enter_count, leave_count);
|
||||
}
|
||||
|
||||
TEST(TraceTest, Enable_trace_with_start_end_callbacks) {
|
||||
parser parser(R"(
|
||||
ROOT <- 'a' / 'b'
|
||||
)");
|
||||
ASSERT_TRUE(parser);
|
||||
|
||||
bool start_called = false;
|
||||
bool end_called = false;
|
||||
|
||||
parser.enable_trace(
|
||||
[&](auto &, auto, auto, auto &, auto &, auto &, auto &) {},
|
||||
[&](auto &, auto, auto, auto &, auto &, auto &, auto, auto &) {},
|
||||
[&](auto &) { start_called = true; }, [&](auto &) { end_called = true; });
|
||||
|
||||
EXPECT_TRUE(parser.parse("a"));
|
||||
EXPECT_TRUE(start_called);
|
||||
EXPECT_TRUE(end_called);
|
||||
}
|
||||
|
||||
TEST(TraceTest, Trace_data_passing) {
|
||||
parser parser(R"(
|
||||
ROOT <- 'test'
|
||||
)");
|
||||
ASSERT_TRUE(parser);
|
||||
|
||||
// trace_data is initialized by tracer_start, then copied into Context.
|
||||
// tracer_enter/leave modify Context's copy; tracer_end sees the original.
|
||||
int enter_count = 0;
|
||||
bool start_called = false;
|
||||
bool end_called = false;
|
||||
|
||||
parser.enable_trace(
|
||||
[&](auto &, auto, auto, auto &, auto &, auto &, std::any &trace_data) {
|
||||
// Verify trace_data was initialized (copied from tracer_start's value)
|
||||
auto val = std::any_cast<int>(trace_data);
|
||||
trace_data = val + 1;
|
||||
enter_count++;
|
||||
},
|
||||
[&](auto &, auto, auto, auto &, auto &, auto &, auto, std::any &) {},
|
||||
[&](std::any &trace_data) {
|
||||
trace_data = 42;
|
||||
start_called = true;
|
||||
},
|
||||
[&](std::any &trace_data) {
|
||||
// tracer_end sees the original trace_data (not Context's copy)
|
||||
EXPECT_EQ(std::any_cast<int>(trace_data), 42);
|
||||
end_called = true;
|
||||
});
|
||||
|
||||
EXPECT_TRUE(parser.parse("test"));
|
||||
EXPECT_TRUE(start_called);
|
||||
EXPECT_TRUE(end_called);
|
||||
EXPECT_GT(enter_count, 0);
|
||||
}
|
||||
|
||||
TEST(TraceTest, Trace_on_parse_failure) {
|
||||
parser parser(R"(
|
||||
ROOT <- 'hello'
|
||||
)");
|
||||
ASSERT_TRUE(parser);
|
||||
|
||||
int enter_count = 0;
|
||||
int leave_success = 0;
|
||||
int leave_fail = 0;
|
||||
|
||||
parser.enable_trace(
|
||||
[&](auto &, auto, auto, auto &, auto &, auto &, auto &) {
|
||||
enter_count++;
|
||||
},
|
||||
[&](auto &, auto, auto, auto &, auto &, auto &, auto len, auto &) {
|
||||
if (len != static_cast<size_t>(-1)) {
|
||||
leave_success++;
|
||||
} else {
|
||||
leave_fail++;
|
||||
}
|
||||
});
|
||||
|
||||
EXPECT_FALSE(parser.parse("world"));
|
||||
EXPECT_GT(enter_count, 0);
|
||||
EXPECT_GT(leave_fail, 0);
|
||||
}
|
||||
|
||||
TEST(TraceTest, Trace_position_in_callback) {
|
||||
parser parser(R"(
|
||||
ROOT <- 'ab' 'cd'
|
||||
)");
|
||||
ASSERT_TRUE(parser);
|
||||
|
||||
std::vector<size_t> positions;
|
||||
|
||||
parser.enable_trace(
|
||||
[&](auto &, auto s, auto, auto &, auto &c, auto &, auto &) {
|
||||
positions.push_back(static_cast<size_t>(s - c.s));
|
||||
},
|
||||
[&](auto &, auto, auto, auto &, auto &, auto &, auto, auto &) {});
|
||||
|
||||
EXPECT_TRUE(parser.parse("abcd"));
|
||||
EXPECT_FALSE(positions.empty());
|
||||
// First position should be 0 (start of input)
|
||||
EXPECT_EQ(positions[0], 0u);
|
||||
}
|
||||
|
||||
TEST(TraceTest, Set_verbose_trace) {
|
||||
parser parser(R"(
|
||||
ROOT <- 'hello'
|
||||
)");
|
||||
ASSERT_TRUE(parser);
|
||||
|
||||
// Just verify it doesn't crash — verbose_trace affects is_traceable
|
||||
parser.set_verbose_trace(true);
|
||||
|
||||
int enter_count = 0;
|
||||
parser.enable_trace(
|
||||
[&](auto &, auto, auto, auto &, auto &, auto &, auto &) {
|
||||
enter_count++;
|
||||
},
|
||||
[&](auto &, auto, auto, auto &, auto &, auto &, auto, auto &) {});
|
||||
|
||||
EXPECT_TRUE(parser.parse("hello"));
|
||||
// With verbose trace, more operations should be traced
|
||||
int verbose_count = enter_count;
|
||||
|
||||
enter_count = 0;
|
||||
parser.set_verbose_trace(false);
|
||||
EXPECT_TRUE(parser.parse("hello"));
|
||||
int non_verbose_count = enter_count;
|
||||
|
||||
EXPECT_GE(verbose_count, non_verbose_count);
|
||||
}
|
||||
|
||||
TEST(TraceTest, Enable_tracing_helper) {
|
||||
parser parser(R"(
|
||||
ROOT <- GREETING
|
||||
GREETING <- 'hello' ' ' 'world'
|
||||
)");
|
||||
ASSERT_TRUE(parser);
|
||||
|
||||
std::ostringstream os;
|
||||
enable_tracing(parser, os);
|
||||
|
||||
EXPECT_TRUE(parser.parse("hello world"));
|
||||
|
||||
auto output = os.str();
|
||||
EXPECT_FALSE(output.empty());
|
||||
// Should contain Enter and Leave markers
|
||||
EXPECT_NE(output.find("E "), std::string::npos);
|
||||
EXPECT_NE(output.find("L "), std::string::npos);
|
||||
}
|
||||
|
||||
TEST(TraceTest, Enable_tracing_on_failure) {
|
||||
parser parser(R"(
|
||||
ROOT <- 'hello'
|
||||
)");
|
||||
ASSERT_TRUE(parser);
|
||||
|
||||
std::ostringstream os;
|
||||
enable_tracing(parser, os);
|
||||
|
||||
EXPECT_FALSE(parser.parse("world"));
|
||||
|
||||
auto output = os.str();
|
||||
EXPECT_FALSE(output.empty());
|
||||
}
|
||||
|
||||
TEST(TraceTest, Enable_profiling_helper) {
|
||||
parser parser(R"(
|
||||
ROOT <- NUMBER ('+' NUMBER)*
|
||||
NUMBER <- [0-9]+
|
||||
)");
|
||||
ASSERT_TRUE(parser);
|
||||
|
||||
std::ostringstream os;
|
||||
enable_profiling(parser, os);
|
||||
|
||||
EXPECT_TRUE(parser.parse("1+2+3"));
|
||||
|
||||
auto output = os.str();
|
||||
EXPECT_FALSE(output.empty());
|
||||
// Should contain duration info
|
||||
EXPECT_NE(output.find("duration:"), std::string::npos);
|
||||
// Should contain rule names
|
||||
EXPECT_NE(output.find("ROOT"), std::string::npos);
|
||||
EXPECT_NE(output.find("NUMBER"), std::string::npos);
|
||||
}
|
||||
|
||||
TEST(TraceTest, Enable_profiling_shows_success_fail_counts) {
|
||||
parser parser(R"(
|
||||
ROOT <- ITEM+
|
||||
ITEM <- 'a' / 'b'
|
||||
)");
|
||||
ASSERT_TRUE(parser);
|
||||
|
||||
std::ostringstream os;
|
||||
enable_profiling(parser, os);
|
||||
|
||||
EXPECT_TRUE(parser.parse("ab"));
|
||||
|
||||
auto output = os.str();
|
||||
// Should contain success/fail summary
|
||||
EXPECT_NE(output.find("success"), std::string::npos);
|
||||
EXPECT_NE(output.find("fail"), std::string::npos);
|
||||
}
|
||||
|
||||
TEST(TraceTest, Trace_with_packrat) {
|
||||
parser parser(R"(
|
||||
ROOT <- A / B
|
||||
A <- 'x' 'y'
|
||||
B <- 'x' 'z'
|
||||
)");
|
||||
ASSERT_TRUE(parser);
|
||||
parser.enable_packrat_parsing();
|
||||
|
||||
int enter_count = 0;
|
||||
parser.enable_trace(
|
||||
[&](auto &, auto, auto, auto &, auto &, auto &, auto &) {
|
||||
enter_count++;
|
||||
},
|
||||
[&](auto &, auto, auto, auto &, auto &, auto &, auto, auto &) {});
|
||||
|
||||
EXPECT_TRUE(parser.parse("xz"));
|
||||
EXPECT_GT(enter_count, 0);
|
||||
}
|
||||
|
||||
TEST(TraceTest, Trace_with_left_recursion) {
|
||||
parser parser(R"(
|
||||
E <- E '+' T / T
|
||||
T <- [0-9]+
|
||||
)");
|
||||
ASSERT_TRUE(parser);
|
||||
|
||||
int enter_count = 0;
|
||||
parser.enable_trace(
|
||||
[&](auto &, auto, auto, auto &, auto &, auto &, auto &) {
|
||||
enter_count++;
|
||||
},
|
||||
[&](auto &, auto, auto, auto &, auto &, auto &, auto, auto &) {});
|
||||
|
||||
EXPECT_TRUE(parser.parse("1+2"));
|
||||
EXPECT_GT(enter_count, 0);
|
||||
}
|
||||
188
test/test_utf8.cc
Normal file
188
test/test_utf8.cc
Normal file
@ -0,0 +1,188 @@
|
||||
#include <gtest/gtest.h>
|
||||
#include <peglib.h>
|
||||
|
||||
using namespace peg;
|
||||
|
||||
// --- codepoint_length ---
|
||||
|
||||
TEST(Utf8Test, Codepoint_length_ascii) {
|
||||
const char *s = "A";
|
||||
EXPECT_EQ(codepoint_length(s, 1), 1u);
|
||||
}
|
||||
|
||||
TEST(Utf8Test, Codepoint_length_2byte) {
|
||||
// U+00E9 (é) = 0xC3 0xA9
|
||||
const char s[] = "\xC3\xA9";
|
||||
EXPECT_EQ(codepoint_length(s, 2), 2u);
|
||||
}
|
||||
|
||||
TEST(Utf8Test, Codepoint_length_3byte) {
|
||||
// U+3042 (あ) = 0xE3 0x81 0x82
|
||||
const char s[] = "\xE3\x81\x82";
|
||||
EXPECT_EQ(codepoint_length(s, 3), 3u);
|
||||
}
|
||||
|
||||
TEST(Utf8Test, Codepoint_length_4byte) {
|
||||
// U+1F600 (😀) = 0xF0 0x9F 0x98 0x80
|
||||
const char s[] = "\xF0\x9F\x98\x80";
|
||||
EXPECT_EQ(codepoint_length(s, 4), 4u);
|
||||
}
|
||||
|
||||
TEST(Utf8Test, Codepoint_length_empty) {
|
||||
EXPECT_EQ(codepoint_length("", 0), 0u);
|
||||
}
|
||||
|
||||
TEST(Utf8Test, Codepoint_length_truncated) {
|
||||
// 3-byte sequence but only 2 bytes available
|
||||
const char s[] = "\xE3\x81";
|
||||
EXPECT_EQ(codepoint_length(s, 2), 0u);
|
||||
}
|
||||
|
||||
// --- codepoint_count ---
|
||||
|
||||
TEST(Utf8Test, Codepoint_count_ascii) {
|
||||
const char *s = "hello";
|
||||
EXPECT_EQ(codepoint_count(s, 5), 5u);
|
||||
}
|
||||
|
||||
TEST(Utf8Test, Codepoint_count_mixed) {
|
||||
// "aあb" = 'a' + 3-byte + 'b' = 5 bytes, 3 codepoints
|
||||
const char s[] = "a\xE3\x81\x82"
|
||||
"b";
|
||||
EXPECT_EQ(codepoint_count(s, 5), 3u);
|
||||
}
|
||||
|
||||
TEST(Utf8Test, Codepoint_count_empty) { EXPECT_EQ(codepoint_count("", 0), 0u); }
|
||||
|
||||
TEST(Utf8Test, Codepoint_count_emoji) {
|
||||
// "😀😀" = 2 x 4-byte = 8 bytes, 2 codepoints
|
||||
const char s[] = "\xF0\x9F\x98\x80\xF0\x9F\x98\x80";
|
||||
EXPECT_EQ(codepoint_count(s, 8), 2u);
|
||||
}
|
||||
|
||||
// --- encode_codepoint ---
|
||||
|
||||
TEST(Utf8Test, Encode_codepoint_ascii) {
|
||||
char buff[4];
|
||||
auto len = encode_codepoint(U'A', buff);
|
||||
EXPECT_EQ(len, 1u);
|
||||
EXPECT_EQ(buff[0], 'A');
|
||||
}
|
||||
|
||||
TEST(Utf8Test, Encode_codepoint_2byte) {
|
||||
auto s = encode_codepoint(U'\u00E9'); // é
|
||||
EXPECT_EQ(s.size(), 2u);
|
||||
EXPECT_EQ(s, "\xC3\xA9");
|
||||
}
|
||||
|
||||
TEST(Utf8Test, Encode_codepoint_3byte) {
|
||||
auto s = encode_codepoint(U'\u3042'); // あ
|
||||
EXPECT_EQ(s.size(), 3u);
|
||||
EXPECT_EQ(s, "\xE3\x81\x82");
|
||||
}
|
||||
|
||||
TEST(Utf8Test, Encode_codepoint_4byte) {
|
||||
auto s = encode_codepoint(U'\U0001F600'); // 😀
|
||||
EXPECT_EQ(s.size(), 4u);
|
||||
EXPECT_EQ(s, "\xF0\x9F\x98\x80");
|
||||
}
|
||||
|
||||
TEST(Utf8Test, Encode_codepoint_surrogate_returns_zero) {
|
||||
// Surrogates (U+D800-U+DFFF) are invalid
|
||||
char buff[4];
|
||||
auto len = encode_codepoint(0xD800, buff);
|
||||
EXPECT_EQ(len, 0u);
|
||||
}
|
||||
|
||||
TEST(Utf8Test, Encode_codepoint_beyond_unicode_returns_zero) {
|
||||
char buff[4];
|
||||
auto len = encode_codepoint(0x110000, buff);
|
||||
EXPECT_EQ(len, 0u);
|
||||
}
|
||||
|
||||
// --- decode_codepoint ---
|
||||
|
||||
TEST(Utf8Test, Decode_codepoint_ascii) {
|
||||
const char *s = "A";
|
||||
size_t bytes;
|
||||
char32_t cp;
|
||||
EXPECT_TRUE(decode_codepoint(s, 1, bytes, cp));
|
||||
EXPECT_EQ(bytes, 1u);
|
||||
EXPECT_EQ(cp, U'A');
|
||||
}
|
||||
|
||||
TEST(Utf8Test, Decode_codepoint_2byte) {
|
||||
const char s[] = "\xC3\xA9";
|
||||
size_t bytes;
|
||||
char32_t cp;
|
||||
EXPECT_TRUE(decode_codepoint(s, 2, bytes, cp));
|
||||
EXPECT_EQ(bytes, 2u);
|
||||
EXPECT_EQ(cp, U'\u00E9');
|
||||
}
|
||||
|
||||
TEST(Utf8Test, Decode_codepoint_3byte) {
|
||||
const char s[] = "\xE3\x81\x82";
|
||||
size_t bytes;
|
||||
char32_t cp;
|
||||
EXPECT_TRUE(decode_codepoint(s, 3, bytes, cp));
|
||||
EXPECT_EQ(bytes, 3u);
|
||||
EXPECT_EQ(cp, U'\u3042');
|
||||
}
|
||||
|
||||
TEST(Utf8Test, Decode_codepoint_4byte) {
|
||||
const char s[] = "\xF0\x9F\x98\x80";
|
||||
size_t bytes;
|
||||
char32_t cp;
|
||||
EXPECT_TRUE(decode_codepoint(s, 4, bytes, cp));
|
||||
EXPECT_EQ(bytes, 4u);
|
||||
EXPECT_EQ(cp, U'\U0001F600');
|
||||
}
|
||||
|
||||
TEST(Utf8Test, Decode_codepoint_empty) {
|
||||
size_t bytes;
|
||||
char32_t cp;
|
||||
EXPECT_FALSE(decode_codepoint("", 0, bytes, cp));
|
||||
}
|
||||
|
||||
TEST(Utf8Test, Decode_codepoint_convenience_with_bytes) {
|
||||
const char s[] = "\xE3\x81\x82";
|
||||
char32_t cp;
|
||||
auto bytes = decode_codepoint(s, 3, cp);
|
||||
EXPECT_EQ(bytes, 3u);
|
||||
EXPECT_EQ(cp, U'\u3042');
|
||||
}
|
||||
|
||||
TEST(Utf8Test, Decode_codepoint_convenience_simple) {
|
||||
const char s[] = "\xE3\x81\x82";
|
||||
auto cp = decode_codepoint(s, 3);
|
||||
EXPECT_EQ(cp, U'\u3042');
|
||||
}
|
||||
|
||||
// --- decode (full string) ---
|
||||
|
||||
TEST(Utf8Test, Decode_full_string) {
|
||||
const char s[] = "a\xE3\x81\x82"
|
||||
"b";
|
||||
auto u32 = decode(s, 5);
|
||||
EXPECT_EQ(u32.size(), 3u);
|
||||
EXPECT_EQ(u32[0], U'a');
|
||||
EXPECT_EQ(u32[1], U'\u3042');
|
||||
EXPECT_EQ(u32[2], U'b');
|
||||
}
|
||||
|
||||
TEST(Utf8Test, Decode_empty_string) {
|
||||
auto u32 = decode("", 0);
|
||||
EXPECT_EQ(u32.size(), 0u);
|
||||
}
|
||||
|
||||
// --- roundtrip ---
|
||||
|
||||
TEST(Utf8Test, Encode_decode_roundtrip) {
|
||||
std::vector<char32_t> codepoints = {U'A', U'\u00E9', U'\u3042',
|
||||
U'\U0001F600'};
|
||||
for (auto cp : codepoints) {
|
||||
auto encoded = encode_codepoint(cp);
|
||||
auto decoded = decode_codepoint(encoded.c_str(), encoded.size());
|
||||
EXPECT_EQ(decoded, cp);
|
||||
}
|
||||
}
|
||||
Loading…
Reference in New Issue
Block a user