storage-engine-playground/crates/plan-runner/README.md

## Plan Runner

This crate is a snapshot executor for conjunctive-query plans.
It reads a JSON plan (a DAG of scan and join nodes plus the input facts),
walks the DAG using the operators from [`query-ops`](../query-ops),
and prints the binding relation produced at the root node.

The wire format mirrors `Geolog.DB.Plan.PlanGraph` from the
[`geolog`](../../external/geolog) submodule, but the JSON shape is the contract:
any frontend that emits this format can drive the runner.
The mapping from `PlanEvalAtom` / `PlanJoin` to `scan_atom` / `semijoin` / `natural_join`,
and the full IR spec, are documented as module-level rustdoc in
[`src/lib.rs`](src/lib.rs).

### Pipeline

End-to-end, scenarios become runner output through three stages:

```text
tools/exporter/examples/*.scenario.json
  └── (Haskell exporter; runs Geolog.DB.Plan.planConjunction
       and Geolog.DB.InMemory.evalConjunctionPlanned as a self-check)
        └── crates/plan-runner/fixtures/*.json    (JSON IR; checked in)
             └── (plan-runner; this crate)
                  └── stdout JSON, with row-for-row oracle check
```

The exporter (`tools/exporter`) is the only producer of runner IR today;
it's where atoms are planned and rejected if they don't fit the supported subset.
Fixtures are regenerated with `make export-fixtures`, and the full loop is `make examples`.

What happens inside the runner once a JSON plan arrives:

<div align="center">
  <picture>
    <img alt="Workflow" src="docs/diagrams/workflow.svg" height="90%" width="90%">
  </picture>
</div>

### Backends

The CLI takes a `--backend` flag.
The `memory` backend is the pure in-memory path;
every other backend routes facts through the [`Storage`](../storage) trait
via `build_tables_via_storage`, then scans tables back out before executing.

| Backend          | Storage                                        | Location              |
|------------------|------------------------------------------------|-----------------------|
| `memory`         | none (direct from `plan.facts`)                | n/a                   |
| `memory-storage` | `MemoryStorage`                                | in-process            |
| `lmdb`           | `LmdbStorage` (heed-backed mmap B-tree)        | fresh tempdir per run |
| `redb`           | `RedbStorage` (single-file B-tree)             | fresh tempdir per run |
| `fjall`          | `FjallStorage` (LSM tree)                      | fresh tempdir per run |
| `sqlite`         | `SqliteStorage` (rusqlite, bundled libsqlite3) | fresh tempdir per run |
| `geomerge`       | `GeomergeStorage` (CRDT; alpha)                | in-process            |

All seven produce byte-identical output for every checked-in fixture.
The point of the abstraction is not performance comparison
(the snapshot evaluator is bulk-materialized either way),
but to validate that the storage layer is genuinely backend-neutral
and that adding a new adapter is a constructor swap.

Note on `geomerge`:
the runner's JSON IR is untyped (only arity per relation),
but geomerge requires a typed theory upfront.
The CLI infers column types from the first fact row per relation
and synthesizes a theory of `PrimInt` and `PrimString` columns via
[`GeomergeStorage::with_relations`](../storage/src/adapters/geomerge.rs).
Columns with no sample facts default to `PrimString`.

### Run It

```sh
# Run one fixture through the default in-memory path:
cargo run -p plan-runner -- crates/plan-runner/fixtures/two_atom_join.json

# Same plan, routed through different backends:
cargo run -p plan-runner -- --backend memory-storage crates/plan-runner/fixtures/two_atom_join.json
cargo run -p plan-runner -- --backend lmdb           crates/plan-runner/fixtures/two_atom_join.json
cargo run -p plan-runner -- --backend redb           crates/plan-runner/fixtures/two_atom_join.json
cargo run -p plan-runner -- --backend fjall          crates/plan-runner/fixtures/two_atom_join.json
cargo run -p plan-runner -- --backend sqlite         crates/plan-runner/fixtures/two_atom_join.json
cargo run -p plan-runner -- --backend geomerge       crates/plan-runner/fixtures/two_atom_join.json

# Regenerate every fixture from its scenario and run the oracle test:
make examples
```

A sample run:

```sh
$ plan-run crates/plan-runner/fixtures/two_atom_join.json
{"columns":["a","b","_w0_2"],"rows":[["node:1","node:2","edge:1"],["node:2","node:1","edge:2"]]}
```

The `_w<atomIdx>_<pos>` columns are wildcards the exporter named so the runner can bind them.
The scenario's `expected_bindings` block names only the variables the test cares about,
and `verify` projects the runner output to that subset before comparing as a multiset.

### Run the Tests

```sh
cargo test -p plan-runner
```

The two integration test files exercise complementary properties:

- `tests/examples.rs` walks every fixture and checks it against its `expected_bindings` oracle.
- `tests/storage_roundtrip.rs` cross-checks the pure path against the storage-backed path,
  to keep `build_tables` and `build_tables_via_storage` in lockstep.

### Notes

- **IR contract.**
  The runner is backend-agnostic and frontend-agnostic:
  it consumes JSON in the shape documented in `src/lib.rs` and produces a binding relation.
  Anything that emits the same JSON can drive it.
- **No optimizer.**
  Plans are executed as written.
  Node ordering, join shape, and antijoin scheduling are all the producer's responsibility.
  This crate's job ends at faithful execution of the IR.
- **Wildcard columns survive.**
  `scan_atom` keeps every distinct variable that appears in the pattern,
  including the exporter's synthetic `_w<atomIdx>_<pos>` names.
  The runner does not project them out;
  oracle verification handles that on the comparison side.
- **Bulk, not streaming.**
  Each node materializes its full output as a `Relation`.
  This matches `query-ops`' execution model;
  it's not designed for incremental or maintained-view workloads.
WIP 2026-06-05 11:31:18 +02:00			`## Plan Runner`

			`This crate is a snapshot executor for conjunctive-query plans.`
			`It reads a JSON plan (a DAG of scan and join nodes plus the input facts),`
			walks the DAG using the operators from [`query-ops`](../query-ops),
			`and prints the binding relation produced at the root node.`

			The wire format mirrors `Geolog.DB.Plan.PlanGraph` from the
			[`geolog`](../../external/geolog) submodule, but the JSON shape is the contract:
			`any frontend that emits this format can drive the runner.`
			The mapping from `PlanEvalAtom` / `PlanJoin` to `scan_atom` / `semijoin` / `natural_join`,
			`and the full IR spec, are documented as module-level rustdoc in`
			[`src/lib.rs`](src/lib.rs).

			`### Pipeline`

			`End-to-end, scenarios become runner output through three stages:`

			```text
			`tools/exporter/examples/*.scenario.json`
			`└── (Haskell exporter; runs Geolog.DB.Plan.planConjunction`
			`and Geolog.DB.InMemory.evalConjunctionPlanned as a self-check)`
			`└── crates/plan-runner/fixtures/*.json (JSON IR; checked in)`
			`└── (plan-runner; this crate)`
			`└── stdout JSON, with row-for-row oracle check`
			```

			The exporter (`tools/exporter`) is the only producer of runner IR today;
			`it's where atoms are planned and rejected if they don't fit the supported subset.`
			Fixtures are regenerated with `make export-fixtures`, and the full loop is `make examples`.

			`What happens inside the runner once a JSON plan arrives:`

			`<div align="center">`
			`<picture>`
			`<img alt="Workflow" src="docs/diagrams/workflow.svg" height="90%" width="90%">`
			`</picture>`
			`</div>`

			`### Backends`

			The CLI takes a `--backend` flag.
			The `memory` backend is the pure in-memory path;
			every other backend routes facts through the [`Storage`](../storage) trait
			via `build_tables_via_storage`, then scans tables back out before executing.

			`\| Backend \| Storage \| Location \|`
			`\|------------------\|------------------------------------------------\|-----------------------\|`
			\| `memory` \| none (direct from `plan.facts`) \| n/a \|
			\| `memory-storage` \| `MemoryStorage` \| in-process \|
			\| `lmdb` \| `LmdbStorage` (heed-backed mmap B-tree) \| fresh tempdir per run \|
			\| `redb` \| `RedbStorage` (single-file B-tree) \| fresh tempdir per run \|
			\| `fjall` \| `FjallStorage` (LSM tree) \| fresh tempdir per run \|
			\| `sqlite` \| `SqliteStorage` (rusqlite, bundled libsqlite3) \| fresh tempdir per run \|
			\| `geomerge` \| `GeomergeStorage` (CRDT; alpha) \| in-process \|

			`All seven produce byte-identical output for every checked-in fixture.`
			`The point of the abstraction is not performance comparison`
			`(the snapshot evaluator is bulk-materialized either way),`
			`but to validate that the storage layer is genuinely backend-neutral`
			`and that adding a new adapter is a constructor swap.`

			Note on `geomerge`:
			`the runner's JSON IR is untyped (only arity per relation),`
			`but geomerge requires a typed theory upfront.`
			`The CLI infers column types from the first fact row per relation`
			and synthesizes a theory of `PrimInt` and `PrimString` columns via
			[`GeomergeStorage::with_relations`](../storage/src/adapters/geomerge.rs).
			Columns with no sample facts default to `PrimString`.

			`### Run It`

			```sh
			`# Run one fixture through the default in-memory path:`
			`cargo run -p plan-runner -- crates/plan-runner/fixtures/two_atom_join.json`

			`# Same plan, routed through different backends:`
			`cargo run -p plan-runner -- --backend memory-storage crates/plan-runner/fixtures/two_atom_join.json`
			`cargo run -p plan-runner -- --backend lmdb crates/plan-runner/fixtures/two_atom_join.json`
			`cargo run -p plan-runner -- --backend redb crates/plan-runner/fixtures/two_atom_join.json`
			`cargo run -p plan-runner -- --backend fjall crates/plan-runner/fixtures/two_atom_join.json`
			`cargo run -p plan-runner -- --backend sqlite crates/plan-runner/fixtures/two_atom_join.json`
			`cargo run -p plan-runner -- --backend geomerge crates/plan-runner/fixtures/two_atom_join.json`

			`# Regenerate every fixture from its scenario and run the oracle test:`
			`make examples`
			```

			`A sample run:`

			```sh
			`$ plan-run crates/plan-runner/fixtures/two_atom_join.json`
			`{"columns":["a","b","_w0_2"],"rows":[["node:1","node:2","edge:1"],["node:2","node:1","edge:2"]]}`
			```

			The `_w<atomIdx>_<pos>` columns are wildcards the exporter named so the runner can bind them.
			The scenario's `expected_bindings` block names only the variables the test cares about,
			and `verify` projects the runner output to that subset before comparing as a multiset.

			`### Run the Tests`

			```sh
			`cargo test -p plan-runner`
			```

			`The two integration test files exercise complementary properties:`

			- `tests/examples.rs` walks every fixture and checks it against its `expected_bindings` oracle.
			- `tests/storage_roundtrip.rs` cross-checks the pure path against the storage-backed path,
			to keep `build_tables` and `build_tables_via_storage` in lockstep.

			`### Notes`

			`- IR contract.`
			`The runner is backend-agnostic and frontend-agnostic:`
			it consumes JSON in the shape documented in `src/lib.rs` and produces a binding relation.
			`Anything that emits the same JSON can drive it.`
			`- No optimizer.`
			`Plans are executed as written.`
			`Node ordering, join shape, and antijoin scheduling are all the producer's responsibility.`
			`This crate's job ends at faithful execution of the IR.`
			`- Wildcard columns survive.`
			`scan_atom` keeps every distinct variable that appears in the pattern,
			including the exporter's synthetic `_w<atomIdx>_<pos>` names.
			`The runner does not project them out;`
			`oracle verification handles that on the comparison side.`
			`- Bulk, not streaming.`
			Each node materializes its full output as a `Relation`.
			This matches `query-ops`' execution model;
			`it's not designed for incremental or maintained-view workloads.`