Hassan Abedi 1ec1ab9adc Add a notefile for Geomerge DBSP integration

2026-05-08 09:21:50 +02:00

14 KiB

Raw Permalink Blame History

Geomerge and DBSP Integration

A design note on integrating Leo Stewen's DBSP/Datalog work with Vincent Liu's Geomerge storage engine for storage-level constraint checking in Geolog.

Short Answer

Leo's work can be integrated with Geomerge, but the first integration should be narrow.

The useful target is incremental validation for the Datalog-shaped subset of Geolog laws. Geomerge already stores compiled theories, tables, rows, and laws. Leo's work provides a model for turning relational rules into an incremental DBSP circuit. The bridge would translate selected Geomerge laws into violation queries maintained by DBSP.

This would not yet be a full Geolog chase engine. It would not automatically generate existential witnesses, branch over disjunctions, or perform equality saturation. It would make Geomerge's current validation path incremental for laws that can be expressed as relational constraints.

Current Geomerge Behavior

Geomerge currently behaves like a storage engine plus validator.

It loads a compiled FlatTheory, creates tables from the theory, accepts row additions, validates primary keys, checks laws, and either commits or rejects the transaction. The transaction pattern is:

Clone the current store into a preview store.
Apply proposed row additions to the preview.
Check schemas, primary keys, and laws.
Commit the preview only if validation succeeds.
Leave the original store unchanged on failure.

This is the right semantic shape for storage-level constraint checking. The store is not allowed to enter an invalid state.

The limitation is that law checking is currently direct and non-incremental. The law validator binds antecedent variables by scanning tables and then checks whether the consequent holds for each binding. This is acceptable for small examples but will become expensive as facts and laws grow.

Current Leo DBSP Behavior

Leo's implementation behaves like an incremental Datalog query engine.

It parses a custom Datalog dialect, analyzes predicate dependencies, translates rules into a relational intermediate representation, builds a DBSP circuit, accepts input changes, and emits output changes.

The important part for Geolog is not the CRDT examples themselves. The important part is the pipeline:

rules over relations
-> relational operators
-> DBSP circuit
-> maintained output deltas

This is the missing performance model for Geomerge's law checking. Instead of rescanning all relevant tables after each transaction, the system could feed transaction deltas into a maintained constraint-checking circuit.

Integration Goal

The first integration goal should be:

incremental violation detection for selected Geomerge laws

The goal should not initially be:

full Geolog evaluation
full chase
automatic witness generation
disjunctive model search
replacement of Geomerge's storage engine
replacement of Geolog's compiler or IR

This scope matters. A small successful integration would already answer the main architectural question: can compiled Geolog tables and laws be maintained as incremental relational checks?

Key Idea

Every supported law can be compiled into a violation relation.

Suppose a law has this shape:

antecedent(x) |- consequent(x)

The DBSP checker can maintain:

required_consequent(x) :- antecedent(x).
violation(x) :- required_consequent(x), not consequent(x).

Then the storage engine only needs to ask:

is the violation relation empty after this transaction?

If it is empty, the transaction is valid for that law subset. If it is non-empty, the transaction is rejected, and the violation rows can be shown to the user.

This turns law checking into ordinary incremental view maintenance.

Example: Foreign-Key Laws

Foreign-key-style laws are the easiest first target.

In Geomerge's compiled theories, a row may imply that referenced rows must exist. For example, an edge row may require its source and target vertex rows to exist.

The law shape is:

G.E(graph, src, dst) |- G.V(graph, src), G.V(graph, dst)

The violation query can be split into two checks:

required_src(graph, src) :- G.E(graph, src, dst).
missing_src(graph, src) :- required_src(graph, src), not G.V(graph, src).

required_dst(graph, dst) :- G.E(graph, src, dst).
missing_dst(graph, dst) :- required_dst(graph, dst), not G.V(graph, dst).

The combined violation relation is the union of missing_src and missing_dst.

This is exactly the kind of work DBSP should be good at: maintaining joins, projections, antijoins, and differences as rows are inserted.

Example: Totality Laws

Totality laws are also plausible, but they need careful interpretation.

A function table with a primary key says that inputs determine an output. A totality law says that when the required input rows exist, a corresponding function row must exist.

In validation mode, missing rows are violations:

required_f(x) :- input_condition(x).
missing_f(x) :- required_f(x), not f(x, _).

In chase mode, missing rows might instead cause witness generation:

input_condition(x) |- exists y. f(x, y)

The incremental DBSP checker can handle the validation interpretation. It can maintain missing_f. It should not initially be responsible for generating y.

This distinction is central. The first integration should treat totality as a constraint, not as an automatic construction rule.

Example: Equality Laws

Equality laws can also be handled at the validation level.

If a law concludes:

x = y

then the checker can maintain a violation whenever the antecedent holds and the bound values differ.

In Datalog-style terms:

equality_obligation(x, y) :- antecedent(x, y).
equality_violation(x, y) :- equality_obligation(x, y), x != y.

This is not equality saturation. It does not merge terms, rewrite facts, or maintain equivalence classes. It only reports that the current store violates a required equality.

That is still useful at the storage-engine level. It matches Geomerge's current validator behavior, where unsatisfied equality is a law violation.

Supported First Subset

The first integration should support laws whose compiled form uses:

atoms
conjunction
equality
variables
literals
entity references
primitive values
projections from table columns into rule variables

The corresponding relational operators are:

projection
selection
equijoin
antijoin
union
difference
distinct

Positive recursion can be added after the non-recursive path works, especially for transitive-closure-shaped tables.

Excluded First Subset

The first integration should explicitly exclude:

disjunctive consequents
existential witness generation
automatic insertion of missing consequent rows
equality-class maintenance
tuple columns
record construction
term projection if it cannot be lowered to simple column access
branch or world tracking
deletion semantics, unless the storage layer adds row removal

These features are not unimportant. They are just the wrong first target because they require Geolog semantics beyond ordinary Datalog-style validation.

Proposed Architecture

The integration can be organized around a new constraint-checking component:

Geomerge Store
  owns schemas, rows, transactions, and persistence

Constraint Compiler
  translates supported FlatTheory laws into DBSP plans

DBSP Constraint Runtime
  owns input handles, output handles, and maintained violation relations

Transaction Adapter
  converts Geomerge row additions into DBSP input deltas

The write path becomes:

begin transaction
-> validate row shapes and primary keys in Geomerge
-> apply row additions to preview store
-> feed row deltas to preview constraint runtime
-> step DBSP circuit
-> inspect violation output
-> commit store and checker if no violations
-> otherwise reject transaction

The preview aspect is important. Geomerge currently gets atomicity by checking a cloned preview store. An incremental checker needs a matching preview or rollback story.

State Management Options

There are three plausible ways to handle transaction atomicity.

Rebuild on Preview

For each transaction, clone or rebuild a checker from the preview store and run it.

This is simplest but defeats much of the point of incrementality. It may still be useful for early testing.

Clone Checker State

Keep the DBSP runtime state cloneable, then apply deltas to a preview checker and commit the preview on success.

This matches Geomerge's current preview-store model, but it depends on whether the DBSP circuit state can be cloned or snapshotted cleanly.

Compensating Deltas

Apply transaction deltas to the live checker, inspect violations, and if validation fails, apply inverse deltas to roll back.

This may fit DBSP's weighted-change model, but it requires careful discipline. The storage transaction and checker transaction must remain in sync.

The first prototype should probably use compensating deltas or a separate throwaway checker, depending on how easy DBSP state management is in practice.

Input Representation

Each Geomerge table can become one DBSP input relation.

The row representation needs a stable mapping:

Geomerge table path -> DBSP relation name
Geomerge row id -> scalar id value
Geomerge cell values -> tuple fields

Entity references are the main point to watch. Geomerge currently stores entity references as generated row ids. That is fine inside one store, but it may not be enough for replicated or versioned storage. If Geomerge later moves to stable globally meaningful ids, the DBSP adapter should follow that representation.

For a first local prototype, row ids can be encoded as unsigned integers in DBSP tuples.

Output Representation

The checker should expose one combined violation relation at first.

A row should include enough information to explain the violation:

law_id
violation_kind
table_or_consequent
bound_variable_values

This avoids Leo's current one-output limitation becoming a blocker. Instead of needing one DBSP output per law, every compiled law can emit rows into a shared violations relation.

Later, the system can split outputs or attach richer provenance.

Incremental vs Snapshot Checking

Geomerge currently validates the whole store after a write. That gives a simple correctness story.

An incremental checker should preserve the same external behavior:

if the store would be valid under full checking, incremental checking should accept
if the store would be invalid under full checking, incremental checking should reject

The first integration should therefore run both checkers side by side in tests:

Geomerge existing check_laws result == DBSP violation-empty result

This makes the incremental checker a performance optimization first, not a semantic change.

Test Strategy

The first tests should use Geomerge's existing fixture theories.

For the paths fixture:

valid graph, vertex, and edge insertions should produce no violations
inserting a graph without required G0 and G1 rows should produce violations
inserting an edge with a missing vertex should produce violations
persisted and reloaded data should hydrate the checker to the same state

For the ssa fixture:

start with laws whose consequents are atoms or equalities
compare existing validator results against DBSP violation outputs
defer unsupported laws with a clear unsupported-feature report

The test oracle should initially be Geomerge's existing validator.

Main Risks

The first risk is semantic drift. If the adapter silently changes the meaning of a Geolog law, the storage engine may accept invalid data or reject valid data.

The second risk is unsupported features. The adapter must explicitly report unsupported law forms rather than partially compiling them.

The third risk is transaction rollback. Store state and DBSP state must not diverge after a failed transaction.

The fourth risk is output explainability. A boolean "invalid" result is not enough. The user needs to know which law failed and which facts caused it.

The fifth risk is overfitting to Leo's Datalog syntax. The better integration point may be Leo's relational IR or DBSP circuit-building layer, not Datalog source strings.

Current Observation

The strongest current observation is that the two systems meet at the relational layer.

Geomerge has:

tables
typed columns
row ids
laws as atoms, equalities, and conjunctions
a current validator

Leo's implementation has:

input relations
tuple values
joins
antijoins
selections
projections
fixed-point iteration
output deltas

That overlap is enough to justify a prototype.

The gap is full Geolog semantics. DBSP can help maintain relational consequences, but it does not by itself decide how to handle existential witnesses, disjunctions, equality saturation, or model branching.

Bottom Line

Integrating Leo's work into Geomerge is plausible and useful if framed as incremental storage-level validation.

The first milestone should be:

compile supported Geomerge laws into one maintained DBSP violation relation

The success criterion should be:

for supported laws, DBSP validation agrees with Geomerge's existing check_laws

After that works, the next question is whether the integration should grow toward a richer chase runtime or remain a fast validator for the storage engine.

Changelog

May 8, 2026 -- First version created from local inspection of Geomerge and Leo's DBSP/Datalog implementation.

14 KiB Raw Permalink Blame History