14 KiB
Geomerge and DBSP Integration
A design note on integrating Leo Stewen's DBSP/Datalog work with Vincent Liu's Geomerge storage engine for storage-level constraint checking in Geolog.
Short Answer
Leo's work can be integrated with Geomerge, but the first integration should be narrow.
The useful target is incremental validation for the Datalog-shaped subset of Geolog laws. Geomerge already stores compiled theories, tables, rows, and laws. Leo's work provides a model for turning relational rules into an incremental DBSP circuit. The bridge would translate selected Geomerge laws into violation queries maintained by DBSP.
This would not yet be a full Geolog chase engine. It would not automatically generate existential witnesses, branch over disjunctions, or perform equality saturation. It would make Geomerge's current validation path incremental for laws that can be expressed as relational constraints.
Current Geomerge Behavior
Geomerge currently behaves like a storage engine plus validator.
It loads a compiled FlatTheory, creates tables from the theory, accepts row additions, validates primary keys, checks laws, and either commits or rejects the transaction. The transaction pattern is:
- Clone the current store into a preview store.
- Apply proposed row additions to the preview.
- Check schemas, primary keys, and laws.
- Commit the preview only if validation succeeds.
- Leave the original store unchanged on failure.
This is the right semantic shape for storage-level constraint checking. The store is not allowed to enter an invalid state.
The limitation is that law checking is currently direct and non-incremental. The law validator binds antecedent variables by scanning tables and then checks whether the consequent holds for each binding. This is acceptable for small examples but will become expensive as facts and laws grow.
Current Leo DBSP Behavior
Leo's implementation behaves like an incremental Datalog query engine.
It parses a custom Datalog dialect, analyzes predicate dependencies, translates rules into a relational intermediate representation, builds a DBSP circuit, accepts input changes, and emits output changes.
The important part for Geolog is not the CRDT examples themselves. The important part is the pipeline:
rules over relations
-> relational operators
-> DBSP circuit
-> maintained output deltas
This is the missing performance model for Geomerge's law checking. Instead of rescanning all relevant tables after each transaction, the system could feed transaction deltas into a maintained constraint-checking circuit.
Integration Goal
The first integration goal should be:
incremental violation detection for selected Geomerge laws
The goal should not initially be:
- full Geolog evaluation
- full chase
- automatic witness generation
- disjunctive model search
- replacement of Geomerge's storage engine
- replacement of Geolog's compiler or IR
This scope matters. A small successful integration would already answer the main architectural question: can compiled Geolog tables and laws be maintained as incremental relational checks?
Key Idea
Every supported law can be compiled into a violation relation.
Suppose a law has this shape:
antecedent(x) |- consequent(x)
The DBSP checker can maintain:
required_consequent(x) :- antecedent(x).
violation(x) :- required_consequent(x), not consequent(x).
Then the storage engine only needs to ask:
is the violation relation empty after this transaction?
If it is empty, the transaction is valid for that law subset. If it is non-empty, the transaction is rejected, and the violation rows can be shown to the user.
This turns law checking into ordinary incremental view maintenance.
Example: Foreign-Key Laws
Foreign-key-style laws are the easiest first target.
In Geomerge's compiled theories, a row may imply that referenced rows must exist. For example, an edge row may require its source and target vertex rows to exist.
The law shape is:
G.E(graph, src, dst) |- G.V(graph, src), G.V(graph, dst)
The violation query can be split into two checks:
required_src(graph, src) :- G.E(graph, src, dst).
missing_src(graph, src) :- required_src(graph, src), not G.V(graph, src).
required_dst(graph, dst) :- G.E(graph, src, dst).
missing_dst(graph, dst) :- required_dst(graph, dst), not G.V(graph, dst).
The combined violation relation is the union of missing_src and missing_dst.
This is exactly the kind of work DBSP should be good at: maintaining joins, projections, antijoins, and differences as rows are inserted.
Example: Totality Laws
Totality laws are also plausible, but they need careful interpretation.
A function table with a primary key says that inputs determine an output. A totality law says that when the required input rows exist, a corresponding function row must exist.
In validation mode, missing rows are violations:
required_f(x) :- input_condition(x).
missing_f(x) :- required_f(x), not f(x, _).
In chase mode, missing rows might instead cause witness generation:
input_condition(x) |- exists y. f(x, y)
The incremental DBSP checker can handle the validation interpretation. It can maintain missing_f. It should not initially be responsible for generating y.
This distinction is central. The first integration should treat totality as a constraint, not as an automatic construction rule.
Example: Equality Laws
Equality laws can also be handled at the validation level.
If a law concludes:
x = y
then the checker can maintain a violation whenever the antecedent holds and the bound values differ.
In Datalog-style terms:
equality_obligation(x, y) :- antecedent(x, y).
equality_violation(x, y) :- equality_obligation(x, y), x != y.
This is not equality saturation. It does not merge terms, rewrite facts, or maintain equivalence classes. It only reports that the current store violates a required equality.
That is still useful at the storage-engine level. It matches Geomerge's current validator behavior, where unsatisfied equality is a law violation.
Supported First Subset
The first integration should support laws whose compiled form uses:
- atoms
- conjunction
- equality
- variables
- literals
- entity references
- primitive values
- projections from table columns into rule variables
The corresponding relational operators are:
- projection
- selection
- equijoin
- antijoin
- union
- difference
- distinct
Positive recursion can be added after the non-recursive path works, especially for transitive-closure-shaped tables.
Excluded First Subset
The first integration should explicitly exclude:
- disjunctive consequents
- existential witness generation
- automatic insertion of missing consequent rows
- equality-class maintenance
- tuple columns
- record construction
- term projection if it cannot be lowered to simple column access
- branch or world tracking
- deletion semantics, unless the storage layer adds row removal
These features are not unimportant. They are just the wrong first target because they require Geolog semantics beyond ordinary Datalog-style validation.
Proposed Architecture
The integration can be organized around a new constraint-checking component:
Geomerge Store
owns schemas, rows, transactions, and persistence
Constraint Compiler
translates supported FlatTheory laws into DBSP plans
DBSP Constraint Runtime
owns input handles, output handles, and maintained violation relations
Transaction Adapter
converts Geomerge row additions into DBSP input deltas
The write path becomes:
begin transaction
-> validate row shapes and primary keys in Geomerge
-> apply row additions to preview store
-> feed row deltas to preview constraint runtime
-> step DBSP circuit
-> inspect violation output
-> commit store and checker if no violations
-> otherwise reject transaction
The preview aspect is important. Geomerge currently gets atomicity by checking a cloned preview store. An incremental checker needs a matching preview or rollback story.
State Management Options
There are three plausible ways to handle transaction atomicity.
Rebuild on Preview
For each transaction, clone or rebuild a checker from the preview store and run it.
This is simplest but defeats much of the point of incrementality. It may still be useful for early testing.
Clone Checker State
Keep the DBSP runtime state cloneable, then apply deltas to a preview checker and commit the preview on success.
This matches Geomerge's current preview-store model, but it depends on whether the DBSP circuit state can be cloned or snapshotted cleanly.
Compensating Deltas
Apply transaction deltas to the live checker, inspect violations, and if validation fails, apply inverse deltas to roll back.
This may fit DBSP's weighted-change model, but it requires careful discipline. The storage transaction and checker transaction must remain in sync.
The first prototype should probably use compensating deltas or a separate throwaway checker, depending on how easy DBSP state management is in practice.
Input Representation
Each Geomerge table can become one DBSP input relation.
The row representation needs a stable mapping:
Geomerge table path -> DBSP relation name
Geomerge row id -> scalar id value
Geomerge cell values -> tuple fields
Entity references are the main point to watch. Geomerge currently stores entity references as generated row ids. That is fine inside one store, but it may not be enough for replicated or versioned storage. If Geomerge later moves to stable globally meaningful ids, the DBSP adapter should follow that representation.
For a first local prototype, row ids can be encoded as unsigned integers in DBSP tuples.
Output Representation
The checker should expose one combined violation relation at first.
A row should include enough information to explain the violation:
law_id
violation_kind
table_or_consequent
bound_variable_values
This avoids Leo's current one-output limitation becoming a blocker. Instead of needing one DBSP output per law, every compiled law can emit rows into a shared violations relation.
Later, the system can split outputs or attach richer provenance.
Incremental vs Snapshot Checking
Geomerge currently validates the whole store after a write. That gives a simple correctness story.
An incremental checker should preserve the same external behavior:
- if the store would be valid under full checking, incremental checking should accept
- if the store would be invalid under full checking, incremental checking should reject
The first integration should therefore run both checkers side by side in tests:
Geomerge existing check_laws result == DBSP violation-empty result
This makes the incremental checker a performance optimization first, not a semantic change.
Test Strategy
The first tests should use Geomerge's existing fixture theories.
For the paths fixture:
- valid graph, vertex, and edge insertions should produce no violations
- inserting a graph without required
G0andG1rows should produce violations - inserting an edge with a missing vertex should produce violations
- persisted and reloaded data should hydrate the checker to the same state
For the ssa fixture:
- start with laws whose consequents are atoms or equalities
- compare existing validator results against DBSP violation outputs
- defer unsupported laws with a clear unsupported-feature report
The test oracle should initially be Geomerge's existing validator.
Main Risks
The first risk is semantic drift. If the adapter silently changes the meaning of a Geolog law, the storage engine may accept invalid data or reject valid data.
The second risk is unsupported features. The adapter must explicitly report unsupported law forms rather than partially compiling them.
The third risk is transaction rollback. Store state and DBSP state must not diverge after a failed transaction.
The fourth risk is output explainability. A boolean "invalid" result is not enough. The user needs to know which law failed and which facts caused it.
The fifth risk is overfitting to Leo's Datalog syntax. The better integration point may be Leo's relational IR or DBSP circuit-building layer, not Datalog source strings.
Current Observation
The strongest current observation is that the two systems meet at the relational layer.
Geomerge has:
- tables
- typed columns
- row ids
- laws as atoms, equalities, and conjunctions
- a current validator
Leo's implementation has:
- input relations
- tuple values
- joins
- antijoins
- selections
- projections
- fixed-point iteration
- output deltas
That overlap is enough to justify a prototype.
The gap is full Geolog semantics. DBSP can help maintain relational consequences, but it does not by itself decide how to handle existential witnesses, disjunctions, equality saturation, or model branching.
Bottom Line
Integrating Leo's work into Geomerge is plausible and useful if framed as incremental storage-level validation.
The first milestone should be:
compile supported Geomerge laws into one maintained DBSP violation relation
The success criterion should be:
for supported laws, DBSP validation agrees with Geomerge's existing check_laws
After that works, the next question is whether the integration should grow toward a richer chase runtime or remain a fast validator for the storage engine.
Changelog
- May 8, 2026 -- First version created from local inspection of Geomerge and Leo's DBSP/Datalog implementation.