storage-engine-playground/notes/adapter/02-geolog-lang-findings.md
2026-05-29 11:28:39 +02:00

9.3 KiB

Geolog Lang Findings

Source inspected: local reference checkout at tmp/geolog/geolog-lang.

Summary

geolog-lang is a Haskell prototype for a Geolog source language. It parses .glog theory files, elaborates them into a typed core language, lowers a selected Main theory into a flat relational representation, and includes an in-memory relation engine with law checking and a conjunctive-query planner.

For this repository, the useful reference is not the Haskell implementation itself. The useful reference is the shape of the pipeline:

Geolog theory
-> typed core elaboration
-> flat relational theory
-> tables, generated laws, and validation queries
-> in-memory execution and query planning experiments

Package Shape

The package exposes modules for:

  • core syntax, values, evaluation, quoting, and equality checking
  • notation, lexer, parser configuration, and diagnostics
  • elaboration from parsed notation into typed core entries
  • lowering from the typed Main theory into a flat relational theory
  • relational IR for tables, atoms, propositions, and laws
  • an in-memory database and a conjunctive-query planner

The test suite combines golden tests for .glog elaboration and lowering with unit and property tests for the in-memory database and planner.

Source Language Shape

The examples use declarations such as:

theory Main := sig
  Node : Set
  Edge : Node -> Node -> Set
end

The notation supports theory, def, set, let, open, import, Set, Prop, Int, String, Inductive, pure, init, ->, *->, and =>.

The source language is closer to a dependent typed theory language than to ordinary Datalog. Relations, functions, records, equality, and inductive constructions are represented in the source and then selectively lowered.

Lowered Relational IR

The lowered IR has these main concepts:

  • Path: dotted table or law names represented as path components
  • Table: column types and an optional primary key
  • Atom: table reference, optional row identity term, and positional value terms
  • Prop: atom, equality, conjunction, or disjunction
  • Law: universally quantified variables, an antecedent proposition, and a consequent proposition
  • FlatTheory: maps of tables and laws

For the graph example above, lowering produces a Node table, an Edge table, and an Edge.foreignKeys law. The law says that every edge row implies the existence of both endpoint node rows.

This is directly relevant to Geomerge-style validation experiments. A high-level declaration can compile into maintained violation checks by turning generated laws into queries:

required_consequent(x) :- antecedent(x).
violation(x) :- required_consequent(x), not consequent(x).

Generated Law Patterns

Two generated law shapes matter for this playground:

  • foreign-key laws: rows in one table require referenced rows in another table
  • totality laws: functional relations require an output for each valid input

Foreign-key laws are generated from dependent arguments that reference prior sets or relations. Totality laws are generated for functional table shapes.

The current lowering appears intentionally narrow. Unsupported shapes are reported or panic in prototype code rather than being treated as complete semantics. That is useful for this repo because early adapters should also make supported and unsupported cases explicit.

In-Memory Execution

The in-memory database stores:

  • ground values: integers, text, and entity identities
  • relations: set-valued tuples with secondary indexes by column and value
  • bindings: rows of variable assignments

Query evaluation supports atoms and conjunctions. Conjunctions are evaluated with natural join. Law checking evaluates a law antecedent to produce witness bindings, then checks the consequent for each witness. Violations carry the law path and witness values.

This gives a simple snapshot oracle shape for validation experiments:

facts + generated law
-> antecedent witnesses
-> consequent checks
-> violation rows

Storage Engine

geolog-lang uses a custom in-memory storage engine, not SQLite, Postgres, or another external database. The storage module is Geolog.DB.InMemory.

The core storage structures are:

  • DB: schema, relations, and next entity IDs
  • Relation: fixed-arity tuple set plus a secondary index
  • Val: stored values, including integers, text, and entity identities
  • Bindings: query result rows as variable-to-value maps

The relation storage shape is:

Relation {
  arity: tuple width,
  members: set of tuples,
  index: map from (column position, value) to matching tuples
}

Inserts add a tuple to the members set and update the secondary index for every (column, value) pair in the tuple. Atom evaluation uses literal terms to probe those secondary indexes, intersects candidate tuple sets when multiple literal constraints are present, and then produces variable bindings.

Conjunction evaluation has two paths:

  • naive evaluation: evaluate each atom, then natural-join the resulting bindings
  • planned evaluation: use Geolog.DB.Plan to build a semijoin-based join plan, then evaluate that plan against the same in-memory relations

The checked insert path inserts into a temporary updated database and checks generated laws such as foreignKeys. If the law check finds violations, the updated database is not returned. This gives rollback-like behavior at the API boundary, but it is not a transactional storage engine with durable logs or multi-step rollback.

Storage API

geolog-lang does not define a storage abstraction or backend trait. It directly uses the concrete in-memory API in Geolog.DB.InMemory.

The minimal storage API it needs is:

schema registration:
  fromTheory :: FlatTheory -> DB

entity allocation:
  freshEntity :: Path -> DB -> (Val, DB)

writes:
  insertRow :: Path -> [Val] -> DB -> DB
  checkedInsertRow :: FlatTheory -> Path -> [Val] -> DB -> Either InsertError DB

reads:
  evalAtom :: DB -> QAtom -> Either DBError Bindings
  evalConjunction :: DB -> [QAtom] -> Either DBError Bindings

validation:
  checkLaw :: DB -> Path -> Law -> Either DBError [LawViolation]

The planner also depends on planned conjunction evaluation:

planConjunction :: [QAtom] -> JoinPlan
evalRootedPlan :: DB -> JoinPlan -> Either DBError Bindings
evalConjunctionPlanned :: DB -> [QAtom] -> Either DBError Bindings

For this repository, an adapter-facing storage interface should probably provide:

  1. flat schema registration from lowered tables,
  2. fixed-arity tuple insertion into named relations,
  3. stable entity ID allocation per table path,
  4. tuple lookup by table and positional constraints,
  5. variable binding output for query atoms,
  6. natural join and semijoin support, or enough scans and index probes for the planner to implement them,
  7. generated-law checking with structured violations, and
  8. checked insert behavior that rejects or rolls back invalid writes.

This API is snapshot-oriented. It is not yet a durable transaction API, an incremental DBSP API, or a general SQL-like storage layer.

Query Planning

The planner targets conjunctive queries over the in-memory database. It:

  1. evaluates each atom independently,
  2. builds an atom intersection graph weighted by shared variables,
  3. computes a maximum spanning forest,
  4. performs bottom-up and top-down semijoin reduction, and
  5. performs full joins along the forest.

The tests compare planned conjunction evaluation against naive conjunction evaluation on generated graph workloads. This is a useful test pattern for this repository: every planned execution path should have a simple snapshot or naive oracle while the planner is still experimental.

Adapter Takeaways

The practical adapter direction is:

  1. Model a small FlatTheory-like IR in Rust before implementing a full Geolog frontend.
  2. Start with generated foreign-key and totality laws because they have clear violation semantics.
  3. Represent violation outputs with law identity, violation kind, relation or consequent identity, and bound variable values.
  4. Keep source elaboration, relational lowering, planning, and execution as separate modules.
  5. Add textual plan output early so planner behavior is inspectable.
  6. Use a naive snapshot evaluator as the oracle for planned or maintained execution.

Open Questions

  • How much of Geolog source syntax should this repo parse directly versus using hand-built fixtures first?
  • Should equality in consequents be supported before disjunction, existential witnesses, or recursive laws?
  • Should totality validation use explicit missing-output violations or a generated required-output relation?
  • How should entity identities map onto this repo's storage transaction and rollback model?
  • Which law subset is enough to test Geomerge-style maintained violation relations without building a full chase engine?

Next Experiment Shape

A small vertical slice for this repository could be:

input:
  table node(id)
  table edge(src, dst, id)
  law edge_foreign_keys:
    edge(src, dst, id) -> node(src) and node(dst)

snapshot output:
  violation rows for missing src or dst nodes

planned output:
  textual antijoin plan for edge against node(src) and node(dst)

That would test catalog construction, generated-law representation, antijoin planning, deterministic violation rows, and a snapshot oracle without requiring the rest of Geolog.