18 KiB
This document is copied from https://git.sgai.uk/creators/geolog/-/wikis/Geolog%20storage%20engine%20meeting%202026-04-21 on May 7, 2026.
Discussion pointsMaterialising the current state for efficient constraint checking and efficient queryingSpecial-casing certain types of theories, such as hash consing for syntax trees, or arrays for linear ordersExposing an API that looks like the theory definitions, not like the IRExposing to various programming languages via FFIDBSP for incremental constraint checking? Get Leo Stewen involved?Hooking into Soufflé for general querying / testing program analysis use cases?IR for initial models + storage layer for it (egraph cache)Initial modelsoften infinite, and not data. how should they show up in the op log?use an SSA or stack-based program where each instruction is a term constructor. both what you extract from an egraph and can be interpreted into an egraph.imagine we're working in the free ring with generators all strings. we have a couple of term constructors: given a string we can construct an element of the ring; and then we have 0, 1, multiplication, and addition. there's a term construction for terms in the free ring. if a term tree has common subtrees, we want a DAG representation. simplest: topologically sort it, store it in an array with backwards indices. that's essentially what SSA is: store instructions that may reference previous values.fundamental operation: given two EIDs, give me the EID of the term that is given by applying this term constructor to those EIDs. given an egraph and an SSA-style program you can interpret the SSA-style program.each term constructor may have a different number of arguments. separate table per term constructor would mean a lot of tables. maybe one table per arity?think of this as a serialised form of an egraph? stores many different terms concurrently. a term is just an index into a giant table.could have a defined term constructor. e.g. in a ring, could define a function of 4 arguments to be a big complicated polynomial. more efficient to store it as a single operation than as a collection of operations. from the perspective of commits and storage, could do either way.might make sense to sometimes put derived operations in the oplog, if they are compact to store.in the type theory, initial models are in general context. e.g. transitive closure for a graph: for all graphs, give you an initial model, which is the transitive closure. In the IR, have a lazy monomorphisation: declare a bunch of tables in the IR; some of the constraints on those tables mention certain initial models. those initial models are the transitive closure of this graph, or the term trees on this set of variables. might imagine emitting auxiliary declarations in terms of the main tables of the schema.initial models would not use geometric sequents, as they are too general. restricted to be a conjunction implying another conjunction. could give this to a datalog/egglog engine, containing only relation constructors but not term constructors.IR is still a set of tables that are declared to be an initial model; term constructors and what they are supposed to act upon.James's current IR is fully relational. Would want to preserve functions in the IR? Functions have types.James: Several tables sharing a single rowId space?Transitive closure/DAG exampletheory Graph := sig V: Set E: V -> V -> Set end
def TransClosure (G*: Graph) := (init sig r: G.V -> G.V -> Prop _: r ?a ?a _: r ?a ?b -> G.E ?b ?c -> r ?a ?c end).r
theory DAG := sig G: Graph tc := TransClosure G _: tc ?a ?b -> tc ?b ?a -> ?a = ?b endtranslates into IR:V: TABLE [] E: TABLE [V, V] r: FREETABLE [V, V] refl: RULE[(v: V) -> r V V] trans: RULE[...]You can't insert into a FREETABLE, it can only be computed using rules. Every element added to FREETABLE has a provenance: produced by refl, or produced by trans, or produced by multiple rules and later shown to be equal.Say we want to preserve that an insertion preserves the acyclicity of the graph.At any given moment, FREETABLE may not yet be fully computed. Either run refl and trans until completion (fixed point), or turn it into a graph traversal.Props don't need to be stored. If we had r: G.V -> G.V -> Set instead of r: G.V -> G.V -> Prop that would represent paths through the graph.While you can't insert into a free table, other tables could refer to elements of a free table. In that case, the derived free table would have to materialised as a commit. This is where we could use an SSA-like construction: each element of the table can be derived through a finite list of invocations of rules. These might have IDs that are just local to the commit, not globally unique? But if we want to refer to IDs in another commit, just make them global. Use hash consing.Concurrently created terms could use different IDs to refer to the same term. Could imagine a merging process where we extract all the terms we care about, run egraph over them, and then re-encode them as a sequence of operations with fresh IDs.How do you compare elements of the free model across commits, if they have different IDs? Could walk backwards through the SSA tables for both commits, pull out both into a shared egraph, and then compare them in the egraph. But when you extract a term, you get out the term you originally put in.Storage engine should concern itself only with hash consing, not the egraph. But storage engine may need to deal with proofs of equality extracted from the egraph, because those proofs might be expensive to discover, so we'd want to store it. The witness of that proof would be a bunch of rule invocations, transitivity, congruence etc. (A truncated version of alifib?)There might be laws in the IR that use equality. Making equality just another stored relation is perhaps not a good idea – make it a built-in concept of the storage engine that performs congruence and transitivity out of the box?You can refer to equality via a set of rewrite rules that you have applied to the database. We have efficient representations of sets (e.g. Merkle trees). If you want to record an equality between two terms, you have a hash that refers to a set of rule invocations; whenever you need to materialise it, you run those rules in your egraph copy.You could have a table where one of the columns in the table is an equality proof? If Geolog is used for storing witnesses of certain properties, e.g. this branch never gets called. Store this in the database as the sequence of rewrite rules you need in order to prove that the branch condition equals false.Concept of computational depth: a complement to algorithmic complexity. Complexity is how many bits you need to generate a thing, no matter how long it takes. Computational depth is, given you are constructing it from the least amount possible, how long does it take? Even if things do not contain a lot of information, might want to store them if they contain a lot of computational depth. In the context of Geolog, where the process of equality is a semi-decision procedure (not guaranteed to terminate), this means the witnesses for equality need to be stored.Could make a content-addressed construction for term IDs, but this would take a lot more space.Query executionFrom a type theory point of view: if T is a theory, then a query is just a term of type q : (M : T) -> Set (given a model M of a theory, give me a set). Conjunctive queries are the easiest. for example, G : Graph -> [v : G.V, e : G.E v v] returns all the vertices with self-loops.Discussion of DBSP. Owen had heard of Differential Dataflow but not DBSP.In Geolog-zeta, Davidad compiled every Geolog sequent into two queries, one for the antecedent, one for the consequent. Compute a set for each, and check that one set is included in the other.DBSP needs to deal with positive queries ( conjunctive and finitely disjunctive)How does this interact with chase? Mark rules as chased or not-chased. With chased, compute the derived facts. With non-chased, it's an error if the result is not satisfied (checked after the chase is complete).Want some way of special-casing certain patterns. There are more or less efficient algorithms for initial models. Some of these could be inlined into DBSP.This is analogous to the part of a SQL database that produces a query plan. For example, choice between hash join or merge join. Similarly, we want to be able to see if an axiom has a certain shape and compile it to an appropriate DBSP circuit.Could do an initial DBSP integration without initial models (no transitive closure, just e.g. union of conjunctive queries). A program logic (deep embedding into Geolog) could probably be checked using this fragment? The axioms are checking the well-formedness of the application of every rule, and the well-foundedness of the proof tree. Well-foundedness is the only non-local thing being checked here. This would be a good use case to try: Hoare logic over imperative programs. Allow agents to suggest proof steps and get feedback on which axioms are violated.Hooking up to Souffle is also a possibility. So far we've only looked at Geolog features that are also Datalog, but soon we might go beyond that. But we could try exporting to textual Datalog and run it through Souffle as a benchmarking baseline. Or hook into Souffle at ABI level.APIBesides the IR, another output of the compiler needs to be FFI bindings e.g. for JavaScript or Rust. Do Haskell first (using compiler as a library)? Owen thinks Haskell is harder – easier to generate code. TypeScript is the best target for now.Alex suggests we use LLM to generate ORM-like bindings from theory definition. Owen: it's less like ORM, more like a thin wrapper around prepared statements. Davidad: want an interface more like ODBC, less like an ORM.What makes this different from a database: every node has Geolog in-process.Owen wants to think about annotations of which concepts to expose via FFI. perhaps have a separate language that has an "FFI" to the Geolog language, but is restricted in itself.TA1.3 interface. Geolog compiler as "dev dependency", e.g. as a vite plugin. Run over query definition and output bindings. Geolog engine gets compiled to wasm, reads the stuff the bindings are giving it. Argument about whether to ship the compiler to the browser. Since we want the compiler to output TypeScript bindings, need to then recompile the code that uses those bindings. For people experimenting with theories this will happen frequently; in a "production" use the theory will be fixed at compile time.Alex suggests we want a web app that will allow others to experiment with writing theories and using instances. Obsidian already have a basic version of this. Make this available as npm library and implement the web interface as a simply vite app to check if we've libraryified it sufficiently.Another goal for TA1.3: dispense with the interchange format for Petrinaut, use Geolog insteadPossible format for inserting data into a Geolog instance:theory Graph := sig V : Set E : V -> V -> Set end
theory Main := sig Graphs : Set G : Graphs -> Graph g0 : Graphs end
Syntax for data of an instance
[ g0 : Graphs _ : [ open G g0 # brings (G g0).V and (G g0).E into scope as V and E v0 : V v1 : V e : E v0 v1 ] ]Make a Godbolt-style app to allow people to play with Geolog theoriesData structuresFast analyses: egraphs, making conjunctive queries fastvs. Automerge: one oplog, multiple materialised check-outs.Eventually, various optimised data structures would be good. As a first pass do something generic. Davidad: "tensors of ordered semirings"How often do we expect to have dense multidimensional tensors? Only really in neural networks. But a sparse tensor can implement the same trait. Build up a library of tensor implementations and the optimiser picks one. Updates to the tensors represented via binary space partitioning. Davidad thinks nobody has done this.For example, representing a graph as an adjacency matrix, even if the underlying representation is actually a set of edges represented as tuples. Davidad thinks the tensor representation is good for GPUs to operate on, streaming a dense representation. e.g. transitive closure as matrix representation. conjunctive queries are binary tensor contractions.This representation can't handle strings – have to intern them into a string store, e.g. a big append-only blob, and string values are represented as an offset into that big blob.Anything that is a B-tree index in an ordinary database (want to be able to do range queries on it) could be a dtype.Davidad argues that functions should be stored differently from tables. For example with a graph, instead of E: V -> V -> Set there would be functions head: E -> V and tail: E -> V.Geolog breaking down into documentsWe need a concept of a namespace: can define a variable "foo is an instance of theory bar"; whoever owns the theory can update it, but the instances and theories are immutable.Want to be able to point at something in someone else's namespace; that would be Geolog's analogy to Automerge documents. A commit is located in a namespace and modifies things only within that namespace. Namespaces reference each other. Want something like cargo import?Each variable is its own document. You might have a collection of documents. Their content might not be disjoint though. Say you have a graph, copy it a couple of times, and then use it in different instances. Logically, each variable has a completely separate history.A namespace has a very simple version history – it's just a version bump to one of the variables it contains. Recognises that mutation is needed somewhere: at the namespace there are just variables.Sedimentree doesn't work well for bushy histories. Our workload might have long parallel branches (work from independent agents) but probably not very wide branchingPartial merging might be needed: given two instances of a theory that are individually valid, need a merge that produces the maximal subset that is still valid. This always exists (in the worst case, throw away one of the branches), but not necessarily unique or deterministically defined. Perhaps a cherry-pick/rebase. Perhaps an explicit merge commit that indicates which rows are excluded or added in the merged result. In general, doing a merge might be just as much work as doing the work on the individual branch, although many practical use cases (e.g. program logics) would allow easy merges (two individually proved statements can easily be combined).Program logic exampleString diagram for (a+b)c, want to prove that if all the inputs are >=0 then the output is >=0. Have a proof rule for multiplication: if both inputs are >=0 then the output is >=0.Layers: SSA target (references to variables in the program); logical assertions about those variables; logical formulae; sequents; derivations (applications of inference rules). All of these are encoded as Geolog theories.What if assertions and atoms were meta-level props? More in line with traditional Datalog analyses. But if >= was an actual term constructor in the Geolog language, that would be less flexible/extensible. Geolog formulas are not first-class objects in Geolog: if we have a proof tree that points at objects that are not themselves in the theory, that gets messy. Davidad thinks it's cleaner to do a deep embedding than a shallow one.Why do you need a proof tree? If you're using the Geolog metalogic (propositions in Geolog), keeping track of propositions is just part of the native logic of Geolog. Say ">=0" was a built-in of type Wire -> Prop. The problem is that this puts the burden of proof search on the Geolog computation engine? Not necessarily. There's nothing stopping you from writing down the proof rules applied – in fact, that's what initial models are.There are bunch of term constructors that produce proofs, and we can just serialise them. Far more compact than the whole proof tree that stores the context at each node. Still constrained to geometric logic though, which arbitrary proof trees would not be.Want a concept of a non-strict axioms: allows data violating the axioms to be committed, but with a big red flag saying that the proof is incomplete. Allow sorry as a constructor, and a no-sorries predicate that's a non-strict axiom.What kinds of logics do we want to use that don't embed nicely into geometric logic? James suggested a separation logic that came out a few years ago, which has some substructure.The proof tree approach might not work for binders. Once you have quantifiers in the language inside, you need to switch to De Bruijn indices or suchlike? No, represent free variables as explicit references to the variable binding. Performing substitution is still a bunch of work.Problem: if you prove something about one term, and then you rewrite (e.g. substitute) into a similar term, it's a new term and the proofs don't apply to it. You can copy the proof trees but that will get huge. A lot of the juice we can get out of geometric logic is conjunctive is fast to check, but deep embedding throws a lot of that away.Owen thinks it would be better to extend the Geolog logic to do e.g. interesting things with real numbers, rather than try to deeply embed a theory that is as annoying as pen and paper. Davidad thinks we should try it out; it's a selling point that it's sufficient to do a deep embedding of any logic, even if it's easier and more efficient to do a shallow embedding. Importing SMT certificates needs that flexibility, for example.Geolog might be best for things that are combinatorial. More natively combinatorial use cases would allow it to shine.There's Geolog the type theory, and the massively scalable infrastructure for solving problems. On the latter, can we beat Soufflé at their own game? They must have some good benchmarks.Try cryptography as a use case? e.g. can we prove an elliptic curve implementation correct? See Martin's X25519 tutorial, verified F implementation in ValeCryptActionsAlex, George: Geolog compiler on npm (GHC-compiled Wasm) → get Godbolt-style web interface goingOwen: design a language for people to write queries and insert data, putting it through loweringJames: new version of lowering that can handle inductive thingssomeone should write a compiler that turns a simple toy imperative language into a Geolog instanceit would be nice to have one frontend – Obsidian should write TypeScript? Cale happy to write bits of Haskell that generate TypeScriptVincent, Alex, Martin – physical data storage, disk and network wire formatsMartin, Leo – DBSP experiments