163 lines
18 KiB
Markdown
163 lines
18 KiB
Markdown
|
|
This document is copied from https://git.sgai.uk/creators/geolog/-/wikis/Geolog%20storage%20engine%20meeting%202026-04-21 on May 7, 2026.
|
|||
|
|
|
|||
|
|
Discussion pointsMaterialising the current state for efficient constraint checking and efficient queryingSpecial-casing certain types of theories,
|
|||
|
|
such as hash consing for syntax trees, or arrays for linear ordersExposing an API that looks like the theory definitions, not like the IRExposing to
|
|||
|
|
various programming languages via FFIDBSP for incremental constraint checking? Get Leo Stewen involved?Hooking into Soufflé for general querying /
|
|||
|
|
testing program analysis use cases?IR for initial models + storage layer for it (egraph cache)Initial modelsoften infinite, and not data. how should
|
|||
|
|
they show up in the op log?use an SSA or stack-based program where each instruction is a term constructor. both what you extract from an egraph and
|
|||
|
|
can be interpreted into an egraph.imagine we're working in the free ring with generators all strings. we have a couple of term constructors: given a
|
|||
|
|
string we can construct an element of the ring; and then we have 0, 1, multiplication, and addition. there's a term construction for terms in the free
|
|||
|
|
ring. if a term tree has common subtrees, we want a DAG representation. simplest: topologically sort it, store it in an array with backwards indices.
|
|||
|
|
that's essentially what SSA is: store instructions that may reference previous values.fundamental operation: given two EIDs, give me the EID of the
|
|||
|
|
term that is given by applying this term constructor to those EIDs. given an egraph and an SSA-style program you can interpret the SSA-style
|
|||
|
|
program.each term constructor may have a different number of arguments. separate table per term constructor would mean a lot of tables. maybe one
|
|||
|
|
table per arity?think of this as a serialised form of an egraph? stores many different terms concurrently. a term is just an index into a giant
|
|||
|
|
table.could have a defined term constructor. e.g. in a ring, could define a function of 4 arguments to be a big complicated polynomial. more efficient
|
|||
|
|
to store it as a single operation than as a collection of operations. from the perspective of commits and storage, could do either way.might make
|
|||
|
|
sense to sometimes put derived operations in the oplog, if they are compact to store.in the type theory, initial models are in general context. e.g.
|
|||
|
|
transitive closure for a graph: for all graphs, give you an initial model, which is the transitive closure. In the IR, have a lazy monomorphisation:
|
|||
|
|
declare a bunch of tables in the IR; some of the constraints on those tables mention certain initial models. those initial models are the transitive
|
|||
|
|
closure of this graph, or the term trees on this set of variables. might imagine emitting auxiliary declarations in terms of the main tables of the
|
|||
|
|
schema.initial models would not use geometric sequents, as they are too general. restricted to be a conjunction implying another conjunction. could
|
|||
|
|
give this to a datalog/egglog engine, containing only relation constructors but not term constructors.IR is still a set of tables that are declared to
|
|||
|
|
be an initial model; term constructors and what they are supposed to act upon.James's current IR is fully relational. Would want to preserve functions
|
|||
|
|
in the IR? Functions have types.James: Several tables sharing a single rowId space?Transitive closure/DAG exampletheory Graph := sig
|
|||
|
|
V: Set
|
|||
|
|
E: V -> V -> Set
|
|||
|
|
end
|
|||
|
|
|
|||
|
|
def TransClosure (G*: Graph) := (init sig
|
|||
|
|
r: G.V -> G.V -> Prop
|
|||
|
|
_: r ?a ?a
|
|||
|
|
_: r ?a ?b -> G.E ?b ?c -> r ?a ?c
|
|||
|
|
end).r
|
|||
|
|
|
|||
|
|
theory DAG := sig
|
|||
|
|
G: Graph
|
|||
|
|
tc := TransClosure G
|
|||
|
|
_: tc ?a ?b -> tc ?b ?a -> ?a = ?b
|
|||
|
|
endtranslates into IR:V: TABLE []
|
|||
|
|
E: TABLE [V, V]
|
|||
|
|
r: FREETABLE [V, V]
|
|||
|
|
refl: RULE[(v: V) -> r V V]
|
|||
|
|
trans: RULE[...]You can't insert into a FREETABLE, it can only be computed using rules. Every element added to FREETABLE has a provenance: produced by
|
|||
|
|
refl, or produced by trans, or produced by multiple rules and later shown to be equal.Say we want to preserve that an insertion preserves the
|
|||
|
|
acyclicity of the graph.At any given moment, FREETABLE may not yet be fully computed. Either run refl and trans until completion (fixed point), or
|
|||
|
|
turn it into a graph traversal.Props don't need to be stored. If we had r: G.V -> G.V -> Set instead of r: G.V -> G.V -> Prop that would represent
|
|||
|
|
paths through the graph.While you can't insert into a free table, other tables could refer to elements of a free table. In that case, the derived free
|
|||
|
|
table would have to materialised as a commit. This is where we could use an SSA-like construction: each element of the table can be derived through a
|
|||
|
|
finite list of invocations of rules. These might have IDs that are just local to the commit, not globally unique? But if we want to refer to IDs in
|
|||
|
|
another commit, just make them global. Use hash consing.Concurrently created terms could use different IDs to refer to the same term. Could imagine a
|
|||
|
|
merging process where we extract all the terms we care about, run egraph over them, and then re-encode them as a sequence of operations with fresh
|
|||
|
|
IDs.How do you compare elements of the free model across commits, if they have different IDs? Could walk backwards through the SSA tables for both
|
|||
|
|
commits, pull out both into a shared egraph, and then compare them in the egraph. But when you extract a term, you get out the term you originally put
|
|||
|
|
in.Storage engine should concern itself only with hash consing, not the egraph. But storage engine may need to deal with proofs of equality extracted
|
|||
|
|
from the egraph, because those proofs might be expensive to discover, so we'd want to store it. The witness of that proof would be a bunch of rule
|
|||
|
|
invocations, transitivity, congruence etc. (A truncated version of alifib?)There might be laws in the IR that use equality. Making equality just
|
|||
|
|
another stored relation is perhaps not a good idea – make it a built-in concept of the storage engine that performs congruence and transitivity out of
|
|||
|
|
the box?You can refer to equality via a set of rewrite rules that you have applied to the database. We have efficient representations of sets (e.g.
|
|||
|
|
Merkle trees). If you want to record an equality between two terms, you have a hash that refers to a set of rule invocations; whenever you need to
|
|||
|
|
materialise it, you run those rules in your egraph copy.You could have a table where one of the columns in the table is an equality proof? If Geolog
|
|||
|
|
is used for storing witnesses of certain properties, e.g. this branch never gets called. Store this in the database as the sequence of rewrite rules
|
|||
|
|
you need in order to prove that the branch condition equals false.Concept of computational depth: a complement to algorithmic complexity. Complexity
|
|||
|
|
is how many bits you need to generate a thing, no matter how long it takes. Computational depth is, given you are constructing it from the least
|
|||
|
|
amount possible, how long does it take? Even if things do not contain a lot of information, might want to store them if they contain a lot of
|
|||
|
|
computational depth. In the context of Geolog, where the process of equality is a semi-decision procedure (not guaranteed to terminate), this means
|
|||
|
|
the witnesses for equality need to be stored.Could make a content-addressed construction for term IDs, but this would take a lot more space.Query
|
|||
|
|
executionFrom a type theory point of view: if T is a theory, then a query is just a term of type q : (M : T) -> Set (given a model M of a theory, give
|
|||
|
|
me a set). Conjunctive queries are the easiest. for example, G : Graph -> [v : G.V, e : G.E v v] returns all the vertices with self-loops.Discussion
|
|||
|
|
of DBSP. Owen had heard of Differential Dataflow but not DBSP.In Geolog-zeta, Davidad compiled every Geolog sequent into two queries, one for the
|
|||
|
|
antecedent, one for the consequent. Compute a set for each, and check that one set is included in the other.DBSP needs to deal with positive queries (
|
|||
|
|
conjunctive and finitely disjunctive)How does this interact with chase? Mark rules as chased or not-chased. With chased, compute the derived facts.
|
|||
|
|
With non-chased, it's an error if the result is not satisfied (checked after the chase is complete).Want some way of special-casing certain patterns.
|
|||
|
|
There are more or less efficient algorithms for initial models. Some of these could be inlined into DBSP.This is analogous to the part of a SQL
|
|||
|
|
database that produces a query plan. For example, choice between hash join or merge join. Similarly, we want to be able to see if an axiom has a
|
|||
|
|
certain shape and compile it to an appropriate DBSP circuit.Could do an initial DBSP integration without initial models (no transitive closure, just
|
|||
|
|
e.g. union of conjunctive queries). A program logic (deep embedding into Geolog) could probably be checked using this fragment? The axioms are
|
|||
|
|
checking the well-formedness of the application of every rule, and the well-foundedness of the proof tree. Well-foundedness is the only non-local
|
|||
|
|
thing being checked here. This would be a good use case to try: Hoare logic over imperative programs. Allow agents to suggest proof steps and get
|
|||
|
|
feedback on which axioms are violated.Hooking up to Souffle is also a possibility. So far we've only looked at Geolog features that are also Datalog,
|
|||
|
|
but soon we might go beyond that. But we could try exporting to textual Datalog and run it through Souffle as a benchmarking baseline. Or hook into
|
|||
|
|
Souffle at ABI level.APIBesides the IR, another output of the compiler needs to be FFI bindings e.g. for JavaScript or Rust. Do Haskell first (using
|
|||
|
|
compiler as a library)? Owen thinks Haskell is harder – easier to generate code. TypeScript is the best target for now.Alex suggests we use LLM to
|
|||
|
|
generate ORM-like bindings from theory definition. Owen: it's less like ORM, more like a thin wrapper around prepared statements. Davidad: want an
|
|||
|
|
interface more like ODBC, less like an ORM.What makes this different from a database: every node has Geolog in-process.Owen wants to think about
|
|||
|
|
annotations of which concepts to expose via FFI. perhaps have a separate language that has an "FFI" to the Geolog language, but is restricted in
|
|||
|
|
itself.TA1.3 interface. Geolog compiler as "dev dependency", e.g. as a vite plugin. Run over query definition and output bindings. Geolog engine gets
|
|||
|
|
compiled to wasm, reads the stuff the bindings are giving it. Argument about whether to ship the compiler to the browser. Since we want the compiler
|
|||
|
|
to output TypeScript bindings, need to then recompile the code that uses those bindings. For people experimenting with theories this will happen
|
|||
|
|
frequently; in a "production" use the theory will be fixed at compile time.Alex suggests we want a web app that will allow others to experiment with
|
|||
|
|
writing theories and using instances. Obsidian already have a basic version of this. Make this available as npm library and implement the web
|
|||
|
|
interface as a simply vite app to check if we've libraryified it sufficiently.Another goal for TA1.3: dispense with the interchange format for
|
|||
|
|
Petrinaut, use Geolog insteadPossible format for inserting data into a Geolog instance:theory Graph := sig
|
|||
|
|
V : Set
|
|||
|
|
E : V -> V -> Set
|
|||
|
|
end
|
|||
|
|
|
|||
|
|
theory Main := sig
|
|||
|
|
Graphs : Set
|
|||
|
|
G : Graphs -> Graph
|
|||
|
|
g0 : Graphs
|
|||
|
|
end
|
|||
|
|
|
|||
|
|
# Syntax for data of an instance
|
|||
|
|
|
|||
|
|
[
|
|||
|
|
g0 : Graphs
|
|||
|
|
_ : [
|
|||
|
|
open G g0 # brings (G g0).V and (G g0).E into scope as V and E
|
|||
|
|
v0 : V
|
|||
|
|
v1 : V
|
|||
|
|
e : E v0 v1
|
|||
|
|
]
|
|||
|
|
]Make a Godbolt-style app to allow people to play with Geolog theoriesData structuresFast analyses: egraphs, making conjunctive queries fastvs.
|
|||
|
|
Automerge: one oplog, multiple materialised check-outs.Eventually, various optimised data structures would be good. As a first pass do something
|
|||
|
|
generic. Davidad: "tensors of ordered semirings"How often do we expect to have dense multidimensional tensors? Only really in neural networks. But a
|
|||
|
|
sparse tensor can implement the same trait. Build up a library of tensor implementations and the optimiser picks one. Updates to the tensors
|
|||
|
|
represented via binary space partitioning. Davidad thinks nobody has done this.For example, representing a graph as an adjacency matrix, even if the
|
|||
|
|
underlying representation is actually a set of edges represented as tuples. Davidad thinks the tensor representation is good for GPUs to operate on,
|
|||
|
|
streaming a dense representation. e.g. transitive closure as matrix representation. conjunctive queries are binary tensor contractions.This
|
|||
|
|
representation can't handle strings – have to intern them into a string store, e.g. a big append-only blob, and string values are represented as an
|
|||
|
|
offset into that big blob.Anything that is a B-tree index in an ordinary database (want to be able to do range queries on it) could be a dtype.Davidad
|
|||
|
|
argues that functions should be stored differently from tables. For example with a graph, instead of E: V -> V -> Set there would be functions head:
|
|||
|
|
E -> V and tail: E -> V.Geolog breaking down into documentsWe need a concept of a namespace: can define a variable "foo is an instance of theory bar";
|
|||
|
|
whoever owns the theory can update it, but the instances and theories are immutable.Want to be able to point at something in someone else's namespace;
|
|||
|
|
that would be Geolog's analogy to Automerge documents. A commit is located in a namespace and modifies things only within that namespace. Namespaces
|
|||
|
|
reference each other. Want something like cargo import?Each variable is its own document. You might have a collection of documents. Their content
|
|||
|
|
might not be disjoint though. Say you have a graph, copy it a couple of times, and then use it in different instances. Logically, each variable has a
|
|||
|
|
completely separate history.A namespace has a very simple version history – it's just a version bump to one of the variables it contains. Recognises
|
|||
|
|
that mutation is needed somewhere: at the namespace there are just variables.Sedimentree doesn't work well for bushy histories. Our workload might
|
|||
|
|
have long parallel branches (work from independent agents) but probably not very wide branchingPartial merging might be needed: given two instances of
|
|||
|
|
a theory that are individually valid, need a merge that produces the maximal subset that is still valid. This always exists (in the worst case, throw
|
|||
|
|
away one of the branches), but not necessarily unique or deterministically defined. Perhaps a cherry-pick/rebase. Perhaps an explicit merge commit
|
|||
|
|
that indicates which rows are excluded or added in the merged result. In general, doing a merge might be just as much work as doing the work on the
|
|||
|
|
individual branch, although many practical use cases (e.g. program logics) would allow easy merges (two individually proved statements can easily be
|
|||
|
|
combined).Program logic exampleString diagram for (a+b)*c, want to prove that if all the inputs are >=0 then the output is >=0. Have a proof rule for
|
|||
|
|
multiplication: if both inputs are >=0 then the output is >=0.Layers: SSA target (references to variables in the program); logical assertions about
|
|||
|
|
those variables; logical formulae; sequents; derivations (applications of inference rules). All of these are encoded as Geolog theories.What if
|
|||
|
|
assertions and atoms were meta-level props? More in line with traditional Datalog analyses. But if >= was an actual term constructor in the Geolog
|
|||
|
|
language, that would be less flexible/extensible. Geolog formulas are not first-class objects in Geolog: if we have a proof tree that points at
|
|||
|
|
objects that are not themselves in the theory, that gets messy. Davidad thinks it's cleaner to do a deep embedding than a shallow one.Why do you need
|
|||
|
|
a proof tree? If you're using the Geolog metalogic (propositions in Geolog), keeping track of propositions is just part of the native logic of Geolog.
|
|||
|
|
Say ">=0" was a built-in of type Wire -> Prop. The problem is that this puts the burden of proof search on the Geolog computation engine? Not
|
|||
|
|
necessarily. There's nothing stopping you from writing down the proof rules applied – in fact, that's what initial models are.There are bunch of term
|
|||
|
|
constructors that produce proofs, and we can just serialise them. Far more compact than the whole proof tree that stores the context at each node.
|
|||
|
|
Still constrained to geometric logic though, which arbitrary proof trees would not be.Want a concept of a non-strict axioms: allows data violating the
|
|||
|
|
axioms to be committed, but with a big red flag saying that the proof is incomplete. Allow sorry as a constructor, and a no-sorries predicate that's a
|
|||
|
|
non-strict axiom.What kinds of logics do we want to use that don't embed nicely into geometric logic? James suggested a separation logic that came out
|
|||
|
|
a few years ago, which has some substructure.The proof tree approach might not work for binders. Once you have quantifiers in the language inside, you
|
|||
|
|
need to switch to De Bruijn indices or suchlike? No, represent free variables as explicit references to the variable binding. Performing substitution
|
|||
|
|
is still a bunch of work.Problem: if you prove something about one term, and then you rewrite (e.g. substitute) into a similar term, it's a new term
|
|||
|
|
and the proofs don't apply to it. You can copy the proof trees but that will get huge. A lot of the juice we can get out of geometric logic is
|
|||
|
|
conjunctive is fast to check, but deep embedding throws a lot of that away.Owen thinks it would be better to extend the Geolog logic to do e.g.
|
|||
|
|
interesting things with real numbers, rather than try to deeply embed a theory that is as annoying as pen and paper. Davidad thinks we should try it
|
|||
|
|
out; it's a selling point that it's sufficient to do a deep embedding of any logic, even if it's easier and more efficient to do a shallow embedding.
|
|||
|
|
Importing SMT certificates needs that flexibility, for example.Geolog might be best for things that are combinatorial. More natively combinatorial use
|
|||
|
|
cases would allow it to shine.There's Geolog the type theory, and the massively scalable infrastructure for solving problems. On the latter, can we
|
|||
|
|
beat Soufflé at their own game? They must have some good benchmarks.Try cryptography as a use case? e.g. can we prove an elliptic curve implementation
|
|||
|
|
correct? See Martin's X25519 tutorial, verified F* implementation in ValeCryptActionsAlex, George: Geolog compiler on npm (GHC-compiled Wasm) → get
|
|||
|
|
Godbolt-style web interface goingOwen: design a language for people to write queries and insert data, putting it through loweringJames: new version of
|
|||
|
|
lowering that can handle inductive thingssomeone should write a compiler that turns a simple toy imperative language into a Geolog instanceit would be
|
|||
|
|
nice to have one frontend – Obsidian should write TypeScript? Cale happy to write bits of Haskell that generate TypeScriptVincent, Alex, Martin –
|
|||
|
|
physical data storage, disk and network wire formatsMartin, Leo – DBSP experiments
|