Add a suplamantay note file on CRDTs and Datalog

2026-05-07 10:22:45 +02:00 · 2026-05-07 10:22:45 +02:00 · 01a9aa167e
commit 01a9aa167e
parent 35b6e8f43f
2 changed files with 479 additions and 0 deletions
--- a/dbsp/002-why-crdts-as-queries.md
+++ b/dbsp/002-why-crdts-as-queries.md
@ -0,0 +1,317 @@
+# Why CRDTs as Queries
+
+A coherent reading note on the idea of defining replicated data structures as deterministic queries over immutable operations.
+
+---
+
+## Starting Point
+
+The basic problem behind CRDTs is easy to state and hard to implement well. Several replicas hold copies of the same logical data. Each replica should
+be able to accept writes locally, including while offline or disconnected from the other replicas. Later, when replicas exchange information, they
+should converge to the same state.
+
+This is not the same problem as ordinary database replication. In a conventional primary-replica database, a single authority can decide the order of
+writes. If two users write at the same time, the system can serialize those writes through a leader, a lock, or a consensus protocol. That gives the
+database a single history. CRDTs are designed for environments where that coordination is unavailable, undesirable, or too expensive. A user should
+still be able to write locally, even if there is no reachable leader.
+
+The price is that the system must handle concurrency after the fact. Two replicas may both accept writes that neither knew about at the time. When
+those writes later meet, the system must define what the merged state means. A CRDT is a data structure whose merge behavior is designed so that all
+replicas eventually compute the same state.
+
+---
+
+## The Traditional Burden
+
+In a hand-written CRDT, the implementer writes an algorithm whose operations are safe under concurrency. For an operation-based CRDT, that often means
+concurrent operations must commute: applying operation `a` and then operation `b` must lead to the same logical state as applying `b` and then `a`, at
+least when `a` and `b` are concurrent.
+
+This is manageable for simple structures such as grow-only sets. It becomes more subtle for registers, maps, lists, trees, undo and redo, and nested
+documents. Ordered lists are a good example. A list CRDT must not only decide whether an element exists, but also where it appears. Concurrent
+insertions at the same position must be ordered deterministically. Deletions must not destroy information that later or concurrent insertions might
+reference. These details produce the familiar machinery of operation identifiers, tombstones, causal dependencies, and tie-breaking rules.
+
+The programmer therefore has two jobs. First, they must design the logical behavior of the data type. Second, they must implement it in a way that
+preserves convergence under every possible delivery order. The second job is where many mistakes hide.
+
+---
+
+## Query-Based Turn
+
+The query-based approach reframes the problem. Instead of treating the CRDT as mutable state plus a merge algorithm, it treats the CRDT as a derived
+view over an immutable operation log.
+
+The replica stores operations as facts:
+
+```text
+set(replica_id, counter, key, value)
+pred(from_replica_id, from_counter, to_replica_id, to_counter)
+insert(replica_id, counter, parent_replica_id, parent_counter, value)
+remove(replica_id, counter)
+```
+
+The visible state is not the primary stored object. The visible state is the result of a query over those facts.
+
+This gives a clean convergence story. If two replicas have the same operation facts, and they evaluate the same deterministic query, they must compute
+the same result. The query does not depend on the order in which operations arrived. Arrival order is an implementation detail. The logical input is a
+set or multiset of immutable facts.
+
+That shift is powerful because it moves convergence reasoning into the structure of the computation. The developer still has to define the intended
+semantics, but they are no longer hand-coding every step of the merge algorithm.
+
+---
+
+## Why Datalog Fits
+
+Datalog is a natural language for this style because it is built around facts and derived facts. A Datalog program says which new facts follow from
+existing facts. The execution model is close to the mental model of derived views.
+
+For example, a multi-value register key-value store can be described by storing `set` operations and causal predecessor edges. A value is visible if
+its set operation has not been overwritten by a causally later operation. In Datalog-like notation:
+
+```text
+overwritten(RepId, Ctr) :-
+    pred(RepId, Ctr, _, _).
+
+mvrStore(Key, Value) :-
+    set(RepId, Ctr, Key, Value),
+    not overwritten(RepId, Ctr).
+```
+
+This is small, but it captures an important semantic choice. The query does not pick one winner among concurrent values. It filters out values that
+have been causally superseded. If two values are concurrent, neither overwrites the other, so both remain visible.
+
+Datalog also handles recursion directly. That matters because causal histories and list structures are graph-shaped. Asking whether an operation is
+reachable from a root, whether a dependency chain is complete, or what the next visible list element is can require recursive rules.
+
+The query language is restricted enough to make evaluation well-defined, but expressive enough to describe useful replicated structures.
+
+---
+
+## Causality as Data
+
+Causality is the difference between "this write came after that write" and "these writes were independent." CRDTs need this distinction because
+overwriting should usually remove only causally prior values, not concurrent values.
+
+The operation identifier is usually a pair:
+
+```text
+(replica_id, counter)
+```
+
+The replica id identifies where the operation came from. The counter is a local logical clock. Together, they make operation identifiers unique. They
+also provide a deterministic tie-breaker when the data type needs a total order among concurrent operations.
+
+Causal dependencies can be represented as edges:
+
+```text
+pred(from_replica_id, from_counter, to_replica_id, to_counter)
+```
+
+This says the `to` operation depends on the `from` operation. The dependency graph is then ordinary relational data. A query can derive roots, leaves,
+overwritten operations, causally ready operations, and visible values.
+
+There is an important design choice here. If the network or runtime guarantees causal delivery, the query can be simpler. If operations may arrive out
+of order, the query must avoid exposing operations whose dependencies have not arrived. That means causal readiness becomes part of the query.
+
+---
+
+## Out-of-Order Delivery
+
+Out-of-order delivery is common in distributed systems. A replica might receive an operation before receiving the operation it depends on. If the
+system exposes the later operation too early, it can show a state that is not valid under the intended causal semantics.
+
+A causal-readiness query guards against this. It derives which operations can be safely considered visible because their dependency chain is present.
+Conceptually, the query walks the causal graph from roots toward leaves and marks operations as ready only when the necessary predecessors are
+available.
+
+This improves correctness in less controlled networks, but it adds cost. Graph traversal is recursive. Recursive incremental computation is possible,
+but it is not free. If the query repeatedly walks long causal chains, performance may grow with the depth of the history.
+
+This is one of the central engineering lessons. Declarative correctness is not the same as automatic efficiency. The query may be compact and
+semantically clear, while still requiring careful optimization.
+
+---
+
+## Lists Are the Hard Case
+
+A key-value register demonstrates the idea, but an ordered list shows why the idea is interesting.
+
+In a collaborative text editor, users insert and delete characters. If two users insert at the same position concurrently, the final document must
+place both insertions somewhere, and all replicas must choose the same order. The data type cannot rely on local array indexes because indexes shift
+as edits arrive.
+
+A common CRDT solution is to give every inserted element a stable identifier. An insertion does not say "put this character at index 12." It says "put
+this character after element `(r, c)`." The insert operations form a tree:
+
+```text
+insert(replica_id, counter, parent_replica_id, parent_counter, value)
+```
+
+The sentinel root represents the beginning of the list. Children of a node are insertions that targeted that node as their parent. Concurrent
+insertions after the same parent become siblings. Siblings are ordered deterministically by operation identifier. The visible list is then obtained by
+traversing the tree in a deterministic order.
+
+Deletion is also subtle. If an element is deleted, it often cannot be physically removed from the structural history, because later or concurrent
+operations may refer to it as a parent. The system keeps a tombstone: the element remains as a reference point, but the visible list skips its value.
+
+In query terms, list behavior becomes a set of derived relations: first child, next sibling, next element, visible element, and next visible element.
+Datalog can express those relations directly. The query is longer than the register example, but it is still a declarative description of the list
+semantics.
+
+---
+
+## Incremental View Maintenance
+
+If CRDT state is a query over all operations, the obvious worry is cost. The operation set only grows. A naive implementation would recompute the
+entire query result every time a new operation arrives.
+
+Incremental view maintenance is the response. The engine maintains the result of a query as inputs change. When a new operation arrives, the engine
+computes the change to the output rather than recomputing the whole output.
+
+For a replicated application, the desired runtime shape is:
+
+```text
+new operation facts
+-> incremental query update
+-> visible state changes
+-> application update
+```
+
+This shape is especially attractive for user interfaces. If the query engine emits deltas, the application can update only the affected views. It is
+also attractive for local-first systems because the same machinery can process local writes, remote writes, and startup replay.
+
+DBSP is relevant here because it provides a formal and practical model for incremental computation over changing relations. Relational operators are
+lifted into a streaming setting where inputs and outputs evolve over time.
+
+---
+
+## Hydration and Warm Updates
+
+There are two different performance situations to keep separate.
+
+Hydration is startup. The application already has a stored operation history, but the query engine must rebuild its internal operator state. It may
+need to parse the query, build the execution plan, feed in the existing facts, and produce the current state. Hydration measures how long it takes
+before the application can show the document or database contents after opening.
+
+Warm update processing is the normal running mode after hydration. The query engine already has internal state. A small batch of new operations
+arrives. The engine only needs to update the maintained result.
+
+A design can perform acceptably in warm updates but poorly during hydration, or the other way around. For CRDT-backed applications, both matter. A
+collaborative editor must feel responsive while editing, but it must also open large documents without a long pause.
+
+This distinction also suggests possible hybrid strategies. A system might use a batch-oriented computation for startup and then switch to incremental
+maintenance. Or it might persist internal operator state so startup does not require replaying the entire operation history.
+
+---
+
+## Relational Intermediate Representation
+
+A Datalog program is convenient for users, but the execution engine usually wants a lower-level representation. A common design is to translate
+Datalog into a relational intermediate representation.
+
+The relational IR can include operators such as:
+
+- projection
+- selection
+- join
+- antijoin
+- union
+- difference
+- distinct
+- fixed-point iteration
+
+This is useful for two reasons. First, relational algebra is a good target for optimization. The engine can push down filters, remove unused fields,
+combine projections, and choose join strategies. Second, the IR separates the frontend language from the execution backend. Datalog is one possible
+frontend. DBSP is one possible incremental backend.
+
+That separation matters for research and engineering. If the backend changes from DBSP to another incremental framework, the Datalog frontend does not
+have to be redesigned. If a SQL-like frontend is added later, it can target the same IR.
+
+---
+
+## Where the Costs Hide
+
+The query-based approach simplifies some reasoning, but it does not erase hard systems problems.
+
+Negation is one source of care. Datalog with arbitrary negation can have unclear or unstable semantics. Stratified negation restricts programs so
+negative dependencies do not form problematic cycles. This keeps evaluation understandable, but it limits what can be expressed directly.
+
+Recursion is another source of cost. Recursive rules are needed for graph reachability, transitive closure, causal readiness, and list traversal.
+Incremental recursion can still be expensive when each update affects a long chain or a large region of the dependency graph.
+
+Join planning matters as well. Datalog rules often translate into joins. Bad join order can create large intermediate relations. In a continuously
+maintained query, changing the plan later may be harder than changing it for a one-shot query because the operators hold state.
+
+Storage growth is also unresolved. The clean convergence story assumes a monotonically growing operation set. Real applications cannot always keep
+every operation forever, especially on small devices. Compaction must preserve enough information for future queries and future synchronization.
+
+These are not arguments against the approach. They are the places where the approach becomes a database systems problem rather than only a
+programming-language idea.
+
+---
+
+## What This Approach Buys
+
+The first benefit is conceptual. The CRDT is specified as a query. The implementation has a clearer boundary between logical behavior and physical
+execution. That is the same separation that made relational databases powerful: the user specifies what result should exist, and the engine decides
+how to maintain it.
+
+The second benefit is extensibility. A fixed CRDT library exposes a fixed set of data types. A query-based system could let application developers
+define custom replicated structures, provided they stay within the safe fragment of the language.
+
+The third benefit is a shared interface. Ordinary application state, derived views, and replicated state can all look like queries. This could reduce
+the number of special-purpose layers in local-first applications.
+
+The fourth benefit is optimization headroom. If CRDTs are expressed through a query plan, improvements to the query engine can improve many CRDT
+definitions without changing their logical definitions.
+
+---
+
+## What Remains Open
+
+Several questions remain open before this style can be treated as a production design.
+
+Can enough useful CRDTs be expressed in Datalog with stratified negation? Registers and lists are promising examples, but nested documents, moves in
+trees, undo and redo, and rich JSON-like structures are harder.
+
+Can incremental evaluation make the performance competitive with hand-written CRDTs? A hand-written CRDT can exploit structure-specific shortcuts. A
+query engine needs optimization to avoid paying too much for generality.
+
+Can operation histories be compacted safely? Append-only facts are clean, but unbounded growth is not acceptable for every application.
+
+Can the system provide good error messages and type checking? A query language for application developers needs more than a working parser and
+runtime.
+
+Can causal readiness be optimized around the common case? Most new operations in a live application are likely close to current causal heads, but a
+naive recursive query may still traverse from roots.
+
+These questions define the practical research agenda.
+
+---
+
+## Reading Frame
+
+The best way to read this line of work is as a bridge between three areas.
+
+From CRDTs, it takes the goal of coordination-free replicated state.
+
+From Datalog, it takes declarative rules, recursion, and deterministic derivation over facts.
+
+From incremental query engines, it takes the ability to maintain derived state as input changes.
+
+The slogan is:
+
+```text
+replicated data structure = materialized query over immutable operations
+```
+
+That slogan is not the whole system, but it is the core mental model. The data structure is no longer only an object with methods. It is a maintained
+view. The operation log is the base data. The query is the semantics. The incremental engine is the execution strategy.
+
+---
+
+## Changelog
+
+* **May 7, 2026** -- First version created.
--- a/external/001_query_engine.md
+++ b/external/001_query_engine.md
@ -0,0 +1,162 @@
+This document is copied from https://git.sgai.uk/creators/geolog/-/wikis/Geolog%20storage%20engine%20meeting%202026-04-21 on May 7, 2026.
+
+Discussion pointsMaterialising the current state for efficient constraint checking and efficient queryingSpecial-casing certain types of theories,
+such as hash consing for syntax trees, or arrays for linear ordersExposing an API that looks like the theory definitions, not like the IRExposing to
+various programming languages via FFIDBSP for incremental constraint checking? Get Leo Stewen involved?Hooking into Soufflé for general querying /
+testing program analysis use cases?IR for initial models + storage layer for it (egraph cache)Initial modelsoften infinite, and not data. how should
+they show up in the op log?use an SSA or stack-based program where each instruction is a term constructor. both what you extract from an egraph and
+can be interpreted into an egraph.imagine we're working in the free ring with generators all strings. we have a couple of term constructors: given a
+string we can construct an element of the ring; and then we have 0, 1, multiplication, and addition. there's a term construction for terms in the free
+ring. if a term tree has common subtrees, we want a DAG representation. simplest: topologically sort it, store it in an array with backwards indices.
+that's essentially what SSA is: store instructions that may reference previous values.fundamental operation: given two EIDs, give me the EID of the
+term that is given by applying this term constructor to those EIDs. given an egraph and an SSA-style program you can interpret the SSA-style
+program.each term constructor may have a different number of arguments. separate table per term constructor would mean a lot of tables. maybe one
+table per arity?think of this as a serialised form of an egraph? stores many different terms concurrently. a term is just an index into a giant
+table.could have a defined term constructor. e.g. in a ring, could define a function of 4 arguments to be a big complicated polynomial. more efficient
+to store it as a single operation than as a collection of operations. from the perspective of commits and storage, could do either way.might make
+sense to sometimes put derived operations in the oplog, if they are compact to store.in the type theory, initial models are in general context. e.g.
+transitive closure for a graph: for all graphs, give you an initial model, which is the transitive closure. In the IR, have a lazy monomorphisation:
+declare a bunch of tables in the IR; some of the constraints on those tables mention certain initial models. those initial models are the transitive
+closure of this graph, or the term trees on this set of variables. might imagine emitting auxiliary declarations in terms of the main tables of the
+schema.initial models would not use geometric sequents, as they are too general. restricted to be a conjunction implying another conjunction. could
+give this to a datalog/egglog engine, containing only relation constructors but not term constructors.IR is still a set of tables that are declared to
+be an initial model; term constructors and what they are supposed to act upon.James's current IR is fully relational. Would want to preserve functions
+in the IR? Functions have types.James: Several tables sharing a single rowId space?Transitive closure/DAG exampletheory Graph := sig
+V: Set
+E: V -> V -> Set
+end
+
+def TransClosure (G*: Graph) := (init sig
+r: G.V -> G.V -> Prop
+_: r ?a ?a
+_: r ?a ?b -> G.E ?b ?c -> r ?a ?c
+end).r
+
+theory DAG := sig
+G: Graph
+tc := TransClosure G
+_: tc ?a ?b -> tc ?b ?a -> ?a = ?b
+endtranslates into IR:V: TABLE []
+E: TABLE [V, V]
+r: FREETABLE [V, V]
+refl: RULE[(v: V) -> r V V]
+trans: RULE[...]You can't insert into a FREETABLE, it can only be computed using rules. Every element added to FREETABLE has a provenance: produced by
+refl, or produced by trans, or produced by multiple rules and later shown to be equal.Say we want to preserve that an insertion preserves the
+acyclicity of the graph.At any given moment, FREETABLE may not yet be fully computed. Either run refl and trans until completion (fixed point), or
+turn it into a graph traversal.Props don't need to be stored. If we had r: G.V -> G.V -> Set instead of r: G.V -> G.V -> Prop that would represent
+paths through the graph.While you can't insert into a free table, other tables could refer to elements of a free table. In that case, the derived free
+table would have to materialised as a commit. This is where we could use an SSA-like construction: each element of the table can be derived through a
+finite list of invocations of rules. These might have IDs that are just local to the commit, not globally unique? But if we want to refer to IDs in
+another commit, just make them global. Use hash consing.Concurrently created terms could use different IDs to refer to the same term. Could imagine a
+merging process where we extract all the terms we care about, run egraph over them, and then re-encode them as a sequence of operations with fresh
+IDs.How do you compare elements of the free model across commits, if they have different IDs? Could walk backwards through the SSA tables for both
+commits, pull out both into a shared egraph, and then compare them in the egraph. But when you extract a term, you get out the term you originally put
+in.Storage engine should concern itself only with hash consing, not the egraph. But storage engine may need to deal with proofs of equality extracted
+from the egraph, because those proofs might be expensive to discover, so we'd want to store it. The witness of that proof would be a bunch of rule
+invocations, transitivity, congruence etc. (A truncated version of alifib?)There might be laws in the IR that use equality. Making equality just
+another stored relation is perhaps not a good idea – make it a built-in concept of the storage engine that performs congruence and transitivity out of
+the box?You can refer to equality via a set of rewrite rules that you have applied to the database. We have efficient representations of sets (e.g.
+Merkle trees). If you want to record an equality between two terms, you have a hash that refers to a set of rule invocations; whenever you need to
+materialise it, you run those rules in your egraph copy.You could have a table where one of the columns in the table is an equality proof? If Geolog
+is used for storing witnesses of certain properties, e.g. this branch never gets called. Store this in the database as the sequence of rewrite rules
+you need in order to prove that the branch condition equals false.Concept of computational depth: a complement to algorithmic complexity. Complexity
+is how many bits you need to generate a thing, no matter how long it takes. Computational depth is, given you are constructing it from the least
+amount possible, how long does it take? Even if things do not contain a lot of information, might want to store them if they contain a lot of
+computational depth. In the context of Geolog, where the process of equality is a semi-decision procedure (not guaranteed to terminate), this means
+the witnesses for equality need to be stored.Could make a content-addressed construction for term IDs, but this would take a lot more space.Query
+executionFrom a type theory point of view: if T is a theory, then a query is just a term of type q : (M : T) -> Set (given a model M of a theory, give
+me a set). Conjunctive queries are the easiest. for example, G : Graph -> [v : G.V, e : G.E v v] returns all the vertices with self-loops.Discussion
+of DBSP. Owen had heard of Differential Dataflow but not DBSP.In Geolog-zeta, Davidad compiled every Geolog sequent into two queries, one for the
+antecedent, one for the consequent. Compute a set for each, and check that one set is included in the other.DBSP needs to deal with positive queries (
+conjunctive and finitely disjunctive)How does this interact with chase? Mark rules as chased or not-chased. With chased, compute the derived facts.
+With non-chased, it's an error if the result is not satisfied (checked after the chase is complete).Want some way of special-casing certain patterns.
+There are more or less efficient algorithms for initial models. Some of these could be inlined into DBSP.This is analogous to the part of a SQL
+database that produces a query plan. For example, choice between hash join or merge join. Similarly, we want to be able to see if an axiom has a
+certain shape and compile it to an appropriate DBSP circuit.Could do an initial DBSP integration without initial models (no transitive closure, just
+e.g. union of conjunctive queries). A program logic (deep embedding into Geolog) could probably be checked using this fragment? The axioms are
+checking the well-formedness of the application of every rule, and the well-foundedness of the proof tree. Well-foundedness is the only non-local
+thing being checked here. This would be a good use case to try: Hoare logic over imperative programs. Allow agents to suggest proof steps and get
+feedback on which axioms are violated.Hooking up to Souffle is also a possibility. So far we've only looked at Geolog features that are also Datalog,
+but soon we might go beyond that. But we could try exporting to textual Datalog and run it through Souffle as a benchmarking baseline. Or hook into
+Souffle at ABI level.APIBesides the IR, another output of the compiler needs to be FFI bindings e.g. for JavaScript or Rust. Do Haskell first (using
+compiler as a library)? Owen thinks Haskell is harder – easier to generate code. TypeScript is the best target for now.Alex suggests we use LLM to
+generate ORM-like bindings from theory definition. Owen: it's less like ORM, more like a thin wrapper around prepared statements. Davidad: want an
+interface more like ODBC, less like an ORM.What makes this different from a database: every node has Geolog in-process.Owen wants to think about
+annotations of which concepts to expose via FFI. perhaps have a separate language that has an "FFI" to the Geolog language, but is restricted in
+itself.TA1.3 interface. Geolog compiler as "dev dependency", e.g. as a vite plugin. Run over query definition and output bindings. Geolog engine gets
+compiled to wasm, reads the stuff the bindings are giving it. Argument about whether to ship the compiler to the browser. Since we want the compiler
+to output TypeScript bindings, need to then recompile the code that uses those bindings. For people experimenting with theories this will happen
+frequently; in a "production" use the theory will be fixed at compile time.Alex suggests we want a web app that will allow others to experiment with
+writing theories and using instances. Obsidian already have a basic version of this. Make this available as npm library and implement the web
+interface as a simply vite app to check if we've libraryified it sufficiently.Another goal for TA1.3: dispense with the interchange format for
+Petrinaut, use Geolog insteadPossible format for inserting data into a Geolog instance:theory Graph := sig
+V : Set
+E : V -> V -> Set
+end
+
+theory Main := sig
+Graphs : Set
+G : Graphs -> Graph
+g0 : Graphs
+end
+
+# Syntax for data of an instance
+
+[
+g0 : Graphs
+_ : [
+open G g0 # brings (G g0).V and (G g0).E into scope as V and E
+v0 : V
+v1 : V
+e : E v0 v1
+]
+]Make a Godbolt-style app to allow people to play with Geolog theoriesData structuresFast analyses: egraphs, making conjunctive queries fastvs.
+Automerge: one oplog, multiple materialised check-outs.Eventually, various optimised data structures would be good. As a first pass do something
+generic. Davidad: "tensors of ordered semirings"How often do we expect to have dense multidimensional tensors? Only really in neural networks. But a
+sparse tensor can implement the same trait. Build up a library of tensor implementations and the optimiser picks one. Updates to the tensors
+represented via binary space partitioning. Davidad thinks nobody has done this.For example, representing a graph as an adjacency matrix, even if the
+underlying representation is actually a set of edges represented as tuples. Davidad thinks the tensor representation is good for GPUs to operate on,
+streaming a dense representation. e.g. transitive closure as matrix representation. conjunctive queries are binary tensor contractions.This
+representation can't handle strings – have to intern them into a string store, e.g. a big append-only blob, and string values are represented as an
+offset into that big blob.Anything that is a B-tree index in an ordinary database (want to be able to do range queries on it) could be a dtype.Davidad
+argues that functions should be stored differently from tables. For example with a graph, instead of E: V -> V -> Set there would be functions head:
+E -> V and tail: E -> V.Geolog breaking down into documentsWe need a concept of a namespace: can define a variable "foo is an instance of theory bar";
+whoever owns the theory can update it, but the instances and theories are immutable.Want to be able to point at something in someone else's namespace;
+that would be Geolog's analogy to Automerge documents. A commit is located in a namespace and modifies things only within that namespace. Namespaces
+reference each other. Want something like cargo import?Each variable is its own document. You might have a collection of documents. Their content
+might not be disjoint though. Say you have a graph, copy it a couple of times, and then use it in different instances. Logically, each variable has a
+completely separate history.A namespace has a very simple version history – it's just a version bump to one of the variables it contains. Recognises
+that mutation is needed somewhere: at the namespace there are just variables.Sedimentree doesn't work well for bushy histories. Our workload might
+have long parallel branches (work from independent agents) but probably not very wide branchingPartial merging might be needed: given two instances of
+a theory that are individually valid, need a merge that produces the maximal subset that is still valid. This always exists (in the worst case, throw
+away one of the branches), but not necessarily unique or deterministically defined. Perhaps a cherry-pick/rebase. Perhaps an explicit merge commit
+that indicates which rows are excluded or added in the merged result. In general, doing a merge might be just as much work as doing the work on the
+individual branch, although many practical use cases (e.g. program logics) would allow easy merges (two individually proved statements can easily be
+combined).Program logic exampleString diagram for (a+b)*c, want to prove that if all the inputs are >=0 then the output is >=0. Have a proof rule for
+multiplication: if both inputs are >=0 then the output is >=0.Layers: SSA target (references to variables in the program); logical assertions about
+those variables; logical formulae; sequents; derivations (applications of inference rules). All of these are encoded as Geolog theories.What if
+assertions and atoms were meta-level props? More in line with traditional Datalog analyses. But if >= was an actual term constructor in the Geolog
+language, that would be less flexible/extensible. Geolog formulas are not first-class objects in Geolog: if we have a proof tree that points at
+objects that are not themselves in the theory, that gets messy. Davidad thinks it's cleaner to do a deep embedding than a shallow one.Why do you need
+a proof tree? If you're using the Geolog metalogic (propositions in Geolog), keeping track of propositions is just part of the native logic of Geolog.
+Say ">=0" was a built-in of type Wire -> Prop. The problem is that this puts the burden of proof search on the Geolog computation engine? Not
+necessarily. There's nothing stopping you from writing down the proof rules applied – in fact, that's what initial models are.There are bunch of term
+constructors that produce proofs, and we can just serialise them. Far more compact than the whole proof tree that stores the context at each node.
+Still constrained to geometric logic though, which arbitrary proof trees would not be.Want a concept of a non-strict axioms: allows data violating the
+axioms to be committed, but with a big red flag saying that the proof is incomplete. Allow sorry as a constructor, and a no-sorries predicate that's a
+non-strict axiom.What kinds of logics do we want to use that don't embed nicely into geometric logic? James suggested a separation logic that came out
+a few years ago, which has some substructure.The proof tree approach might not work for binders. Once you have quantifiers in the language inside, you
+need to switch to De Bruijn indices or suchlike? No, represent free variables as explicit references to the variable binding. Performing substitution
+is still a bunch of work.Problem: if you prove something about one term, and then you rewrite (e.g. substitute) into a similar term, it's a new term
+and the proofs don't apply to it. You can copy the proof trees but that will get huge. A lot of the juice we can get out of geometric logic is
+conjunctive is fast to check, but deep embedding throws a lot of that away.Owen thinks it would be better to extend the Geolog logic to do e.g.
+interesting things with real numbers, rather than try to deeply embed a theory that is as annoying as pen and paper. Davidad thinks we should try it
+out; it's a selling point that it's sufficient to do a deep embedding of any logic, even if it's easier and more efficient to do a shallow embedding.
+Importing SMT certificates needs that flexibility, for example.Geolog might be best for things that are combinatorial. More natively combinatorial use
+cases would allow it to shine.There's Geolog the type theory, and the massively scalable infrastructure for solving problems. On the latter, can we
+beat Soufflé at their own game? They must have some good benchmarks.Try cryptography as a use case? e.g. can we prove an elliptic curve implementation
+correct? See Martin's X25519 tutorial, verified F* implementation in ValeCryptActionsAlex, George: Geolog compiler on npm (GHC-compiled Wasm) → get
+Godbolt-style web interface goingOwen: design a language for people to write queries and insert data, putting it through loweringJames: new version of
+lowering that can handle inductive thingssomeone should write a compiler that turns a simple toy imperative language into a Geolog instanceit would be
+nice to have one frontend – Obsidian should write TypeScript? Cale happy to write bits of Haskell that generate TypeScriptVincent, Alex, Martin –
+physical data storage, disk and network wire formatsMartin, Leo – DBSP experiments