From 01a9aa167e1ec8490c72e36fc1bce4ac008ede74 Mon Sep 17 00:00:00 2001 From: Hassan Abedi Date: Thu, 7 May 2026 10:22:45 +0200 Subject: [PATCH] Add a suplamantay note file on CRDTs and Datalog --- dbsp/002-why-crdts-as-queries.md | 317 +++++++++++++++++++++++++++++++ external/001_query_engine.md | 162 ++++++++++++++++ 2 files changed, 479 insertions(+) create mode 100644 dbsp/002-why-crdts-as-queries.md create mode 100644 external/001_query_engine.md diff --git a/dbsp/002-why-crdts-as-queries.md b/dbsp/002-why-crdts-as-queries.md new file mode 100644 index 0000000..bd0ff5c --- /dev/null +++ b/dbsp/002-why-crdts-as-queries.md @@ -0,0 +1,317 @@ +# Why CRDTs as Queries + +A coherent reading note on the idea of defining replicated data structures as deterministic queries over immutable operations. + +--- + +## Starting Point + +The basic problem behind CRDTs is easy to state and hard to implement well. Several replicas hold copies of the same logical data. Each replica should +be able to accept writes locally, including while offline or disconnected from the other replicas. Later, when replicas exchange information, they +should converge to the same state. + +This is not the same problem as ordinary database replication. In a conventional primary-replica database, a single authority can decide the order of +writes. If two users write at the same time, the system can serialize those writes through a leader, a lock, or a consensus protocol. That gives the +database a single history. CRDTs are designed for environments where that coordination is unavailable, undesirable, or too expensive. A user should +still be able to write locally, even if there is no reachable leader. + +The price is that the system must handle concurrency after the fact. Two replicas may both accept writes that neither knew about at the time. When +those writes later meet, the system must define what the merged state means. A CRDT is a data structure whose merge behavior is designed so that all +replicas eventually compute the same state. + +--- + +## The Traditional Burden + +In a hand-written CRDT, the implementer writes an algorithm whose operations are safe under concurrency. For an operation-based CRDT, that often means +concurrent operations must commute: applying operation `a` and then operation `b` must lead to the same logical state as applying `b` and then `a`, at +least when `a` and `b` are concurrent. + +This is manageable for simple structures such as grow-only sets. It becomes more subtle for registers, maps, lists, trees, undo and redo, and nested +documents. Ordered lists are a good example. A list CRDT must not only decide whether an element exists, but also where it appears. Concurrent +insertions at the same position must be ordered deterministically. Deletions must not destroy information that later or concurrent insertions might +reference. These details produce the familiar machinery of operation identifiers, tombstones, causal dependencies, and tie-breaking rules. + +The programmer therefore has two jobs. First, they must design the logical behavior of the data type. Second, they must implement it in a way that +preserves convergence under every possible delivery order. The second job is where many mistakes hide. + +--- + +## Query-Based Turn + +The query-based approach reframes the problem. Instead of treating the CRDT as mutable state plus a merge algorithm, it treats the CRDT as a derived +view over an immutable operation log. + +The replica stores operations as facts: + +```text +set(replica_id, counter, key, value) +pred(from_replica_id, from_counter, to_replica_id, to_counter) +insert(replica_id, counter, parent_replica_id, parent_counter, value) +remove(replica_id, counter) +``` + +The visible state is not the primary stored object. The visible state is the result of a query over those facts. + +This gives a clean convergence story. If two replicas have the same operation facts, and they evaluate the same deterministic query, they must compute +the same result. The query does not depend on the order in which operations arrived. Arrival order is an implementation detail. The logical input is a +set or multiset of immutable facts. + +That shift is powerful because it moves convergence reasoning into the structure of the computation. The developer still has to define the intended +semantics, but they are no longer hand-coding every step of the merge algorithm. + +--- + +## Why Datalog Fits + +Datalog is a natural language for this style because it is built around facts and derived facts. A Datalog program says which new facts follow from +existing facts. The execution model is close to the mental model of derived views. + +For example, a multi-value register key-value store can be described by storing `set` operations and causal predecessor edges. A value is visible if +its set operation has not been overwritten by a causally later operation. In Datalog-like notation: + +```text +overwritten(RepId, Ctr) :- + pred(RepId, Ctr, _, _). + +mvrStore(Key, Value) :- + set(RepId, Ctr, Key, Value), + not overwritten(RepId, Ctr). +``` + +This is small, but it captures an important semantic choice. The query does not pick one winner among concurrent values. It filters out values that +have been causally superseded. If two values are concurrent, neither overwrites the other, so both remain visible. + +Datalog also handles recursion directly. That matters because causal histories and list structures are graph-shaped. Asking whether an operation is +reachable from a root, whether a dependency chain is complete, or what the next visible list element is can require recursive rules. + +The query language is restricted enough to make evaluation well-defined, but expressive enough to describe useful replicated structures. + +--- + +## Causality as Data + +Causality is the difference between "this write came after that write" and "these writes were independent." CRDTs need this distinction because +overwriting should usually remove only causally prior values, not concurrent values. + +The operation identifier is usually a pair: + +```text +(replica_id, counter) +``` + +The replica id identifies where the operation came from. The counter is a local logical clock. Together, they make operation identifiers unique. They +also provide a deterministic tie-breaker when the data type needs a total order among concurrent operations. + +Causal dependencies can be represented as edges: + +```text +pred(from_replica_id, from_counter, to_replica_id, to_counter) +``` + +This says the `to` operation depends on the `from` operation. The dependency graph is then ordinary relational data. A query can derive roots, leaves, +overwritten operations, causally ready operations, and visible values. + +There is an important design choice here. If the network or runtime guarantees causal delivery, the query can be simpler. If operations may arrive out +of order, the query must avoid exposing operations whose dependencies have not arrived. That means causal readiness becomes part of the query. + +--- + +## Out-of-Order Delivery + +Out-of-order delivery is common in distributed systems. A replica might receive an operation before receiving the operation it depends on. If the +system exposes the later operation too early, it can show a state that is not valid under the intended causal semantics. + +A causal-readiness query guards against this. It derives which operations can be safely considered visible because their dependency chain is present. +Conceptually, the query walks the causal graph from roots toward leaves and marks operations as ready only when the necessary predecessors are +available. + +This improves correctness in less controlled networks, but it adds cost. Graph traversal is recursive. Recursive incremental computation is possible, +but it is not free. If the query repeatedly walks long causal chains, performance may grow with the depth of the history. + +This is one of the central engineering lessons. Declarative correctness is not the same as automatic efficiency. The query may be compact and +semantically clear, while still requiring careful optimization. + +--- + +## Lists Are the Hard Case + +A key-value register demonstrates the idea, but an ordered list shows why the idea is interesting. + +In a collaborative text editor, users insert and delete characters. If two users insert at the same position concurrently, the final document must +place both insertions somewhere, and all replicas must choose the same order. The data type cannot rely on local array indexes because indexes shift +as edits arrive. + +A common CRDT solution is to give every inserted element a stable identifier. An insertion does not say "put this character at index 12." It says "put +this character after element `(r, c)`." The insert operations form a tree: + +```text +insert(replica_id, counter, parent_replica_id, parent_counter, value) +``` + +The sentinel root represents the beginning of the list. Children of a node are insertions that targeted that node as their parent. Concurrent +insertions after the same parent become siblings. Siblings are ordered deterministically by operation identifier. The visible list is then obtained by +traversing the tree in a deterministic order. + +Deletion is also subtle. If an element is deleted, it often cannot be physically removed from the structural history, because later or concurrent +operations may refer to it as a parent. The system keeps a tombstone: the element remains as a reference point, but the visible list skips its value. + +In query terms, list behavior becomes a set of derived relations: first child, next sibling, next element, visible element, and next visible element. +Datalog can express those relations directly. The query is longer than the register example, but it is still a declarative description of the list +semantics. + +--- + +## Incremental View Maintenance + +If CRDT state is a query over all operations, the obvious worry is cost. The operation set only grows. A naive implementation would recompute the +entire query result every time a new operation arrives. + +Incremental view maintenance is the response. The engine maintains the result of a query as inputs change. When a new operation arrives, the engine +computes the change to the output rather than recomputing the whole output. + +For a replicated application, the desired runtime shape is: + +```text +new operation facts +-> incremental query update +-> visible state changes +-> application update +``` + +This shape is especially attractive for user interfaces. If the query engine emits deltas, the application can update only the affected views. It is +also attractive for local-first systems because the same machinery can process local writes, remote writes, and startup replay. + +DBSP is relevant here because it provides a formal and practical model for incremental computation over changing relations. Relational operators are +lifted into a streaming setting where inputs and outputs evolve over time. + +--- + +## Hydration and Warm Updates + +There are two different performance situations to keep separate. + +Hydration is startup. The application already has a stored operation history, but the query engine must rebuild its internal operator state. It may +need to parse the query, build the execution plan, feed in the existing facts, and produce the current state. Hydration measures how long it takes +before the application can show the document or database contents after opening. + +Warm update processing is the normal running mode after hydration. The query engine already has internal state. A small batch of new operations +arrives. The engine only needs to update the maintained result. + +A design can perform acceptably in warm updates but poorly during hydration, or the other way around. For CRDT-backed applications, both matter. A +collaborative editor must feel responsive while editing, but it must also open large documents without a long pause. + +This distinction also suggests possible hybrid strategies. A system might use a batch-oriented computation for startup and then switch to incremental +maintenance. Or it might persist internal operator state so startup does not require replaying the entire operation history. + +--- + +## Relational Intermediate Representation + +A Datalog program is convenient for users, but the execution engine usually wants a lower-level representation. A common design is to translate +Datalog into a relational intermediate representation. + +The relational IR can include operators such as: + +- projection +- selection +- join +- antijoin +- union +- difference +- distinct +- fixed-point iteration + +This is useful for two reasons. First, relational algebra is a good target for optimization. The engine can push down filters, remove unused fields, +combine projections, and choose join strategies. Second, the IR separates the frontend language from the execution backend. Datalog is one possible +frontend. DBSP is one possible incremental backend. + +That separation matters for research and engineering. If the backend changes from DBSP to another incremental framework, the Datalog frontend does not +have to be redesigned. If a SQL-like frontend is added later, it can target the same IR. + +--- + +## Where the Costs Hide + +The query-based approach simplifies some reasoning, but it does not erase hard systems problems. + +Negation is one source of care. Datalog with arbitrary negation can have unclear or unstable semantics. Stratified negation restricts programs so +negative dependencies do not form problematic cycles. This keeps evaluation understandable, but it limits what can be expressed directly. + +Recursion is another source of cost. Recursive rules are needed for graph reachability, transitive closure, causal readiness, and list traversal. +Incremental recursion can still be expensive when each update affects a long chain or a large region of the dependency graph. + +Join planning matters as well. Datalog rules often translate into joins. Bad join order can create large intermediate relations. In a continuously +maintained query, changing the plan later may be harder than changing it for a one-shot query because the operators hold state. + +Storage growth is also unresolved. The clean convergence story assumes a monotonically growing operation set. Real applications cannot always keep +every operation forever, especially on small devices. Compaction must preserve enough information for future queries and future synchronization. + +These are not arguments against the approach. They are the places where the approach becomes a database systems problem rather than only a +programming-language idea. + +--- + +## What This Approach Buys + +The first benefit is conceptual. The CRDT is specified as a query. The implementation has a clearer boundary between logical behavior and physical +execution. That is the same separation that made relational databases powerful: the user specifies what result should exist, and the engine decides +how to maintain it. + +The second benefit is extensibility. A fixed CRDT library exposes a fixed set of data types. A query-based system could let application developers +define custom replicated structures, provided they stay within the safe fragment of the language. + +The third benefit is a shared interface. Ordinary application state, derived views, and replicated state can all look like queries. This could reduce +the number of special-purpose layers in local-first applications. + +The fourth benefit is optimization headroom. If CRDTs are expressed through a query plan, improvements to the query engine can improve many CRDT +definitions without changing their logical definitions. + +--- + +## What Remains Open + +Several questions remain open before this style can be treated as a production design. + +Can enough useful CRDTs be expressed in Datalog with stratified negation? Registers and lists are promising examples, but nested documents, moves in +trees, undo and redo, and rich JSON-like structures are harder. + +Can incremental evaluation make the performance competitive with hand-written CRDTs? A hand-written CRDT can exploit structure-specific shortcuts. A +query engine needs optimization to avoid paying too much for generality. + +Can operation histories be compacted safely? Append-only facts are clean, but unbounded growth is not acceptable for every application. + +Can the system provide good error messages and type checking? A query language for application developers needs more than a working parser and +runtime. + +Can causal readiness be optimized around the common case? Most new operations in a live application are likely close to current causal heads, but a +naive recursive query may still traverse from roots. + +These questions define the practical research agenda. + +--- + +## Reading Frame + +The best way to read this line of work is as a bridge between three areas. + +From CRDTs, it takes the goal of coordination-free replicated state. + +From Datalog, it takes declarative rules, recursion, and deterministic derivation over facts. + +From incremental query engines, it takes the ability to maintain derived state as input changes. + +The slogan is: + +```text +replicated data structure = materialized query over immutable operations +``` + +That slogan is not the whole system, but it is the core mental model. The data structure is no longer only an object with methods. It is a maintained +view. The operation log is the base data. The query is the semantics. The incremental engine is the execution strategy. + +--- + +## Changelog + +* **May 7, 2026** -- First version created. diff --git a/external/001_query_engine.md b/external/001_query_engine.md new file mode 100644 index 0000000..138522d --- /dev/null +++ b/external/001_query_engine.md @@ -0,0 +1,162 @@ +This document is copied from https://git.sgai.uk/creators/geolog/-/wikis/Geolog%20storage%20engine%20meeting%202026-04-21 on May 7, 2026. + +Discussion pointsMaterialising the current state for efficient constraint checking and efficient queryingSpecial-casing certain types of theories, +such as hash consing for syntax trees, or arrays for linear ordersExposing an API that looks like the theory definitions, not like the IRExposing to +various programming languages via FFIDBSP for incremental constraint checking? Get Leo Stewen involved?Hooking into Soufflé for general querying / +testing program analysis use cases?IR for initial models + storage layer for it (egraph cache)Initial modelsoften infinite, and not data. how should +they show up in the op log?use an SSA or stack-based program where each instruction is a term constructor. both what you extract from an egraph and +can be interpreted into an egraph.imagine we're working in the free ring with generators all strings. we have a couple of term constructors: given a +string we can construct an element of the ring; and then we have 0, 1, multiplication, and addition. there's a term construction for terms in the free +ring. if a term tree has common subtrees, we want a DAG representation. simplest: topologically sort it, store it in an array with backwards indices. +that's essentially what SSA is: store instructions that may reference previous values.fundamental operation: given two EIDs, give me the EID of the +term that is given by applying this term constructor to those EIDs. given an egraph and an SSA-style program you can interpret the SSA-style +program.each term constructor may have a different number of arguments. separate table per term constructor would mean a lot of tables. maybe one +table per arity?think of this as a serialised form of an egraph? stores many different terms concurrently. a term is just an index into a giant +table.could have a defined term constructor. e.g. in a ring, could define a function of 4 arguments to be a big complicated polynomial. more efficient +to store it as a single operation than as a collection of operations. from the perspective of commits and storage, could do either way.might make +sense to sometimes put derived operations in the oplog, if they are compact to store.in the type theory, initial models are in general context. e.g. +transitive closure for a graph: for all graphs, give you an initial model, which is the transitive closure. In the IR, have a lazy monomorphisation: +declare a bunch of tables in the IR; some of the constraints on those tables mention certain initial models. those initial models are the transitive +closure of this graph, or the term trees on this set of variables. might imagine emitting auxiliary declarations in terms of the main tables of the +schema.initial models would not use geometric sequents, as they are too general. restricted to be a conjunction implying another conjunction. could +give this to a datalog/egglog engine, containing only relation constructors but not term constructors.IR is still a set of tables that are declared to +be an initial model; term constructors and what they are supposed to act upon.James's current IR is fully relational. Would want to preserve functions +in the IR? Functions have types.James: Several tables sharing a single rowId space?Transitive closure/DAG exampletheory Graph := sig +V: Set +E: V -> V -> Set +end + +def TransClosure (G*: Graph) := (init sig +r: G.V -> G.V -> Prop +_: r ?a ?a +_: r ?a ?b -> G.E ?b ?c -> r ?a ?c +end).r + +theory DAG := sig +G: Graph +tc := TransClosure G +_: tc ?a ?b -> tc ?b ?a -> ?a = ?b +endtranslates into IR:V: TABLE [] +E: TABLE [V, V] +r: FREETABLE [V, V] +refl: RULE[(v: V) -> r V V] +trans: RULE[...]You can't insert into a FREETABLE, it can only be computed using rules. Every element added to FREETABLE has a provenance: produced by +refl, or produced by trans, or produced by multiple rules and later shown to be equal.Say we want to preserve that an insertion preserves the +acyclicity of the graph.At any given moment, FREETABLE may not yet be fully computed. Either run refl and trans until completion (fixed point), or +turn it into a graph traversal.Props don't need to be stored. If we had r: G.V -> G.V -> Set instead of r: G.V -> G.V -> Prop that would represent +paths through the graph.While you can't insert into a free table, other tables could refer to elements of a free table. In that case, the derived free +table would have to materialised as a commit. This is where we could use an SSA-like construction: each element of the table can be derived through a +finite list of invocations of rules. These might have IDs that are just local to the commit, not globally unique? But if we want to refer to IDs in +another commit, just make them global. Use hash consing.Concurrently created terms could use different IDs to refer to the same term. Could imagine a +merging process where we extract all the terms we care about, run egraph over them, and then re-encode them as a sequence of operations with fresh +IDs.How do you compare elements of the free model across commits, if they have different IDs? Could walk backwards through the SSA tables for both +commits, pull out both into a shared egraph, and then compare them in the egraph. But when you extract a term, you get out the term you originally put +in.Storage engine should concern itself only with hash consing, not the egraph. But storage engine may need to deal with proofs of equality extracted +from the egraph, because those proofs might be expensive to discover, so we'd want to store it. The witness of that proof would be a bunch of rule +invocations, transitivity, congruence etc. (A truncated version of alifib?)There might be laws in the IR that use equality. Making equality just +another stored relation is perhaps not a good idea – make it a built-in concept of the storage engine that performs congruence and transitivity out of +the box?You can refer to equality via a set of rewrite rules that you have applied to the database. We have efficient representations of sets (e.g. +Merkle trees). If you want to record an equality between two terms, you have a hash that refers to a set of rule invocations; whenever you need to +materialise it, you run those rules in your egraph copy.You could have a table where one of the columns in the table is an equality proof? If Geolog +is used for storing witnesses of certain properties, e.g. this branch never gets called. Store this in the database as the sequence of rewrite rules +you need in order to prove that the branch condition equals false.Concept of computational depth: a complement to algorithmic complexity. Complexity +is how many bits you need to generate a thing, no matter how long it takes. Computational depth is, given you are constructing it from the least +amount possible, how long does it take? Even if things do not contain a lot of information, might want to store them if they contain a lot of +computational depth. In the context of Geolog, where the process of equality is a semi-decision procedure (not guaranteed to terminate), this means +the witnesses for equality need to be stored.Could make a content-addressed construction for term IDs, but this would take a lot more space.Query +executionFrom a type theory point of view: if T is a theory, then a query is just a term of type q : (M : T) -> Set (given a model M of a theory, give +me a set). Conjunctive queries are the easiest. for example, G : Graph -> [v : G.V, e : G.E v v] returns all the vertices with self-loops.Discussion +of DBSP. Owen had heard of Differential Dataflow but not DBSP.In Geolog-zeta, Davidad compiled every Geolog sequent into two queries, one for the +antecedent, one for the consequent. Compute a set for each, and check that one set is included in the other.DBSP needs to deal with positive queries ( +conjunctive and finitely disjunctive)How does this interact with chase? Mark rules as chased or not-chased. With chased, compute the derived facts. +With non-chased, it's an error if the result is not satisfied (checked after the chase is complete).Want some way of special-casing certain patterns. +There are more or less efficient algorithms for initial models. Some of these could be inlined into DBSP.This is analogous to the part of a SQL +database that produces a query plan. For example, choice between hash join or merge join. Similarly, we want to be able to see if an axiom has a +certain shape and compile it to an appropriate DBSP circuit.Could do an initial DBSP integration without initial models (no transitive closure, just +e.g. union of conjunctive queries). A program logic (deep embedding into Geolog) could probably be checked using this fragment? The axioms are +checking the well-formedness of the application of every rule, and the well-foundedness of the proof tree. Well-foundedness is the only non-local +thing being checked here. This would be a good use case to try: Hoare logic over imperative programs. Allow agents to suggest proof steps and get +feedback on which axioms are violated.Hooking up to Souffle is also a possibility. So far we've only looked at Geolog features that are also Datalog, +but soon we might go beyond that. But we could try exporting to textual Datalog and run it through Souffle as a benchmarking baseline. Or hook into +Souffle at ABI level.APIBesides the IR, another output of the compiler needs to be FFI bindings e.g. for JavaScript or Rust. Do Haskell first (using +compiler as a library)? Owen thinks Haskell is harder – easier to generate code. TypeScript is the best target for now.Alex suggests we use LLM to +generate ORM-like bindings from theory definition. Owen: it's less like ORM, more like a thin wrapper around prepared statements. Davidad: want an +interface more like ODBC, less like an ORM.What makes this different from a database: every node has Geolog in-process.Owen wants to think about +annotations of which concepts to expose via FFI. perhaps have a separate language that has an "FFI" to the Geolog language, but is restricted in +itself.TA1.3 interface. Geolog compiler as "dev dependency", e.g. as a vite plugin. Run over query definition and output bindings. Geolog engine gets +compiled to wasm, reads the stuff the bindings are giving it. Argument about whether to ship the compiler to the browser. Since we want the compiler +to output TypeScript bindings, need to then recompile the code that uses those bindings. For people experimenting with theories this will happen +frequently; in a "production" use the theory will be fixed at compile time.Alex suggests we want a web app that will allow others to experiment with +writing theories and using instances. Obsidian already have a basic version of this. Make this available as npm library and implement the web +interface as a simply vite app to check if we've libraryified it sufficiently.Another goal for TA1.3: dispense with the interchange format for +Petrinaut, use Geolog insteadPossible format for inserting data into a Geolog instance:theory Graph := sig +V : Set +E : V -> V -> Set +end + +theory Main := sig +Graphs : Set +G : Graphs -> Graph +g0 : Graphs +end + +# Syntax for data of an instance + +[ +g0 : Graphs +_ : [ +open G g0 # brings (G g0).V and (G g0).E into scope as V and E +v0 : V +v1 : V +e : E v0 v1 +] +]Make a Godbolt-style app to allow people to play with Geolog theoriesData structuresFast analyses: egraphs, making conjunctive queries fastvs. +Automerge: one oplog, multiple materialised check-outs.Eventually, various optimised data structures would be good. As a first pass do something +generic. Davidad: "tensors of ordered semirings"How often do we expect to have dense multidimensional tensors? Only really in neural networks. But a +sparse tensor can implement the same trait. Build up a library of tensor implementations and the optimiser picks one. Updates to the tensors +represented via binary space partitioning. Davidad thinks nobody has done this.For example, representing a graph as an adjacency matrix, even if the +underlying representation is actually a set of edges represented as tuples. Davidad thinks the tensor representation is good for GPUs to operate on, +streaming a dense representation. e.g. transitive closure as matrix representation. conjunctive queries are binary tensor contractions.This +representation can't handle strings – have to intern them into a string store, e.g. a big append-only blob, and string values are represented as an +offset into that big blob.Anything that is a B-tree index in an ordinary database (want to be able to do range queries on it) could be a dtype.Davidad +argues that functions should be stored differently from tables. For example with a graph, instead of E: V -> V -> Set there would be functions head: +E -> V and tail: E -> V.Geolog breaking down into documentsWe need a concept of a namespace: can define a variable "foo is an instance of theory bar"; +whoever owns the theory can update it, but the instances and theories are immutable.Want to be able to point at something in someone else's namespace; +that would be Geolog's analogy to Automerge documents. A commit is located in a namespace and modifies things only within that namespace. Namespaces +reference each other. Want something like cargo import?Each variable is its own document. You might have a collection of documents. Their content +might not be disjoint though. Say you have a graph, copy it a couple of times, and then use it in different instances. Logically, each variable has a +completely separate history.A namespace has a very simple version history – it's just a version bump to one of the variables it contains. Recognises +that mutation is needed somewhere: at the namespace there are just variables.Sedimentree doesn't work well for bushy histories. Our workload might +have long parallel branches (work from independent agents) but probably not very wide branchingPartial merging might be needed: given two instances of +a theory that are individually valid, need a merge that produces the maximal subset that is still valid. This always exists (in the worst case, throw +away one of the branches), but not necessarily unique or deterministically defined. Perhaps a cherry-pick/rebase. Perhaps an explicit merge commit +that indicates which rows are excluded or added in the merged result. In general, doing a merge might be just as much work as doing the work on the +individual branch, although many practical use cases (e.g. program logics) would allow easy merges (two individually proved statements can easily be +combined).Program logic exampleString diagram for (a+b)*c, want to prove that if all the inputs are >=0 then the output is >=0. Have a proof rule for +multiplication: if both inputs are >=0 then the output is >=0.Layers: SSA target (references to variables in the program); logical assertions about +those variables; logical formulae; sequents; derivations (applications of inference rules). All of these are encoded as Geolog theories.What if +assertions and atoms were meta-level props? More in line with traditional Datalog analyses. But if >= was an actual term constructor in the Geolog +language, that would be less flexible/extensible. Geolog formulas are not first-class objects in Geolog: if we have a proof tree that points at +objects that are not themselves in the theory, that gets messy. Davidad thinks it's cleaner to do a deep embedding than a shallow one.Why do you need +a proof tree? If you're using the Geolog metalogic (propositions in Geolog), keeping track of propositions is just part of the native logic of Geolog. +Say ">=0" was a built-in of type Wire -> Prop. The problem is that this puts the burden of proof search on the Geolog computation engine? Not +necessarily. There's nothing stopping you from writing down the proof rules applied – in fact, that's what initial models are.There are bunch of term +constructors that produce proofs, and we can just serialise them. Far more compact than the whole proof tree that stores the context at each node. +Still constrained to geometric logic though, which arbitrary proof trees would not be.Want a concept of a non-strict axioms: allows data violating the +axioms to be committed, but with a big red flag saying that the proof is incomplete. Allow sorry as a constructor, and a no-sorries predicate that's a +non-strict axiom.What kinds of logics do we want to use that don't embed nicely into geometric logic? James suggested a separation logic that came out +a few years ago, which has some substructure.The proof tree approach might not work for binders. Once you have quantifiers in the language inside, you +need to switch to De Bruijn indices or suchlike? No, represent free variables as explicit references to the variable binding. Performing substitution +is still a bunch of work.Problem: if you prove something about one term, and then you rewrite (e.g. substitute) into a similar term, it's a new term +and the proofs don't apply to it. You can copy the proof trees but that will get huge. A lot of the juice we can get out of geometric logic is +conjunctive is fast to check, but deep embedding throws a lot of that away.Owen thinks it would be better to extend the Geolog logic to do e.g. +interesting things with real numbers, rather than try to deeply embed a theory that is as annoying as pen and paper. Davidad thinks we should try it +out; it's a selling point that it's sufficient to do a deep embedding of any logic, even if it's easier and more efficient to do a shallow embedding. +Importing SMT certificates needs that flexibility, for example.Geolog might be best for things that are combinatorial. More natively combinatorial use +cases would allow it to shine.There's Geolog the type theory, and the massively scalable infrastructure for solving problems. On the latter, can we +beat Soufflé at their own game? They must have some good benchmarks.Try cryptography as a use case? e.g. can we prove an elliptic curve implementation +correct? See Martin's X25519 tutorial, verified F* implementation in ValeCryptActionsAlex, George: Geolog compiler on npm (GHC-compiled Wasm) → get +Godbolt-style web interface goingOwen: design a language for people to write queries and insert data, putting it through loweringJames: new version of +lowering that can handle inductive thingssomeone should write a compiler that turns a simple toy imperative language into a Geolog instanceit would be +nice to have one frontend – Obsidian should write TypeScript? Cale happy to write bits of Haskell that generate TypeScriptVincent, Alex, Martin – +physical data storage, disk and network wire formatsMartin, Leo – DBSP experiments