Add a suplamantay note file on CRDTs and Datalog

This commit is contained in:
Hassan Abedi 2026-05-07 10:22:45 +02:00
parent 35b6e8f43f
commit 01a9aa167e
2 changed files with 479 additions and 0 deletions

View File

@ -0,0 +1,317 @@
# Why CRDTs as Queries
A coherent reading note on the idea of defining replicated data structures as deterministic queries over immutable operations.
---
## Starting Point
The basic problem behind CRDTs is easy to state and hard to implement well. Several replicas hold copies of the same logical data. Each replica should
be able to accept writes locally, including while offline or disconnected from the other replicas. Later, when replicas exchange information, they
should converge to the same state.
This is not the same problem as ordinary database replication. In a conventional primary-replica database, a single authority can decide the order of
writes. If two users write at the same time, the system can serialize those writes through a leader, a lock, or a consensus protocol. That gives the
database a single history. CRDTs are designed for environments where that coordination is unavailable, undesirable, or too expensive. A user should
still be able to write locally, even if there is no reachable leader.
The price is that the system must handle concurrency after the fact. Two replicas may both accept writes that neither knew about at the time. When
those writes later meet, the system must define what the merged state means. A CRDT is a data structure whose merge behavior is designed so that all
replicas eventually compute the same state.
---
## The Traditional Burden
In a hand-written CRDT, the implementer writes an algorithm whose operations are safe under concurrency. For an operation-based CRDT, that often means
concurrent operations must commute: applying operation `a` and then operation `b` must lead to the same logical state as applying `b` and then `a`, at
least when `a` and `b` are concurrent.
This is manageable for simple structures such as grow-only sets. It becomes more subtle for registers, maps, lists, trees, undo and redo, and nested
documents. Ordered lists are a good example. A list CRDT must not only decide whether an element exists, but also where it appears. Concurrent
insertions at the same position must be ordered deterministically. Deletions must not destroy information that later or concurrent insertions might
reference. These details produce the familiar machinery of operation identifiers, tombstones, causal dependencies, and tie-breaking rules.
The programmer therefore has two jobs. First, they must design the logical behavior of the data type. Second, they must implement it in a way that
preserves convergence under every possible delivery order. The second job is where many mistakes hide.
---
## Query-Based Turn
The query-based approach reframes the problem. Instead of treating the CRDT as mutable state plus a merge algorithm, it treats the CRDT as a derived
view over an immutable operation log.
The replica stores operations as facts:
```text
set(replica_id, counter, key, value)
pred(from_replica_id, from_counter, to_replica_id, to_counter)
insert(replica_id, counter, parent_replica_id, parent_counter, value)
remove(replica_id, counter)
```
The visible state is not the primary stored object. The visible state is the result of a query over those facts.
This gives a clean convergence story. If two replicas have the same operation facts, and they evaluate the same deterministic query, they must compute
the same result. The query does not depend on the order in which operations arrived. Arrival order is an implementation detail. The logical input is a
set or multiset of immutable facts.
That shift is powerful because it moves convergence reasoning into the structure of the computation. The developer still has to define the intended
semantics, but they are no longer hand-coding every step of the merge algorithm.
---
## Why Datalog Fits
Datalog is a natural language for this style because it is built around facts and derived facts. A Datalog program says which new facts follow from
existing facts. The execution model is close to the mental model of derived views.
For example, a multi-value register key-value store can be described by storing `set` operations and causal predecessor edges. A value is visible if
its set operation has not been overwritten by a causally later operation. In Datalog-like notation:
```text
overwritten(RepId, Ctr) :-
pred(RepId, Ctr, _, _).
mvrStore(Key, Value) :-
set(RepId, Ctr, Key, Value),
not overwritten(RepId, Ctr).
```
This is small, but it captures an important semantic choice. The query does not pick one winner among concurrent values. It filters out values that
have been causally superseded. If two values are concurrent, neither overwrites the other, so both remain visible.
Datalog also handles recursion directly. That matters because causal histories and list structures are graph-shaped. Asking whether an operation is
reachable from a root, whether a dependency chain is complete, or what the next visible list element is can require recursive rules.
The query language is restricted enough to make evaluation well-defined, but expressive enough to describe useful replicated structures.
---
## Causality as Data
Causality is the difference between "this write came after that write" and "these writes were independent." CRDTs need this distinction because
overwriting should usually remove only causally prior values, not concurrent values.
The operation identifier is usually a pair:
```text
(replica_id, counter)
```
The replica id identifies where the operation came from. The counter is a local logical clock. Together, they make operation identifiers unique. They
also provide a deterministic tie-breaker when the data type needs a total order among concurrent operations.
Causal dependencies can be represented as edges:
```text
pred(from_replica_id, from_counter, to_replica_id, to_counter)
```
This says the `to` operation depends on the `from` operation. The dependency graph is then ordinary relational data. A query can derive roots, leaves,
overwritten operations, causally ready operations, and visible values.
There is an important design choice here. If the network or runtime guarantees causal delivery, the query can be simpler. If operations may arrive out
of order, the query must avoid exposing operations whose dependencies have not arrived. That means causal readiness becomes part of the query.
---
## Out-of-Order Delivery
Out-of-order delivery is common in distributed systems. A replica might receive an operation before receiving the operation it depends on. If the
system exposes the later operation too early, it can show a state that is not valid under the intended causal semantics.
A causal-readiness query guards against this. It derives which operations can be safely considered visible because their dependency chain is present.
Conceptually, the query walks the causal graph from roots toward leaves and marks operations as ready only when the necessary predecessors are
available.
This improves correctness in less controlled networks, but it adds cost. Graph traversal is recursive. Recursive incremental computation is possible,
but it is not free. If the query repeatedly walks long causal chains, performance may grow with the depth of the history.
This is one of the central engineering lessons. Declarative correctness is not the same as automatic efficiency. The query may be compact and
semantically clear, while still requiring careful optimization.
---
## Lists Are the Hard Case
A key-value register demonstrates the idea, but an ordered list shows why the idea is interesting.
In a collaborative text editor, users insert and delete characters. If two users insert at the same position concurrently, the final document must
place both insertions somewhere, and all replicas must choose the same order. The data type cannot rely on local array indexes because indexes shift
as edits arrive.
A common CRDT solution is to give every inserted element a stable identifier. An insertion does not say "put this character at index 12." It says "put
this character after element `(r, c)`." The insert operations form a tree:
```text
insert(replica_id, counter, parent_replica_id, parent_counter, value)
```
The sentinel root represents the beginning of the list. Children of a node are insertions that targeted that node as their parent. Concurrent
insertions after the same parent become siblings. Siblings are ordered deterministically by operation identifier. The visible list is then obtained by
traversing the tree in a deterministic order.
Deletion is also subtle. If an element is deleted, it often cannot be physically removed from the structural history, because later or concurrent
operations may refer to it as a parent. The system keeps a tombstone: the element remains as a reference point, but the visible list skips its value.
In query terms, list behavior becomes a set of derived relations: first child, next sibling, next element, visible element, and next visible element.
Datalog can express those relations directly. The query is longer than the register example, but it is still a declarative description of the list
semantics.
---
## Incremental View Maintenance
If CRDT state is a query over all operations, the obvious worry is cost. The operation set only grows. A naive implementation would recompute the
entire query result every time a new operation arrives.
Incremental view maintenance is the response. The engine maintains the result of a query as inputs change. When a new operation arrives, the engine
computes the change to the output rather than recomputing the whole output.
For a replicated application, the desired runtime shape is:
```text
new operation facts
-> incremental query update
-> visible state changes
-> application update
```
This shape is especially attractive for user interfaces. If the query engine emits deltas, the application can update only the affected views. It is
also attractive for local-first systems because the same machinery can process local writes, remote writes, and startup replay.
DBSP is relevant here because it provides a formal and practical model for incremental computation over changing relations. Relational operators are
lifted into a streaming setting where inputs and outputs evolve over time.
---
## Hydration and Warm Updates
There are two different performance situations to keep separate.
Hydration is startup. The application already has a stored operation history, but the query engine must rebuild its internal operator state. It may
need to parse the query, build the execution plan, feed in the existing facts, and produce the current state. Hydration measures how long it takes
before the application can show the document or database contents after opening.
Warm update processing is the normal running mode after hydration. The query engine already has internal state. A small batch of new operations
arrives. The engine only needs to update the maintained result.
A design can perform acceptably in warm updates but poorly during hydration, or the other way around. For CRDT-backed applications, both matter. A
collaborative editor must feel responsive while editing, but it must also open large documents without a long pause.
This distinction also suggests possible hybrid strategies. A system might use a batch-oriented computation for startup and then switch to incremental
maintenance. Or it might persist internal operator state so startup does not require replaying the entire operation history.
---
## Relational Intermediate Representation
A Datalog program is convenient for users, but the execution engine usually wants a lower-level representation. A common design is to translate
Datalog into a relational intermediate representation.
The relational IR can include operators such as:
- projection
- selection
- join
- antijoin
- union
- difference
- distinct
- fixed-point iteration
This is useful for two reasons. First, relational algebra is a good target for optimization. The engine can push down filters, remove unused fields,
combine projections, and choose join strategies. Second, the IR separates the frontend language from the execution backend. Datalog is one possible
frontend. DBSP is one possible incremental backend.
That separation matters for research and engineering. If the backend changes from DBSP to another incremental framework, the Datalog frontend does not
have to be redesigned. If a SQL-like frontend is added later, it can target the same IR.
---
## Where the Costs Hide
The query-based approach simplifies some reasoning, but it does not erase hard systems problems.
Negation is one source of care. Datalog with arbitrary negation can have unclear or unstable semantics. Stratified negation restricts programs so
negative dependencies do not form problematic cycles. This keeps evaluation understandable, but it limits what can be expressed directly.
Recursion is another source of cost. Recursive rules are needed for graph reachability, transitive closure, causal readiness, and list traversal.
Incremental recursion can still be expensive when each update affects a long chain or a large region of the dependency graph.
Join planning matters as well. Datalog rules often translate into joins. Bad join order can create large intermediate relations. In a continuously
maintained query, changing the plan later may be harder than changing it for a one-shot query because the operators hold state.
Storage growth is also unresolved. The clean convergence story assumes a monotonically growing operation set. Real applications cannot always keep
every operation forever, especially on small devices. Compaction must preserve enough information for future queries and future synchronization.
These are not arguments against the approach. They are the places where the approach becomes a database systems problem rather than only a
programming-language idea.
---
## What This Approach Buys
The first benefit is conceptual. The CRDT is specified as a query. The implementation has a clearer boundary between logical behavior and physical
execution. That is the same separation that made relational databases powerful: the user specifies what result should exist, and the engine decides
how to maintain it.
The second benefit is extensibility. A fixed CRDT library exposes a fixed set of data types. A query-based system could let application developers
define custom replicated structures, provided they stay within the safe fragment of the language.
The third benefit is a shared interface. Ordinary application state, derived views, and replicated state can all look like queries. This could reduce
the number of special-purpose layers in local-first applications.
The fourth benefit is optimization headroom. If CRDTs are expressed through a query plan, improvements to the query engine can improve many CRDT
definitions without changing their logical definitions.
---
## What Remains Open
Several questions remain open before this style can be treated as a production design.
Can enough useful CRDTs be expressed in Datalog with stratified negation? Registers and lists are promising examples, but nested documents, moves in
trees, undo and redo, and rich JSON-like structures are harder.
Can incremental evaluation make the performance competitive with hand-written CRDTs? A hand-written CRDT can exploit structure-specific shortcuts. A
query engine needs optimization to avoid paying too much for generality.
Can operation histories be compacted safely? Append-only facts are clean, but unbounded growth is not acceptable for every application.
Can the system provide good error messages and type checking? A query language for application developers needs more than a working parser and
runtime.
Can causal readiness be optimized around the common case? Most new operations in a live application are likely close to current causal heads, but a
naive recursive query may still traverse from roots.
These questions define the practical research agenda.
---
## Reading Frame
The best way to read this line of work is as a bridge between three areas.
From CRDTs, it takes the goal of coordination-free replicated state.
From Datalog, it takes declarative rules, recursion, and deterministic derivation over facts.
From incremental query engines, it takes the ability to maintain derived state as input changes.
The slogan is:
```text
replicated data structure = materialized query over immutable operations
```
That slogan is not the whole system, but it is the core mental model. The data structure is no longer only an object with methods. It is a maintained
view. The operation log is the base data. The query is the semantics. The incremental engine is the execution strategy.
---
## Changelog
* **May 7, 2026** -- First version created.

162
external/001_query_engine.md vendored Normal file
View File

@ -0,0 +1,162 @@
This document is copied from https://git.sgai.uk/creators/geolog/-/wikis/Geolog%20storage%20engine%20meeting%202026-04-21 on May 7, 2026.
Discussion pointsMaterialising the current state for efficient constraint checking and efficient queryingSpecial-casing certain types of theories,
such as hash consing for syntax trees, or arrays for linear ordersExposing an API that looks like the theory definitions, not like the IRExposing to
various programming languages via FFIDBSP for incremental constraint checking? Get Leo Stewen involved?Hooking into Soufflé for general querying /
testing program analysis use cases?IR for initial models + storage layer for it (egraph cache)Initial modelsoften infinite, and not data. how should
they show up in the op log?use an SSA or stack-based program where each instruction is a term constructor. both what you extract from an egraph and
can be interpreted into an egraph.imagine we're working in the free ring with generators all strings. we have a couple of term constructors: given a
string we can construct an element of the ring; and then we have 0, 1, multiplication, and addition. there's a term construction for terms in the free
ring. if a term tree has common subtrees, we want a DAG representation. simplest: topologically sort it, store it in an array with backwards indices.
that's essentially what SSA is: store instructions that may reference previous values.fundamental operation: given two EIDs, give me the EID of the
term that is given by applying this term constructor to those EIDs. given an egraph and an SSA-style program you can interpret the SSA-style
program.each term constructor may have a different number of arguments. separate table per term constructor would mean a lot of tables. maybe one
table per arity?think of this as a serialised form of an egraph? stores many different terms concurrently. a term is just an index into a giant
table.could have a defined term constructor. e.g. in a ring, could define a function of 4 arguments to be a big complicated polynomial. more efficient
to store it as a single operation than as a collection of operations. from the perspective of commits and storage, could do either way.might make
sense to sometimes put derived operations in the oplog, if they are compact to store.in the type theory, initial models are in general context. e.g.
transitive closure for a graph: for all graphs, give you an initial model, which is the transitive closure. In the IR, have a lazy monomorphisation:
declare a bunch of tables in the IR; some of the constraints on those tables mention certain initial models. those initial models are the transitive
closure of this graph, or the term trees on this set of variables. might imagine emitting auxiliary declarations in terms of the main tables of the
schema.initial models would not use geometric sequents, as they are too general. restricted to be a conjunction implying another conjunction. could
give this to a datalog/egglog engine, containing only relation constructors but not term constructors.IR is still a set of tables that are declared to
be an initial model; term constructors and what they are supposed to act upon.James's current IR is fully relational. Would want to preserve functions
in the IR? Functions have types.James: Several tables sharing a single rowId space?Transitive closure/DAG exampletheory Graph := sig
V: Set
E: V -> V -> Set
end
def TransClosure (G*: Graph) := (init sig
r: G.V -> G.V -> Prop
_: r ?a ?a
_: r ?a ?b -> G.E ?b ?c -> r ?a ?c
end).r
theory DAG := sig
G: Graph
tc := TransClosure G
_: tc ?a ?b -> tc ?b ?a -> ?a = ?b
endtranslates into IR:V: TABLE []
E: TABLE [V, V]
r: FREETABLE [V, V]
refl: RULE[(v: V) -> r V V]
trans: RULE[...]You can't insert into a FREETABLE, it can only be computed using rules. Every element added to FREETABLE has a provenance: produced by
refl, or produced by trans, or produced by multiple rules and later shown to be equal.Say we want to preserve that an insertion preserves the
acyclicity of the graph.At any given moment, FREETABLE may not yet be fully computed. Either run refl and trans until completion (fixed point), or
turn it into a graph traversal.Props don't need to be stored. If we had r: G.V -> G.V -> Set instead of r: G.V -> G.V -> Prop that would represent
paths through the graph.While you can't insert into a free table, other tables could refer to elements of a free table. In that case, the derived free
table would have to materialised as a commit. This is where we could use an SSA-like construction: each element of the table can be derived through a
finite list of invocations of rules. These might have IDs that are just local to the commit, not globally unique? But if we want to refer to IDs in
another commit, just make them global. Use hash consing.Concurrently created terms could use different IDs to refer to the same term. Could imagine a
merging process where we extract all the terms we care about, run egraph over them, and then re-encode them as a sequence of operations with fresh
IDs.How do you compare elements of the free model across commits, if they have different IDs? Could walk backwards through the SSA tables for both
commits, pull out both into a shared egraph, and then compare them in the egraph. But when you extract a term, you get out the term you originally put
in.Storage engine should concern itself only with hash consing, not the egraph. But storage engine may need to deal with proofs of equality extracted
from the egraph, because those proofs might be expensive to discover, so we'd want to store it. The witness of that proof would be a bunch of rule
invocations, transitivity, congruence etc. (A truncated version of alifib?)There might be laws in the IR that use equality. Making equality just
another stored relation is perhaps not a good idea make it a built-in concept of the storage engine that performs congruence and transitivity out of
the box?You can refer to equality via a set of rewrite rules that you have applied to the database. We have efficient representations of sets (e.g.
Merkle trees). If you want to record an equality between two terms, you have a hash that refers to a set of rule invocations; whenever you need to
materialise it, you run those rules in your egraph copy.You could have a table where one of the columns in the table is an equality proof? If Geolog
is used for storing witnesses of certain properties, e.g. this branch never gets called. Store this in the database as the sequence of rewrite rules
you need in order to prove that the branch condition equals false.Concept of computational depth: a complement to algorithmic complexity. Complexity
is how many bits you need to generate a thing, no matter how long it takes. Computational depth is, given you are constructing it from the least
amount possible, how long does it take? Even if things do not contain a lot of information, might want to store them if they contain a lot of
computational depth. In the context of Geolog, where the process of equality is a semi-decision procedure (not guaranteed to terminate), this means
the witnesses for equality need to be stored.Could make a content-addressed construction for term IDs, but this would take a lot more space.Query
executionFrom a type theory point of view: if T is a theory, then a query is just a term of type q : (M : T) -> Set (given a model M of a theory, give
me a set). Conjunctive queries are the easiest. for example, G : Graph -> [v : G.V, e : G.E v v] returns all the vertices with self-loops.Discussion
of DBSP. Owen had heard of Differential Dataflow but not DBSP.In Geolog-zeta, Davidad compiled every Geolog sequent into two queries, one for the
antecedent, one for the consequent. Compute a set for each, and check that one set is included in the other.DBSP needs to deal with positive queries (
conjunctive and finitely disjunctive)How does this interact with chase? Mark rules as chased or not-chased. With chased, compute the derived facts.
With non-chased, it's an error if the result is not satisfied (checked after the chase is complete).Want some way of special-casing certain patterns.
There are more or less efficient algorithms for initial models. Some of these could be inlined into DBSP.This is analogous to the part of a SQL
database that produces a query plan. For example, choice between hash join or merge join. Similarly, we want to be able to see if an axiom has a
certain shape and compile it to an appropriate DBSP circuit.Could do an initial DBSP integration without initial models (no transitive closure, just
e.g. union of conjunctive queries). A program logic (deep embedding into Geolog) could probably be checked using this fragment? The axioms are
checking the well-formedness of the application of every rule, and the well-foundedness of the proof tree. Well-foundedness is the only non-local
thing being checked here. This would be a good use case to try: Hoare logic over imperative programs. Allow agents to suggest proof steps and get
feedback on which axioms are violated.Hooking up to Souffle is also a possibility. So far we've only looked at Geolog features that are also Datalog,
but soon we might go beyond that. But we could try exporting to textual Datalog and run it through Souffle as a benchmarking baseline. Or hook into
Souffle at ABI level.APIBesides the IR, another output of the compiler needs to be FFI bindings e.g. for JavaScript or Rust. Do Haskell first (using
compiler as a library)? Owen thinks Haskell is harder easier to generate code. TypeScript is the best target for now.Alex suggests we use LLM to
generate ORM-like bindings from theory definition. Owen: it's less like ORM, more like a thin wrapper around prepared statements. Davidad: want an
interface more like ODBC, less like an ORM.What makes this different from a database: every node has Geolog in-process.Owen wants to think about
annotations of which concepts to expose via FFI. perhaps have a separate language that has an "FFI" to the Geolog language, but is restricted in
itself.TA1.3 interface. Geolog compiler as "dev dependency", e.g. as a vite plugin. Run over query definition and output bindings. Geolog engine gets
compiled to wasm, reads the stuff the bindings are giving it. Argument about whether to ship the compiler to the browser. Since we want the compiler
to output TypeScript bindings, need to then recompile the code that uses those bindings. For people experimenting with theories this will happen
frequently; in a "production" use the theory will be fixed at compile time.Alex suggests we want a web app that will allow others to experiment with
writing theories and using instances. Obsidian already have a basic version of this. Make this available as npm library and implement the web
interface as a simply vite app to check if we've libraryified it sufficiently.Another goal for TA1.3: dispense with the interchange format for
Petrinaut, use Geolog insteadPossible format for inserting data into a Geolog instance:theory Graph := sig
V : Set
E : V -> V -> Set
end
theory Main := sig
Graphs : Set
G : Graphs -> Graph
g0 : Graphs
end
# Syntax for data of an instance
[
g0 : Graphs
_ : [
open G g0 # brings (G g0).V and (G g0).E into scope as V and E
v0 : V
v1 : V
e : E v0 v1
]
]Make a Godbolt-style app to allow people to play with Geolog theoriesData structuresFast analyses: egraphs, making conjunctive queries fastvs.
Automerge: one oplog, multiple materialised check-outs.Eventually, various optimised data structures would be good. As a first pass do something
generic. Davidad: "tensors of ordered semirings"How often do we expect to have dense multidimensional tensors? Only really in neural networks. But a
sparse tensor can implement the same trait. Build up a library of tensor implementations and the optimiser picks one. Updates to the tensors
represented via binary space partitioning. Davidad thinks nobody has done this.For example, representing a graph as an adjacency matrix, even if the
underlying representation is actually a set of edges represented as tuples. Davidad thinks the tensor representation is good for GPUs to operate on,
streaming a dense representation. e.g. transitive closure as matrix representation. conjunctive queries are binary tensor contractions.This
representation can't handle strings have to intern them into a string store, e.g. a big append-only blob, and string values are represented as an
offset into that big blob.Anything that is a B-tree index in an ordinary database (want to be able to do range queries on it) could be a dtype.Davidad
argues that functions should be stored differently from tables. For example with a graph, instead of E: V -> V -> Set there would be functions head:
E -> V and tail: E -> V.Geolog breaking down into documentsWe need a concept of a namespace: can define a variable "foo is an instance of theory bar";
whoever owns the theory can update it, but the instances and theories are immutable.Want to be able to point at something in someone else's namespace;
that would be Geolog's analogy to Automerge documents. A commit is located in a namespace and modifies things only within that namespace. Namespaces
reference each other. Want something like cargo import?Each variable is its own document. You might have a collection of documents. Their content
might not be disjoint though. Say you have a graph, copy it a couple of times, and then use it in different instances. Logically, each variable has a
completely separate history.A namespace has a very simple version history it's just a version bump to one of the variables it contains. Recognises
that mutation is needed somewhere: at the namespace there are just variables.Sedimentree doesn't work well for bushy histories. Our workload might
have long parallel branches (work from independent agents) but probably not very wide branchingPartial merging might be needed: given two instances of
a theory that are individually valid, need a merge that produces the maximal subset that is still valid. This always exists (in the worst case, throw
away one of the branches), but not necessarily unique or deterministically defined. Perhaps a cherry-pick/rebase. Perhaps an explicit merge commit
that indicates which rows are excluded or added in the merged result. In general, doing a merge might be just as much work as doing the work on the
individual branch, although many practical use cases (e.g. program logics) would allow easy merges (two individually proved statements can easily be
combined).Program logic exampleString diagram for (a+b)*c, want to prove that if all the inputs are >=0 then the output is >=0. Have a proof rule for
multiplication: if both inputs are >=0 then the output is >=0.Layers: SSA target (references to variables in the program); logical assertions about
those variables; logical formulae; sequents; derivations (applications of inference rules). All of these are encoded as Geolog theories.What if
assertions and atoms were meta-level props? More in line with traditional Datalog analyses. But if >= was an actual term constructor in the Geolog
language, that would be less flexible/extensible. Geolog formulas are not first-class objects in Geolog: if we have a proof tree that points at
objects that are not themselves in the theory, that gets messy. Davidad thinks it's cleaner to do a deep embedding than a shallow one.Why do you need
a proof tree? If you're using the Geolog metalogic (propositions in Geolog), keeping track of propositions is just part of the native logic of Geolog.
Say ">=0" was a built-in of type Wire -> Prop. The problem is that this puts the burden of proof search on the Geolog computation engine? Not
necessarily. There's nothing stopping you from writing down the proof rules applied in fact, that's what initial models are.There are bunch of term
constructors that produce proofs, and we can just serialise them. Far more compact than the whole proof tree that stores the context at each node.
Still constrained to geometric logic though, which arbitrary proof trees would not be.Want a concept of a non-strict axioms: allows data violating the
axioms to be committed, but with a big red flag saying that the proof is incomplete. Allow sorry as a constructor, and a no-sorries predicate that's a
non-strict axiom.What kinds of logics do we want to use that don't embed nicely into geometric logic? James suggested a separation logic that came out
a few years ago, which has some substructure.The proof tree approach might not work for binders. Once you have quantifiers in the language inside, you
need to switch to De Bruijn indices or suchlike? No, represent free variables as explicit references to the variable binding. Performing substitution
is still a bunch of work.Problem: if you prove something about one term, and then you rewrite (e.g. substitute) into a similar term, it's a new term
and the proofs don't apply to it. You can copy the proof trees but that will get huge. A lot of the juice we can get out of geometric logic is
conjunctive is fast to check, but deep embedding throws a lot of that away.Owen thinks it would be better to extend the Geolog logic to do e.g.
interesting things with real numbers, rather than try to deeply embed a theory that is as annoying as pen and paper. Davidad thinks we should try it
out; it's a selling point that it's sufficient to do a deep embedding of any logic, even if it's easier and more efficient to do a shallow embedding.
Importing SMT certificates needs that flexibility, for example.Geolog might be best for things that are combinatorial. More natively combinatorial use
cases would allow it to shine.There's Geolog the type theory, and the massively scalable infrastructure for solving problems. On the latter, can we
beat Soufflé at their own game? They must have some good benchmarks.Try cryptography as a use case? e.g. can we prove an elliptic curve implementation
correct? See Martin's X25519 tutorial, verified F* implementation in ValeCryptActionsAlex, George: Geolog compiler on npm (GHC-compiled Wasm) → get
Godbolt-style web interface goingOwen: design a language for people to write queries and insert data, putting it through loweringJames: new version of
lowering that can handle inductive thingssomeone should write a compiler that turns a simple toy imperative language into a Geolog instanceit would be
nice to have one frontend Obsidian should write TypeScript? Cale happy to write bits of Haskell that generate TypeScriptVincent, Alex, Martin
physical data storage, disk and network wire formatsMartin, Leo DBSP experiments