Add a suplamantay note file on CRDTs and Datalog
This commit is contained in:
parent
35b6e8f43f
commit
01a9aa167e
317
dbsp/002-why-crdts-as-queries.md
Normal file
317
dbsp/002-why-crdts-as-queries.md
Normal file
@ -0,0 +1,317 @@
|
||||
# Why CRDTs as Queries
|
||||
|
||||
A coherent reading note on the idea of defining replicated data structures as deterministic queries over immutable operations.
|
||||
|
||||
---
|
||||
|
||||
## Starting Point
|
||||
|
||||
The basic problem behind CRDTs is easy to state and hard to implement well. Several replicas hold copies of the same logical data. Each replica should
|
||||
be able to accept writes locally, including while offline or disconnected from the other replicas. Later, when replicas exchange information, they
|
||||
should converge to the same state.
|
||||
|
||||
This is not the same problem as ordinary database replication. In a conventional primary-replica database, a single authority can decide the order of
|
||||
writes. If two users write at the same time, the system can serialize those writes through a leader, a lock, or a consensus protocol. That gives the
|
||||
database a single history. CRDTs are designed for environments where that coordination is unavailable, undesirable, or too expensive. A user should
|
||||
still be able to write locally, even if there is no reachable leader.
|
||||
|
||||
The price is that the system must handle concurrency after the fact. Two replicas may both accept writes that neither knew about at the time. When
|
||||
those writes later meet, the system must define what the merged state means. A CRDT is a data structure whose merge behavior is designed so that all
|
||||
replicas eventually compute the same state.
|
||||
|
||||
---
|
||||
|
||||
## The Traditional Burden
|
||||
|
||||
In a hand-written CRDT, the implementer writes an algorithm whose operations are safe under concurrency. For an operation-based CRDT, that often means
|
||||
concurrent operations must commute: applying operation `a` and then operation `b` must lead to the same logical state as applying `b` and then `a`, at
|
||||
least when `a` and `b` are concurrent.
|
||||
|
||||
This is manageable for simple structures such as grow-only sets. It becomes more subtle for registers, maps, lists, trees, undo and redo, and nested
|
||||
documents. Ordered lists are a good example. A list CRDT must not only decide whether an element exists, but also where it appears. Concurrent
|
||||
insertions at the same position must be ordered deterministically. Deletions must not destroy information that later or concurrent insertions might
|
||||
reference. These details produce the familiar machinery of operation identifiers, tombstones, causal dependencies, and tie-breaking rules.
|
||||
|
||||
The programmer therefore has two jobs. First, they must design the logical behavior of the data type. Second, they must implement it in a way that
|
||||
preserves convergence under every possible delivery order. The second job is where many mistakes hide.
|
||||
|
||||
---
|
||||
|
||||
## Query-Based Turn
|
||||
|
||||
The query-based approach reframes the problem. Instead of treating the CRDT as mutable state plus a merge algorithm, it treats the CRDT as a derived
|
||||
view over an immutable operation log.
|
||||
|
||||
The replica stores operations as facts:
|
||||
|
||||
```text
|
||||
set(replica_id, counter, key, value)
|
||||
pred(from_replica_id, from_counter, to_replica_id, to_counter)
|
||||
insert(replica_id, counter, parent_replica_id, parent_counter, value)
|
||||
remove(replica_id, counter)
|
||||
```
|
||||
|
||||
The visible state is not the primary stored object. The visible state is the result of a query over those facts.
|
||||
|
||||
This gives a clean convergence story. If two replicas have the same operation facts, and they evaluate the same deterministic query, they must compute
|
||||
the same result. The query does not depend on the order in which operations arrived. Arrival order is an implementation detail. The logical input is a
|
||||
set or multiset of immutable facts.
|
||||
|
||||
That shift is powerful because it moves convergence reasoning into the structure of the computation. The developer still has to define the intended
|
||||
semantics, but they are no longer hand-coding every step of the merge algorithm.
|
||||
|
||||
---
|
||||
|
||||
## Why Datalog Fits
|
||||
|
||||
Datalog is a natural language for this style because it is built around facts and derived facts. A Datalog program says which new facts follow from
|
||||
existing facts. The execution model is close to the mental model of derived views.
|
||||
|
||||
For example, a multi-value register key-value store can be described by storing `set` operations and causal predecessor edges. A value is visible if
|
||||
its set operation has not been overwritten by a causally later operation. In Datalog-like notation:
|
||||
|
||||
```text
|
||||
overwritten(RepId, Ctr) :-
|
||||
pred(RepId, Ctr, _, _).
|
||||
|
||||
mvrStore(Key, Value) :-
|
||||
set(RepId, Ctr, Key, Value),
|
||||
not overwritten(RepId, Ctr).
|
||||
```
|
||||
|
||||
This is small, but it captures an important semantic choice. The query does not pick one winner among concurrent values. It filters out values that
|
||||
have been causally superseded. If two values are concurrent, neither overwrites the other, so both remain visible.
|
||||
|
||||
Datalog also handles recursion directly. That matters because causal histories and list structures are graph-shaped. Asking whether an operation is
|
||||
reachable from a root, whether a dependency chain is complete, or what the next visible list element is can require recursive rules.
|
||||
|
||||
The query language is restricted enough to make evaluation well-defined, but expressive enough to describe useful replicated structures.
|
||||
|
||||
---
|
||||
|
||||
## Causality as Data
|
||||
|
||||
Causality is the difference between "this write came after that write" and "these writes were independent." CRDTs need this distinction because
|
||||
overwriting should usually remove only causally prior values, not concurrent values.
|
||||
|
||||
The operation identifier is usually a pair:
|
||||
|
||||
```text
|
||||
(replica_id, counter)
|
||||
```
|
||||
|
||||
The replica id identifies where the operation came from. The counter is a local logical clock. Together, they make operation identifiers unique. They
|
||||
also provide a deterministic tie-breaker when the data type needs a total order among concurrent operations.
|
||||
|
||||
Causal dependencies can be represented as edges:
|
||||
|
||||
```text
|
||||
pred(from_replica_id, from_counter, to_replica_id, to_counter)
|
||||
```
|
||||
|
||||
This says the `to` operation depends on the `from` operation. The dependency graph is then ordinary relational data. A query can derive roots, leaves,
|
||||
overwritten operations, causally ready operations, and visible values.
|
||||
|
||||
There is an important design choice here. If the network or runtime guarantees causal delivery, the query can be simpler. If operations may arrive out
|
||||
of order, the query must avoid exposing operations whose dependencies have not arrived. That means causal readiness becomes part of the query.
|
||||
|
||||
---
|
||||
|
||||
## Out-of-Order Delivery
|
||||
|
||||
Out-of-order delivery is common in distributed systems. A replica might receive an operation before receiving the operation it depends on. If the
|
||||
system exposes the later operation too early, it can show a state that is not valid under the intended causal semantics.
|
||||
|
||||
A causal-readiness query guards against this. It derives which operations can be safely considered visible because their dependency chain is present.
|
||||
Conceptually, the query walks the causal graph from roots toward leaves and marks operations as ready only when the necessary predecessors are
|
||||
available.
|
||||
|
||||
This improves correctness in less controlled networks, but it adds cost. Graph traversal is recursive. Recursive incremental computation is possible,
|
||||
but it is not free. If the query repeatedly walks long causal chains, performance may grow with the depth of the history.
|
||||
|
||||
This is one of the central engineering lessons. Declarative correctness is not the same as automatic efficiency. The query may be compact and
|
||||
semantically clear, while still requiring careful optimization.
|
||||
|
||||
---
|
||||
|
||||
## Lists Are the Hard Case
|
||||
|
||||
A key-value register demonstrates the idea, but an ordered list shows why the idea is interesting.
|
||||
|
||||
In a collaborative text editor, users insert and delete characters. If two users insert at the same position concurrently, the final document must
|
||||
place both insertions somewhere, and all replicas must choose the same order. The data type cannot rely on local array indexes because indexes shift
|
||||
as edits arrive.
|
||||
|
||||
A common CRDT solution is to give every inserted element a stable identifier. An insertion does not say "put this character at index 12." It says "put
|
||||
this character after element `(r, c)`." The insert operations form a tree:
|
||||
|
||||
```text
|
||||
insert(replica_id, counter, parent_replica_id, parent_counter, value)
|
||||
```
|
||||
|
||||
The sentinel root represents the beginning of the list. Children of a node are insertions that targeted that node as their parent. Concurrent
|
||||
insertions after the same parent become siblings. Siblings are ordered deterministically by operation identifier. The visible list is then obtained by
|
||||
traversing the tree in a deterministic order.
|
||||
|
||||
Deletion is also subtle. If an element is deleted, it often cannot be physically removed from the structural history, because later or concurrent
|
||||
operations may refer to it as a parent. The system keeps a tombstone: the element remains as a reference point, but the visible list skips its value.
|
||||
|
||||
In query terms, list behavior becomes a set of derived relations: first child, next sibling, next element, visible element, and next visible element.
|
||||
Datalog can express those relations directly. The query is longer than the register example, but it is still a declarative description of the list
|
||||
semantics.
|
||||
|
||||
---
|
||||
|
||||
## Incremental View Maintenance
|
||||
|
||||
If CRDT state is a query over all operations, the obvious worry is cost. The operation set only grows. A naive implementation would recompute the
|
||||
entire query result every time a new operation arrives.
|
||||
|
||||
Incremental view maintenance is the response. The engine maintains the result of a query as inputs change. When a new operation arrives, the engine
|
||||
computes the change to the output rather than recomputing the whole output.
|
||||
|
||||
For a replicated application, the desired runtime shape is:
|
||||
|
||||
```text
|
||||
new operation facts
|
||||
-> incremental query update
|
||||
-> visible state changes
|
||||
-> application update
|
||||
```
|
||||
|
||||
This shape is especially attractive for user interfaces. If the query engine emits deltas, the application can update only the affected views. It is
|
||||
also attractive for local-first systems because the same machinery can process local writes, remote writes, and startup replay.
|
||||
|
||||
DBSP is relevant here because it provides a formal and practical model for incremental computation over changing relations. Relational operators are
|
||||
lifted into a streaming setting where inputs and outputs evolve over time.
|
||||
|
||||
---
|
||||
|
||||
## Hydration and Warm Updates
|
||||
|
||||
There are two different performance situations to keep separate.
|
||||
|
||||
Hydration is startup. The application already has a stored operation history, but the query engine must rebuild its internal operator state. It may
|
||||
need to parse the query, build the execution plan, feed in the existing facts, and produce the current state. Hydration measures how long it takes
|
||||
before the application can show the document or database contents after opening.
|
||||
|
||||
Warm update processing is the normal running mode after hydration. The query engine already has internal state. A small batch of new operations
|
||||
arrives. The engine only needs to update the maintained result.
|
||||
|
||||
A design can perform acceptably in warm updates but poorly during hydration, or the other way around. For CRDT-backed applications, both matter. A
|
||||
collaborative editor must feel responsive while editing, but it must also open large documents without a long pause.
|
||||
|
||||
This distinction also suggests possible hybrid strategies. A system might use a batch-oriented computation for startup and then switch to incremental
|
||||
maintenance. Or it might persist internal operator state so startup does not require replaying the entire operation history.
|
||||
|
||||
---
|
||||
|
||||
## Relational Intermediate Representation
|
||||
|
||||
A Datalog program is convenient for users, but the execution engine usually wants a lower-level representation. A common design is to translate
|
||||
Datalog into a relational intermediate representation.
|
||||
|
||||
The relational IR can include operators such as:
|
||||
|
||||
- projection
|
||||
- selection
|
||||
- join
|
||||
- antijoin
|
||||
- union
|
||||
- difference
|
||||
- distinct
|
||||
- fixed-point iteration
|
||||
|
||||
This is useful for two reasons. First, relational algebra is a good target for optimization. The engine can push down filters, remove unused fields,
|
||||
combine projections, and choose join strategies. Second, the IR separates the frontend language from the execution backend. Datalog is one possible
|
||||
frontend. DBSP is one possible incremental backend.
|
||||
|
||||
That separation matters for research and engineering. If the backend changes from DBSP to another incremental framework, the Datalog frontend does not
|
||||
have to be redesigned. If a SQL-like frontend is added later, it can target the same IR.
|
||||
|
||||
---
|
||||
|
||||
## Where the Costs Hide
|
||||
|
||||
The query-based approach simplifies some reasoning, but it does not erase hard systems problems.
|
||||
|
||||
Negation is one source of care. Datalog with arbitrary negation can have unclear or unstable semantics. Stratified negation restricts programs so
|
||||
negative dependencies do not form problematic cycles. This keeps evaluation understandable, but it limits what can be expressed directly.
|
||||
|
||||
Recursion is another source of cost. Recursive rules are needed for graph reachability, transitive closure, causal readiness, and list traversal.
|
||||
Incremental recursion can still be expensive when each update affects a long chain or a large region of the dependency graph.
|
||||
|
||||
Join planning matters as well. Datalog rules often translate into joins. Bad join order can create large intermediate relations. In a continuously
|
||||
maintained query, changing the plan later may be harder than changing it for a one-shot query because the operators hold state.
|
||||
|
||||
Storage growth is also unresolved. The clean convergence story assumes a monotonically growing operation set. Real applications cannot always keep
|
||||
every operation forever, especially on small devices. Compaction must preserve enough information for future queries and future synchronization.
|
||||
|
||||
These are not arguments against the approach. They are the places where the approach becomes a database systems problem rather than only a
|
||||
programming-language idea.
|
||||
|
||||
---
|
||||
|
||||
## What This Approach Buys
|
||||
|
||||
The first benefit is conceptual. The CRDT is specified as a query. The implementation has a clearer boundary between logical behavior and physical
|
||||
execution. That is the same separation that made relational databases powerful: the user specifies what result should exist, and the engine decides
|
||||
how to maintain it.
|
||||
|
||||
The second benefit is extensibility. A fixed CRDT library exposes a fixed set of data types. A query-based system could let application developers
|
||||
define custom replicated structures, provided they stay within the safe fragment of the language.
|
||||
|
||||
The third benefit is a shared interface. Ordinary application state, derived views, and replicated state can all look like queries. This could reduce
|
||||
the number of special-purpose layers in local-first applications.
|
||||
|
||||
The fourth benefit is optimization headroom. If CRDTs are expressed through a query plan, improvements to the query engine can improve many CRDT
|
||||
definitions without changing their logical definitions.
|
||||
|
||||
---
|
||||
|
||||
## What Remains Open
|
||||
|
||||
Several questions remain open before this style can be treated as a production design.
|
||||
|
||||
Can enough useful CRDTs be expressed in Datalog with stratified negation? Registers and lists are promising examples, but nested documents, moves in
|
||||
trees, undo and redo, and rich JSON-like structures are harder.
|
||||
|
||||
Can incremental evaluation make the performance competitive with hand-written CRDTs? A hand-written CRDT can exploit structure-specific shortcuts. A
|
||||
query engine needs optimization to avoid paying too much for generality.
|
||||
|
||||
Can operation histories be compacted safely? Append-only facts are clean, but unbounded growth is not acceptable for every application.
|
||||
|
||||
Can the system provide good error messages and type checking? A query language for application developers needs more than a working parser and
|
||||
runtime.
|
||||
|
||||
Can causal readiness be optimized around the common case? Most new operations in a live application are likely close to current causal heads, but a
|
||||
naive recursive query may still traverse from roots.
|
||||
|
||||
These questions define the practical research agenda.
|
||||
|
||||
---
|
||||
|
||||
## Reading Frame
|
||||
|
||||
The best way to read this line of work is as a bridge between three areas.
|
||||
|
||||
From CRDTs, it takes the goal of coordination-free replicated state.
|
||||
|
||||
From Datalog, it takes declarative rules, recursion, and deterministic derivation over facts.
|
||||
|
||||
From incremental query engines, it takes the ability to maintain derived state as input changes.
|
||||
|
||||
The slogan is:
|
||||
|
||||
```text
|
||||
replicated data structure = materialized query over immutable operations
|
||||
```
|
||||
|
||||
That slogan is not the whole system, but it is the core mental model. The data structure is no longer only an object with methods. It is a maintained
|
||||
view. The operation log is the base data. The query is the semantics. The incremental engine is the execution strategy.
|
||||
|
||||
---
|
||||
|
||||
## Changelog
|
||||
|
||||
* **May 7, 2026** -- First version created.
|
||||
162
external/001_query_engine.md
vendored
Normal file
162
external/001_query_engine.md
vendored
Normal file
@ -0,0 +1,162 @@
|
||||
This document is copied from https://git.sgai.uk/creators/geolog/-/wikis/Geolog%20storage%20engine%20meeting%202026-04-21 on May 7, 2026.
|
||||
|
||||
Discussion pointsMaterialising the current state for efficient constraint checking and efficient queryingSpecial-casing certain types of theories,
|
||||
such as hash consing for syntax trees, or arrays for linear ordersExposing an API that looks like the theory definitions, not like the IRExposing to
|
||||
various programming languages via FFIDBSP for incremental constraint checking? Get Leo Stewen involved?Hooking into Soufflé for general querying /
|
||||
testing program analysis use cases?IR for initial models + storage layer for it (egraph cache)Initial modelsoften infinite, and not data. how should
|
||||
they show up in the op log?use an SSA or stack-based program where each instruction is a term constructor. both what you extract from an egraph and
|
||||
can be interpreted into an egraph.imagine we're working in the free ring with generators all strings. we have a couple of term constructors: given a
|
||||
string we can construct an element of the ring; and then we have 0, 1, multiplication, and addition. there's a term construction for terms in the free
|
||||
ring. if a term tree has common subtrees, we want a DAG representation. simplest: topologically sort it, store it in an array with backwards indices.
|
||||
that's essentially what SSA is: store instructions that may reference previous values.fundamental operation: given two EIDs, give me the EID of the
|
||||
term that is given by applying this term constructor to those EIDs. given an egraph and an SSA-style program you can interpret the SSA-style
|
||||
program.each term constructor may have a different number of arguments. separate table per term constructor would mean a lot of tables. maybe one
|
||||
table per arity?think of this as a serialised form of an egraph? stores many different terms concurrently. a term is just an index into a giant
|
||||
table.could have a defined term constructor. e.g. in a ring, could define a function of 4 arguments to be a big complicated polynomial. more efficient
|
||||
to store it as a single operation than as a collection of operations. from the perspective of commits and storage, could do either way.might make
|
||||
sense to sometimes put derived operations in the oplog, if they are compact to store.in the type theory, initial models are in general context. e.g.
|
||||
transitive closure for a graph: for all graphs, give you an initial model, which is the transitive closure. In the IR, have a lazy monomorphisation:
|
||||
declare a bunch of tables in the IR; some of the constraints on those tables mention certain initial models. those initial models are the transitive
|
||||
closure of this graph, or the term trees on this set of variables. might imagine emitting auxiliary declarations in terms of the main tables of the
|
||||
schema.initial models would not use geometric sequents, as they are too general. restricted to be a conjunction implying another conjunction. could
|
||||
give this to a datalog/egglog engine, containing only relation constructors but not term constructors.IR is still a set of tables that are declared to
|
||||
be an initial model; term constructors and what they are supposed to act upon.James's current IR is fully relational. Would want to preserve functions
|
||||
in the IR? Functions have types.James: Several tables sharing a single rowId space?Transitive closure/DAG exampletheory Graph := sig
|
||||
V: Set
|
||||
E: V -> V -> Set
|
||||
end
|
||||
|
||||
def TransClosure (G*: Graph) := (init sig
|
||||
r: G.V -> G.V -> Prop
|
||||
_: r ?a ?a
|
||||
_: r ?a ?b -> G.E ?b ?c -> r ?a ?c
|
||||
end).r
|
||||
|
||||
theory DAG := sig
|
||||
G: Graph
|
||||
tc := TransClosure G
|
||||
_: tc ?a ?b -> tc ?b ?a -> ?a = ?b
|
||||
endtranslates into IR:V: TABLE []
|
||||
E: TABLE [V, V]
|
||||
r: FREETABLE [V, V]
|
||||
refl: RULE[(v: V) -> r V V]
|
||||
trans: RULE[...]You can't insert into a FREETABLE, it can only be computed using rules. Every element added to FREETABLE has a provenance: produced by
|
||||
refl, or produced by trans, or produced by multiple rules and later shown to be equal.Say we want to preserve that an insertion preserves the
|
||||
acyclicity of the graph.At any given moment, FREETABLE may not yet be fully computed. Either run refl and trans until completion (fixed point), or
|
||||
turn it into a graph traversal.Props don't need to be stored. If we had r: G.V -> G.V -> Set instead of r: G.V -> G.V -> Prop that would represent
|
||||
paths through the graph.While you can't insert into a free table, other tables could refer to elements of a free table. In that case, the derived free
|
||||
table would have to materialised as a commit. This is where we could use an SSA-like construction: each element of the table can be derived through a
|
||||
finite list of invocations of rules. These might have IDs that are just local to the commit, not globally unique? But if we want to refer to IDs in
|
||||
another commit, just make them global. Use hash consing.Concurrently created terms could use different IDs to refer to the same term. Could imagine a
|
||||
merging process where we extract all the terms we care about, run egraph over them, and then re-encode them as a sequence of operations with fresh
|
||||
IDs.How do you compare elements of the free model across commits, if they have different IDs? Could walk backwards through the SSA tables for both
|
||||
commits, pull out both into a shared egraph, and then compare them in the egraph. But when you extract a term, you get out the term you originally put
|
||||
in.Storage engine should concern itself only with hash consing, not the egraph. But storage engine may need to deal with proofs of equality extracted
|
||||
from the egraph, because those proofs might be expensive to discover, so we'd want to store it. The witness of that proof would be a bunch of rule
|
||||
invocations, transitivity, congruence etc. (A truncated version of alifib?)There might be laws in the IR that use equality. Making equality just
|
||||
another stored relation is perhaps not a good idea – make it a built-in concept of the storage engine that performs congruence and transitivity out of
|
||||
the box?You can refer to equality via a set of rewrite rules that you have applied to the database. We have efficient representations of sets (e.g.
|
||||
Merkle trees). If you want to record an equality between two terms, you have a hash that refers to a set of rule invocations; whenever you need to
|
||||
materialise it, you run those rules in your egraph copy.You could have a table where one of the columns in the table is an equality proof? If Geolog
|
||||
is used for storing witnesses of certain properties, e.g. this branch never gets called. Store this in the database as the sequence of rewrite rules
|
||||
you need in order to prove that the branch condition equals false.Concept of computational depth: a complement to algorithmic complexity. Complexity
|
||||
is how many bits you need to generate a thing, no matter how long it takes. Computational depth is, given you are constructing it from the least
|
||||
amount possible, how long does it take? Even if things do not contain a lot of information, might want to store them if they contain a lot of
|
||||
computational depth. In the context of Geolog, where the process of equality is a semi-decision procedure (not guaranteed to terminate), this means
|
||||
the witnesses for equality need to be stored.Could make a content-addressed construction for term IDs, but this would take a lot more space.Query
|
||||
executionFrom a type theory point of view: if T is a theory, then a query is just a term of type q : (M : T) -> Set (given a model M of a theory, give
|
||||
me a set). Conjunctive queries are the easiest. for example, G : Graph -> [v : G.V, e : G.E v v] returns all the vertices with self-loops.Discussion
|
||||
of DBSP. Owen had heard of Differential Dataflow but not DBSP.In Geolog-zeta, Davidad compiled every Geolog sequent into two queries, one for the
|
||||
antecedent, one for the consequent. Compute a set for each, and check that one set is included in the other.DBSP needs to deal with positive queries (
|
||||
conjunctive and finitely disjunctive)How does this interact with chase? Mark rules as chased or not-chased. With chased, compute the derived facts.
|
||||
With non-chased, it's an error if the result is not satisfied (checked after the chase is complete).Want some way of special-casing certain patterns.
|
||||
There are more or less efficient algorithms for initial models. Some of these could be inlined into DBSP.This is analogous to the part of a SQL
|
||||
database that produces a query plan. For example, choice between hash join or merge join. Similarly, we want to be able to see if an axiom has a
|
||||
certain shape and compile it to an appropriate DBSP circuit.Could do an initial DBSP integration without initial models (no transitive closure, just
|
||||
e.g. union of conjunctive queries). A program logic (deep embedding into Geolog) could probably be checked using this fragment? The axioms are
|
||||
checking the well-formedness of the application of every rule, and the well-foundedness of the proof tree. Well-foundedness is the only non-local
|
||||
thing being checked here. This would be a good use case to try: Hoare logic over imperative programs. Allow agents to suggest proof steps and get
|
||||
feedback on which axioms are violated.Hooking up to Souffle is also a possibility. So far we've only looked at Geolog features that are also Datalog,
|
||||
but soon we might go beyond that. But we could try exporting to textual Datalog and run it through Souffle as a benchmarking baseline. Or hook into
|
||||
Souffle at ABI level.APIBesides the IR, another output of the compiler needs to be FFI bindings e.g. for JavaScript or Rust. Do Haskell first (using
|
||||
compiler as a library)? Owen thinks Haskell is harder – easier to generate code. TypeScript is the best target for now.Alex suggests we use LLM to
|
||||
generate ORM-like bindings from theory definition. Owen: it's less like ORM, more like a thin wrapper around prepared statements. Davidad: want an
|
||||
interface more like ODBC, less like an ORM.What makes this different from a database: every node has Geolog in-process.Owen wants to think about
|
||||
annotations of which concepts to expose via FFI. perhaps have a separate language that has an "FFI" to the Geolog language, but is restricted in
|
||||
itself.TA1.3 interface. Geolog compiler as "dev dependency", e.g. as a vite plugin. Run over query definition and output bindings. Geolog engine gets
|
||||
compiled to wasm, reads the stuff the bindings are giving it. Argument about whether to ship the compiler to the browser. Since we want the compiler
|
||||
to output TypeScript bindings, need to then recompile the code that uses those bindings. For people experimenting with theories this will happen
|
||||
frequently; in a "production" use the theory will be fixed at compile time.Alex suggests we want a web app that will allow others to experiment with
|
||||
writing theories and using instances. Obsidian already have a basic version of this. Make this available as npm library and implement the web
|
||||
interface as a simply vite app to check if we've libraryified it sufficiently.Another goal for TA1.3: dispense with the interchange format for
|
||||
Petrinaut, use Geolog insteadPossible format for inserting data into a Geolog instance:theory Graph := sig
|
||||
V : Set
|
||||
E : V -> V -> Set
|
||||
end
|
||||
|
||||
theory Main := sig
|
||||
Graphs : Set
|
||||
G : Graphs -> Graph
|
||||
g0 : Graphs
|
||||
end
|
||||
|
||||
# Syntax for data of an instance
|
||||
|
||||
[
|
||||
g0 : Graphs
|
||||
_ : [
|
||||
open G g0 # brings (G g0).V and (G g0).E into scope as V and E
|
||||
v0 : V
|
||||
v1 : V
|
||||
e : E v0 v1
|
||||
]
|
||||
]Make a Godbolt-style app to allow people to play with Geolog theoriesData structuresFast analyses: egraphs, making conjunctive queries fastvs.
|
||||
Automerge: one oplog, multiple materialised check-outs.Eventually, various optimised data structures would be good. As a first pass do something
|
||||
generic. Davidad: "tensors of ordered semirings"How often do we expect to have dense multidimensional tensors? Only really in neural networks. But a
|
||||
sparse tensor can implement the same trait. Build up a library of tensor implementations and the optimiser picks one. Updates to the tensors
|
||||
represented via binary space partitioning. Davidad thinks nobody has done this.For example, representing a graph as an adjacency matrix, even if the
|
||||
underlying representation is actually a set of edges represented as tuples. Davidad thinks the tensor representation is good for GPUs to operate on,
|
||||
streaming a dense representation. e.g. transitive closure as matrix representation. conjunctive queries are binary tensor contractions.This
|
||||
representation can't handle strings – have to intern them into a string store, e.g. a big append-only blob, and string values are represented as an
|
||||
offset into that big blob.Anything that is a B-tree index in an ordinary database (want to be able to do range queries on it) could be a dtype.Davidad
|
||||
argues that functions should be stored differently from tables. For example with a graph, instead of E: V -> V -> Set there would be functions head:
|
||||
E -> V and tail: E -> V.Geolog breaking down into documentsWe need a concept of a namespace: can define a variable "foo is an instance of theory bar";
|
||||
whoever owns the theory can update it, but the instances and theories are immutable.Want to be able to point at something in someone else's namespace;
|
||||
that would be Geolog's analogy to Automerge documents. A commit is located in a namespace and modifies things only within that namespace. Namespaces
|
||||
reference each other. Want something like cargo import?Each variable is its own document. You might have a collection of documents. Their content
|
||||
might not be disjoint though. Say you have a graph, copy it a couple of times, and then use it in different instances. Logically, each variable has a
|
||||
completely separate history.A namespace has a very simple version history – it's just a version bump to one of the variables it contains. Recognises
|
||||
that mutation is needed somewhere: at the namespace there are just variables.Sedimentree doesn't work well for bushy histories. Our workload might
|
||||
have long parallel branches (work from independent agents) but probably not very wide branchingPartial merging might be needed: given two instances of
|
||||
a theory that are individually valid, need a merge that produces the maximal subset that is still valid. This always exists (in the worst case, throw
|
||||
away one of the branches), but not necessarily unique or deterministically defined. Perhaps a cherry-pick/rebase. Perhaps an explicit merge commit
|
||||
that indicates which rows are excluded or added in the merged result. In general, doing a merge might be just as much work as doing the work on the
|
||||
individual branch, although many practical use cases (e.g. program logics) would allow easy merges (two individually proved statements can easily be
|
||||
combined).Program logic exampleString diagram for (a+b)*c, want to prove that if all the inputs are >=0 then the output is >=0. Have a proof rule for
|
||||
multiplication: if both inputs are >=0 then the output is >=0.Layers: SSA target (references to variables in the program); logical assertions about
|
||||
those variables; logical formulae; sequents; derivations (applications of inference rules). All of these are encoded as Geolog theories.What if
|
||||
assertions and atoms were meta-level props? More in line with traditional Datalog analyses. But if >= was an actual term constructor in the Geolog
|
||||
language, that would be less flexible/extensible. Geolog formulas are not first-class objects in Geolog: if we have a proof tree that points at
|
||||
objects that are not themselves in the theory, that gets messy. Davidad thinks it's cleaner to do a deep embedding than a shallow one.Why do you need
|
||||
a proof tree? If you're using the Geolog metalogic (propositions in Geolog), keeping track of propositions is just part of the native logic of Geolog.
|
||||
Say ">=0" was a built-in of type Wire -> Prop. The problem is that this puts the burden of proof search on the Geolog computation engine? Not
|
||||
necessarily. There's nothing stopping you from writing down the proof rules applied – in fact, that's what initial models are.There are bunch of term
|
||||
constructors that produce proofs, and we can just serialise them. Far more compact than the whole proof tree that stores the context at each node.
|
||||
Still constrained to geometric logic though, which arbitrary proof trees would not be.Want a concept of a non-strict axioms: allows data violating the
|
||||
axioms to be committed, but with a big red flag saying that the proof is incomplete. Allow sorry as a constructor, and a no-sorries predicate that's a
|
||||
non-strict axiom.What kinds of logics do we want to use that don't embed nicely into geometric logic? James suggested a separation logic that came out
|
||||
a few years ago, which has some substructure.The proof tree approach might not work for binders. Once you have quantifiers in the language inside, you
|
||||
need to switch to De Bruijn indices or suchlike? No, represent free variables as explicit references to the variable binding. Performing substitution
|
||||
is still a bunch of work.Problem: if you prove something about one term, and then you rewrite (e.g. substitute) into a similar term, it's a new term
|
||||
and the proofs don't apply to it. You can copy the proof trees but that will get huge. A lot of the juice we can get out of geometric logic is
|
||||
conjunctive is fast to check, but deep embedding throws a lot of that away.Owen thinks it would be better to extend the Geolog logic to do e.g.
|
||||
interesting things with real numbers, rather than try to deeply embed a theory that is as annoying as pen and paper. Davidad thinks we should try it
|
||||
out; it's a selling point that it's sufficient to do a deep embedding of any logic, even if it's easier and more efficient to do a shallow embedding.
|
||||
Importing SMT certificates needs that flexibility, for example.Geolog might be best for things that are combinatorial. More natively combinatorial use
|
||||
cases would allow it to shine.There's Geolog the type theory, and the massively scalable infrastructure for solving problems. On the latter, can we
|
||||
beat Soufflé at their own game? They must have some good benchmarks.Try cryptography as a use case? e.g. can we prove an elliptic curve implementation
|
||||
correct? See Martin's X25519 tutorial, verified F* implementation in ValeCryptActionsAlex, George: Geolog compiler on npm (GHC-compiled Wasm) → get
|
||||
Godbolt-style web interface goingOwen: design a language for people to write queries and insert data, putting it through loweringJames: new version of
|
||||
lowering that can handle inductive thingssomeone should write a compiler that turns a simple toy imperative language into a Geolog instanceit would be
|
||||
nice to have one frontend – Obsidian should write TypeScript? Cale happy to write bits of Haskell that generate TypeScriptVincent, Alex, Martin –
|
||||
physical data storage, disk and network wire formatsMartin, Leo – DBSP experiments
|
||||
Loading…
x
Reference in New Issue
Block a user