318 lines
16 KiB
Markdown
318 lines
16 KiB
Markdown
# Why CRDTs as Queries
|
|
|
|
A coherent reading note on the idea of defining replicated data structures as deterministic queries over immutable operations.
|
|
|
|
---
|
|
|
|
## Starting Point
|
|
|
|
The basic problem behind CRDTs is easy to state and hard to implement well. Several replicas hold copies of the same logical data. Each replica should
|
|
be able to accept writes locally, including while offline or disconnected from the other replicas. Later, when replicas exchange information, they
|
|
should converge to the same state.
|
|
|
|
This is not the same problem as ordinary database replication. In a conventional primary-replica database, a single authority can decide the order of
|
|
writes. If two users write at the same time, the system can serialize those writes through a leader, a lock, or a consensus protocol. That gives the
|
|
database a single history. CRDTs are designed for environments where that coordination is unavailable, undesirable, or too expensive. A user should
|
|
still be able to write locally, even if there is no reachable leader.
|
|
|
|
The price is that the system must handle concurrency after the fact. Two replicas may both accept writes that neither knew about at the time. When
|
|
those writes later meet, the system must define what the merged state means. A CRDT is a data structure whose merge behavior is designed so that all
|
|
replicas eventually compute the same state.
|
|
|
|
---
|
|
|
|
## The Traditional Burden
|
|
|
|
In a hand-written CRDT, the implementer writes an algorithm whose operations are safe under concurrency. For an operation-based CRDT, that often means
|
|
concurrent operations must commute: applying operation `a` and then operation `b` must lead to the same logical state as applying `b` and then `a`, at
|
|
least when `a` and `b` are concurrent.
|
|
|
|
This is manageable for simple structures such as grow-only sets. It becomes more subtle for registers, maps, lists, trees, undo and redo, and nested
|
|
documents. Ordered lists are a good example. A list CRDT must not only decide whether an element exists, but also where it appears. Concurrent
|
|
insertions at the same position must be ordered deterministically. Deletions must not destroy information that later or concurrent insertions might
|
|
reference. These details produce the familiar machinery of operation identifiers, tombstones, causal dependencies, and tie-breaking rules.
|
|
|
|
The programmer therefore has two jobs. First, they must design the logical behavior of the data type. Second, they must implement it in a way that
|
|
preserves convergence under every possible delivery order. The second job is where many mistakes hide.
|
|
|
|
---
|
|
|
|
## Query-Based Turn
|
|
|
|
The query-based approach reframes the problem. Instead of treating the CRDT as mutable state plus a merge algorithm, it treats the CRDT as a derived
|
|
view over an immutable operation log.
|
|
|
|
The replica stores operations as facts:
|
|
|
|
```text
|
|
set(replica_id, counter, key, value)
|
|
pred(from_replica_id, from_counter, to_replica_id, to_counter)
|
|
insert(replica_id, counter, parent_replica_id, parent_counter, value)
|
|
remove(replica_id, counter)
|
|
```
|
|
|
|
The visible state is not the primary stored object. The visible state is the result of a query over those facts.
|
|
|
|
This gives a clean convergence story. If two replicas have the same operation facts, and they evaluate the same deterministic query, they must compute
|
|
the same result. The query does not depend on the order in which operations arrived. Arrival order is an implementation detail. The logical input is a
|
|
set or multiset of immutable facts.
|
|
|
|
That shift is powerful because it moves convergence reasoning into the structure of the computation. The developer still has to define the intended
|
|
semantics, but they are no longer hand-coding every step of the merge algorithm.
|
|
|
|
---
|
|
|
|
## Why Datalog Fits
|
|
|
|
Datalog is a natural language for this style because it is built around facts and derived facts. A Datalog program says which new facts follow from
|
|
existing facts. The execution model is close to the mental model of derived views.
|
|
|
|
For example, a multi-value register key-value store can be described by storing `set` operations and causal predecessor edges. A value is visible if
|
|
its set operation has not been overwritten by a causally later operation. In Datalog-like notation:
|
|
|
|
```text
|
|
overwritten(RepId, Ctr) :-
|
|
pred(RepId, Ctr, _, _).
|
|
|
|
mvrStore(Key, Value) :-
|
|
set(RepId, Ctr, Key, Value),
|
|
not overwritten(RepId, Ctr).
|
|
```
|
|
|
|
This is small, but it captures an important semantic choice. The query does not pick one winner among concurrent values. It filters out values that
|
|
have been causally superseded. If two values are concurrent, neither overwrites the other, so both remain visible.
|
|
|
|
Datalog also handles recursion directly. That matters because causal histories and list structures are graph-shaped. Asking whether an operation is
|
|
reachable from a root, whether a dependency chain is complete, or what the next visible list element is can require recursive rules.
|
|
|
|
The query language is restricted enough to make evaluation well-defined, but expressive enough to describe useful replicated structures.
|
|
|
|
---
|
|
|
|
## Causality as Data
|
|
|
|
Causality is the difference between "this write came after that write" and "these writes were independent." CRDTs need this distinction because
|
|
overwriting should usually remove only causally prior values, not concurrent values.
|
|
|
|
The operation identifier is usually a pair:
|
|
|
|
```text
|
|
(replica_id, counter)
|
|
```
|
|
|
|
The replica id identifies where the operation came from. The counter is a local logical clock. Together, they make operation identifiers unique. They
|
|
also provide a deterministic tie-breaker when the data type needs a total order among concurrent operations.
|
|
|
|
Causal dependencies can be represented as edges:
|
|
|
|
```text
|
|
pred(from_replica_id, from_counter, to_replica_id, to_counter)
|
|
```
|
|
|
|
This says the `to` operation depends on the `from` operation. The dependency graph is then ordinary relational data. A query can derive roots, leaves,
|
|
overwritten operations, causally ready operations, and visible values.
|
|
|
|
There is an important design choice here. If the network or runtime guarantees causal delivery, the query can be simpler. If operations may arrive out
|
|
of order, the query must avoid exposing operations whose dependencies have not arrived. That means causal readiness becomes part of the query.
|
|
|
|
---
|
|
|
|
## Out-of-Order Delivery
|
|
|
|
Out-of-order delivery is common in distributed systems. A replica might receive an operation before receiving the operation it depends on. If the
|
|
system exposes the later operation too early, it can show a state that is not valid under the intended causal semantics.
|
|
|
|
A causal-readiness query guards against this. It derives which operations can be safely considered visible because their dependency chain is present.
|
|
Conceptually, the query walks the causal graph from roots toward leaves and marks operations as ready only when the necessary predecessors are
|
|
available.
|
|
|
|
This improves correctness in less controlled networks, but it adds cost. Graph traversal is recursive. Recursive incremental computation is possible,
|
|
but it is not free. If the query repeatedly walks long causal chains, performance may grow with the depth of the history.
|
|
|
|
This is one of the central engineering lessons. Declarative correctness is not the same as automatic efficiency. The query may be compact and
|
|
semantically clear, while still requiring careful optimization.
|
|
|
|
---
|
|
|
|
## Lists Are the Hard Case
|
|
|
|
A key-value register demonstrates the idea, but an ordered list shows why the idea is interesting.
|
|
|
|
In a collaborative text editor, users insert and delete characters. If two users insert at the same position concurrently, the final document must
|
|
place both insertions somewhere, and all replicas must choose the same order. The data type cannot rely on local array indexes because indexes shift
|
|
as edits arrive.
|
|
|
|
A common CRDT solution is to give every inserted element a stable identifier. An insertion does not say "put this character at index 12." It says "put
|
|
this character after element `(r, c)`." The insert operations form a tree:
|
|
|
|
```text
|
|
insert(replica_id, counter, parent_replica_id, parent_counter, value)
|
|
```
|
|
|
|
The sentinel root represents the beginning of the list. Children of a node are insertions that targeted that node as their parent. Concurrent
|
|
insertions after the same parent become siblings. Siblings are ordered deterministically by operation identifier. The visible list is then obtained by
|
|
traversing the tree in a deterministic order.
|
|
|
|
Deletion is also subtle. If an element is deleted, it often cannot be physically removed from the structural history, because later or concurrent
|
|
operations may refer to it as a parent. The system keeps a tombstone: the element remains as a reference point, but the visible list skips its value.
|
|
|
|
In query terms, list behavior becomes a set of derived relations: first child, next sibling, next element, visible element, and next visible element.
|
|
Datalog can express those relations directly. The query is longer than the register example, but it is still a declarative description of the list
|
|
semantics.
|
|
|
|
---
|
|
|
|
## Incremental View Maintenance
|
|
|
|
If CRDT state is a query over all operations, the obvious worry is cost. The operation set only grows. A naive implementation would recompute the
|
|
entire query result every time a new operation arrives.
|
|
|
|
Incremental view maintenance is the response. The engine maintains the result of a query as inputs change. When a new operation arrives, the engine
|
|
computes the change to the output rather than recomputing the whole output.
|
|
|
|
For a replicated application, the desired runtime shape is:
|
|
|
|
```text
|
|
new operation facts
|
|
-> incremental query update
|
|
-> visible state changes
|
|
-> application update
|
|
```
|
|
|
|
This shape is especially attractive for user interfaces. If the query engine emits deltas, the application can update only the affected views. It is
|
|
also attractive for local-first systems because the same machinery can process local writes, remote writes, and startup replay.
|
|
|
|
DBSP is relevant here because it provides a formal and practical model for incremental computation over changing relations. Relational operators are
|
|
lifted into a streaming setting where inputs and outputs evolve over time.
|
|
|
|
---
|
|
|
|
## Hydration and Warm Updates
|
|
|
|
There are two different performance situations to keep separate.
|
|
|
|
Hydration is startup. The application already has a stored operation history, but the query engine must rebuild its internal operator state. It may
|
|
need to parse the query, build the execution plan, feed in the existing facts, and produce the current state. Hydration measures how long it takes
|
|
before the application can show the document or database contents after opening.
|
|
|
|
Warm update processing is the normal running mode after hydration. The query engine already has internal state. A small batch of new operations
|
|
arrives. The engine only needs to update the maintained result.
|
|
|
|
A design can perform acceptably in warm updates but poorly during hydration, or the other way around. For CRDT-backed applications, both matter. A
|
|
collaborative editor must feel responsive while editing, but it must also open large documents without a long pause.
|
|
|
|
This distinction also suggests possible hybrid strategies. A system might use a batch-oriented computation for startup and then switch to incremental
|
|
maintenance. Or it might persist internal operator state so startup does not require replaying the entire operation history.
|
|
|
|
---
|
|
|
|
## Relational Intermediate Representation
|
|
|
|
A Datalog program is convenient for users, but the execution engine usually wants a lower-level representation. A common design is to translate
|
|
Datalog into a relational intermediate representation.
|
|
|
|
The relational IR can include operators such as:
|
|
|
|
- projection
|
|
- selection
|
|
- join
|
|
- antijoin
|
|
- union
|
|
- difference
|
|
- distinct
|
|
- fixed-point iteration
|
|
|
|
This is useful for two reasons. First, relational algebra is a good target for optimization. The engine can push down filters, remove unused fields,
|
|
combine projections, and choose join strategies. Second, the IR separates the frontend language from the execution backend. Datalog is one possible
|
|
frontend. DBSP is one possible incremental backend.
|
|
|
|
That separation matters for research and engineering. If the backend changes from DBSP to another incremental framework, the Datalog frontend does not
|
|
have to be redesigned. If a SQL-like frontend is added later, it can target the same IR.
|
|
|
|
---
|
|
|
|
## Where the Costs Hide
|
|
|
|
The query-based approach simplifies some reasoning, but it does not erase hard systems problems.
|
|
|
|
Negation is one source of care. Datalog with arbitrary negation can have unclear or unstable semantics. Stratified negation restricts programs so
|
|
negative dependencies do not form problematic cycles. This keeps evaluation understandable, but it limits what can be expressed directly.
|
|
|
|
Recursion is another source of cost. Recursive rules are needed for graph reachability, transitive closure, causal readiness, and list traversal.
|
|
Incremental recursion can still be expensive when each update affects a long chain or a large region of the dependency graph.
|
|
|
|
Join planning matters as well. Datalog rules often translate into joins. Bad join order can create large intermediate relations. In a continuously
|
|
maintained query, changing the plan later may be harder than changing it for a one-shot query because the operators hold state.
|
|
|
|
Storage growth is also unresolved. The clean convergence story assumes a monotonically growing operation set. Real applications cannot always keep
|
|
every operation forever, especially on small devices. Compaction must preserve enough information for future queries and future synchronization.
|
|
|
|
These are not arguments against the approach. They are the places where the approach becomes a database systems problem rather than only a
|
|
programming-language idea.
|
|
|
|
---
|
|
|
|
## What This Approach Buys
|
|
|
|
The first benefit is conceptual. The CRDT is specified as a query. The implementation has a clearer boundary between logical behavior and physical
|
|
execution. That is the same separation that made relational databases powerful: the user specifies what result should exist, and the engine decides
|
|
how to maintain it.
|
|
|
|
The second benefit is extensibility. A fixed CRDT library exposes a fixed set of data types. A query-based system could let application developers
|
|
define custom replicated structures, provided they stay within the safe fragment of the language.
|
|
|
|
The third benefit is a shared interface. Ordinary application state, derived views, and replicated state can all look like queries. This could reduce
|
|
the number of special-purpose layers in local-first applications.
|
|
|
|
The fourth benefit is optimization headroom. If CRDTs are expressed through a query plan, improvements to the query engine can improve many CRDT
|
|
definitions without changing their logical definitions.
|
|
|
|
---
|
|
|
|
## What Remains Open
|
|
|
|
Several questions remain open before this style can be treated as a production design.
|
|
|
|
Can enough useful CRDTs be expressed in Datalog with stratified negation? Registers and lists are promising examples, but nested documents, moves in
|
|
trees, undo and redo, and rich JSON-like structures are harder.
|
|
|
|
Can incremental evaluation make the performance competitive with hand-written CRDTs? A hand-written CRDT can exploit structure-specific shortcuts. A
|
|
query engine needs optimization to avoid paying too much for generality.
|
|
|
|
Can operation histories be compacted safely? Append-only facts are clean, but unbounded growth is not acceptable for every application.
|
|
|
|
Can the system provide good error messages and type checking? A query language for application developers needs more than a working parser and
|
|
runtime.
|
|
|
|
Can causal readiness be optimized around the common case? Most new operations in a live application are likely close to current causal heads, but a
|
|
naive recursive query may still traverse from roots.
|
|
|
|
These questions define the practical research agenda.
|
|
|
|
---
|
|
|
|
## Reading Frame
|
|
|
|
The best way to read this line of work is as a bridge between three areas.
|
|
|
|
From CRDTs, it takes the goal of coordination-free replicated state.
|
|
|
|
From Datalog, it takes declarative rules, recursion, and deterministic derivation over facts.
|
|
|
|
From incremental query engines, it takes the ability to maintain derived state as input changes.
|
|
|
|
The slogan is:
|
|
|
|
```text
|
|
replicated data structure = materialized query over immutable operations
|
|
```
|
|
|
|
That slogan is not the whole system, but it is the core mental model. The data structure is no longer only an object with methods. It is a maintained
|
|
view. The operation log is the base data. The query is the semantics. The incremental engine is the execution strategy.
|
|
|
|
---
|
|
|
|
## Changelog
|
|
|
|
* **May 7, 2026** -- First version created.
|