useful-notes/dbsp/004-dbsp-spec-reading-notes.md

253 lines
7.3 KiB
Markdown
Raw Normal View History

2026-05-11 15:06:04 +02:00
# DBSP Spec Reading Notes
A reading note on the DBSP specification and its model for incremental view maintenance.
---
## Short Answer
DBSP treats view maintenance as a streaming computation over changes.
The central definition is:
```text
Q_delta = D . lift(Q) . I
```
Read this as:
```text
input changes
-> integrate into current input snapshots
-> run the ordinary query on each snapshot
-> differentiate the output snapshots into output changes
```
This definition is correct but naive if executed literally. It would rebuild full input snapshots and recompute the full query after every update. The
main contribution of DBSP is the circuit algebra that rewrites this definition into an implementation that works directly on deltas and maintained
operator state.
---
## Streams
A stream is a function from discrete time to values:
```text
stream: Nat -> A
```
Time is not wall-clock time. For database maintenance, time usually counts input transactions or update batches. If `DB[t]` is the database snapshot
after transaction `t`, then a database is a stream of snapshots.
A stream operator consumes one or more streams and produces another stream. DBSP programs are drawn as circuits:
```text
stream inputs -> operator boxes -> stream outputs
```
Ordinary scalar functions can be lifted to streams. If `Q` is a query on one database snapshot, then `lift(Q)` applies `Q` independently to every
snapshot in a stream.
---
## Integration and Differentiation
DBSP uses two basic stream operators:
- `D`: differentiation
- `I`: integration
Differentiation turns snapshots into changes:
```text
D(s)[t] = s[t] - s[t - 1]
```
Integration turns changes into snapshots:
```text
I(s)[t] = sum of s[i] for i <= t
```
They are inverses:
```text
D(I(s)) = s
I(D(s)) = s
```
This is the mathematical basis for incremental view maintenance. If a query can be interpreted as a stream computation, then DBSP can define its
incremental version by placing `I` before it and `D` after it.
---
## Z-Sets
DBSP represents relations as Z-sets.
A Z-set is a finite map from values to integer weights:
```text
{ row1 -> 1, row2 -> 1, row3 -> -1 }
```
The integer weight is the row multiplicity. Positive weights represent presence. Negative weights represent deletion or compensation. A normal set is
a Z-set where every present row has weight `1`.
This representation matters because Z-sets form an abelian group. They support zero, addition, negation, and subtraction. Those operations are what
make `D` and `I` well-defined for relations.
In practical terms:
- an insertion is a singleton Z-set with weight `+1`
- a deletion is a singleton Z-set with weight `-1`
- a batch update is a Z-set containing many weighted rows
- applying a batch means adding its weights to the current relation
---
## Relational Operators
Relational algebra operators become functions over Z-sets.
Projection sums weights for rows that collapse to the same projected value. Filtering keeps or removes weighted rows according to a predicate. Union
is addition followed by `distinct` when set semantics are required. Difference uses subtraction followed by `distinct`.
Joins are the important non-linear case. A join combines row weights by multiplication:
```text
(R join S)[(r, s)] = R[r] * S[s]
```
When an input changes, the output change depends on both the new delta and the maintained state of the other side. Conceptually:
```text
dR join S
R join dS
dR join dS
```
This is why an efficient DBSP runtime maintains indexed state for joins, aggregations, distinct, and related operators.
---
## Incrementalization
The spec gives a mechanical path for relational queries:
1. Query translation into a circuit of relational operators.
2. Circuit optimizations.
3. Circuit lifting to streams.
4. Circuit bracketing with `I` and `D`.
5. Algebraic rewriting so operators consume deltas directly.
The useful rule is the chain rule:
```text
(Q1 . Q2)_delta = Q1_delta . Q2_delta
```
This lets DBSP incrementalize a whole query by incrementalizing its parts and composing the results.
Linear time-invariant operators are especially simple:
```text
Q_delta = Q
```
Projection, filtering, addition, and negation fall into this easy category. Joins and other non-linear operators need more structure because they
combine current state with incoming deltas.
---
## Recursive Queries
Recursive queries are represented as circuits with feedback.
The feedback edge passes through a delay operator:
```text
z^-1
```
The delay means the next recursive step depends on the previous value, not on its own instantaneous output. This makes the circuit well-defined.
Datalog recursion then becomes a fixed-point computation. A recursive rule such as transitive closure can be compiled into a circuit that repeatedly
derives new facts until no more facts appear.
DBSP extends the incrementalization story to recursive circuits by using nested streams. At the outer level, input updates arrive over transaction
time. At the inner level, each update may trigger an iterative fixed-point adjustment. The maintained result changes by a stream of corrections rather
than a full recomputation from scratch.
---
## Datalog Compilation
The spec also describes how Differential Datalog can be compiled to DBSP circuits.
The pipeline is:
```text
Datalog relations and rules
-> valuations as Z-sets
-> relational operator circuits
-> recursive fixed-point circuits when needed
-> incremental DBSP circuits
```
Rule bodies become relational operations over valuations. Repeated rule heads become union. Negation becomes set difference or antijoin, subject to
stratification. Grouping and aggregation become grouped Z-set operators.
This is relevant for Geolog-shaped work because it separates the source language from the incremental execution model. A Datalog-like or relational
intermediate representation can be lowered into DBSP without making DBSP responsible for the source language's full semantics.
---
## Runtime Shape
DBSP is a view-maintenance engine, not a database.
It maintains the state required to produce output changes. It does not automatically provide arbitrary reads of every intermediate relation. If a
system needs to query a maintained view directly, the runtime must keep that view in integrated form and expose an API for membership or enumeration.
The runtime state includes the contents of delay nodes and operator state such as indexed Z-sets. Checkpointing, restore, and transaction rollback
therefore need to account for DBSP state as well as storage state.
For an application, the intended shape is:
```text
input relation deltas
-> DBSP circuit step
-> output relation deltas
-> integrated view or application update
```
---
## Practical Mental Model
DBSP is best understood as:
```text
relational query plan
+ streams of Z-set deltas
+ integration and differentiation algebra
+ circuit rewriting
+ maintained operator state
```
The specification's main result is not just that incremental maintenance is possible. It gives a uniform way to define the incremental version of a
query, then optimize that definition into a practical circuit.
For Geolog or Geomerge integration, the useful boundary is likely:
```text
compiled relational laws
-> violation queries
-> DBSP-maintained violation deltas
```
That makes DBSP a performance layer for supported relational checks. It does not by itself solve witness generation, disjunction, equality
saturation, or chase search.