253 lines
7.3 KiB
Markdown
253 lines
7.3 KiB
Markdown
|
|
# DBSP Spec Reading Notes
|
||
|
|
|
||
|
|
A reading note on the DBSP specification and its model for incremental view maintenance.
|
||
|
|
|
||
|
|
---
|
||
|
|
|
||
|
|
## Short Answer
|
||
|
|
|
||
|
|
DBSP treats view maintenance as a streaming computation over changes.
|
||
|
|
|
||
|
|
The central definition is:
|
||
|
|
|
||
|
|
```text
|
||
|
|
Q_delta = D . lift(Q) . I
|
||
|
|
```
|
||
|
|
|
||
|
|
Read this as:
|
||
|
|
|
||
|
|
```text
|
||
|
|
input changes
|
||
|
|
-> integrate into current input snapshots
|
||
|
|
-> run the ordinary query on each snapshot
|
||
|
|
-> differentiate the output snapshots into output changes
|
||
|
|
```
|
||
|
|
|
||
|
|
This definition is correct but naive if executed literally. It would rebuild full input snapshots and recompute the full query after every update. The
|
||
|
|
main contribution of DBSP is the circuit algebra that rewrites this definition into an implementation that works directly on deltas and maintained
|
||
|
|
operator state.
|
||
|
|
|
||
|
|
---
|
||
|
|
|
||
|
|
## Streams
|
||
|
|
|
||
|
|
A stream is a function from discrete time to values:
|
||
|
|
|
||
|
|
```text
|
||
|
|
stream: Nat -> A
|
||
|
|
```
|
||
|
|
|
||
|
|
Time is not wall-clock time. For database maintenance, time usually counts input transactions or update batches. If `DB[t]` is the database snapshot
|
||
|
|
after transaction `t`, then a database is a stream of snapshots.
|
||
|
|
|
||
|
|
A stream operator consumes one or more streams and produces another stream. DBSP programs are drawn as circuits:
|
||
|
|
|
||
|
|
```text
|
||
|
|
stream inputs -> operator boxes -> stream outputs
|
||
|
|
```
|
||
|
|
|
||
|
|
Ordinary scalar functions can be lifted to streams. If `Q` is a query on one database snapshot, then `lift(Q)` applies `Q` independently to every
|
||
|
|
snapshot in a stream.
|
||
|
|
|
||
|
|
---
|
||
|
|
|
||
|
|
## Integration and Differentiation
|
||
|
|
|
||
|
|
DBSP uses two basic stream operators:
|
||
|
|
|
||
|
|
- `D`: differentiation
|
||
|
|
- `I`: integration
|
||
|
|
|
||
|
|
Differentiation turns snapshots into changes:
|
||
|
|
|
||
|
|
```text
|
||
|
|
D(s)[t] = s[t] - s[t - 1]
|
||
|
|
```
|
||
|
|
|
||
|
|
Integration turns changes into snapshots:
|
||
|
|
|
||
|
|
```text
|
||
|
|
I(s)[t] = sum of s[i] for i <= t
|
||
|
|
```
|
||
|
|
|
||
|
|
They are inverses:
|
||
|
|
|
||
|
|
```text
|
||
|
|
D(I(s)) = s
|
||
|
|
I(D(s)) = s
|
||
|
|
```
|
||
|
|
|
||
|
|
This is the mathematical basis for incremental view maintenance. If a query can be interpreted as a stream computation, then DBSP can define its
|
||
|
|
incremental version by placing `I` before it and `D` after it.
|
||
|
|
|
||
|
|
---
|
||
|
|
|
||
|
|
## Z-Sets
|
||
|
|
|
||
|
|
DBSP represents relations as Z-sets.
|
||
|
|
|
||
|
|
A Z-set is a finite map from values to integer weights:
|
||
|
|
|
||
|
|
```text
|
||
|
|
{ row1 -> 1, row2 -> 1, row3 -> -1 }
|
||
|
|
```
|
||
|
|
|
||
|
|
The integer weight is the row multiplicity. Positive weights represent presence. Negative weights represent deletion or compensation. A normal set is
|
||
|
|
a Z-set where every present row has weight `1`.
|
||
|
|
|
||
|
|
This representation matters because Z-sets form an abelian group. They support zero, addition, negation, and subtraction. Those operations are what
|
||
|
|
make `D` and `I` well-defined for relations.
|
||
|
|
|
||
|
|
In practical terms:
|
||
|
|
|
||
|
|
- an insertion is a singleton Z-set with weight `+1`
|
||
|
|
- a deletion is a singleton Z-set with weight `-1`
|
||
|
|
- a batch update is a Z-set containing many weighted rows
|
||
|
|
- applying a batch means adding its weights to the current relation
|
||
|
|
|
||
|
|
---
|
||
|
|
|
||
|
|
## Relational Operators
|
||
|
|
|
||
|
|
Relational algebra operators become functions over Z-sets.
|
||
|
|
|
||
|
|
Projection sums weights for rows that collapse to the same projected value. Filtering keeps or removes weighted rows according to a predicate. Union
|
||
|
|
is addition followed by `distinct` when set semantics are required. Difference uses subtraction followed by `distinct`.
|
||
|
|
|
||
|
|
Joins are the important non-linear case. A join combines row weights by multiplication:
|
||
|
|
|
||
|
|
```text
|
||
|
|
(R join S)[(r, s)] = R[r] * S[s]
|
||
|
|
```
|
||
|
|
|
||
|
|
When an input changes, the output change depends on both the new delta and the maintained state of the other side. Conceptually:
|
||
|
|
|
||
|
|
```text
|
||
|
|
dR join S
|
||
|
|
R join dS
|
||
|
|
dR join dS
|
||
|
|
```
|
||
|
|
|
||
|
|
This is why an efficient DBSP runtime maintains indexed state for joins, aggregations, distinct, and related operators.
|
||
|
|
|
||
|
|
---
|
||
|
|
|
||
|
|
## Incrementalization
|
||
|
|
|
||
|
|
The spec gives a mechanical path for relational queries:
|
||
|
|
|
||
|
|
1. Query translation into a circuit of relational operators.
|
||
|
|
2. Circuit optimizations.
|
||
|
|
3. Circuit lifting to streams.
|
||
|
|
4. Circuit bracketing with `I` and `D`.
|
||
|
|
5. Algebraic rewriting so operators consume deltas directly.
|
||
|
|
|
||
|
|
The useful rule is the chain rule:
|
||
|
|
|
||
|
|
```text
|
||
|
|
(Q1 . Q2)_delta = Q1_delta . Q2_delta
|
||
|
|
```
|
||
|
|
|
||
|
|
This lets DBSP incrementalize a whole query by incrementalizing its parts and composing the results.
|
||
|
|
|
||
|
|
Linear time-invariant operators are especially simple:
|
||
|
|
|
||
|
|
```text
|
||
|
|
Q_delta = Q
|
||
|
|
```
|
||
|
|
|
||
|
|
Projection, filtering, addition, and negation fall into this easy category. Joins and other non-linear operators need more structure because they
|
||
|
|
combine current state with incoming deltas.
|
||
|
|
|
||
|
|
---
|
||
|
|
|
||
|
|
## Recursive Queries
|
||
|
|
|
||
|
|
Recursive queries are represented as circuits with feedback.
|
||
|
|
|
||
|
|
The feedback edge passes through a delay operator:
|
||
|
|
|
||
|
|
```text
|
||
|
|
z^-1
|
||
|
|
```
|
||
|
|
|
||
|
|
The delay means the next recursive step depends on the previous value, not on its own instantaneous output. This makes the circuit well-defined.
|
||
|
|
|
||
|
|
Datalog recursion then becomes a fixed-point computation. A recursive rule such as transitive closure can be compiled into a circuit that repeatedly
|
||
|
|
derives new facts until no more facts appear.
|
||
|
|
|
||
|
|
DBSP extends the incrementalization story to recursive circuits by using nested streams. At the outer level, input updates arrive over transaction
|
||
|
|
time. At the inner level, each update may trigger an iterative fixed-point adjustment. The maintained result changes by a stream of corrections rather
|
||
|
|
than a full recomputation from scratch.
|
||
|
|
|
||
|
|
---
|
||
|
|
|
||
|
|
## Datalog Compilation
|
||
|
|
|
||
|
|
The spec also describes how Differential Datalog can be compiled to DBSP circuits.
|
||
|
|
|
||
|
|
The pipeline is:
|
||
|
|
|
||
|
|
```text
|
||
|
|
Datalog relations and rules
|
||
|
|
-> valuations as Z-sets
|
||
|
|
-> relational operator circuits
|
||
|
|
-> recursive fixed-point circuits when needed
|
||
|
|
-> incremental DBSP circuits
|
||
|
|
```
|
||
|
|
|
||
|
|
Rule bodies become relational operations over valuations. Repeated rule heads become union. Negation becomes set difference or antijoin, subject to
|
||
|
|
stratification. Grouping and aggregation become grouped Z-set operators.
|
||
|
|
|
||
|
|
This is relevant for Geolog-shaped work because it separates the source language from the incremental execution model. A Datalog-like or relational
|
||
|
|
intermediate representation can be lowered into DBSP without making DBSP responsible for the source language's full semantics.
|
||
|
|
|
||
|
|
---
|
||
|
|
|
||
|
|
## Runtime Shape
|
||
|
|
|
||
|
|
DBSP is a view-maintenance engine, not a database.
|
||
|
|
|
||
|
|
It maintains the state required to produce output changes. It does not automatically provide arbitrary reads of every intermediate relation. If a
|
||
|
|
system needs to query a maintained view directly, the runtime must keep that view in integrated form and expose an API for membership or enumeration.
|
||
|
|
|
||
|
|
The runtime state includes the contents of delay nodes and operator state such as indexed Z-sets. Checkpointing, restore, and transaction rollback
|
||
|
|
therefore need to account for DBSP state as well as storage state.
|
||
|
|
|
||
|
|
For an application, the intended shape is:
|
||
|
|
|
||
|
|
```text
|
||
|
|
input relation deltas
|
||
|
|
-> DBSP circuit step
|
||
|
|
-> output relation deltas
|
||
|
|
-> integrated view or application update
|
||
|
|
```
|
||
|
|
|
||
|
|
---
|
||
|
|
|
||
|
|
## Practical Mental Model
|
||
|
|
|
||
|
|
DBSP is best understood as:
|
||
|
|
|
||
|
|
```text
|
||
|
|
relational query plan
|
||
|
|
+ streams of Z-set deltas
|
||
|
|
+ integration and differentiation algebra
|
||
|
|
+ circuit rewriting
|
||
|
|
+ maintained operator state
|
||
|
|
```
|
||
|
|
|
||
|
|
The specification's main result is not just that incremental maintenance is possible. It gives a uniform way to define the incremental version of a
|
||
|
|
query, then optimize that definition into a practical circuit.
|
||
|
|
|
||
|
|
For Geolog or Geomerge integration, the useful boundary is likely:
|
||
|
|
|
||
|
|
```text
|
||
|
|
compiled relational laws
|
||
|
|
-> violation queries
|
||
|
|
-> DBSP-maintained violation deltas
|
||
|
|
```
|
||
|
|
|
||
|
|
That makes DBSP a performance layer for supported relational checks. It does not by itself solve witness generation, disjunction, equality
|
||
|
|
saturation, or chase search.
|