Add a note file related to DBSP specs
This commit is contained in:
parent
1ec1ab9adc
commit
08ee0992e6
252
dbsp/004-dbsp-spec-reading-notes.md
Normal file
252
dbsp/004-dbsp-spec-reading-notes.md
Normal file
@ -0,0 +1,252 @@
|
||||
# DBSP Spec Reading Notes
|
||||
|
||||
A reading note on the DBSP specification and its model for incremental view maintenance.
|
||||
|
||||
---
|
||||
|
||||
## Short Answer
|
||||
|
||||
DBSP treats view maintenance as a streaming computation over changes.
|
||||
|
||||
The central definition is:
|
||||
|
||||
```text
|
||||
Q_delta = D . lift(Q) . I
|
||||
```
|
||||
|
||||
Read this as:
|
||||
|
||||
```text
|
||||
input changes
|
||||
-> integrate into current input snapshots
|
||||
-> run the ordinary query on each snapshot
|
||||
-> differentiate the output snapshots into output changes
|
||||
```
|
||||
|
||||
This definition is correct but naive if executed literally. It would rebuild full input snapshots and recompute the full query after every update. The
|
||||
main contribution of DBSP is the circuit algebra that rewrites this definition into an implementation that works directly on deltas and maintained
|
||||
operator state.
|
||||
|
||||
---
|
||||
|
||||
## Streams
|
||||
|
||||
A stream is a function from discrete time to values:
|
||||
|
||||
```text
|
||||
stream: Nat -> A
|
||||
```
|
||||
|
||||
Time is not wall-clock time. For database maintenance, time usually counts input transactions or update batches. If `DB[t]` is the database snapshot
|
||||
after transaction `t`, then a database is a stream of snapshots.
|
||||
|
||||
A stream operator consumes one or more streams and produces another stream. DBSP programs are drawn as circuits:
|
||||
|
||||
```text
|
||||
stream inputs -> operator boxes -> stream outputs
|
||||
```
|
||||
|
||||
Ordinary scalar functions can be lifted to streams. If `Q` is a query on one database snapshot, then `lift(Q)` applies `Q` independently to every
|
||||
snapshot in a stream.
|
||||
|
||||
---
|
||||
|
||||
## Integration and Differentiation
|
||||
|
||||
DBSP uses two basic stream operators:
|
||||
|
||||
- `D`: differentiation
|
||||
- `I`: integration
|
||||
|
||||
Differentiation turns snapshots into changes:
|
||||
|
||||
```text
|
||||
D(s)[t] = s[t] - s[t - 1]
|
||||
```
|
||||
|
||||
Integration turns changes into snapshots:
|
||||
|
||||
```text
|
||||
I(s)[t] = sum of s[i] for i <= t
|
||||
```
|
||||
|
||||
They are inverses:
|
||||
|
||||
```text
|
||||
D(I(s)) = s
|
||||
I(D(s)) = s
|
||||
```
|
||||
|
||||
This is the mathematical basis for incremental view maintenance. If a query can be interpreted as a stream computation, then DBSP can define its
|
||||
incremental version by placing `I` before it and `D` after it.
|
||||
|
||||
---
|
||||
|
||||
## Z-Sets
|
||||
|
||||
DBSP represents relations as Z-sets.
|
||||
|
||||
A Z-set is a finite map from values to integer weights:
|
||||
|
||||
```text
|
||||
{ row1 -> 1, row2 -> 1, row3 -> -1 }
|
||||
```
|
||||
|
||||
The integer weight is the row multiplicity. Positive weights represent presence. Negative weights represent deletion or compensation. A normal set is
|
||||
a Z-set where every present row has weight `1`.
|
||||
|
||||
This representation matters because Z-sets form an abelian group. They support zero, addition, negation, and subtraction. Those operations are what
|
||||
make `D` and `I` well-defined for relations.
|
||||
|
||||
In practical terms:
|
||||
|
||||
- an insertion is a singleton Z-set with weight `+1`
|
||||
- a deletion is a singleton Z-set with weight `-1`
|
||||
- a batch update is a Z-set containing many weighted rows
|
||||
- applying a batch means adding its weights to the current relation
|
||||
|
||||
---
|
||||
|
||||
## Relational Operators
|
||||
|
||||
Relational algebra operators become functions over Z-sets.
|
||||
|
||||
Projection sums weights for rows that collapse to the same projected value. Filtering keeps or removes weighted rows according to a predicate. Union
|
||||
is addition followed by `distinct` when set semantics are required. Difference uses subtraction followed by `distinct`.
|
||||
|
||||
Joins are the important non-linear case. A join combines row weights by multiplication:
|
||||
|
||||
```text
|
||||
(R join S)[(r, s)] = R[r] * S[s]
|
||||
```
|
||||
|
||||
When an input changes, the output change depends on both the new delta and the maintained state of the other side. Conceptually:
|
||||
|
||||
```text
|
||||
dR join S
|
||||
R join dS
|
||||
dR join dS
|
||||
```
|
||||
|
||||
This is why an efficient DBSP runtime maintains indexed state for joins, aggregations, distinct, and related operators.
|
||||
|
||||
---
|
||||
|
||||
## Incrementalization
|
||||
|
||||
The spec gives a mechanical path for relational queries:
|
||||
|
||||
1. Query translation into a circuit of relational operators.
|
||||
2. Circuit optimizations.
|
||||
3. Circuit lifting to streams.
|
||||
4. Circuit bracketing with `I` and `D`.
|
||||
5. Algebraic rewriting so operators consume deltas directly.
|
||||
|
||||
The useful rule is the chain rule:
|
||||
|
||||
```text
|
||||
(Q1 . Q2)_delta = Q1_delta . Q2_delta
|
||||
```
|
||||
|
||||
This lets DBSP incrementalize a whole query by incrementalizing its parts and composing the results.
|
||||
|
||||
Linear time-invariant operators are especially simple:
|
||||
|
||||
```text
|
||||
Q_delta = Q
|
||||
```
|
||||
|
||||
Projection, filtering, addition, and negation fall into this easy category. Joins and other non-linear operators need more structure because they
|
||||
combine current state with incoming deltas.
|
||||
|
||||
---
|
||||
|
||||
## Recursive Queries
|
||||
|
||||
Recursive queries are represented as circuits with feedback.
|
||||
|
||||
The feedback edge passes through a delay operator:
|
||||
|
||||
```text
|
||||
z^-1
|
||||
```
|
||||
|
||||
The delay means the next recursive step depends on the previous value, not on its own instantaneous output. This makes the circuit well-defined.
|
||||
|
||||
Datalog recursion then becomes a fixed-point computation. A recursive rule such as transitive closure can be compiled into a circuit that repeatedly
|
||||
derives new facts until no more facts appear.
|
||||
|
||||
DBSP extends the incrementalization story to recursive circuits by using nested streams. At the outer level, input updates arrive over transaction
|
||||
time. At the inner level, each update may trigger an iterative fixed-point adjustment. The maintained result changes by a stream of corrections rather
|
||||
than a full recomputation from scratch.
|
||||
|
||||
---
|
||||
|
||||
## Datalog Compilation
|
||||
|
||||
The spec also describes how Differential Datalog can be compiled to DBSP circuits.
|
||||
|
||||
The pipeline is:
|
||||
|
||||
```text
|
||||
Datalog relations and rules
|
||||
-> valuations as Z-sets
|
||||
-> relational operator circuits
|
||||
-> recursive fixed-point circuits when needed
|
||||
-> incremental DBSP circuits
|
||||
```
|
||||
|
||||
Rule bodies become relational operations over valuations. Repeated rule heads become union. Negation becomes set difference or antijoin, subject to
|
||||
stratification. Grouping and aggregation become grouped Z-set operators.
|
||||
|
||||
This is relevant for Geolog-shaped work because it separates the source language from the incremental execution model. A Datalog-like or relational
|
||||
intermediate representation can be lowered into DBSP without making DBSP responsible for the source language's full semantics.
|
||||
|
||||
---
|
||||
|
||||
## Runtime Shape
|
||||
|
||||
DBSP is a view-maintenance engine, not a database.
|
||||
|
||||
It maintains the state required to produce output changes. It does not automatically provide arbitrary reads of every intermediate relation. If a
|
||||
system needs to query a maintained view directly, the runtime must keep that view in integrated form and expose an API for membership or enumeration.
|
||||
|
||||
The runtime state includes the contents of delay nodes and operator state such as indexed Z-sets. Checkpointing, restore, and transaction rollback
|
||||
therefore need to account for DBSP state as well as storage state.
|
||||
|
||||
For an application, the intended shape is:
|
||||
|
||||
```text
|
||||
input relation deltas
|
||||
-> DBSP circuit step
|
||||
-> output relation deltas
|
||||
-> integrated view or application update
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## Practical Mental Model
|
||||
|
||||
DBSP is best understood as:
|
||||
|
||||
```text
|
||||
relational query plan
|
||||
+ streams of Z-set deltas
|
||||
+ integration and differentiation algebra
|
||||
+ circuit rewriting
|
||||
+ maintained operator state
|
||||
```
|
||||
|
||||
The specification's main result is not just that incremental maintenance is possible. It gives a uniform way to define the incremental version of a
|
||||
query, then optimize that definition into a practical circuit.
|
||||
|
||||
For Geolog or Geomerge integration, the useful boundary is likely:
|
||||
|
||||
```text
|
||||
compiled relational laws
|
||||
-> violation queries
|
||||
-> DBSP-maintained violation deltas
|
||||
```
|
||||
|
||||
That makes DBSP a performance layer for supported relational checks. It does not by itself solve witness generation, disjunction, equality
|
||||
saturation, or chase search.
|
||||
Loading…
x
Reference in New Issue
Block a user