331 lines
8.7 KiB
Markdown
331 lines
8.7 KiB
Markdown
# FlowLog and DBSP Synergy
|
|
|
|
A note on how FlowLog's Datalog planning ideas could support the DBSP, CRDT, and Geomerge work.
|
|
|
|
---
|
|
|
|
## Short Answer
|
|
|
|
FlowLog and the DBSP notes meet at the boundary between Datalog rules and incremental execution.
|
|
|
|
DBSP answers:
|
|
|
|
```text
|
|
How can a relational query be maintained over changing inputs?
|
|
```
|
|
|
|
FlowLog helps answer:
|
|
|
|
```text
|
|
What relational plan should the incremental backend maintain?
|
|
```
|
|
|
|
The synergy is not that FlowLog should replace DBSP. The useful split is:
|
|
|
|
```text
|
|
FlowLog-like frontend and optimizer
|
|
-> backend-neutral relational IR
|
|
-> DBSP-maintained circuit
|
|
```
|
|
|
|
FlowLog is a useful blueprint for the compiler and optimizer that should sit in front of DBSP.
|
|
|
|
---
|
|
|
|
## Existing DBSP Direction
|
|
|
|
The DBSP notes are organized around three related goals.
|
|
|
|
First, CRDTs can be described as deterministic Datalog queries over immutable operation facts:
|
|
|
|
```text
|
|
operation facts
|
|
-> Datalog rules
|
|
-> visible CRDT state
|
|
```
|
|
|
|
Second, DBSP can maintain those query results incrementally:
|
|
|
|
```text
|
|
input relation deltas
|
|
-> DBSP circuit step
|
|
-> output relation deltas
|
|
```
|
|
|
|
Third, Geomerge laws can be compiled into maintained violation relations:
|
|
|
|
```text
|
|
compiled relational laws
|
|
-> violation queries
|
|
-> maintained violation deltas
|
|
```
|
|
|
|
The missing layer is query planning. A Datalog rule can be semantically correct but still produce a poor physical plan.
|
|
|
|
---
|
|
|
|
## FlowLog's Useful Layer
|
|
|
|
FlowLog has a pipeline that is useful independently of its Differential Dataflow backend:
|
|
|
|
```text
|
|
Datalog program
|
|
-> parser
|
|
-> strata
|
|
-> rule catalog
|
|
-> per-rule relational plan
|
|
-> optimizer
|
|
-> incremental dataflow backend
|
|
```
|
|
|
|
The reusable parts are:
|
|
|
|
- dependency analysis
|
|
- stratification
|
|
- rule catalog construction
|
|
- join graph extraction
|
|
- structural join planning
|
|
- sideways information passing
|
|
- physical key and value shape selection
|
|
|
|
These are exactly the parts a DBSP-backed Datalog or Geolog compiler needs before lowering rules to DBSP.
|
|
|
|
---
|
|
|
|
## CRDT Synergy
|
|
|
|
The CRDT notes identify three representative query shapes.
|
|
|
|
The multi-value register is mostly projection plus antijoin:
|
|
|
|
```text
|
|
overwritten(RepId, Ctr) :-
|
|
pred(RepId, Ctr, _, _).
|
|
|
|
mvrStore(Key, Value) :-
|
|
set(RepId, Ctr, Key, Value),
|
|
not overwritten(RepId, Ctr).
|
|
```
|
|
|
|
This is a favorable DBSP workload. The backend can maintain the antijoin state and process small updates without rescanning the full operation history.
|
|
|
|
The causal-readiness query is harder:
|
|
|
|
```text
|
|
isCausallyReady(RepId, Ctr) :-
|
|
isRoot(RepId, Ctr).
|
|
|
|
isCausallyReady(RepId, Ctr) :-
|
|
isCausallyReady(FromRepId, FromCtr),
|
|
pred(FromRepId, FromCtr, RepId, Ctr).
|
|
```
|
|
|
|
This is recursive graph traversal. The DBSP CRDT notes report that this query can remain dependent on causal-history depth, even during warm updates.
|
|
|
|
FlowLog's planning ideas are relevant here. Sideways information passing suggests prefiltering recursive traversal through known relevant bindings. For CRDTs, that could mean using current heads, leaves, or newly arrived operations as a frontier instead of repeatedly deriving readiness from roots through the whole causal graph.
|
|
|
|
The list CRDT query is also planning-sensitive. Relations such as `firstChild`, `nextSibling`, `nextSiblingAnc`, `nextElem`, and `nextVisible` create several joins, antijoins, and recursive steps. FlowLog-style rule catalogs and join planning would help choose better intermediate shapes before DBSP builds maintained operator state.
|
|
|
|
---
|
|
|
|
## Geomerge Synergy
|
|
|
|
The Geomerge integration note proposes maintaining one combined violation relation.
|
|
|
|
Simple foreign-key laws are straightforward:
|
|
|
|
```text
|
|
required_src(graph, src) :- G.E(graph, src, dst).
|
|
missing_src(graph, src) :- required_src(graph, src), not G.V(graph, src).
|
|
```
|
|
|
|
For this subset, DBSP can maintain projections and antijoins directly.
|
|
|
|
The need for FlowLog grows when laws contain several atoms:
|
|
|
|
```text
|
|
violation(x, y, z) :-
|
|
A(x, y),
|
|
B(y, z),
|
|
C(z),
|
|
not D(x, z).
|
|
```
|
|
|
|
At that point, the compiler must decide:
|
|
|
|
- which atoms join first
|
|
- which variables form keys
|
|
- where filters and antijoins should be applied
|
|
- which intermediate fields must be retained
|
|
- whether several laws share subplans
|
|
|
|
FlowLog's catalog and structural planning model is a good guide for this compiler layer.
|
|
|
|
The resulting Geomerge architecture could be:
|
|
|
|
```text
|
|
FlatTheory laws
|
|
-> supported Datalog-shaped rules
|
|
-> FlowLog-like catalog and optimizer
|
|
-> relational violation plan
|
|
-> DBSP circuit
|
|
-> maintained violations relation
|
|
```
|
|
|
|
This keeps DBSP as a performance backend while giving Geomerge a real planning layer.
|
|
|
|
---
|
|
|
|
## IR Boundary
|
|
|
|
The strongest architectural lesson is to avoid binding the system too tightly to either source syntax or backend syntax.
|
|
|
|
The durable boundary should be a relational intermediate representation:
|
|
|
|
```text
|
|
source rules
|
|
-> rule catalog
|
|
-> relational IR
|
|
-> backend-specific lowering
|
|
```
|
|
|
|
For CRDTs, the source may be a Datalog dialect.
|
|
|
|
For Geomerge, the source may be compiled Geolog laws.
|
|
|
|
For execution, the backend may be DBSP, Differential Dataflow, or a non-incremental batch engine.
|
|
|
|
This matches the existing DBSP notes: DBSP should not own full source-language semantics. It should receive a relational plan that has already been checked, stratified, and optimized.
|
|
|
|
---
|
|
|
|
## Optimization Transfer
|
|
|
|
FlowLog suggests several optimizations that transfer well to DBSP-backed work.
|
|
|
|
**Structural Planning**: Choose join trees from variable overlap and intermediate width, especially for recursive rules where cardinality estimates are weak.
|
|
|
|
**Sideways Information Passing**: Add semijoin-style filters so later joins and recursive steps see fewer irrelevant tuples.
|
|
|
|
**Antijoin Scheduling**: Apply negated atoms as soon as their variables are available. This matches the DBSP CRDT note's antijoin-pushdown agenda.
|
|
|
|
**Subplan Sharing**: Reuse common derived relations across laws or CRDT views when multiple outputs need the same intermediate facts.
|
|
|
|
**Physical Key Choice**: Pick key fields deliberately before lowering to the backend. DBSP joins also need maintained state, so bad key and arrangement choices will become runtime costs.
|
|
|
|
---
|
|
|
|
## Backend Comparison
|
|
|
|
FlowLog uses Differential Dataflow. The DBSP notes use DBSP.
|
|
|
|
The models differ:
|
|
|
|
- Differential Dataflow uses collections with logical time and differences.
|
|
- DBSP uses streams, Z-sets, integration, differentiation, and circuits.
|
|
|
|
The shared lesson is more important than the difference:
|
|
|
|
```text
|
|
incremental backends maintain operator state
|
|
```
|
|
|
|
That means bad plans become persistent state, not just one bad query execution. A poor join order or unnecessary intermediate relation can increase memory use and update cost for the lifetime of the maintained query.
|
|
|
|
FlowLog is useful because it treats planning as a first-class layer before execution.
|
|
|
|
---
|
|
|
|
## Proposed Synergy Path
|
|
|
|
The practical path is incremental.
|
|
|
|
First, use FlowLog as a reading reference for a DBSP frontend:
|
|
|
|
```text
|
|
parser
|
|
-> dependency graph
|
|
-> strata
|
|
-> rule catalog
|
|
-> relational plan
|
|
```
|
|
|
|
Second, add a small optimizer:
|
|
|
|
```text
|
|
join ordering
|
|
antijoin pushdown
|
|
simple SIP for bound variables
|
|
```
|
|
|
|
Third, lower the optimized plan to DBSP operators:
|
|
|
|
```text
|
|
projection
|
|
selection
|
|
join
|
|
antijoin
|
|
union
|
|
distinct
|
|
fixed point
|
|
```
|
|
|
|
Fourth, test against the current direct implementations:
|
|
|
|
```text
|
|
snapshot result == DBSP-maintained result
|
|
```
|
|
|
|
For Geomerge, the first target should stay the same as the DBSP integration note: supported laws compiled into one maintained `violations` relation.
|
|
|
|
For CRDTs, the first target should be causal readiness and list traversal, since those are where the existing DBSP notes identify performance risk.
|
|
|
|
---
|
|
|
|
## Open Questions
|
|
|
|
- Can FlowLog-style SIP be adapted to causal-readiness frontiers?
|
|
- Can DBSP expose enough physical planning control for key choice and subplan sharing?
|
|
- Should the Datalog frontend target a FlowLog-like IR before DBSP lowering?
|
|
- Can Geomerge laws use the same catalog structure as Datalog rules?
|
|
- Which recursive CRDT queries benefit from structural planning?
|
|
- Is hydration better handled by DBSP, a batch engine, or persisted operator state?
|
|
- Can one optimizer target both DBSP and Differential Dataflow backends?
|
|
|
|
---
|
|
|
|
## Bottom Line
|
|
|
|
FlowLog should be treated as an optimizer and compiler blueprint for the DBSP work.
|
|
|
|
The DBSP notes already have the right execution target:
|
|
|
|
```text
|
|
maintained relational deltas
|
|
```
|
|
|
|
FlowLog adds the missing planning discipline:
|
|
|
|
```text
|
|
rule catalog
|
|
+ join graph
|
|
+ recursive strata
|
|
+ robust plan choice
|
|
+ SIP-style prefiltering
|
|
```
|
|
|
|
Together, the systems suggest a stronger architecture:
|
|
|
|
```text
|
|
Datalog or Geolog rules
|
|
-> FlowLog-like planning layer
|
|
-> DBSP incremental backend
|
|
-> maintained CRDT views or violation relations
|
|
```
|
|
|
|
---
|
|
|
|
## Changelog
|
|
|
|
* **May 20, 2026** -- First version created from the DBSP and FlowLog notes.
|