useful-notes/flowlog/003-flowlog-and-dbsp-synergy.md

340 lines
8.7 KiB
Markdown
Raw Normal View History

# FlowLog and DBSP Synergy
A note on how FlowLog's Datalog planning ideas could support the DBSP, CRDT, and Geomerge work.
---
## Short Answer
FlowLog and the DBSP notes meet at the boundary between Datalog rules and incremental execution.
DBSP answers:
```text
How can a relational query be maintained over changing inputs?
```
FlowLog helps answer:
```text
What relational plan should the incremental backend maintain?
```
The synergy is not that FlowLog should replace DBSP. The useful split is:
```text
FlowLog-like frontend and optimizer
-> backend-neutral relational IR
-> DBSP-maintained circuit
```
FlowLog is a useful blueprint for the compiler and optimizer that should sit in front of DBSP.
---
## Existing DBSP Direction
The DBSP notes are organized around three related goals.
First, CRDTs can be described as deterministic Datalog queries over immutable operation facts:
```text
operation facts
-> Datalog rules
-> visible CRDT state
```
Second, DBSP can maintain those query results incrementally:
```text
input relation deltas
-> DBSP circuit step
-> output relation deltas
```
Third, Geomerge laws can be compiled into maintained violation relations:
```text
compiled relational laws
-> violation queries
-> maintained violation deltas
```
The missing layer is query planning. A Datalog rule can be semantically correct but still produce a poor physical plan.
---
## FlowLog's Useful Layer
FlowLog has a pipeline that is useful independently of its Differential Dataflow backend:
```text
Datalog program
-> parser
-> strata
-> rule catalog
-> per-rule relational plan
-> optimizer
-> incremental dataflow backend
```
The reusable parts are:
- dependency analysis
- stratification
- rule catalog construction
- join graph extraction
- structural join planning
- sideways information passing
- physical key and value shape selection
These are exactly the parts a DBSP-backed Datalog or Geolog compiler needs before lowering rules to DBSP.
---
## CRDT Synergy
The CRDT notes identify three representative query shapes.
The multi-value register is mostly projection plus antijoin:
```text
overwritten(RepId, Ctr) :-
pred(RepId, Ctr, _, _).
mvrStore(Key, Value) :-
set(RepId, Ctr, Key, Value),
not overwritten(RepId, Ctr).
```
2026-05-20 15:54:55 +02:00
This is a favorable DBSP workload. The backend can maintain the antijoin state and process small updates without rescanning the full operation
history.
The causal-readiness query is harder:
```text
isCausallyReady(RepId, Ctr) :-
isRoot(RepId, Ctr).
isCausallyReady(RepId, Ctr) :-
isCausallyReady(FromRepId, FromCtr),
pred(FromRepId, FromCtr, RepId, Ctr).
```
This is recursive graph traversal. The DBSP CRDT notes report that this query can remain dependent on causal-history depth, even during warm updates.
2026-05-20 15:54:55 +02:00
FlowLog's planning ideas are relevant here. Sideways information passing suggests prefiltering recursive traversal through known relevant bindings.
For CRDTs, that could mean using current heads, leaves, or newly arrived operations as a frontier instead of repeatedly deriving readiness from roots
through the whole causal graph.
2026-05-20 15:54:55 +02:00
The list CRDT query is also planning-sensitive. Relations such as `firstChild`, `nextSibling`, `nextSiblingAnc`, `nextElem`, and `nextVisible` create
several joins, antijoins, and recursive steps. FlowLog-style rule catalogs and join planning would help choose better intermediate shapes before DBSP
builds maintained operator state.
---
## Geomerge Synergy
The Geomerge integration note proposes maintaining one combined violation relation.
Simple foreign-key laws are straightforward:
```text
required_src(graph, src) :- G.E(graph, src, dst).
missing_src(graph, src) :- required_src(graph, src), not G.V(graph, src).
```
For this subset, DBSP can maintain projections and antijoins directly.
The need for FlowLog grows when laws contain several atoms:
```text
violation(x, y, z) :-
A(x, y),
B(y, z),
C(z),
not D(x, z).
```
At that point, the compiler must decide:
- which atoms join first
- which variables form keys
- where filters and antijoins should be applied
- which intermediate fields must be retained
- whether several laws share subplans
FlowLog's catalog and structural planning model is a good guide for this compiler layer.
The resulting Geomerge architecture could be:
```text
FlatTheory laws
-> supported Datalog-shaped rules
-> FlowLog-like catalog and optimizer
-> relational violation plan
-> DBSP circuit
-> maintained violations relation
```
This keeps DBSP as a performance backend while giving Geomerge a real planning layer.
---
## IR Boundary
The strongest architectural lesson is to avoid binding the system too tightly to either source syntax or backend syntax.
The durable boundary should be a relational intermediate representation:
```text
source rules
-> rule catalog
-> relational IR
-> backend-specific lowering
```
For CRDTs, the source may be a Datalog dialect.
For Geomerge, the source may be compiled Geolog laws.
For execution, the backend may be DBSP, Differential Dataflow, or a non-incremental batch engine.
2026-05-20 15:54:55 +02:00
This matches the existing DBSP notes: DBSP should not own full source-language semantics. It should receive a relational plan that has already been
checked, stratified, and optimized.
---
## Optimization Transfer
FlowLog suggests several optimizations that transfer well to DBSP-backed work.
2026-05-20 15:54:55 +02:00
**Structural Planning**: Choose join trees from variable overlap and intermediate width, especially for recursive rules where cardinality estimates
are weak.
**Sideways Information Passing**: Add semijoin-style filters so later joins and recursive steps see fewer irrelevant tuples.
**Antijoin Scheduling**: Apply negated atoms as soon as their variables are available. This matches the DBSP CRDT note's antijoin-pushdown agenda.
**Subplan Sharing**: Reuse common derived relations across laws or CRDT views when multiple outputs need the same intermediate facts.
2026-05-20 15:54:55 +02:00
**Physical Key Choice**: Pick key fields deliberately before lowering to the backend. DBSP joins also need maintained state, so bad key and
arrangement choices will become runtime costs.
---
## Backend Comparison
FlowLog uses Differential Dataflow. The DBSP notes use DBSP.
The models differ:
- Differential Dataflow uses collections with logical time and differences.
- DBSP uses streams, Z-sets, integration, differentiation, and circuits.
The shared lesson is more important than the difference:
```text
incremental backends maintain operator state
```
2026-05-20 15:54:55 +02:00
That means bad plans become persistent state, not just one bad query execution. A poor join order or unnecessary intermediate relation can increase
memory use and update cost for the lifetime of the maintained query.
FlowLog is useful because it treats planning as a first-class layer before execution.
---
## Proposed Synergy Path
The practical path is incremental.
First, use FlowLog as a reading reference for a DBSP frontend:
```text
parser
-> dependency graph
-> strata
-> rule catalog
-> relational plan
```
Second, add a small optimizer:
```text
join ordering
antijoin pushdown
simple SIP for bound variables
```
Third, lower the optimized plan to DBSP operators:
```text
projection
selection
join
antijoin
union
distinct
fixed point
```
Fourth, test against the current direct implementations:
```text
snapshot result == DBSP-maintained result
```
For Geomerge, the first target should stay the same as the DBSP integration note: supported laws compiled into one maintained `violations` relation.
For CRDTs, the first target should be causal readiness and list traversal, since those are where the existing DBSP notes identify performance risk.
---
## Open Questions
- Can FlowLog-style SIP be adapted to causal-readiness frontiers?
- Can DBSP expose enough physical planning control for key choice and subplan sharing?
- Should the Datalog frontend target a FlowLog-like IR before DBSP lowering?
- Can Geomerge laws use the same catalog structure as Datalog rules?
- Which recursive CRDT queries benefit from structural planning?
- Is hydration better handled by DBSP, a batch engine, or persisted operator state?
- Can one optimizer target both DBSP and Differential Dataflow backends?
---
## Bottom Line
FlowLog should be treated as an optimizer and compiler blueprint for the DBSP work.
The DBSP notes already have the right execution target:
```text
maintained relational deltas
```
FlowLog adds the missing planning discipline:
```text
rule catalog
+ join graph
+ recursive strata
+ robust plan choice
+ SIP-style prefiltering
```
Together, the systems suggest a stronger architecture:
```text
Datalog or Geolog rules
-> FlowLog-like planning layer
-> DBSP incremental backend
-> maintained CRDT views or violation relations
```
---
## Changelog
* **May 20, 2026** -- First version created from the DBSP and FlowLog notes.