Add anote file for how we could use FlowLog and DBSP together
This commit is contained in:
parent
3d67b4994e
commit
2d3c02315a
330
flowlog/003-flowlog-and-dbsp-synergy.md
Normal file
330
flowlog/003-flowlog-and-dbsp-synergy.md
Normal file
@ -0,0 +1,330 @@
|
||||
# FlowLog and DBSP Synergy
|
||||
|
||||
A note on how FlowLog's Datalog planning ideas could support the DBSP, CRDT, and Geomerge work.
|
||||
|
||||
---
|
||||
|
||||
## Short Answer
|
||||
|
||||
FlowLog and the DBSP notes meet at the boundary between Datalog rules and incremental execution.
|
||||
|
||||
DBSP answers:
|
||||
|
||||
```text
|
||||
How can a relational query be maintained over changing inputs?
|
||||
```
|
||||
|
||||
FlowLog helps answer:
|
||||
|
||||
```text
|
||||
What relational plan should the incremental backend maintain?
|
||||
```
|
||||
|
||||
The synergy is not that FlowLog should replace DBSP. The useful split is:
|
||||
|
||||
```text
|
||||
FlowLog-like frontend and optimizer
|
||||
-> backend-neutral relational IR
|
||||
-> DBSP-maintained circuit
|
||||
```
|
||||
|
||||
FlowLog is a useful blueprint for the compiler and optimizer that should sit in front of DBSP.
|
||||
|
||||
---
|
||||
|
||||
## Existing DBSP Direction
|
||||
|
||||
The DBSP notes are organized around three related goals.
|
||||
|
||||
First, CRDTs can be described as deterministic Datalog queries over immutable operation facts:
|
||||
|
||||
```text
|
||||
operation facts
|
||||
-> Datalog rules
|
||||
-> visible CRDT state
|
||||
```
|
||||
|
||||
Second, DBSP can maintain those query results incrementally:
|
||||
|
||||
```text
|
||||
input relation deltas
|
||||
-> DBSP circuit step
|
||||
-> output relation deltas
|
||||
```
|
||||
|
||||
Third, Geomerge laws can be compiled into maintained violation relations:
|
||||
|
||||
```text
|
||||
compiled relational laws
|
||||
-> violation queries
|
||||
-> maintained violation deltas
|
||||
```
|
||||
|
||||
The missing layer is query planning. A Datalog rule can be semantically correct but still produce a poor physical plan.
|
||||
|
||||
---
|
||||
|
||||
## FlowLog's Useful Layer
|
||||
|
||||
FlowLog has a pipeline that is useful independently of its Differential Dataflow backend:
|
||||
|
||||
```text
|
||||
Datalog program
|
||||
-> parser
|
||||
-> strata
|
||||
-> rule catalog
|
||||
-> per-rule relational plan
|
||||
-> optimizer
|
||||
-> incremental dataflow backend
|
||||
```
|
||||
|
||||
The reusable parts are:
|
||||
|
||||
- dependency analysis
|
||||
- stratification
|
||||
- rule catalog construction
|
||||
- join graph extraction
|
||||
- structural join planning
|
||||
- sideways information passing
|
||||
- physical key and value shape selection
|
||||
|
||||
These are exactly the parts a DBSP-backed Datalog or Geolog compiler needs before lowering rules to DBSP.
|
||||
|
||||
---
|
||||
|
||||
## CRDT Synergy
|
||||
|
||||
The CRDT notes identify three representative query shapes.
|
||||
|
||||
The multi-value register is mostly projection plus antijoin:
|
||||
|
||||
```text
|
||||
overwritten(RepId, Ctr) :-
|
||||
pred(RepId, Ctr, _, _).
|
||||
|
||||
mvrStore(Key, Value) :-
|
||||
set(RepId, Ctr, Key, Value),
|
||||
not overwritten(RepId, Ctr).
|
||||
```
|
||||
|
||||
This is a favorable DBSP workload. The backend can maintain the antijoin state and process small updates without rescanning the full operation history.
|
||||
|
||||
The causal-readiness query is harder:
|
||||
|
||||
```text
|
||||
isCausallyReady(RepId, Ctr) :-
|
||||
isRoot(RepId, Ctr).
|
||||
|
||||
isCausallyReady(RepId, Ctr) :-
|
||||
isCausallyReady(FromRepId, FromCtr),
|
||||
pred(FromRepId, FromCtr, RepId, Ctr).
|
||||
```
|
||||
|
||||
This is recursive graph traversal. The DBSP CRDT notes report that this query can remain dependent on causal-history depth, even during warm updates.
|
||||
|
||||
FlowLog's planning ideas are relevant here. Sideways information passing suggests prefiltering recursive traversal through known relevant bindings. For CRDTs, that could mean using current heads, leaves, or newly arrived operations as a frontier instead of repeatedly deriving readiness from roots through the whole causal graph.
|
||||
|
||||
The list CRDT query is also planning-sensitive. Relations such as `firstChild`, `nextSibling`, `nextSiblingAnc`, `nextElem`, and `nextVisible` create several joins, antijoins, and recursive steps. FlowLog-style rule catalogs and join planning would help choose better intermediate shapes before DBSP builds maintained operator state.
|
||||
|
||||
---
|
||||
|
||||
## Geomerge Synergy
|
||||
|
||||
The Geomerge integration note proposes maintaining one combined violation relation.
|
||||
|
||||
Simple foreign-key laws are straightforward:
|
||||
|
||||
```text
|
||||
required_src(graph, src) :- G.E(graph, src, dst).
|
||||
missing_src(graph, src) :- required_src(graph, src), not G.V(graph, src).
|
||||
```
|
||||
|
||||
For this subset, DBSP can maintain projections and antijoins directly.
|
||||
|
||||
The need for FlowLog grows when laws contain several atoms:
|
||||
|
||||
```text
|
||||
violation(x, y, z) :-
|
||||
A(x, y),
|
||||
B(y, z),
|
||||
C(z),
|
||||
not D(x, z).
|
||||
```
|
||||
|
||||
At that point, the compiler must decide:
|
||||
|
||||
- which atoms join first
|
||||
- which variables form keys
|
||||
- where filters and antijoins should be applied
|
||||
- which intermediate fields must be retained
|
||||
- whether several laws share subplans
|
||||
|
||||
FlowLog's catalog and structural planning model is a good guide for this compiler layer.
|
||||
|
||||
The resulting Geomerge architecture could be:
|
||||
|
||||
```text
|
||||
FlatTheory laws
|
||||
-> supported Datalog-shaped rules
|
||||
-> FlowLog-like catalog and optimizer
|
||||
-> relational violation plan
|
||||
-> DBSP circuit
|
||||
-> maintained violations relation
|
||||
```
|
||||
|
||||
This keeps DBSP as a performance backend while giving Geomerge a real planning layer.
|
||||
|
||||
---
|
||||
|
||||
## IR Boundary
|
||||
|
||||
The strongest architectural lesson is to avoid binding the system too tightly to either source syntax or backend syntax.
|
||||
|
||||
The durable boundary should be a relational intermediate representation:
|
||||
|
||||
```text
|
||||
source rules
|
||||
-> rule catalog
|
||||
-> relational IR
|
||||
-> backend-specific lowering
|
||||
```
|
||||
|
||||
For CRDTs, the source may be a Datalog dialect.
|
||||
|
||||
For Geomerge, the source may be compiled Geolog laws.
|
||||
|
||||
For execution, the backend may be DBSP, Differential Dataflow, or a non-incremental batch engine.
|
||||
|
||||
This matches the existing DBSP notes: DBSP should not own full source-language semantics. It should receive a relational plan that has already been checked, stratified, and optimized.
|
||||
|
||||
---
|
||||
|
||||
## Optimization Transfer
|
||||
|
||||
FlowLog suggests several optimizations that transfer well to DBSP-backed work.
|
||||
|
||||
**Structural Planning**: Choose join trees from variable overlap and intermediate width, especially for recursive rules where cardinality estimates are weak.
|
||||
|
||||
**Sideways Information Passing**: Add semijoin-style filters so later joins and recursive steps see fewer irrelevant tuples.
|
||||
|
||||
**Antijoin Scheduling**: Apply negated atoms as soon as their variables are available. This matches the DBSP CRDT note's antijoin-pushdown agenda.
|
||||
|
||||
**Subplan Sharing**: Reuse common derived relations across laws or CRDT views when multiple outputs need the same intermediate facts.
|
||||
|
||||
**Physical Key Choice**: Pick key fields deliberately before lowering to the backend. DBSP joins also need maintained state, so bad key and arrangement choices will become runtime costs.
|
||||
|
||||
---
|
||||
|
||||
## Backend Comparison
|
||||
|
||||
FlowLog uses Differential Dataflow. The DBSP notes use DBSP.
|
||||
|
||||
The models differ:
|
||||
|
||||
- Differential Dataflow uses collections with logical time and differences.
|
||||
- DBSP uses streams, Z-sets, integration, differentiation, and circuits.
|
||||
|
||||
The shared lesson is more important than the difference:
|
||||
|
||||
```text
|
||||
incremental backends maintain operator state
|
||||
```
|
||||
|
||||
That means bad plans become persistent state, not just one bad query execution. A poor join order or unnecessary intermediate relation can increase memory use and update cost for the lifetime of the maintained query.
|
||||
|
||||
FlowLog is useful because it treats planning as a first-class layer before execution.
|
||||
|
||||
---
|
||||
|
||||
## Proposed Synergy Path
|
||||
|
||||
The practical path is incremental.
|
||||
|
||||
First, use FlowLog as a reading reference for a DBSP frontend:
|
||||
|
||||
```text
|
||||
parser
|
||||
-> dependency graph
|
||||
-> strata
|
||||
-> rule catalog
|
||||
-> relational plan
|
||||
```
|
||||
|
||||
Second, add a small optimizer:
|
||||
|
||||
```text
|
||||
join ordering
|
||||
antijoin pushdown
|
||||
simple SIP for bound variables
|
||||
```
|
||||
|
||||
Third, lower the optimized plan to DBSP operators:
|
||||
|
||||
```text
|
||||
projection
|
||||
selection
|
||||
join
|
||||
antijoin
|
||||
union
|
||||
distinct
|
||||
fixed point
|
||||
```
|
||||
|
||||
Fourth, test against the current direct implementations:
|
||||
|
||||
```text
|
||||
snapshot result == DBSP-maintained result
|
||||
```
|
||||
|
||||
For Geomerge, the first target should stay the same as the DBSP integration note: supported laws compiled into one maintained `violations` relation.
|
||||
|
||||
For CRDTs, the first target should be causal readiness and list traversal, since those are where the existing DBSP notes identify performance risk.
|
||||
|
||||
---
|
||||
|
||||
## Open Questions
|
||||
|
||||
- Can FlowLog-style SIP be adapted to causal-readiness frontiers?
|
||||
- Can DBSP expose enough physical planning control for key choice and subplan sharing?
|
||||
- Should the Datalog frontend target a FlowLog-like IR before DBSP lowering?
|
||||
- Can Geomerge laws use the same catalog structure as Datalog rules?
|
||||
- Which recursive CRDT queries benefit from structural planning?
|
||||
- Is hydration better handled by DBSP, a batch engine, or persisted operator state?
|
||||
- Can one optimizer target both DBSP and Differential Dataflow backends?
|
||||
|
||||
---
|
||||
|
||||
## Bottom Line
|
||||
|
||||
FlowLog should be treated as an optimizer and compiler blueprint for the DBSP work.
|
||||
|
||||
The DBSP notes already have the right execution target:
|
||||
|
||||
```text
|
||||
maintained relational deltas
|
||||
```
|
||||
|
||||
FlowLog adds the missing planning discipline:
|
||||
|
||||
```text
|
||||
rule catalog
|
||||
+ join graph
|
||||
+ recursive strata
|
||||
+ robust plan choice
|
||||
+ SIP-style prefiltering
|
||||
```
|
||||
|
||||
Together, the systems suggest a stronger architecture:
|
||||
|
||||
```text
|
||||
Datalog or Geolog rules
|
||||
-> FlowLog-like planning layer
|
||||
-> DBSP incremental backend
|
||||
-> maintained CRDT views or violation relations
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## Changelog
|
||||
|
||||
* **May 20, 2026** -- First version created from the DBSP and FlowLog notes.
|
||||
Loading…
x
Reference in New Issue
Block a user