Add anote file for how we could use FlowLog and DBSP together

2026-05-20 10:32:46 +02:00 · 2026-05-20 10:32:46 +02:00 · 2d3c02315a
commit 2d3c02315a
parent 3d67b4994e
1 changed files with 330 additions and 0 deletions
--- a/flowlog/003-flowlog-and-dbsp-synergy.md
+++ b/flowlog/003-flowlog-and-dbsp-synergy.md
@ -0,0 +1,330 @@
+# FlowLog and DBSP Synergy
+
+A note on how FlowLog's Datalog planning ideas could support the DBSP, CRDT, and Geomerge work.
+
+---
+
+## Short Answer
+
+FlowLog and the DBSP notes meet at the boundary between Datalog rules and incremental execution.
+
+DBSP answers:
+
+```text
+How can a relational query be maintained over changing inputs?
+```
+
+FlowLog helps answer:
+
+```text
+What relational plan should the incremental backend maintain?
+```
+
+The synergy is not that FlowLog should replace DBSP. The useful split is:
+
+```text
+FlowLog-like frontend and optimizer
+-> backend-neutral relational IR
+-> DBSP-maintained circuit
+```
+
+FlowLog is a useful blueprint for the compiler and optimizer that should sit in front of DBSP.
+
+---
+
+## Existing DBSP Direction
+
+The DBSP notes are organized around three related goals.
+
+First, CRDTs can be described as deterministic Datalog queries over immutable operation facts:
+
+```text
+operation facts
+-> Datalog rules
+-> visible CRDT state
+```
+
+Second, DBSP can maintain those query results incrementally:
+
+```text
+input relation deltas
+-> DBSP circuit step
+-> output relation deltas
+```
+
+Third, Geomerge laws can be compiled into maintained violation relations:
+
+```text
+compiled relational laws
+-> violation queries
+-> maintained violation deltas
+```
+
+The missing layer is query planning. A Datalog rule can be semantically correct but still produce a poor physical plan.
+
+---
+
+## FlowLog's Useful Layer
+
+FlowLog has a pipeline that is useful independently of its Differential Dataflow backend:
+
+```text
+Datalog program
+-> parser
+-> strata
+-> rule catalog
+-> per-rule relational plan
+-> optimizer
+-> incremental dataflow backend
+```
+
+The reusable parts are:
+
+- dependency analysis
+- stratification
+- rule catalog construction
+- join graph extraction
+- structural join planning
+- sideways information passing
+- physical key and value shape selection
+
+These are exactly the parts a DBSP-backed Datalog or Geolog compiler needs before lowering rules to DBSP.
+
+---
+
+## CRDT Synergy
+
+The CRDT notes identify three representative query shapes.
+
+The multi-value register is mostly projection plus antijoin:
+
+```text
+overwritten(RepId, Ctr) :-
+    pred(RepId, Ctr, _, _).
+
+mvrStore(Key, Value) :-
+    set(RepId, Ctr, Key, Value),
+    not overwritten(RepId, Ctr).
+```
+
+This is a favorable DBSP workload. The backend can maintain the antijoin state and process small updates without rescanning the full operation history.
+
+The causal-readiness query is harder:
+
+```text
+isCausallyReady(RepId, Ctr) :-
+    isRoot(RepId, Ctr).
+
+isCausallyReady(RepId, Ctr) :-
+    isCausallyReady(FromRepId, FromCtr),
+    pred(FromRepId, FromCtr, RepId, Ctr).
+```
+
+This is recursive graph traversal. The DBSP CRDT notes report that this query can remain dependent on causal-history depth, even during warm updates.
+
+FlowLog's planning ideas are relevant here. Sideways information passing suggests prefiltering recursive traversal through known relevant bindings. For CRDTs, that could mean using current heads, leaves, or newly arrived operations as a frontier instead of repeatedly deriving readiness from roots through the whole causal graph.
+
+The list CRDT query is also planning-sensitive. Relations such as `firstChild`, `nextSibling`, `nextSiblingAnc`, `nextElem`, and `nextVisible` create several joins, antijoins, and recursive steps. FlowLog-style rule catalogs and join planning would help choose better intermediate shapes before DBSP builds maintained operator state.
+
+---
+
+## Geomerge Synergy
+
+The Geomerge integration note proposes maintaining one combined violation relation.
+
+Simple foreign-key laws are straightforward:
+
+```text
+required_src(graph, src) :- G.E(graph, src, dst).
+missing_src(graph, src) :- required_src(graph, src), not G.V(graph, src).
+```
+
+For this subset, DBSP can maintain projections and antijoins directly.
+
+The need for FlowLog grows when laws contain several atoms:
+
+```text
+violation(x, y, z) :-
+    A(x, y),
+    B(y, z),
+    C(z),
+    not D(x, z).
+```
+
+At that point, the compiler must decide:
+
+- which atoms join first
+- which variables form keys
+- where filters and antijoins should be applied
+- which intermediate fields must be retained
+- whether several laws share subplans
+
+FlowLog's catalog and structural planning model is a good guide for this compiler layer.
+
+The resulting Geomerge architecture could be:
+
+```text
+FlatTheory laws
+-> supported Datalog-shaped rules
+-> FlowLog-like catalog and optimizer
+-> relational violation plan
+-> DBSP circuit
+-> maintained violations relation
+```
+
+This keeps DBSP as a performance backend while giving Geomerge a real planning layer.
+
+---
+
+## IR Boundary
+
+The strongest architectural lesson is to avoid binding the system too tightly to either source syntax or backend syntax.
+
+The durable boundary should be a relational intermediate representation:
+
+```text
+source rules
+-> rule catalog
+-> relational IR
+-> backend-specific lowering
+```
+
+For CRDTs, the source may be a Datalog dialect.
+
+For Geomerge, the source may be compiled Geolog laws.
+
+For execution, the backend may be DBSP, Differential Dataflow, or a non-incremental batch engine.
+
+This matches the existing DBSP notes: DBSP should not own full source-language semantics. It should receive a relational plan that has already been checked, stratified, and optimized.
+
+---
+
+## Optimization Transfer
+
+FlowLog suggests several optimizations that transfer well to DBSP-backed work.
+
+**Structural Planning**: Choose join trees from variable overlap and intermediate width, especially for recursive rules where cardinality estimates are weak.
+
+**Sideways Information Passing**: Add semijoin-style filters so later joins and recursive steps see fewer irrelevant tuples.
+
+**Antijoin Scheduling**: Apply negated atoms as soon as their variables are available. This matches the DBSP CRDT note's antijoin-pushdown agenda.
+
+**Subplan Sharing**: Reuse common derived relations across laws or CRDT views when multiple outputs need the same intermediate facts.
+
+**Physical Key Choice**: Pick key fields deliberately before lowering to the backend. DBSP joins also need maintained state, so bad key and arrangement choices will become runtime costs.
+
+---
+
+## Backend Comparison
+
+FlowLog uses Differential Dataflow. The DBSP notes use DBSP.
+
+The models differ:
+
+- Differential Dataflow uses collections with logical time and differences.
+- DBSP uses streams, Z-sets, integration, differentiation, and circuits.
+
+The shared lesson is more important than the difference:
+
+```text
+incremental backends maintain operator state
+```
+
+That means bad plans become persistent state, not just one bad query execution. A poor join order or unnecessary intermediate relation can increase memory use and update cost for the lifetime of the maintained query.
+
+FlowLog is useful because it treats planning as a first-class layer before execution.
+
+---
+
+## Proposed Synergy Path
+
+The practical path is incremental.
+
+First, use FlowLog as a reading reference for a DBSP frontend:
+
+```text
+parser
+-> dependency graph
+-> strata
+-> rule catalog
+-> relational plan
+```
+
+Second, add a small optimizer:
+
+```text
+join ordering
+antijoin pushdown
+simple SIP for bound variables
+```
+
+Third, lower the optimized plan to DBSP operators:
+
+```text
+projection
+selection
+join
+antijoin
+union
+distinct
+fixed point
+```
+
+Fourth, test against the current direct implementations:
+
+```text
+snapshot result == DBSP-maintained result
+```
+
+For Geomerge, the first target should stay the same as the DBSP integration note: supported laws compiled into one maintained `violations` relation.
+
+For CRDTs, the first target should be causal readiness and list traversal, since those are where the existing DBSP notes identify performance risk.
+
+---
+
+## Open Questions
+
+- Can FlowLog-style SIP be adapted to causal-readiness frontiers?
+- Can DBSP expose enough physical planning control for key choice and subplan sharing?
+- Should the Datalog frontend target a FlowLog-like IR before DBSP lowering?
+- Can Geomerge laws use the same catalog structure as Datalog rules?
+- Which recursive CRDT queries benefit from structural planning?
+- Is hydration better handled by DBSP, a batch engine, or persisted operator state?
+- Can one optimizer target both DBSP and Differential Dataflow backends?
+
+---
+
+## Bottom Line
+
+FlowLog should be treated as an optimizer and compiler blueprint for the DBSP work.
+
+The DBSP notes already have the right execution target:
+
+```text
+maintained relational deltas
+```
+
+FlowLog adds the missing planning discipline:
+
+```text
+rule catalog
+ join graph
+ recursive strata
+ robust plan choice
+ SIP-style prefiltering
+```
+
+Together, the systems suggest a stronger architecture:
+
+```text
+Datalog or Geolog rules
+-> FlowLog-like planning layer
+-> DBSP incremental backend
+-> maintained CRDT views or violation relations
+```
+
+---
+
+## Changelog
+
+* **May 20, 2026** -- First version created from the DBSP and FlowLog notes.