diff --git a/flowlog/003-flowlog-and-dbsp-synergy.md b/flowlog/003-flowlog-and-dbsp-synergy.md new file mode 100644 index 0000000..4d713d1 --- /dev/null +++ b/flowlog/003-flowlog-and-dbsp-synergy.md @@ -0,0 +1,330 @@ +# FlowLog and DBSP Synergy + +A note on how FlowLog's Datalog planning ideas could support the DBSP, CRDT, and Geomerge work. + +--- + +## Short Answer + +FlowLog and the DBSP notes meet at the boundary between Datalog rules and incremental execution. + +DBSP answers: + +```text +How can a relational query be maintained over changing inputs? +``` + +FlowLog helps answer: + +```text +What relational plan should the incremental backend maintain? +``` + +The synergy is not that FlowLog should replace DBSP. The useful split is: + +```text +FlowLog-like frontend and optimizer +-> backend-neutral relational IR +-> DBSP-maintained circuit +``` + +FlowLog is a useful blueprint for the compiler and optimizer that should sit in front of DBSP. + +--- + +## Existing DBSP Direction + +The DBSP notes are organized around three related goals. + +First, CRDTs can be described as deterministic Datalog queries over immutable operation facts: + +```text +operation facts +-> Datalog rules +-> visible CRDT state +``` + +Second, DBSP can maintain those query results incrementally: + +```text +input relation deltas +-> DBSP circuit step +-> output relation deltas +``` + +Third, Geomerge laws can be compiled into maintained violation relations: + +```text +compiled relational laws +-> violation queries +-> maintained violation deltas +``` + +The missing layer is query planning. A Datalog rule can be semantically correct but still produce a poor physical plan. + +--- + +## FlowLog's Useful Layer + +FlowLog has a pipeline that is useful independently of its Differential Dataflow backend: + +```text +Datalog program +-> parser +-> strata +-> rule catalog +-> per-rule relational plan +-> optimizer +-> incremental dataflow backend +``` + +The reusable parts are: + +- dependency analysis +- stratification +- rule catalog construction +- join graph extraction +- structural join planning +- sideways information passing +- physical key and value shape selection + +These are exactly the parts a DBSP-backed Datalog or Geolog compiler needs before lowering rules to DBSP. + +--- + +## CRDT Synergy + +The CRDT notes identify three representative query shapes. + +The multi-value register is mostly projection plus antijoin: + +```text +overwritten(RepId, Ctr) :- + pred(RepId, Ctr, _, _). + +mvrStore(Key, Value) :- + set(RepId, Ctr, Key, Value), + not overwritten(RepId, Ctr). +``` + +This is a favorable DBSP workload. The backend can maintain the antijoin state and process small updates without rescanning the full operation history. + +The causal-readiness query is harder: + +```text +isCausallyReady(RepId, Ctr) :- + isRoot(RepId, Ctr). + +isCausallyReady(RepId, Ctr) :- + isCausallyReady(FromRepId, FromCtr), + pred(FromRepId, FromCtr, RepId, Ctr). +``` + +This is recursive graph traversal. The DBSP CRDT notes report that this query can remain dependent on causal-history depth, even during warm updates. + +FlowLog's planning ideas are relevant here. Sideways information passing suggests prefiltering recursive traversal through known relevant bindings. For CRDTs, that could mean using current heads, leaves, or newly arrived operations as a frontier instead of repeatedly deriving readiness from roots through the whole causal graph. + +The list CRDT query is also planning-sensitive. Relations such as `firstChild`, `nextSibling`, `nextSiblingAnc`, `nextElem`, and `nextVisible` create several joins, antijoins, and recursive steps. FlowLog-style rule catalogs and join planning would help choose better intermediate shapes before DBSP builds maintained operator state. + +--- + +## Geomerge Synergy + +The Geomerge integration note proposes maintaining one combined violation relation. + +Simple foreign-key laws are straightforward: + +```text +required_src(graph, src) :- G.E(graph, src, dst). +missing_src(graph, src) :- required_src(graph, src), not G.V(graph, src). +``` + +For this subset, DBSP can maintain projections and antijoins directly. + +The need for FlowLog grows when laws contain several atoms: + +```text +violation(x, y, z) :- + A(x, y), + B(y, z), + C(z), + not D(x, z). +``` + +At that point, the compiler must decide: + +- which atoms join first +- which variables form keys +- where filters and antijoins should be applied +- which intermediate fields must be retained +- whether several laws share subplans + +FlowLog's catalog and structural planning model is a good guide for this compiler layer. + +The resulting Geomerge architecture could be: + +```text +FlatTheory laws +-> supported Datalog-shaped rules +-> FlowLog-like catalog and optimizer +-> relational violation plan +-> DBSP circuit +-> maintained violations relation +``` + +This keeps DBSP as a performance backend while giving Geomerge a real planning layer. + +--- + +## IR Boundary + +The strongest architectural lesson is to avoid binding the system too tightly to either source syntax or backend syntax. + +The durable boundary should be a relational intermediate representation: + +```text +source rules +-> rule catalog +-> relational IR +-> backend-specific lowering +``` + +For CRDTs, the source may be a Datalog dialect. + +For Geomerge, the source may be compiled Geolog laws. + +For execution, the backend may be DBSP, Differential Dataflow, or a non-incremental batch engine. + +This matches the existing DBSP notes: DBSP should not own full source-language semantics. It should receive a relational plan that has already been checked, stratified, and optimized. + +--- + +## Optimization Transfer + +FlowLog suggests several optimizations that transfer well to DBSP-backed work. + +**Structural Planning**: Choose join trees from variable overlap and intermediate width, especially for recursive rules where cardinality estimates are weak. + +**Sideways Information Passing**: Add semijoin-style filters so later joins and recursive steps see fewer irrelevant tuples. + +**Antijoin Scheduling**: Apply negated atoms as soon as their variables are available. This matches the DBSP CRDT note's antijoin-pushdown agenda. + +**Subplan Sharing**: Reuse common derived relations across laws or CRDT views when multiple outputs need the same intermediate facts. + +**Physical Key Choice**: Pick key fields deliberately before lowering to the backend. DBSP joins also need maintained state, so bad key and arrangement choices will become runtime costs. + +--- + +## Backend Comparison + +FlowLog uses Differential Dataflow. The DBSP notes use DBSP. + +The models differ: + +- Differential Dataflow uses collections with logical time and differences. +- DBSP uses streams, Z-sets, integration, differentiation, and circuits. + +The shared lesson is more important than the difference: + +```text +incremental backends maintain operator state +``` + +That means bad plans become persistent state, not just one bad query execution. A poor join order or unnecessary intermediate relation can increase memory use and update cost for the lifetime of the maintained query. + +FlowLog is useful because it treats planning as a first-class layer before execution. + +--- + +## Proposed Synergy Path + +The practical path is incremental. + +First, use FlowLog as a reading reference for a DBSP frontend: + +```text +parser +-> dependency graph +-> strata +-> rule catalog +-> relational plan +``` + +Second, add a small optimizer: + +```text +join ordering +antijoin pushdown +simple SIP for bound variables +``` + +Third, lower the optimized plan to DBSP operators: + +```text +projection +selection +join +antijoin +union +distinct +fixed point +``` + +Fourth, test against the current direct implementations: + +```text +snapshot result == DBSP-maintained result +``` + +For Geomerge, the first target should stay the same as the DBSP integration note: supported laws compiled into one maintained `violations` relation. + +For CRDTs, the first target should be causal readiness and list traversal, since those are where the existing DBSP notes identify performance risk. + +--- + +## Open Questions + +- Can FlowLog-style SIP be adapted to causal-readiness frontiers? +- Can DBSP expose enough physical planning control for key choice and subplan sharing? +- Should the Datalog frontend target a FlowLog-like IR before DBSP lowering? +- Can Geomerge laws use the same catalog structure as Datalog rules? +- Which recursive CRDT queries benefit from structural planning? +- Is hydration better handled by DBSP, a batch engine, or persisted operator state? +- Can one optimizer target both DBSP and Differential Dataflow backends? + +--- + +## Bottom Line + +FlowLog should be treated as an optimizer and compiler blueprint for the DBSP work. + +The DBSP notes already have the right execution target: + +```text +maintained relational deltas +``` + +FlowLog adds the missing planning discipline: + +```text +rule catalog ++ join graph ++ recursive strata ++ robust plan choice ++ SIP-style prefiltering +``` + +Together, the systems suggest a stronger architecture: + +```text +Datalog or Geolog rules +-> FlowLog-like planning layer +-> DBSP incremental backend +-> maintained CRDT views or violation relations +``` + +--- + +## Changelog + +* **May 20, 2026** -- First version created from the DBSP and FlowLog notes.