From cf4c522ff35a02a87aea2574ec6d909b940f66b4 Mon Sep 17 00:00:00 2001 From: Hassan Abedi Date: Thu, 21 May 2026 12:18:06 +0200 Subject: [PATCH] Add a summary note file for the notes taken so far --- flowlog/006-flowlog-synthesis.md | 426 +++++++++++++++++++++++++++++++ 1 file changed, 426 insertions(+) create mode 100644 flowlog/006-flowlog-synthesis.md diff --git a/flowlog/006-flowlog-synthesis.md b/flowlog/006-flowlog-synthesis.md new file mode 100644 index 0000000..42f4b28 --- /dev/null +++ b/flowlog/006-flowlog-synthesis.md @@ -0,0 +1,426 @@ +# FlowLog Synthesis + +A unifying note for the FlowLog primer, implementation notes, DBSP synergy notes, technical planning notes, and usage plan. + +--- + +## Short Answer + +The five FlowLog notes make one argument: + +```text +FlowLog is most useful here as a model for the Datalog planning layer that should sit before an incremental backend such as DBSP. +``` + +The core architecture is: + +```text +Datalog or Geolog-shaped rules +-> dependency analysis and strata +-> rule catalog +-> join graph and relational plan +-> FlowLog-style optimization +-> DBSP or Differential Dataflow backend +-> maintained outputs +``` + +FlowLog is not only an engine to run. It is a concrete example of how to keep rule semantics, planning, optimization, and backend execution separated. + +```mermaid +flowchart LR + Source["Datalog or Geolog Rules"] --> Strata["Dependency Analysis and Strata"] + Strata --> Catalog["Rule Catalog"] + Catalog --> Plan["Relational Plan"] + Plan --> Optimize["FlowLog-Style Optimization"] + Optimize --> IR["Backend-Neutral IR"] + IR --> DBSP["DBSP Backend"] + IR --> DD["Differential Dataflow Backend"] + DBSP --> Outputs["Maintained Outputs"] + DD --> Outputs +``` + +--- + +## How the Notes Fit Together + +The first note, `001-flowlog-primer.md`, explains the concept. FlowLog is a Datalog engine that uses Differential Dataflow as its execution backend, +while keeping Datalog-specific planning visible before lowering to dataflow operators. + +The second note, `002-flowlog-implementation.md`, explains the artifact structure. The useful implementation shape is: + +```text +parsing +-> strata +-> catalog +-> planning +-> optimizing +-> executing +``` + +The third note, `003-flowlog-and-dbsp-synergy.md`, maps FlowLog to the DBSP notes. DBSP answers how to maintain relational results over changing +inputs. FlowLog helps answer what relational plan should be maintained. + +The fourth note, `004-flowlog-technical-planning-notes.md`, zooms into the planning details: rule catalogs, collection shapes, transformation flows, +join graphs, antijoin timing, SIP, recursive strata, subplan sharing, and physical key choice. + +The fifth note, `005-using-flowlog-ideas.md`, turns the ideas into a practical adoption path: planning-only prototype, DBSP lowering prototype, +backend comparison, test corpus, data model decisions, and evaluation plan. + +Together, the notes move from: + +```text +what FlowLog is +-> how it is built +-> why it matters for DBSP +-> which technical pieces transfer +-> how to use those pieces +``` + +--- + +## Unified Mental Model + +The shared mental model is that Datalog execution has three separate layers. + +The source layer owns user-facing or system-facing rules: + +```text +Datalog programs +Geolog laws +CRDT definitions +``` + +The planning layer owns the logical and physical shape of evaluation: + +```text +strata +rule catalogs +join graphs +antijoin placement +SIP filters +physical keys +shared subplans +``` + +The backend layer owns maintained computation: + +```text +DBSP circuits +Differential Dataflow dataflows +batch evaluators +``` + +The main design rule is: + +```text +Backend execution should not rediscover rule semantics. +``` + +The backend should receive a checked, stratified, and optimized relational plan. + +--- + +## FlowLog's Transferable Pieces + +The most transferable pieces are not tied to Differential Dataflow. + +**Rule Catalog**: A structured summary of each rule's atoms, variables, constants, comparisons, negations, and output projection. + +**Stratification**: A dependency order for non-recursive and recursive rule groups, with negation restrictions kept explicit. + +**Join Graph**: A graph or hypergraph of atoms connected by shared variables. + +**Structural Planning**: A robust join-ordering strategy based on variable overlap, intermediate width, and join connectivity. + +**Sideways Information Passing**: Semijoin-style filtering that uses known bindings to reduce later joins. + +**Antijoin Scheduling**: Placement of negated atoms as soon as their variables are bound. + +**Physical Key Choice**: Deliberate selection of keys and payload fields for maintained joins and arrangements. + +**Subplan Sharing**: Reuse of common antecedents or intermediate relations across rules. + +```mermaid +flowchart TB + Catalog["Rule Catalog"] --> JoinGraph["Join Graph"] + Catalog --> Negation["Negation and Filters"] + JoinGraph --> Structural["Structural Planning"] + Negation --> Antijoin["Antijoin Scheduling"] + Catalog --> SIP["Sideways Information Passing"] + Structural --> Keys["Physical Key Choice"] + SIP --> Keys + Antijoin --> Keys + Keys --> IR["Optimized Relational IR"] +``` + +--- + +## DBSP Connection + +The DBSP notes focus on incremental maintenance: + +```text +input deltas +-> maintained operator state +-> output deltas +``` + +DBSP gives the algebra and runtime model for maintained relational computation. It does not by itself solve source-language compilation, +Datalog-specific optimization, Geolog law translation, CRDT-specific planning, or user-facing diagnostics. + +FlowLog's planning layer fits before DBSP: + +```text +rules +-> FlowLog-like planner +-> DBSP circuit +``` + +This division is important because a poor plan is not just a bad one-shot query. In an incremental system, a poor plan becomes persistent maintained +state. Bad join order, unnecessary intermediate fields, and late antijoins can increase memory and update cost for the life of the circuit. + +--- + +## CRDT Connection + +The CRDT notes use Datalog to define visible state over immutable operation facts. + +Simple register queries are already a good fit: + +```text +set + pred +-> overwritten +-> visible values +``` + +The harder cases are recursive and structural: + +```text +causal readiness +list traversal +tombstone skipping +move-like tree operations +``` + +FlowLog helps by making the expensive parts explicit: + +- causal-readiness recursion should be planned around frontiers when possible +- list traversal should avoid carrying unnecessary fields through every intermediate +- antijoins for tombstones should run as soon as their keys are available +- repeated list subqueries should share intermediate relations where possible + +The practical CRDT target is: + +```text +same CRDT rules +-> naive plan +-> FlowLog-style plan +-> DBSP-maintained result +-> hydration and warm-update comparison +``` + +--- + +## Geomerge Connection + +The Geomerge notes propose compiling supported laws into maintained violation relations. + +The simplest useful form is: + +```text +required_consequent(x) :- antecedent(x). +violation(x) :- required_consequent(x), not consequent(x). +``` + +FlowLog helps once antecedents become multi-atom joins: + +```text +violation(x, y, z) :- + A(x, y), + B(y, z), + C(z), + not D(x, z). +``` + +At that point, a compiler needs the same machinery FlowLog demonstrates: + +- variable occurrence maps +- join graph extraction +- antijoin scheduling +- projection minimization +- shared antecedent detection +- violation-row construction + +The practical Geomerge target is: + +```text +FlatTheory laws +-> supported relational subset +-> rule catalog +-> planned violation query +-> DBSP-maintained violations relation +``` + +--- + +## Recommended Architecture + +The recommended architecture has a backend-neutral middle layer. + +```mermaid +flowchart TB + subgraph Sources["Source Layers"] + Datalog["Datalog CRDT Rules"] + Geolog["Compiled Geolog Laws"] + end + + subgraph Planner["FlowLog-Inspired Planner"] + Parse["Parse or Translate"] + Strata["Stratify"] + Catalog["Catalog Rules"] + Graph["Join Graph Construction"] + Optimize["Plan Joins, Antijoins, and SIP"] + IR["Relational IR with Physical Keys"] + end + + subgraph Backends["Execution Backends"] + DBSP["DBSP"] + DD["Differential Dataflow"] + Batch["Snapshot Evaluator"] + end + + Datalog --> Parse + Geolog --> Parse + Parse --> Strata --> Catalog --> Graph --> Optimize --> IR + IR --> DBSP + IR --> DD + IR --> Batch +``` + +This architecture keeps the core questions separate: + +- source language +- rule semantics +- relational planning +- physical execution backend +- application integration + +That separation makes experiments easier. If a query is slow, it becomes possible to ask whether the problem is the rule semantics, the plan, the +backend, or the storage boundary. + +--- + +## Practical Path + +The practical path should be staged. + +**Stage 1: Planning-Only Prototype** + +```text +Datalog-like rules +-> dependency graph +-> strata +-> rule catalog +-> join graph +-> textual plan +``` + +This validates whether the compiler understands rule shape. + +**Stage 2: Narrow DBSP Lowering** + +```text +planned rules +-> projection, selection, join, antijoin, union, distinct, recursion +-> DBSP circuit +``` + +This validates maintained outputs against a snapshot evaluator. + +**Stage 3: Workload Comparison** + +```text +same rules +same facts +same outputs +-> DBSP backend +-> Differential Dataflow backend +-> snapshot backend +``` + +This identifies whether bottlenecks come from planning or backend behavior. + +**Stage 4: Geomerge Integration** + +```text +supported FlatTheory laws +-> planned violation queries +-> maintained violations relation +-> agreement with current validator +``` + +This makes the DBSP checker a performance optimization first, not a semantic change. + +--- + +## Test Workloads + +The shared test corpus should include: + +- transitive closure +- reachability +- connected components +- antijoin checks +- multi-value register +- causal readiness +- list next-element traversal +- tombstone skipping +- missing foreign-key violations +- multi-atom Geomerge antecedents + +Each workload should have: + +- input schemas +- base facts +- update facts +- expected snapshot output +- expected output deltas +- recursion and negation classification +- accepted or rejected status + +--- + +## Evaluation Questions + +The main evaluation questions are: + +- Does planning reduce hydration time? +- Does planning reduce warm-update time? +- Does causal-readiness update cost still grow with history depth? +- Does antijoin scheduling reduce intermediate relation size? +- Does SIP help frontier-shaped recursive queries? +- Does physical key choice reduce maintained state? +- Does the DBSP result match a snapshot evaluator? +- Does Geomerge validation agree with the existing validator? +- Is backend state rollback or preview execution tractable? + +--- + +## Bottom Line + +The unified conclusion is: + +```text +FlowLog is the planning blueprint. +DBSP is the target incremental backend. +CRDTs and Geomerge laws are the motivating rule sources. +``` + +The next durable artifact should not be a full engine. It should be a small planner that can explain rule structure, join graphs, antijoin placement, +and physical key choices. Once that explanation is correct, DBSP lowering becomes a narrower engineering problem. + +--- + +## Changelog + +* **May 21, 2026** -- First version created to unify the first five FlowLog notes.