Add a summary note file for the notes taken so far

2026-05-21 12:18:06 +02:00 · 2026-05-21 12:18:06 +02:00 · cf4c522ff3
commit cf4c522ff3
parent 2bfcb7e818
1 changed files with 426 additions and 0 deletions
--- a/flowlog/006-flowlog-synthesis.md
+++ b/flowlog/006-flowlog-synthesis.md
@ -0,0 +1,426 @@
 # FlowLog Synthesis
 A unifying note for the FlowLog primer, implementation notes, DBSP synergy notes, technical planning notes, and usage plan.
 ---
 ## Short Answer
 The five FlowLog notes make one argument:
 ```text
 FlowLog is most useful here as a model for the Datalog planning layer that should sit before an incremental backend such as DBSP.
 ```
 The core architecture is:
 ```text
 Datalog or Geolog-shaped rules
 -> dependency analysis and strata
 -> rule catalog
 -> join graph and relational plan
 -> FlowLog-style optimization
 -> DBSP or Differential Dataflow backend
 -> maintained outputs
 ```
 FlowLog is not only an engine to run. It is a concrete example of how to keep rule semantics, planning, optimization, and backend execution separated.
 ```mermaid
 flowchart LR
    Source["Datalog or Geolog Rules"] --> Strata["Dependency Analysis and Strata"]
    Strata --> Catalog["Rule Catalog"]
    Catalog --> Plan["Relational Plan"]
    Plan --> Optimize["FlowLog-Style Optimization"]
    Optimize --> IR["Backend-Neutral IR"]
    IR --> DBSP["DBSP Backend"]
    IR --> DD["Differential Dataflow Backend"]
    DBSP --> Outputs["Maintained Outputs"]
    DD --> Outputs
 ```
 ---
 ## How the Notes Fit Together
 The first note, `001-flowlog-primer.md`, explains the concept. FlowLog is a Datalog engine that uses Differential Dataflow as its execution backend,
 while keeping Datalog-specific planning visible before lowering to dataflow operators.
 The second note, `002-flowlog-implementation.md`, explains the artifact structure. The useful implementation shape is:
 ```text
 parsing
 -> strata
 -> catalog
 -> planning
 -> optimizing
 -> executing
 ```
 The third note, `003-flowlog-and-dbsp-synergy.md`, maps FlowLog to the DBSP notes. DBSP answers how to maintain relational results over changing
 inputs. FlowLog helps answer what relational plan should be maintained.
 The fourth note, `004-flowlog-technical-planning-notes.md`, zooms into the planning details: rule catalogs, collection shapes, transformation flows,
 join graphs, antijoin timing, SIP, recursive strata, subplan sharing, and physical key choice.
 The fifth note, `005-using-flowlog-ideas.md`, turns the ideas into a practical adoption path: planning-only prototype, DBSP lowering prototype,
 backend comparison, test corpus, data model decisions, and evaluation plan.
 Together, the notes move from:
 ```text
 what FlowLog is
 -> how it is built
 -> why it matters for DBSP
 -> which technical pieces transfer
 -> how to use those pieces
 ```
 ---
 ## Unified Mental Model
 The shared mental model is that Datalog execution has three separate layers.
 The source layer owns user-facing or system-facing rules:
 ```text
 Datalog programs
 Geolog laws
 CRDT definitions
 ```
 The planning layer owns the logical and physical shape of evaluation:
 ```text
 strata
 rule catalogs
 join graphs
 antijoin placement
 SIP filters
 physical keys
 shared subplans
 ```
 The backend layer owns maintained computation:
 ```text
 DBSP circuits
 Differential Dataflow dataflows
 batch evaluators
 ```
 The main design rule is:
 ```text
 Backend execution should not rediscover rule semantics.
 ```
 The backend should receive a checked, stratified, and optimized relational plan.
 ---
 ## FlowLog's Transferable Pieces
 The most transferable pieces are not tied to Differential Dataflow.
 **Rule Catalog**: A structured summary of each rule's atoms, variables, constants, comparisons, negations, and output projection.
 **Stratification**: A dependency order for non-recursive and recursive rule groups, with negation restrictions kept explicit.
 **Join Graph**: A graph or hypergraph of atoms connected by shared variables.
 **Structural Planning**: A robust join-ordering strategy based on variable overlap, intermediate width, and join connectivity.
 **Sideways Information Passing**: Semijoin-style filtering that uses known bindings to reduce later joins.
 **Antijoin Scheduling**: Placement of negated atoms as soon as their variables are bound.
 **Physical Key Choice**: Deliberate selection of keys and payload fields for maintained joins and arrangements.
 **Subplan Sharing**: Reuse of common antecedents or intermediate relations across rules.
 ```mermaid
 flowchart TB
    Catalog["Rule Catalog"] --> JoinGraph["Join Graph"]
    Catalog --> Negation["Negation and Filters"]
    JoinGraph --> Structural["Structural Planning"]
    Negation --> Antijoin["Antijoin Scheduling"]
    Catalog --> SIP["Sideways Information Passing"]
    Structural --> Keys["Physical Key Choice"]
    SIP --> Keys
    Antijoin --> Keys
    Keys --> IR["Optimized Relational IR"]
 ```
 ---
 ## DBSP Connection
 The DBSP notes focus on incremental maintenance:
 ```text
 input deltas
 -> maintained operator state
 -> output deltas
 ```
 DBSP gives the algebra and runtime model for maintained relational computation. It does not by itself solve source-language compilation,
 Datalog-specific optimization, Geolog law translation, CRDT-specific planning, or user-facing diagnostics.
 FlowLog's planning layer fits before DBSP:
 ```text
 rules
 -> FlowLog-like planner
 -> DBSP circuit
 ```
 This division is important because a poor plan is not just a bad one-shot query. In an incremental system, a poor plan becomes persistent maintained
 state. Bad join order, unnecessary intermediate fields, and late antijoins can increase memory and update cost for the life of the circuit.
 ---
 ## CRDT Connection
 The CRDT notes use Datalog to define visible state over immutable operation facts.
 Simple register queries are already a good fit:
 ```text
 set + pred
 -> overwritten
 -> visible values
 ```
 The harder cases are recursive and structural:
 ```text
 causal readiness
 list traversal
 tombstone skipping
 move-like tree operations
 ```
 FlowLog helps by making the expensive parts explicit:
 - causal-readiness recursion should be planned around frontiers when possible
 - list traversal should avoid carrying unnecessary fields through every intermediate
 - antijoins for tombstones should run as soon as their keys are available
 - repeated list subqueries should share intermediate relations where possible
 The practical CRDT target is:
 ```text
 same CRDT rules
 -> naive plan
 -> FlowLog-style plan
 -> DBSP-maintained result
 -> hydration and warm-update comparison
 ```
 ---
 ## Geomerge Connection
 The Geomerge notes propose compiling supported laws into maintained violation relations.
 The simplest useful form is:
 ```text
 required_consequent(x) :- antecedent(x).
 violation(x) :- required_consequent(x), not consequent(x).
 ```
 FlowLog helps once antecedents become multi-atom joins:
 ```text
 violation(x, y, z) :-
    A(x, y),
    B(y, z),
    C(z),
    not D(x, z).
 ```
 At that point, a compiler needs the same machinery FlowLog demonstrates:
 - variable occurrence maps
 - join graph extraction
 - antijoin scheduling
 - projection minimization
 - shared antecedent detection
 - violation-row construction
 The practical Geomerge target is:
 ```text
 FlatTheory laws
 -> supported relational subset
 -> rule catalog
 -> planned violation query
 -> DBSP-maintained violations relation
 ```
 ---
 ## Recommended Architecture
 The recommended architecture has a backend-neutral middle layer.
 ```mermaid
 flowchart TB
    subgraph Sources["Source Layers"]
        Datalog["Datalog CRDT Rules"]
        Geolog["Compiled Geolog Laws"]
    end
    subgraph Planner["FlowLog-Inspired Planner"]
        Parse["Parse or Translate"]
        Strata["Stratify"]
        Catalog["Catalog Rules"]
        Graph["Join Graph Construction"]
        Optimize["Plan Joins, Antijoins, and SIP"]
        IR["Relational IR with Physical Keys"]
    end
    subgraph Backends["Execution Backends"]
        DBSP["DBSP"]
        DD["Differential Dataflow"]
        Batch["Snapshot Evaluator"]
    end
    Datalog --> Parse
    Geolog --> Parse
    Parse --> Strata --> Catalog --> Graph --> Optimize --> IR
    IR --> DBSP
    IR --> DD
    IR --> Batch
 ```
 This architecture keeps the core questions separate:
 - source language
 - rule semantics
 - relational planning
 - physical execution backend
 - application integration
 That separation makes experiments easier. If a query is slow, it becomes possible to ask whether the problem is the rule semantics, the plan, the
 backend, or the storage boundary.
 ---
 ## Practical Path
 The practical path should be staged.
 **Stage 1: Planning-Only Prototype**
 ```text
 Datalog-like rules
 -> dependency graph
 -> strata
 -> rule catalog
 -> join graph
 -> textual plan
 ```
 This validates whether the compiler understands rule shape.
 **Stage 2: Narrow DBSP Lowering**
 ```text
 planned rules
 -> projection, selection, join, antijoin, union, distinct, recursion
 -> DBSP circuit
 ```
 This validates maintained outputs against a snapshot evaluator.
 **Stage 3: Workload Comparison**
 ```text
 same rules
 same facts
 same outputs
 -> DBSP backend
 -> Differential Dataflow backend
 -> snapshot backend
 ```
 This identifies whether bottlenecks come from planning or backend behavior.
 **Stage 4: Geomerge Integration**
 ```text
 supported FlatTheory laws
 -> planned violation queries
 -> maintained violations relation
 -> agreement with current validator
 ```
 This makes the DBSP checker a performance optimization first, not a semantic change.
 ---
 ## Test Workloads
 The shared test corpus should include:
 - transitive closure
 - reachability
 - connected components
 - antijoin checks
 - multi-value register
 - causal readiness
 - list next-element traversal
 - tombstone skipping
 - missing foreign-key violations
 - multi-atom Geomerge antecedents
 Each workload should have:
 - input schemas
 - base facts
 - update facts
 - expected snapshot output
 - expected output deltas
 - recursion and negation classification
 - accepted or rejected status
 ---
 ## Evaluation Questions
 The main evaluation questions are:
 - Does planning reduce hydration time?
 - Does planning reduce warm-update time?
 - Does causal-readiness update cost still grow with history depth?
 - Does antijoin scheduling reduce intermediate relation size?
 - Does SIP help frontier-shaped recursive queries?
 - Does physical key choice reduce maintained state?
 - Does the DBSP result match a snapshot evaluator?
 - Does Geomerge validation agree with the existing validator?
 - Is backend state rollback or preview execution tractable?
 ---
 ## Bottom Line
 The unified conclusion is:
 ```text
 FlowLog is the planning blueprint.
 DBSP is the target incremental backend.
 CRDTs and Geomerge laws are the motivating rule sources.
 ```
 The next durable artifact should not be a full engine. It should be a small planner that can explain rule structure, join graphs, antijoin placement,
 and physical key choices. Once that explanation is correct, DBSP lowering becomes a narrower engineering problem.
 ---
 ## Changelog
 * **May 21, 2026** -- First version created to unify the first five FlowLog notes.