useful-notes/flowlog/006-flowlog-synthesis.md

427 lines
11 KiB
Markdown
Raw Permalink Normal View History

# FlowLog Synthesis
A unifying note for the FlowLog primer, implementation notes, DBSP synergy notes, technical planning notes, and usage plan.
---
## Short Answer
The five FlowLog notes make one argument:
```text
FlowLog is most useful here as a model for the Datalog planning layer that should sit before an incremental backend such as DBSP.
```
The core architecture is:
```text
Datalog or Geolog-shaped rules
-> dependency analysis and strata
-> rule catalog
-> join graph and relational plan
-> FlowLog-style optimization
-> DBSP or Differential Dataflow backend
-> maintained outputs
```
FlowLog is not only an engine to run. It is a concrete example of how to keep rule semantics, planning, optimization, and backend execution separated.
```mermaid
flowchart LR
Source["Datalog or Geolog Rules"] --> Strata["Dependency Analysis and Strata"]
Strata --> Catalog["Rule Catalog"]
Catalog --> Plan["Relational Plan"]
Plan --> Optimize["FlowLog-Style Optimization"]
Optimize --> IR["Backend-Neutral IR"]
IR --> DBSP["DBSP Backend"]
IR --> DD["Differential Dataflow Backend"]
DBSP --> Outputs["Maintained Outputs"]
DD --> Outputs
```
---
## How the Notes Fit Together
The first note, `001-flowlog-primer.md`, explains the concept. FlowLog is a Datalog engine that uses Differential Dataflow as its execution backend,
while keeping Datalog-specific planning visible before lowering to dataflow operators.
The second note, `002-flowlog-implementation.md`, explains the artifact structure. The useful implementation shape is:
```text
parsing
-> strata
-> catalog
-> planning
-> optimizing
-> executing
```
The third note, `003-flowlog-and-dbsp-synergy.md`, maps FlowLog to the DBSP notes. DBSP answers how to maintain relational results over changing
inputs. FlowLog helps answer what relational plan should be maintained.
The fourth note, `004-flowlog-technical-planning-notes.md`, zooms into the planning details: rule catalogs, collection shapes, transformation flows,
join graphs, antijoin timing, SIP, recursive strata, subplan sharing, and physical key choice.
The fifth note, `005-using-flowlog-ideas.md`, turns the ideas into a practical adoption path: planning-only prototype, DBSP lowering prototype,
backend comparison, test corpus, data model decisions, and evaluation plan.
Together, the notes move from:
```text
what FlowLog is
-> how it is built
-> why it matters for DBSP
-> which technical pieces transfer
-> how to use those pieces
```
---
## Unified Mental Model
The shared mental model is that Datalog execution has three separate layers.
The source layer owns user-facing or system-facing rules:
```text
Datalog programs
Geolog laws
CRDT definitions
```
The planning layer owns the logical and physical shape of evaluation:
```text
strata
rule catalogs
join graphs
antijoin placement
SIP filters
physical keys
shared subplans
```
The backend layer owns maintained computation:
```text
DBSP circuits
Differential Dataflow dataflows
batch evaluators
```
The main design rule is:
```text
Backend execution should not rediscover rule semantics.
```
The backend should receive a checked, stratified, and optimized relational plan.
---
## FlowLog's Transferable Pieces
The most transferable pieces are not tied to Differential Dataflow.
**Rule Catalog**: A structured summary of each rule's atoms, variables, constants, comparisons, negations, and output projection.
**Stratification**: A dependency order for non-recursive and recursive rule groups, with negation restrictions kept explicit.
**Join Graph**: A graph or hypergraph of atoms connected by shared variables.
**Structural Planning**: A robust join-ordering strategy based on variable overlap, intermediate width, and join connectivity.
**Sideways Information Passing**: Semijoin-style filtering that uses known bindings to reduce later joins.
**Antijoin Scheduling**: Placement of negated atoms as soon as their variables are bound.
**Physical Key Choice**: Deliberate selection of keys and payload fields for maintained joins and arrangements.
**Subplan Sharing**: Reuse of common antecedents or intermediate relations across rules.
```mermaid
flowchart TB
Catalog["Rule Catalog"] --> JoinGraph["Join Graph"]
Catalog --> Negation["Negation and Filters"]
JoinGraph --> Structural["Structural Planning"]
Negation --> Antijoin["Antijoin Scheduling"]
Catalog --> SIP["Sideways Information Passing"]
Structural --> Keys["Physical Key Choice"]
SIP --> Keys
Antijoin --> Keys
Keys --> IR["Optimized Relational IR"]
```
---
## DBSP Connection
The DBSP notes focus on incremental maintenance:
```text
input deltas
-> maintained operator state
-> output deltas
```
DBSP gives the algebra and runtime model for maintained relational computation. It does not by itself solve source-language compilation,
Datalog-specific optimization, Geolog law translation, CRDT-specific planning, or user-facing diagnostics.
FlowLog's planning layer fits before DBSP:
```text
rules
-> FlowLog-like planner
-> DBSP circuit
```
This division is important because a poor plan is not just a bad one-shot query. In an incremental system, a poor plan becomes persistent maintained
state. Bad join order, unnecessary intermediate fields, and late antijoins can increase memory and update cost for the life of the circuit.
---
## CRDT Connection
The CRDT notes use Datalog to define visible state over immutable operation facts.
Simple register queries are already a good fit:
```text
set + pred
-> overwritten
-> visible values
```
The harder cases are recursive and structural:
```text
causal readiness
list traversal
tombstone skipping
move-like tree operations
```
FlowLog helps by making the expensive parts explicit:
- causal-readiness recursion should be planned around frontiers when possible
- list traversal should avoid carrying unnecessary fields through every intermediate
- antijoins for tombstones should run as soon as their keys are available
- repeated list subqueries should share intermediate relations where possible
The practical CRDT target is:
```text
same CRDT rules
-> naive plan
-> FlowLog-style plan
-> DBSP-maintained result
-> hydration and warm-update comparison
```
---
## Geomerge Connection
The Geomerge notes propose compiling supported laws into maintained violation relations.
The simplest useful form is:
```text
required_consequent(x) :- antecedent(x).
violation(x) :- required_consequent(x), not consequent(x).
```
FlowLog helps once antecedents become multi-atom joins:
```text
violation(x, y, z) :-
A(x, y),
B(y, z),
C(z),
not D(x, z).
```
At that point, a compiler needs the same machinery FlowLog demonstrates:
- variable occurrence maps
- join graph extraction
- antijoin scheduling
- projection minimization
- shared antecedent detection
- violation-row construction
The practical Geomerge target is:
```text
FlatTheory laws
-> supported relational subset
-> rule catalog
-> planned violation query
-> DBSP-maintained violations relation
```
---
## Recommended Architecture
The recommended architecture has a backend-neutral middle layer.
```mermaid
flowchart TB
subgraph Sources["Source Layers"]
Datalog["Datalog CRDT Rules"]
Geolog["Compiled Geolog Laws"]
end
subgraph Planner["FlowLog-Inspired Planner"]
Parse["Parse or Translate"]
Strata["Stratify"]
Catalog["Catalog Rules"]
Graph["Join Graph Construction"]
Optimize["Plan Joins, Antijoins, and SIP"]
IR["Relational IR with Physical Keys"]
end
subgraph Backends["Execution Backends"]
DBSP["DBSP"]
DD["Differential Dataflow"]
Batch["Snapshot Evaluator"]
end
Datalog --> Parse
Geolog --> Parse
Parse --> Strata --> Catalog --> Graph --> Optimize --> IR
IR --> DBSP
IR --> DD
IR --> Batch
```
This architecture keeps the core questions separate:
- source language
- rule semantics
- relational planning
- physical execution backend
- application integration
That separation makes experiments easier. If a query is slow, it becomes possible to ask whether the problem is the rule semantics, the plan, the
backend, or the storage boundary.
---
## Practical Path
The practical path should be staged.
**Stage 1: Planning-Only Prototype**
```text
Datalog-like rules
-> dependency graph
-> strata
-> rule catalog
-> join graph
-> textual plan
```
This validates whether the compiler understands rule shape.
**Stage 2: Narrow DBSP Lowering**
```text
planned rules
-> projection, selection, join, antijoin, union, distinct, recursion
-> DBSP circuit
```
This validates maintained outputs against a snapshot evaluator.
**Stage 3: Workload Comparison**
```text
same rules
same facts
same outputs
-> DBSP backend
-> Differential Dataflow backend
-> snapshot backend
```
This identifies whether bottlenecks come from planning or backend behavior.
**Stage 4: Geomerge Integration**
```text
supported FlatTheory laws
-> planned violation queries
-> maintained violations relation
-> agreement with current validator
```
This makes the DBSP checker a performance optimization first, not a semantic change.
---
## Test Workloads
The shared test corpus should include:
- transitive closure
- reachability
- connected components
- antijoin checks
- multi-value register
- causal readiness
- list next-element traversal
- tombstone skipping
- missing foreign-key violations
- multi-atom Geomerge antecedents
Each workload should have:
- input schemas
- base facts
- update facts
- expected snapshot output
- expected output deltas
- recursion and negation classification
- accepted or rejected status
---
## Evaluation Questions
The main evaluation questions are:
- Does planning reduce hydration time?
- Does planning reduce warm-update time?
- Does causal-readiness update cost still grow with history depth?
- Does antijoin scheduling reduce intermediate relation size?
- Does SIP help frontier-shaped recursive queries?
- Does physical key choice reduce maintained state?
- Does the DBSP result match a snapshot evaluator?
- Does Geomerge validation agree with the existing validator?
- Is backend state rollback or preview execution tractable?
---
## Bottom Line
The unified conclusion is:
```text
FlowLog is the planning blueprint.
DBSP is the target incremental backend.
CRDTs and Geomerge laws are the motivating rule sources.
```
The next durable artifact should not be a full engine. It should be a small planner that can explain rule structure, join graphs, antijoin placement,
and physical key choices. Once that explanation is correct, DBSP lowering becomes a narrower engineering problem.
---
## Changelog
* **May 21, 2026** -- First version created to unify the first five FlowLog notes.