Hassan Abedi cf4c522ff3 Add a summary note file for the notes taken so far

2026-05-21 12:18:06 +02:00

11 KiB

Raw Blame History

FlowLog Synthesis

A unifying note for the FlowLog primer, implementation notes, DBSP synergy notes, technical planning notes, and usage plan.

Short Answer

The five FlowLog notes make one argument:

FlowLog is most useful here as a model for the Datalog planning layer that should sit before an incremental backend such as DBSP.

The core architecture is:

Datalog or Geolog-shaped rules
-> dependency analysis and strata
-> rule catalog
-> join graph and relational plan
-> FlowLog-style optimization
-> DBSP or Differential Dataflow backend
-> maintained outputs

FlowLog is not only an engine to run. It is a concrete example of how to keep rule semantics, planning, optimization, and backend execution separated.

flowchart LR
    Source["Datalog or Geolog Rules"] --> Strata["Dependency Analysis and Strata"]
    Strata --> Catalog["Rule Catalog"]
    Catalog --> Plan["Relational Plan"]
    Plan --> Optimize["FlowLog-Style Optimization"]
    Optimize --> IR["Backend-Neutral IR"]
    IR --> DBSP["DBSP Backend"]
    IR --> DD["Differential Dataflow Backend"]
    DBSP --> Outputs["Maintained Outputs"]
    DD --> Outputs

How the Notes Fit Together

The first note, 001-flowlog-primer.md, explains the concept. FlowLog is a Datalog engine that uses Differential Dataflow as its execution backend, while keeping Datalog-specific planning visible before lowering to dataflow operators.

The second note, 002-flowlog-implementation.md, explains the artifact structure. The useful implementation shape is:

parsing
-> strata
-> catalog
-> planning
-> optimizing
-> executing

The third note, 003-flowlog-and-dbsp-synergy.md, maps FlowLog to the DBSP notes. DBSP answers how to maintain relational results over changing inputs. FlowLog helps answer what relational plan should be maintained.

The fourth note, 004-flowlog-technical-planning-notes.md, zooms into the planning details: rule catalogs, collection shapes, transformation flows, join graphs, antijoin timing, SIP, recursive strata, subplan sharing, and physical key choice.

The fifth note, 005-using-flowlog-ideas.md, turns the ideas into a practical adoption path: planning-only prototype, DBSP lowering prototype, backend comparison, test corpus, data model decisions, and evaluation plan.

Together, the notes move from:

what FlowLog is
-> how it is built
-> why it matters for DBSP
-> which technical pieces transfer
-> how to use those pieces

Unified Mental Model

The shared mental model is that Datalog execution has three separate layers.

The source layer owns user-facing or system-facing rules:

Datalog programs
Geolog laws
CRDT definitions

The planning layer owns the logical and physical shape of evaluation:

strata
rule catalogs
join graphs
antijoin placement
SIP filters
physical keys
shared subplans

The backend layer owns maintained computation:

DBSP circuits
Differential Dataflow dataflows
batch evaluators

The main design rule is:

Backend execution should not rediscover rule semantics.

The backend should receive a checked, stratified, and optimized relational plan.

FlowLog's Transferable Pieces

The most transferable pieces are not tied to Differential Dataflow.

Rule Catalog: A structured summary of each rule's atoms, variables, constants, comparisons, negations, and output projection.

Stratification: A dependency order for non-recursive and recursive rule groups, with negation restrictions kept explicit.

Join Graph: A graph or hypergraph of atoms connected by shared variables.

Structural Planning: A robust join-ordering strategy based on variable overlap, intermediate width, and join connectivity.

Sideways Information Passing: Semijoin-style filtering that uses known bindings to reduce later joins.

Antijoin Scheduling: Placement of negated atoms as soon as their variables are bound.

Physical Key Choice: Deliberate selection of keys and payload fields for maintained joins and arrangements.

Subplan Sharing: Reuse of common antecedents or intermediate relations across rules.

flowchart TB
    Catalog["Rule Catalog"] --> JoinGraph["Join Graph"]
    Catalog --> Negation["Negation and Filters"]
    JoinGraph --> Structural["Structural Planning"]
    Negation --> Antijoin["Antijoin Scheduling"]
    Catalog --> SIP["Sideways Information Passing"]
    Structural --> Keys["Physical Key Choice"]
    SIP --> Keys
    Antijoin --> Keys
    Keys --> IR["Optimized Relational IR"]

DBSP Connection

The DBSP notes focus on incremental maintenance:

input deltas
-> maintained operator state
-> output deltas

DBSP gives the algebra and runtime model for maintained relational computation. It does not by itself solve source-language compilation, Datalog-specific optimization, Geolog law translation, CRDT-specific planning, or user-facing diagnostics.

FlowLog's planning layer fits before DBSP:

rules
-> FlowLog-like planner
-> DBSP circuit

This division is important because a poor plan is not just a bad one-shot query. In an incremental system, a poor plan becomes persistent maintained state. Bad join order, unnecessary intermediate fields, and late antijoins can increase memory and update cost for the life of the circuit.

CRDT Connection

The CRDT notes use Datalog to define visible state over immutable operation facts.

Simple register queries are already a good fit:

set + pred
-> overwritten
-> visible values

The harder cases are recursive and structural:

causal readiness
list traversal
tombstone skipping
move-like tree operations

FlowLog helps by making the expensive parts explicit:

causal-readiness recursion should be planned around frontiers when possible
list traversal should avoid carrying unnecessary fields through every intermediate
antijoins for tombstones should run as soon as their keys are available
repeated list subqueries should share intermediate relations where possible

The practical CRDT target is:

same CRDT rules
-> naive plan
-> FlowLog-style plan
-> DBSP-maintained result
-> hydration and warm-update comparison

Geomerge Connection

The Geomerge notes propose compiling supported laws into maintained violation relations.

The simplest useful form is:

required_consequent(x) :- antecedent(x).
violation(x) :- required_consequent(x), not consequent(x).

FlowLog helps once antecedents become multi-atom joins:

violation(x, y, z) :-
    A(x, y),
    B(y, z),
    C(z),
    not D(x, z).

At that point, a compiler needs the same machinery FlowLog demonstrates:

variable occurrence maps
join graph extraction
antijoin scheduling
projection minimization
shared antecedent detection
violation-row construction

The practical Geomerge target is:

FlatTheory laws
-> supported relational subset
-> rule catalog
-> planned violation query
-> DBSP-maintained violations relation

Recommended Architecture

The recommended architecture has a backend-neutral middle layer.

flowchart TB
    subgraph Sources["Source Layers"]
        Datalog["Datalog CRDT Rules"]
        Geolog["Compiled Geolog Laws"]
    end

    subgraph Planner["FlowLog-Inspired Planner"]
        Parse["Parse or Translate"]
        Strata["Stratify"]
        Catalog["Catalog Rules"]
        Graph["Join Graph Construction"]
        Optimize["Plan Joins, Antijoins, and SIP"]
        IR["Relational IR with Physical Keys"]
    end

    subgraph Backends["Execution Backends"]
        DBSP["DBSP"]
        DD["Differential Dataflow"]
        Batch["Snapshot Evaluator"]
    end

    Datalog --> Parse
    Geolog --> Parse
    Parse --> Strata --> Catalog --> Graph --> Optimize --> IR
    IR --> DBSP
    IR --> DD
    IR --> Batch

This architecture keeps the core questions separate:

source language
rule semantics
relational planning
physical execution backend
application integration

That separation makes experiments easier. If a query is slow, it becomes possible to ask whether the problem is the rule semantics, the plan, the backend, or the storage boundary.

Practical Path

The practical path should be staged.

Stage 1: Planning-Only Prototype

Datalog-like rules
-> dependency graph
-> strata
-> rule catalog
-> join graph
-> textual plan

This validates whether the compiler understands rule shape.

Stage 2: Narrow DBSP Lowering

planned rules
-> projection, selection, join, antijoin, union, distinct, recursion
-> DBSP circuit

This validates maintained outputs against a snapshot evaluator.

Stage 3: Workload Comparison

same rules
same facts
same outputs
-> DBSP backend
-> Differential Dataflow backend
-> snapshot backend

This identifies whether bottlenecks come from planning or backend behavior.

Stage 4: Geomerge Integration

supported FlatTheory laws
-> planned violation queries
-> maintained violations relation
-> agreement with current validator

This makes the DBSP checker a performance optimization first, not a semantic change.

Test Workloads

The shared test corpus should include:

transitive closure
reachability
connected components
antijoin checks
multi-value register
causal readiness
list next-element traversal
tombstone skipping
missing foreign-key violations
multi-atom Geomerge antecedents

Each workload should have:

input schemas
base facts
update facts
expected snapshot output
expected output deltas
recursion and negation classification
accepted or rejected status

Evaluation Questions

The main evaluation questions are:

Does planning reduce hydration time?
Does planning reduce warm-update time?
Does causal-readiness update cost still grow with history depth?
Does antijoin scheduling reduce intermediate relation size?
Does SIP help frontier-shaped recursive queries?
Does physical key choice reduce maintained state?
Does the DBSP result match a snapshot evaluator?
Does Geomerge validation agree with the existing validator?
Is backend state rollback or preview execution tractable?

Bottom Line

The unified conclusion is:

FlowLog is the planning blueprint.
DBSP is the target incremental backend.
CRDTs and Geomerge laws are the motivating rule sources.

The next durable artifact should not be a full engine. It should be a small planner that can explain rule structure, join graphs, antijoin placement, and physical key choices. Once that explanation is correct, DBSP lowering becomes a narrower engineering problem.

Changelog

May 21, 2026 -- First version created to unify the first five FlowLog notes.

11 KiB Raw Blame History