Add two note files for FlowLog (primer and implementation)

2026-05-19 15:20:43 +02:00 · 2026-05-19 15:20:43 +02:00 · 3d67b4994e
commit 3d67b4994e
parent ee52b850e4
2 changed files with 616 additions and 0 deletions
--- a/flowlog/001-flowlog-primer.md
+++ b/flowlog/001-flowlog-primer.md
@ -0,0 +1,262 @@
+# FlowLog Primer
+
+A primer on FlowLog as a Datalog engine built on Differential Dataflow.
+
+---
+
+## Short Answer
+
+FlowLog is a Datalog engine for recursive queries. It parses Datalog programs, stratifies rules, builds a relational intermediate representation,
+optimizes rule plans, and executes them with Differential Dataflow.
+
+The main idea is:
+
+```text
+Datalog rules
+-> relational rule plans
+-> Differential Dataflow operators
+-> maintained derived relations
+```
+
+FlowLog is not only a parser for Datalog. It is a query engine design that keeps Datalog-specific optimization visible before the program is lowered
+to a streaming dataflow backend.
+
+---
+
+## Why It Exists
+
+Datalog is useful for recursive computations:
+
+- graph reachability
+- transitive closure
+- program analysis
+- static analysis
+- network and distributed-system rules
+- recursive data-cleaning or constraint logic
+
+The hard part is execution. Recursive Datalog can spend most of its time and memory on joins inside fixed-point loops. Bad join orders can create
+large intermediate relations, and the best order can vary by workload and iteration.
+
+FlowLog tries to keep three properties together:
+
+- Datalog-level expressiveness
+- incremental and parallel execution
+- query planning control before execution
+
+The design uses Differential Dataflow as the physical backend, but it does not translate Datalog directly into low-level dataflow code. It first
+creates an intermediate representation where Datalog-aware rewrites can happen.
+
+---
+
+## Datalog Model
+
+A Datalog program contains facts and rules.
+
+Input relations are extensional database predicates:
+
+```datalog
+.in
+.decl Arc(x: number, y: number)
+.input Arc.csv
+```
+
+Derived relations are intensional database predicates:
+
+```datalog
+.printsize
+.decl Tc(x: number, y: number)
+```
+
+Rules derive output facts from input and already-derived facts:
+
+```datalog
+.rule
+Tc(x, y) :- Arc(x, y).
+Tc(x, y) :- Arc(z, y), Tc(x, z).
+```
+
+This example computes transitive closure. The first rule copies direct edges into `Tc`. The second rule recursively extends paths.
+
+---
+
+## Language Features
+
+FlowLog supports a practical Datalog dialect with:
+
+- relation declarations
+- CSV-style input and output directives
+- recursive rules
+- stratified negation
+- comparisons
+- arithmetic expressions
+- placeholder arguments with `_`
+- aggregation with `count`, `sum`, `min`, and `max`
+- optimization directives such as `.plan`, `.sip`, and `.optimize`
+
+Negation is written with `!`:
+
+```datalog
+indirect_only(x, z) :- edge(x, y), edge(y, z), !edge(x, z).
+```
+
+Aggregation appears in the head:
+
+```datalog
+count_paths(x, z, count(y)) :- edge(x, y), edge(y, z).
+```
+
+The implementation has limits. Aggregation support is constrained, arithmetic in rule heads is not fully stable in the artifact version, and compile
+times can be high because the backend depends on Differential Dataflow and Timely Dataflow.
+
+---
+
+## Execution Modes
+
+FlowLog has two execution modes.
+
+Batch mode is the default. It is intended for static Datalog evaluation where the input facts are loaded and the derived relations are computed.
+
+Incremental mode uses integer differences so changes can be tracked as insertions and retractions. This fits incremental view maintenance, where input
+updates should produce output updates.
+
+The important distinction is:
+
+```text
+batch mode:
+  compute the fixed point for an input dataset
+
+incremental mode:
+  maintain derived results as facts change
+```
+
+The paper benchmarks focus on batch execution, but the architecture is designed around incrementality.
+
+---
+
+## Differential Dataflow Role
+
+Differential Dataflow represents collections as records with data, logical time, and a difference:
+
+```text
+(data, time, diff)
+```
+
+The `diff` field records multiplicity changes. Positive differences insert facts. Negative differences retract facts.
+
+Operators such as `map`, `filter`, `join`, `concat`, `distinct`, and `iterate` maintain output changes as input changes arrive. Joins use maintained
+indexes called arrangements.
+
+This makes Differential Dataflow a useful backend for Datalog because:
+
+- Datalog rules are relational queries.
+- Recursive rules need fixed-point iteration.
+- Semi-naive evaluation naturally works with deltas.
+- Maintained arrangements can avoid repeated full scans.
+
+FlowLog's job is to turn Datalog rules into a form that uses these backend operators efficiently.
+
+---
+
+## Stratification
+
+FlowLog groups rules into strata using the dependency graph of the program.
+
+A rule depends on another rule if its body mentions the relation derived by that other rule. Recursive rules appear in strongly connected components.
+The engine evaluates strata in dependency order.
+
+The usual shape is:
+
+```text
+non-recursive strata
+-> recursive strata
+-> later strata that depend on earlier outputs
+```
+
+This matters for negation and recursion. Negation must be stratified so a rule does not negatively depend on itself through a cycle. Recursive strata
+need fixed-point evaluation.
+
+---
+
+## Optimization Focus
+
+FlowLog's main contribution is not a new Datalog syntax. It is the optimization boundary between Datalog and Differential Dataflow.
+
+The system uses a relational intermediate representation per rule. That lets the optimizer reason about joins, filters, subplans, and recursive
+execution before lowering to physical dataflow operators.
+
+Two important optimizations are:
+
+- structural planning
+- sideways information passing
+
+Structural planning chooses join plans intended to avoid large intermediate results. It is robustness-oriented: avoid bad plans rather than assume
+perfect cardinality estimates.
+
+Sideways information passing uses semijoin-style prefiltering. It pushes known bindings or reachable values sideways through a rule so later joins see
+less irrelevant input.
+
+These two optimizations are complementary. Planning improves join shape. SIP reduces input size before the joins happen.
+
+---
+
+## Comparison with DBSP
+
+FlowLog and DBSP live in the same design neighborhood:
+
+```text
+relational rules
+-> incremental computation
+-> maintained output relations
+```
+
+The backend model differs.
+
+DBSP describes incremental view maintenance through streams, Z-sets, integration, differentiation, and circuit rewriting.
+
+Differential Dataflow describes incremental collections with logical times and differences, and it maintains arrangements for efficient joins over
+time.
+
+For the CRDT and Geomerge notes, FlowLog is useful because it emphasizes a lesson that also applies to DBSP: recursive Datalog performance depends
+heavily on physical planning. A declarative rule can be correct and still produce expensive intermediate state.
+
+---
+
+## When FlowLog Is Relevant
+
+FlowLog is relevant when the problem has:
+
+- recursive relational logic
+- large graph-shaped inputs
+- repeated joins inside fixed-point loops
+- a need for incremental maintenance
+- sensitivity to join order and memory use
+
+It is less directly relevant when the problem is mostly point lookups, simple filters, or small non-recursive validation queries. Those can be handled
+by simpler relational engines.
+
+The strongest use case is a Datalog workload where rule-level optimization and incremental execution both matter.
+
+---
+
+## Practical Mental Model
+
+FlowLog is best understood as:
+
+```text
+Datalog frontend
+ per-rule relational IR
+ recursive strata planning
+ robust join optimization
+ Differential Dataflow execution
+```
+
+Its central architectural choice is the split between logical Datalog planning and physical dataflow execution.
+
+That split is what makes the system useful to study. It shows how a Datalog engine can reuse a general incremental backend without giving up
+Datalog-specific optimization.
+
+---
+
+## Changelog
+
+* **May 19, 2026** -- First version created from the FlowLog paper and artifact.
--- a/flowlog/002-flowlog-implementation.md
+++ b/flowlog/002-flowlog-implementation.md
@ -0,0 +1,354 @@
+# FlowLog Implementation
+
+A reading note on the implementation shape of FlowLog's Rust artifact.
+
+---
+
+## Short Answer
+
+FlowLog is implemented as a Rust workspace with separate crates for parsing, stratification, catalog construction, logical planning, optimization,
+input reading, execution, and code-generation macros.
+
+The implementation path is:
+
+```text
+.dl file
+-> parser
+-> strata
+-> catalog
+-> program query plan
+-> grouped strata plans
+-> Differential Dataflow dataflow
+-> output relation sizes or CSVs
+```
+
+The most important implementation idea is that FlowLog does not treat Differential Dataflow as a direct code-generation target from raw Datalog. It
+first builds logical rule plans with explicit collection signatures and transformation flows.
+
+---
+
+## Workspace Shape
+
+The artifact is organized as a Rust workspace with crates that line up with the execution pipeline.
+
+`parsing` parses the Datalog dialect. It uses a grammar with declarations, input directives, output directives, rules, negation, comparisons,
+arithmetic, and aggregation.
+
+`strata` builds dependency information and groups rules into strata. Recursive rules are identified through the dependency graph.
+
+`catalog` turns parsed rules into metadata. This includes atoms, head arguments, filters, comparisons, arithmetic expressions, aggregation heads, and
+rule structure.
+
+`planning` creates logical query plans from catalogs and strata. This is where rule bodies become transformation chains and where join structure is
+represented.
+
+`optimizing` chooses structural join plans. It reasons over the variable overlap among rule atoms and selects plan trees.
+
+`reading` loads input relations from files and represents rows, relation sessions, semiring-like weights, and arrangements.
+
+`executing` builds and runs the Differential Dataflow graph. It owns command-line handling, dataflow construction, operators, collectors, and output
+inspection.
+
+`macros` provides Rust macros that generate specialized operator code for different key and value arities.
+
+---
+
+## Frontend
+
+The frontend grammar supports sections like:
+
+```datalog
+.in
+.decl Arc(x: number, y: number)
+.input Arc.csv
+
+.printsize
+.decl Tc(x: number, y: number)
+
+.rule
+Tc(x, y) :- Arc(x, y).
+Tc(x, y) :- Arc(z, y), Tc(x, z).
+```
+
+The parser distinguishes:
+
+- extensional declarations
+- intensional declarations
+- rule heads
+- positive atoms
+- negated atoms
+- comparisons
+- constants
+- placeholders
+- aggregate heads
+
+This is more schema-driven than small teaching Datalog examples. Relation declarations give the engine names and arities before planning.
+
+---
+
+## Strata and Program Plans
+
+After parsing, FlowLog builds strata from rule dependencies.
+
+The `ProgramQueryPlan` is created from strata. It iterates through each stratum, builds a `Catalog` for each rule, decides whether SIP and structural
+planning apply, expands SIP rules when needed, and converts each catalog into a `RuleQueryPlan`.
+
+The optimizer is only activated for rules with more than two core atoms. This is a pragmatic choice: optimizing one-atom or two-atom rules has little
+value.
+
+The planning level also tracks whether a stratum is recursive. Recursive and non-recursive groups are executed differently later.
+
+The key shape is:
+
+```text
+Strata
+-> Catalog per rule
+-> RuleQueryPlan per catalog
+-> GroupStrataQueryPlan
+-> ProgramQueryPlan
+```
+
+---
+
+## Catalog Role
+
+The catalog is the bridge between syntax and planning.
+
+It records which rule atoms are core relational inputs, which terms are filters or constraints, which variables occur where, and how the rule head
+relates to the body.
+
+Planning needs this metadata to answer questions such as:
+
+- Which atoms should participate in joins?
+- Which atom arguments are shared variables?
+- Which comparisons can be applied locally?
+- Which projected fields are needed in the output?
+- Which atoms are negated?
+- Which rules derive the same output relation?
+
+Without this catalog layer, the executor would have to rediscover semantic information from syntax during physical dataflow construction.
+
+---
+
+## Collection Signatures
+
+FlowLog lowers relations into collection signatures that distinguish row, key, and value shapes.
+
+Differential Dataflow joins are easiest to express over key-value collections. FlowLog therefore maps relation tuples into several physical forms:
+
+- row collections for plain tuples
+- key-only collections for semijoin and antijoin support
+- key-value collections for joins
+
+The executor keeps maps for these forms:
+
+```text
+row_map
+kv_map
+k_map
+```
+
+This is a central implementation detail. A Datalog relation looks like one logical predicate, but execution may maintain several arranged physical
+views of it.
+
+---
+
+## Transformation Flow
+
+The planning layer represents operations with transformation flows.
+
+A unary transformation maps one collection to another:
+
+```text
+KVToKV
+```
+
+This covers projection, filtering, local constraints, and reshaping a tuple into key-value form.
+
+A binary transformation represents a join-like step:
+
+```text
+JnToKV
+```
+
+This maps joined key-value inputs into a new output key and value shape.
+
+Each transformation flow tracks:
+
+- output key arguments
+- output value arguments
+- local constraints
+- comparison expressions
+- how input fields flow to output fields
+
+This lets the executor generate a specific Differential Dataflow operator while the planner remains backend-independent enough to reason about rule
+structure.
+
+---
+
+## Structural Planning
+
+The optimizer builds a plan tree for the core atoms of a rule.
+
+The default plan is essentially a chain following the rule's atom order. The optimized plan searches for a better tree by looking at variable overlap
+among atoms.
+
+The optimizer uses a maximum-spanning-tree-style search over atom overlaps. Then it evaluates candidate trees with a width measure and depth
+tie-breaker.
+
+The goal is not perfect cardinality estimation. The goal is robust plan shape:
+
+- cross-product avoidance when possible
+- smaller intermediate relation width
+- earlier joins between atoms that share variables
+- lower chance of large maintained join state
+
+This fits recursive Datalog because reliable static cardinality estimates are hard. A robustness-oriented heuristic is often more useful than a
+fragile cost model.
+
+---
+
+## Sideways Information Passing
+
+Sideways information passing is a rule transformation that creates semijoin-style filters.
+
+The practical goal is:
+
+```text
+known useful bindings
+-> prefilter later atoms
+-> smaller join inputs
+-> less intermediate state
+```
+
+In the implementation, enabling SIP can expand a catalog into multiple catalogs. For non-recursive strata, this may split one group into several
+cascading groups so the generated filters can feed later steps.
+
+This is why planning and stratification interact. SIP is not just a local operator rewrite. It can change the shape of the stratum plan.
+
+---
+
+## Executor Shape
+
+The executor creates a Timely dataflow and then builds Differential Dataflow collections inside it.
+
+At startup, it creates input sessions for every extensional relation. Those sessions are used to load facts from files.
+
+For each stratum group, execution branches by recursion:
+
+- non-recursive groups are built as straight-line transformations
+- recursive groups are built inside an iterative scope
+
+Non-recursive execution walks each transformation and constructs the matching dataflow operator. Outputs are stored back into `row_map`, `kv_map`, or
+`k_map` depending on their physical shape.
+
+Recursive execution creates iterative variables for intensional relations and repeatedly applies the recursive transformations until convergence.
+
+Collectors merge rule outputs for the same intensional predicate. Inspectors print relation sizes or emit outputs.
+
+---
+
+## Physical Operators
+
+The executor has operators corresponding to the physical collection shapes:
+
+- row to row
+- row to key
+- row to key-value
+- key-value join key-value
+- key-value join key
+- key join key
+- cartesian product
+- key-value antijoin key
+- key antijoin key
+
+The implementation arranges collections when needed. Arrangements are Differential Dataflow's indexed representation for joins and repeated access.
+
+This is why FlowLog cares about key and value arity. The physical shape determines which macro-generated operator can be used and whether the runtime
+needs a fallback representation.
+
+---
+
+## Arity Strategy
+
+FlowLog uses specialized fixed-size representations for common arities and a fallback mode for wider tuples.
+
+The program plan can compute maximal key and value arity pairs. If a query exceeds the fixed-size fallback threshold, fat mode is required.
+
+This is a performance engineering detail: Datalog workloads can produce wide intermediate tuples, but specializing small tuples can reduce allocation
+and dynamic dispatch overhead.
+
+---
+
+## Batch and Incremental Weights
+
+FlowLog has two build modes for weights.
+
+Batch mode uses a presence-style difference type. This is suited for static Datalog workloads where a fact is either present or absent.
+
+Incremental mode uses signed integer differences. This can represent insertions, deletions, and multiplicities.
+
+At the implementation level, this means the same logical engine can target:
+
+```text
+static fixed-point computation
+incremental maintenance over changing inputs
+```
+
+The paper's artifact focuses on batch benchmarks, but the backend model is compatible with incremental updates.
+
+---
+
+## Important Limitations
+
+The artifact has several important limitations:
+
+- release builds can be slow because of large Timely and Differential Dataflow dependencies
+- aggregation support is constrained
+- arithmetic in rule heads is unstable in the artifact version
+- some optimization paths are controlled by flags or rule directives
+- SIP currently has implementation-specific handling in stratum grouping
+- output support is more oriented around relation sizes and CSV dumps than an embedded application API
+
+These are acceptable for a research artifact, but they matter if comparing FlowLog to an embedded query engine for an application.
+
+---
+
+## Lessons for Other Engines
+
+FlowLog's implementation suggests several reusable lessons.
+
+A Datalog engine benefits from an explicit rule catalog. It gives optimization and execution a shared view of variables, atoms, filters, and heads.
+
+Recursive evaluation should not hide join planning. The rules inside a fixed-point loop are where bad plans become expensive.
+
+Physical arrangements are part of the query plan. If the backend needs key-value indexes, the logical planner should expose key choices explicitly.
+
+Optimization can be robustness-first. Recursive workloads may not have stable enough statistics for a conventional cost model.
+
+The frontend and backend should stay separated. Datalog syntax, relational rule planning, and Differential Dataflow execution are different concerns.
+
+---
+
+## Practical Mental Model
+
+FlowLog's implementation can be read as:
+
+```text
+parser and schema loader
+ dependency and strata analyzer
+ rule catalog
+ relational transformation planner
+ robust join planner
+ SIP expander
+ Differential Dataflow executor
+```
+
+The implementation is valuable because it shows the concrete machinery needed between a compact Datalog rule and an efficient incremental dataflow
+program.
+
+---
+
+## Changelog
+
+* **May 19, 2026** -- First version created from the FlowLog paper and artifact.