From 3d67b4994ebf1a16f658bbc6846ca18b246cb0ef Mon Sep 17 00:00:00 2001 From: Hassan Abedi Date: Tue, 19 May 2026 15:20:43 +0200 Subject: [PATCH] Add two note files for FlowLog (primer and implementation) --- flowlog/001-flowlog-primer.md | 262 +++++++++++++++++++ flowlog/002-flowlog-implementation.md | 354 ++++++++++++++++++++++++++ 2 files changed, 616 insertions(+) create mode 100644 flowlog/001-flowlog-primer.md create mode 100644 flowlog/002-flowlog-implementation.md diff --git a/flowlog/001-flowlog-primer.md b/flowlog/001-flowlog-primer.md new file mode 100644 index 0000000..1ef043b --- /dev/null +++ b/flowlog/001-flowlog-primer.md @@ -0,0 +1,262 @@ +# FlowLog Primer + +A primer on FlowLog as a Datalog engine built on Differential Dataflow. + +--- + +## Short Answer + +FlowLog is a Datalog engine for recursive queries. It parses Datalog programs, stratifies rules, builds a relational intermediate representation, +optimizes rule plans, and executes them with Differential Dataflow. + +The main idea is: + +```text +Datalog rules +-> relational rule plans +-> Differential Dataflow operators +-> maintained derived relations +``` + +FlowLog is not only a parser for Datalog. It is a query engine design that keeps Datalog-specific optimization visible before the program is lowered +to a streaming dataflow backend. + +--- + +## Why It Exists + +Datalog is useful for recursive computations: + +- graph reachability +- transitive closure +- program analysis +- static analysis +- network and distributed-system rules +- recursive data-cleaning or constraint logic + +The hard part is execution. Recursive Datalog can spend most of its time and memory on joins inside fixed-point loops. Bad join orders can create +large intermediate relations, and the best order can vary by workload and iteration. + +FlowLog tries to keep three properties together: + +- Datalog-level expressiveness +- incremental and parallel execution +- query planning control before execution + +The design uses Differential Dataflow as the physical backend, but it does not translate Datalog directly into low-level dataflow code. It first +creates an intermediate representation where Datalog-aware rewrites can happen. + +--- + +## Datalog Model + +A Datalog program contains facts and rules. + +Input relations are extensional database predicates: + +```datalog +.in +.decl Arc(x: number, y: number) +.input Arc.csv +``` + +Derived relations are intensional database predicates: + +```datalog +.printsize +.decl Tc(x: number, y: number) +``` + +Rules derive output facts from input and already-derived facts: + +```datalog +.rule +Tc(x, y) :- Arc(x, y). +Tc(x, y) :- Arc(z, y), Tc(x, z). +``` + +This example computes transitive closure. The first rule copies direct edges into `Tc`. The second rule recursively extends paths. + +--- + +## Language Features + +FlowLog supports a practical Datalog dialect with: + +- relation declarations +- CSV-style input and output directives +- recursive rules +- stratified negation +- comparisons +- arithmetic expressions +- placeholder arguments with `_` +- aggregation with `count`, `sum`, `min`, and `max` +- optimization directives such as `.plan`, `.sip`, and `.optimize` + +Negation is written with `!`: + +```datalog +indirect_only(x, z) :- edge(x, y), edge(y, z), !edge(x, z). +``` + +Aggregation appears in the head: + +```datalog +count_paths(x, z, count(y)) :- edge(x, y), edge(y, z). +``` + +The implementation has limits. Aggregation support is constrained, arithmetic in rule heads is not fully stable in the artifact version, and compile +times can be high because the backend depends on Differential Dataflow and Timely Dataflow. + +--- + +## Execution Modes + +FlowLog has two execution modes. + +Batch mode is the default. It is intended for static Datalog evaluation where the input facts are loaded and the derived relations are computed. + +Incremental mode uses integer differences so changes can be tracked as insertions and retractions. This fits incremental view maintenance, where input +updates should produce output updates. + +The important distinction is: + +```text +batch mode: + compute the fixed point for an input dataset + +incremental mode: + maintain derived results as facts change +``` + +The paper benchmarks focus on batch execution, but the architecture is designed around incrementality. + +--- + +## Differential Dataflow Role + +Differential Dataflow represents collections as records with data, logical time, and a difference: + +```text +(data, time, diff) +``` + +The `diff` field records multiplicity changes. Positive differences insert facts. Negative differences retract facts. + +Operators such as `map`, `filter`, `join`, `concat`, `distinct`, and `iterate` maintain output changes as input changes arrive. Joins use maintained +indexes called arrangements. + +This makes Differential Dataflow a useful backend for Datalog because: + +- Datalog rules are relational queries. +- Recursive rules need fixed-point iteration. +- Semi-naive evaluation naturally works with deltas. +- Maintained arrangements can avoid repeated full scans. + +FlowLog's job is to turn Datalog rules into a form that uses these backend operators efficiently. + +--- + +## Stratification + +FlowLog groups rules into strata using the dependency graph of the program. + +A rule depends on another rule if its body mentions the relation derived by that other rule. Recursive rules appear in strongly connected components. +The engine evaluates strata in dependency order. + +The usual shape is: + +```text +non-recursive strata +-> recursive strata +-> later strata that depend on earlier outputs +``` + +This matters for negation and recursion. Negation must be stratified so a rule does not negatively depend on itself through a cycle. Recursive strata +need fixed-point evaluation. + +--- + +## Optimization Focus + +FlowLog's main contribution is not a new Datalog syntax. It is the optimization boundary between Datalog and Differential Dataflow. + +The system uses a relational intermediate representation per rule. That lets the optimizer reason about joins, filters, subplans, and recursive +execution before lowering to physical dataflow operators. + +Two important optimizations are: + +- structural planning +- sideways information passing + +Structural planning chooses join plans intended to avoid large intermediate results. It is robustness-oriented: avoid bad plans rather than assume +perfect cardinality estimates. + +Sideways information passing uses semijoin-style prefiltering. It pushes known bindings or reachable values sideways through a rule so later joins see +less irrelevant input. + +These two optimizations are complementary. Planning improves join shape. SIP reduces input size before the joins happen. + +--- + +## Comparison with DBSP + +FlowLog and DBSP live in the same design neighborhood: + +```text +relational rules +-> incremental computation +-> maintained output relations +``` + +The backend model differs. + +DBSP describes incremental view maintenance through streams, Z-sets, integration, differentiation, and circuit rewriting. + +Differential Dataflow describes incremental collections with logical times and differences, and it maintains arrangements for efficient joins over +time. + +For the CRDT and Geomerge notes, FlowLog is useful because it emphasizes a lesson that also applies to DBSP: recursive Datalog performance depends +heavily on physical planning. A declarative rule can be correct and still produce expensive intermediate state. + +--- + +## When FlowLog Is Relevant + +FlowLog is relevant when the problem has: + +- recursive relational logic +- large graph-shaped inputs +- repeated joins inside fixed-point loops +- a need for incremental maintenance +- sensitivity to join order and memory use + +It is less directly relevant when the problem is mostly point lookups, simple filters, or small non-recursive validation queries. Those can be handled +by simpler relational engines. + +The strongest use case is a Datalog workload where rule-level optimization and incremental execution both matter. + +--- + +## Practical Mental Model + +FlowLog is best understood as: + +```text +Datalog frontend ++ per-rule relational IR ++ recursive strata planning ++ robust join optimization ++ Differential Dataflow execution +``` + +Its central architectural choice is the split between logical Datalog planning and physical dataflow execution. + +That split is what makes the system useful to study. It shows how a Datalog engine can reuse a general incremental backend without giving up +Datalog-specific optimization. + +--- + +## Changelog + +* **May 19, 2026** -- First version created from the FlowLog paper and artifact. diff --git a/flowlog/002-flowlog-implementation.md b/flowlog/002-flowlog-implementation.md new file mode 100644 index 0000000..8133067 --- /dev/null +++ b/flowlog/002-flowlog-implementation.md @@ -0,0 +1,354 @@ +# FlowLog Implementation + +A reading note on the implementation shape of FlowLog's Rust artifact. + +--- + +## Short Answer + +FlowLog is implemented as a Rust workspace with separate crates for parsing, stratification, catalog construction, logical planning, optimization, +input reading, execution, and code-generation macros. + +The implementation path is: + +```text +.dl file +-> parser +-> strata +-> catalog +-> program query plan +-> grouped strata plans +-> Differential Dataflow dataflow +-> output relation sizes or CSVs +``` + +The most important implementation idea is that FlowLog does not treat Differential Dataflow as a direct code-generation target from raw Datalog. It +first builds logical rule plans with explicit collection signatures and transformation flows. + +--- + +## Workspace Shape + +The artifact is organized as a Rust workspace with crates that line up with the execution pipeline. + +`parsing` parses the Datalog dialect. It uses a grammar with declarations, input directives, output directives, rules, negation, comparisons, +arithmetic, and aggregation. + +`strata` builds dependency information and groups rules into strata. Recursive rules are identified through the dependency graph. + +`catalog` turns parsed rules into metadata. This includes atoms, head arguments, filters, comparisons, arithmetic expressions, aggregation heads, and +rule structure. + +`planning` creates logical query plans from catalogs and strata. This is where rule bodies become transformation chains and where join structure is +represented. + +`optimizing` chooses structural join plans. It reasons over the variable overlap among rule atoms and selects plan trees. + +`reading` loads input relations from files and represents rows, relation sessions, semiring-like weights, and arrangements. + +`executing` builds and runs the Differential Dataflow graph. It owns command-line handling, dataflow construction, operators, collectors, and output +inspection. + +`macros` provides Rust macros that generate specialized operator code for different key and value arities. + +--- + +## Frontend + +The frontend grammar supports sections like: + +```datalog +.in +.decl Arc(x: number, y: number) +.input Arc.csv + +.printsize +.decl Tc(x: number, y: number) + +.rule +Tc(x, y) :- Arc(x, y). +Tc(x, y) :- Arc(z, y), Tc(x, z). +``` + +The parser distinguishes: + +- extensional declarations +- intensional declarations +- rule heads +- positive atoms +- negated atoms +- comparisons +- constants +- placeholders +- aggregate heads + +This is more schema-driven than small teaching Datalog examples. Relation declarations give the engine names and arities before planning. + +--- + +## Strata and Program Plans + +After parsing, FlowLog builds strata from rule dependencies. + +The `ProgramQueryPlan` is created from strata. It iterates through each stratum, builds a `Catalog` for each rule, decides whether SIP and structural +planning apply, expands SIP rules when needed, and converts each catalog into a `RuleQueryPlan`. + +The optimizer is only activated for rules with more than two core atoms. This is a pragmatic choice: optimizing one-atom or two-atom rules has little +value. + +The planning level also tracks whether a stratum is recursive. Recursive and non-recursive groups are executed differently later. + +The key shape is: + +```text +Strata +-> Catalog per rule +-> RuleQueryPlan per catalog +-> GroupStrataQueryPlan +-> ProgramQueryPlan +``` + +--- + +## Catalog Role + +The catalog is the bridge between syntax and planning. + +It records which rule atoms are core relational inputs, which terms are filters or constraints, which variables occur where, and how the rule head +relates to the body. + +Planning needs this metadata to answer questions such as: + +- Which atoms should participate in joins? +- Which atom arguments are shared variables? +- Which comparisons can be applied locally? +- Which projected fields are needed in the output? +- Which atoms are negated? +- Which rules derive the same output relation? + +Without this catalog layer, the executor would have to rediscover semantic information from syntax during physical dataflow construction. + +--- + +## Collection Signatures + +FlowLog lowers relations into collection signatures that distinguish row, key, and value shapes. + +Differential Dataflow joins are easiest to express over key-value collections. FlowLog therefore maps relation tuples into several physical forms: + +- row collections for plain tuples +- key-only collections for semijoin and antijoin support +- key-value collections for joins + +The executor keeps maps for these forms: + +```text +row_map +kv_map +k_map +``` + +This is a central implementation detail. A Datalog relation looks like one logical predicate, but execution may maintain several arranged physical +views of it. + +--- + +## Transformation Flow + +The planning layer represents operations with transformation flows. + +A unary transformation maps one collection to another: + +```text +KVToKV +``` + +This covers projection, filtering, local constraints, and reshaping a tuple into key-value form. + +A binary transformation represents a join-like step: + +```text +JnToKV +``` + +This maps joined key-value inputs into a new output key and value shape. + +Each transformation flow tracks: + +- output key arguments +- output value arguments +- local constraints +- comparison expressions +- how input fields flow to output fields + +This lets the executor generate a specific Differential Dataflow operator while the planner remains backend-independent enough to reason about rule +structure. + +--- + +## Structural Planning + +The optimizer builds a plan tree for the core atoms of a rule. + +The default plan is essentially a chain following the rule's atom order. The optimized plan searches for a better tree by looking at variable overlap +among atoms. + +The optimizer uses a maximum-spanning-tree-style search over atom overlaps. Then it evaluates candidate trees with a width measure and depth +tie-breaker. + +The goal is not perfect cardinality estimation. The goal is robust plan shape: + +- cross-product avoidance when possible +- smaller intermediate relation width +- earlier joins between atoms that share variables +- lower chance of large maintained join state + +This fits recursive Datalog because reliable static cardinality estimates are hard. A robustness-oriented heuristic is often more useful than a +fragile cost model. + +--- + +## Sideways Information Passing + +Sideways information passing is a rule transformation that creates semijoin-style filters. + +The practical goal is: + +```text +known useful bindings +-> prefilter later atoms +-> smaller join inputs +-> less intermediate state +``` + +In the implementation, enabling SIP can expand a catalog into multiple catalogs. For non-recursive strata, this may split one group into several +cascading groups so the generated filters can feed later steps. + +This is why planning and stratification interact. SIP is not just a local operator rewrite. It can change the shape of the stratum plan. + +--- + +## Executor Shape + +The executor creates a Timely dataflow and then builds Differential Dataflow collections inside it. + +At startup, it creates input sessions for every extensional relation. Those sessions are used to load facts from files. + +For each stratum group, execution branches by recursion: + +- non-recursive groups are built as straight-line transformations +- recursive groups are built inside an iterative scope + +Non-recursive execution walks each transformation and constructs the matching dataflow operator. Outputs are stored back into `row_map`, `kv_map`, or +`k_map` depending on their physical shape. + +Recursive execution creates iterative variables for intensional relations and repeatedly applies the recursive transformations until convergence. + +Collectors merge rule outputs for the same intensional predicate. Inspectors print relation sizes or emit outputs. + +--- + +## Physical Operators + +The executor has operators corresponding to the physical collection shapes: + +- row to row +- row to key +- row to key-value +- key-value join key-value +- key-value join key +- key join key +- cartesian product +- key-value antijoin key +- key antijoin key + +The implementation arranges collections when needed. Arrangements are Differential Dataflow's indexed representation for joins and repeated access. + +This is why FlowLog cares about key and value arity. The physical shape determines which macro-generated operator can be used and whether the runtime +needs a fallback representation. + +--- + +## Arity Strategy + +FlowLog uses specialized fixed-size representations for common arities and a fallback mode for wider tuples. + +The program plan can compute maximal key and value arity pairs. If a query exceeds the fixed-size fallback threshold, fat mode is required. + +This is a performance engineering detail: Datalog workloads can produce wide intermediate tuples, but specializing small tuples can reduce allocation +and dynamic dispatch overhead. + +--- + +## Batch and Incremental Weights + +FlowLog has two build modes for weights. + +Batch mode uses a presence-style difference type. This is suited for static Datalog workloads where a fact is either present or absent. + +Incremental mode uses signed integer differences. This can represent insertions, deletions, and multiplicities. + +At the implementation level, this means the same logical engine can target: + +```text +static fixed-point computation +incremental maintenance over changing inputs +``` + +The paper's artifact focuses on batch benchmarks, but the backend model is compatible with incremental updates. + +--- + +## Important Limitations + +The artifact has several important limitations: + +- release builds can be slow because of large Timely and Differential Dataflow dependencies +- aggregation support is constrained +- arithmetic in rule heads is unstable in the artifact version +- some optimization paths are controlled by flags or rule directives +- SIP currently has implementation-specific handling in stratum grouping +- output support is more oriented around relation sizes and CSV dumps than an embedded application API + +These are acceptable for a research artifact, but they matter if comparing FlowLog to an embedded query engine for an application. + +--- + +## Lessons for Other Engines + +FlowLog's implementation suggests several reusable lessons. + +A Datalog engine benefits from an explicit rule catalog. It gives optimization and execution a shared view of variables, atoms, filters, and heads. + +Recursive evaluation should not hide join planning. The rules inside a fixed-point loop are where bad plans become expensive. + +Physical arrangements are part of the query plan. If the backend needs key-value indexes, the logical planner should expose key choices explicitly. + +Optimization can be robustness-first. Recursive workloads may not have stable enough statistics for a conventional cost model. + +The frontend and backend should stay separated. Datalog syntax, relational rule planning, and Differential Dataflow execution are different concerns. + +--- + +## Practical Mental Model + +FlowLog's implementation can be read as: + +```text +parser and schema loader ++ dependency and strata analyzer ++ rule catalog ++ relational transformation planner ++ robust join planner ++ SIP expander ++ Differential Dataflow executor +``` + +The implementation is valuable because it shows the concrete machinery needed between a compact Datalog rule and an efficient incremental dataflow +program. + +--- + +## Changelog + +* **May 19, 2026** -- First version created from the FlowLog paper and artifact.