Add two note files for FlowLog (primer and implementation)

This commit is contained in:
Hassan Abedi 2026-05-19 15:20:43 +02:00
parent ee52b850e4
commit 3d67b4994e
2 changed files with 616 additions and 0 deletions

View File

@ -0,0 +1,262 @@
# FlowLog Primer
A primer on FlowLog as a Datalog engine built on Differential Dataflow.
---
## Short Answer
FlowLog is a Datalog engine for recursive queries. It parses Datalog programs, stratifies rules, builds a relational intermediate representation,
optimizes rule plans, and executes them with Differential Dataflow.
The main idea is:
```text
Datalog rules
-> relational rule plans
-> Differential Dataflow operators
-> maintained derived relations
```
FlowLog is not only a parser for Datalog. It is a query engine design that keeps Datalog-specific optimization visible before the program is lowered
to a streaming dataflow backend.
---
## Why It Exists
Datalog is useful for recursive computations:
- graph reachability
- transitive closure
- program analysis
- static analysis
- network and distributed-system rules
- recursive data-cleaning or constraint logic
The hard part is execution. Recursive Datalog can spend most of its time and memory on joins inside fixed-point loops. Bad join orders can create
large intermediate relations, and the best order can vary by workload and iteration.
FlowLog tries to keep three properties together:
- Datalog-level expressiveness
- incremental and parallel execution
- query planning control before execution
The design uses Differential Dataflow as the physical backend, but it does not translate Datalog directly into low-level dataflow code. It first
creates an intermediate representation where Datalog-aware rewrites can happen.
---
## Datalog Model
A Datalog program contains facts and rules.
Input relations are extensional database predicates:
```datalog
.in
.decl Arc(x: number, y: number)
.input Arc.csv
```
Derived relations are intensional database predicates:
```datalog
.printsize
.decl Tc(x: number, y: number)
```
Rules derive output facts from input and already-derived facts:
```datalog
.rule
Tc(x, y) :- Arc(x, y).
Tc(x, y) :- Arc(z, y), Tc(x, z).
```
This example computes transitive closure. The first rule copies direct edges into `Tc`. The second rule recursively extends paths.
---
## Language Features
FlowLog supports a practical Datalog dialect with:
- relation declarations
- CSV-style input and output directives
- recursive rules
- stratified negation
- comparisons
- arithmetic expressions
- placeholder arguments with `_`
- aggregation with `count`, `sum`, `min`, and `max`
- optimization directives such as `.plan`, `.sip`, and `.optimize`
Negation is written with `!`:
```datalog
indirect_only(x, z) :- edge(x, y), edge(y, z), !edge(x, z).
```
Aggregation appears in the head:
```datalog
count_paths(x, z, count(y)) :- edge(x, y), edge(y, z).
```
The implementation has limits. Aggregation support is constrained, arithmetic in rule heads is not fully stable in the artifact version, and compile
times can be high because the backend depends on Differential Dataflow and Timely Dataflow.
---
## Execution Modes
FlowLog has two execution modes.
Batch mode is the default. It is intended for static Datalog evaluation where the input facts are loaded and the derived relations are computed.
Incremental mode uses integer differences so changes can be tracked as insertions and retractions. This fits incremental view maintenance, where input
updates should produce output updates.
The important distinction is:
```text
batch mode:
compute the fixed point for an input dataset
incremental mode:
maintain derived results as facts change
```
The paper benchmarks focus on batch execution, but the architecture is designed around incrementality.
---
## Differential Dataflow Role
Differential Dataflow represents collections as records with data, logical time, and a difference:
```text
(data, time, diff)
```
The `diff` field records multiplicity changes. Positive differences insert facts. Negative differences retract facts.
Operators such as `map`, `filter`, `join`, `concat`, `distinct`, and `iterate` maintain output changes as input changes arrive. Joins use maintained
indexes called arrangements.
This makes Differential Dataflow a useful backend for Datalog because:
- Datalog rules are relational queries.
- Recursive rules need fixed-point iteration.
- Semi-naive evaluation naturally works with deltas.
- Maintained arrangements can avoid repeated full scans.
FlowLog's job is to turn Datalog rules into a form that uses these backend operators efficiently.
---
## Stratification
FlowLog groups rules into strata using the dependency graph of the program.
A rule depends on another rule if its body mentions the relation derived by that other rule. Recursive rules appear in strongly connected components.
The engine evaluates strata in dependency order.
The usual shape is:
```text
non-recursive strata
-> recursive strata
-> later strata that depend on earlier outputs
```
This matters for negation and recursion. Negation must be stratified so a rule does not negatively depend on itself through a cycle. Recursive strata
need fixed-point evaluation.
---
## Optimization Focus
FlowLog's main contribution is not a new Datalog syntax. It is the optimization boundary between Datalog and Differential Dataflow.
The system uses a relational intermediate representation per rule. That lets the optimizer reason about joins, filters, subplans, and recursive
execution before lowering to physical dataflow operators.
Two important optimizations are:
- structural planning
- sideways information passing
Structural planning chooses join plans intended to avoid large intermediate results. It is robustness-oriented: avoid bad plans rather than assume
perfect cardinality estimates.
Sideways information passing uses semijoin-style prefiltering. It pushes known bindings or reachable values sideways through a rule so later joins see
less irrelevant input.
These two optimizations are complementary. Planning improves join shape. SIP reduces input size before the joins happen.
---
## Comparison with DBSP
FlowLog and DBSP live in the same design neighborhood:
```text
relational rules
-> incremental computation
-> maintained output relations
```
The backend model differs.
DBSP describes incremental view maintenance through streams, Z-sets, integration, differentiation, and circuit rewriting.
Differential Dataflow describes incremental collections with logical times and differences, and it maintains arrangements for efficient joins over
time.
For the CRDT and Geomerge notes, FlowLog is useful because it emphasizes a lesson that also applies to DBSP: recursive Datalog performance depends
heavily on physical planning. A declarative rule can be correct and still produce expensive intermediate state.
---
## When FlowLog Is Relevant
FlowLog is relevant when the problem has:
- recursive relational logic
- large graph-shaped inputs
- repeated joins inside fixed-point loops
- a need for incremental maintenance
- sensitivity to join order and memory use
It is less directly relevant when the problem is mostly point lookups, simple filters, or small non-recursive validation queries. Those can be handled
by simpler relational engines.
The strongest use case is a Datalog workload where rule-level optimization and incremental execution both matter.
---
## Practical Mental Model
FlowLog is best understood as:
```text
Datalog frontend
+ per-rule relational IR
+ recursive strata planning
+ robust join optimization
+ Differential Dataflow execution
```
Its central architectural choice is the split between logical Datalog planning and physical dataflow execution.
That split is what makes the system useful to study. It shows how a Datalog engine can reuse a general incremental backend without giving up
Datalog-specific optimization.
---
## Changelog
* **May 19, 2026** -- First version created from the FlowLog paper and artifact.

View File

@ -0,0 +1,354 @@
# FlowLog Implementation
A reading note on the implementation shape of FlowLog's Rust artifact.
---
## Short Answer
FlowLog is implemented as a Rust workspace with separate crates for parsing, stratification, catalog construction, logical planning, optimization,
input reading, execution, and code-generation macros.
The implementation path is:
```text
.dl file
-> parser
-> strata
-> catalog
-> program query plan
-> grouped strata plans
-> Differential Dataflow dataflow
-> output relation sizes or CSVs
```
The most important implementation idea is that FlowLog does not treat Differential Dataflow as a direct code-generation target from raw Datalog. It
first builds logical rule plans with explicit collection signatures and transformation flows.
---
## Workspace Shape
The artifact is organized as a Rust workspace with crates that line up with the execution pipeline.
`parsing` parses the Datalog dialect. It uses a grammar with declarations, input directives, output directives, rules, negation, comparisons,
arithmetic, and aggregation.
`strata` builds dependency information and groups rules into strata. Recursive rules are identified through the dependency graph.
`catalog` turns parsed rules into metadata. This includes atoms, head arguments, filters, comparisons, arithmetic expressions, aggregation heads, and
rule structure.
`planning` creates logical query plans from catalogs and strata. This is where rule bodies become transformation chains and where join structure is
represented.
`optimizing` chooses structural join plans. It reasons over the variable overlap among rule atoms and selects plan trees.
`reading` loads input relations from files and represents rows, relation sessions, semiring-like weights, and arrangements.
`executing` builds and runs the Differential Dataflow graph. It owns command-line handling, dataflow construction, operators, collectors, and output
inspection.
`macros` provides Rust macros that generate specialized operator code for different key and value arities.
---
## Frontend
The frontend grammar supports sections like:
```datalog
.in
.decl Arc(x: number, y: number)
.input Arc.csv
.printsize
.decl Tc(x: number, y: number)
.rule
Tc(x, y) :- Arc(x, y).
Tc(x, y) :- Arc(z, y), Tc(x, z).
```
The parser distinguishes:
- extensional declarations
- intensional declarations
- rule heads
- positive atoms
- negated atoms
- comparisons
- constants
- placeholders
- aggregate heads
This is more schema-driven than small teaching Datalog examples. Relation declarations give the engine names and arities before planning.
---
## Strata and Program Plans
After parsing, FlowLog builds strata from rule dependencies.
The `ProgramQueryPlan` is created from strata. It iterates through each stratum, builds a `Catalog` for each rule, decides whether SIP and structural
planning apply, expands SIP rules when needed, and converts each catalog into a `RuleQueryPlan`.
The optimizer is only activated for rules with more than two core atoms. This is a pragmatic choice: optimizing one-atom or two-atom rules has little
value.
The planning level also tracks whether a stratum is recursive. Recursive and non-recursive groups are executed differently later.
The key shape is:
```text
Strata
-> Catalog per rule
-> RuleQueryPlan per catalog
-> GroupStrataQueryPlan
-> ProgramQueryPlan
```
---
## Catalog Role
The catalog is the bridge between syntax and planning.
It records which rule atoms are core relational inputs, which terms are filters or constraints, which variables occur where, and how the rule head
relates to the body.
Planning needs this metadata to answer questions such as:
- Which atoms should participate in joins?
- Which atom arguments are shared variables?
- Which comparisons can be applied locally?
- Which projected fields are needed in the output?
- Which atoms are negated?
- Which rules derive the same output relation?
Without this catalog layer, the executor would have to rediscover semantic information from syntax during physical dataflow construction.
---
## Collection Signatures
FlowLog lowers relations into collection signatures that distinguish row, key, and value shapes.
Differential Dataflow joins are easiest to express over key-value collections. FlowLog therefore maps relation tuples into several physical forms:
- row collections for plain tuples
- key-only collections for semijoin and antijoin support
- key-value collections for joins
The executor keeps maps for these forms:
```text
row_map
kv_map
k_map
```
This is a central implementation detail. A Datalog relation looks like one logical predicate, but execution may maintain several arranged physical
views of it.
---
## Transformation Flow
The planning layer represents operations with transformation flows.
A unary transformation maps one collection to another:
```text
KVToKV
```
This covers projection, filtering, local constraints, and reshaping a tuple into key-value form.
A binary transformation represents a join-like step:
```text
JnToKV
```
This maps joined key-value inputs into a new output key and value shape.
Each transformation flow tracks:
- output key arguments
- output value arguments
- local constraints
- comparison expressions
- how input fields flow to output fields
This lets the executor generate a specific Differential Dataflow operator while the planner remains backend-independent enough to reason about rule
structure.
---
## Structural Planning
The optimizer builds a plan tree for the core atoms of a rule.
The default plan is essentially a chain following the rule's atom order. The optimized plan searches for a better tree by looking at variable overlap
among atoms.
The optimizer uses a maximum-spanning-tree-style search over atom overlaps. Then it evaluates candidate trees with a width measure and depth
tie-breaker.
The goal is not perfect cardinality estimation. The goal is robust plan shape:
- cross-product avoidance when possible
- smaller intermediate relation width
- earlier joins between atoms that share variables
- lower chance of large maintained join state
This fits recursive Datalog because reliable static cardinality estimates are hard. A robustness-oriented heuristic is often more useful than a
fragile cost model.
---
## Sideways Information Passing
Sideways information passing is a rule transformation that creates semijoin-style filters.
The practical goal is:
```text
known useful bindings
-> prefilter later atoms
-> smaller join inputs
-> less intermediate state
```
In the implementation, enabling SIP can expand a catalog into multiple catalogs. For non-recursive strata, this may split one group into several
cascading groups so the generated filters can feed later steps.
This is why planning and stratification interact. SIP is not just a local operator rewrite. It can change the shape of the stratum plan.
---
## Executor Shape
The executor creates a Timely dataflow and then builds Differential Dataflow collections inside it.
At startup, it creates input sessions for every extensional relation. Those sessions are used to load facts from files.
For each stratum group, execution branches by recursion:
- non-recursive groups are built as straight-line transformations
- recursive groups are built inside an iterative scope
Non-recursive execution walks each transformation and constructs the matching dataflow operator. Outputs are stored back into `row_map`, `kv_map`, or
`k_map` depending on their physical shape.
Recursive execution creates iterative variables for intensional relations and repeatedly applies the recursive transformations until convergence.
Collectors merge rule outputs for the same intensional predicate. Inspectors print relation sizes or emit outputs.
---
## Physical Operators
The executor has operators corresponding to the physical collection shapes:
- row to row
- row to key
- row to key-value
- key-value join key-value
- key-value join key
- key join key
- cartesian product
- key-value antijoin key
- key antijoin key
The implementation arranges collections when needed. Arrangements are Differential Dataflow's indexed representation for joins and repeated access.
This is why FlowLog cares about key and value arity. The physical shape determines which macro-generated operator can be used and whether the runtime
needs a fallback representation.
---
## Arity Strategy
FlowLog uses specialized fixed-size representations for common arities and a fallback mode for wider tuples.
The program plan can compute maximal key and value arity pairs. If a query exceeds the fixed-size fallback threshold, fat mode is required.
This is a performance engineering detail: Datalog workloads can produce wide intermediate tuples, but specializing small tuples can reduce allocation
and dynamic dispatch overhead.
---
## Batch and Incremental Weights
FlowLog has two build modes for weights.
Batch mode uses a presence-style difference type. This is suited for static Datalog workloads where a fact is either present or absent.
Incremental mode uses signed integer differences. This can represent insertions, deletions, and multiplicities.
At the implementation level, this means the same logical engine can target:
```text
static fixed-point computation
incremental maintenance over changing inputs
```
The paper's artifact focuses on batch benchmarks, but the backend model is compatible with incremental updates.
---
## Important Limitations
The artifact has several important limitations:
- release builds can be slow because of large Timely and Differential Dataflow dependencies
- aggregation support is constrained
- arithmetic in rule heads is unstable in the artifact version
- some optimization paths are controlled by flags or rule directives
- SIP currently has implementation-specific handling in stratum grouping
- output support is more oriented around relation sizes and CSV dumps than an embedded application API
These are acceptable for a research artifact, but they matter if comparing FlowLog to an embedded query engine for an application.
---
## Lessons for Other Engines
FlowLog's implementation suggests several reusable lessons.
A Datalog engine benefits from an explicit rule catalog. It gives optimization and execution a shared view of variables, atoms, filters, and heads.
Recursive evaluation should not hide join planning. The rules inside a fixed-point loop are where bad plans become expensive.
Physical arrangements are part of the query plan. If the backend needs key-value indexes, the logical planner should expose key choices explicitly.
Optimization can be robustness-first. Recursive workloads may not have stable enough statistics for a conventional cost model.
The frontend and backend should stay separated. Datalog syntax, relational rule planning, and Differential Dataflow execution are different concerns.
---
## Practical Mental Model
FlowLog's implementation can be read as:
```text
parser and schema loader
+ dependency and strata analyzer
+ rule catalog
+ relational transformation planner
+ robust join planner
+ SIP expander
+ Differential Dataflow executor
```
The implementation is valuable because it shows the concrete machinery needed between a compact Datalog rule and an efficient incremental dataflow
program.
---
## Changelog
* **May 19, 2026** -- First version created from the FlowLog paper and artifact.