useful-notes/flowlog/002-flowlog-implementation.md

355 lines
11 KiB
Markdown
Raw Permalink Normal View History

# FlowLog Implementation
A reading note on the implementation shape of FlowLog's Rust artifact.
---
## Short Answer
FlowLog is implemented as a Rust workspace with separate crates for parsing, stratification, catalog construction, logical planning, optimization,
input reading, execution, and code-generation macros.
The implementation path is:
```text
.dl file
-> parser
-> strata
-> catalog
-> program query plan
-> grouped strata plans
-> Differential Dataflow dataflow
-> output relation sizes or CSVs
```
The most important implementation idea is that FlowLog does not treat Differential Dataflow as a direct code-generation target from raw Datalog. It
first builds logical rule plans with explicit collection signatures and transformation flows.
---
## Workspace Shape
The artifact is organized as a Rust workspace with crates that line up with the execution pipeline.
`parsing` parses the Datalog dialect. It uses a grammar with declarations, input directives, output directives, rules, negation, comparisons,
arithmetic, and aggregation.
`strata` builds dependency information and groups rules into strata. Recursive rules are identified through the dependency graph.
`catalog` turns parsed rules into metadata. This includes atoms, head arguments, filters, comparisons, arithmetic expressions, aggregation heads, and
rule structure.
`planning` creates logical query plans from catalogs and strata. This is where rule bodies become transformation chains and where join structure is
represented.
`optimizing` chooses structural join plans. It reasons over the variable overlap among rule atoms and selects plan trees.
`reading` loads input relations from files and represents rows, relation sessions, semiring-like weights, and arrangements.
`executing` builds and runs the Differential Dataflow graph. It owns command-line handling, dataflow construction, operators, collectors, and output
inspection.
`macros` provides Rust macros that generate specialized operator code for different key and value arities.
---
## Frontend
The frontend grammar supports sections like:
```datalog
.in
.decl Arc(x: number, y: number)
.input Arc.csv
.printsize
.decl Tc(x: number, y: number)
.rule
Tc(x, y) :- Arc(x, y).
Tc(x, y) :- Arc(z, y), Tc(x, z).
```
The parser distinguishes:
- extensional declarations
- intensional declarations
- rule heads
- positive atoms
- negated atoms
- comparisons
- constants
- placeholders
- aggregate heads
This is more schema-driven than small teaching Datalog examples. Relation declarations give the engine names and arities before planning.
---
## Strata and Program Plans
After parsing, FlowLog builds strata from rule dependencies.
The `ProgramQueryPlan` is created from strata. It iterates through each stratum, builds a `Catalog` for each rule, decides whether SIP and structural
planning apply, expands SIP rules when needed, and converts each catalog into a `RuleQueryPlan`.
The optimizer is only activated for rules with more than two core atoms. This is a pragmatic choice: optimizing one-atom or two-atom rules has little
value.
The planning level also tracks whether a stratum is recursive. Recursive and non-recursive groups are executed differently later.
The key shape is:
```text
Strata
-> Catalog per rule
-> RuleQueryPlan per catalog
-> GroupStrataQueryPlan
-> ProgramQueryPlan
```
---
## Catalog Role
The catalog is the bridge between syntax and planning.
It records which rule atoms are core relational inputs, which terms are filters or constraints, which variables occur where, and how the rule head
relates to the body.
Planning needs this metadata to answer questions such as:
- Which atoms should participate in joins?
- Which atom arguments are shared variables?
- Which comparisons can be applied locally?
- Which projected fields are needed in the output?
- Which atoms are negated?
- Which rules derive the same output relation?
Without this catalog layer, the executor would have to rediscover semantic information from syntax during physical dataflow construction.
---
## Collection Signatures
FlowLog lowers relations into collection signatures that distinguish row, key, and value shapes.
Differential Dataflow joins are easiest to express over key-value collections. FlowLog therefore maps relation tuples into several physical forms:
- row collections for plain tuples
- key-only collections for semijoin and antijoin support
- key-value collections for joins
The executor keeps maps for these forms:
```text
row_map
kv_map
k_map
```
This is a central implementation detail. A Datalog relation looks like one logical predicate, but execution may maintain several arranged physical
views of it.
---
## Transformation Flow
The planning layer represents operations with transformation flows.
A unary transformation maps one collection to another:
```text
KVToKV
```
This covers projection, filtering, local constraints, and reshaping a tuple into key-value form.
A binary transformation represents a join-like step:
```text
JnToKV
```
This maps joined key-value inputs into a new output key and value shape.
Each transformation flow tracks:
- output key arguments
- output value arguments
- local constraints
- comparison expressions
- how input fields flow to output fields
This lets the executor generate a specific Differential Dataflow operator while the planner remains backend-independent enough to reason about rule
structure.
---
## Structural Planning
The optimizer builds a plan tree for the core atoms of a rule.
The default plan is essentially a chain following the rule's atom order. The optimized plan searches for a better tree by looking at variable overlap
among atoms.
The optimizer uses a maximum-spanning-tree-style search over atom overlaps. Then it evaluates candidate trees with a width measure and depth
tie-breaker.
The goal is not perfect cardinality estimation. The goal is robust plan shape:
- cross-product avoidance when possible
- smaller intermediate relation width
- earlier joins between atoms that share variables
- lower chance of large maintained join state
This fits recursive Datalog because reliable static cardinality estimates are hard. A robustness-oriented heuristic is often more useful than a
fragile cost model.
---
## Sideways Information Passing
Sideways information passing is a rule transformation that creates semijoin-style filters.
The practical goal is:
```text
known useful bindings
-> prefilter later atoms
-> smaller join inputs
-> less intermediate state
```
In the implementation, enabling SIP can expand a catalog into multiple catalogs. For non-recursive strata, this may split one group into several
cascading groups so the generated filters can feed later steps.
This is why planning and stratification interact. SIP is not just a local operator rewrite. It can change the shape of the stratum plan.
---
## Executor Shape
The executor creates a Timely dataflow and then builds Differential Dataflow collections inside it.
At startup, it creates input sessions for every extensional relation. Those sessions are used to load facts from files.
For each stratum group, execution branches by recursion:
- non-recursive groups are built as straight-line transformations
- recursive groups are built inside an iterative scope
Non-recursive execution walks each transformation and constructs the matching dataflow operator. Outputs are stored back into `row_map`, `kv_map`, or
`k_map` depending on their physical shape.
Recursive execution creates iterative variables for intensional relations and repeatedly applies the recursive transformations until convergence.
Collectors merge rule outputs for the same intensional predicate. Inspectors print relation sizes or emit outputs.
---
## Physical Operators
The executor has operators corresponding to the physical collection shapes:
- row to row
- row to key
- row to key-value
- key-value join key-value
- key-value join key
- key join key
- cartesian product
- key-value antijoin key
- key antijoin key
The implementation arranges collections when needed. Arrangements are Differential Dataflow's indexed representation for joins and repeated access.
This is why FlowLog cares about key and value arity. The physical shape determines which macro-generated operator can be used and whether the runtime
needs a fallback representation.
---
## Arity Strategy
FlowLog uses specialized fixed-size representations for common arities and a fallback mode for wider tuples.
The program plan can compute maximal key and value arity pairs. If a query exceeds the fixed-size fallback threshold, fat mode is required.
This is a performance engineering detail: Datalog workloads can produce wide intermediate tuples, but specializing small tuples can reduce allocation
and dynamic dispatch overhead.
---
## Batch and Incremental Weights
FlowLog has two build modes for weights.
Batch mode uses a presence-style difference type. This is suited for static Datalog workloads where a fact is either present or absent.
Incremental mode uses signed integer differences. This can represent insertions, deletions, and multiplicities.
At the implementation level, this means the same logical engine can target:
```text
static fixed-point computation
incremental maintenance over changing inputs
```
The paper's artifact focuses on batch benchmarks, but the backend model is compatible with incremental updates.
---
## Important Limitations
The artifact has several important limitations:
- release builds can be slow because of large Timely and Differential Dataflow dependencies
- aggregation support is constrained
- arithmetic in rule heads is unstable in the artifact version
- some optimization paths are controlled by flags or rule directives
- SIP currently has implementation-specific handling in stratum grouping
- output support is more oriented around relation sizes and CSV dumps than an embedded application API
These are acceptable for a research artifact, but they matter if comparing FlowLog to an embedded query engine for an application.
---
## Lessons for Other Engines
FlowLog's implementation suggests several reusable lessons.
A Datalog engine benefits from an explicit rule catalog. It gives optimization and execution a shared view of variables, atoms, filters, and heads.
Recursive evaluation should not hide join planning. The rules inside a fixed-point loop are where bad plans become expensive.
Physical arrangements are part of the query plan. If the backend needs key-value indexes, the logical planner should expose key choices explicitly.
Optimization can be robustness-first. Recursive workloads may not have stable enough statistics for a conventional cost model.
The frontend and backend should stay separated. Datalog syntax, relational rule planning, and Differential Dataflow execution are different concerns.
---
## Practical Mental Model
FlowLog's implementation can be read as:
```text
parser and schema loader
+ dependency and strata analyzer
+ rule catalog
+ relational transformation planner
+ robust join planner
+ SIP expander
+ Differential Dataflow executor
```
The implementation is valuable because it shows the concrete machinery needed between a compact Datalog rule and an efficient incremental dataflow
program.
---
## Changelog
* **May 19, 2026** -- First version created from the FlowLog paper and artifact.