355 lines
11 KiB
Markdown
355 lines
11 KiB
Markdown
# FlowLog Implementation
|
|
|
|
A reading note on the implementation shape of FlowLog's Rust artifact.
|
|
|
|
---
|
|
|
|
## Short Answer
|
|
|
|
FlowLog is implemented as a Rust workspace with separate crates for parsing, stratification, catalog construction, logical planning, optimization,
|
|
input reading, execution, and code-generation macros.
|
|
|
|
The implementation path is:
|
|
|
|
```text
|
|
.dl file
|
|
-> parser
|
|
-> strata
|
|
-> catalog
|
|
-> program query plan
|
|
-> grouped strata plans
|
|
-> Differential Dataflow dataflow
|
|
-> output relation sizes or CSVs
|
|
```
|
|
|
|
The most important implementation idea is that FlowLog does not treat Differential Dataflow as a direct code-generation target from raw Datalog. It
|
|
first builds logical rule plans with explicit collection signatures and transformation flows.
|
|
|
|
---
|
|
|
|
## Workspace Shape
|
|
|
|
The artifact is organized as a Rust workspace with crates that line up with the execution pipeline.
|
|
|
|
`parsing` parses the Datalog dialect. It uses a grammar with declarations, input directives, output directives, rules, negation, comparisons,
|
|
arithmetic, and aggregation.
|
|
|
|
`strata` builds dependency information and groups rules into strata. Recursive rules are identified through the dependency graph.
|
|
|
|
`catalog` turns parsed rules into metadata. This includes atoms, head arguments, filters, comparisons, arithmetic expressions, aggregation heads, and
|
|
rule structure.
|
|
|
|
`planning` creates logical query plans from catalogs and strata. This is where rule bodies become transformation chains and where join structure is
|
|
represented.
|
|
|
|
`optimizing` chooses structural join plans. It reasons over the variable overlap among rule atoms and selects plan trees.
|
|
|
|
`reading` loads input relations from files and represents rows, relation sessions, semiring-like weights, and arrangements.
|
|
|
|
`executing` builds and runs the Differential Dataflow graph. It owns command-line handling, dataflow construction, operators, collectors, and output
|
|
inspection.
|
|
|
|
`macros` provides Rust macros that generate specialized operator code for different key and value arities.
|
|
|
|
---
|
|
|
|
## Frontend
|
|
|
|
The frontend grammar supports sections like:
|
|
|
|
```datalog
|
|
.in
|
|
.decl Arc(x: number, y: number)
|
|
.input Arc.csv
|
|
|
|
.printsize
|
|
.decl Tc(x: number, y: number)
|
|
|
|
.rule
|
|
Tc(x, y) :- Arc(x, y).
|
|
Tc(x, y) :- Arc(z, y), Tc(x, z).
|
|
```
|
|
|
|
The parser distinguishes:
|
|
|
|
- extensional declarations
|
|
- intensional declarations
|
|
- rule heads
|
|
- positive atoms
|
|
- negated atoms
|
|
- comparisons
|
|
- constants
|
|
- placeholders
|
|
- aggregate heads
|
|
|
|
This is more schema-driven than small teaching Datalog examples. Relation declarations give the engine names and arities before planning.
|
|
|
|
---
|
|
|
|
## Strata and Program Plans
|
|
|
|
After parsing, FlowLog builds strata from rule dependencies.
|
|
|
|
The `ProgramQueryPlan` is created from strata. It iterates through each stratum, builds a `Catalog` for each rule, decides whether SIP and structural
|
|
planning apply, expands SIP rules when needed, and converts each catalog into a `RuleQueryPlan`.
|
|
|
|
The optimizer is only activated for rules with more than two core atoms. This is a pragmatic choice: optimizing one-atom or two-atom rules has little
|
|
value.
|
|
|
|
The planning level also tracks whether a stratum is recursive. Recursive and non-recursive groups are executed differently later.
|
|
|
|
The key shape is:
|
|
|
|
```text
|
|
Strata
|
|
-> Catalog per rule
|
|
-> RuleQueryPlan per catalog
|
|
-> GroupStrataQueryPlan
|
|
-> ProgramQueryPlan
|
|
```
|
|
|
|
---
|
|
|
|
## Catalog Role
|
|
|
|
The catalog is the bridge between syntax and planning.
|
|
|
|
It records which rule atoms are core relational inputs, which terms are filters or constraints, which variables occur where, and how the rule head
|
|
relates to the body.
|
|
|
|
Planning needs this metadata to answer questions such as:
|
|
|
|
- Which atoms should participate in joins?
|
|
- Which atom arguments are shared variables?
|
|
- Which comparisons can be applied locally?
|
|
- Which projected fields are needed in the output?
|
|
- Which atoms are negated?
|
|
- Which rules derive the same output relation?
|
|
|
|
Without this catalog layer, the executor would have to rediscover semantic information from syntax during physical dataflow construction.
|
|
|
|
---
|
|
|
|
## Collection Signatures
|
|
|
|
FlowLog lowers relations into collection signatures that distinguish row, key, and value shapes.
|
|
|
|
Differential Dataflow joins are easiest to express over key-value collections. FlowLog therefore maps relation tuples into several physical forms:
|
|
|
|
- row collections for plain tuples
|
|
- key-only collections for semijoin and antijoin support
|
|
- key-value collections for joins
|
|
|
|
The executor keeps maps for these forms:
|
|
|
|
```text
|
|
row_map
|
|
kv_map
|
|
k_map
|
|
```
|
|
|
|
This is a central implementation detail. A Datalog relation looks like one logical predicate, but execution may maintain several arranged physical
|
|
views of it.
|
|
|
|
---
|
|
|
|
## Transformation Flow
|
|
|
|
The planning layer represents operations with transformation flows.
|
|
|
|
A unary transformation maps one collection to another:
|
|
|
|
```text
|
|
KVToKV
|
|
```
|
|
|
|
This covers projection, filtering, local constraints, and reshaping a tuple into key-value form.
|
|
|
|
A binary transformation represents a join-like step:
|
|
|
|
```text
|
|
JnToKV
|
|
```
|
|
|
|
This maps joined key-value inputs into a new output key and value shape.
|
|
|
|
Each transformation flow tracks:
|
|
|
|
- output key arguments
|
|
- output value arguments
|
|
- local constraints
|
|
- comparison expressions
|
|
- how input fields flow to output fields
|
|
|
|
This lets the executor generate a specific Differential Dataflow operator while the planner remains backend-independent enough to reason about rule
|
|
structure.
|
|
|
|
---
|
|
|
|
## Structural Planning
|
|
|
|
The optimizer builds a plan tree for the core atoms of a rule.
|
|
|
|
The default plan is essentially a chain following the rule's atom order. The optimized plan searches for a better tree by looking at variable overlap
|
|
among atoms.
|
|
|
|
The optimizer uses a maximum-spanning-tree-style search over atom overlaps. Then it evaluates candidate trees with a width measure and depth
|
|
tie-breaker.
|
|
|
|
The goal is not perfect cardinality estimation. The goal is robust plan shape:
|
|
|
|
- cross-product avoidance when possible
|
|
- smaller intermediate relation width
|
|
- earlier joins between atoms that share variables
|
|
- lower chance of large maintained join state
|
|
|
|
This fits recursive Datalog because reliable static cardinality estimates are hard. A robustness-oriented heuristic is often more useful than a
|
|
fragile cost model.
|
|
|
|
---
|
|
|
|
## Sideways Information Passing
|
|
|
|
Sideways information passing is a rule transformation that creates semijoin-style filters.
|
|
|
|
The practical goal is:
|
|
|
|
```text
|
|
known useful bindings
|
|
-> prefilter later atoms
|
|
-> smaller join inputs
|
|
-> less intermediate state
|
|
```
|
|
|
|
In the implementation, enabling SIP can expand a catalog into multiple catalogs. For non-recursive strata, this may split one group into several
|
|
cascading groups so the generated filters can feed later steps.
|
|
|
|
This is why planning and stratification interact. SIP is not just a local operator rewrite. It can change the shape of the stratum plan.
|
|
|
|
---
|
|
|
|
## Executor Shape
|
|
|
|
The executor creates a Timely dataflow and then builds Differential Dataflow collections inside it.
|
|
|
|
At startup, it creates input sessions for every extensional relation. Those sessions are used to load facts from files.
|
|
|
|
For each stratum group, execution branches by recursion:
|
|
|
|
- non-recursive groups are built as straight-line transformations
|
|
- recursive groups are built inside an iterative scope
|
|
|
|
Non-recursive execution walks each transformation and constructs the matching dataflow operator. Outputs are stored back into `row_map`, `kv_map`, or
|
|
`k_map` depending on their physical shape.
|
|
|
|
Recursive execution creates iterative variables for intensional relations and repeatedly applies the recursive transformations until convergence.
|
|
|
|
Collectors merge rule outputs for the same intensional predicate. Inspectors print relation sizes or emit outputs.
|
|
|
|
---
|
|
|
|
## Physical Operators
|
|
|
|
The executor has operators corresponding to the physical collection shapes:
|
|
|
|
- row to row
|
|
- row to key
|
|
- row to key-value
|
|
- key-value join key-value
|
|
- key-value join key
|
|
- key join key
|
|
- cartesian product
|
|
- key-value antijoin key
|
|
- key antijoin key
|
|
|
|
The implementation arranges collections when needed. Arrangements are Differential Dataflow's indexed representation for joins and repeated access.
|
|
|
|
This is why FlowLog cares about key and value arity. The physical shape determines which macro-generated operator can be used and whether the runtime
|
|
needs a fallback representation.
|
|
|
|
---
|
|
|
|
## Arity Strategy
|
|
|
|
FlowLog uses specialized fixed-size representations for common arities and a fallback mode for wider tuples.
|
|
|
|
The program plan can compute maximal key and value arity pairs. If a query exceeds the fixed-size fallback threshold, fat mode is required.
|
|
|
|
This is a performance engineering detail: Datalog workloads can produce wide intermediate tuples, but specializing small tuples can reduce allocation
|
|
and dynamic dispatch overhead.
|
|
|
|
---
|
|
|
|
## Batch and Incremental Weights
|
|
|
|
FlowLog has two build modes for weights.
|
|
|
|
Batch mode uses a presence-style difference type. This is suited for static Datalog workloads where a fact is either present or absent.
|
|
|
|
Incremental mode uses signed integer differences. This can represent insertions, deletions, and multiplicities.
|
|
|
|
At the implementation level, this means the same logical engine can target:
|
|
|
|
```text
|
|
static fixed-point computation
|
|
incremental maintenance over changing inputs
|
|
```
|
|
|
|
The paper's artifact focuses on batch benchmarks, but the backend model is compatible with incremental updates.
|
|
|
|
---
|
|
|
|
## Important Limitations
|
|
|
|
The artifact has several important limitations:
|
|
|
|
- release builds can be slow because of large Timely and Differential Dataflow dependencies
|
|
- aggregation support is constrained
|
|
- arithmetic in rule heads is unstable in the artifact version
|
|
- some optimization paths are controlled by flags or rule directives
|
|
- SIP currently has implementation-specific handling in stratum grouping
|
|
- output support is more oriented around relation sizes and CSV dumps than an embedded application API
|
|
|
|
These are acceptable for a research artifact, but they matter if comparing FlowLog to an embedded query engine for an application.
|
|
|
|
---
|
|
|
|
## Lessons for Other Engines
|
|
|
|
FlowLog's implementation suggests several reusable lessons.
|
|
|
|
A Datalog engine benefits from an explicit rule catalog. It gives optimization and execution a shared view of variables, atoms, filters, and heads.
|
|
|
|
Recursive evaluation should not hide join planning. The rules inside a fixed-point loop are where bad plans become expensive.
|
|
|
|
Physical arrangements are part of the query plan. If the backend needs key-value indexes, the logical planner should expose key choices explicitly.
|
|
|
|
Optimization can be robustness-first. Recursive workloads may not have stable enough statistics for a conventional cost model.
|
|
|
|
The frontend and backend should stay separated. Datalog syntax, relational rule planning, and Differential Dataflow execution are different concerns.
|
|
|
|
---
|
|
|
|
## Practical Mental Model
|
|
|
|
FlowLog's implementation can be read as:
|
|
|
|
```text
|
|
parser and schema loader
|
|
+ dependency and strata analyzer
|
|
+ rule catalog
|
|
+ relational transformation planner
|
|
+ robust join planner
|
|
+ SIP expander
|
|
+ Differential Dataflow executor
|
|
```
|
|
|
|
The implementation is valuable because it shows the concrete machinery needed between a compact Datalog rule and an efficient incremental dataflow
|
|
program.
|
|
|
|
---
|
|
|
|
## Changelog
|
|
|
|
* **May 19, 2026** -- First version created from the FlowLog paper and artifact.
|