Add two note files for FlowLog (primer and implementation)
This commit is contained in:
parent
ee52b850e4
commit
3d67b4994e
262
flowlog/001-flowlog-primer.md
Normal file
262
flowlog/001-flowlog-primer.md
Normal file
@ -0,0 +1,262 @@
|
||||
# FlowLog Primer
|
||||
|
||||
A primer on FlowLog as a Datalog engine built on Differential Dataflow.
|
||||
|
||||
---
|
||||
|
||||
## Short Answer
|
||||
|
||||
FlowLog is a Datalog engine for recursive queries. It parses Datalog programs, stratifies rules, builds a relational intermediate representation,
|
||||
optimizes rule plans, and executes them with Differential Dataflow.
|
||||
|
||||
The main idea is:
|
||||
|
||||
```text
|
||||
Datalog rules
|
||||
-> relational rule plans
|
||||
-> Differential Dataflow operators
|
||||
-> maintained derived relations
|
||||
```
|
||||
|
||||
FlowLog is not only a parser for Datalog. It is a query engine design that keeps Datalog-specific optimization visible before the program is lowered
|
||||
to a streaming dataflow backend.
|
||||
|
||||
---
|
||||
|
||||
## Why It Exists
|
||||
|
||||
Datalog is useful for recursive computations:
|
||||
|
||||
- graph reachability
|
||||
- transitive closure
|
||||
- program analysis
|
||||
- static analysis
|
||||
- network and distributed-system rules
|
||||
- recursive data-cleaning or constraint logic
|
||||
|
||||
The hard part is execution. Recursive Datalog can spend most of its time and memory on joins inside fixed-point loops. Bad join orders can create
|
||||
large intermediate relations, and the best order can vary by workload and iteration.
|
||||
|
||||
FlowLog tries to keep three properties together:
|
||||
|
||||
- Datalog-level expressiveness
|
||||
- incremental and parallel execution
|
||||
- query planning control before execution
|
||||
|
||||
The design uses Differential Dataflow as the physical backend, but it does not translate Datalog directly into low-level dataflow code. It first
|
||||
creates an intermediate representation where Datalog-aware rewrites can happen.
|
||||
|
||||
---
|
||||
|
||||
## Datalog Model
|
||||
|
||||
A Datalog program contains facts and rules.
|
||||
|
||||
Input relations are extensional database predicates:
|
||||
|
||||
```datalog
|
||||
.in
|
||||
.decl Arc(x: number, y: number)
|
||||
.input Arc.csv
|
||||
```
|
||||
|
||||
Derived relations are intensional database predicates:
|
||||
|
||||
```datalog
|
||||
.printsize
|
||||
.decl Tc(x: number, y: number)
|
||||
```
|
||||
|
||||
Rules derive output facts from input and already-derived facts:
|
||||
|
||||
```datalog
|
||||
.rule
|
||||
Tc(x, y) :- Arc(x, y).
|
||||
Tc(x, y) :- Arc(z, y), Tc(x, z).
|
||||
```
|
||||
|
||||
This example computes transitive closure. The first rule copies direct edges into `Tc`. The second rule recursively extends paths.
|
||||
|
||||
---
|
||||
|
||||
## Language Features
|
||||
|
||||
FlowLog supports a practical Datalog dialect with:
|
||||
|
||||
- relation declarations
|
||||
- CSV-style input and output directives
|
||||
- recursive rules
|
||||
- stratified negation
|
||||
- comparisons
|
||||
- arithmetic expressions
|
||||
- placeholder arguments with `_`
|
||||
- aggregation with `count`, `sum`, `min`, and `max`
|
||||
- optimization directives such as `.plan`, `.sip`, and `.optimize`
|
||||
|
||||
Negation is written with `!`:
|
||||
|
||||
```datalog
|
||||
indirect_only(x, z) :- edge(x, y), edge(y, z), !edge(x, z).
|
||||
```
|
||||
|
||||
Aggregation appears in the head:
|
||||
|
||||
```datalog
|
||||
count_paths(x, z, count(y)) :- edge(x, y), edge(y, z).
|
||||
```
|
||||
|
||||
The implementation has limits. Aggregation support is constrained, arithmetic in rule heads is not fully stable in the artifact version, and compile
|
||||
times can be high because the backend depends on Differential Dataflow and Timely Dataflow.
|
||||
|
||||
---
|
||||
|
||||
## Execution Modes
|
||||
|
||||
FlowLog has two execution modes.
|
||||
|
||||
Batch mode is the default. It is intended for static Datalog evaluation where the input facts are loaded and the derived relations are computed.
|
||||
|
||||
Incremental mode uses integer differences so changes can be tracked as insertions and retractions. This fits incremental view maintenance, where input
|
||||
updates should produce output updates.
|
||||
|
||||
The important distinction is:
|
||||
|
||||
```text
|
||||
batch mode:
|
||||
compute the fixed point for an input dataset
|
||||
|
||||
incremental mode:
|
||||
maintain derived results as facts change
|
||||
```
|
||||
|
||||
The paper benchmarks focus on batch execution, but the architecture is designed around incrementality.
|
||||
|
||||
---
|
||||
|
||||
## Differential Dataflow Role
|
||||
|
||||
Differential Dataflow represents collections as records with data, logical time, and a difference:
|
||||
|
||||
```text
|
||||
(data, time, diff)
|
||||
```
|
||||
|
||||
The `diff` field records multiplicity changes. Positive differences insert facts. Negative differences retract facts.
|
||||
|
||||
Operators such as `map`, `filter`, `join`, `concat`, `distinct`, and `iterate` maintain output changes as input changes arrive. Joins use maintained
|
||||
indexes called arrangements.
|
||||
|
||||
This makes Differential Dataflow a useful backend for Datalog because:
|
||||
|
||||
- Datalog rules are relational queries.
|
||||
- Recursive rules need fixed-point iteration.
|
||||
- Semi-naive evaluation naturally works with deltas.
|
||||
- Maintained arrangements can avoid repeated full scans.
|
||||
|
||||
FlowLog's job is to turn Datalog rules into a form that uses these backend operators efficiently.
|
||||
|
||||
---
|
||||
|
||||
## Stratification
|
||||
|
||||
FlowLog groups rules into strata using the dependency graph of the program.
|
||||
|
||||
A rule depends on another rule if its body mentions the relation derived by that other rule. Recursive rules appear in strongly connected components.
|
||||
The engine evaluates strata in dependency order.
|
||||
|
||||
The usual shape is:
|
||||
|
||||
```text
|
||||
non-recursive strata
|
||||
-> recursive strata
|
||||
-> later strata that depend on earlier outputs
|
||||
```
|
||||
|
||||
This matters for negation and recursion. Negation must be stratified so a rule does not negatively depend on itself through a cycle. Recursive strata
|
||||
need fixed-point evaluation.
|
||||
|
||||
---
|
||||
|
||||
## Optimization Focus
|
||||
|
||||
FlowLog's main contribution is not a new Datalog syntax. It is the optimization boundary between Datalog and Differential Dataflow.
|
||||
|
||||
The system uses a relational intermediate representation per rule. That lets the optimizer reason about joins, filters, subplans, and recursive
|
||||
execution before lowering to physical dataflow operators.
|
||||
|
||||
Two important optimizations are:
|
||||
|
||||
- structural planning
|
||||
- sideways information passing
|
||||
|
||||
Structural planning chooses join plans intended to avoid large intermediate results. It is robustness-oriented: avoid bad plans rather than assume
|
||||
perfect cardinality estimates.
|
||||
|
||||
Sideways information passing uses semijoin-style prefiltering. It pushes known bindings or reachable values sideways through a rule so later joins see
|
||||
less irrelevant input.
|
||||
|
||||
These two optimizations are complementary. Planning improves join shape. SIP reduces input size before the joins happen.
|
||||
|
||||
---
|
||||
|
||||
## Comparison with DBSP
|
||||
|
||||
FlowLog and DBSP live in the same design neighborhood:
|
||||
|
||||
```text
|
||||
relational rules
|
||||
-> incremental computation
|
||||
-> maintained output relations
|
||||
```
|
||||
|
||||
The backend model differs.
|
||||
|
||||
DBSP describes incremental view maintenance through streams, Z-sets, integration, differentiation, and circuit rewriting.
|
||||
|
||||
Differential Dataflow describes incremental collections with logical times and differences, and it maintains arrangements for efficient joins over
|
||||
time.
|
||||
|
||||
For the CRDT and Geomerge notes, FlowLog is useful because it emphasizes a lesson that also applies to DBSP: recursive Datalog performance depends
|
||||
heavily on physical planning. A declarative rule can be correct and still produce expensive intermediate state.
|
||||
|
||||
---
|
||||
|
||||
## When FlowLog Is Relevant
|
||||
|
||||
FlowLog is relevant when the problem has:
|
||||
|
||||
- recursive relational logic
|
||||
- large graph-shaped inputs
|
||||
- repeated joins inside fixed-point loops
|
||||
- a need for incremental maintenance
|
||||
- sensitivity to join order and memory use
|
||||
|
||||
It is less directly relevant when the problem is mostly point lookups, simple filters, or small non-recursive validation queries. Those can be handled
|
||||
by simpler relational engines.
|
||||
|
||||
The strongest use case is a Datalog workload where rule-level optimization and incremental execution both matter.
|
||||
|
||||
---
|
||||
|
||||
## Practical Mental Model
|
||||
|
||||
FlowLog is best understood as:
|
||||
|
||||
```text
|
||||
Datalog frontend
|
||||
+ per-rule relational IR
|
||||
+ recursive strata planning
|
||||
+ robust join optimization
|
||||
+ Differential Dataflow execution
|
||||
```
|
||||
|
||||
Its central architectural choice is the split between logical Datalog planning and physical dataflow execution.
|
||||
|
||||
That split is what makes the system useful to study. It shows how a Datalog engine can reuse a general incremental backend without giving up
|
||||
Datalog-specific optimization.
|
||||
|
||||
---
|
||||
|
||||
## Changelog
|
||||
|
||||
* **May 19, 2026** -- First version created from the FlowLog paper and artifact.
|
||||
354
flowlog/002-flowlog-implementation.md
Normal file
354
flowlog/002-flowlog-implementation.md
Normal file
@ -0,0 +1,354 @@
|
||||
# FlowLog Implementation
|
||||
|
||||
A reading note on the implementation shape of FlowLog's Rust artifact.
|
||||
|
||||
---
|
||||
|
||||
## Short Answer
|
||||
|
||||
FlowLog is implemented as a Rust workspace with separate crates for parsing, stratification, catalog construction, logical planning, optimization,
|
||||
input reading, execution, and code-generation macros.
|
||||
|
||||
The implementation path is:
|
||||
|
||||
```text
|
||||
.dl file
|
||||
-> parser
|
||||
-> strata
|
||||
-> catalog
|
||||
-> program query plan
|
||||
-> grouped strata plans
|
||||
-> Differential Dataflow dataflow
|
||||
-> output relation sizes or CSVs
|
||||
```
|
||||
|
||||
The most important implementation idea is that FlowLog does not treat Differential Dataflow as a direct code-generation target from raw Datalog. It
|
||||
first builds logical rule plans with explicit collection signatures and transformation flows.
|
||||
|
||||
---
|
||||
|
||||
## Workspace Shape
|
||||
|
||||
The artifact is organized as a Rust workspace with crates that line up with the execution pipeline.
|
||||
|
||||
`parsing` parses the Datalog dialect. It uses a grammar with declarations, input directives, output directives, rules, negation, comparisons,
|
||||
arithmetic, and aggregation.
|
||||
|
||||
`strata` builds dependency information and groups rules into strata. Recursive rules are identified through the dependency graph.
|
||||
|
||||
`catalog` turns parsed rules into metadata. This includes atoms, head arguments, filters, comparisons, arithmetic expressions, aggregation heads, and
|
||||
rule structure.
|
||||
|
||||
`planning` creates logical query plans from catalogs and strata. This is where rule bodies become transformation chains and where join structure is
|
||||
represented.
|
||||
|
||||
`optimizing` chooses structural join plans. It reasons over the variable overlap among rule atoms and selects plan trees.
|
||||
|
||||
`reading` loads input relations from files and represents rows, relation sessions, semiring-like weights, and arrangements.
|
||||
|
||||
`executing` builds and runs the Differential Dataflow graph. It owns command-line handling, dataflow construction, operators, collectors, and output
|
||||
inspection.
|
||||
|
||||
`macros` provides Rust macros that generate specialized operator code for different key and value arities.
|
||||
|
||||
---
|
||||
|
||||
## Frontend
|
||||
|
||||
The frontend grammar supports sections like:
|
||||
|
||||
```datalog
|
||||
.in
|
||||
.decl Arc(x: number, y: number)
|
||||
.input Arc.csv
|
||||
|
||||
.printsize
|
||||
.decl Tc(x: number, y: number)
|
||||
|
||||
.rule
|
||||
Tc(x, y) :- Arc(x, y).
|
||||
Tc(x, y) :- Arc(z, y), Tc(x, z).
|
||||
```
|
||||
|
||||
The parser distinguishes:
|
||||
|
||||
- extensional declarations
|
||||
- intensional declarations
|
||||
- rule heads
|
||||
- positive atoms
|
||||
- negated atoms
|
||||
- comparisons
|
||||
- constants
|
||||
- placeholders
|
||||
- aggregate heads
|
||||
|
||||
This is more schema-driven than small teaching Datalog examples. Relation declarations give the engine names and arities before planning.
|
||||
|
||||
---
|
||||
|
||||
## Strata and Program Plans
|
||||
|
||||
After parsing, FlowLog builds strata from rule dependencies.
|
||||
|
||||
The `ProgramQueryPlan` is created from strata. It iterates through each stratum, builds a `Catalog` for each rule, decides whether SIP and structural
|
||||
planning apply, expands SIP rules when needed, and converts each catalog into a `RuleQueryPlan`.
|
||||
|
||||
The optimizer is only activated for rules with more than two core atoms. This is a pragmatic choice: optimizing one-atom or two-atom rules has little
|
||||
value.
|
||||
|
||||
The planning level also tracks whether a stratum is recursive. Recursive and non-recursive groups are executed differently later.
|
||||
|
||||
The key shape is:
|
||||
|
||||
```text
|
||||
Strata
|
||||
-> Catalog per rule
|
||||
-> RuleQueryPlan per catalog
|
||||
-> GroupStrataQueryPlan
|
||||
-> ProgramQueryPlan
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## Catalog Role
|
||||
|
||||
The catalog is the bridge between syntax and planning.
|
||||
|
||||
It records which rule atoms are core relational inputs, which terms are filters or constraints, which variables occur where, and how the rule head
|
||||
relates to the body.
|
||||
|
||||
Planning needs this metadata to answer questions such as:
|
||||
|
||||
- Which atoms should participate in joins?
|
||||
- Which atom arguments are shared variables?
|
||||
- Which comparisons can be applied locally?
|
||||
- Which projected fields are needed in the output?
|
||||
- Which atoms are negated?
|
||||
- Which rules derive the same output relation?
|
||||
|
||||
Without this catalog layer, the executor would have to rediscover semantic information from syntax during physical dataflow construction.
|
||||
|
||||
---
|
||||
|
||||
## Collection Signatures
|
||||
|
||||
FlowLog lowers relations into collection signatures that distinguish row, key, and value shapes.
|
||||
|
||||
Differential Dataflow joins are easiest to express over key-value collections. FlowLog therefore maps relation tuples into several physical forms:
|
||||
|
||||
- row collections for plain tuples
|
||||
- key-only collections for semijoin and antijoin support
|
||||
- key-value collections for joins
|
||||
|
||||
The executor keeps maps for these forms:
|
||||
|
||||
```text
|
||||
row_map
|
||||
kv_map
|
||||
k_map
|
||||
```
|
||||
|
||||
This is a central implementation detail. A Datalog relation looks like one logical predicate, but execution may maintain several arranged physical
|
||||
views of it.
|
||||
|
||||
---
|
||||
|
||||
## Transformation Flow
|
||||
|
||||
The planning layer represents operations with transformation flows.
|
||||
|
||||
A unary transformation maps one collection to another:
|
||||
|
||||
```text
|
||||
KVToKV
|
||||
```
|
||||
|
||||
This covers projection, filtering, local constraints, and reshaping a tuple into key-value form.
|
||||
|
||||
A binary transformation represents a join-like step:
|
||||
|
||||
```text
|
||||
JnToKV
|
||||
```
|
||||
|
||||
This maps joined key-value inputs into a new output key and value shape.
|
||||
|
||||
Each transformation flow tracks:
|
||||
|
||||
- output key arguments
|
||||
- output value arguments
|
||||
- local constraints
|
||||
- comparison expressions
|
||||
- how input fields flow to output fields
|
||||
|
||||
This lets the executor generate a specific Differential Dataflow operator while the planner remains backend-independent enough to reason about rule
|
||||
structure.
|
||||
|
||||
---
|
||||
|
||||
## Structural Planning
|
||||
|
||||
The optimizer builds a plan tree for the core atoms of a rule.
|
||||
|
||||
The default plan is essentially a chain following the rule's atom order. The optimized plan searches for a better tree by looking at variable overlap
|
||||
among atoms.
|
||||
|
||||
The optimizer uses a maximum-spanning-tree-style search over atom overlaps. Then it evaluates candidate trees with a width measure and depth
|
||||
tie-breaker.
|
||||
|
||||
The goal is not perfect cardinality estimation. The goal is robust plan shape:
|
||||
|
||||
- cross-product avoidance when possible
|
||||
- smaller intermediate relation width
|
||||
- earlier joins between atoms that share variables
|
||||
- lower chance of large maintained join state
|
||||
|
||||
This fits recursive Datalog because reliable static cardinality estimates are hard. A robustness-oriented heuristic is often more useful than a
|
||||
fragile cost model.
|
||||
|
||||
---
|
||||
|
||||
## Sideways Information Passing
|
||||
|
||||
Sideways information passing is a rule transformation that creates semijoin-style filters.
|
||||
|
||||
The practical goal is:
|
||||
|
||||
```text
|
||||
known useful bindings
|
||||
-> prefilter later atoms
|
||||
-> smaller join inputs
|
||||
-> less intermediate state
|
||||
```
|
||||
|
||||
In the implementation, enabling SIP can expand a catalog into multiple catalogs. For non-recursive strata, this may split one group into several
|
||||
cascading groups so the generated filters can feed later steps.
|
||||
|
||||
This is why planning and stratification interact. SIP is not just a local operator rewrite. It can change the shape of the stratum plan.
|
||||
|
||||
---
|
||||
|
||||
## Executor Shape
|
||||
|
||||
The executor creates a Timely dataflow and then builds Differential Dataflow collections inside it.
|
||||
|
||||
At startup, it creates input sessions for every extensional relation. Those sessions are used to load facts from files.
|
||||
|
||||
For each stratum group, execution branches by recursion:
|
||||
|
||||
- non-recursive groups are built as straight-line transformations
|
||||
- recursive groups are built inside an iterative scope
|
||||
|
||||
Non-recursive execution walks each transformation and constructs the matching dataflow operator. Outputs are stored back into `row_map`, `kv_map`, or
|
||||
`k_map` depending on their physical shape.
|
||||
|
||||
Recursive execution creates iterative variables for intensional relations and repeatedly applies the recursive transformations until convergence.
|
||||
|
||||
Collectors merge rule outputs for the same intensional predicate. Inspectors print relation sizes or emit outputs.
|
||||
|
||||
---
|
||||
|
||||
## Physical Operators
|
||||
|
||||
The executor has operators corresponding to the physical collection shapes:
|
||||
|
||||
- row to row
|
||||
- row to key
|
||||
- row to key-value
|
||||
- key-value join key-value
|
||||
- key-value join key
|
||||
- key join key
|
||||
- cartesian product
|
||||
- key-value antijoin key
|
||||
- key antijoin key
|
||||
|
||||
The implementation arranges collections when needed. Arrangements are Differential Dataflow's indexed representation for joins and repeated access.
|
||||
|
||||
This is why FlowLog cares about key and value arity. The physical shape determines which macro-generated operator can be used and whether the runtime
|
||||
needs a fallback representation.
|
||||
|
||||
---
|
||||
|
||||
## Arity Strategy
|
||||
|
||||
FlowLog uses specialized fixed-size representations for common arities and a fallback mode for wider tuples.
|
||||
|
||||
The program plan can compute maximal key and value arity pairs. If a query exceeds the fixed-size fallback threshold, fat mode is required.
|
||||
|
||||
This is a performance engineering detail: Datalog workloads can produce wide intermediate tuples, but specializing small tuples can reduce allocation
|
||||
and dynamic dispatch overhead.
|
||||
|
||||
---
|
||||
|
||||
## Batch and Incremental Weights
|
||||
|
||||
FlowLog has two build modes for weights.
|
||||
|
||||
Batch mode uses a presence-style difference type. This is suited for static Datalog workloads where a fact is either present or absent.
|
||||
|
||||
Incremental mode uses signed integer differences. This can represent insertions, deletions, and multiplicities.
|
||||
|
||||
At the implementation level, this means the same logical engine can target:
|
||||
|
||||
```text
|
||||
static fixed-point computation
|
||||
incremental maintenance over changing inputs
|
||||
```
|
||||
|
||||
The paper's artifact focuses on batch benchmarks, but the backend model is compatible with incremental updates.
|
||||
|
||||
---
|
||||
|
||||
## Important Limitations
|
||||
|
||||
The artifact has several important limitations:
|
||||
|
||||
- release builds can be slow because of large Timely and Differential Dataflow dependencies
|
||||
- aggregation support is constrained
|
||||
- arithmetic in rule heads is unstable in the artifact version
|
||||
- some optimization paths are controlled by flags or rule directives
|
||||
- SIP currently has implementation-specific handling in stratum grouping
|
||||
- output support is more oriented around relation sizes and CSV dumps than an embedded application API
|
||||
|
||||
These are acceptable for a research artifact, but they matter if comparing FlowLog to an embedded query engine for an application.
|
||||
|
||||
---
|
||||
|
||||
## Lessons for Other Engines
|
||||
|
||||
FlowLog's implementation suggests several reusable lessons.
|
||||
|
||||
A Datalog engine benefits from an explicit rule catalog. It gives optimization and execution a shared view of variables, atoms, filters, and heads.
|
||||
|
||||
Recursive evaluation should not hide join planning. The rules inside a fixed-point loop are where bad plans become expensive.
|
||||
|
||||
Physical arrangements are part of the query plan. If the backend needs key-value indexes, the logical planner should expose key choices explicitly.
|
||||
|
||||
Optimization can be robustness-first. Recursive workloads may not have stable enough statistics for a conventional cost model.
|
||||
|
||||
The frontend and backend should stay separated. Datalog syntax, relational rule planning, and Differential Dataflow execution are different concerns.
|
||||
|
||||
---
|
||||
|
||||
## Practical Mental Model
|
||||
|
||||
FlowLog's implementation can be read as:
|
||||
|
||||
```text
|
||||
parser and schema loader
|
||||
+ dependency and strata analyzer
|
||||
+ rule catalog
|
||||
+ relational transformation planner
|
||||
+ robust join planner
|
||||
+ SIP expander
|
||||
+ Differential Dataflow executor
|
||||
```
|
||||
|
||||
The implementation is valuable because it shows the concrete machinery needed between a compact Datalog rule and an efficient incremental dataflow
|
||||
program.
|
||||
|
||||
---
|
||||
|
||||
## Changelog
|
||||
|
||||
* **May 19, 2026** -- First version created from the FlowLog paper and artifact.
|
||||
Loading…
x
Reference in New Issue
Block a user