263 lines
7.6 KiB
Markdown
263 lines
7.6 KiB
Markdown
# FlowLog Primer
|
|
|
|
A primer on FlowLog as a Datalog engine built on Differential Dataflow.
|
|
|
|
---
|
|
|
|
## Short Answer
|
|
|
|
FlowLog is a Datalog engine for recursive queries. It parses Datalog programs, stratifies rules, builds a relational intermediate representation,
|
|
optimizes rule plans, and executes them with Differential Dataflow.
|
|
|
|
The main idea is:
|
|
|
|
```text
|
|
Datalog rules
|
|
-> relational rule plans
|
|
-> Differential Dataflow operators
|
|
-> maintained derived relations
|
|
```
|
|
|
|
FlowLog is not only a parser for Datalog. It is a query engine design that keeps Datalog-specific optimization visible before the program is lowered
|
|
to a streaming dataflow backend.
|
|
|
|
---
|
|
|
|
## Why It Exists
|
|
|
|
Datalog is useful for recursive computations:
|
|
|
|
- graph reachability
|
|
- transitive closure
|
|
- program analysis
|
|
- static analysis
|
|
- network and distributed-system rules
|
|
- recursive data-cleaning or constraint logic
|
|
|
|
The hard part is execution. Recursive Datalog can spend most of its time and memory on joins inside fixed-point loops. Bad join orders can create
|
|
large intermediate relations, and the best order can vary by workload and iteration.
|
|
|
|
FlowLog tries to keep three properties together:
|
|
|
|
- Datalog-level expressiveness
|
|
- incremental and parallel execution
|
|
- query planning control before execution
|
|
|
|
The design uses Differential Dataflow as the physical backend, but it does not translate Datalog directly into low-level dataflow code. It first
|
|
creates an intermediate representation where Datalog-aware rewrites can happen.
|
|
|
|
---
|
|
|
|
## Datalog Model
|
|
|
|
A Datalog program contains facts and rules.
|
|
|
|
Input relations are extensional database predicates:
|
|
|
|
```datalog
|
|
.in
|
|
.decl Arc(x: number, y: number)
|
|
.input Arc.csv
|
|
```
|
|
|
|
Derived relations are intensional database predicates:
|
|
|
|
```datalog
|
|
.printsize
|
|
.decl Tc(x: number, y: number)
|
|
```
|
|
|
|
Rules derive output facts from input and already-derived facts:
|
|
|
|
```datalog
|
|
.rule
|
|
Tc(x, y) :- Arc(x, y).
|
|
Tc(x, y) :- Arc(z, y), Tc(x, z).
|
|
```
|
|
|
|
This example computes transitive closure. The first rule copies direct edges into `Tc`. The second rule recursively extends paths.
|
|
|
|
---
|
|
|
|
## Language Features
|
|
|
|
FlowLog supports a practical Datalog dialect with:
|
|
|
|
- relation declarations
|
|
- CSV-style input and output directives
|
|
- recursive rules
|
|
- stratified negation
|
|
- comparisons
|
|
- arithmetic expressions
|
|
- placeholder arguments with `_`
|
|
- aggregation with `count`, `sum`, `min`, and `max`
|
|
- optimization directives such as `.plan`, `.sip`, and `.optimize`
|
|
|
|
Negation is written with `!`:
|
|
|
|
```datalog
|
|
indirect_only(x, z) :- edge(x, y), edge(y, z), !edge(x, z).
|
|
```
|
|
|
|
Aggregation appears in the head:
|
|
|
|
```datalog
|
|
count_paths(x, z, count(y)) :- edge(x, y), edge(y, z).
|
|
```
|
|
|
|
The implementation has limits. Aggregation support is constrained, arithmetic in rule heads is not fully stable in the artifact version, and compile
|
|
times can be high because the backend depends on Differential Dataflow and Timely Dataflow.
|
|
|
|
---
|
|
|
|
## Execution Modes
|
|
|
|
FlowLog has two execution modes.
|
|
|
|
Batch mode is the default. It is intended for static Datalog evaluation where the input facts are loaded and the derived relations are computed.
|
|
|
|
Incremental mode uses integer differences so changes can be tracked as insertions and retractions. This fits incremental view maintenance, where input
|
|
updates should produce output updates.
|
|
|
|
The important distinction is:
|
|
|
|
```text
|
|
batch mode:
|
|
compute the fixed point for an input dataset
|
|
|
|
incremental mode:
|
|
maintain derived results as facts change
|
|
```
|
|
|
|
The paper benchmarks focus on batch execution, but the architecture is designed around incrementality.
|
|
|
|
---
|
|
|
|
## Differential Dataflow Role
|
|
|
|
Differential Dataflow represents collections as records with data, logical time, and a difference:
|
|
|
|
```text
|
|
(data, time, diff)
|
|
```
|
|
|
|
The `diff` field records multiplicity changes. Positive differences insert facts. Negative differences retract facts.
|
|
|
|
Operators such as `map`, `filter`, `join`, `concat`, `distinct`, and `iterate` maintain output changes as input changes arrive. Joins use maintained
|
|
indexes called arrangements.
|
|
|
|
This makes Differential Dataflow a useful backend for Datalog because:
|
|
|
|
- Datalog rules are relational queries.
|
|
- Recursive rules need fixed-point iteration.
|
|
- Semi-naive evaluation naturally works with deltas.
|
|
- Maintained arrangements can avoid repeated full scans.
|
|
|
|
FlowLog's job is to turn Datalog rules into a form that uses these backend operators efficiently.
|
|
|
|
---
|
|
|
|
## Stratification
|
|
|
|
FlowLog groups rules into strata using the dependency graph of the program.
|
|
|
|
A rule depends on another rule if its body mentions the relation derived by that other rule. Recursive rules appear in strongly connected components.
|
|
The engine evaluates strata in dependency order.
|
|
|
|
The usual shape is:
|
|
|
|
```text
|
|
non-recursive strata
|
|
-> recursive strata
|
|
-> later strata that depend on earlier outputs
|
|
```
|
|
|
|
This matters for negation and recursion. Negation must be stratified so a rule does not negatively depend on itself through a cycle. Recursive strata
|
|
need fixed-point evaluation.
|
|
|
|
---
|
|
|
|
## Optimization Focus
|
|
|
|
FlowLog's main contribution is not a new Datalog syntax. It is the optimization boundary between Datalog and Differential Dataflow.
|
|
|
|
The system uses a relational intermediate representation per rule. That lets the optimizer reason about joins, filters, subplans, and recursive
|
|
execution before lowering to physical dataflow operators.
|
|
|
|
Two important optimizations are:
|
|
|
|
- structural planning
|
|
- sideways information passing
|
|
|
|
Structural planning chooses join plans intended to avoid large intermediate results. It is robustness-oriented: avoid bad plans rather than assume
|
|
perfect cardinality estimates.
|
|
|
|
Sideways information passing uses semijoin-style prefiltering. It pushes known bindings or reachable values sideways through a rule so later joins see
|
|
less irrelevant input.
|
|
|
|
These two optimizations are complementary. Planning improves join shape. SIP reduces input size before the joins happen.
|
|
|
|
---
|
|
|
|
## Comparison with DBSP
|
|
|
|
FlowLog and DBSP live in the same design neighborhood:
|
|
|
|
```text
|
|
relational rules
|
|
-> incremental computation
|
|
-> maintained output relations
|
|
```
|
|
|
|
The backend model differs.
|
|
|
|
DBSP describes incremental view maintenance through streams, Z-sets, integration, differentiation, and circuit rewriting.
|
|
|
|
Differential Dataflow describes incremental collections with logical times and differences, and it maintains arrangements for efficient joins over
|
|
time.
|
|
|
|
For the CRDT and Geomerge notes, FlowLog is useful because it emphasizes a lesson that also applies to DBSP: recursive Datalog performance depends
|
|
heavily on physical planning. A declarative rule can be correct and still produce expensive intermediate state.
|
|
|
|
---
|
|
|
|
## When FlowLog Is Relevant
|
|
|
|
FlowLog is relevant when the problem has:
|
|
|
|
- recursive relational logic
|
|
- large graph-shaped inputs
|
|
- repeated joins inside fixed-point loops
|
|
- a need for incremental maintenance
|
|
- sensitivity to join order and memory use
|
|
|
|
It is less directly relevant when the problem is mostly point lookups, simple filters, or small non-recursive validation queries. Those can be handled
|
|
by simpler relational engines.
|
|
|
|
The strongest use case is a Datalog workload where rule-level optimization and incremental execution both matter.
|
|
|
|
---
|
|
|
|
## Practical Mental Model
|
|
|
|
FlowLog is best understood as:
|
|
|
|
```text
|
|
Datalog frontend
|
|
+ per-rule relational IR
|
|
+ recursive strata planning
|
|
+ robust join optimization
|
|
+ Differential Dataflow execution
|
|
```
|
|
|
|
Its central architectural choice is the split between logical Datalog planning and physical dataflow execution.
|
|
|
|
That split is what makes the system useful to study. It shows how a Datalog engine can reuse a general incremental backend without giving up
|
|
Datalog-specific optimization.
|
|
|
|
---
|
|
|
|
## Changelog
|
|
|
|
* **May 19, 2026** -- First version created from the FlowLog paper and artifact.
|