useful-notes/flowlog/001-flowlog-primer.md

# FlowLog Primer

A primer on FlowLog as a Datalog engine built on Differential Dataflow.

---

## Short Answer

FlowLog is a Datalog engine for recursive queries. It parses Datalog programs, stratifies rules, builds a relational intermediate representation,
optimizes rule plans, and executes them with Differential Dataflow.

The main idea is:

```text
Datalog rules
-> relational rule plans
-> Differential Dataflow operators
-> maintained derived relations
```

FlowLog is not only a parser for Datalog. It is a query engine design that keeps Datalog-specific optimization visible before the program is lowered
to a streaming dataflow backend.

---

## Why It Exists

Datalog is useful for recursive computations:

- graph reachability
- transitive closure
- program analysis
- static analysis
- network and distributed-system rules
- recursive data-cleaning or constraint logic

The hard part is execution. Recursive Datalog can spend most of its time and memory on joins inside fixed-point loops. Bad join orders can create
large intermediate relations, and the best order can vary by workload and iteration.

FlowLog tries to keep three properties together:

- Datalog-level expressiveness
- incremental and parallel execution
- query planning control before execution

The design uses Differential Dataflow as the physical backend, but it does not translate Datalog directly into low-level dataflow code. It first
creates an intermediate representation where Datalog-aware rewrites can happen.

---

## Datalog Model

A Datalog program contains facts and rules.

Input relations are extensional database predicates:

```datalog
.in
.decl Arc(x: number, y: number)
.input Arc.csv
```

Derived relations are intensional database predicates:

```datalog
.printsize
.decl Tc(x: number, y: number)
```

Rules derive output facts from input and already-derived facts:

```datalog
.rule
Tc(x, y) :- Arc(x, y).
Tc(x, y) :- Arc(z, y), Tc(x, z).
```

This example computes transitive closure. The first rule copies direct edges into `Tc`. The second rule recursively extends paths.

---

## Language Features

FlowLog supports a practical Datalog dialect with:

- relation declarations
- CSV-style input and output directives
- recursive rules
- stratified negation
- comparisons
- arithmetic expressions
- placeholder arguments with `_`
- aggregation with `count`, `sum`, `min`, and `max`
- optimization directives such as `.plan`, `.sip`, and `.optimize`

Negation is written with `!`:

```datalog
indirect_only(x, z) :- edge(x, y), edge(y, z), !edge(x, z).
```

Aggregation appears in the head:

```datalog
count_paths(x, z, count(y)) :- edge(x, y), edge(y, z).
```

The implementation has limits. Aggregation support is constrained, arithmetic in rule heads is not fully stable in the artifact version, and compile
times can be high because the backend depends on Differential Dataflow and Timely Dataflow.

---

## Execution Modes

FlowLog has two execution modes.

Batch mode is the default. It is intended for static Datalog evaluation where the input facts are loaded and the derived relations are computed.

Incremental mode uses integer differences so changes can be tracked as insertions and retractions. This fits incremental view maintenance, where input
updates should produce output updates.

The important distinction is:

```text
batch mode:
  compute the fixed point for an input dataset

incremental mode:
  maintain derived results as facts change
```

The paper benchmarks focus on batch execution, but the architecture is designed around incrementality.

---

## Differential Dataflow Role

Differential Dataflow represents collections as records with data, logical time, and a difference:

```text
(data, time, diff)
```

The `diff` field records multiplicity changes. Positive differences insert facts. Negative differences retract facts.

Operators such as `map`, `filter`, `join`, `concat`, `distinct`, and `iterate` maintain output changes as input changes arrive. Joins use maintained
indexes called arrangements.

This makes Differential Dataflow a useful backend for Datalog because:

- Datalog rules are relational queries.
- Recursive rules need fixed-point iteration.
- Semi-naive evaluation naturally works with deltas.
- Maintained arrangements can avoid repeated full scans.

FlowLog's job is to turn Datalog rules into a form that uses these backend operators efficiently.

---

## Stratification

FlowLog groups rules into strata using the dependency graph of the program.

A rule depends on another rule if its body mentions the relation derived by that other rule. Recursive rules appear in strongly connected components.
The engine evaluates strata in dependency order.

The usual shape is:

```text
non-recursive strata
-> recursive strata
-> later strata that depend on earlier outputs
```

This matters for negation and recursion. Negation must be stratified so a rule does not negatively depend on itself through a cycle. Recursive strata
need fixed-point evaluation.

---

## Optimization Focus

FlowLog's main contribution is not a new Datalog syntax. It is the optimization boundary between Datalog and Differential Dataflow.

The system uses a relational intermediate representation per rule. That lets the optimizer reason about joins, filters, subplans, and recursive
execution before lowering to physical dataflow operators.

Two important optimizations are:

- structural planning
- sideways information passing

Structural planning chooses join plans intended to avoid large intermediate results. It is robustness-oriented: avoid bad plans rather than assume
perfect cardinality estimates.

Sideways information passing uses semijoin-style prefiltering. It pushes known bindings or reachable values sideways through a rule so later joins see
less irrelevant input.

These two optimizations are complementary. Planning improves join shape. SIP reduces input size before the joins happen.

---

## Comparison with DBSP

FlowLog and DBSP live in the same design neighborhood:

```text
relational rules
-> incremental computation
-> maintained output relations
```

The backend model differs.

DBSP describes incremental view maintenance through streams, Z-sets, integration, differentiation, and circuit rewriting.

Differential Dataflow describes incremental collections with logical times and differences, and it maintains arrangements for efficient joins over
time.

For the CRDT and Geomerge notes, FlowLog is useful because it emphasizes a lesson that also applies to DBSP: recursive Datalog performance depends
heavily on physical planning. A declarative rule can be correct and still produce expensive intermediate state.

---

## When FlowLog Is Relevant

FlowLog is relevant when the problem has:

- recursive relational logic
- large graph-shaped inputs
- repeated joins inside fixed-point loops
- a need for incremental maintenance
- sensitivity to join order and memory use

It is less directly relevant when the problem is mostly point lookups, simple filters, or small non-recursive validation queries. Those can be handled
by simpler relational engines.

The strongest use case is a Datalog workload where rule-level optimization and incremental execution both matter.

---

## Practical Mental Model

FlowLog is best understood as:

```text
Datalog frontend
+ per-rule relational IR
+ recursive strata planning
+ robust join optimization
+ Differential Dataflow execution
```

Its central architectural choice is the split between logical Datalog planning and physical dataflow execution.

That split is what makes the system useful to study. It shows how a Datalog engine can reuse a general incremental backend without giving up
Datalog-specific optimization.

---

## Changelog

* **May 19, 2026** -- First version created from the FlowLog paper and artifact.