389 lines
9.0 KiB
Markdown
389 lines
9.0 KiB
Markdown
|
|
# FlowLog Technical Planning Notes
|
||
|
|
|
||
|
|
A technical note on the FlowLog planning layer: catalogs, transformations, key-value shapes, and recursive execution.
|
||
|
|
|
||
|
|
---
|
||
|
|
|
||
|
|
## Short Answer
|
||
|
|
|
||
|
|
FlowLog's most reusable technical idea is not the Datalog syntax or the Differential Dataflow backend. It is the planner boundary between them.
|
||
|
|
|
||
|
|
The planner turns a rule into a sequence of typed relational transformations:
|
||
|
|
|
||
|
|
```text
|
||
|
|
rule atoms and variables
|
||
|
|
-> catalog metadata
|
||
|
|
-> collection signatures
|
||
|
|
-> transformation flows
|
||
|
|
-> physical operator choices
|
||
|
|
```
|
||
|
|
|
||
|
|
That layer is useful because it records how variables move through projections, filters, joins, antijoins, and recursive strata before the backend starts maintaining state.
|
||
|
|
|
||
|
|
---
|
||
|
|
|
||
|
|
## Catalog as Rule Metadata
|
||
|
|
|
||
|
|
The catalog is the rule-level semantic summary.
|
||
|
|
|
||
|
|
For each rule, it needs to know:
|
||
|
|
|
||
|
|
- the head relation
|
||
|
|
- the body atoms
|
||
|
|
- which atoms are positive
|
||
|
|
- which atoms are negated
|
||
|
|
- which atoms are core join inputs
|
||
|
|
- where each variable appears
|
||
|
|
- which constants constrain fields
|
||
|
|
- which comparisons constrain tuples
|
||
|
|
- which output fields are projected into the head
|
||
|
|
|
||
|
|
This is the information needed to go from syntax to a relational plan.
|
||
|
|
|
||
|
|
For example:
|
||
|
|
|
||
|
|
```text
|
||
|
|
violation(x, z) :-
|
||
|
|
A(x, y),
|
||
|
|
B(y, z),
|
||
|
|
not C(x, z),
|
||
|
|
x != z.
|
||
|
|
```
|
||
|
|
|
||
|
|
The catalog should make the following facts explicit:
|
||
|
|
|
||
|
|
- `A` and `B` are positive join inputs.
|
||
|
|
- `C` is a negated input.
|
||
|
|
- `y` joins `A` to `B`.
|
||
|
|
- `(x, z)` is needed for the output and for the antijoin against `C`.
|
||
|
|
- `x != z` is a filter after both variables are available.
|
||
|
|
|
||
|
|
Without this catalog, the executor has to rediscover planning information from rule syntax.
|
||
|
|
|
||
|
|
---
|
||
|
|
|
||
|
|
## Collection Shapes
|
||
|
|
|
||
|
|
FlowLog lowers logical relations into physical collection shapes.
|
||
|
|
|
||
|
|
The main shapes are:
|
||
|
|
|
||
|
|
```text
|
||
|
|
row
|
||
|
|
key
|
||
|
|
key-value
|
||
|
|
```
|
||
|
|
|
||
|
|
A row collection is a plain tuple relation.
|
||
|
|
|
||
|
|
A key-only collection is useful for semijoins and antijoins. It represents membership of keys.
|
||
|
|
|
||
|
|
A key-value collection is useful for joins. The key is the join attribute set, and the value is the payload carried forward.
|
||
|
|
|
||
|
|
The same logical relation may need several physical views:
|
||
|
|
|
||
|
|
```text
|
||
|
|
Arc(x, y)
|
||
|
|
-> row view: (x, y)
|
||
|
|
-> key view: key=(x)
|
||
|
|
-> key-value view: key=(x), value=(y)
|
||
|
|
-> key-value view: key=(y), value=(x)
|
||
|
|
```
|
||
|
|
|
||
|
|
This is a central planning choice. The key determines which arrangement or maintained index the backend can use.
|
||
|
|
|
||
|
|
---
|
||
|
|
|
||
|
|
## Transformation Types
|
||
|
|
|
||
|
|
FlowLog's transformations separate unary reshaping from binary combination.
|
||
|
|
|
||
|
|
Unary transformations include:
|
||
|
|
|
||
|
|
- row to row
|
||
|
|
- row to key
|
||
|
|
- row to key-value
|
||
|
|
- key-value to key-value
|
||
|
|
- key-value to key
|
||
|
|
|
||
|
|
These cover:
|
||
|
|
|
||
|
|
- projection
|
||
|
|
- filtering
|
||
|
|
- constant checks
|
||
|
|
- equality checks
|
||
|
|
- comparison checks
|
||
|
|
- arranging a relation by a join key
|
||
|
|
- dropping fields that are no longer needed
|
||
|
|
|
||
|
|
Binary transformations include:
|
||
|
|
|
||
|
|
- key join key
|
||
|
|
- key-value join key
|
||
|
|
- key-value join key-value
|
||
|
|
- cartesian product
|
||
|
|
- key antijoin key
|
||
|
|
- key-value antijoin key
|
||
|
|
|
||
|
|
These cover joins and negation. The planner must choose both the inputs and the output shape.
|
||
|
|
|
||
|
|
---
|
||
|
|
|
||
|
|
## Transformation Flow
|
||
|
|
|
||
|
|
A transformation flow records how input fields become output fields.
|
||
|
|
|
||
|
|
For a unary transformation, the flow answers:
|
||
|
|
|
||
|
|
```text
|
||
|
|
Which input fields form the new key?
|
||
|
|
Which input fields remain as value?
|
||
|
|
Which constants and comparisons filter rows?
|
||
|
|
```
|
||
|
|
|
||
|
|
For a binary transformation, the flow answers:
|
||
|
|
|
||
|
|
```text
|
||
|
|
Which fields came from the left input?
|
||
|
|
Which fields came from the right input?
|
||
|
|
Which joined fields are retained?
|
||
|
|
Which new key should the output use?
|
||
|
|
Which payload fields must continue to the next step?
|
||
|
|
```
|
||
|
|
|
||
|
|
This matters because Datalog variables are logical names, but the backend sees tuple positions.
|
||
|
|
|
||
|
|
The planner's job is to keep those two worlds aligned.
|
||
|
|
|
||
|
|
---
|
||
|
|
|
||
|
|
## Join Graph
|
||
|
|
|
||
|
|
A rule body induces a join graph.
|
||
|
|
|
||
|
|
Atoms are nodes. Shared variables are edges or hyperedges between atoms.
|
||
|
|
|
||
|
|
Example:
|
||
|
|
|
||
|
|
```text
|
||
|
|
R(a, b, c) :-
|
||
|
|
A(a, x),
|
||
|
|
B(x, y),
|
||
|
|
C(y, c).
|
||
|
|
```
|
||
|
|
|
||
|
|
The join graph is a chain:
|
||
|
|
|
||
|
|
```text
|
||
|
|
A --x-- B --y-- C
|
||
|
|
```
|
||
|
|
|
||
|
|
A rule like this is sensitive to join order:
|
||
|
|
|
||
|
|
```text
|
||
|
|
R(a, d) :-
|
||
|
|
A(a, x),
|
||
|
|
B(x, y),
|
||
|
|
C(y, z),
|
||
|
|
D(z, d).
|
||
|
|
```
|
||
|
|
|
||
|
|
Joining `A` with `D` first is a cross product. Joining adjacent atoms first preserves bindings and reduces intermediate results.
|
||
|
|
|
||
|
|
FlowLog's structural planning uses variable overlap to choose a plan tree that keeps joins connected and intermediate width smaller.
|
||
|
|
|
||
|
|
---
|
||
|
|
|
||
|
|
## Width-Oriented Planning
|
||
|
|
|
||
|
|
FlowLog's planner is robustness-oriented rather than fully cost-based.
|
||
|
|
|
||
|
|
A conventional cost model needs statistics:
|
||
|
|
|
||
|
|
- relation sizes
|
||
|
|
- distinct counts
|
||
|
|
- skew
|
||
|
|
- selectivity
|
||
|
|
- correlation
|
||
|
|
|
||
|
|
Recursive Datalog makes those estimates unstable because intermediate relations change across fixed-point iterations.
|
||
|
|
|
||
|
|
FlowLog instead uses structural signals:
|
||
|
|
|
||
|
|
- how many variables two atoms share
|
||
|
|
- how many variables an intermediate result must carry
|
||
|
|
- whether a candidate plan creates disconnected joins
|
||
|
|
- how deep the plan tree becomes
|
||
|
|
|
||
|
|
This is not guaranteed to be optimal. It is meant to avoid obviously bad plans.
|
||
|
|
|
||
|
|
That is a good fit for DBSP-backed work too, because a bad plan becomes maintained operator state.
|
||
|
|
|
||
|
|
---
|
||
|
|
|
||
|
|
## Antijoin Timing
|
||
|
|
|
||
|
|
Negated atoms become antijoins.
|
||
|
|
|
||
|
|
An antijoin can only run after all of its variables are bound by prior positive atoms.
|
||
|
|
|
||
|
|
Example:
|
||
|
|
|
||
|
|
```text
|
||
|
|
missing_src(graph, src) :-
|
||
|
|
edge(graph, src, dst),
|
||
|
|
not vertex(graph, src).
|
||
|
|
```
|
||
|
|
|
||
|
|
The antijoin against `vertex(graph, src)` can run immediately after `edge` because both `graph` and `src` are available.
|
||
|
|
|
||
|
|
In a larger rule:
|
||
|
|
|
||
|
|
```text
|
||
|
|
bad(x, z) :-
|
||
|
|
A(x, y),
|
||
|
|
B(y, z),
|
||
|
|
C(z, w),
|
||
|
|
not D(x, z).
|
||
|
|
```
|
||
|
|
|
||
|
|
The antijoin against `D(x, z)` can run after `A` and `B`; it does not need to wait for `C`. Running it earlier may reduce the input to the later join with `C`.
|
||
|
|
|
||
|
|
This is the same issue as antijoin pushdown in the DBSP CRDT note.
|
||
|
|
|
||
|
|
---
|
||
|
|
|
||
|
|
## Sideways Information Passing
|
||
|
|
|
||
|
|
Sideways information passing is semijoin-style filtering across a rule.
|
||
|
|
|
||
|
|
The intuition is:
|
||
|
|
|
||
|
|
```text
|
||
|
|
derive useful keys
|
||
|
|
-> filter another relation to those keys
|
||
|
|
-> join less data
|
||
|
|
```
|
||
|
|
|
||
|
|
Example:
|
||
|
|
|
||
|
|
```text
|
||
|
|
Reach(y) :- Reach(x), Arc(x, y).
|
||
|
|
```
|
||
|
|
|
||
|
|
If the current delta contains only a small set of `Reach(x)` values, then `Arc` only needs edges whose source is in that set. A semijoin can prefilter `Arc` before the recursive join.
|
||
|
|
|
||
|
|
For CRDT causal readiness, this suggests a physical plan centered on frontier operations:
|
||
|
|
|
||
|
|
```text
|
||
|
|
new ready operations
|
||
|
|
-> candidate outgoing pred edges
|
||
|
|
-> newly ready operations
|
||
|
|
```
|
||
|
|
|
||
|
|
rather than a plan that repeatedly starts from roots.
|
||
|
|
|
||
|
|
---
|
||
|
|
|
||
|
|
## Recursive Strata
|
||
|
|
|
||
|
|
Recursive rules require fixed-point execution.
|
||
|
|
|
||
|
|
FlowLog groups recursive rules into recursive strata, then executes them inside an iterative dataflow scope.
|
||
|
|
|
||
|
|
The important design point is that a recursive stratum can contain several rules deriving related IDBs. The planner must know:
|
||
|
|
|
||
|
|
- which IDBs are loop variables
|
||
|
|
- which relations enter the recursive scope from earlier strata
|
||
|
|
- which outputs must be collected after convergence
|
||
|
|
- which intermediate arrangements are useful across iterations
|
||
|
|
|
||
|
|
For DBSP, this maps to recursive circuits with feedback and delay. The frontend still needs the same rule-level information before it can produce a good circuit.
|
||
|
|
|
||
|
|
---
|
||
|
|
|
||
|
|
## Subplan Sharing
|
||
|
|
|
||
|
|
Multiple rules may derive the same relation, or several relations may reuse the same intermediate computation.
|
||
|
|
|
||
|
|
Example:
|
||
|
|
|
||
|
|
```text
|
||
|
|
required_src(g, s) :- edge(g, s, d).
|
||
|
|
required_dst(g, d) :- edge(g, s, d).
|
||
|
|
```
|
||
|
|
|
||
|
|
Both rules scan or project from `edge`.
|
||
|
|
|
||
|
|
In larger Geomerge theories, many violation rules may share antecedent fragments. A planner should be able to notice common subplans:
|
||
|
|
|
||
|
|
```text
|
||
|
|
common_antecedent(x, y)
|
||
|
|
-> violation_a(x)
|
||
|
|
-> violation_b(y)
|
||
|
|
```
|
||
|
|
|
||
|
|
FlowLog's explicit rule plans and collection signatures are a useful place to represent this sharing.
|
||
|
|
|
||
|
|
---
|
||
|
|
|
||
|
|
## Physical Key Choice
|
||
|
|
|
||
|
|
Backend performance depends on key choice.
|
||
|
|
|
||
|
|
For a join:
|
||
|
|
|
||
|
|
```text
|
||
|
|
R(x, z) :- A(x, y), B(y, z).
|
||
|
|
```
|
||
|
|
|
||
|
|
both `A` and `B` should be arranged by `y`.
|
||
|
|
|
||
|
|
For a later join:
|
||
|
|
|
||
|
|
```text
|
||
|
|
S(x, w) :- R(x, z), C(z, w).
|
||
|
|
```
|
||
|
|
|
||
|
|
the output of the first join may need to be arranged by `z`, not by `x`.
|
||
|
|
|
||
|
|
That means the planner should choose output keys based on the next operation, not only the current operation.
|
||
|
|
|
||
|
|
This is one reason a simple relational algebra tree is not enough. The physical plan needs key and payload annotations.
|
||
|
|
|
||
|
|
---
|
||
|
|
|
||
|
|
## Transfer to a DBSP Frontend
|
||
|
|
|
||
|
|
A DBSP frontend inspired by FlowLog should probably have these data structures:
|
||
|
|
|
||
|
|
- relation schemas
|
||
|
|
- rule catalogs
|
||
|
|
- variable occurrence maps
|
||
|
|
- dependency graph
|
||
|
|
- strata
|
||
|
|
- join graph per rule
|
||
|
|
- logical relational plan
|
||
|
|
- physical key annotations
|
||
|
|
- backend lowering rules
|
||
|
|
|
||
|
|
The lowering should treat DBSP as the maintained execution backend:
|
||
|
|
|
||
|
|
```text
|
||
|
|
projection -> DBSP projection
|
||
|
|
selection -> DBSP filter
|
||
|
|
join -> DBSP join with maintained state
|
||
|
|
antijoin -> DBSP antijoin or difference plan
|
||
|
|
union -> DBSP addition or union
|
||
|
|
distinct -> DBSP distinct
|
||
|
|
recursion -> DBSP fixed-point circuit
|
||
|
|
```
|
||
|
|
|
||
|
|
The key point is that DBSP should receive an already planned circuit, not raw Datalog text.
|
||
|
|
|
||
|
|
---
|
||
|
|
|
||
|
|
## Changelog
|
||
|
|
|
||
|
|
* **May 20, 2026** -- First version created from the FlowLog implementation and DBSP synergy notes.
|