useful-notes/flowlog/004-flowlog-technical-planning-notes.md

540 lines
13 KiB
Markdown
Raw Permalink Normal View History

# FlowLog Technical Planning Notes
A technical note on the FlowLog planning layer: catalogs, transformations, key-value shapes, and recursive execution.
---
## Short Answer
FlowLog's most reusable technical idea is not the Datalog syntax or the Differential Dataflow backend. It is the planner boundary between them.
The planner turns a rule into a sequence of typed relational transformations:
```text
rule atoms and variables
-> catalog metadata
-> collection signatures
-> transformation flows
-> physical operator choices
```
2026-05-20 15:54:55 +02:00
```mermaid
flowchart LR
Rule["Datalog Rule"] --> Catalog["Rule Catalog"]
Catalog --> Signatures["Collection Signatures"]
Signatures --> Flows["Transformation Flows"]
Flows --> Operators["Physical Operators"]
Operators --> Backend["Incremental Backend State"]
```
That layer is useful because it records how variables move through projections, filters, joins, antijoins, and recursive strata before the backend
starts maintaining state.
---
## Catalog as Rule Metadata
The catalog is the rule-level semantic summary.
For each rule, it needs to know:
- the head relation
- the body atoms
- which atoms are positive
- which atoms are negated
- which atoms are core join inputs
- where each variable appears
- which constants constrain fields
- which comparisons constrain tuples
- which output fields are projected into the head
This is the information needed to go from syntax to a relational plan.
For example:
```text
violation(x, z) :-
A(x, y),
B(y, z),
not C(x, z),
x != z.
```
The catalog should make the following facts explicit:
- `A` and `B` are positive join inputs.
- `C` is a negated input.
- `y` joins `A` to `B`.
- `(x, z)` is needed for the output and for the antijoin against `C`.
- `x != z` is a filter after both variables are available.
Without this catalog, the executor has to rediscover planning information from rule syntax.
---
## Collection Shapes
FlowLog lowers logical relations into physical collection shapes.
The main shapes are:
```text
row
key
key-value
```
A row collection is a plain tuple relation.
A key-only collection is useful for semijoins and antijoins. It represents membership of keys.
A key-value collection is useful for joins. The key is the join attribute set, and the value is the payload carried forward.
The same logical relation may need several physical views:
```text
Arc(x, y)
-> row view: (x, y)
-> key view: key=(x)
-> key-value view: key=(x), value=(y)
-> key-value view: key=(y), value=(x)
```
2026-05-20 15:54:55 +02:00
```mermaid
flowchart TB
Arc["Arc(x, y)"]
Arc --> Row["Row View<br/>(x, y)"]
Arc --> KeyX["Key View<br/>key = x"]
Arc --> KvX["Key-Value View<br/>key = x<br/>value = y"]
Arc --> KvY["Key-Value View<br/>key = y<br/>value = x"]
KeyX --> Semi["Semijoin or Antijoin"]
KvX --> Join1["Join on x"]
KvY --> Join2["Join on y"]
```
This is a central planning choice. The key determines which arrangement or maintained index the backend can use.
---
## Transformation Types
FlowLog's transformations separate unary reshaping from binary combination.
2026-05-20 15:54:55 +02:00
```mermaid
flowchart TB
Input["Input Collection"]
Input --> Unary["Unary Transformation"]
Unary --> UnaryOut["Projected, Filtered, or Arranged Collection"]
Left["Left Collection"] --> Binary["Binary Transformation"]
Right["Right Collection"] --> Binary
Binary --> BinaryOut["Joined or Antijoined Collection"]
Unary --> RowToRow["Row to Row"]
Unary --> RowToKey["Row to Key"]
Unary --> RowToKv["Row to Key-Value"]
Binary --> Join["Join"]
Binary --> Anti["Antijoin"]
Binary --> Product["Cartesian Product"]
```
Unary transformations include:
- row to row
- row to key
- row to key-value
- key-value to key-value
- key-value to key
These cover:
- projection
- filtering
- constant checks
- equality checks
- comparison checks
- arranging a relation by a join key
- dropping fields that are no longer needed
Binary transformations include:
- key join key
- key-value join key
- key-value join key-value
- cartesian product
- key antijoin key
- key-value antijoin key
These cover joins and negation. The planner must choose both the inputs and the output shape.
---
## Transformation Flow
A transformation flow records how input fields become output fields.
For a unary transformation, the flow answers:
```text
Which input fields form the new key?
Which input fields remain as value?
Which constants and comparisons filter rows?
```
For a binary transformation, the flow answers:
```text
Which fields came from the left input?
Which fields came from the right input?
Which joined fields are retained?
Which new key should the output use?
Which payload fields must continue to the next step?
```
This matters because Datalog variables are logical names, but the backend sees tuple positions.
The planner's job is to keep those two worlds aligned.
---
## Join Graph
A rule body induces a join graph.
Atoms are nodes. Shared variables are edges or hyperedges between atoms.
Example:
```text
R(a, b, c) :-
A(a, x),
B(x, y),
C(y, c).
```
The join graph is a chain:
```text
A --x-- B --y-- C
```
2026-05-20 15:54:55 +02:00
```mermaid
flowchart LR
A["A(a, x)"] -- "x" --> B["B(x, y)"]
B -- "y" --> C["C(y, c)"]
```
A rule like this is sensitive to join order:
```text
R(a, d) :-
A(a, x),
B(x, y),
C(y, z),
D(z, d).
```
Joining `A` with `D` first is a cross product. Joining adjacent atoms first preserves bindings and reduces intermediate results.
2026-05-20 15:54:55 +02:00
```mermaid
flowchart TB
subgraph Good["Connected Join Order"]
A1["A(a, x)"] --> AB["Join on x"]
B1["B(x, y)"] --> AB
AB --> ABC["Join on y"]
C1["C(y, z)"] --> ABC
ABC --> ABCD["Join on z"]
D1["D(z, d)"] --> ABCD
end
subgraph Bad["Disconnected Join Order"]
A2["A(a, x)"] --> AD["Cross Product"]
D2["D(z, d)"] --> AD
AD --> Later["Later Filters and Joins"]
end
```
FlowLog's structural planning uses variable overlap to choose a plan tree that keeps joins connected and intermediate width smaller.
---
## Width-Oriented Planning
FlowLog's planner is robustness-oriented rather than fully cost-based.
A conventional cost model needs statistics:
- relation sizes
- distinct counts
- skew
- selectivity
- correlation
Recursive Datalog makes those estimates unstable because intermediate relations change across fixed-point iterations.
FlowLog instead uses structural signals:
- how many variables two atoms share
- how many variables an intermediate result must carry
- whether a candidate plan creates disconnected joins
- how deep the plan tree becomes
2026-05-20 15:54:55 +02:00
```mermaid
flowchart LR
RuleBody["Rule Body"] --> Overlap["Variable Overlap"]
RuleBody --> Width["Intermediate Width"]
RuleBody --> Connectivity["Join Connectivity"]
Overlap --> PlanTree["Candidate Plan Tree"]
Width --> PlanTree
Connectivity --> PlanTree
PlanTree --> Choice["Robust Plan Choice"]
```
This is not guaranteed to be optimal. It is meant to avoid obviously bad plans.
That is a good fit for DBSP-backed work too, because a bad plan becomes maintained operator state.
---
## Antijoin Timing
Negated atoms become antijoins.
An antijoin can only run after all of its variables are bound by prior positive atoms.
Example:
```text
missing_src(graph, src) :-
edge(graph, src, dst),
not vertex(graph, src).
```
The antijoin against `vertex(graph, src)` can run immediately after `edge` because both `graph` and `src` are available.
In a larger rule:
```text
bad(x, z) :-
A(x, y),
B(y, z),
C(z, w),
not D(x, z).
```
2026-05-20 15:54:55 +02:00
The antijoin against `D(x, z)` can run after `A` and `B`; it does not need to wait for `C`. Running it earlier may reduce the input to the later join
with `C`.
```mermaid
flowchart LR
A["A(x, y)"] --> AB["Join on y"]
B["B(y, z)"] --> AB
AB --> AntiD["Antijoin D(x, z)"]
D["D(x, z)"] --> AntiD
AntiD --> JoinC["Join C(z, w)"]
C["C(z, w)"] --> JoinC
JoinC --> Out["bad(x, z)"]
```
This is the same issue as antijoin pushdown in the DBSP CRDT note.
---
## Sideways Information Passing
Sideways information passing is semijoin-style filtering across a rule.
The intuition is:
```text
derive useful keys
-> filter another relation to those keys
-> join less data
```
2026-05-20 15:54:55 +02:00
```mermaid
flowchart LR
DeltaReach["Delta Reach(x)"] --> Keys["Useful x Keys"]
Keys --> SemiArc["Semijoin Arc on x"]
Arc["Arc(x, y)"] --> SemiArc
SemiArc --> Join["Join with Delta Reach"]
Join --> NewReach["New Reach(y)"]
```
Example:
```text
Reach(y) :- Reach(x), Arc(x, y).
```
2026-05-20 15:54:55 +02:00
If the current delta contains only a small set of `Reach(x)` values, then `Arc` only needs edges whose source is in that set. A semijoin can prefilter
`Arc` before the recursive join.
For CRDT causal readiness, this suggests a physical plan centered on frontier operations:
```text
new ready operations
-> candidate outgoing pred edges
-> newly ready operations
```
2026-05-20 15:54:55 +02:00
```mermaid
flowchart LR
Frontier["Ready Frontier"] --> CandidatePred["Pred Edges from Frontier"]
Pred["pred(from, to)"] --> CandidatePred
CandidatePred --> Check["Predecessor Checks"]
Check --> NewReady["New Ready Operations"]
NewReady --> Frontier
```
rather than a plan that repeatedly starts from roots.
---
## Recursive Strata
Recursive rules require fixed-point execution.
FlowLog groups recursive rules into recursive strata, then executes them inside an iterative dataflow scope.
2026-05-20 15:54:55 +02:00
```mermaid
flowchart TB
Earlier["Earlier Strata Outputs"] --> Enter["Enter Recursive Scope"]
EDB["Input Relations"] --> Enter
Enter --> Base["Base Rules"]
Base --> LoopVars["IDB Loop Variables"]
LoopVars --> Step["Recursive Step Rules"]
Step --> Delta["New Derived Facts"]
Delta --> LoopVars
LoopVars --> Done{"Fixed Point?"}
Done -- "no" --> Step
Done -- "yes" --> Collect["Collect Recursive Outputs"]
```
The important design point is that a recursive stratum can contain several rules deriving related IDBs. The planner must know:
- which IDBs are loop variables
- which relations enter the recursive scope from earlier strata
- which outputs must be collected after convergence
- which intermediate arrangements are useful across iterations
2026-05-20 15:54:55 +02:00
For DBSP, this maps to recursive circuits with feedback and delay. The frontend still needs the same rule-level information before it can produce a
good circuit.
---
## Subplan Sharing
Multiple rules may derive the same relation, or several relations may reuse the same intermediate computation.
Example:
```text
required_src(g, s) :- edge(g, s, d).
required_dst(g, d) :- edge(g, s, d).
```
Both rules scan or project from `edge`.
In larger Geomerge theories, many violation rules may share antecedent fragments. A planner should be able to notice common subplans:
```text
common_antecedent(x, y)
-> violation_a(x)
-> violation_b(y)
```
2026-05-20 15:54:55 +02:00
```mermaid
flowchart LR
A["A(x, y)"] --> Common["common_antecedent(x, y)"]
B["B(y)"] --> Common
Common --> Va["violation_a(x)"]
Common --> Vb["violation_b(y)"]
Extra["Extra Check"] --> Vb
```
FlowLog's explicit rule plans and collection signatures are a useful place to represent this sharing.
---
## Physical Key Choice
Backend performance depends on key choice.
For a join:
```text
R(x, z) :- A(x, y), B(y, z).
```
both `A` and `B` should be arranged by `y`.
For a later join:
```text
S(x, w) :- R(x, z), C(z, w).
```
the output of the first join may need to be arranged by `z`, not by `x`.
That means the planner should choose output keys based on the next operation, not only the current operation.
2026-05-20 15:54:55 +02:00
```mermaid
flowchart LR
A["A(x, y)<br/>key = y"] --> JoinAB["Join on y"]
B["B(y, z)<br/>key = y"] --> JoinAB
JoinAB --> R["R(x, z)<br/>next key = z"]
R --> JoinRC["Join on z"]
C["C(z, w)<br/>key = z"] --> JoinRC
JoinRC --> S["S(x, w)"]
```
This is one reason a simple relational algebra tree is not enough. The physical plan needs key and payload annotations.
---
## Transfer to a DBSP Frontend
A DBSP frontend inspired by FlowLog should probably have these data structures:
- relation schemas
- rule catalogs
- variable occurrence maps
- dependency graph
- strata
- join graph per rule
- logical relational plan
- physical key annotations
- backend lowering rules
The lowering should treat DBSP as the maintained execution backend:
```text
projection -> DBSP projection
selection -> DBSP filter
join -> DBSP join with maintained state
antijoin -> DBSP antijoin or difference plan
union -> DBSP addition or union
distinct -> DBSP distinct
recursion -> DBSP fixed-point circuit
```
2026-05-20 15:54:55 +02:00
```mermaid
flowchart LR
Source["Datalog or Geolog Rules"] --> Frontend["Frontend Parser or Compiler"]
Frontend --> Catalog["Rule Catalogs"]
Catalog --> Planner["FlowLog-Style Planner"]
Planner --> IR["Relational IR with Keys"]
IR --> Lowering["DBSP Lowering"]
Lowering --> Circuit["DBSP Circuit"]
Circuit --> Deltas["Maintained Output Deltas"]
```
The key point is that DBSP should receive an already planned circuit, not raw Datalog text.
---
## Changelog
* **May 20, 2026** -- First version created from the FlowLog implementation and DBSP synergy notes.