540 lines
13 KiB
Markdown
540 lines
13 KiB
Markdown
# FlowLog Technical Planning Notes
|
|
|
|
A technical note on the FlowLog planning layer: catalogs, transformations, key-value shapes, and recursive execution.
|
|
|
|
---
|
|
|
|
## Short Answer
|
|
|
|
FlowLog's most reusable technical idea is not the Datalog syntax or the Differential Dataflow backend. It is the planner boundary between them.
|
|
|
|
The planner turns a rule into a sequence of typed relational transformations:
|
|
|
|
```text
|
|
rule atoms and variables
|
|
-> catalog metadata
|
|
-> collection signatures
|
|
-> transformation flows
|
|
-> physical operator choices
|
|
```
|
|
|
|
```mermaid
|
|
flowchart LR
|
|
Rule["Datalog Rule"] --> Catalog["Rule Catalog"]
|
|
Catalog --> Signatures["Collection Signatures"]
|
|
Signatures --> Flows["Transformation Flows"]
|
|
Flows --> Operators["Physical Operators"]
|
|
Operators --> Backend["Incremental Backend State"]
|
|
```
|
|
|
|
That layer is useful because it records how variables move through projections, filters, joins, antijoins, and recursive strata before the backend
|
|
starts maintaining state.
|
|
|
|
---
|
|
|
|
## Catalog as Rule Metadata
|
|
|
|
The catalog is the rule-level semantic summary.
|
|
|
|
For each rule, it needs to know:
|
|
|
|
- the head relation
|
|
- the body atoms
|
|
- which atoms are positive
|
|
- which atoms are negated
|
|
- which atoms are core join inputs
|
|
- where each variable appears
|
|
- which constants constrain fields
|
|
- which comparisons constrain tuples
|
|
- which output fields are projected into the head
|
|
|
|
This is the information needed to go from syntax to a relational plan.
|
|
|
|
For example:
|
|
|
|
```text
|
|
violation(x, z) :-
|
|
A(x, y),
|
|
B(y, z),
|
|
not C(x, z),
|
|
x != z.
|
|
```
|
|
|
|
The catalog should make the following facts explicit:
|
|
|
|
- `A` and `B` are positive join inputs.
|
|
- `C` is a negated input.
|
|
- `y` joins `A` to `B`.
|
|
- `(x, z)` is needed for the output and for the antijoin against `C`.
|
|
- `x != z` is a filter after both variables are available.
|
|
|
|
Without this catalog, the executor has to rediscover planning information from rule syntax.
|
|
|
|
---
|
|
|
|
## Collection Shapes
|
|
|
|
FlowLog lowers logical relations into physical collection shapes.
|
|
|
|
The main shapes are:
|
|
|
|
```text
|
|
row
|
|
key
|
|
key-value
|
|
```
|
|
|
|
A row collection is a plain tuple relation.
|
|
|
|
A key-only collection is useful for semijoins and antijoins. It represents membership of keys.
|
|
|
|
A key-value collection is useful for joins. The key is the join attribute set, and the value is the payload carried forward.
|
|
|
|
The same logical relation may need several physical views:
|
|
|
|
```text
|
|
Arc(x, y)
|
|
-> row view: (x, y)
|
|
-> key view: key=(x)
|
|
-> key-value view: key=(x), value=(y)
|
|
-> key-value view: key=(y), value=(x)
|
|
```
|
|
|
|
```mermaid
|
|
flowchart TB
|
|
Arc["Arc(x, y)"]
|
|
Arc --> Row["Row View<br/>(x, y)"]
|
|
Arc --> KeyX["Key View<br/>key = x"]
|
|
Arc --> KvX["Key-Value View<br/>key = x<br/>value = y"]
|
|
Arc --> KvY["Key-Value View<br/>key = y<br/>value = x"]
|
|
KeyX --> Semi["Semijoin or Antijoin"]
|
|
KvX --> Join1["Join on x"]
|
|
KvY --> Join2["Join on y"]
|
|
```
|
|
|
|
This is a central planning choice. The key determines which arrangement or maintained index the backend can use.
|
|
|
|
---
|
|
|
|
## Transformation Types
|
|
|
|
FlowLog's transformations separate unary reshaping from binary combination.
|
|
|
|
```mermaid
|
|
flowchart TB
|
|
Input["Input Collection"]
|
|
Input --> Unary["Unary Transformation"]
|
|
Unary --> UnaryOut["Projected, Filtered, or Arranged Collection"]
|
|
|
|
Left["Left Collection"] --> Binary["Binary Transformation"]
|
|
Right["Right Collection"] --> Binary
|
|
Binary --> BinaryOut["Joined or Antijoined Collection"]
|
|
|
|
Unary --> RowToRow["Row to Row"]
|
|
Unary --> RowToKey["Row to Key"]
|
|
Unary --> RowToKv["Row to Key-Value"]
|
|
Binary --> Join["Join"]
|
|
Binary --> Anti["Antijoin"]
|
|
Binary --> Product["Cartesian Product"]
|
|
```
|
|
|
|
Unary transformations include:
|
|
|
|
- row to row
|
|
- row to key
|
|
- row to key-value
|
|
- key-value to key-value
|
|
- key-value to key
|
|
|
|
These cover:
|
|
|
|
- projection
|
|
- filtering
|
|
- constant checks
|
|
- equality checks
|
|
- comparison checks
|
|
- arranging a relation by a join key
|
|
- dropping fields that are no longer needed
|
|
|
|
Binary transformations include:
|
|
|
|
- key join key
|
|
- key-value join key
|
|
- key-value join key-value
|
|
- cartesian product
|
|
- key antijoin key
|
|
- key-value antijoin key
|
|
|
|
These cover joins and negation. The planner must choose both the inputs and the output shape.
|
|
|
|
---
|
|
|
|
## Transformation Flow
|
|
|
|
A transformation flow records how input fields become output fields.
|
|
|
|
For a unary transformation, the flow answers:
|
|
|
|
```text
|
|
Which input fields form the new key?
|
|
Which input fields remain as value?
|
|
Which constants and comparisons filter rows?
|
|
```
|
|
|
|
For a binary transformation, the flow answers:
|
|
|
|
```text
|
|
Which fields came from the left input?
|
|
Which fields came from the right input?
|
|
Which joined fields are retained?
|
|
Which new key should the output use?
|
|
Which payload fields must continue to the next step?
|
|
```
|
|
|
|
This matters because Datalog variables are logical names, but the backend sees tuple positions.
|
|
|
|
The planner's job is to keep those two worlds aligned.
|
|
|
|
---
|
|
|
|
## Join Graph
|
|
|
|
A rule body induces a join graph.
|
|
|
|
Atoms are nodes. Shared variables are edges or hyperedges between atoms.
|
|
|
|
Example:
|
|
|
|
```text
|
|
R(a, b, c) :-
|
|
A(a, x),
|
|
B(x, y),
|
|
C(y, c).
|
|
```
|
|
|
|
The join graph is a chain:
|
|
|
|
```text
|
|
A --x-- B --y-- C
|
|
```
|
|
|
|
```mermaid
|
|
flowchart LR
|
|
A["A(a, x)"] -- "x" --> B["B(x, y)"]
|
|
B -- "y" --> C["C(y, c)"]
|
|
```
|
|
|
|
A rule like this is sensitive to join order:
|
|
|
|
```text
|
|
R(a, d) :-
|
|
A(a, x),
|
|
B(x, y),
|
|
C(y, z),
|
|
D(z, d).
|
|
```
|
|
|
|
Joining `A` with `D` first is a cross product. Joining adjacent atoms first preserves bindings and reduces intermediate results.
|
|
|
|
```mermaid
|
|
flowchart TB
|
|
subgraph Good["Connected Join Order"]
|
|
A1["A(a, x)"] --> AB["Join on x"]
|
|
B1["B(x, y)"] --> AB
|
|
AB --> ABC["Join on y"]
|
|
C1["C(y, z)"] --> ABC
|
|
ABC --> ABCD["Join on z"]
|
|
D1["D(z, d)"] --> ABCD
|
|
end
|
|
|
|
subgraph Bad["Disconnected Join Order"]
|
|
A2["A(a, x)"] --> AD["Cross Product"]
|
|
D2["D(z, d)"] --> AD
|
|
AD --> Later["Later Filters and Joins"]
|
|
end
|
|
```
|
|
|
|
FlowLog's structural planning uses variable overlap to choose a plan tree that keeps joins connected and intermediate width smaller.
|
|
|
|
---
|
|
|
|
## Width-Oriented Planning
|
|
|
|
FlowLog's planner is robustness-oriented rather than fully cost-based.
|
|
|
|
A conventional cost model needs statistics:
|
|
|
|
- relation sizes
|
|
- distinct counts
|
|
- skew
|
|
- selectivity
|
|
- correlation
|
|
|
|
Recursive Datalog makes those estimates unstable because intermediate relations change across fixed-point iterations.
|
|
|
|
FlowLog instead uses structural signals:
|
|
|
|
- how many variables two atoms share
|
|
- how many variables an intermediate result must carry
|
|
- whether a candidate plan creates disconnected joins
|
|
- how deep the plan tree becomes
|
|
|
|
```mermaid
|
|
flowchart LR
|
|
RuleBody["Rule Body"] --> Overlap["Variable Overlap"]
|
|
RuleBody --> Width["Intermediate Width"]
|
|
RuleBody --> Connectivity["Join Connectivity"]
|
|
Overlap --> PlanTree["Candidate Plan Tree"]
|
|
Width --> PlanTree
|
|
Connectivity --> PlanTree
|
|
PlanTree --> Choice["Robust Plan Choice"]
|
|
```
|
|
|
|
This is not guaranteed to be optimal. It is meant to avoid obviously bad plans.
|
|
|
|
That is a good fit for DBSP-backed work too, because a bad plan becomes maintained operator state.
|
|
|
|
---
|
|
|
|
## Antijoin Timing
|
|
|
|
Negated atoms become antijoins.
|
|
|
|
An antijoin can only run after all of its variables are bound by prior positive atoms.
|
|
|
|
Example:
|
|
|
|
```text
|
|
missing_src(graph, src) :-
|
|
edge(graph, src, dst),
|
|
not vertex(graph, src).
|
|
```
|
|
|
|
The antijoin against `vertex(graph, src)` can run immediately after `edge` because both `graph` and `src` are available.
|
|
|
|
In a larger rule:
|
|
|
|
```text
|
|
bad(x, z) :-
|
|
A(x, y),
|
|
B(y, z),
|
|
C(z, w),
|
|
not D(x, z).
|
|
```
|
|
|
|
The antijoin against `D(x, z)` can run after `A` and `B`; it does not need to wait for `C`. Running it earlier may reduce the input to the later join
|
|
with `C`.
|
|
|
|
```mermaid
|
|
flowchart LR
|
|
A["A(x, y)"] --> AB["Join on y"]
|
|
B["B(y, z)"] --> AB
|
|
AB --> AntiD["Antijoin D(x, z)"]
|
|
D["D(x, z)"] --> AntiD
|
|
AntiD --> JoinC["Join C(z, w)"]
|
|
C["C(z, w)"] --> JoinC
|
|
JoinC --> Out["bad(x, z)"]
|
|
```
|
|
|
|
This is the same issue as antijoin pushdown in the DBSP CRDT note.
|
|
|
|
---
|
|
|
|
## Sideways Information Passing
|
|
|
|
Sideways information passing is semijoin-style filtering across a rule.
|
|
|
|
The intuition is:
|
|
|
|
```text
|
|
derive useful keys
|
|
-> filter another relation to those keys
|
|
-> join less data
|
|
```
|
|
|
|
```mermaid
|
|
flowchart LR
|
|
DeltaReach["Delta Reach(x)"] --> Keys["Useful x Keys"]
|
|
Keys --> SemiArc["Semijoin Arc on x"]
|
|
Arc["Arc(x, y)"] --> SemiArc
|
|
SemiArc --> Join["Join with Delta Reach"]
|
|
Join --> NewReach["New Reach(y)"]
|
|
```
|
|
|
|
Example:
|
|
|
|
```text
|
|
Reach(y) :- Reach(x), Arc(x, y).
|
|
```
|
|
|
|
If the current delta contains only a small set of `Reach(x)` values, then `Arc` only needs edges whose source is in that set. A semijoin can prefilter
|
|
`Arc` before the recursive join.
|
|
|
|
For CRDT causal readiness, this suggests a physical plan centered on frontier operations:
|
|
|
|
```text
|
|
new ready operations
|
|
-> candidate outgoing pred edges
|
|
-> newly ready operations
|
|
```
|
|
|
|
```mermaid
|
|
flowchart LR
|
|
Frontier["Ready Frontier"] --> CandidatePred["Pred Edges from Frontier"]
|
|
Pred["pred(from, to)"] --> CandidatePred
|
|
CandidatePred --> Check["Predecessor Checks"]
|
|
Check --> NewReady["New Ready Operations"]
|
|
NewReady --> Frontier
|
|
```
|
|
|
|
rather than a plan that repeatedly starts from roots.
|
|
|
|
---
|
|
|
|
## Recursive Strata
|
|
|
|
Recursive rules require fixed-point execution.
|
|
|
|
FlowLog groups recursive rules into recursive strata, then executes them inside an iterative dataflow scope.
|
|
|
|
```mermaid
|
|
flowchart TB
|
|
Earlier["Earlier Strata Outputs"] --> Enter["Enter Recursive Scope"]
|
|
EDB["Input Relations"] --> Enter
|
|
Enter --> Base["Base Rules"]
|
|
Base --> LoopVars["IDB Loop Variables"]
|
|
LoopVars --> Step["Recursive Step Rules"]
|
|
Step --> Delta["New Derived Facts"]
|
|
Delta --> LoopVars
|
|
LoopVars --> Done{"Fixed Point?"}
|
|
Done -- "no" --> Step
|
|
Done -- "yes" --> Collect["Collect Recursive Outputs"]
|
|
```
|
|
|
|
The important design point is that a recursive stratum can contain several rules deriving related IDBs. The planner must know:
|
|
|
|
- which IDBs are loop variables
|
|
- which relations enter the recursive scope from earlier strata
|
|
- which outputs must be collected after convergence
|
|
- which intermediate arrangements are useful across iterations
|
|
|
|
For DBSP, this maps to recursive circuits with feedback and delay. The frontend still needs the same rule-level information before it can produce a
|
|
good circuit.
|
|
|
|
---
|
|
|
|
## Subplan Sharing
|
|
|
|
Multiple rules may derive the same relation, or several relations may reuse the same intermediate computation.
|
|
|
|
Example:
|
|
|
|
```text
|
|
required_src(g, s) :- edge(g, s, d).
|
|
required_dst(g, d) :- edge(g, s, d).
|
|
```
|
|
|
|
Both rules scan or project from `edge`.
|
|
|
|
In larger Geomerge theories, many violation rules may share antecedent fragments. A planner should be able to notice common subplans:
|
|
|
|
```text
|
|
common_antecedent(x, y)
|
|
-> violation_a(x)
|
|
-> violation_b(y)
|
|
```
|
|
|
|
```mermaid
|
|
flowchart LR
|
|
A["A(x, y)"] --> Common["common_antecedent(x, y)"]
|
|
B["B(y)"] --> Common
|
|
Common --> Va["violation_a(x)"]
|
|
Common --> Vb["violation_b(y)"]
|
|
Extra["Extra Check"] --> Vb
|
|
```
|
|
|
|
FlowLog's explicit rule plans and collection signatures are a useful place to represent this sharing.
|
|
|
|
---
|
|
|
|
## Physical Key Choice
|
|
|
|
Backend performance depends on key choice.
|
|
|
|
For a join:
|
|
|
|
```text
|
|
R(x, z) :- A(x, y), B(y, z).
|
|
```
|
|
|
|
both `A` and `B` should be arranged by `y`.
|
|
|
|
For a later join:
|
|
|
|
```text
|
|
S(x, w) :- R(x, z), C(z, w).
|
|
```
|
|
|
|
the output of the first join may need to be arranged by `z`, not by `x`.
|
|
|
|
That means the planner should choose output keys based on the next operation, not only the current operation.
|
|
|
|
```mermaid
|
|
flowchart LR
|
|
A["A(x, y)<br/>key = y"] --> JoinAB["Join on y"]
|
|
B["B(y, z)<br/>key = y"] --> JoinAB
|
|
JoinAB --> R["R(x, z)<br/>next key = z"]
|
|
R --> JoinRC["Join on z"]
|
|
C["C(z, w)<br/>key = z"] --> JoinRC
|
|
JoinRC --> S["S(x, w)"]
|
|
```
|
|
|
|
This is one reason a simple relational algebra tree is not enough. The physical plan needs key and payload annotations.
|
|
|
|
---
|
|
|
|
## Transfer to a DBSP Frontend
|
|
|
|
A DBSP frontend inspired by FlowLog should probably have these data structures:
|
|
|
|
- relation schemas
|
|
- rule catalogs
|
|
- variable occurrence maps
|
|
- dependency graph
|
|
- strata
|
|
- join graph per rule
|
|
- logical relational plan
|
|
- physical key annotations
|
|
- backend lowering rules
|
|
|
|
The lowering should treat DBSP as the maintained execution backend:
|
|
|
|
```text
|
|
projection -> DBSP projection
|
|
selection -> DBSP filter
|
|
join -> DBSP join with maintained state
|
|
antijoin -> DBSP antijoin or difference plan
|
|
union -> DBSP addition or union
|
|
distinct -> DBSP distinct
|
|
recursion -> DBSP fixed-point circuit
|
|
```
|
|
|
|
```mermaid
|
|
flowchart LR
|
|
Source["Datalog or Geolog Rules"] --> Frontend["Frontend Parser or Compiler"]
|
|
Frontend --> Catalog["Rule Catalogs"]
|
|
Catalog --> Planner["FlowLog-Style Planner"]
|
|
Planner --> IR["Relational IR with Keys"]
|
|
IR --> Lowering["DBSP Lowering"]
|
|
Lowering --> Circuit["DBSP Circuit"]
|
|
Circuit --> Deltas["Maintained Output Deltas"]
|
|
```
|
|
|
|
The key point is that DBSP should receive an already planned circuit, not raw Datalog text.
|
|
|
|
---
|
|
|
|
## Changelog
|
|
|
|
* **May 20, 2026** -- First version created from the FlowLog implementation and DBSP synergy notes.
|