useful-notes/flowlog/004-flowlog-technical-planning-notes.md

9.0 KiB

FlowLog Technical Planning Notes

A technical note on the FlowLog planning layer: catalogs, transformations, key-value shapes, and recursive execution.


Short Answer

FlowLog's most reusable technical idea is not the Datalog syntax or the Differential Dataflow backend. It is the planner boundary between them.

The planner turns a rule into a sequence of typed relational transformations:

rule atoms and variables
-> catalog metadata
-> collection signatures
-> transformation flows
-> physical operator choices

That layer is useful because it records how variables move through projections, filters, joins, antijoins, and recursive strata before the backend starts maintaining state.


Catalog as Rule Metadata

The catalog is the rule-level semantic summary.

For each rule, it needs to know:

  • the head relation
  • the body atoms
  • which atoms are positive
  • which atoms are negated
  • which atoms are core join inputs
  • where each variable appears
  • which constants constrain fields
  • which comparisons constrain tuples
  • which output fields are projected into the head

This is the information needed to go from syntax to a relational plan.

For example:

violation(x, z) :-
    A(x, y),
    B(y, z),
    not C(x, z),
    x != z.

The catalog should make the following facts explicit:

  • A and B are positive join inputs.
  • C is a negated input.
  • y joins A to B.
  • (x, z) is needed for the output and for the antijoin against C.
  • x != z is a filter after both variables are available.

Without this catalog, the executor has to rediscover planning information from rule syntax.


Collection Shapes

FlowLog lowers logical relations into physical collection shapes.

The main shapes are:

row
key
key-value

A row collection is a plain tuple relation.

A key-only collection is useful for semijoins and antijoins. It represents membership of keys.

A key-value collection is useful for joins. The key is the join attribute set, and the value is the payload carried forward.

The same logical relation may need several physical views:

Arc(x, y)
-> row view:       (x, y)
-> key view:       key=(x)
-> key-value view: key=(x), value=(y)
-> key-value view: key=(y), value=(x)

This is a central planning choice. The key determines which arrangement or maintained index the backend can use.


Transformation Types

FlowLog's transformations separate unary reshaping from binary combination.

Unary transformations include:

  • row to row
  • row to key
  • row to key-value
  • key-value to key-value
  • key-value to key

These cover:

  • projection
  • filtering
  • constant checks
  • equality checks
  • comparison checks
  • arranging a relation by a join key
  • dropping fields that are no longer needed

Binary transformations include:

  • key join key
  • key-value join key
  • key-value join key-value
  • cartesian product
  • key antijoin key
  • key-value antijoin key

These cover joins and negation. The planner must choose both the inputs and the output shape.


Transformation Flow

A transformation flow records how input fields become output fields.

For a unary transformation, the flow answers:

Which input fields form the new key?
Which input fields remain as value?
Which constants and comparisons filter rows?

For a binary transformation, the flow answers:

Which fields came from the left input?
Which fields came from the right input?
Which joined fields are retained?
Which new key should the output use?
Which payload fields must continue to the next step?

This matters because Datalog variables are logical names, but the backend sees tuple positions.

The planner's job is to keep those two worlds aligned.


Join Graph

A rule body induces a join graph.

Atoms are nodes. Shared variables are edges or hyperedges between atoms.

Example:

R(a, b, c) :-
    A(a, x),
    B(x, y),
    C(y, c).

The join graph is a chain:

A --x-- B --y-- C

A rule like this is sensitive to join order:

R(a, d) :-
    A(a, x),
    B(x, y),
    C(y, z),
    D(z, d).

Joining A with D first is a cross product. Joining adjacent atoms first preserves bindings and reduces intermediate results.

FlowLog's structural planning uses variable overlap to choose a plan tree that keeps joins connected and intermediate width smaller.


Width-Oriented Planning

FlowLog's planner is robustness-oriented rather than fully cost-based.

A conventional cost model needs statistics:

  • relation sizes
  • distinct counts
  • skew
  • selectivity
  • correlation

Recursive Datalog makes those estimates unstable because intermediate relations change across fixed-point iterations.

FlowLog instead uses structural signals:

  • how many variables two atoms share
  • how many variables an intermediate result must carry
  • whether a candidate plan creates disconnected joins
  • how deep the plan tree becomes

This is not guaranteed to be optimal. It is meant to avoid obviously bad plans.

That is a good fit for DBSP-backed work too, because a bad plan becomes maintained operator state.


Antijoin Timing

Negated atoms become antijoins.

An antijoin can only run after all of its variables are bound by prior positive atoms.

Example:

missing_src(graph, src) :-
    edge(graph, src, dst),
    not vertex(graph, src).

The antijoin against vertex(graph, src) can run immediately after edge because both graph and src are available.

In a larger rule:

bad(x, z) :-
    A(x, y),
    B(y, z),
    C(z, w),
    not D(x, z).

The antijoin against D(x, z) can run after A and B; it does not need to wait for C. Running it earlier may reduce the input to the later join with C.

This is the same issue as antijoin pushdown in the DBSP CRDT note.


Sideways Information Passing

Sideways information passing is semijoin-style filtering across a rule.

The intuition is:

derive useful keys
-> filter another relation to those keys
-> join less data

Example:

Reach(y) :- Reach(x), Arc(x, y).

If the current delta contains only a small set of Reach(x) values, then Arc only needs edges whose source is in that set. A semijoin can prefilter Arc before the recursive join.

For CRDT causal readiness, this suggests a physical plan centered on frontier operations:

new ready operations
-> candidate outgoing pred edges
-> newly ready operations

rather than a plan that repeatedly starts from roots.


Recursive Strata

Recursive rules require fixed-point execution.

FlowLog groups recursive rules into recursive strata, then executes them inside an iterative dataflow scope.

The important design point is that a recursive stratum can contain several rules deriving related IDBs. The planner must know:

  • which IDBs are loop variables
  • which relations enter the recursive scope from earlier strata
  • which outputs must be collected after convergence
  • which intermediate arrangements are useful across iterations

For DBSP, this maps to recursive circuits with feedback and delay. The frontend still needs the same rule-level information before it can produce a good circuit.


Subplan Sharing

Multiple rules may derive the same relation, or several relations may reuse the same intermediate computation.

Example:

required_src(g, s) :- edge(g, s, d).
required_dst(g, d) :- edge(g, s, d).

Both rules scan or project from edge.

In larger Geomerge theories, many violation rules may share antecedent fragments. A planner should be able to notice common subplans:

common_antecedent(x, y)
-> violation_a(x)
-> violation_b(y)

FlowLog's explicit rule plans and collection signatures are a useful place to represent this sharing.


Physical Key Choice

Backend performance depends on key choice.

For a join:

R(x, z) :- A(x, y), B(y, z).

both A and B should be arranged by y.

For a later join:

S(x, w) :- R(x, z), C(z, w).

the output of the first join may need to be arranged by z, not by x.

That means the planner should choose output keys based on the next operation, not only the current operation.

This is one reason a simple relational algebra tree is not enough. The physical plan needs key and payload annotations.


Transfer to a DBSP Frontend

A DBSP frontend inspired by FlowLog should probably have these data structures:

  • relation schemas
  • rule catalogs
  • variable occurrence maps
  • dependency graph
  • strata
  • join graph per rule
  • logical relational plan
  • physical key annotations
  • backend lowering rules

The lowering should treat DBSP as the maintained execution backend:

projection -> DBSP projection
selection -> DBSP filter
join -> DBSP join with maintained state
antijoin -> DBSP antijoin or difference plan
union -> DBSP addition or union
distinct -> DBSP distinct
recursion -> DBSP fixed-point circuit

The key point is that DBSP should receive an already planned circuit, not raw Datalog text.


Changelog

  • May 20, 2026 -- First version created from the FlowLog implementation and DBSP synergy notes.