Hassan Abedi 99c30190a8 Add two new note files (mianly technical notes on usage)

2026-05-20 15:36:14 +02:00

9.0 KiB

Raw Blame History

FlowLog Technical Planning Notes

A technical note on the FlowLog planning layer: catalogs, transformations, key-value shapes, and recursive execution.

Short Answer

FlowLog's most reusable technical idea is not the Datalog syntax or the Differential Dataflow backend. It is the planner boundary between them.

The planner turns a rule into a sequence of typed relational transformations:

rule atoms and variables
-> catalog metadata
-> collection signatures
-> transformation flows
-> physical operator choices

That layer is useful because it records how variables move through projections, filters, joins, antijoins, and recursive strata before the backend starts maintaining state.

Catalog as Rule Metadata

The catalog is the rule-level semantic summary.

For each rule, it needs to know:

the head relation
the body atoms
which atoms are positive
which atoms are negated
which atoms are core join inputs
where each variable appears
which constants constrain fields
which comparisons constrain tuples
which output fields are projected into the head

This is the information needed to go from syntax to a relational plan.

For example:

violation(x, z) :-
    A(x, y),
    B(y, z),
    not C(x, z),
    x != z.

The catalog should make the following facts explicit:

A and B are positive join inputs.
C is a negated input.
y joins A to B.
(x, z) is needed for the output and for the antijoin against C.
x != z is a filter after both variables are available.

Without this catalog, the executor has to rediscover planning information from rule syntax.

Collection Shapes

FlowLog lowers logical relations into physical collection shapes.

The main shapes are:

row
key
key-value

A row collection is a plain tuple relation.

A key-only collection is useful for semijoins and antijoins. It represents membership of keys.

A key-value collection is useful for joins. The key is the join attribute set, and the value is the payload carried forward.

The same logical relation may need several physical views:

Arc(x, y)
-> row view:       (x, y)
-> key view:       key=(x)
-> key-value view: key=(x), value=(y)
-> key-value view: key=(y), value=(x)

This is a central planning choice. The key determines which arrangement or maintained index the backend can use.

Transformation Types

FlowLog's transformations separate unary reshaping from binary combination.

Unary transformations include:

row to row
row to key
row to key-value
key-value to key-value
key-value to key

These cover:

projection
filtering
constant checks
equality checks
comparison checks
arranging a relation by a join key
dropping fields that are no longer needed

Binary transformations include:

key join key
key-value join key
key-value join key-value
cartesian product
key antijoin key
key-value antijoin key

These cover joins and negation. The planner must choose both the inputs and the output shape.

Transformation Flow

A transformation flow records how input fields become output fields.

For a unary transformation, the flow answers:

Which input fields form the new key?
Which input fields remain as value?
Which constants and comparisons filter rows?

For a binary transformation, the flow answers:

Which fields came from the left input?
Which fields came from the right input?
Which joined fields are retained?
Which new key should the output use?
Which payload fields must continue to the next step?

This matters because Datalog variables are logical names, but the backend sees tuple positions.

The planner's job is to keep those two worlds aligned.

Join Graph

A rule body induces a join graph.

Atoms are nodes. Shared variables are edges or hyperedges between atoms.

Example:

R(a, b, c) :-
    A(a, x),
    B(x, y),
    C(y, c).

The join graph is a chain:

A --x-- B --y-- C

A rule like this is sensitive to join order:

R(a, d) :-
    A(a, x),
    B(x, y),
    C(y, z),
    D(z, d).

Joining A with D first is a cross product. Joining adjacent atoms first preserves bindings and reduces intermediate results.

FlowLog's structural planning uses variable overlap to choose a plan tree that keeps joins connected and intermediate width smaller.

Width-Oriented Planning

FlowLog's planner is robustness-oriented rather than fully cost-based.

A conventional cost model needs statistics:

relation sizes
distinct counts
skew
selectivity
correlation

Recursive Datalog makes those estimates unstable because intermediate relations change across fixed-point iterations.

FlowLog instead uses structural signals:

how many variables two atoms share
how many variables an intermediate result must carry
whether a candidate plan creates disconnected joins
how deep the plan tree becomes

This is not guaranteed to be optimal. It is meant to avoid obviously bad plans.

That is a good fit for DBSP-backed work too, because a bad plan becomes maintained operator state.

Antijoin Timing

Negated atoms become antijoins.

An antijoin can only run after all of its variables are bound by prior positive atoms.

Example:

missing_src(graph, src) :-
    edge(graph, src, dst),
    not vertex(graph, src).

The antijoin against vertex(graph, src) can run immediately after edge because both graph and src are available.

In a larger rule:

bad(x, z) :-
    A(x, y),
    B(y, z),
    C(z, w),
    not D(x, z).

The antijoin against D(x, z) can run after A and B; it does not need to wait for C. Running it earlier may reduce the input to the later join with C.

This is the same issue as antijoin pushdown in the DBSP CRDT note.

Sideways Information Passing

Sideways information passing is semijoin-style filtering across a rule.

The intuition is:

derive useful keys
-> filter another relation to those keys
-> join less data

Example:

Reach(y) :- Reach(x), Arc(x, y).

If the current delta contains only a small set of Reach(x) values, then Arc only needs edges whose source is in that set. A semijoin can prefilter Arc before the recursive join.

For CRDT causal readiness, this suggests a physical plan centered on frontier operations:

new ready operations
-> candidate outgoing pred edges
-> newly ready operations

rather than a plan that repeatedly starts from roots.

Recursive Strata

Recursive rules require fixed-point execution.

FlowLog groups recursive rules into recursive strata, then executes them inside an iterative dataflow scope.

The important design point is that a recursive stratum can contain several rules deriving related IDBs. The planner must know:

which IDBs are loop variables
which relations enter the recursive scope from earlier strata
which outputs must be collected after convergence
which intermediate arrangements are useful across iterations

For DBSP, this maps to recursive circuits with feedback and delay. The frontend still needs the same rule-level information before it can produce a good circuit.

Multiple rules may derive the same relation, or several relations may reuse the same intermediate computation.

Example:

required_src(g, s) :- edge(g, s, d).
required_dst(g, d) :- edge(g, s, d).

Both rules scan or project from edge.

In larger Geomerge theories, many violation rules may share antecedent fragments. A planner should be able to notice common subplans:

common_antecedent(x, y)
-> violation_a(x)
-> violation_b(y)

FlowLog's explicit rule plans and collection signatures are a useful place to represent this sharing.

Physical Key Choice

Backend performance depends on key choice.

For a join:

R(x, z) :- A(x, y), B(y, z).

both A and B should be arranged by y.

For a later join:

S(x, w) :- R(x, z), C(z, w).

the output of the first join may need to be arranged by z, not by x.

That means the planner should choose output keys based on the next operation, not only the current operation.

This is one reason a simple relational algebra tree is not enough. The physical plan needs key and payload annotations.

Transfer to a DBSP Frontend

A DBSP frontend inspired by FlowLog should probably have these data structures:

relation schemas
rule catalogs
variable occurrence maps
dependency graph
strata
join graph per rule
logical relational plan
physical key annotations
backend lowering rules

The lowering should treat DBSP as the maintained execution backend:

projection -> DBSP projection
selection -> DBSP filter
join -> DBSP join with maintained state
antijoin -> DBSP antijoin or difference plan
union -> DBSP addition or union
distinct -> DBSP distinct
recursion -> DBSP fixed-point circuit

The key point is that DBSP should receive an already planned circuit, not raw Datalog text.

Changelog

May 20, 2026 -- First version created from the FlowLog implementation and DBSP synergy notes.

9.0 KiB Raw Blame History

FlowLog Technical Planning Notes

Short Answer

Catalog as Rule Metadata

Collection Shapes

Transformation Types

Transformation Flow

Join Graph

Width-Oriented Planning

Antijoin Timing

Sideways Information Passing

Recursive Strata

Subplan Sharing

Physical Key Choice

Transfer to a DBSP Frontend

Changelog

9.0 KiB

Raw Blame History