habedi-work/useful-notes

Fork 0

Hassan Abedi 2d3c02315a Add anote file for how we could use FlowLog and DBSP together

2026-05-20 10:32:46 +02:00

8.7 KiB

Raw Blame History

FlowLog and DBSP Synergy

A note on how FlowLog's Datalog planning ideas could support the DBSP, CRDT, and Geomerge work.

Short Answer

FlowLog and the DBSP notes meet at the boundary between Datalog rules and incremental execution.

DBSP answers:

How can a relational query be maintained over changing inputs?

FlowLog helps answer:

What relational plan should the incremental backend maintain?

The synergy is not that FlowLog should replace DBSP. The useful split is:

FlowLog-like frontend and optimizer
-> backend-neutral relational IR
-> DBSP-maintained circuit

FlowLog is a useful blueprint for the compiler and optimizer that should sit in front of DBSP.

Existing DBSP Direction

The DBSP notes are organized around three related goals.

First, CRDTs can be described as deterministic Datalog queries over immutable operation facts:

operation facts
-> Datalog rules
-> visible CRDT state

Second, DBSP can maintain those query results incrementally:

input relation deltas
-> DBSP circuit step
-> output relation deltas

Third, Geomerge laws can be compiled into maintained violation relations:

compiled relational laws
-> violation queries
-> maintained violation deltas

The missing layer is query planning. A Datalog rule can be semantically correct but still produce a poor physical plan.

FlowLog's Useful Layer

FlowLog has a pipeline that is useful independently of its Differential Dataflow backend:

Datalog program
-> parser
-> strata
-> rule catalog
-> per-rule relational plan
-> optimizer
-> incremental dataflow backend

The reusable parts are:

dependency analysis
stratification
rule catalog construction
join graph extraction
structural join planning
sideways information passing
physical key and value shape selection

These are exactly the parts a DBSP-backed Datalog or Geolog compiler needs before lowering rules to DBSP.

CRDT Synergy

The CRDT notes identify three representative query shapes.

The multi-value register is mostly projection plus antijoin:

overwritten(RepId, Ctr) :-
    pred(RepId, Ctr, _, _).

mvrStore(Key, Value) :-
    set(RepId, Ctr, Key, Value),
    not overwritten(RepId, Ctr).

This is a favorable DBSP workload. The backend can maintain the antijoin state and process small updates without rescanning the full operation history.

The causal-readiness query is harder:

isCausallyReady(RepId, Ctr) :-
    isRoot(RepId, Ctr).

isCausallyReady(RepId, Ctr) :-
    isCausallyReady(FromRepId, FromCtr),
    pred(FromRepId, FromCtr, RepId, Ctr).

This is recursive graph traversal. The DBSP CRDT notes report that this query can remain dependent on causal-history depth, even during warm updates.

FlowLog's planning ideas are relevant here. Sideways information passing suggests prefiltering recursive traversal through known relevant bindings. For CRDTs, that could mean using current heads, leaves, or newly arrived operations as a frontier instead of repeatedly deriving readiness from roots through the whole causal graph.

The list CRDT query is also planning-sensitive. Relations such as firstChild, nextSibling, nextSiblingAnc, nextElem, and nextVisible create several joins, antijoins, and recursive steps. FlowLog-style rule catalogs and join planning would help choose better intermediate shapes before DBSP builds maintained operator state.

Geomerge Synergy

The Geomerge integration note proposes maintaining one combined violation relation.

Simple foreign-key laws are straightforward:

required_src(graph, src) :- G.E(graph, src, dst).
missing_src(graph, src) :- required_src(graph, src), not G.V(graph, src).

For this subset, DBSP can maintain projections and antijoins directly.

The need for FlowLog grows when laws contain several atoms:

violation(x, y, z) :-
    A(x, y),
    B(y, z),
    C(z),
    not D(x, z).

At that point, the compiler must decide:

which atoms join first
which variables form keys
where filters and antijoins should be applied
which intermediate fields must be retained
whether several laws share subplans

FlowLog's catalog and structural planning model is a good guide for this compiler layer.

The resulting Geomerge architecture could be:

FlatTheory laws
-> supported Datalog-shaped rules
-> FlowLog-like catalog and optimizer
-> relational violation plan
-> DBSP circuit
-> maintained violations relation

This keeps DBSP as a performance backend while giving Geomerge a real planning layer.

IR Boundary

The strongest architectural lesson is to avoid binding the system too tightly to either source syntax or backend syntax.

The durable boundary should be a relational intermediate representation:

source rules
-> rule catalog
-> relational IR
-> backend-specific lowering

For CRDTs, the source may be a Datalog dialect.

For Geomerge, the source may be compiled Geolog laws.

For execution, the backend may be DBSP, Differential Dataflow, or a non-incremental batch engine.

This matches the existing DBSP notes: DBSP should not own full source-language semantics. It should receive a relational plan that has already been checked, stratified, and optimized.

Optimization Transfer

FlowLog suggests several optimizations that transfer well to DBSP-backed work.

Structural Planning: Choose join trees from variable overlap and intermediate width, especially for recursive rules where cardinality estimates are weak.

Sideways Information Passing: Add semijoin-style filters so later joins and recursive steps see fewer irrelevant tuples.

Antijoin Scheduling: Apply negated atoms as soon as their variables are available. This matches the DBSP CRDT note's antijoin-pushdown agenda.

Subplan Sharing: Reuse common derived relations across laws or CRDT views when multiple outputs need the same intermediate facts.

Physical Key Choice: Pick key fields deliberately before lowering to the backend. DBSP joins also need maintained state, so bad key and arrangement choices will become runtime costs.

Backend Comparison

FlowLog uses Differential Dataflow. The DBSP notes use DBSP.

The models differ:

Differential Dataflow uses collections with logical time and differences.
DBSP uses streams, Z-sets, integration, differentiation, and circuits.

The shared lesson is more important than the difference:

incremental backends maintain operator state

That means bad plans become persistent state, not just one bad query execution. A poor join order or unnecessary intermediate relation can increase memory use and update cost for the lifetime of the maintained query.

FlowLog is useful because it treats planning as a first-class layer before execution.

Proposed Synergy Path

The practical path is incremental.

First, use FlowLog as a reading reference for a DBSP frontend:

parser
-> dependency graph
-> strata
-> rule catalog
-> relational plan

Second, add a small optimizer:

join ordering
antijoin pushdown
simple SIP for bound variables

Third, lower the optimized plan to DBSP operators:

projection
selection
join
antijoin
union
distinct
fixed point

Fourth, test against the current direct implementations:

snapshot result == DBSP-maintained result

For Geomerge, the first target should stay the same as the DBSP integration note: supported laws compiled into one maintained violations relation.

For CRDTs, the first target should be causal readiness and list traversal, since those are where the existing DBSP notes identify performance risk.

Open Questions

Can FlowLog-style SIP be adapted to causal-readiness frontiers?
Can DBSP expose enough physical planning control for key choice and subplan sharing?
Should the Datalog frontend target a FlowLog-like IR before DBSP lowering?
Can Geomerge laws use the same catalog structure as Datalog rules?
Which recursive CRDT queries benefit from structural planning?
Is hydration better handled by DBSP, a batch engine, or persisted operator state?
Can one optimizer target both DBSP and Differential Dataflow backends?

Bottom Line

FlowLog should be treated as an optimizer and compiler blueprint for the DBSP work.

The DBSP notes already have the right execution target:

maintained relational deltas

FlowLog adds the missing planning discipline:

rule catalog
+ join graph
+ recursive strata
+ robust plan choice
+ SIP-style prefiltering

Together, the systems suggest a stronger architecture:

Datalog or Geolog rules
-> FlowLog-like planning layer
-> DBSP incremental backend
-> maintained CRDT views or violation relations

Changelog

May 20, 2026 -- First version created from the DBSP and FlowLog notes.

8.7 KiB Raw Blame History