useful-notes/flowlog/005-using-flowlog-ideas.md

9.1 KiB

Using FlowLog Ideas

A practical note on how FlowLog or FlowLog-style planning could be used for the local DBSP, CRDT, and Geomerge work.


Short Answer

The most useful way to use FlowLog is as a planning reference, not as a direct dependency.

There are three possible levels of use:

Level 1: run FlowLog examples to learn workload behavior
Level 2: borrow FlowLog planning ideas for a DBSP frontend
Level 3: compare DBSP and Differential Dataflow backends on the same Datalog programs

The practical near-term path is Level 2. Use FlowLog's catalog, join planning, antijoin scheduling, and SIP ideas to design a better compiler layer before DBSP.


Initial Non-Goals

The first step should not be replacing the DBSP backend with FlowLog.

That would conflate two separate questions:

  • Which incremental backend should maintain deltas?
  • Which frontend planner should produce the backend plan?

The DBSP notes are already about DBSP as a formal view-maintenance backend. FlowLog is more useful as a guide for the missing frontend and optimizer.

The first step should not be adopting FlowLog's syntax as the durable source language either. Geomerge and Geolog already have their own source concepts. Datalog should be an intermediate or testing language unless the user-facing language decision is explicit.


Use Case 1: Better CRDT Query Planning

The CRDT queries in the DBSP notes include:

  • multi-value register queries
  • causal-readiness queries
  • list traversal queries
  • tombstone-skipping queries

The multi-value register is already simple enough:

set + pred -> overwritten -> visible values

The planning value is higher for causal readiness:

pred graph
-> roots
-> recursive reachability
-> ready operations

and list traversal:

insert tree
-> first child
-> next sibling
-> ancestor sibling
-> next element
-> next visible element

These queries contain several recursive or join-heavy rules. FlowLog-style planning can help by choosing join keys, pushing antijoins earlier, and adding semijoin filters around the current frontier.

The concrete experiment:

causal-readiness Datalog rules
rule catalog
naive plan
FlowLog-style planned version
DBSP hydration and warm-update comparison

The success criterion is not only lower total runtime. It is lower dependence on causal-history depth for small warm updates.


Use Case 2: Geomerge Violation Planning

The Geomerge integration note proposes compiling supported laws into violation relations.

For simple laws, this is direct:

missing_src(g, s) :-
    edge(g, s, d),
    not vertex(g, s).

For larger laws, the compiler needs a planner:

violation(vars) :-
    antecedent_atom_1(...),
    antecedent_atom_2(...),
    antecedent_atom_3(...),
    not consequent_atom(...).

FlowLog-style catalogs would help the compiler answer:

  • which variables are introduced by each atom
  • which atoms join on which variables
  • when each negated consequent can be checked
  • which projected values are needed for the violation row
  • whether two laws share a common antecedent

The concrete experiment:

one Geomerge fixture theory
Datalog-like rule per supported law
join graph per rule
planned relational tree
comparison with the current direct validator's binding order

This can be useful before any DBSP integration exists, because it tests whether the compiler can understand the law shape.


Use Case 3: Backend Comparison

FlowLog can also be used as a comparison point for DBSP.

The fair comparison is not:

FlowLog product vs DBSP product

The useful comparison is:

same Datalog query
same input facts
same output relation
different backend lowering

Candidate workloads:

  • transitive closure
  • causal readiness
  • list next-element traversal
  • missing foreign-key violations
  • multi-atom Geomerge antecedents

The comparison should measure:

  • hydration time
  • warm-update time
  • memory use
  • sensitivity to join order
  • output delta size
  • ease of rollback or preview execution

This helps decide whether DBSP needs FlowLog-like planning, whether Differential Dataflow is better for some recursive workloads, or whether a hybrid batch-plus-incremental strategy is needed.


Use Case 4: Test Corpus for Datalog Lowering

FlowLog's examples suggest a useful test corpus shape.

A local Datalog-to-DBSP frontend should include small programs for:

  • reachability
  • transitive closure
  • connected components
  • antijoin checks
  • aggregation checks
  • CRDT multi-value register
  • CRDT causal readiness
  • CRDT list traversal
  • Geomerge-style violation detection

Each test should define:

  • input schemas
  • input facts
  • expected output facts
  • expected output deltas for at least one update
  • whether recursion or negation is used
  • whether the program should be accepted or rejected

This gives a better foundation than testing only one CRDT or one Geomerge law.


First Prototype

The first useful prototype should be small.

A planning-only tool:

Datalog-like rule text
-> parsed rules
-> dependency graph
-> strata
-> rule catalog
-> join graph
-> planned relational tree

It does not need to run DBSP at first.

The output can be textual:

rule: missing_src
positive atoms: edge
negative atoms: vertex
join graph: none
plan:
  scan edge
  project (graph, src)
  antijoin vertex on (graph, src)
  emit violation row

For recursive rules, the output can identify the loop:

recursive stratum:
  ready

base:
  roots -> ready

step:
  ready join pred on operation id
  project successor operation id

This prototype would validate the compiler shape before depending on a backend API.


Second Prototype

The second prototype should lower a narrow subset to DBSP.

Supported subset:

  • relation declarations
  • positive atoms
  • equality joins
  • constants
  • simple comparisons
  • stratified negation
  • union of repeated rule heads
  • one recursive IDB at a time

Excluded subset:

  • aggregation
  • mutual recursion
  • disjunction
  • existential generation
  • equality saturation
  • custom scalar functions

The target workloads:

  • missing_src and missing_dst
  • multi-value register
  • transitive closure
  • causal readiness

This subset is enough to test the important bridge:

planned rules -> DBSP-maintained outputs

Data Model Decisions

Several decisions should be made explicitly before implementation.

Set or Multiset Semantics: CRDT operation facts are usually set-like. DBSP uses Z-set weights internally. The frontend should define when distinct is applied.

Operation Identity: CRDT examples use (replica_id, counter). The planner should treat this pair either as two scalar fields or as one logical key with two physical fields.

Violation Rows: Geomerge violations should include enough context for error messages, not just a boolean.

Output Integration: DBSP emits deltas. Applications often need an integrated current view. The runtime boundary should say who owns that integration.

Rollback: Geomerge validation needs preview or rollback behavior. If using weighted deltas, inverse deltas are plausible but must stay transactionally coupled to storage.


Evaluation Plan

The evaluation should separate correctness from performance.

Correctness checks:

planned evaluation == naive snapshot evaluation
DBSP maintained result == snapshot result
failed Geomerge transaction leaves no checker drift

Performance checks:

hydration time
warm-update time
memory used by maintained state
number of output delta rows
history-depth sensitivity
join-order sensitivity

The most important performance test is causal readiness:

large causal history
+ small new update
-> does update cost grow with history depth?

If the answer is yes, the frontend needs frontier-aware planning or a different physical representation.


Decision Points

The main decision points are:

  • whether to implement a Datalog frontend or compile directly from Geolog laws
  • whether the relational IR should be FlowLog-like, DBSP-like, or custom
  • whether recursive planning should support mutual recursion early
  • whether SIP should be automatic, directive-controlled, or both
  • whether hydration should use the same backend as warm updates
  • whether to persist backend operator state
  • whether to compare against Differential Dataflow for recursive workloads

These decisions should stay separate. Choosing DBSP as the backend does not force a particular Datalog syntax. Choosing a FlowLog-like planner does not force Differential Dataflow as the backend.


Practical Recommendation

The first practical step is a planning-only FlowLog-inspired compiler layer.

The next step is lowering a small subset to DBSP.

After that, FlowLog itself can serve as a comparison backend for the same small programs.

The goal should be:

one rule frontend
one relational IR
two possible execution backends

That architecture would make it possible to test whether performance problems come from the query semantics, the planner, or the backend.


Changelog

  • May 20, 2026 -- First version created from FlowLog, DBSP, CRDT, and Geomerge notes.