useful-notes/flowlog/005-using-flowlog-ideas.md

381 lines
9.1 KiB
Markdown
Raw Normal View History

# Using FlowLog Ideas
A practical note on how FlowLog or FlowLog-style planning could be used for the local DBSP, CRDT, and Geomerge work.
---
## Short Answer
The most useful way to use FlowLog is as a planning reference, not as a direct dependency.
There are three possible levels of use:
```text
Level 1: run FlowLog examples to learn workload behavior
Level 2: borrow FlowLog planning ideas for a DBSP frontend
Level 3: compare DBSP and Differential Dataflow backends on the same Datalog programs
```
The practical near-term path is Level 2. Use FlowLog's catalog, join planning, antijoin scheduling, and SIP ideas to design a better compiler layer before DBSP.
---
## Initial Non-Goals
The first step should not be replacing the DBSP backend with FlowLog.
That would conflate two separate questions:
- Which incremental backend should maintain deltas?
- Which frontend planner should produce the backend plan?
The DBSP notes are already about DBSP as a formal view-maintenance backend. FlowLog is more useful as a guide for the missing frontend and optimizer.
The first step should not be adopting FlowLog's syntax as the durable source language either. Geomerge and Geolog already have their own source concepts. Datalog should be an intermediate or testing language unless the user-facing language decision is explicit.
---
## Use Case 1: Better CRDT Query Planning
The CRDT queries in the DBSP notes include:
- multi-value register queries
- causal-readiness queries
- list traversal queries
- tombstone-skipping queries
The multi-value register is already simple enough:
```text
set + pred -> overwritten -> visible values
```
The planning value is higher for causal readiness:
```text
pred graph
-> roots
-> recursive reachability
-> ready operations
```
and list traversal:
```text
insert tree
-> first child
-> next sibling
-> ancestor sibling
-> next element
-> next visible element
```
These queries contain several recursive or join-heavy rules. FlowLog-style planning can help by choosing join keys, pushing antijoins earlier, and adding semijoin filters around the current frontier.
The concrete experiment:
```text
causal-readiness Datalog rules
rule catalog
naive plan
FlowLog-style planned version
DBSP hydration and warm-update comparison
```
The success criterion is not only lower total runtime. It is lower dependence on causal-history depth for small warm updates.
---
## Use Case 2: Geomerge Violation Planning
The Geomerge integration note proposes compiling supported laws into violation relations.
For simple laws, this is direct:
```text
missing_src(g, s) :-
edge(g, s, d),
not vertex(g, s).
```
For larger laws, the compiler needs a planner:
```text
violation(vars) :-
antecedent_atom_1(...),
antecedent_atom_2(...),
antecedent_atom_3(...),
not consequent_atom(...).
```
FlowLog-style catalogs would help the compiler answer:
- which variables are introduced by each atom
- which atoms join on which variables
- when each negated consequent can be checked
- which projected values are needed for the violation row
- whether two laws share a common antecedent
The concrete experiment:
```text
one Geomerge fixture theory
Datalog-like rule per supported law
join graph per rule
planned relational tree
comparison with the current direct validator's binding order
```
This can be useful before any DBSP integration exists, because it tests whether the compiler can understand the law shape.
---
## Use Case 3: Backend Comparison
FlowLog can also be used as a comparison point for DBSP.
The fair comparison is not:
```text
FlowLog product vs DBSP product
```
The useful comparison is:
```text
same Datalog query
same input facts
same output relation
different backend lowering
```
Candidate workloads:
- transitive closure
- causal readiness
- list next-element traversal
- missing foreign-key violations
- multi-atom Geomerge antecedents
The comparison should measure:
- hydration time
- warm-update time
- memory use
- sensitivity to join order
- output delta size
- ease of rollback or preview execution
This helps decide whether DBSP needs FlowLog-like planning, whether Differential Dataflow is better for some recursive workloads, or whether a hybrid batch-plus-incremental strategy is needed.
---
## Use Case 4: Test Corpus for Datalog Lowering
FlowLog's examples suggest a useful test corpus shape.
A local Datalog-to-DBSP frontend should include small programs for:
- reachability
- transitive closure
- connected components
- antijoin checks
- aggregation checks
- CRDT multi-value register
- CRDT causal readiness
- CRDT list traversal
- Geomerge-style violation detection
Each test should define:
- input schemas
- input facts
- expected output facts
- expected output deltas for at least one update
- whether recursion or negation is used
- whether the program should be accepted or rejected
This gives a better foundation than testing only one CRDT or one Geomerge law.
---
## First Prototype
The first useful prototype should be small.
A planning-only tool:
```text
Datalog-like rule text
-> parsed rules
-> dependency graph
-> strata
-> rule catalog
-> join graph
-> planned relational tree
```
It does not need to run DBSP at first.
The output can be textual:
```text
rule: missing_src
positive atoms: edge
negative atoms: vertex
join graph: none
plan:
scan edge
project (graph, src)
antijoin vertex on (graph, src)
emit violation row
```
For recursive rules, the output can identify the loop:
```text
recursive stratum:
ready
base:
roots -> ready
step:
ready join pred on operation id
project successor operation id
```
This prototype would validate the compiler shape before depending on a backend API.
---
## Second Prototype
The second prototype should lower a narrow subset to DBSP.
Supported subset:
- relation declarations
- positive atoms
- equality joins
- constants
- simple comparisons
- stratified negation
- union of repeated rule heads
- one recursive IDB at a time
Excluded subset:
- aggregation
- mutual recursion
- disjunction
- existential generation
- equality saturation
- custom scalar functions
The target workloads:
- `missing_src` and `missing_dst`
- multi-value register
- transitive closure
- causal readiness
This subset is enough to test the important bridge:
```text
planned rules -> DBSP-maintained outputs
```
---
## Data Model Decisions
Several decisions should be made explicitly before implementation.
**Set or Multiset Semantics**: CRDT operation facts are usually set-like. DBSP uses Z-set weights internally. The frontend should define when `distinct` is applied.
**Operation Identity**: CRDT examples use `(replica_id, counter)`. The planner should treat this pair either as two scalar fields or as one logical key with two physical fields.
**Violation Rows**: Geomerge violations should include enough context for error messages, not just a boolean.
**Output Integration**: DBSP emits deltas. Applications often need an integrated current view. The runtime boundary should say who owns that integration.
**Rollback**: Geomerge validation needs preview or rollback behavior. If using weighted deltas, inverse deltas are plausible but must stay transactionally coupled to storage.
---
## Evaluation Plan
The evaluation should separate correctness from performance.
Correctness checks:
```text
planned evaluation == naive snapshot evaluation
DBSP maintained result == snapshot result
failed Geomerge transaction leaves no checker drift
```
Performance checks:
```text
hydration time
warm-update time
memory used by maintained state
number of output delta rows
history-depth sensitivity
join-order sensitivity
```
The most important performance test is causal readiness:
```text
large causal history
+ small new update
-> does update cost grow with history depth?
```
If the answer is yes, the frontend needs frontier-aware planning or a different physical representation.
---
## Decision Points
The main decision points are:
- whether to implement a Datalog frontend or compile directly from Geolog laws
- whether the relational IR should be FlowLog-like, DBSP-like, or custom
- whether recursive planning should support mutual recursion early
- whether SIP should be automatic, directive-controlled, or both
- whether hydration should use the same backend as warm updates
- whether to persist backend operator state
- whether to compare against Differential Dataflow for recursive workloads
These decisions should stay separate. Choosing DBSP as the backend does not force a particular Datalog syntax. Choosing a FlowLog-like planner does not force Differential Dataflow as the backend.
---
## Practical Recommendation
The first practical step is a planning-only FlowLog-inspired compiler layer.
The next step is lowering a small subset to DBSP.
After that, FlowLog itself can serve as a comparison backend for the same small programs.
The goal should be:
```text
one rule frontend
one relational IR
two possible execution backends
```
That architecture would make it possible to test whether performance problems come from the query semantics, the planner, or the backend.
---
## Changelog
* **May 20, 2026** -- First version created from FlowLog, DBSP, CRDT, and Geomerge notes.