Add mermaid diagrams to the note files

This commit is contained in:
Hassan Abedi 2026-05-20 15:54:55 +02:00
parent 99c30190a8
commit 2bfcb7e818
3 changed files with 293 additions and 20 deletions

View File

@ -107,7 +107,8 @@ mvrStore(Key, Value) :-
not overwritten(RepId, Ctr).
```
This is a favorable DBSP workload. The backend can maintain the antijoin state and process small updates without rescanning the full operation history.
This is a favorable DBSP workload. The backend can maintain the antijoin state and process small updates without rescanning the full operation
history.
The causal-readiness query is harder:
@ -122,9 +123,13 @@ isCausallyReady(RepId, Ctr) :-
This is recursive graph traversal. The DBSP CRDT notes report that this query can remain dependent on causal-history depth, even during warm updates.
FlowLog's planning ideas are relevant here. Sideways information passing suggests prefiltering recursive traversal through known relevant bindings. For CRDTs, that could mean using current heads, leaves, or newly arrived operations as a frontier instead of repeatedly deriving readiness from roots through the whole causal graph.
FlowLog's planning ideas are relevant here. Sideways information passing suggests prefiltering recursive traversal through known relevant bindings.
For CRDTs, that could mean using current heads, leaves, or newly arrived operations as a frontier instead of repeatedly deriving readiness from roots
through the whole causal graph.
The list CRDT query is also planning-sensitive. Relations such as `firstChild`, `nextSibling`, `nextSiblingAnc`, `nextElem`, and `nextVisible` create several joins, antijoins, and recursive steps. FlowLog-style rule catalogs and join planning would help choose better intermediate shapes before DBSP builds maintained operator state.
The list CRDT query is also planning-sensitive. Relations such as `firstChild`, `nextSibling`, `nextSiblingAnc`, `nextElem`, and `nextVisible` create
several joins, antijoins, and recursive steps. FlowLog-style rule catalogs and join planning would help choose better intermediate shapes before DBSP
builds maintained operator state.
---
@ -195,7 +200,8 @@ For Geomerge, the source may be compiled Geolog laws.
For execution, the backend may be DBSP, Differential Dataflow, or a non-incremental batch engine.
This matches the existing DBSP notes: DBSP should not own full source-language semantics. It should receive a relational plan that has already been checked, stratified, and optimized.
This matches the existing DBSP notes: DBSP should not own full source-language semantics. It should receive a relational plan that has already been
checked, stratified, and optimized.
---
@ -203,7 +209,8 @@ This matches the existing DBSP notes: DBSP should not own full source-language s
FlowLog suggests several optimizations that transfer well to DBSP-backed work.
**Structural Planning**: Choose join trees from variable overlap and intermediate width, especially for recursive rules where cardinality estimates are weak.
**Structural Planning**: Choose join trees from variable overlap and intermediate width, especially for recursive rules where cardinality estimates
are weak.
**Sideways Information Passing**: Add semijoin-style filters so later joins and recursive steps see fewer irrelevant tuples.
@ -211,7 +218,8 @@ FlowLog suggests several optimizations that transfer well to DBSP-backed work.
**Subplan Sharing**: Reuse common derived relations across laws or CRDT views when multiple outputs need the same intermediate facts.
**Physical Key Choice**: Pick key fields deliberately before lowering to the backend. DBSP joins also need maintained state, so bad key and arrangement choices will become runtime costs.
**Physical Key Choice**: Pick key fields deliberately before lowering to the backend. DBSP joins also need maintained state, so bad key and
arrangement choices will become runtime costs.
---
@ -230,7 +238,8 @@ The shared lesson is more important than the difference:
incremental backends maintain operator state
```
That means bad plans become persistent state, not just one bad query execution. A poor join order or unnecessary intermediate relation can increase memory use and update cost for the lifetime of the maintained query.
That means bad plans become persistent state, not just one bad query execution. A poor join order or unnecessary intermediate relation can increase
memory use and update cost for the lifetime of the maintained query.
FlowLog is useful because it treats planning as a first-class layer before execution.

View File

@ -18,7 +18,17 @@ rule atoms and variables
-> physical operator choices
```
That layer is useful because it records how variables move through projections, filters, joins, antijoins, and recursive strata before the backend starts maintaining state.
```mermaid
flowchart LR
Rule["Datalog Rule"] --> Catalog["Rule Catalog"]
Catalog --> Signatures["Collection Signatures"]
Signatures --> Flows["Transformation Flows"]
Flows --> Operators["Physical Operators"]
Operators --> Backend["Incremental Backend State"]
```
That layer is useful because it records how variables move through projections, filters, joins, antijoins, and recursive strata before the backend
starts maintaining state.
---
@ -90,6 +100,18 @@ Arc(x, y)
-> key-value view: key=(y), value=(x)
```
```mermaid
flowchart TB
Arc["Arc(x, y)"]
Arc --> Row["Row View<br/>(x, y)"]
Arc --> KeyX["Key View<br/>key = x"]
Arc --> KvX["Key-Value View<br/>key = x<br/>value = y"]
Arc --> KvY["Key-Value View<br/>key = y<br/>value = x"]
KeyX --> Semi["Semijoin or Antijoin"]
KvX --> Join1["Join on x"]
KvY --> Join2["Join on y"]
```
This is a central planning choice. The key determines which arrangement or maintained index the backend can use.
---
@ -98,6 +120,24 @@ This is a central planning choice. The key determines which arrangement or maint
FlowLog's transformations separate unary reshaping from binary combination.
```mermaid
flowchart TB
Input["Input Collection"]
Input --> Unary["Unary Transformation"]
Unary --> UnaryOut["Projected, Filtered, or Arranged Collection"]
Left["Left Collection"] --> Binary["Binary Transformation"]
Right["Right Collection"] --> Binary
Binary --> BinaryOut["Joined or Antijoined Collection"]
Unary --> RowToRow["Row to Row"]
Unary --> RowToKey["Row to Key"]
Unary --> RowToKv["Row to Key-Value"]
Binary --> Join["Join"]
Binary --> Anti["Antijoin"]
Binary --> Product["Cartesian Product"]
```
Unary transformations include:
- row to row
@ -178,6 +218,12 @@ The join graph is a chain:
A --x-- B --y-- C
```
```mermaid
flowchart LR
A["A(a, x)"] -- "x" --> B["B(x, y)"]
B -- "y" --> C["C(y, c)"]
```
A rule like this is sensitive to join order:
```text
@ -190,6 +236,24 @@ R(a, d) :-
Joining `A` with `D` first is a cross product. Joining adjacent atoms first preserves bindings and reduces intermediate results.
```mermaid
flowchart TB
subgraph Good["Connected Join Order"]
A1["A(a, x)"] --> AB["Join on x"]
B1["B(x, y)"] --> AB
AB --> ABC["Join on y"]
C1["C(y, z)"] --> ABC
ABC --> ABCD["Join on z"]
D1["D(z, d)"] --> ABCD
end
subgraph Bad["Disconnected Join Order"]
A2["A(a, x)"] --> AD["Cross Product"]
D2["D(z, d)"] --> AD
AD --> Later["Later Filters and Joins"]
end
```
FlowLog's structural planning uses variable overlap to choose a plan tree that keeps joins connected and intermediate width smaller.
---
@ -215,6 +279,17 @@ FlowLog instead uses structural signals:
- whether a candidate plan creates disconnected joins
- how deep the plan tree becomes
```mermaid
flowchart LR
RuleBody["Rule Body"] --> Overlap["Variable Overlap"]
RuleBody --> Width["Intermediate Width"]
RuleBody --> Connectivity["Join Connectivity"]
Overlap --> PlanTree["Candidate Plan Tree"]
Width --> PlanTree
Connectivity --> PlanTree
PlanTree --> Choice["Robust Plan Choice"]
```
This is not guaranteed to be optimal. It is meant to avoid obviously bad plans.
That is a good fit for DBSP-backed work too, because a bad plan becomes maintained operator state.
@ -247,7 +322,19 @@ bad(x, z) :-
not D(x, z).
```
The antijoin against `D(x, z)` can run after `A` and `B`; it does not need to wait for `C`. Running it earlier may reduce the input to the later join with `C`.
The antijoin against `D(x, z)` can run after `A` and `B`; it does not need to wait for `C`. Running it earlier may reduce the input to the later join
with `C`.
```mermaid
flowchart LR
A["A(x, y)"] --> AB["Join on y"]
B["B(y, z)"] --> AB
AB --> AntiD["Antijoin D(x, z)"]
D["D(x, z)"] --> AntiD
AntiD --> JoinC["Join C(z, w)"]
C["C(z, w)"] --> JoinC
JoinC --> Out["bad(x, z)"]
```
This is the same issue as antijoin pushdown in the DBSP CRDT note.
@ -265,13 +352,23 @@ derive useful keys
-> join less data
```
```mermaid
flowchart LR
DeltaReach["Delta Reach(x)"] --> Keys["Useful x Keys"]
Keys --> SemiArc["Semijoin Arc on x"]
Arc["Arc(x, y)"] --> SemiArc
SemiArc --> Join["Join with Delta Reach"]
Join --> NewReach["New Reach(y)"]
```
Example:
```text
Reach(y) :- Reach(x), Arc(x, y).
```
If the current delta contains only a small set of `Reach(x)` values, then `Arc` only needs edges whose source is in that set. A semijoin can prefilter `Arc` before the recursive join.
If the current delta contains only a small set of `Reach(x)` values, then `Arc` only needs edges whose source is in that set. A semijoin can prefilter
`Arc` before the recursive join.
For CRDT causal readiness, this suggests a physical plan centered on frontier operations:
@ -281,6 +378,15 @@ new ready operations
-> newly ready operations
```
```mermaid
flowchart LR
Frontier["Ready Frontier"] --> CandidatePred["Pred Edges from Frontier"]
Pred["pred(from, to)"] --> CandidatePred
CandidatePred --> Check["Predecessor Checks"]
Check --> NewReady["New Ready Operations"]
NewReady --> Frontier
```
rather than a plan that repeatedly starts from roots.
---
@ -291,6 +397,20 @@ Recursive rules require fixed-point execution.
FlowLog groups recursive rules into recursive strata, then executes them inside an iterative dataflow scope.
```mermaid
flowchart TB
Earlier["Earlier Strata Outputs"] --> Enter["Enter Recursive Scope"]
EDB["Input Relations"] --> Enter
Enter --> Base["Base Rules"]
Base --> LoopVars["IDB Loop Variables"]
LoopVars --> Step["Recursive Step Rules"]
Step --> Delta["New Derived Facts"]
Delta --> LoopVars
LoopVars --> Done{"Fixed Point?"}
Done -- "no" --> Step
Done -- "yes" --> Collect["Collect Recursive Outputs"]
```
The important design point is that a recursive stratum can contain several rules deriving related IDBs. The planner must know:
- which IDBs are loop variables
@ -298,7 +418,8 @@ The important design point is that a recursive stratum can contain several rules
- which outputs must be collected after convergence
- which intermediate arrangements are useful across iterations
For DBSP, this maps to recursive circuits with feedback and delay. The frontend still needs the same rule-level information before it can produce a good circuit.
For DBSP, this maps to recursive circuits with feedback and delay. The frontend still needs the same rule-level information before it can produce a
good circuit.
---
@ -323,6 +444,15 @@ common_antecedent(x, y)
-> violation_b(y)
```
```mermaid
flowchart LR
A["A(x, y)"] --> Common["common_antecedent(x, y)"]
B["B(y)"] --> Common
Common --> Va["violation_a(x)"]
Common --> Vb["violation_b(y)"]
Extra["Extra Check"] --> Vb
```
FlowLog's explicit rule plans and collection signatures are a useful place to represent this sharing.
---
@ -349,6 +479,16 @@ the output of the first join may need to be arranged by `z`, not by `x`.
That means the planner should choose output keys based on the next operation, not only the current operation.
```mermaid
flowchart LR
A["A(x, y)<br/>key = y"] --> JoinAB["Join on y"]
B["B(y, z)<br/>key = y"] --> JoinAB
JoinAB --> R["R(x, z)<br/>next key = z"]
R --> JoinRC["Join on z"]
C["C(z, w)<br/>key = z"] --> JoinRC
JoinRC --> S["S(x, w)"]
```
This is one reason a simple relational algebra tree is not enough. The physical plan needs key and payload annotations.
---
@ -379,6 +519,17 @@ distinct -> DBSP distinct
recursion -> DBSP fixed-point circuit
```
```mermaid
flowchart LR
Source["Datalog or Geolog Rules"] --> Frontend["Frontend Parser or Compiler"]
Frontend --> Catalog["Rule Catalogs"]
Catalog --> Planner["FlowLog-Style Planner"]
Planner --> IR["Relational IR with Keys"]
IR --> Lowering["DBSP Lowering"]
Lowering --> Circuit["DBSP Circuit"]
Circuit --> Deltas["Maintained Output Deltas"]
```
The key point is that DBSP should receive an already planned circuit, not raw Datalog text.
---

View File

@ -16,7 +16,16 @@ Level 2: borrow FlowLog planning ideas for a DBSP frontend
Level 3: compare DBSP and Differential Dataflow backends on the same Datalog programs
```
The practical near-term path is Level 2. Use FlowLog's catalog, join planning, antijoin scheduling, and SIP ideas to design a better compiler layer before DBSP.
```mermaid
flowchart TB
L1["Level 1<br/>Run FlowLog Examples"] --> L2["Level 2<br/>Borrow Planning Ideas"]
L2 --> L3["Level 3<br/>Backend Comparison"]
L2 --> DBSP["DBSP Frontend Work"]
L3 --> Decision["Backend and Planner Decisions"]
```
The practical near-term path is Level 2. Use FlowLog's catalog, join planning, antijoin scheduling, and SIP ideas to design a better compiler layer
before DBSP.
---
@ -31,7 +40,8 @@ That would conflate two separate questions:
The DBSP notes are already about DBSP as a formal view-maintenance backend. FlowLog is more useful as a guide for the missing frontend and optimizer.
The first step should not be adopting FlowLog's syntax as the durable source language either. Geomerge and Geolog already have their own source concepts. Datalog should be an intermediate or testing language unless the user-facing language decision is explicit.
The first step should not be adopting FlowLog's syntax as the durable source language either. Geomerge and Geolog already have their own source
concepts. Datalog should be an intermediate or testing language unless the user-facing language decision is explicit.
---
@ -70,7 +80,28 @@ insert tree
-> next visible element
```
These queries contain several recursive or join-heavy rules. FlowLog-style planning can help by choosing join keys, pushing antijoins earlier, and adding semijoin filters around the current frontier.
These queries contain several recursive or join-heavy rules. FlowLog-style planning can help by choosing join keys, pushing antijoins earlier, and
adding semijoin filters around the current frontier.
```mermaid
flowchart TB
subgraph Causal["Causal Readiness"]
Pred["pred Graph"] --> Roots["Roots"]
Roots --> Ready["Ready Operations"]
Ready --> Frontier["Frontier"]
Frontier --> NewPred["Outgoing Pred Edges"]
NewPred --> Ready
end
subgraph List["List Traversal"]
Insert["insert Tree"] --> First["firstChild"]
Insert --> Sibling["nextSibling"]
First --> Next["nextElem"]
Sibling --> Next
Remove["remove Tombstones"] --> Visible["nextVisible"]
Next --> Visible
end
```
The concrete experiment:
@ -116,6 +147,16 @@ FlowLog-style catalogs would help the compiler answer:
- which projected values are needed for the violation row
- whether two laws share a common antecedent
```mermaid
flowchart LR
Law["Geomerge Law"] --> Rule["Datalog-Like Rule"]
Rule --> Catalog["Rule Catalog"]
Catalog --> JoinGraph["Join Graph"]
JoinGraph --> Plan["Planned Relational Tree"]
Plan --> Violation["Violation Relation"]
Violation --> DBSP["DBSP Maintained Output"]
```
The concrete experiment:
```text
@ -166,7 +207,20 @@ The comparison should measure:
- output delta size
- ease of rollback or preview execution
This helps decide whether DBSP needs FlowLog-like planning, whether Differential Dataflow is better for some recursive workloads, or whether a hybrid batch-plus-incremental strategy is needed.
```mermaid
flowchart TB
Program["Same Datalog Program"] --> IR["Shared Relational IR"]
Facts["Same Input Facts"] --> IR
IR --> DBSP["DBSP Lowering"]
IR --> DD["Differential Dataflow Lowering"]
DBSP --> DbspMetrics["Hydration<br/>Warm Updates<br/>Memory<br/>Deltas"]
DD --> DdMetrics["Hydration<br/>Warm Updates<br/>Memory<br/>Deltas"]
DbspMetrics --> Compare["Backend Comparison"]
DdMetrics --> Compare
```
This helps decide whether DBSP needs FlowLog-like planning, whether Differential Dataflow is better for some recursive workloads, or whether a hybrid
batch-plus-incremental strategy is needed.
---
@ -215,6 +269,18 @@ Datalog-like rule text
-> planned relational tree
```
```mermaid
flowchart LR
Text["Rule Text"] --> Parse["Parsed Rules"]
Parse --> Deps["Dependency Graph"]
Deps --> Strata["Strata"]
Parse --> Catalog["Rule Catalog"]
Catalog --> JoinGraph["Join Graph"]
Strata --> Plan["Planned Tree"]
JoinGraph --> Plan
Plan --> Explain["Textual Plan Explanation"]
```
It does not need to run DBSP at first.
The output can be textual:
@ -253,6 +319,17 @@ This prototype would validate the compiler shape before depending on a backend A
The second prototype should lower a narrow subset to DBSP.
```mermaid
flowchart TB
Subset["Supported Rule Subset"] --> Planner["Planner"]
Planner --> IR["Relational IR"]
IR --> Lowering["DBSP Lowering"]
Lowering --> Runtime["DBSP Runtime"]
Runtime --> Output["Maintained Outputs"]
Snapshot["Naive Snapshot Evaluator"] --> Oracle["Correctness Oracle"]
Output --> Oracle
```
Supported subset:
- relation declarations
@ -292,15 +369,29 @@ planned rules -> DBSP-maintained outputs
Several decisions should be made explicitly before implementation.
**Set or Multiset Semantics**: CRDT operation facts are usually set-like. DBSP uses Z-set weights internally. The frontend should define when `distinct` is applied.
```mermaid
flowchart TB
Decisions["Data Model Decisions"]
Decisions --> Semantics["Set or Multiset Semantics"]
Decisions --> Identity["Operation Identity"]
Decisions --> Violations["Violation Row Shape"]
Decisions --> Integration["Output Integration"]
Decisions --> Rollback["Rollback or Preview"]
```
**Operation Identity**: CRDT examples use `(replica_id, counter)`. The planner should treat this pair either as two scalar fields or as one logical key with two physical fields.
**Set or Multiset Semantics**: CRDT operation facts are usually set-like. DBSP uses Z-set weights internally. The frontend should define when
`distinct` is applied.
**Operation Identity**: CRDT examples use `(replica_id, counter)`. The planner should treat this pair either as two scalar fields or as one logical
key with two physical fields.
**Violation Rows**: Geomerge violations should include enough context for error messages, not just a boolean.
**Output Integration**: DBSP emits deltas. Applications often need an integrated current view. The runtime boundary should say who owns that integration.
**Output Integration**: DBSP emits deltas. Applications often need an integrated current view. The runtime boundary should say who owns that
integration.
**Rollback**: Geomerge validation needs preview or rollback behavior. If using weighted deltas, inverse deltas are plausible but must stay transactionally coupled to storage.
**Rollback**: Geomerge validation needs preview or rollback behavior. If using weighted deltas, inverse deltas are plausible but must stay
transactionally coupled to storage.
---
@ -308,6 +399,19 @@ Several decisions should be made explicitly before implementation.
The evaluation should separate correctness from performance.
```mermaid
flowchart LR
Inputs["Input Facts and Updates"] --> Naive["Naive Snapshot Evaluation"]
Inputs --> Planned["Planned Backend Evaluation"]
Naive --> Correctness["Correctness Check"]
Planned --> Correctness
Planned --> Perf["Performance Metrics"]
Perf --> Hydration["Hydration"]
Perf --> Warm["Warm Updates"]
Perf --> Memory["Memory"]
Perf --> Sensitivity["History and Join Sensitivity"]
```
Correctness checks:
```text
@ -351,7 +455,8 @@ The main decision points are:
- whether to persist backend operator state
- whether to compare against Differential Dataflow for recursive workloads
These decisions should stay separate. Choosing DBSP as the backend does not force a particular Datalog syntax. Choosing a FlowLog-like planner does not force Differential Dataflow as the backend.
These decisions should stay separate. Choosing DBSP as the backend does not force a particular Datalog syntax. Choosing a FlowLog-like planner does
not force Differential Dataflow as the backend.
---
@ -363,6 +468,14 @@ The next step is lowering a small subset to DBSP.
After that, FlowLog itself can serve as a comparison backend for the same small programs.
```mermaid
flowchart LR
P1["Planning-Only Compiler"] --> P2["DBSP Subset Lowering"]
P2 --> P3["FlowLog Backend Comparison"]
P3 --> P4["Shared IR Decision"]
P4 --> P5["Production-Oriented Prototype"]
```
The goal should be:
```text