diff --git a/notes/backend/01-cozo-and-lmdb-findings.md b/notes/backend/01-cozo-and-lmdb-findings.md new file mode 100644 index 0000000..9f5ad62 --- /dev/null +++ b/notes/backend/01-cozo-and-lmdb-findings.md @@ -0,0 +1,135 @@ +## Cozo and LMDB Findings + +Sources inspected: the Cozo source tree at `github.com/cozodb/cozo`, the LMDB source tree at `github.com/LMDB/lmdb`, and the `heed` Rust binding at `github.com/meilisearch/heed`. +File paths in this note are relative to the root of the named project's source tree. +The aim was to understand how a working Datalog engine (Cozo) implements joins and what a low-level key-value substrate (LMDB) provides that makes those joins cheap. +This note summarizes the design lessons and the practical implications for the `query-ops` crate in this playground. + +### Summary + +Cozo is an embedded Datalog database written in Rust. +It does not have a separate semijoin operator. +Instead, it has one inner-join operator that picks between two strategies based on how each relation is stored: an index-nested-loop strategy that uses ordered range scans over the substrate, and a fallback that materializes one side into a sorted vector and probes it. +Semijoin behavior, when needed, emerges from a separate rewrite step called the magic-sets transformation, which converts semijoin-shaped pruning into regular inner joins against derived relations. + +LMDB is a memory-mapped, ordered key-value store with a B+ tree on disk. +It exposes a small set of cursor primitives that support prefix iteration, range iteration, and exact-key lookup. +These primitives are exactly what an index-nested-loop join needs: seek to a key prefix, then iterate forward while the prefix matches. + +The combined lesson is that a good join does not require a clever operator. +It requires the relation to be stored with the join columns at the front of the key, so that the substrate's ordered iteration can do the join itself. + +### Cozo + +#### What It Is + +Cozo is a Datalog database with multiple swappable storage backends, including an in-memory store, SQLite, RocksDB, sled, and TiKV. +The execution engine speaks a single narrow storage trait whose surface is essentially `get`, `put`, `range_iter`, and `prefix_iter` over byte keys. +Each backend implements that trait. +The trait definition lives at `cozo-core/src/storage/mod.rs` in the Cozo source tree. + +#### Join Behavior + +The relational algebra at `cozo-core/src/query/ra.rs` in the Cozo source tree defines a single join operator named `InnerJoin`. +At execution time it chooses between two strategies based on a check called `join_is_prefix`: + +- prefix join: for each tuple from the left side, the engine builds a byte prefix from the join columns and calls `prefix_iter` on the right relation. + The substrate yields all matching tuples in key order. + No hash table is built. + This path is taken whenever the right side's join columns are stored as the prefix of its key. +- materialized join: used when the join columns are not a key prefix. + The right side is read fully into a sorted, deduplicated vector, reordered so the join columns come first, then walked with a `starts_with(prefix)` check. + This is the build-and-probe family, but with a sorted vector instead of a hash map. + +The choice is made entirely on whether the join columns sit at the front of the stored key. + +#### No Semijoin Operator + +A search of the Cozo source for `semijoin` or `semi_join` returns nothing. +Semijoin behavior comes from the magic-sets transformation at `cozo-core/src/query/magic.rs` in the Cozo source tree. +This pass rewrites each rule so that body atoms get joined against an auxiliary "magic" relation whose contents encode the binding patterns supplied by the rule's callers. +The net effect is the same as semijoining body atoms against caller-supplied filters, but the implementation is a logical rewrite, not a runtime operator. + +#### No Auto-Maintained Secondary Indexes + +Cozo does not maintain secondary indexes automatically. +If you want to query a relation by a column order different from how it was declared, you declare a second relation with the columns reordered and keep its contents synchronized at insert time. +A covering index is just another stored relation. +The decision of which column order to store comes from how you expect to query the data, not from the engine. + +### LMDB + +#### What It Is + +LMDB is a single-file, memory-mapped, ordered key-value store. +It uses a B+ tree on disk and exposes reads as zero-copy byte slices that point directly into the mmap. +It supports a single writer at a time and many concurrent readers, and it uses shadow paging for MVCC, which means commits are atomic without a write-ahead log. + +#### Cursor Primitives + +A cursor in LMDB is a position inside the B+ tree. +The full set of cursor operations is defined by the `MDB_cursor_op` enum in `libraries/liblmdb/lmdb.h` in the LMDB source tree. +The operations relevant to join work are: + +- `MDB_SET_RANGE`: position at the first key greater than or equal to a given key. + This is the seek primitive that makes prefix scans possible. +- `MDB_NEXT`: advance one step forward in key order. + Combined with `MDB_SET_RANGE` and a per-step prefix check, this gives you ordered range iteration. +- `MDB_SET` and `MDB_SET_KEY`: exact-key positioning, used for point lookups. +- `MDB_FIRST` and `MDB_LAST`: positional endpoints. + +For databases opened with the `MDB_DUPSORT` flag, one key can carry multiple sorted values, and additional operations apply: `MDB_GET_BOTH`, `MDB_NEXT_DUP`, `MDB_FIRST_DUP`. +This is useful when a relation is encoded as "key = join columns, duplicate values = remaining columns": the set of duplicates is itself a secondary index over the join key. + +#### Rust Binding + +`heed` is the idiomatic Rust binding for LMDB. +It wraps the cursor operations as `RoCursor` and `RwCursor` and returns key and value byte slices tied to the transaction lifetime, so reads remain zero-copy. +Meilisearch uses `heed` in production, so the binding is well exercised. + +### LMDB Versus RocksDB + +Both LMDB and RocksDB are ordered key-value stores with prefix and range scans, but their internal designs lead to different operational profiles. + +LMDB highlights: + +- B+ tree on disk, memory mapped +- Single writer at a time, many concurrent readers +- Zero-copy reads from the mmap +- Append-only on-disk format; deletes leave reclaimable free pages +- File size grows up to a configured `mapsize` +- No background compaction +- Manual reclaim with `mdb_copy --compact` + +RocksDB highlights: + +- Log-structured merge tree +- Multiple concurrent writers +- Background compaction +- Higher write throughput at the cost of write amplification +- Reads may traverse multiple levels with bloom-filter checks +- Engine manages its own disk layout + +For a read-heavy prototype with batch inserts, LMDB is the closer fit: predictable read costs, cheap range scans, and zero-copy probes. +RocksDB earns its overhead when sustained write throughput is the bottleneck. + +### Practical Implications + +The current `query-ops` crate works on in-memory `Vec` values and will implement semijoin and natural join with a transient hash on one side. +The Cozo design suggests a clear upgrade path once a real substrate is added. + +Short term: keep the in-memory operator and build a transient hash on the smaller side. +This is correct, easy to test, and easy to reason about. + +Medium term: when relations move into a substrate like LMDB, encode each relation so that the join columns sit at the prefix of the key, or use a `DUPSORT` database where the duplicate values carry the remaining columns. +At that point the join operator becomes a cursor pattern (`MDB_SET_RANGE` followed by `MDB_NEXT` while the prefix matches), and the separate hash-building step disappears. + +Index discipline: if a relation needs to be joined two different ways, store it twice with different prefix orders. +There is no clever-indexing shortcut in either Cozo or LMDB, and trying to invent one is unlikely to be worth the cost. + +The takeaway is that the operator surface in `query-ops` is fine for an in-memory prototype, but the substrate decision is the load-bearing one for performance. +We do not need to design around it now, but the natural successor to the current operators is a key-encoding discipline rather than a more elaborate operator implementation. + +### Changelog + +- **June 2, 2026** -- The first version of this document was made.