storage-engine-playground/notes/backend/01-cozo-and-lmdb-findings.md

136 lines
8.0 KiB
Markdown
Raw Normal View History

## Cozo and LMDB Findings
Sources inspected: the Cozo source tree at `github.com/cozodb/cozo`, the LMDB source tree at `github.com/LMDB/lmdb`, and the `heed` Rust binding at `github.com/meilisearch/heed`.
File paths in this note are relative to the root of the named project's source tree.
The aim was to understand how a working Datalog engine (Cozo) implements joins and what a low-level key-value substrate (LMDB) provides that makes those joins cheap.
This note summarizes the design lessons and the practical implications for the `query-ops` crate in this playground.
### Summary
Cozo is an embedded Datalog database written in Rust.
It does not have a separate semijoin operator.
Instead, it has one inner-join operator that picks between two strategies based on how each relation is stored: an index-nested-loop strategy that uses ordered range scans over the substrate, and a fallback that materializes one side into a sorted vector and probes it.
Semijoin behavior, when needed, emerges from a separate rewrite step called the magic-sets transformation, which converts semijoin-shaped pruning into regular inner joins against derived relations.
LMDB is a memory-mapped, ordered key-value store with a B+ tree on disk.
It exposes a small set of cursor primitives that support prefix iteration, range iteration, and exact-key lookup.
These primitives are exactly what an index-nested-loop join needs: seek to a key prefix, then iterate forward while the prefix matches.
The combined lesson is that a good join does not require a clever operator.
It requires the relation to be stored with the join columns at the front of the key, so that the substrate's ordered iteration can do the join itself.
### Cozo
#### What It Is
Cozo is a Datalog database with multiple swappable storage backends, including an in-memory store, SQLite, RocksDB, sled, and TiKV.
The execution engine speaks a single narrow storage trait whose surface is essentially `get`, `put`, `range_iter`, and `prefix_iter` over byte keys.
Each backend implements that trait.
The trait definition lives at `cozo-core/src/storage/mod.rs` in the Cozo source tree.
#### Join Behavior
The relational algebra at `cozo-core/src/query/ra.rs` in the Cozo source tree defines a single join operator named `InnerJoin`.
At execution time it chooses between two strategies based on a check called `join_is_prefix`:
- prefix join: for each tuple from the left side, the engine builds a byte prefix from the join columns and calls `prefix_iter` on the right relation.
The substrate yields all matching tuples in key order.
No hash table is built.
This path is taken whenever the right side's join columns are stored as the prefix of its key.
- materialized join: used when the join columns are not a key prefix.
The right side is read fully into a sorted, deduplicated vector, reordered so the join columns come first, then walked with a `starts_with(prefix)` check.
This is the build-and-probe family, but with a sorted vector instead of a hash map.
The choice is made entirely on whether the join columns sit at the front of the stored key.
#### No Semijoin Operator
A search of the Cozo source for `semijoin` or `semi_join` returns nothing.
Semijoin behavior comes from the magic-sets transformation at `cozo-core/src/query/magic.rs` in the Cozo source tree.
This pass rewrites each rule so that body atoms get joined against an auxiliary "magic" relation whose contents encode the binding patterns supplied by the rule's callers.
The net effect is the same as semijoining body atoms against caller-supplied filters, but the implementation is a logical rewrite, not a runtime operator.
#### No Auto-Maintained Secondary Indexes
Cozo does not maintain secondary indexes automatically.
If you want to query a relation by a column order different from how it was declared, you declare a second relation with the columns reordered and keep its contents synchronized at insert time.
A covering index is just another stored relation.
The decision of which column order to store comes from how you expect to query the data, not from the engine.
### LMDB
#### What It Is
LMDB is a single-file, memory-mapped, ordered key-value store.
It uses a B+ tree on disk and exposes reads as zero-copy byte slices that point directly into the mmap.
It supports a single writer at a time and many concurrent readers, and it uses shadow paging for MVCC, which means commits are atomic without a write-ahead log.
#### Cursor Primitives
A cursor in LMDB is a position inside the B+ tree.
The full set of cursor operations is defined by the `MDB_cursor_op` enum in `libraries/liblmdb/lmdb.h` in the LMDB source tree.
The operations relevant to join work are:
- `MDB_SET_RANGE`: position at the first key greater than or equal to a given key.
This is the seek primitive that makes prefix scans possible.
- `MDB_NEXT`: advance one step forward in key order.
Combined with `MDB_SET_RANGE` and a per-step prefix check, this gives you ordered range iteration.
- `MDB_SET` and `MDB_SET_KEY`: exact-key positioning, used for point lookups.
- `MDB_FIRST` and `MDB_LAST`: positional endpoints.
For databases opened with the `MDB_DUPSORT` flag, one key can carry multiple sorted values, and additional operations apply: `MDB_GET_BOTH`, `MDB_NEXT_DUP`, `MDB_FIRST_DUP`.
This is useful when a relation is encoded as "key = join columns, duplicate values = remaining columns": the set of duplicates is itself a secondary index over the join key.
#### Rust Binding
`heed` is the idiomatic Rust binding for LMDB.
It wraps the cursor operations as `RoCursor` and `RwCursor` and returns key and value byte slices tied to the transaction lifetime, so reads remain zero-copy.
Meilisearch uses `heed` in production, so the binding is well exercised.
### LMDB Versus RocksDB
Both LMDB and RocksDB are ordered key-value stores with prefix and range scans, but their internal designs lead to different operational profiles.
LMDB highlights:
- B+ tree on disk, memory mapped
- Single writer at a time, many concurrent readers
- Zero-copy reads from the mmap
- Append-only on-disk format; deletes leave reclaimable free pages
- File size grows up to a configured `mapsize`
- No background compaction
- Manual reclaim with `mdb_copy --compact`
RocksDB highlights:
- Log-structured merge tree
- Multiple concurrent writers
- Background compaction
- Higher write throughput at the cost of write amplification
- Reads may traverse multiple levels with bloom-filter checks
- Engine manages its own disk layout
For a read-heavy prototype with batch inserts, LMDB is the closer fit: predictable read costs, cheap range scans, and zero-copy probes.
RocksDB earns its overhead when sustained write throughput is the bottleneck.
### Practical Implications
The current `query-ops` crate works on in-memory `Vec<Row>` values and will implement semijoin and natural join with a transient hash on one side.
The Cozo design suggests a clear upgrade path once a real substrate is added.
Short term: keep the in-memory operator and build a transient hash on the smaller side.
This is correct, easy to test, and easy to reason about.
Medium term: when relations move into a substrate like LMDB, encode each relation so that the join columns sit at the prefix of the key, or use a `DUPSORT` database where the duplicate values carry the remaining columns.
At that point the join operator becomes a cursor pattern (`MDB_SET_RANGE` followed by `MDB_NEXT` while the prefix matches), and the separate hash-building step disappears.
Index discipline: if a relation needs to be joined two different ways, store it twice with different prefix orders.
There is no clever-indexing shortcut in either Cozo or LMDB, and trying to invent one is unlikely to be worth the cost.
The takeaway is that the operator surface in `query-ops` is fine for an in-memory prototype, but the substrate decision is the load-bearing one for performance.
We do not need to design around it now, but the natural successor to the current operators is a key-encoding discipline rather than a more elaborate operator implementation.
### Changelog
- **June 2, 2026** -- The first version of this document was made.