Add a note file about the findings from CozoDB and LMDB projects
This commit is contained in:
parent
b1d38eff49
commit
6819e3f8b3
135
notes/backend/01-cozo-and-lmdb-findings.md
Normal file
135
notes/backend/01-cozo-and-lmdb-findings.md
Normal file
@ -0,0 +1,135 @@
|
|||||||
|
## Cozo and LMDB Findings
|
||||||
|
|
||||||
|
Sources inspected: the Cozo source tree at `github.com/cozodb/cozo`, the LMDB source tree at `github.com/LMDB/lmdb`, and the `heed` Rust binding at `github.com/meilisearch/heed`.
|
||||||
|
File paths in this note are relative to the root of the named project's source tree.
|
||||||
|
The aim was to understand how a working Datalog engine (Cozo) implements joins and what a low-level key-value substrate (LMDB) provides that makes those joins cheap.
|
||||||
|
This note summarizes the design lessons and the practical implications for the `query-ops` crate in this playground.
|
||||||
|
|
||||||
|
### Summary
|
||||||
|
|
||||||
|
Cozo is an embedded Datalog database written in Rust.
|
||||||
|
It does not have a separate semijoin operator.
|
||||||
|
Instead, it has one inner-join operator that picks between two strategies based on how each relation is stored: an index-nested-loop strategy that uses ordered range scans over the substrate, and a fallback that materializes one side into a sorted vector and probes it.
|
||||||
|
Semijoin behavior, when needed, emerges from a separate rewrite step called the magic-sets transformation, which converts semijoin-shaped pruning into regular inner joins against derived relations.
|
||||||
|
|
||||||
|
LMDB is a memory-mapped, ordered key-value store with a B+ tree on disk.
|
||||||
|
It exposes a small set of cursor primitives that support prefix iteration, range iteration, and exact-key lookup.
|
||||||
|
These primitives are exactly what an index-nested-loop join needs: seek to a key prefix, then iterate forward while the prefix matches.
|
||||||
|
|
||||||
|
The combined lesson is that a good join does not require a clever operator.
|
||||||
|
It requires the relation to be stored with the join columns at the front of the key, so that the substrate's ordered iteration can do the join itself.
|
||||||
|
|
||||||
|
### Cozo
|
||||||
|
|
||||||
|
#### What It Is
|
||||||
|
|
||||||
|
Cozo is a Datalog database with multiple swappable storage backends, including an in-memory store, SQLite, RocksDB, sled, and TiKV.
|
||||||
|
The execution engine speaks a single narrow storage trait whose surface is essentially `get`, `put`, `range_iter`, and `prefix_iter` over byte keys.
|
||||||
|
Each backend implements that trait.
|
||||||
|
The trait definition lives at `cozo-core/src/storage/mod.rs` in the Cozo source tree.
|
||||||
|
|
||||||
|
#### Join Behavior
|
||||||
|
|
||||||
|
The relational algebra at `cozo-core/src/query/ra.rs` in the Cozo source tree defines a single join operator named `InnerJoin`.
|
||||||
|
At execution time it chooses between two strategies based on a check called `join_is_prefix`:
|
||||||
|
|
||||||
|
- prefix join: for each tuple from the left side, the engine builds a byte prefix from the join columns and calls `prefix_iter` on the right relation.
|
||||||
|
The substrate yields all matching tuples in key order.
|
||||||
|
No hash table is built.
|
||||||
|
This path is taken whenever the right side's join columns are stored as the prefix of its key.
|
||||||
|
- materialized join: used when the join columns are not a key prefix.
|
||||||
|
The right side is read fully into a sorted, deduplicated vector, reordered so the join columns come first, then walked with a `starts_with(prefix)` check.
|
||||||
|
This is the build-and-probe family, but with a sorted vector instead of a hash map.
|
||||||
|
|
||||||
|
The choice is made entirely on whether the join columns sit at the front of the stored key.
|
||||||
|
|
||||||
|
#### No Semijoin Operator
|
||||||
|
|
||||||
|
A search of the Cozo source for `semijoin` or `semi_join` returns nothing.
|
||||||
|
Semijoin behavior comes from the magic-sets transformation at `cozo-core/src/query/magic.rs` in the Cozo source tree.
|
||||||
|
This pass rewrites each rule so that body atoms get joined against an auxiliary "magic" relation whose contents encode the binding patterns supplied by the rule's callers.
|
||||||
|
The net effect is the same as semijoining body atoms against caller-supplied filters, but the implementation is a logical rewrite, not a runtime operator.
|
||||||
|
|
||||||
|
#### No Auto-Maintained Secondary Indexes
|
||||||
|
|
||||||
|
Cozo does not maintain secondary indexes automatically.
|
||||||
|
If you want to query a relation by a column order different from how it was declared, you declare a second relation with the columns reordered and keep its contents synchronized at insert time.
|
||||||
|
A covering index is just another stored relation.
|
||||||
|
The decision of which column order to store comes from how you expect to query the data, not from the engine.
|
||||||
|
|
||||||
|
### LMDB
|
||||||
|
|
||||||
|
#### What It Is
|
||||||
|
|
||||||
|
LMDB is a single-file, memory-mapped, ordered key-value store.
|
||||||
|
It uses a B+ tree on disk and exposes reads as zero-copy byte slices that point directly into the mmap.
|
||||||
|
It supports a single writer at a time and many concurrent readers, and it uses shadow paging for MVCC, which means commits are atomic without a write-ahead log.
|
||||||
|
|
||||||
|
#### Cursor Primitives
|
||||||
|
|
||||||
|
A cursor in LMDB is a position inside the B+ tree.
|
||||||
|
The full set of cursor operations is defined by the `MDB_cursor_op` enum in `libraries/liblmdb/lmdb.h` in the LMDB source tree.
|
||||||
|
The operations relevant to join work are:
|
||||||
|
|
||||||
|
- `MDB_SET_RANGE`: position at the first key greater than or equal to a given key.
|
||||||
|
This is the seek primitive that makes prefix scans possible.
|
||||||
|
- `MDB_NEXT`: advance one step forward in key order.
|
||||||
|
Combined with `MDB_SET_RANGE` and a per-step prefix check, this gives you ordered range iteration.
|
||||||
|
- `MDB_SET` and `MDB_SET_KEY`: exact-key positioning, used for point lookups.
|
||||||
|
- `MDB_FIRST` and `MDB_LAST`: positional endpoints.
|
||||||
|
|
||||||
|
For databases opened with the `MDB_DUPSORT` flag, one key can carry multiple sorted values, and additional operations apply: `MDB_GET_BOTH`, `MDB_NEXT_DUP`, `MDB_FIRST_DUP`.
|
||||||
|
This is useful when a relation is encoded as "key = join columns, duplicate values = remaining columns": the set of duplicates is itself a secondary index over the join key.
|
||||||
|
|
||||||
|
#### Rust Binding
|
||||||
|
|
||||||
|
`heed` is the idiomatic Rust binding for LMDB.
|
||||||
|
It wraps the cursor operations as `RoCursor` and `RwCursor` and returns key and value byte slices tied to the transaction lifetime, so reads remain zero-copy.
|
||||||
|
Meilisearch uses `heed` in production, so the binding is well exercised.
|
||||||
|
|
||||||
|
### LMDB Versus RocksDB
|
||||||
|
|
||||||
|
Both LMDB and RocksDB are ordered key-value stores with prefix and range scans, but their internal designs lead to different operational profiles.
|
||||||
|
|
||||||
|
LMDB highlights:
|
||||||
|
|
||||||
|
- B+ tree on disk, memory mapped
|
||||||
|
- Single writer at a time, many concurrent readers
|
||||||
|
- Zero-copy reads from the mmap
|
||||||
|
- Append-only on-disk format; deletes leave reclaimable free pages
|
||||||
|
- File size grows up to a configured `mapsize`
|
||||||
|
- No background compaction
|
||||||
|
- Manual reclaim with `mdb_copy --compact`
|
||||||
|
|
||||||
|
RocksDB highlights:
|
||||||
|
|
||||||
|
- Log-structured merge tree
|
||||||
|
- Multiple concurrent writers
|
||||||
|
- Background compaction
|
||||||
|
- Higher write throughput at the cost of write amplification
|
||||||
|
- Reads may traverse multiple levels with bloom-filter checks
|
||||||
|
- Engine manages its own disk layout
|
||||||
|
|
||||||
|
For a read-heavy prototype with batch inserts, LMDB is the closer fit: predictable read costs, cheap range scans, and zero-copy probes.
|
||||||
|
RocksDB earns its overhead when sustained write throughput is the bottleneck.
|
||||||
|
|
||||||
|
### Practical Implications
|
||||||
|
|
||||||
|
The current `query-ops` crate works on in-memory `Vec<Row>` values and will implement semijoin and natural join with a transient hash on one side.
|
||||||
|
The Cozo design suggests a clear upgrade path once a real substrate is added.
|
||||||
|
|
||||||
|
Short term: keep the in-memory operator and build a transient hash on the smaller side.
|
||||||
|
This is correct, easy to test, and easy to reason about.
|
||||||
|
|
||||||
|
Medium term: when relations move into a substrate like LMDB, encode each relation so that the join columns sit at the prefix of the key, or use a `DUPSORT` database where the duplicate values carry the remaining columns.
|
||||||
|
At that point the join operator becomes a cursor pattern (`MDB_SET_RANGE` followed by `MDB_NEXT` while the prefix matches), and the separate hash-building step disappears.
|
||||||
|
|
||||||
|
Index discipline: if a relation needs to be joined two different ways, store it twice with different prefix orders.
|
||||||
|
There is no clever-indexing shortcut in either Cozo or LMDB, and trying to invent one is unlikely to be worth the cost.
|
||||||
|
|
||||||
|
The takeaway is that the operator surface in `query-ops` is fine for an in-memory prototype, but the substrate decision is the load-bearing one for performance.
|
||||||
|
We do not need to design around it now, but the natural successor to the current operators is a key-encoding discipline rather than a more elaborate operator implementation.
|
||||||
|
|
||||||
|
### Changelog
|
||||||
|
|
||||||
|
- **June 2, 2026** -- The first version of this document was made.
|
||||||
Loading…
x
Reference in New Issue
Block a user