storage-engine-playground/01-cozo-and-lmdb-findings.md at 751ef2e47e495be1e4b7679b3aa0007ba3f9f9a1

Hassan Abedi b31aa32747 Add a note file about the findings from CozoDB and LMDB projects

2026-06-03 12:16:26 +02:00

8.0 KiB

Raw Blame History

Cozo and LMDB Findings

Sources inspected: the Cozo source tree at github.com/cozodb/cozo, the LMDB source tree at github.com/LMDB/lmdb, and the heed Rust binding at github.com/meilisearch/heed. File paths in this note are relative to the root of the named project's source tree. The aim was to understand how a working Datalog engine (Cozo) implements joins and what a low-level key-value substrate (LMDB) provides that makes those joins cheap. This note summarizes the design lessons and the practical implications for the query-ops crate in this playground.

Summary

Cozo is an embedded Datalog database written in Rust. It does not have a separate semijoin operator. Instead, it has one inner-join operator that picks between two strategies based on how each relation is stored: an index-nested-loop strategy that uses ordered range scans over the substrate, and a fallback that materializes one side into a sorted vector and probes it. Semijoin behavior, when needed, emerges from a separate rewrite step called the magic-sets transformation, which converts semijoin-shaped pruning into regular inner joins against derived relations.

LMDB is a memory-mapped, ordered key-value store with a B+ tree on disk. It exposes a small set of cursor primitives that support prefix iteration, range iteration, and exact-key lookup. These primitives are exactly what an index-nested-loop join needs: seek to a key prefix, then iterate forward while the prefix matches.

The combined lesson is that a good join does not require a clever operator. It requires the relation to be stored with the join columns at the front of the key, so that the substrate's ordered iteration can do the join itself.

Cozo

What It Is

Cozo is a Datalog database with multiple swappable storage backends, including an in-memory store, SQLite, RocksDB, sled, and TiKV. The execution engine speaks a single narrow storage trait whose surface is essentially get, put, range_iter, and prefix_iter over byte keys. Each backend implements that trait. The trait definition lives at cozo-core/src/storage/mod.rs in the Cozo source tree.

Join Behavior

The relational algebra at cozo-core/src/query/ra.rs in the Cozo source tree defines a single join operator named InnerJoin. At execution time it chooses between two strategies based on a check called join_is_prefix:

prefix join: for each tuple from the left side, the engine builds a byte prefix from the join columns and calls prefix_iter on the right relation. The substrate yields all matching tuples in key order. No hash table is built. This path is taken whenever the right side's join columns are stored as the prefix of its key.
materialized join: used when the join columns are not a key prefix. The right side is read fully into a sorted, deduplicated vector, reordered so the join columns come first, then walked with a starts_with(prefix) check. This is the build-and-probe family, but with a sorted vector instead of a hash map.

The choice is made entirely on whether the join columns sit at the front of the stored key.

No Semijoin Operator

A search of the Cozo source for semijoin or semi_join returns nothing. Semijoin behavior comes from the magic-sets transformation at cozo-core/src/query/magic.rs in the Cozo source tree. This pass rewrites each rule so that body atoms get joined against an auxiliary "magic" relation whose contents encode the binding patterns supplied by the rule's callers. The net effect is the same as semijoining body atoms against caller-supplied filters, but the implementation is a logical rewrite, not a runtime operator.

No Auto-Maintained Secondary Indexes

Cozo does not maintain secondary indexes automatically. If you want to query a relation by a column order different from how it was declared, you declare a second relation with the columns reordered and keep its contents synchronized at insert time. A covering index is just another stored relation. The decision of which column order to store comes from how you expect to query the data, not from the engine.

LMDB

What It Is

LMDB is a single-file, memory-mapped, ordered key-value store. It uses a B+ tree on disk and exposes reads as zero-copy byte slices that point directly into the mmap. It supports a single writer at a time and many concurrent readers, and it uses shadow paging for MVCC, which means commits are atomic without a write-ahead log.

Cursor Primitives

A cursor in LMDB is a position inside the B+ tree. The full set of cursor operations is defined by the MDB_cursor_op enum in libraries/liblmdb/lmdb.h in the LMDB source tree. The operations relevant to join work are:

MDB_SET_RANGE: position at the first key greater than or equal to a given key. This is the seek primitive that makes prefix scans possible.
MDB_NEXT: advance one step forward in key order. Combined with MDB_SET_RANGE and a per-step prefix check, this gives you ordered range iteration.
MDB_SET and MDB_SET_KEY: exact-key positioning, used for point lookups.
MDB_FIRST and MDB_LAST: positional endpoints.

For databases opened with the MDB_DUPSORT flag, one key can carry multiple sorted values, and additional operations apply: MDB_GET_BOTH, MDB_NEXT_DUP, MDB_FIRST_DUP. This is useful when a relation is encoded as "key = join columns, duplicate values = remaining columns": the set of duplicates is itself a secondary index over the join key.

Rust Binding

heed is the idiomatic Rust binding for LMDB. It wraps the cursor operations as RoCursor and RwCursor and returns key and value byte slices tied to the transaction lifetime, so reads remain zero-copy. Meilisearch uses heed in production, so the binding is well exercised.

LMDB Versus RocksDB

Both LMDB and RocksDB are ordered key-value stores with prefix and range scans, but their internal designs lead to different operational profiles.

LMDB highlights:

B+ tree on disk, memory mapped
Single writer at a time, many concurrent readers
Zero-copy reads from the mmap
Append-only on-disk format; deletes leave reclaimable free pages
File size grows up to a configured mapsize
No background compaction
Manual reclaim with mdb_copy --compact

RocksDB highlights:

Log-structured merge tree
Multiple concurrent writers
Background compaction
Higher write throughput at the cost of write amplification
Reads may traverse multiple levels with bloom-filter checks
Engine manages its own disk layout

For a read-heavy prototype with batch inserts, LMDB is the closer fit: predictable read costs, cheap range scans, and zero-copy probes. RocksDB earns its overhead when sustained write throughput is the bottleneck.

Practical Implications

The current query-ops crate works on in-memory Vec<Row> values and will implement semijoin and natural join with a transient hash on one side. The Cozo design suggests a clear upgrade path once a real substrate is added.

Short term: keep the in-memory operator and build a transient hash on the smaller side. This is correct, easy to test, and easy to reason about.

Medium term: when relations move into a substrate like LMDB, encode each relation so that the join columns sit at the prefix of the key, or use a DUPSORT database where the duplicate values carry the remaining columns. At that point the join operator becomes a cursor pattern (MDB_SET_RANGE followed by MDB_NEXT while the prefix matches), and the separate hash-building step disappears.

Index discipline: if a relation needs to be joined two different ways, store it twice with different prefix orders. There is no clever-indexing shortcut in either Cozo or LMDB, and trying to invent one is unlikely to be worth the cost.

The takeaway is that the operator surface in query-ops is fine for an in-memory prototype, but the substrate decision is the load-bearing one for performance. We do not need to design around it now, but the natural successor to the current operators is a key-encoding discipline rather than a more elaborate operator implementation.

Changelog

June 2, 2026 -- The first version of this document was made.

8.0 KiB Raw Blame History

Cozo and LMDB Findings

Summary

Cozo

What It Is

Join Behavior

No Semijoin Operator

No Auto-Maintained Secondary Indexes

LMDB

What It Is

Cursor Primitives

Rust Binding

LMDB Versus RocksDB

Practical Implications

Changelog

8.0 KiB

Raw Blame History