habedi-work/useful-notes

Fork 0

Hassan Abedi 7100a757b3 Add prelimiary notes on query engine (architecture and internals)

2026-04-01 09:12:27 +02:00

10 KiB

Raw Blame History

How Query Engines Work Part 1

This note covers the foundation chapters from Andy Grove's book How Query Engines Work:

What Is a Query Engine?
Apache Arrow
Choosing a Type System
Data Sources

The main sources were:

https://howqueryengineswork.com/
the local companion repo in tmp/how-query-engines-work

Short answer

Part 1 frames a query engine as a specialized compiler for data work.

A user expresses what data they want in a declarative form such as SQL or a DataFrame API, and the engine is responsible for figuring out how to get it efficiently. The foundational design choices are:

use columnar, batch-oriented data rather than row-at-a-time execution where possible
make schemas and types explicit
abstract over data sources behind a common interface
treat in-memory representation as a first-class architectural decision, not an implementation detail

Core mental model

The book's basic pipeline is:

parse the query text or API calls
build an abstract query representation
optimize or rewrite that representation
execute a concrete plan against data sources

This is why query engines feel compiler-like:

a declarative input gets translated into an executable form
the engine can validate before execution
the engine can rewrite for efficiency without changing semantics

The key difference from a general compiler is that the runtime target is data processing rather than machine code.

Important terminology

declarative query: a request that says what result is wanted, not how to compute it
query engine: software that turns declarative queries into results
parsing: turning SQL text into a structured representation
planning: deciding which logical operations are needed
optimization: rewriting the plan into a more efficient equivalent
execution: actually running operators over data
schema: the names, types, and nullability of columns
field: one named column in a schema
column vector: one in-memory column of values
record batch: a group of equal-length column vectors processed together
data source: a backend that can expose a schema and stream batches
projection: selecting only the needed columns
projection pushdown: asking the data source to read only required columns
row-based execution: processing one record at a time
columnar execution: storing and often processing values by column
vectorized execution: applying one operation across many values in a batch

Notes by chapter

What Is a Query Engine?

The book starts from the simplest possible idea: filtering a collection in code is already a tiny query engine. The real difference in production systems is scale, generality, and optimization.

The most important conceptual shift is:

application code hard-codes the procedure
query languages describe the desired result
the engine gets freedom to choose the execution strategy

That separation matters because the same logical request can run:

against a small file
against a large local dataset
against a distributed cluster

without changing the query text itself.

The book describes the core stages as:

parsing
planning
optimization
execution

That is a useful mental model for our own work too. Even if Geolog does not look like SQL, the same split applies: there is still a front-end language, an internal representation, possible rewrites, and some execution mechanism.

Apache Arrow

The book treats Arrow as the foundation for the engine's in-memory model.

The main reason is not just standardization. It is that query workloads usually touch a subset of columns, not full rows. A columnar layout means:

better cache locality for the values actually being processed
better compression characteristics
better support for SIMD-style vectorized operations

Arrow's core structures are simple but important:

a data buffer holding values
a validity bitmap for null tracking
an offset buffer for variable-width values such as strings

Arrow also introduces the concept that ends up mattering most operationally in the book: the RecordBatch, meaning a schema plus a set of equal-length columns processed as one unit.

The key takeaway is that the unit of execution should not be "the whole dataset" or "one row". It should usually be "one batch".

Choosing a Type System

The type-system chapter argues that the engine must know more than raw values. It needs enough metadata to:

reject invalid expressions early
determine result types
allocate correctly typed output storage
skip unnecessary null checks when nullability is known

The design choice in the book is to build on Arrow's types instead of inventing a proprietary type system. That gives:

a shared type vocabulary
direct compatibility with Arrow-based formats and tools
less impedance mismatch between storage and execution

The local companion code reflects this directly:

Two ideas here are especially worth carrying forward:

the schema is part of the contract all the way through the engine
batches and vectors are the runtime currency of execution

The chapter also contrasts row-at-a-time iterator execution with batch-oriented vectorized execution. The row-at-a-time model is simpler, but it pays overhead per row. Batch-oriented processing reduces that overhead and opens the door to data-parallel computation.

Data Sources

The data-source chapter introduces a very clean abstraction: all backends should answer two questions.

What is your schema?
Can you scan and stream batches, optionally for only certain columns?

That shows up directly in the local code:

DataSource.kt

This is the interface:

interface DataSource {
  fun schema(): Schema
  fun scan(projection: List<String>): Sequence<RecordBatch>
}

That is a small API, but it carries a lot of architectural weight:

planning depends on schema()
execution depends on scan(...)
projection pushdown begins at the source boundary
streaming is built in through Sequence<RecordBatch>

The CSV and Parquet discussion makes the practical tradeoff clear:

CSV is simple and widespread but weakly typed and expensive to parse
Parquet is much more engine-friendly because it is columnar, typed, and supports efficient projection

This suggests a general rule: data-source abstraction should hide source-specific details, but the engine should still exploit source-specific capabilities when they matter for performance.

What seems most important

These are the main durable concepts from Part 1:

A query engine is best understood as a planner and executor over declarative requests.
The in-memory data model is a core design choice, not a storage afterthought.
Columnar batches are a strong default for analytics-oriented execution.
Schemas and nullability are execution metadata, not just documentation.
Data sources should be abstracted behind a common interface.
Pushdown starts at the boundary with the data source.
A clean separation between logical intent and physical execution strategy is foundational.

Relevance to Geolog

There is not a one-to-one mapping from the book to Geolog, but several ideas transfer cleanly.

First, our existing notes already treat the IR as the contract between front-end work and execution work. The book reinforces that this is the right general shape. A query engine needs a representation that is stable enough for validation, rewriting, and execution.

Second, the book's distinction between declarative intent and execution strategy maps well onto Geolog's likely split between:

source language and elaboration
lowered IR
runtime planning and execution

Third, the data-source abstraction looks very relevant if Geolog will query or check laws over multiple backends. A backend-neutral interface should probably expose:

what data is available
how its shape is described
how to stream or otherwise iterate over it
what pushdowns or source-side evaluation are possible

Fourth, the book is more relational and analytics-oriented than Geolog's logical/chase concerns, so some adaptation is needed. Columnar record batches are a natural fit for relational scans, filters, projections, aggregates, and joins. They are a less obvious fit for:

existential witness generation
branching search
equality merging
provenance-heavy chase steps

So the likely lesson is not "copy this architecture wholesale". The lesson is that Geolog will probably still need:

a stable logical representation
an execution-oriented runtime representation
a clear backend boundary
explicit decisions about the unit of execution

Questions worth keeping open

Is the first Geolog executor fundamentally row-oriented, set-oriented, or batch-oriented?
Do we want a strict split between logical planning and physical execution, or a thinner execution layer at first?
What is the Geolog equivalent of Schema and RecordBatch?
What source capabilities do we eventually want to push down: projection, filtering, joins, law checks?
Is the first backend abstraction only about storage, or also about partial execution?

Bottom line

Part 1 is mostly about architectural posture.

Before building planners or optimizers, the book insists on getting four things straight:

what a query engine is
how data lives in memory
how types flow through the engine
how data enters the system

That is a good sequence. It suggests that for our own work, arguments about optimization or execution strategy should come after we are clear on:

the contract representation
the runtime unit of data
the backend boundary
the basic execution model

Changelog

Mar 31, 2026 -- First version created.

10 KiB Raw Blame History