10 KiB
How Query Engines Work Part 1
This note covers the foundation chapters from Andy Grove's book How Query Engines Work:
- What Is a Query Engine?
- Apache Arrow
- Choosing a Type System
- Data Sources
The main sources were:
- https://howqueryengineswork.com/
- the local companion repo in
tmp/how-query-engines-work
Short answer
Part 1 frames a query engine as a specialized compiler for data work.
A user expresses what data they want in a declarative form such as SQL or a DataFrame API, and the engine is responsible for figuring out how to get it efficiently. The foundational design choices are:
- use columnar, batch-oriented data rather than row-at-a-time execution where possible
- make schemas and types explicit
- abstract over data sources behind a common interface
- treat in-memory representation as a first-class architectural decision, not an implementation detail
Core mental model
The book's basic pipeline is:
- parse the query text or API calls
- build an abstract query representation
- optimize or rewrite that representation
- execute a concrete plan against data sources
This is why query engines feel compiler-like:
- a declarative input gets translated into an executable form
- the engine can validate before execution
- the engine can rewrite for efficiency without changing semantics
The key difference from a general compiler is that the runtime target is data processing rather than machine code.
Important terminology
declarative query: a request that says what result is wanted, not how to compute itquery engine: software that turns declarative queries into resultsparsing: turning SQL text into a structured representationplanning: deciding which logical operations are neededoptimization: rewriting the plan into a more efficient equivalentexecution: actually running operators over dataschema: the names, types, and nullability of columnsfield: one named column in a schemacolumn vector: one in-memory column of valuesrecord batch: a group of equal-length column vectors processed togetherdata source: a backend that can expose a schema and stream batchesprojection: selecting only the needed columnsprojection pushdown: asking the data source to read only required columnsrow-based execution: processing one record at a timecolumnar execution: storing and often processing values by columnvectorized execution: applying one operation across many values in a batch
Notes by chapter
What Is a Query Engine?
The book starts from the simplest possible idea: filtering a collection in code is already a tiny query engine. The real difference in production systems is scale, generality, and optimization.
The most important conceptual shift is:
- application code hard-codes the procedure
- query languages describe the desired result
- the engine gets freedom to choose the execution strategy
That separation matters because the same logical request can run:
- against a small file
- against a large local dataset
- against a distributed cluster
without changing the query text itself.
The book describes the core stages as:
- parsing
- planning
- optimization
- execution
That is a useful mental model for our own work too. Even if Geolog does not look like SQL, the same split applies: there is still a front-end language, an internal representation, possible rewrites, and some execution mechanism.
Apache Arrow
The book treats Arrow as the foundation for the engine's in-memory model.
The main reason is not just standardization. It is that query workloads usually touch a subset of columns, not full rows. A columnar layout means:
- better cache locality for the values actually being processed
- better compression characteristics
- better support for SIMD-style vectorized operations
Arrow's core structures are simple but important:
- a data buffer holding values
- a validity bitmap for null tracking
- an offset buffer for variable-width values such as strings
Arrow also introduces the concept that ends up mattering most operationally in the book: the RecordBatch, meaning a schema plus a set of equal-length columns processed as one unit.
The key takeaway is that the unit of execution should not be "the whole dataset" or "one row". It should usually be "one batch".
Choosing a Type System
The type-system chapter argues that the engine must know more than raw values. It needs enough metadata to:
- reject invalid expressions early
- determine result types
- allocate correctly typed output storage
- skip unnecessary null checks when nullability is known
The design choice in the book is to build on Arrow's types instead of inventing a proprietary type system. That gives:
- a shared type vocabulary
- direct compatibility with Arrow-based formats and tools
- less impedance mismatch between storage and execution
The local companion code reflects this directly:
Two ideas here are especially worth carrying forward:
- the schema is part of the contract all the way through the engine
- batches and vectors are the runtime currency of execution
The chapter also contrasts row-at-a-time iterator execution with batch-oriented vectorized execution. The row-at-a-time model is simpler, but it pays overhead per row. Batch-oriented processing reduces that overhead and opens the door to data-parallel computation.
Data Sources
The data-source chapter introduces a very clean abstraction: all backends should answer two questions.
- What is your schema?
- Can you scan and stream batches, optionally for only certain columns?
That shows up directly in the local code:
This is the interface:
interface DataSource {
fun schema(): Schema
fun scan(projection: List<String>): Sequence<RecordBatch>
}
That is a small API, but it carries a lot of architectural weight:
- planning depends on
schema() - execution depends on
scan(...) - projection pushdown begins at the source boundary
- streaming is built in through
Sequence<RecordBatch>
The CSV and Parquet discussion makes the practical tradeoff clear:
- CSV is simple and widespread but weakly typed and expensive to parse
- Parquet is much more engine-friendly because it is columnar, typed, and supports efficient projection
This suggests a general rule: data-source abstraction should hide source-specific details, but the engine should still exploit source-specific capabilities when they matter for performance.
What seems most important
These are the main durable concepts from Part 1:
- A query engine is best understood as a planner and executor over declarative requests.
- The in-memory data model is a core design choice, not a storage afterthought.
- Columnar batches are a strong default for analytics-oriented execution.
- Schemas and nullability are execution metadata, not just documentation.
- Data sources should be abstracted behind a common interface.
- Pushdown starts at the boundary with the data source.
- A clean separation between logical intent and physical execution strategy is foundational.
Relevance to Geolog
There is not a one-to-one mapping from the book to Geolog, but several ideas transfer cleanly.
First, our existing notes already treat the IR as the contract between front-end work and execution work. The book reinforces that this is the right general shape. A query engine needs a representation that is stable enough for validation, rewriting, and execution.
Second, the book's distinction between declarative intent and execution strategy maps well onto Geolog's likely split between:
- source language and elaboration
- lowered IR
- runtime planning and execution
Third, the data-source abstraction looks very relevant if Geolog will query or check laws over multiple backends. A backend-neutral interface should probably expose:
- what data is available
- how its shape is described
- how to stream or otherwise iterate over it
- what pushdowns or source-side evaluation are possible
Fourth, the book is more relational and analytics-oriented than Geolog's logical/chase concerns, so some adaptation is needed. Columnar record batches are a natural fit for relational scans, filters, projections, aggregates, and joins. They are a less obvious fit for:
- existential witness generation
- branching search
- equality merging
- provenance-heavy chase steps
So the likely lesson is not "copy this architecture wholesale". The lesson is that Geolog will probably still need:
- a stable logical representation
- an execution-oriented runtime representation
- a clear backend boundary
- explicit decisions about the unit of execution
Questions worth keeping open
- Is the first Geolog executor fundamentally row-oriented, set-oriented, or batch-oriented?
- Do we want a strict split between logical planning and physical execution, or a thinner execution layer at first?
- What is the Geolog equivalent of
SchemaandRecordBatch? - What source capabilities do we eventually want to push down: projection, filtering, joins, law checks?
- Is the first backend abstraction only about storage, or also about partial execution?
Bottom line
Part 1 is mostly about architectural posture.
Before building planners or optimizers, the book insists on getting four things straight:
- what a query engine is
- how data lives in memory
- how types flow through the engine
- how data enters the system
That is a good sequence. It suggests that for our own work, arguments about optimization or execution strategy should come after we are clear on:
- the contract representation
- the runtime unit of data
- the backend boundary
- the basic execution model
Changelog
- Mar 31, 2026 -- First version created.