247 lines
10 KiB
Markdown
247 lines
10 KiB
Markdown
|
|
# How Query Engines Work Part 1
|
||
|
|
|
||
|
|
This note covers the foundation chapters from Andy Grove's book _How Query Engines Work_:
|
||
|
|
|
||
|
|
- What Is a Query Engine?
|
||
|
|
- Apache Arrow
|
||
|
|
- Choosing a Type System
|
||
|
|
- Data Sources
|
||
|
|
|
||
|
|
The main sources were:
|
||
|
|
|
||
|
|
- https://howqueryengineswork.com/
|
||
|
|
- the local companion repo in `tmp/how-query-engines-work`
|
||
|
|
|
||
|
|
## Short answer
|
||
|
|
|
||
|
|
Part 1 frames a query engine as a specialized compiler for data work.
|
||
|
|
|
||
|
|
A user expresses _what_ data they want in a declarative form such as SQL or a DataFrame API, and the engine is responsible for figuring out _how_ to get it efficiently. The foundational design choices are:
|
||
|
|
|
||
|
|
- use columnar, batch-oriented data rather than row-at-a-time execution where possible
|
||
|
|
- make schemas and types explicit
|
||
|
|
- abstract over data sources behind a common interface
|
||
|
|
- treat in-memory representation as a first-class architectural decision, not an implementation detail
|
||
|
|
|
||
|
|
## Core mental model
|
||
|
|
|
||
|
|
The book's basic pipeline is:
|
||
|
|
|
||
|
|
1. parse the query text or API calls
|
||
|
|
2. build an abstract query representation
|
||
|
|
3. optimize or rewrite that representation
|
||
|
|
4. execute a concrete plan against data sources
|
||
|
|
|
||
|
|
This is why query engines feel compiler-like:
|
||
|
|
|
||
|
|
- a declarative input gets translated into an executable form
|
||
|
|
- the engine can validate before execution
|
||
|
|
- the engine can rewrite for efficiency without changing semantics
|
||
|
|
|
||
|
|
The key difference from a general compiler is that the runtime target is data processing rather than machine code.
|
||
|
|
|
||
|
|
## Important terminology
|
||
|
|
|
||
|
|
- `declarative query`: a request that says what result is wanted, not how to compute it
|
||
|
|
- `query engine`: software that turns declarative queries into results
|
||
|
|
- `parsing`: turning SQL text into a structured representation
|
||
|
|
- `planning`: deciding which logical operations are needed
|
||
|
|
- `optimization`: rewriting the plan into a more efficient equivalent
|
||
|
|
- `execution`: actually running operators over data
|
||
|
|
- `schema`: the names, types, and nullability of columns
|
||
|
|
- `field`: one named column in a schema
|
||
|
|
- `column vector`: one in-memory column of values
|
||
|
|
- `record batch`: a group of equal-length column vectors processed together
|
||
|
|
- `data source`: a backend that can expose a schema and stream batches
|
||
|
|
- `projection`: selecting only the needed columns
|
||
|
|
- `projection pushdown`: asking the data source to read only required columns
|
||
|
|
- `row-based execution`: processing one record at a time
|
||
|
|
- `columnar execution`: storing and often processing values by column
|
||
|
|
- `vectorized execution`: applying one operation across many values in a batch
|
||
|
|
|
||
|
|
## Notes by chapter
|
||
|
|
|
||
|
|
### What Is a Query Engine?
|
||
|
|
|
||
|
|
The book starts from the simplest possible idea: filtering a collection in code is already a tiny query engine. The real difference in production systems is scale, generality, and optimization.
|
||
|
|
|
||
|
|
The most important conceptual shift is:
|
||
|
|
|
||
|
|
- application code hard-codes the procedure
|
||
|
|
- query languages describe the desired result
|
||
|
|
- the engine gets freedom to choose the execution strategy
|
||
|
|
|
||
|
|
That separation matters because the same logical request can run:
|
||
|
|
|
||
|
|
- against a small file
|
||
|
|
- against a large local dataset
|
||
|
|
- against a distributed cluster
|
||
|
|
|
||
|
|
without changing the query text itself.
|
||
|
|
|
||
|
|
The book describes the core stages as:
|
||
|
|
|
||
|
|
1. parsing
|
||
|
|
2. planning
|
||
|
|
3. optimization
|
||
|
|
4. execution
|
||
|
|
|
||
|
|
That is a useful mental model for our own work too. Even if Geolog does not look like SQL, the same split applies: there is still a front-end language, an internal representation, possible rewrites, and some execution mechanism.
|
||
|
|
|
||
|
|
### Apache Arrow
|
||
|
|
|
||
|
|
The book treats Arrow as the foundation for the engine's in-memory model.
|
||
|
|
|
||
|
|
The main reason is not just standardization. It is that query workloads usually touch a subset of columns, not full rows. A columnar layout means:
|
||
|
|
|
||
|
|
- better cache locality for the values actually being processed
|
||
|
|
- better compression characteristics
|
||
|
|
- better support for SIMD-style vectorized operations
|
||
|
|
|
||
|
|
Arrow's core structures are simple but important:
|
||
|
|
|
||
|
|
- a data buffer holding values
|
||
|
|
- a validity bitmap for null tracking
|
||
|
|
- an offset buffer for variable-width values such as strings
|
||
|
|
|
||
|
|
Arrow also introduces the concept that ends up mattering most operationally in the book: the `RecordBatch`, meaning a schema plus a set of equal-length columns processed as one unit.
|
||
|
|
|
||
|
|
The key takeaway is that the unit of execution should not be "the whole dataset" or "one row". It should usually be "one batch".
|
||
|
|
|
||
|
|
### Choosing a Type System
|
||
|
|
|
||
|
|
The type-system chapter argues that the engine must know more than raw values. It needs enough metadata to:
|
||
|
|
|
||
|
|
- reject invalid expressions early
|
||
|
|
- determine result types
|
||
|
|
- allocate correctly typed output storage
|
||
|
|
- skip unnecessary null checks when nullability is known
|
||
|
|
|
||
|
|
The design choice in the book is to build on Arrow's types instead of inventing a proprietary type system. That gives:
|
||
|
|
|
||
|
|
- a shared type vocabulary
|
||
|
|
- direct compatibility with Arrow-based formats and tools
|
||
|
|
- less impedance mismatch between storage and execution
|
||
|
|
|
||
|
|
The local companion code reflects this directly:
|
||
|
|
|
||
|
|
- [Schema.kt](/home/hassan/Workspace/useful-notes/tmp/how-query-engines-work/datatypes/src/main/kotlin/Schema.kt)
|
||
|
|
- [ColumnVector.kt](/home/hassan/Workspace/useful-notes/tmp/how-query-engines-work/datatypes/src/main/kotlin/ColumnVector.kt)
|
||
|
|
- [RecordBatch.kt](/home/hassan/Workspace/useful-notes/tmp/how-query-engines-work/datatypes/src/main/kotlin/RecordBatch.kt)
|
||
|
|
|
||
|
|
Two ideas here are especially worth carrying forward:
|
||
|
|
|
||
|
|
- the schema is part of the contract all the way through the engine
|
||
|
|
- batches and vectors are the runtime currency of execution
|
||
|
|
|
||
|
|
The chapter also contrasts row-at-a-time iterator execution with batch-oriented vectorized execution. The row-at-a-time model is simpler, but it pays overhead per row. Batch-oriented processing reduces that overhead and opens the door to data-parallel computation.
|
||
|
|
|
||
|
|
### Data Sources
|
||
|
|
|
||
|
|
The data-source chapter introduces a very clean abstraction: all backends should answer two questions.
|
||
|
|
|
||
|
|
1. What is your schema?
|
||
|
|
2. Can you scan and stream batches, optionally for only certain columns?
|
||
|
|
|
||
|
|
That shows up directly in the local code:
|
||
|
|
|
||
|
|
- [DataSource.kt](/home/hassan/Workspace/useful-notes/tmp/how-query-engines-work/datasource/src/main/kotlin/DataSource.kt)
|
||
|
|
|
||
|
|
This is the interface:
|
||
|
|
|
||
|
|
```kotlin
|
||
|
|
interface DataSource {
|
||
|
|
fun schema(): Schema
|
||
|
|
fun scan(projection: List<String>): Sequence<RecordBatch>
|
||
|
|
}
|
||
|
|
```
|
||
|
|
|
||
|
|
That is a small API, but it carries a lot of architectural weight:
|
||
|
|
|
||
|
|
- planning depends on `schema()`
|
||
|
|
- execution depends on `scan(...)`
|
||
|
|
- projection pushdown begins at the source boundary
|
||
|
|
- streaming is built in through `Sequence<RecordBatch>`
|
||
|
|
|
||
|
|
The CSV and Parquet discussion makes the practical tradeoff clear:
|
||
|
|
|
||
|
|
- CSV is simple and widespread but weakly typed and expensive to parse
|
||
|
|
- Parquet is much more engine-friendly because it is columnar, typed, and supports efficient projection
|
||
|
|
|
||
|
|
This suggests a general rule: data-source abstraction should hide source-specific details, but the engine should still exploit source-specific capabilities when they matter for performance.
|
||
|
|
|
||
|
|
## What seems most important
|
||
|
|
|
||
|
|
These are the main durable concepts from Part 1:
|
||
|
|
|
||
|
|
- A query engine is best understood as a planner and executor over declarative requests.
|
||
|
|
- The in-memory data model is a core design choice, not a storage afterthought.
|
||
|
|
- Columnar batches are a strong default for analytics-oriented execution.
|
||
|
|
- Schemas and nullability are execution metadata, not just documentation.
|
||
|
|
- Data sources should be abstracted behind a common interface.
|
||
|
|
- Pushdown starts at the boundary with the data source.
|
||
|
|
- A clean separation between logical intent and physical execution strategy is foundational.
|
||
|
|
|
||
|
|
## Relevance to Geolog
|
||
|
|
|
||
|
|
There is not a one-to-one mapping from the book to Geolog, but several ideas transfer cleanly.
|
||
|
|
|
||
|
|
First, our existing notes already treat the IR as the contract between front-end work and execution work. The book reinforces that this is the right general shape. A query engine needs a representation that is stable enough for validation, rewriting, and execution.
|
||
|
|
|
||
|
|
Second, the book's distinction between declarative intent and execution strategy maps well onto Geolog's likely split between:
|
||
|
|
|
||
|
|
- source language and elaboration
|
||
|
|
- lowered IR
|
||
|
|
- runtime planning and execution
|
||
|
|
|
||
|
|
Third, the data-source abstraction looks very relevant if Geolog will query or check laws over multiple backends. A backend-neutral interface should probably expose:
|
||
|
|
|
||
|
|
- what data is available
|
||
|
|
- how its shape is described
|
||
|
|
- how to stream or otherwise iterate over it
|
||
|
|
- what pushdowns or source-side evaluation are possible
|
||
|
|
|
||
|
|
Fourth, the book is more relational and analytics-oriented than Geolog's logical/chase concerns, so some adaptation is needed. Columnar record batches are a natural fit for relational scans, filters, projections, aggregates, and joins. They are a less obvious fit for:
|
||
|
|
|
||
|
|
- existential witness generation
|
||
|
|
- branching search
|
||
|
|
- equality merging
|
||
|
|
- provenance-heavy chase steps
|
||
|
|
|
||
|
|
So the likely lesson is not "copy this architecture wholesale". The lesson is that Geolog will probably still need:
|
||
|
|
|
||
|
|
- a stable logical representation
|
||
|
|
- an execution-oriented runtime representation
|
||
|
|
- a clear backend boundary
|
||
|
|
- explicit decisions about the unit of execution
|
||
|
|
|
||
|
|
## Questions worth keeping open
|
||
|
|
|
||
|
|
- Is the first Geolog executor fundamentally row-oriented, set-oriented, or batch-oriented?
|
||
|
|
- Do we want a strict split between logical planning and physical execution, or a thinner execution layer at first?
|
||
|
|
- What is the Geolog equivalent of `Schema` and `RecordBatch`?
|
||
|
|
- What source capabilities do we eventually want to push down: projection, filtering, joins, law checks?
|
||
|
|
- Is the first backend abstraction only about storage, or also about partial execution?
|
||
|
|
|
||
|
|
## Bottom line
|
||
|
|
|
||
|
|
Part 1 is mostly about architectural posture.
|
||
|
|
|
||
|
|
Before building planners or optimizers, the book insists on getting four things straight:
|
||
|
|
|
||
|
|
- what a query engine is
|
||
|
|
- how data lives in memory
|
||
|
|
- how types flow through the engine
|
||
|
|
- how data enters the system
|
||
|
|
|
||
|
|
That is a good sequence. It suggests that for our own work, arguments about optimization or execution strategy should come after we are clear on:
|
||
|
|
|
||
|
|
- the contract representation
|
||
|
|
- the runtime unit of data
|
||
|
|
- the backend boundary
|
||
|
|
- the basic execution model
|
||
|
|
|
||
|
|
## Changelog
|
||
|
|
|
||
|
|
* **Mar 31, 2026** -- First version created.
|