Add prelimiary notes on query engine (architecture and internals)

This commit is contained in:
Hassan Abedi 2026-03-31 11:09:56 +02:00
parent 4082b42ac6
commit 7100a757b3
3 changed files with 559 additions and 0 deletions

View File

@ -0,0 +1,51 @@
## Query Engine Glossary
* **Query Engine** — A query engine is the component that takes a declarative request for data and produces the result. The user says what they
want, and the engine decides how to compute it.
* **Declarative Query** — A query that describes the desired result rather than the exact procedure. SQL is the standard example: you ask for rows
matching some condition, not a specific loop nest.
* **Schema** — The structural description of data: column names, column types, and usually nullability. The schema is the contract that planning and
execution rely on.
* **Field** — One named column inside a schema. A field has a name and a type.
* **Type System** — The set of value types the engine understands and the rules for how expressions over those values behave. It lets the engine
reject invalid queries early and allocate correctly typed outputs.
* **Nullability** — Whether a column or expression may contain missing values. This matters for both semantics and execution because null handling can
add branching and bookkeeping.
* **Row** — One logical record in a table. In a row-oriented system, processing often happens one record at a time.
* **Column** — All values for one field across many rows. In a columnar engine, values from the same column are stored together in memory.
* **Column Vector** — An in-memory representation of one column's values. It is the basic runtime container used by columnar engines.
* **Record Batch** — A chunk of data consisting of several equal-length column vectors plus a schema. It is a common unit of execution in modern
columnar engines.
* **Apache Arrow** — A standard in-memory columnar format. It gives query engines a shared representation for schemas, arrays, null bitmaps, and
record batches.
* **Data Source** — Anything the engine can read from: CSV files, Parquet files, databases, object stores, or in-memory tables. A clean engine
usually abstracts these behind a common interface.
* **Scan** — The operator that reads data from a source into the engine. It is usually the leaf node of a query plan.
* **Projection** — Selecting only certain columns or computing new expressions from existing ones. In SQL this is mostly the `SELECT` list.
* **Filter / Selection** — Removing rows that do not satisfy a predicate. In SQL this is usually the `WHERE` clause.
* **Predicate** — A boolean expression used to decide whether a row should be kept, such as `age > 18`.
* **Projection Pushdown** — Asking the data source to read only the needed columns instead of materializing everything first.
* **Predicate Pushdown** — Asking the data source to apply filtering as early as possible, ideally before data is moved into the main engine.
* **Logical Plan** — A representation of what operations are required to answer the query, independent of low-level execution details. It describes
intent rather than exact mechanics.
* **Physical Plan** — A concrete executable strategy for the query. It chooses specific operator implementations and an execution order.
* **Optimizer** — The component that rewrites plans into equivalent but more efficient forms. Examples include pruning unused columns, pushing filters
down, and reordering joins.
* **Operator** — One node in a plan, such as scan, projection, filter, join, aggregate, or limit.
* **Expression** — A computation inside an operator, such as `price * quantity`, `a = b`, or `SUM(total)`.
* **Join** — Combining rows from two inputs based on some matching condition. This is one of the most important and expensive relational operators.
* **Aggregate** — Collapsing many rows into summary values such as `COUNT`, `SUM`, `MIN`, `MAX`, or grouped results.
* **Execution Model** — The style in which operators run: row-at-a-time, batch-oriented, vectorized, pipelined, or materializing intermediate
results.
* **Row-Oriented Execution** — Processing one row at a time through a chain of operators. It is simple to understand but often pays overhead per row.
* **Vectorized Execution** — Processing a batch of values together, typically one column at a time. This reduces per-row overhead and works well with
columnar memory layouts.
* **Materialization** — Fully storing an intermediate result before the next operator consumes it. This can simplify execution but costs memory and
latency.
* **Pipeline** — A flow where one operator produces data incrementally and the next consumes it without waiting for the entire input to finish.
* **Backend** — The concrete storage or execution target under the engine, such as an in-memory table layer, Parquet files, Postgres, or a
distributed service.
## Changelog
* **Mar 31, 2026** -- The first version was created.

262
hqew/query-engine-primer.md Normal file
View File

@ -0,0 +1,262 @@
# Query Engine Primer
A reference for the core ideas behind how a modern query engine works.
---
## Short definition
A query engine takes a declarative request for data and turns it into an executable plan.
The important separation is:
- the user says what result they want
- the engine decides how to compute it
That freedom is what makes optimization possible.
---
## Big picture pipeline
Most query engines follow roughly this shape:
1. accept a query from SQL, a DataFrame API, or some other front-end language
2. parse it into a structured representation
3. build a logical plan describing the required operations
4. optimize or rewrite that plan
5. choose a physical plan with concrete execution operators
6. execute the plan against one or more data sources
7. return the result as rows, batches, or some client-facing format
At a high level, a query engine behaves like a small compiler:
- parsing corresponds to front-end syntax work
- logical planning corresponds to building an internal representation
- optimization corresponds to semantics-preserving rewrites
- physical planning and execution correspond to code generation and runtime
---
## The main pieces
### Parser
The parser turns the query text into a structured form such as an abstract syntax tree. For SQL, this means turning raw text into nodes like
`SELECT`, `FROM`, `WHERE`, `GROUP BY`, and expressions.
### Logical Plan
The logical plan captures the meaning of the query in terms of relational operators. It usually includes operators such as:
- scan
- projection
- filter
- join
- aggregate
- limit
At this stage the plan says what needs to happen, not exactly how each step will run.
### Optimizer
The optimizer rewrites the logical plan into an equivalent but cheaper form.
Typical optimizations include:
- removing columns that are never used
- pushing filters closer to the data source
- simplifying expressions
- reordering joins
- replacing generic operators with cheaper specialized ones
### Physical Plan
The physical plan chooses specific implementations for each operator.
For example, a logical join might become:
- hash join
- sort-merge join
- nested-loop join
This is where the engine commits to an execution strategy.
### Executor
The executor runs the physical plan. It requests data from child operators, processes it, and produces output for parent operators or the client.
---
## Logical plan vs physical plan
This distinction is one of the most important ideas in query processing.
| Layer | Main question answered | Example |
|:--------------|:------------------------------------|:--------|
| Logical plan | What operations are required? | Scan `employees`, filter `age > 18`, project `name` |
| Physical plan | How should those operations run? | Use Parquet scan, push down predicate, run vectorized filter |
Logical plans are about meaning.
Physical plans are about execution mechanics.
Keeping those layers separate gives the engine room to optimize without changing the query's semantics.
---
## The data model inside the engine
Modern analytical engines often use a columnar, batch-oriented model rather than processing one row at a time.
The core concepts are:
- `Schema`: the names and types of columns
- `Field`: one column inside a schema
- `ColumnVector`: the in-memory values for one column
- `RecordBatch`: a schema plus a set of equal-length columns
This matters because analytical workloads often read only a few columns from very large datasets. A columnar layout makes those accesses much more
efficient.
It also works well with vectorized execution, where one operator applies the same computation across many values in a batch.
---
## Why columnar and batch-oriented execution matter
Compared with row-at-a-time execution, columnar batches have several advantages:
- less overhead per record
- better cache behavior
- better compression
- easier use of SIMD-style operations
- simpler projection of only the needed columns
The tradeoff is that batch-oriented systems can be more complex to build and may be less natural for transactional or record-by-record workloads.
So "best" depends on the workload, but for analytics, columnar batches are a strong default.
---
## The data source boundary
Every engine needs a boundary where external data enters the system.
A good data-source abstraction usually answers two questions:
1. What is the schema?
2. Can you scan the data, ideally with some pushdowns?
This is where the engine starts exploiting source capabilities such as:
- projection pushdown
- predicate pushdown
- partition pruning
- file-format-specific decoding
The abstraction should be simple enough to unify many backends, but rich enough to expose useful performance features.
---
## Common operators
### Scan
Reads data from a source into the engine.
### Projection
Chooses columns or computes derived expressions.
### Filter
Keeps only rows that satisfy a predicate.
### Join
Combines rows from multiple inputs using matching keys or conditions.
### Aggregate
Computes summary results such as counts, sums, averages, mins, and maxes, often grouped by one or more keys.
### Limit
Stops after producing a fixed number of rows.
These operators are simple individually, but query engines become interesting because they compose and can be reordered or fused in many ways.
---
## A tiny end-to-end example
Take this query:
```sql
SELECT name
FROM employees
WHERE age > 18
```
One reasonable logical plan is:
1. scan `employees`
2. filter rows by `age > 18`
3. project `name`
An optimized plan might notice that only `name` and `age` are needed and ask the data source for only those columns.
A physical plan might then choose:
1. Parquet scan with projection pushdown
2. vectorized filter over `age`
3. projection of `name`
The result is the same, but the execution cost can be much lower.
---
## What the optimizer is really buying
The optimizer is not magic. It only works because the engine has:
- a stable internal representation
- explicit schemas and types
- clear operator semantics
- enough separation between intent and execution
Without those pieces, the engine has very little room to improve the query.
---
## Practical mental model
If you need to explain a query engine in one sentence, this is a good version:
> A query engine is a planner and executor for declarative data operations.
If you need a slightly longer version:
- parsing turns syntax into structure
- planning turns structure into operators
- optimization rewrites operators into a cheaper form
- execution runs those operators over data
That model is simple, but it is enough to orient most of the important design discussions.
---
## Questions to keep in mind
- What is the engine's internal data model?
- Where is the boundary between logical and physical planning?
- What can be pushed down into the data source?
- Is the workload more row-oriented or analytics-oriented?
- What is the unit of execution: rows, batches, or fully materialized tables?
---
## Changelog
* **Mar 31, 2026** -- The first version was created.

View File

@ -0,0 +1,246 @@
# How Query Engines Work Part 1
This note covers the foundation chapters from Andy Grove's book _How Query Engines Work_:
- What Is a Query Engine?
- Apache Arrow
- Choosing a Type System
- Data Sources
The main sources were:
- https://howqueryengineswork.com/
- the local companion repo in `tmp/how-query-engines-work`
## Short answer
Part 1 frames a query engine as a specialized compiler for data work.
A user expresses _what_ data they want in a declarative form such as SQL or a DataFrame API, and the engine is responsible for figuring out _how_ to get it efficiently. The foundational design choices are:
- use columnar, batch-oriented data rather than row-at-a-time execution where possible
- make schemas and types explicit
- abstract over data sources behind a common interface
- treat in-memory representation as a first-class architectural decision, not an implementation detail
## Core mental model
The book's basic pipeline is:
1. parse the query text or API calls
2. build an abstract query representation
3. optimize or rewrite that representation
4. execute a concrete plan against data sources
This is why query engines feel compiler-like:
- a declarative input gets translated into an executable form
- the engine can validate before execution
- the engine can rewrite for efficiency without changing semantics
The key difference from a general compiler is that the runtime target is data processing rather than machine code.
## Important terminology
- `declarative query`: a request that says what result is wanted, not how to compute it
- `query engine`: software that turns declarative queries into results
- `parsing`: turning SQL text into a structured representation
- `planning`: deciding which logical operations are needed
- `optimization`: rewriting the plan into a more efficient equivalent
- `execution`: actually running operators over data
- `schema`: the names, types, and nullability of columns
- `field`: one named column in a schema
- `column vector`: one in-memory column of values
- `record batch`: a group of equal-length column vectors processed together
- `data source`: a backend that can expose a schema and stream batches
- `projection`: selecting only the needed columns
- `projection pushdown`: asking the data source to read only required columns
- `row-based execution`: processing one record at a time
- `columnar execution`: storing and often processing values by column
- `vectorized execution`: applying one operation across many values in a batch
## Notes by chapter
### What Is a Query Engine?
The book starts from the simplest possible idea: filtering a collection in code is already a tiny query engine. The real difference in production systems is scale, generality, and optimization.
The most important conceptual shift is:
- application code hard-codes the procedure
- query languages describe the desired result
- the engine gets freedom to choose the execution strategy
That separation matters because the same logical request can run:
- against a small file
- against a large local dataset
- against a distributed cluster
without changing the query text itself.
The book describes the core stages as:
1. parsing
2. planning
3. optimization
4. execution
That is a useful mental model for our own work too. Even if Geolog does not look like SQL, the same split applies: there is still a front-end language, an internal representation, possible rewrites, and some execution mechanism.
### Apache Arrow
The book treats Arrow as the foundation for the engine's in-memory model.
The main reason is not just standardization. It is that query workloads usually touch a subset of columns, not full rows. A columnar layout means:
- better cache locality for the values actually being processed
- better compression characteristics
- better support for SIMD-style vectorized operations
Arrow's core structures are simple but important:
- a data buffer holding values
- a validity bitmap for null tracking
- an offset buffer for variable-width values such as strings
Arrow also introduces the concept that ends up mattering most operationally in the book: the `RecordBatch`, meaning a schema plus a set of equal-length columns processed as one unit.
The key takeaway is that the unit of execution should not be "the whole dataset" or "one row". It should usually be "one batch".
### Choosing a Type System
The type-system chapter argues that the engine must know more than raw values. It needs enough metadata to:
- reject invalid expressions early
- determine result types
- allocate correctly typed output storage
- skip unnecessary null checks when nullability is known
The design choice in the book is to build on Arrow's types instead of inventing a proprietary type system. That gives:
- a shared type vocabulary
- direct compatibility with Arrow-based formats and tools
- less impedance mismatch between storage and execution
The local companion code reflects this directly:
- [Schema.kt](/home/hassan/Workspace/useful-notes/tmp/how-query-engines-work/datatypes/src/main/kotlin/Schema.kt)
- [ColumnVector.kt](/home/hassan/Workspace/useful-notes/tmp/how-query-engines-work/datatypes/src/main/kotlin/ColumnVector.kt)
- [RecordBatch.kt](/home/hassan/Workspace/useful-notes/tmp/how-query-engines-work/datatypes/src/main/kotlin/RecordBatch.kt)
Two ideas here are especially worth carrying forward:
- the schema is part of the contract all the way through the engine
- batches and vectors are the runtime currency of execution
The chapter also contrasts row-at-a-time iterator execution with batch-oriented vectorized execution. The row-at-a-time model is simpler, but it pays overhead per row. Batch-oriented processing reduces that overhead and opens the door to data-parallel computation.
### Data Sources
The data-source chapter introduces a very clean abstraction: all backends should answer two questions.
1. What is your schema?
2. Can you scan and stream batches, optionally for only certain columns?
That shows up directly in the local code:
- [DataSource.kt](/home/hassan/Workspace/useful-notes/tmp/how-query-engines-work/datasource/src/main/kotlin/DataSource.kt)
This is the interface:
```kotlin
interface DataSource {
fun schema(): Schema
fun scan(projection: List<String>): Sequence<RecordBatch>
}
```
That is a small API, but it carries a lot of architectural weight:
- planning depends on `schema()`
- execution depends on `scan(...)`
- projection pushdown begins at the source boundary
- streaming is built in through `Sequence<RecordBatch>`
The CSV and Parquet discussion makes the practical tradeoff clear:
- CSV is simple and widespread but weakly typed and expensive to parse
- Parquet is much more engine-friendly because it is columnar, typed, and supports efficient projection
This suggests a general rule: data-source abstraction should hide source-specific details, but the engine should still exploit source-specific capabilities when they matter for performance.
## What seems most important
These are the main durable concepts from Part 1:
- A query engine is best understood as a planner and executor over declarative requests.
- The in-memory data model is a core design choice, not a storage afterthought.
- Columnar batches are a strong default for analytics-oriented execution.
- Schemas and nullability are execution metadata, not just documentation.
- Data sources should be abstracted behind a common interface.
- Pushdown starts at the boundary with the data source.
- A clean separation between logical intent and physical execution strategy is foundational.
## Relevance to Geolog
There is not a one-to-one mapping from the book to Geolog, but several ideas transfer cleanly.
First, our existing notes already treat the IR as the contract between front-end work and execution work. The book reinforces that this is the right general shape. A query engine needs a representation that is stable enough for validation, rewriting, and execution.
Second, the book's distinction between declarative intent and execution strategy maps well onto Geolog's likely split between:
- source language and elaboration
- lowered IR
- runtime planning and execution
Third, the data-source abstraction looks very relevant if Geolog will query or check laws over multiple backends. A backend-neutral interface should probably expose:
- what data is available
- how its shape is described
- how to stream or otherwise iterate over it
- what pushdowns or source-side evaluation are possible
Fourth, the book is more relational and analytics-oriented than Geolog's logical/chase concerns, so some adaptation is needed. Columnar record batches are a natural fit for relational scans, filters, projections, aggregates, and joins. They are a less obvious fit for:
- existential witness generation
- branching search
- equality merging
- provenance-heavy chase steps
So the likely lesson is not "copy this architecture wholesale". The lesson is that Geolog will probably still need:
- a stable logical representation
- an execution-oriented runtime representation
- a clear backend boundary
- explicit decisions about the unit of execution
## Questions worth keeping open
- Is the first Geolog executor fundamentally row-oriented, set-oriented, or batch-oriented?
- Do we want a strict split between logical planning and physical execution, or a thinner execution layer at first?
- What is the Geolog equivalent of `Schema` and `RecordBatch`?
- What source capabilities do we eventually want to push down: projection, filtering, joins, law checks?
- Is the first backend abstraction only about storage, or also about partial execution?
## Bottom line
Part 1 is mostly about architectural posture.
Before building planners or optimizers, the book insists on getting four things straight:
- what a query engine is
- how data lives in memory
- how types flow through the engine
- how data enters the system
That is a good sequence. It suggests that for our own work, arguments about optimization or execution strategy should come after we are clear on:
- the contract representation
- the runtime unit of data
- the backend boundary
- the basic execution model
## Changelog
* **Mar 31, 2026** -- First version created.