Add prelimiary notes on query engine (architecture and internals)
This commit is contained in:
parent
4082b42ac6
commit
7100a757b3
51
hqew/query-engine-glossary.md
Normal file
51
hqew/query-engine-glossary.md
Normal file
@ -0,0 +1,51 @@
|
||||
## Query Engine Glossary
|
||||
|
||||
* **Query Engine** — A query engine is the component that takes a declarative request for data and produces the result. The user says what they
|
||||
want, and the engine decides how to compute it.
|
||||
* **Declarative Query** — A query that describes the desired result rather than the exact procedure. SQL is the standard example: you ask for rows
|
||||
matching some condition, not a specific loop nest.
|
||||
* **Schema** — The structural description of data: column names, column types, and usually nullability. The schema is the contract that planning and
|
||||
execution rely on.
|
||||
* **Field** — One named column inside a schema. A field has a name and a type.
|
||||
* **Type System** — The set of value types the engine understands and the rules for how expressions over those values behave. It lets the engine
|
||||
reject invalid queries early and allocate correctly typed outputs.
|
||||
* **Nullability** — Whether a column or expression may contain missing values. This matters for both semantics and execution because null handling can
|
||||
add branching and bookkeeping.
|
||||
* **Row** — One logical record in a table. In a row-oriented system, processing often happens one record at a time.
|
||||
* **Column** — All values for one field across many rows. In a columnar engine, values from the same column are stored together in memory.
|
||||
* **Column Vector** — An in-memory representation of one column's values. It is the basic runtime container used by columnar engines.
|
||||
* **Record Batch** — A chunk of data consisting of several equal-length column vectors plus a schema. It is a common unit of execution in modern
|
||||
columnar engines.
|
||||
* **Apache Arrow** — A standard in-memory columnar format. It gives query engines a shared representation for schemas, arrays, null bitmaps, and
|
||||
record batches.
|
||||
* **Data Source** — Anything the engine can read from: CSV files, Parquet files, databases, object stores, or in-memory tables. A clean engine
|
||||
usually abstracts these behind a common interface.
|
||||
* **Scan** — The operator that reads data from a source into the engine. It is usually the leaf node of a query plan.
|
||||
* **Projection** — Selecting only certain columns or computing new expressions from existing ones. In SQL this is mostly the `SELECT` list.
|
||||
* **Filter / Selection** — Removing rows that do not satisfy a predicate. In SQL this is usually the `WHERE` clause.
|
||||
* **Predicate** — A boolean expression used to decide whether a row should be kept, such as `age > 18`.
|
||||
* **Projection Pushdown** — Asking the data source to read only the needed columns instead of materializing everything first.
|
||||
* **Predicate Pushdown** — Asking the data source to apply filtering as early as possible, ideally before data is moved into the main engine.
|
||||
* **Logical Plan** — A representation of what operations are required to answer the query, independent of low-level execution details. It describes
|
||||
intent rather than exact mechanics.
|
||||
* **Physical Plan** — A concrete executable strategy for the query. It chooses specific operator implementations and an execution order.
|
||||
* **Optimizer** — The component that rewrites plans into equivalent but more efficient forms. Examples include pruning unused columns, pushing filters
|
||||
down, and reordering joins.
|
||||
* **Operator** — One node in a plan, such as scan, projection, filter, join, aggregate, or limit.
|
||||
* **Expression** — A computation inside an operator, such as `price * quantity`, `a = b`, or `SUM(total)`.
|
||||
* **Join** — Combining rows from two inputs based on some matching condition. This is one of the most important and expensive relational operators.
|
||||
* **Aggregate** — Collapsing many rows into summary values such as `COUNT`, `SUM`, `MIN`, `MAX`, or grouped results.
|
||||
* **Execution Model** — The style in which operators run: row-at-a-time, batch-oriented, vectorized, pipelined, or materializing intermediate
|
||||
results.
|
||||
* **Row-Oriented Execution** — Processing one row at a time through a chain of operators. It is simple to understand but often pays overhead per row.
|
||||
* **Vectorized Execution** — Processing a batch of values together, typically one column at a time. This reduces per-row overhead and works well with
|
||||
columnar memory layouts.
|
||||
* **Materialization** — Fully storing an intermediate result before the next operator consumes it. This can simplify execution but costs memory and
|
||||
latency.
|
||||
* **Pipeline** — A flow where one operator produces data incrementally and the next consumes it without waiting for the entire input to finish.
|
||||
* **Backend** — The concrete storage or execution target under the engine, such as an in-memory table layer, Parquet files, Postgres, or a
|
||||
distributed service.
|
||||
|
||||
## Changelog
|
||||
|
||||
* **Mar 31, 2026** -- The first version was created.
|
||||
262
hqew/query-engine-primer.md
Normal file
262
hqew/query-engine-primer.md
Normal file
@ -0,0 +1,262 @@
|
||||
# Query Engine Primer
|
||||
|
||||
A reference for the core ideas behind how a modern query engine works.
|
||||
|
||||
---
|
||||
|
||||
## Short definition
|
||||
|
||||
A query engine takes a declarative request for data and turns it into an executable plan.
|
||||
|
||||
The important separation is:
|
||||
|
||||
- the user says what result they want
|
||||
- the engine decides how to compute it
|
||||
|
||||
That freedom is what makes optimization possible.
|
||||
|
||||
---
|
||||
|
||||
## Big picture pipeline
|
||||
|
||||
Most query engines follow roughly this shape:
|
||||
|
||||
1. accept a query from SQL, a DataFrame API, or some other front-end language
|
||||
2. parse it into a structured representation
|
||||
3. build a logical plan describing the required operations
|
||||
4. optimize or rewrite that plan
|
||||
5. choose a physical plan with concrete execution operators
|
||||
6. execute the plan against one or more data sources
|
||||
7. return the result as rows, batches, or some client-facing format
|
||||
|
||||
At a high level, a query engine behaves like a small compiler:
|
||||
|
||||
- parsing corresponds to front-end syntax work
|
||||
- logical planning corresponds to building an internal representation
|
||||
- optimization corresponds to semantics-preserving rewrites
|
||||
- physical planning and execution correspond to code generation and runtime
|
||||
|
||||
---
|
||||
|
||||
## The main pieces
|
||||
|
||||
### Parser
|
||||
|
||||
The parser turns the query text into a structured form such as an abstract syntax tree. For SQL, this means turning raw text into nodes like
|
||||
`SELECT`, `FROM`, `WHERE`, `GROUP BY`, and expressions.
|
||||
|
||||
### Logical Plan
|
||||
|
||||
The logical plan captures the meaning of the query in terms of relational operators. It usually includes operators such as:
|
||||
|
||||
- scan
|
||||
- projection
|
||||
- filter
|
||||
- join
|
||||
- aggregate
|
||||
- limit
|
||||
|
||||
At this stage the plan says what needs to happen, not exactly how each step will run.
|
||||
|
||||
### Optimizer
|
||||
|
||||
The optimizer rewrites the logical plan into an equivalent but cheaper form.
|
||||
|
||||
Typical optimizations include:
|
||||
|
||||
- removing columns that are never used
|
||||
- pushing filters closer to the data source
|
||||
- simplifying expressions
|
||||
- reordering joins
|
||||
- replacing generic operators with cheaper specialized ones
|
||||
|
||||
### Physical Plan
|
||||
|
||||
The physical plan chooses specific implementations for each operator.
|
||||
|
||||
For example, a logical join might become:
|
||||
|
||||
- hash join
|
||||
- sort-merge join
|
||||
- nested-loop join
|
||||
|
||||
This is where the engine commits to an execution strategy.
|
||||
|
||||
### Executor
|
||||
|
||||
The executor runs the physical plan. It requests data from child operators, processes it, and produces output for parent operators or the client.
|
||||
|
||||
---
|
||||
|
||||
## Logical plan vs physical plan
|
||||
|
||||
This distinction is one of the most important ideas in query processing.
|
||||
|
||||
| Layer | Main question answered | Example |
|
||||
|:--------------|:------------------------------------|:--------|
|
||||
| Logical plan | What operations are required? | Scan `employees`, filter `age > 18`, project `name` |
|
||||
| Physical plan | How should those operations run? | Use Parquet scan, push down predicate, run vectorized filter |
|
||||
|
||||
Logical plans are about meaning.
|
||||
|
||||
Physical plans are about execution mechanics.
|
||||
|
||||
Keeping those layers separate gives the engine room to optimize without changing the query's semantics.
|
||||
|
||||
---
|
||||
|
||||
## The data model inside the engine
|
||||
|
||||
Modern analytical engines often use a columnar, batch-oriented model rather than processing one row at a time.
|
||||
|
||||
The core concepts are:
|
||||
|
||||
- `Schema`: the names and types of columns
|
||||
- `Field`: one column inside a schema
|
||||
- `ColumnVector`: the in-memory values for one column
|
||||
- `RecordBatch`: a schema plus a set of equal-length columns
|
||||
|
||||
This matters because analytical workloads often read only a few columns from very large datasets. A columnar layout makes those accesses much more
|
||||
efficient.
|
||||
|
||||
It also works well with vectorized execution, where one operator applies the same computation across many values in a batch.
|
||||
|
||||
---
|
||||
|
||||
## Why columnar and batch-oriented execution matter
|
||||
|
||||
Compared with row-at-a-time execution, columnar batches have several advantages:
|
||||
|
||||
- less overhead per record
|
||||
- better cache behavior
|
||||
- better compression
|
||||
- easier use of SIMD-style operations
|
||||
- simpler projection of only the needed columns
|
||||
|
||||
The tradeoff is that batch-oriented systems can be more complex to build and may be less natural for transactional or record-by-record workloads.
|
||||
|
||||
So "best" depends on the workload, but for analytics, columnar batches are a strong default.
|
||||
|
||||
---
|
||||
|
||||
## The data source boundary
|
||||
|
||||
Every engine needs a boundary where external data enters the system.
|
||||
|
||||
A good data-source abstraction usually answers two questions:
|
||||
|
||||
1. What is the schema?
|
||||
2. Can you scan the data, ideally with some pushdowns?
|
||||
|
||||
This is where the engine starts exploiting source capabilities such as:
|
||||
|
||||
- projection pushdown
|
||||
- predicate pushdown
|
||||
- partition pruning
|
||||
- file-format-specific decoding
|
||||
|
||||
The abstraction should be simple enough to unify many backends, but rich enough to expose useful performance features.
|
||||
|
||||
---
|
||||
|
||||
## Common operators
|
||||
|
||||
### Scan
|
||||
|
||||
Reads data from a source into the engine.
|
||||
|
||||
### Projection
|
||||
|
||||
Chooses columns or computes derived expressions.
|
||||
|
||||
### Filter
|
||||
|
||||
Keeps only rows that satisfy a predicate.
|
||||
|
||||
### Join
|
||||
|
||||
Combines rows from multiple inputs using matching keys or conditions.
|
||||
|
||||
### Aggregate
|
||||
|
||||
Computes summary results such as counts, sums, averages, mins, and maxes, often grouped by one or more keys.
|
||||
|
||||
### Limit
|
||||
|
||||
Stops after producing a fixed number of rows.
|
||||
|
||||
These operators are simple individually, but query engines become interesting because they compose and can be reordered or fused in many ways.
|
||||
|
||||
---
|
||||
|
||||
## A tiny end-to-end example
|
||||
|
||||
Take this query:
|
||||
|
||||
```sql
|
||||
SELECT name
|
||||
FROM employees
|
||||
WHERE age > 18
|
||||
```
|
||||
|
||||
One reasonable logical plan is:
|
||||
|
||||
1. scan `employees`
|
||||
2. filter rows by `age > 18`
|
||||
3. project `name`
|
||||
|
||||
An optimized plan might notice that only `name` and `age` are needed and ask the data source for only those columns.
|
||||
|
||||
A physical plan might then choose:
|
||||
|
||||
1. Parquet scan with projection pushdown
|
||||
2. vectorized filter over `age`
|
||||
3. projection of `name`
|
||||
|
||||
The result is the same, but the execution cost can be much lower.
|
||||
|
||||
---
|
||||
|
||||
## What the optimizer is really buying
|
||||
|
||||
The optimizer is not magic. It only works because the engine has:
|
||||
|
||||
- a stable internal representation
|
||||
- explicit schemas and types
|
||||
- clear operator semantics
|
||||
- enough separation between intent and execution
|
||||
|
||||
Without those pieces, the engine has very little room to improve the query.
|
||||
|
||||
---
|
||||
|
||||
## Practical mental model
|
||||
|
||||
If you need to explain a query engine in one sentence, this is a good version:
|
||||
|
||||
> A query engine is a planner and executor for declarative data operations.
|
||||
|
||||
If you need a slightly longer version:
|
||||
|
||||
- parsing turns syntax into structure
|
||||
- planning turns structure into operators
|
||||
- optimization rewrites operators into a cheaper form
|
||||
- execution runs those operators over data
|
||||
|
||||
That model is simple, but it is enough to orient most of the important design discussions.
|
||||
|
||||
---
|
||||
|
||||
## Questions to keep in mind
|
||||
|
||||
- What is the engine's internal data model?
|
||||
- Where is the boundary between logical and physical planning?
|
||||
- What can be pushed down into the data source?
|
||||
- Is the workload more row-oriented or analytics-oriented?
|
||||
- What is the unit of execution: rows, batches, or fully materialized tables?
|
||||
|
||||
---
|
||||
|
||||
## Changelog
|
||||
|
||||
* **Mar 31, 2026** -- The first version was created.
|
||||
246
scratches/how-query-engines-work-part-1.md
Normal file
246
scratches/how-query-engines-work-part-1.md
Normal file
@ -0,0 +1,246 @@
|
||||
# How Query Engines Work Part 1
|
||||
|
||||
This note covers the foundation chapters from Andy Grove's book _How Query Engines Work_:
|
||||
|
||||
- What Is a Query Engine?
|
||||
- Apache Arrow
|
||||
- Choosing a Type System
|
||||
- Data Sources
|
||||
|
||||
The main sources were:
|
||||
|
||||
- https://howqueryengineswork.com/
|
||||
- the local companion repo in `tmp/how-query-engines-work`
|
||||
|
||||
## Short answer
|
||||
|
||||
Part 1 frames a query engine as a specialized compiler for data work.
|
||||
|
||||
A user expresses _what_ data they want in a declarative form such as SQL or a DataFrame API, and the engine is responsible for figuring out _how_ to get it efficiently. The foundational design choices are:
|
||||
|
||||
- use columnar, batch-oriented data rather than row-at-a-time execution where possible
|
||||
- make schemas and types explicit
|
||||
- abstract over data sources behind a common interface
|
||||
- treat in-memory representation as a first-class architectural decision, not an implementation detail
|
||||
|
||||
## Core mental model
|
||||
|
||||
The book's basic pipeline is:
|
||||
|
||||
1. parse the query text or API calls
|
||||
2. build an abstract query representation
|
||||
3. optimize or rewrite that representation
|
||||
4. execute a concrete plan against data sources
|
||||
|
||||
This is why query engines feel compiler-like:
|
||||
|
||||
- a declarative input gets translated into an executable form
|
||||
- the engine can validate before execution
|
||||
- the engine can rewrite for efficiency without changing semantics
|
||||
|
||||
The key difference from a general compiler is that the runtime target is data processing rather than machine code.
|
||||
|
||||
## Important terminology
|
||||
|
||||
- `declarative query`: a request that says what result is wanted, not how to compute it
|
||||
- `query engine`: software that turns declarative queries into results
|
||||
- `parsing`: turning SQL text into a structured representation
|
||||
- `planning`: deciding which logical operations are needed
|
||||
- `optimization`: rewriting the plan into a more efficient equivalent
|
||||
- `execution`: actually running operators over data
|
||||
- `schema`: the names, types, and nullability of columns
|
||||
- `field`: one named column in a schema
|
||||
- `column vector`: one in-memory column of values
|
||||
- `record batch`: a group of equal-length column vectors processed together
|
||||
- `data source`: a backend that can expose a schema and stream batches
|
||||
- `projection`: selecting only the needed columns
|
||||
- `projection pushdown`: asking the data source to read only required columns
|
||||
- `row-based execution`: processing one record at a time
|
||||
- `columnar execution`: storing and often processing values by column
|
||||
- `vectorized execution`: applying one operation across many values in a batch
|
||||
|
||||
## Notes by chapter
|
||||
|
||||
### What Is a Query Engine?
|
||||
|
||||
The book starts from the simplest possible idea: filtering a collection in code is already a tiny query engine. The real difference in production systems is scale, generality, and optimization.
|
||||
|
||||
The most important conceptual shift is:
|
||||
|
||||
- application code hard-codes the procedure
|
||||
- query languages describe the desired result
|
||||
- the engine gets freedom to choose the execution strategy
|
||||
|
||||
That separation matters because the same logical request can run:
|
||||
|
||||
- against a small file
|
||||
- against a large local dataset
|
||||
- against a distributed cluster
|
||||
|
||||
without changing the query text itself.
|
||||
|
||||
The book describes the core stages as:
|
||||
|
||||
1. parsing
|
||||
2. planning
|
||||
3. optimization
|
||||
4. execution
|
||||
|
||||
That is a useful mental model for our own work too. Even if Geolog does not look like SQL, the same split applies: there is still a front-end language, an internal representation, possible rewrites, and some execution mechanism.
|
||||
|
||||
### Apache Arrow
|
||||
|
||||
The book treats Arrow as the foundation for the engine's in-memory model.
|
||||
|
||||
The main reason is not just standardization. It is that query workloads usually touch a subset of columns, not full rows. A columnar layout means:
|
||||
|
||||
- better cache locality for the values actually being processed
|
||||
- better compression characteristics
|
||||
- better support for SIMD-style vectorized operations
|
||||
|
||||
Arrow's core structures are simple but important:
|
||||
|
||||
- a data buffer holding values
|
||||
- a validity bitmap for null tracking
|
||||
- an offset buffer for variable-width values such as strings
|
||||
|
||||
Arrow also introduces the concept that ends up mattering most operationally in the book: the `RecordBatch`, meaning a schema plus a set of equal-length columns processed as one unit.
|
||||
|
||||
The key takeaway is that the unit of execution should not be "the whole dataset" or "one row". It should usually be "one batch".
|
||||
|
||||
### Choosing a Type System
|
||||
|
||||
The type-system chapter argues that the engine must know more than raw values. It needs enough metadata to:
|
||||
|
||||
- reject invalid expressions early
|
||||
- determine result types
|
||||
- allocate correctly typed output storage
|
||||
- skip unnecessary null checks when nullability is known
|
||||
|
||||
The design choice in the book is to build on Arrow's types instead of inventing a proprietary type system. That gives:
|
||||
|
||||
- a shared type vocabulary
|
||||
- direct compatibility with Arrow-based formats and tools
|
||||
- less impedance mismatch between storage and execution
|
||||
|
||||
The local companion code reflects this directly:
|
||||
|
||||
- [Schema.kt](/home/hassan/Workspace/useful-notes/tmp/how-query-engines-work/datatypes/src/main/kotlin/Schema.kt)
|
||||
- [ColumnVector.kt](/home/hassan/Workspace/useful-notes/tmp/how-query-engines-work/datatypes/src/main/kotlin/ColumnVector.kt)
|
||||
- [RecordBatch.kt](/home/hassan/Workspace/useful-notes/tmp/how-query-engines-work/datatypes/src/main/kotlin/RecordBatch.kt)
|
||||
|
||||
Two ideas here are especially worth carrying forward:
|
||||
|
||||
- the schema is part of the contract all the way through the engine
|
||||
- batches and vectors are the runtime currency of execution
|
||||
|
||||
The chapter also contrasts row-at-a-time iterator execution with batch-oriented vectorized execution. The row-at-a-time model is simpler, but it pays overhead per row. Batch-oriented processing reduces that overhead and opens the door to data-parallel computation.
|
||||
|
||||
### Data Sources
|
||||
|
||||
The data-source chapter introduces a very clean abstraction: all backends should answer two questions.
|
||||
|
||||
1. What is your schema?
|
||||
2. Can you scan and stream batches, optionally for only certain columns?
|
||||
|
||||
That shows up directly in the local code:
|
||||
|
||||
- [DataSource.kt](/home/hassan/Workspace/useful-notes/tmp/how-query-engines-work/datasource/src/main/kotlin/DataSource.kt)
|
||||
|
||||
This is the interface:
|
||||
|
||||
```kotlin
|
||||
interface DataSource {
|
||||
fun schema(): Schema
|
||||
fun scan(projection: List<String>): Sequence<RecordBatch>
|
||||
}
|
||||
```
|
||||
|
||||
That is a small API, but it carries a lot of architectural weight:
|
||||
|
||||
- planning depends on `schema()`
|
||||
- execution depends on `scan(...)`
|
||||
- projection pushdown begins at the source boundary
|
||||
- streaming is built in through `Sequence<RecordBatch>`
|
||||
|
||||
The CSV and Parquet discussion makes the practical tradeoff clear:
|
||||
|
||||
- CSV is simple and widespread but weakly typed and expensive to parse
|
||||
- Parquet is much more engine-friendly because it is columnar, typed, and supports efficient projection
|
||||
|
||||
This suggests a general rule: data-source abstraction should hide source-specific details, but the engine should still exploit source-specific capabilities when they matter for performance.
|
||||
|
||||
## What seems most important
|
||||
|
||||
These are the main durable concepts from Part 1:
|
||||
|
||||
- A query engine is best understood as a planner and executor over declarative requests.
|
||||
- The in-memory data model is a core design choice, not a storage afterthought.
|
||||
- Columnar batches are a strong default for analytics-oriented execution.
|
||||
- Schemas and nullability are execution metadata, not just documentation.
|
||||
- Data sources should be abstracted behind a common interface.
|
||||
- Pushdown starts at the boundary with the data source.
|
||||
- A clean separation between logical intent and physical execution strategy is foundational.
|
||||
|
||||
## Relevance to Geolog
|
||||
|
||||
There is not a one-to-one mapping from the book to Geolog, but several ideas transfer cleanly.
|
||||
|
||||
First, our existing notes already treat the IR as the contract between front-end work and execution work. The book reinforces that this is the right general shape. A query engine needs a representation that is stable enough for validation, rewriting, and execution.
|
||||
|
||||
Second, the book's distinction between declarative intent and execution strategy maps well onto Geolog's likely split between:
|
||||
|
||||
- source language and elaboration
|
||||
- lowered IR
|
||||
- runtime planning and execution
|
||||
|
||||
Third, the data-source abstraction looks very relevant if Geolog will query or check laws over multiple backends. A backend-neutral interface should probably expose:
|
||||
|
||||
- what data is available
|
||||
- how its shape is described
|
||||
- how to stream or otherwise iterate over it
|
||||
- what pushdowns or source-side evaluation are possible
|
||||
|
||||
Fourth, the book is more relational and analytics-oriented than Geolog's logical/chase concerns, so some adaptation is needed. Columnar record batches are a natural fit for relational scans, filters, projections, aggregates, and joins. They are a less obvious fit for:
|
||||
|
||||
- existential witness generation
|
||||
- branching search
|
||||
- equality merging
|
||||
- provenance-heavy chase steps
|
||||
|
||||
So the likely lesson is not "copy this architecture wholesale". The lesson is that Geolog will probably still need:
|
||||
|
||||
- a stable logical representation
|
||||
- an execution-oriented runtime representation
|
||||
- a clear backend boundary
|
||||
- explicit decisions about the unit of execution
|
||||
|
||||
## Questions worth keeping open
|
||||
|
||||
- Is the first Geolog executor fundamentally row-oriented, set-oriented, or batch-oriented?
|
||||
- Do we want a strict split between logical planning and physical execution, or a thinner execution layer at first?
|
||||
- What is the Geolog equivalent of `Schema` and `RecordBatch`?
|
||||
- What source capabilities do we eventually want to push down: projection, filtering, joins, law checks?
|
||||
- Is the first backend abstraction only about storage, or also about partial execution?
|
||||
|
||||
## Bottom line
|
||||
|
||||
Part 1 is mostly about architectural posture.
|
||||
|
||||
Before building planners or optimizers, the book insists on getting four things straight:
|
||||
|
||||
- what a query engine is
|
||||
- how data lives in memory
|
||||
- how types flow through the engine
|
||||
- how data enters the system
|
||||
|
||||
That is a good sequence. It suggests that for our own work, arguments about optimization or execution strategy should come after we are clear on:
|
||||
|
||||
- the contract representation
|
||||
- the runtime unit of data
|
||||
- the backend boundary
|
||||
- the basic execution model
|
||||
|
||||
## Changelog
|
||||
|
||||
* **Mar 31, 2026** -- First version created.
|
||||
Loading…
x
Reference in New Issue
Block a user