useful-notes/hqew/002-query-engine-primer.md

263 lines
7.0 KiB
Markdown

# Query Engine Primer
A reference for the core ideas behind how a modern query engine works.
---
## Short definition
A query engine takes a declarative request for data and turns it into an executable plan.
The important separation is:
- the user says what result they want
- the engine decides how to compute it
That freedom is what makes optimization possible.
---
## Big picture pipeline
Most query engines follow roughly this shape:
1. accept a query from SQL, a DataFrame API, or some other front-end language
2. parse it into a structured representation
3. build a logical plan describing the required operations
4. optimize or rewrite that plan
5. choose a physical plan with concrete execution operators
6. execute the plan against one or more data sources
7. return the result as rows, batches, or some client-facing format
At a high level, a query engine behaves like a small compiler:
- parsing corresponds to front-end syntax work
- logical planning corresponds to building an internal representation
- optimization corresponds to semantics-preserving rewrites
- physical planning and execution correspond to code generation and runtime
---
## The main pieces
### Parser
The parser turns the query text into a structured form such as an abstract syntax tree. For SQL, this means turning raw text into nodes like
`SELECT`, `FROM`, `WHERE`, `GROUP BY`, and expressions.
### Logical Plan
The logical plan captures the meaning of the query in terms of relational operators. It usually includes operators such as:
- scan
- projection
- filter
- join
- aggregate
- limit
At this stage the plan says what needs to happen, not exactly how each step will run.
### Optimizer
The optimizer rewrites the logical plan into an equivalent but cheaper form.
Typical optimizations include:
- removing columns that are never used
- pushing filters closer to the data source
- simplifying expressions
- reordering joins
- replacing generic operators with cheaper specialized ones
### Physical Plan
The physical plan chooses specific implementations for each operator.
For example, a logical join might become:
- hash join
- sort-merge join
- nested-loop join
This is where the engine commits to an execution strategy.
### Executor
The executor runs the physical plan. It requests data from child operators, processes it, and produces output for parent operators or the client.
---
## Logical plan vs physical plan
This distinction is one of the most important ideas in query processing.
| Layer | Main question answered | Example |
|:--------------|:------------------------------------|:--------|
| Logical plan | What operations are required? | Scan `employees`, filter `age > 18`, project `name` |
| Physical plan | How should those operations run? | Use Parquet scan, push down predicate, run vectorized filter |
Logical plans are about meaning.
Physical plans are about execution mechanics.
Keeping those layers separate gives the engine room to optimize without changing the query's semantics.
---
## The data model inside the engine
Modern analytical engines often use a columnar, batch-oriented model rather than processing one row at a time.
The core concepts are:
- `Schema`: the names and types of columns
- `Field`: one column inside a schema
- `ColumnVector`: the in-memory values for one column
- `RecordBatch`: a schema plus a set of equal-length columns
This matters because analytical workloads often read only a few columns from very large datasets. A columnar layout makes those accesses much more
efficient.
It also works well with vectorized execution, where one operator applies the same computation across many values in a batch.
---
## Why columnar and batch-oriented execution matter
Compared with row-at-a-time execution, columnar batches have several advantages:
- less overhead per record
- better cache behavior
- better compression
- easier use of SIMD-style operations
- simpler projection of only the needed columns
The tradeoff is that batch-oriented systems can be more complex to build and may be less natural for transactional or record-by-record workloads.
So "best" depends on the workload, but for analytics, columnar batches are a strong default.
---
## The data source boundary
Every engine needs a boundary where external data enters the system.
A good data-source abstraction usually answers two questions:
1. What is the schema?
2. Can you scan the data, ideally with some pushdowns?
This is where the engine starts exploiting source capabilities such as:
- projection pushdown
- predicate pushdown
- partition pruning
- file-format-specific decoding
The abstraction should be simple enough to unify many backends, but rich enough to expose useful performance features.
---
## Common operators
### Scan
Reads data from a source into the engine.
### Projection
Chooses columns or computes derived expressions.
### Filter
Keeps only rows that satisfy a predicate.
### Join
Combines rows from multiple inputs using matching keys or conditions.
### Aggregate
Computes summary results such as counts, sums, averages, mins, and maxes, often grouped by one or more keys.
### Limit
Stops after producing a fixed number of rows.
These operators are simple individually, but query engines become interesting because they compose and can be reordered or fused in many ways.
---
## A tiny end-to-end example
Take this query:
```sql
SELECT name
FROM employees
WHERE age > 18
```
One reasonable logical plan is:
1. scan `employees`
2. filter rows by `age > 18`
3. project `name`
An optimized plan might notice that only `name` and `age` are needed and ask the data source for only those columns.
A physical plan might then choose:
1. Parquet scan with projection pushdown
2. vectorized filter over `age`
3. projection of `name`
The result is the same, but the execution cost can be much lower.
---
## What the optimizer is really buying
The optimizer is not magic. It only works because the engine has:
- a stable internal representation
- explicit schemas and types
- clear operator semantics
- enough separation between intent and execution
Without those pieces, the engine has very little room to improve the query.
---
## Practical mental model
If you need to explain a query engine in one sentence, this is a good version:
> A query engine is a planner and executor for declarative data operations.
If you need a slightly longer version:
- parsing turns syntax into structure
- planning turns structure into operators
- optimization rewrites operators into a cheaper form
- execution runs those operators over data
That model is simple, but it is enough to orient most of the important design discussions.
---
## Questions to keep in mind
- What is the engine's internal data model?
- Where is the boundary between logical and physical planning?
- What can be pushed down into the data source?
- Is the workload more row-oriented or analytics-oriented?
- What is the unit of execution: rows, batches, or fully materialized tables?
---
## Changelog
* **Mar 31, 2026** -- The first version was created.