useful-notes/hqew/002-query-engine-primer.md

# Query Engine Primer

A reference for the core ideas behind how a modern query engine works.

---

## Short definition

A query engine takes a declarative request for data and turns it into an executable plan.

The important separation is:

- the user says what result they want
- the engine decides how to compute it

That freedom is what makes optimization possible.

---

## Big picture pipeline

Most query engines follow roughly this shape:

1. accept a query from SQL, a DataFrame API, or some other front-end language
2. parse it into a structured representation
3. build a logical plan describing the required operations
4. optimize or rewrite that plan
5. choose a physical plan with concrete execution operators
6. execute the plan against one or more data sources
7. return the result as rows, batches, or some client-facing format

At a high level, a query engine behaves like a small compiler:

- parsing corresponds to front-end syntax work
- logical planning corresponds to building an internal representation
- optimization corresponds to semantics-preserving rewrites
- physical planning and execution correspond to code generation and runtime

---

## The main pieces

### Parser

The parser turns the query text into a structured form such as an abstract syntax tree. For SQL, this means turning raw text into nodes like
`SELECT`, `FROM`, `WHERE`, `GROUP BY`, and expressions.

### Logical Plan

The logical plan captures the meaning of the query in terms of relational operators. It usually includes operators such as:

- scan
- projection
- filter
- join
- aggregate
- limit

At this stage the plan says what needs to happen, not exactly how each step will run.

### Optimizer

The optimizer rewrites the logical plan into an equivalent but cheaper form.

Typical optimizations include:

- removing columns that are never used
- pushing filters closer to the data source
- simplifying expressions
- reordering joins
- replacing generic operators with cheaper specialized ones

### Physical Plan

The physical plan chooses specific implementations for each operator.

For example, a logical join might become:

- hash join
- sort-merge join
- nested-loop join

This is where the engine commits to an execution strategy.

### Executor

The executor runs the physical plan. It requests data from child operators, processes it, and produces output for parent operators or the client.

---

## Logical plan vs physical plan

This distinction is one of the most important ideas in query processing.

| Layer         | Main question answered              | Example |
|:--------------|:------------------------------------|:--------|
| Logical plan  | What operations are required?       | Scan `employees`, filter `age > 18`, project `name` |
| Physical plan | How should those operations run?    | Use Parquet scan, push down predicate, run vectorized filter |

Logical plans are about meaning.

Physical plans are about execution mechanics.

Keeping those layers separate gives the engine room to optimize without changing the query's semantics.

---

## The data model inside the engine

Modern analytical engines often use a columnar, batch-oriented model rather than processing one row at a time.

The core concepts are:

- `Schema`: the names and types of columns
- `Field`: one column inside a schema
- `ColumnVector`: the in-memory values for one column
- `RecordBatch`: a schema plus a set of equal-length columns

This matters because analytical workloads often read only a few columns from very large datasets. A columnar layout makes those accesses much more
efficient.

It also works well with vectorized execution, where one operator applies the same computation across many values in a batch.

---

## Why columnar and batch-oriented execution matter

Compared with row-at-a-time execution, columnar batches have several advantages:

- less overhead per record
- better cache behavior
- better compression
- easier use of SIMD-style operations
- simpler projection of only the needed columns

The tradeoff is that batch-oriented systems can be more complex to build and may be less natural for transactional or record-by-record workloads.

So "best" depends on the workload, but for analytics, columnar batches are a strong default.

---

## The data source boundary

Every engine needs a boundary where external data enters the system.

A good data-source abstraction usually answers two questions:

1. What is the schema?
2. Can you scan the data, ideally with some pushdowns?

This is where the engine starts exploiting source capabilities such as:

- projection pushdown
- predicate pushdown
- partition pruning
- file-format-specific decoding

The abstraction should be simple enough to unify many backends, but rich enough to expose useful performance features.

---

## Common operators

### Scan

Reads data from a source into the engine.

### Projection

Chooses columns or computes derived expressions.

### Filter

Keeps only rows that satisfy a predicate.

### Join

Combines rows from multiple inputs using matching keys or conditions.

### Aggregate

Computes summary results such as counts, sums, averages, mins, and maxes, often grouped by one or more keys.

### Limit

Stops after producing a fixed number of rows.

These operators are simple individually, but query engines become interesting because they compose and can be reordered or fused in many ways.

---

## A tiny end-to-end example

Take this query:

```sql
SELECT name
FROM employees
WHERE age > 18
```

One reasonable logical plan is:

1. scan `employees`
2. filter rows by `age > 18`
3. project `name`

An optimized plan might notice that only `name` and `age` are needed and ask the data source for only those columns.

A physical plan might then choose:

1. Parquet scan with projection pushdown
2. vectorized filter over `age`
3. projection of `name`

The result is the same, but the execution cost can be much lower.

---

## What the optimizer is really buying

The optimizer is not magic. It only works because the engine has:

- a stable internal representation
- explicit schemas and types
- clear operator semantics
- enough separation between intent and execution

Without those pieces, the engine has very little room to improve the query.

---

## Practical mental model

If you need to explain a query engine in one sentence, this is a good version:

> A query engine is a planner and executor for declarative data operations.

If you need a slightly longer version:

- parsing turns syntax into structure
- planning turns structure into operators
- optimization rewrites operators into a cheaper form
- execution runs those operators over data

That model is simple, but it is enough to orient most of the important design discussions.

---

## Questions to keep in mind

- What is the engine's internal data model?
- Where is the boundary between logical and physical planning?
- What can be pushed down into the data source?
- Is the workload more row-oriented or analytics-oriented?
- What is the unit of execution: rows, batches, or fully materialized tables?

---

## Changelog

* **Mar 31, 2026** -- The first version was created.