useful-notes/hqew/query-engine-primer.md

7.0 KiB

Query Engine Primer

A reference for the core ideas behind how a modern query engine works.


Short definition

A query engine takes a declarative request for data and turns it into an executable plan.

The important separation is:

  • the user says what result they want
  • the engine decides how to compute it

That freedom is what makes optimization possible.


Big picture pipeline

Most query engines follow roughly this shape:

  1. accept a query from SQL, a DataFrame API, or some other front-end language
  2. parse it into a structured representation
  3. build a logical plan describing the required operations
  4. optimize or rewrite that plan
  5. choose a physical plan with concrete execution operators
  6. execute the plan against one or more data sources
  7. return the result as rows, batches, or some client-facing format

At a high level, a query engine behaves like a small compiler:

  • parsing corresponds to front-end syntax work
  • logical planning corresponds to building an internal representation
  • optimization corresponds to semantics-preserving rewrites
  • physical planning and execution correspond to code generation and runtime

The main pieces

Parser

The parser turns the query text into a structured form such as an abstract syntax tree. For SQL, this means turning raw text into nodes like SELECT, FROM, WHERE, GROUP BY, and expressions.

Logical Plan

The logical plan captures the meaning of the query in terms of relational operators. It usually includes operators such as:

  • scan
  • projection
  • filter
  • join
  • aggregate
  • limit

At this stage the plan says what needs to happen, not exactly how each step will run.

Optimizer

The optimizer rewrites the logical plan into an equivalent but cheaper form.

Typical optimizations include:

  • removing columns that are never used
  • pushing filters closer to the data source
  • simplifying expressions
  • reordering joins
  • replacing generic operators with cheaper specialized ones

Physical Plan

The physical plan chooses specific implementations for each operator.

For example, a logical join might become:

  • hash join
  • sort-merge join
  • nested-loop join

This is where the engine commits to an execution strategy.

Executor

The executor runs the physical plan. It requests data from child operators, processes it, and produces output for parent operators or the client.


Logical plan vs physical plan

This distinction is one of the most important ideas in query processing.

Layer Main question answered Example
Logical plan What operations are required? Scan employees, filter age > 18, project name
Physical plan How should those operations run? Use Parquet scan, push down predicate, run vectorized filter

Logical plans are about meaning.

Physical plans are about execution mechanics.

Keeping those layers separate gives the engine room to optimize without changing the query's semantics.


The data model inside the engine

Modern analytical engines often use a columnar, batch-oriented model rather than processing one row at a time.

The core concepts are:

  • Schema: the names and types of columns
  • Field: one column inside a schema
  • ColumnVector: the in-memory values for one column
  • RecordBatch: a schema plus a set of equal-length columns

This matters because analytical workloads often read only a few columns from very large datasets. A columnar layout makes those accesses much more efficient.

It also works well with vectorized execution, where one operator applies the same computation across many values in a batch.


Why columnar and batch-oriented execution matter

Compared with row-at-a-time execution, columnar batches have several advantages:

  • less overhead per record
  • better cache behavior
  • better compression
  • easier use of SIMD-style operations
  • simpler projection of only the needed columns

The tradeoff is that batch-oriented systems can be more complex to build and may be less natural for transactional or record-by-record workloads.

So "best" depends on the workload, but for analytics, columnar batches are a strong default.


The data source boundary

Every engine needs a boundary where external data enters the system.

A good data-source abstraction usually answers two questions:

  1. What is the schema?
  2. Can you scan the data, ideally with some pushdowns?

This is where the engine starts exploiting source capabilities such as:

  • projection pushdown
  • predicate pushdown
  • partition pruning
  • file-format-specific decoding

The abstraction should be simple enough to unify many backends, but rich enough to expose useful performance features.


Common operators

Scan

Reads data from a source into the engine.

Projection

Chooses columns or computes derived expressions.

Filter

Keeps only rows that satisfy a predicate.

Join

Combines rows from multiple inputs using matching keys or conditions.

Aggregate

Computes summary results such as counts, sums, averages, mins, and maxes, often grouped by one or more keys.

Limit

Stops after producing a fixed number of rows.

These operators are simple individually, but query engines become interesting because they compose and can be reordered or fused in many ways.


A tiny end-to-end example

Take this query:

SELECT name
FROM employees
WHERE age > 18

One reasonable logical plan is:

  1. scan employees
  2. filter rows by age > 18
  3. project name

An optimized plan might notice that only name and age are needed and ask the data source for only those columns.

A physical plan might then choose:

  1. Parquet scan with projection pushdown
  2. vectorized filter over age
  3. projection of name

The result is the same, but the execution cost can be much lower.


What the optimizer is really buying

The optimizer is not magic. It only works because the engine has:

  • a stable internal representation
  • explicit schemas and types
  • clear operator semantics
  • enough separation between intent and execution

Without those pieces, the engine has very little room to improve the query.


Practical mental model

If you need to explain a query engine in one sentence, this is a good version:

A query engine is a planner and executor for declarative data operations.

If you need a slightly longer version:

  • parsing turns syntax into structure
  • planning turns structure into operators
  • optimization rewrites operators into a cheaper form
  • execution runs those operators over data

That model is simple, but it is enough to orient most of the important design discussions.


Questions to keep in mind

  • What is the engine's internal data model?
  • Where is the boundary between logical and physical planning?
  • What can be pushed down into the data source?
  • Is the workload more row-oriented or analytics-oriented?
  • What is the unit of execution: rows, batches, or fully materialized tables?

Changelog

  • Mar 31, 2026 -- The first version was created.