habedi-work/useful-notes

Fork 0

Hassan Abedi 7100a757b3 Add prelimiary notes on query engine (architecture and internals)

2026-04-01 09:12:27 +02:00

7.0 KiB

Raw Blame History

Query Engine Primer

A reference for the core ideas behind how a modern query engine works.

Short definition

A query engine takes a declarative request for data and turns it into an executable plan.

The important separation is:

the user says what result they want
the engine decides how to compute it

That freedom is what makes optimization possible.

Big picture pipeline

Most query engines follow roughly this shape:

accept a query from SQL, a DataFrame API, or some other front-end language
parse it into a structured representation
build a logical plan describing the required operations
optimize or rewrite that plan
choose a physical plan with concrete execution operators
execute the plan against one or more data sources
return the result as rows, batches, or some client-facing format

At a high level, a query engine behaves like a small compiler:

parsing corresponds to front-end syntax work
logical planning corresponds to building an internal representation
optimization corresponds to semantics-preserving rewrites
physical planning and execution correspond to code generation and runtime

The main pieces

Parser

The parser turns the query text into a structured form such as an abstract syntax tree. For SQL, this means turning raw text into nodes like SELECT, FROM, WHERE, GROUP BY, and expressions.

Logical Plan

The logical plan captures the meaning of the query in terms of relational operators. It usually includes operators such as:

scan
projection
filter
join
aggregate
limit

At this stage the plan says what needs to happen, not exactly how each step will run.

Optimizer

The optimizer rewrites the logical plan into an equivalent but cheaper form.

Typical optimizations include:

removing columns that are never used
pushing filters closer to the data source
simplifying expressions
reordering joins
replacing generic operators with cheaper specialized ones

Physical Plan

The physical plan chooses specific implementations for each operator.

For example, a logical join might become:

hash join
sort-merge join
nested-loop join

This is where the engine commits to an execution strategy.

Executor

The executor runs the physical plan. It requests data from child operators, processes it, and produces output for parent operators or the client.

Logical plan vs physical plan

This distinction is one of the most important ideas in query processing.

Layer	Main question answered	Example
Logical plan	What operations are required?	Scan `employees`, filter `age > 18`, project `name`
Physical plan	How should those operations run?	Use Parquet scan, push down predicate, run vectorized filter

Logical plans are about meaning.

Physical plans are about execution mechanics.

Keeping those layers separate gives the engine room to optimize without changing the query's semantics.

The data model inside the engine

Modern analytical engines often use a columnar, batch-oriented model rather than processing one row at a time.

The core concepts are:

Schema: the names and types of columns
Field: one column inside a schema
ColumnVector: the in-memory values for one column
RecordBatch: a schema plus a set of equal-length columns

This matters because analytical workloads often read only a few columns from very large datasets. A columnar layout makes those accesses much more efficient.

It also works well with vectorized execution, where one operator applies the same computation across many values in a batch.

Why columnar and batch-oriented execution matter

Compared with row-at-a-time execution, columnar batches have several advantages:

less overhead per record
better cache behavior
better compression
easier use of SIMD-style operations
simpler projection of only the needed columns

The tradeoff is that batch-oriented systems can be more complex to build and may be less natural for transactional or record-by-record workloads.

So "best" depends on the workload, but for analytics, columnar batches are a strong default.

The data source boundary

Every engine needs a boundary where external data enters the system.

A good data-source abstraction usually answers two questions:

What is the schema?
Can you scan the data, ideally with some pushdowns?

This is where the engine starts exploiting source capabilities such as:

projection pushdown
predicate pushdown
partition pruning
file-format-specific decoding

The abstraction should be simple enough to unify many backends, but rich enough to expose useful performance features.

Common operators

Scan

Reads data from a source into the engine.

Projection

Chooses columns or computes derived expressions.

Filter

Keeps only rows that satisfy a predicate.

Join

Combines rows from multiple inputs using matching keys or conditions.

Aggregate

Computes summary results such as counts, sums, averages, mins, and maxes, often grouped by one or more keys.

Limit

Stops after producing a fixed number of rows.

These operators are simple individually, but query engines become interesting because they compose and can be reordered or fused in many ways.

A tiny end-to-end example

Take this query:

SELECT name
FROM employees
WHERE age > 18

One reasonable logical plan is:

scan employees
filter rows by age > 18
project name

An optimized plan might notice that only name and age are needed and ask the data source for only those columns.

A physical plan might then choose:

Parquet scan with projection pushdown
vectorized filter over age
projection of name

The result is the same, but the execution cost can be much lower.

What the optimizer is really buying

The optimizer is not magic. It only works because the engine has:

a stable internal representation
explicit schemas and types
clear operator semantics
enough separation between intent and execution

Without those pieces, the engine has very little room to improve the query.

Practical mental model

If you need to explain a query engine in one sentence, this is a good version:

A query engine is a planner and executor for declarative data operations.

If you need a slightly longer version:

parsing turns syntax into structure
planning turns structure into operators
optimization rewrites operators into a cheaper form
execution runs those operators over data

That model is simple, but it is enough to orient most of the important design discussions.

Questions to keep in mind

What is the engine's internal data model?
Where is the boundary between logical and physical planning?
What can be pushed down into the data source?
Is the workload more row-oriented or analytics-oriented?
What is the unit of execution: rows, batches, or fully materialized tables?

Changelog

Mar 31, 2026 -- The first version was created.

7.0 KiB Raw Blame History