# Query Engine Primer A reference for the core ideas behind how a modern query engine works. --- ## Short definition A query engine takes a declarative request for data and turns it into an executable plan. The important separation is: - the user says what result they want - the engine decides how to compute it That freedom is what makes optimization possible. --- ## Big picture pipeline Most query engines follow roughly this shape: 1. accept a query from SQL, a DataFrame API, or some other front-end language 2. parse it into a structured representation 3. build a logical plan describing the required operations 4. optimize or rewrite that plan 5. choose a physical plan with concrete execution operators 6. execute the plan against one or more data sources 7. return the result as rows, batches, or some client-facing format At a high level, a query engine behaves like a small compiler: - parsing corresponds to front-end syntax work - logical planning corresponds to building an internal representation - optimization corresponds to semantics-preserving rewrites - physical planning and execution correspond to code generation and runtime --- ## The main pieces ### Parser The parser turns the query text into a structured form such as an abstract syntax tree. For SQL, this means turning raw text into nodes like `SELECT`, `FROM`, `WHERE`, `GROUP BY`, and expressions. ### Logical Plan The logical plan captures the meaning of the query in terms of relational operators. It usually includes operators such as: - scan - projection - filter - join - aggregate - limit At this stage the plan says what needs to happen, not exactly how each step will run. ### Optimizer The optimizer rewrites the logical plan into an equivalent but cheaper form. Typical optimizations include: - removing columns that are never used - pushing filters closer to the data source - simplifying expressions - reordering joins - replacing generic operators with cheaper specialized ones ### Physical Plan The physical plan chooses specific implementations for each operator. For example, a logical join might become: - hash join - sort-merge join - nested-loop join This is where the engine commits to an execution strategy. ### Executor The executor runs the physical plan. It requests data from child operators, processes it, and produces output for parent operators or the client. --- ## Logical plan vs physical plan This distinction is one of the most important ideas in query processing. | Layer | Main question answered | Example | |:--------------|:------------------------------------|:--------| | Logical plan | What operations are required? | Scan `employees`, filter `age > 18`, project `name` | | Physical plan | How should those operations run? | Use Parquet scan, push down predicate, run vectorized filter | Logical plans are about meaning. Physical plans are about execution mechanics. Keeping those layers separate gives the engine room to optimize without changing the query's semantics. --- ## The data model inside the engine Modern analytical engines often use a columnar, batch-oriented model rather than processing one row at a time. The core concepts are: - `Schema`: the names and types of columns - `Field`: one column inside a schema - `ColumnVector`: the in-memory values for one column - `RecordBatch`: a schema plus a set of equal-length columns This matters because analytical workloads often read only a few columns from very large datasets. A columnar layout makes those accesses much more efficient. It also works well with vectorized execution, where one operator applies the same computation across many values in a batch. --- ## Why columnar and batch-oriented execution matter Compared with row-at-a-time execution, columnar batches have several advantages: - less overhead per record - better cache behavior - better compression - easier use of SIMD-style operations - simpler projection of only the needed columns The tradeoff is that batch-oriented systems can be more complex to build and may be less natural for transactional or record-by-record workloads. So "best" depends on the workload, but for analytics, columnar batches are a strong default. --- ## The data source boundary Every engine needs a boundary where external data enters the system. A good data-source abstraction usually answers two questions: 1. What is the schema? 2. Can you scan the data, ideally with some pushdowns? This is where the engine starts exploiting source capabilities such as: - projection pushdown - predicate pushdown - partition pruning - file-format-specific decoding The abstraction should be simple enough to unify many backends, but rich enough to expose useful performance features. --- ## Common operators ### Scan Reads data from a source into the engine. ### Projection Chooses columns or computes derived expressions. ### Filter Keeps only rows that satisfy a predicate. ### Join Combines rows from multiple inputs using matching keys or conditions. ### Aggregate Computes summary results such as counts, sums, averages, mins, and maxes, often grouped by one or more keys. ### Limit Stops after producing a fixed number of rows. These operators are simple individually, but query engines become interesting because they compose and can be reordered or fused in many ways. --- ## A tiny end-to-end example Take this query: ```sql SELECT name FROM employees WHERE age > 18 ``` One reasonable logical plan is: 1. scan `employees` 2. filter rows by `age > 18` 3. project `name` An optimized plan might notice that only `name` and `age` are needed and ask the data source for only those columns. A physical plan might then choose: 1. Parquet scan with projection pushdown 2. vectorized filter over `age` 3. projection of `name` The result is the same, but the execution cost can be much lower. --- ## What the optimizer is really buying The optimizer is not magic. It only works because the engine has: - a stable internal representation - explicit schemas and types - clear operator semantics - enough separation between intent and execution Without those pieces, the engine has very little room to improve the query. --- ## Practical mental model If you need to explain a query engine in one sentence, this is a good version: > A query engine is a planner and executor for declarative data operations. If you need a slightly longer version: - parsing turns syntax into structure - planning turns structure into operators - optimization rewrites operators into a cheaper form - execution runs those operators over data That model is simple, but it is enough to orient most of the important design discussions. --- ## Questions to keep in mind - What is the engine's internal data model? - Where is the boundary between logical and physical planning? - What can be pushed down into the data source? - Is the workload more row-oriented or analytics-oriented? - What is the unit of execution: rows, batches, or fully materialized tables? --- ## Changelog * **Mar 31, 2026** -- The first version was created.