From 7100a757b3c4848abee6522a6ec4cb3afe256bd1 Mon Sep 17 00:00:00 2001 From: Hassan Abedi Date: Tue, 31 Mar 2026 11:09:56 +0200 Subject: [PATCH] Add prelimiary notes on query engine (architecture and internals) --- hqew/query-engine-glossary.md | 51 ++++ hqew/query-engine-primer.md | 262 +++++++++++++++++++++ scratches/how-query-engines-work-part-1.md | 246 +++++++++++++++++++ 3 files changed, 559 insertions(+) create mode 100644 hqew/query-engine-glossary.md create mode 100644 hqew/query-engine-primer.md create mode 100644 scratches/how-query-engines-work-part-1.md diff --git a/hqew/query-engine-glossary.md b/hqew/query-engine-glossary.md new file mode 100644 index 0000000..7a5110f --- /dev/null +++ b/hqew/query-engine-glossary.md @@ -0,0 +1,51 @@ +## Query Engine Glossary + +* **Query Engine** — A query engine is the component that takes a declarative request for data and produces the result. The user says what they + want, and the engine decides how to compute it. +* **Declarative Query** — A query that describes the desired result rather than the exact procedure. SQL is the standard example: you ask for rows + matching some condition, not a specific loop nest. +* **Schema** — The structural description of data: column names, column types, and usually nullability. The schema is the contract that planning and + execution rely on. +* **Field** — One named column inside a schema. A field has a name and a type. +* **Type System** — The set of value types the engine understands and the rules for how expressions over those values behave. It lets the engine + reject invalid queries early and allocate correctly typed outputs. +* **Nullability** — Whether a column or expression may contain missing values. This matters for both semantics and execution because null handling can + add branching and bookkeeping. +* **Row** — One logical record in a table. In a row-oriented system, processing often happens one record at a time. +* **Column** — All values for one field across many rows. In a columnar engine, values from the same column are stored together in memory. +* **Column Vector** — An in-memory representation of one column's values. It is the basic runtime container used by columnar engines. +* **Record Batch** — A chunk of data consisting of several equal-length column vectors plus a schema. It is a common unit of execution in modern + columnar engines. +* **Apache Arrow** — A standard in-memory columnar format. It gives query engines a shared representation for schemas, arrays, null bitmaps, and + record batches. +* **Data Source** — Anything the engine can read from: CSV files, Parquet files, databases, object stores, or in-memory tables. A clean engine + usually abstracts these behind a common interface. +* **Scan** — The operator that reads data from a source into the engine. It is usually the leaf node of a query plan. +* **Projection** — Selecting only certain columns or computing new expressions from existing ones. In SQL this is mostly the `SELECT` list. +* **Filter / Selection** — Removing rows that do not satisfy a predicate. In SQL this is usually the `WHERE` clause. +* **Predicate** — A boolean expression used to decide whether a row should be kept, such as `age > 18`. +* **Projection Pushdown** — Asking the data source to read only the needed columns instead of materializing everything first. +* **Predicate Pushdown** — Asking the data source to apply filtering as early as possible, ideally before data is moved into the main engine. +* **Logical Plan** — A representation of what operations are required to answer the query, independent of low-level execution details. It describes + intent rather than exact mechanics. +* **Physical Plan** — A concrete executable strategy for the query. It chooses specific operator implementations and an execution order. +* **Optimizer** — The component that rewrites plans into equivalent but more efficient forms. Examples include pruning unused columns, pushing filters + down, and reordering joins. +* **Operator** — One node in a plan, such as scan, projection, filter, join, aggregate, or limit. +* **Expression** — A computation inside an operator, such as `price * quantity`, `a = b`, or `SUM(total)`. +* **Join** — Combining rows from two inputs based on some matching condition. This is one of the most important and expensive relational operators. +* **Aggregate** — Collapsing many rows into summary values such as `COUNT`, `SUM`, `MIN`, `MAX`, or grouped results. +* **Execution Model** — The style in which operators run: row-at-a-time, batch-oriented, vectorized, pipelined, or materializing intermediate + results. +* **Row-Oriented Execution** — Processing one row at a time through a chain of operators. It is simple to understand but often pays overhead per row. +* **Vectorized Execution** — Processing a batch of values together, typically one column at a time. This reduces per-row overhead and works well with + columnar memory layouts. +* **Materialization** — Fully storing an intermediate result before the next operator consumes it. This can simplify execution but costs memory and + latency. +* **Pipeline** — A flow where one operator produces data incrementally and the next consumes it without waiting for the entire input to finish. +* **Backend** — The concrete storage or execution target under the engine, such as an in-memory table layer, Parquet files, Postgres, or a + distributed service. + +## Changelog + +* **Mar 31, 2026** -- The first version was created. diff --git a/hqew/query-engine-primer.md b/hqew/query-engine-primer.md new file mode 100644 index 0000000..466b1cf --- /dev/null +++ b/hqew/query-engine-primer.md @@ -0,0 +1,262 @@ +# Query Engine Primer + +A reference for the core ideas behind how a modern query engine works. + +--- + +## Short definition + +A query engine takes a declarative request for data and turns it into an executable plan. + +The important separation is: + +- the user says what result they want +- the engine decides how to compute it + +That freedom is what makes optimization possible. + +--- + +## Big picture pipeline + +Most query engines follow roughly this shape: + +1. accept a query from SQL, a DataFrame API, or some other front-end language +2. parse it into a structured representation +3. build a logical plan describing the required operations +4. optimize or rewrite that plan +5. choose a physical plan with concrete execution operators +6. execute the plan against one or more data sources +7. return the result as rows, batches, or some client-facing format + +At a high level, a query engine behaves like a small compiler: + +- parsing corresponds to front-end syntax work +- logical planning corresponds to building an internal representation +- optimization corresponds to semantics-preserving rewrites +- physical planning and execution correspond to code generation and runtime + +--- + +## The main pieces + +### Parser + +The parser turns the query text into a structured form such as an abstract syntax tree. For SQL, this means turning raw text into nodes like +`SELECT`, `FROM`, `WHERE`, `GROUP BY`, and expressions. + +### Logical Plan + +The logical plan captures the meaning of the query in terms of relational operators. It usually includes operators such as: + +- scan +- projection +- filter +- join +- aggregate +- limit + +At this stage the plan says what needs to happen, not exactly how each step will run. + +### Optimizer + +The optimizer rewrites the logical plan into an equivalent but cheaper form. + +Typical optimizations include: + +- removing columns that are never used +- pushing filters closer to the data source +- simplifying expressions +- reordering joins +- replacing generic operators with cheaper specialized ones + +### Physical Plan + +The physical plan chooses specific implementations for each operator. + +For example, a logical join might become: + +- hash join +- sort-merge join +- nested-loop join + +This is where the engine commits to an execution strategy. + +### Executor + +The executor runs the physical plan. It requests data from child operators, processes it, and produces output for parent operators or the client. + +--- + +## Logical plan vs physical plan + +This distinction is one of the most important ideas in query processing. + +| Layer | Main question answered | Example | +|:--------------|:------------------------------------|:--------| +| Logical plan | What operations are required? | Scan `employees`, filter `age > 18`, project `name` | +| Physical plan | How should those operations run? | Use Parquet scan, push down predicate, run vectorized filter | + +Logical plans are about meaning. + +Physical plans are about execution mechanics. + +Keeping those layers separate gives the engine room to optimize without changing the query's semantics. + +--- + +## The data model inside the engine + +Modern analytical engines often use a columnar, batch-oriented model rather than processing one row at a time. + +The core concepts are: + +- `Schema`: the names and types of columns +- `Field`: one column inside a schema +- `ColumnVector`: the in-memory values for one column +- `RecordBatch`: a schema plus a set of equal-length columns + +This matters because analytical workloads often read only a few columns from very large datasets. A columnar layout makes those accesses much more +efficient. + +It also works well with vectorized execution, where one operator applies the same computation across many values in a batch. + +--- + +## Why columnar and batch-oriented execution matter + +Compared with row-at-a-time execution, columnar batches have several advantages: + +- less overhead per record +- better cache behavior +- better compression +- easier use of SIMD-style operations +- simpler projection of only the needed columns + +The tradeoff is that batch-oriented systems can be more complex to build and may be less natural for transactional or record-by-record workloads. + +So "best" depends on the workload, but for analytics, columnar batches are a strong default. + +--- + +## The data source boundary + +Every engine needs a boundary where external data enters the system. + +A good data-source abstraction usually answers two questions: + +1. What is the schema? +2. Can you scan the data, ideally with some pushdowns? + +This is where the engine starts exploiting source capabilities such as: + +- projection pushdown +- predicate pushdown +- partition pruning +- file-format-specific decoding + +The abstraction should be simple enough to unify many backends, but rich enough to expose useful performance features. + +--- + +## Common operators + +### Scan + +Reads data from a source into the engine. + +### Projection + +Chooses columns or computes derived expressions. + +### Filter + +Keeps only rows that satisfy a predicate. + +### Join + +Combines rows from multiple inputs using matching keys or conditions. + +### Aggregate + +Computes summary results such as counts, sums, averages, mins, and maxes, often grouped by one or more keys. + +### Limit + +Stops after producing a fixed number of rows. + +These operators are simple individually, but query engines become interesting because they compose and can be reordered or fused in many ways. + +--- + +## A tiny end-to-end example + +Take this query: + +```sql +SELECT name +FROM employees +WHERE age > 18 +``` + +One reasonable logical plan is: + +1. scan `employees` +2. filter rows by `age > 18` +3. project `name` + +An optimized plan might notice that only `name` and `age` are needed and ask the data source for only those columns. + +A physical plan might then choose: + +1. Parquet scan with projection pushdown +2. vectorized filter over `age` +3. projection of `name` + +The result is the same, but the execution cost can be much lower. + +--- + +## What the optimizer is really buying + +The optimizer is not magic. It only works because the engine has: + +- a stable internal representation +- explicit schemas and types +- clear operator semantics +- enough separation between intent and execution + +Without those pieces, the engine has very little room to improve the query. + +--- + +## Practical mental model + +If you need to explain a query engine in one sentence, this is a good version: + +> A query engine is a planner and executor for declarative data operations. + +If you need a slightly longer version: + +- parsing turns syntax into structure +- planning turns structure into operators +- optimization rewrites operators into a cheaper form +- execution runs those operators over data + +That model is simple, but it is enough to orient most of the important design discussions. + +--- + +## Questions to keep in mind + +- What is the engine's internal data model? +- Where is the boundary between logical and physical planning? +- What can be pushed down into the data source? +- Is the workload more row-oriented or analytics-oriented? +- What is the unit of execution: rows, batches, or fully materialized tables? + +--- + +## Changelog + +* **Mar 31, 2026** -- The first version was created. diff --git a/scratches/how-query-engines-work-part-1.md b/scratches/how-query-engines-work-part-1.md new file mode 100644 index 0000000..fe1003f --- /dev/null +++ b/scratches/how-query-engines-work-part-1.md @@ -0,0 +1,246 @@ +# How Query Engines Work Part 1 + +This note covers the foundation chapters from Andy Grove's book _How Query Engines Work_: + +- What Is a Query Engine? +- Apache Arrow +- Choosing a Type System +- Data Sources + +The main sources were: + +- https://howqueryengineswork.com/ +- the local companion repo in `tmp/how-query-engines-work` + +## Short answer + +Part 1 frames a query engine as a specialized compiler for data work. + +A user expresses _what_ data they want in a declarative form such as SQL or a DataFrame API, and the engine is responsible for figuring out _how_ to get it efficiently. The foundational design choices are: + +- use columnar, batch-oriented data rather than row-at-a-time execution where possible +- make schemas and types explicit +- abstract over data sources behind a common interface +- treat in-memory representation as a first-class architectural decision, not an implementation detail + +## Core mental model + +The book's basic pipeline is: + +1. parse the query text or API calls +2. build an abstract query representation +3. optimize or rewrite that representation +4. execute a concrete plan against data sources + +This is why query engines feel compiler-like: + +- a declarative input gets translated into an executable form +- the engine can validate before execution +- the engine can rewrite for efficiency without changing semantics + +The key difference from a general compiler is that the runtime target is data processing rather than machine code. + +## Important terminology + +- `declarative query`: a request that says what result is wanted, not how to compute it +- `query engine`: software that turns declarative queries into results +- `parsing`: turning SQL text into a structured representation +- `planning`: deciding which logical operations are needed +- `optimization`: rewriting the plan into a more efficient equivalent +- `execution`: actually running operators over data +- `schema`: the names, types, and nullability of columns +- `field`: one named column in a schema +- `column vector`: one in-memory column of values +- `record batch`: a group of equal-length column vectors processed together +- `data source`: a backend that can expose a schema and stream batches +- `projection`: selecting only the needed columns +- `projection pushdown`: asking the data source to read only required columns +- `row-based execution`: processing one record at a time +- `columnar execution`: storing and often processing values by column +- `vectorized execution`: applying one operation across many values in a batch + +## Notes by chapter + +### What Is a Query Engine? + +The book starts from the simplest possible idea: filtering a collection in code is already a tiny query engine. The real difference in production systems is scale, generality, and optimization. + +The most important conceptual shift is: + +- application code hard-codes the procedure +- query languages describe the desired result +- the engine gets freedom to choose the execution strategy + +That separation matters because the same logical request can run: + +- against a small file +- against a large local dataset +- against a distributed cluster + +without changing the query text itself. + +The book describes the core stages as: + +1. parsing +2. planning +3. optimization +4. execution + +That is a useful mental model for our own work too. Even if Geolog does not look like SQL, the same split applies: there is still a front-end language, an internal representation, possible rewrites, and some execution mechanism. + +### Apache Arrow + +The book treats Arrow as the foundation for the engine's in-memory model. + +The main reason is not just standardization. It is that query workloads usually touch a subset of columns, not full rows. A columnar layout means: + +- better cache locality for the values actually being processed +- better compression characteristics +- better support for SIMD-style vectorized operations + +Arrow's core structures are simple but important: + +- a data buffer holding values +- a validity bitmap for null tracking +- an offset buffer for variable-width values such as strings + +Arrow also introduces the concept that ends up mattering most operationally in the book: the `RecordBatch`, meaning a schema plus a set of equal-length columns processed as one unit. + +The key takeaway is that the unit of execution should not be "the whole dataset" or "one row". It should usually be "one batch". + +### Choosing a Type System + +The type-system chapter argues that the engine must know more than raw values. It needs enough metadata to: + +- reject invalid expressions early +- determine result types +- allocate correctly typed output storage +- skip unnecessary null checks when nullability is known + +The design choice in the book is to build on Arrow's types instead of inventing a proprietary type system. That gives: + +- a shared type vocabulary +- direct compatibility with Arrow-based formats and tools +- less impedance mismatch between storage and execution + +The local companion code reflects this directly: + +- [Schema.kt](/home/hassan/Workspace/useful-notes/tmp/how-query-engines-work/datatypes/src/main/kotlin/Schema.kt) +- [ColumnVector.kt](/home/hassan/Workspace/useful-notes/tmp/how-query-engines-work/datatypes/src/main/kotlin/ColumnVector.kt) +- [RecordBatch.kt](/home/hassan/Workspace/useful-notes/tmp/how-query-engines-work/datatypes/src/main/kotlin/RecordBatch.kt) + +Two ideas here are especially worth carrying forward: + +- the schema is part of the contract all the way through the engine +- batches and vectors are the runtime currency of execution + +The chapter also contrasts row-at-a-time iterator execution with batch-oriented vectorized execution. The row-at-a-time model is simpler, but it pays overhead per row. Batch-oriented processing reduces that overhead and opens the door to data-parallel computation. + +### Data Sources + +The data-source chapter introduces a very clean abstraction: all backends should answer two questions. + +1. What is your schema? +2. Can you scan and stream batches, optionally for only certain columns? + +That shows up directly in the local code: + +- [DataSource.kt](/home/hassan/Workspace/useful-notes/tmp/how-query-engines-work/datasource/src/main/kotlin/DataSource.kt) + +This is the interface: + +```kotlin +interface DataSource { + fun schema(): Schema + fun scan(projection: List): Sequence +} +``` + +That is a small API, but it carries a lot of architectural weight: + +- planning depends on `schema()` +- execution depends on `scan(...)` +- projection pushdown begins at the source boundary +- streaming is built in through `Sequence` + +The CSV and Parquet discussion makes the practical tradeoff clear: + +- CSV is simple and widespread but weakly typed and expensive to parse +- Parquet is much more engine-friendly because it is columnar, typed, and supports efficient projection + +This suggests a general rule: data-source abstraction should hide source-specific details, but the engine should still exploit source-specific capabilities when they matter for performance. + +## What seems most important + +These are the main durable concepts from Part 1: + +- A query engine is best understood as a planner and executor over declarative requests. +- The in-memory data model is a core design choice, not a storage afterthought. +- Columnar batches are a strong default for analytics-oriented execution. +- Schemas and nullability are execution metadata, not just documentation. +- Data sources should be abstracted behind a common interface. +- Pushdown starts at the boundary with the data source. +- A clean separation between logical intent and physical execution strategy is foundational. + +## Relevance to Geolog + +There is not a one-to-one mapping from the book to Geolog, but several ideas transfer cleanly. + +First, our existing notes already treat the IR as the contract between front-end work and execution work. The book reinforces that this is the right general shape. A query engine needs a representation that is stable enough for validation, rewriting, and execution. + +Second, the book's distinction between declarative intent and execution strategy maps well onto Geolog's likely split between: + +- source language and elaboration +- lowered IR +- runtime planning and execution + +Third, the data-source abstraction looks very relevant if Geolog will query or check laws over multiple backends. A backend-neutral interface should probably expose: + +- what data is available +- how its shape is described +- how to stream or otherwise iterate over it +- what pushdowns or source-side evaluation are possible + +Fourth, the book is more relational and analytics-oriented than Geolog's logical/chase concerns, so some adaptation is needed. Columnar record batches are a natural fit for relational scans, filters, projections, aggregates, and joins. They are a less obvious fit for: + +- existential witness generation +- branching search +- equality merging +- provenance-heavy chase steps + +So the likely lesson is not "copy this architecture wholesale". The lesson is that Geolog will probably still need: + +- a stable logical representation +- an execution-oriented runtime representation +- a clear backend boundary +- explicit decisions about the unit of execution + +## Questions worth keeping open + +- Is the first Geolog executor fundamentally row-oriented, set-oriented, or batch-oriented? +- Do we want a strict split between logical planning and physical execution, or a thinner execution layer at first? +- What is the Geolog equivalent of `Schema` and `RecordBatch`? +- What source capabilities do we eventually want to push down: projection, filtering, joins, law checks? +- Is the first backend abstraction only about storage, or also about partial execution? + +## Bottom line + +Part 1 is mostly about architectural posture. + +Before building planners or optimizers, the book insists on getting four things straight: + +- what a query engine is +- how data lives in memory +- how types flow through the engine +- how data enters the system + +That is a good sequence. It suggests that for our own work, arguments about optimization or execution strategy should come after we are clear on: + +- the contract representation +- the runtime unit of data +- the backend boundary +- the basic execution model + +## Changelog + +* **Mar 31, 2026** -- First version created.