Add prelimiary notes on query engine (architecture and internals)

2026-03-31 11:09:56 +02:00 · 2026-03-31 11:09:56 +02:00 · 7100a757b3
commit 7100a757b3
parent 4082b42ac6
3 changed files with 559 additions and 0 deletions
--- a/hqew/query-engine-glossary.md
+++ b/hqew/query-engine-glossary.md
@ -0,0 +1,51 @@
+## Query Engine Glossary
+
+* **Query Engine** — A query engine is the component that takes a declarative request for data and produces the result. The user says what they
+  want, and the engine decides how to compute it.
+* **Declarative Query** — A query that describes the desired result rather than the exact procedure. SQL is the standard example: you ask for rows
+  matching some condition, not a specific loop nest.
+* **Schema** — The structural description of data: column names, column types, and usually nullability. The schema is the contract that planning and
+  execution rely on.
+* **Field** — One named column inside a schema. A field has a name and a type.
+* **Type System** — The set of value types the engine understands and the rules for how expressions over those values behave. It lets the engine
+  reject invalid queries early and allocate correctly typed outputs.
+* **Nullability** — Whether a column or expression may contain missing values. This matters for both semantics and execution because null handling can
+  add branching and bookkeeping.
+* **Row** — One logical record in a table. In a row-oriented system, processing often happens one record at a time.
+* **Column** — All values for one field across many rows. In a columnar engine, values from the same column are stored together in memory.
+* **Column Vector** — An in-memory representation of one column's values. It is the basic runtime container used by columnar engines.
+* **Record Batch** — A chunk of data consisting of several equal-length column vectors plus a schema. It is a common unit of execution in modern
+  columnar engines.
+* **Apache Arrow** — A standard in-memory columnar format. It gives query engines a shared representation for schemas, arrays, null bitmaps, and
+  record batches.
+* **Data Source** — Anything the engine can read from: CSV files, Parquet files, databases, object stores, or in-memory tables. A clean engine
+  usually abstracts these behind a common interface.
+* **Scan** — The operator that reads data from a source into the engine. It is usually the leaf node of a query plan.
+* **Projection** — Selecting only certain columns or computing new expressions from existing ones. In SQL this is mostly the `SELECT` list.
+* **Filter / Selection** — Removing rows that do not satisfy a predicate. In SQL this is usually the `WHERE` clause.
+* **Predicate** — A boolean expression used to decide whether a row should be kept, such as `age > 18`.
+* **Projection Pushdown** — Asking the data source to read only the needed columns instead of materializing everything first.
+* **Predicate Pushdown** — Asking the data source to apply filtering as early as possible, ideally before data is moved into the main engine.
+* **Logical Plan** — A representation of what operations are required to answer the query, independent of low-level execution details. It describes
+  intent rather than exact mechanics.
+* **Physical Plan** — A concrete executable strategy for the query. It chooses specific operator implementations and an execution order.
+* **Optimizer** — The component that rewrites plans into equivalent but more efficient forms. Examples include pruning unused columns, pushing filters
+  down, and reordering joins.
+* **Operator** — One node in a plan, such as scan, projection, filter, join, aggregate, or limit.
+* **Expression** — A computation inside an operator, such as `price * quantity`, `a = b`, or `SUM(total)`.
+* **Join** — Combining rows from two inputs based on some matching condition. This is one of the most important and expensive relational operators.
+* **Aggregate** — Collapsing many rows into summary values such as `COUNT`, `SUM`, `MIN`, `MAX`, or grouped results.
+* **Execution Model** — The style in which operators run: row-at-a-time, batch-oriented, vectorized, pipelined, or materializing intermediate
+  results.
+* **Row-Oriented Execution** — Processing one row at a time through a chain of operators. It is simple to understand but often pays overhead per row.
+* **Vectorized Execution** — Processing a batch of values together, typically one column at a time. This reduces per-row overhead and works well with
+  columnar memory layouts.
+* **Materialization** — Fully storing an intermediate result before the next operator consumes it. This can simplify execution but costs memory and
+  latency.
+* **Pipeline** — A flow where one operator produces data incrementally and the next consumes it without waiting for the entire input to finish.
+* **Backend** — The concrete storage or execution target under the engine, such as an in-memory table layer, Parquet files, Postgres, or a
+  distributed service.
+
+## Changelog
+
+* **Mar 31, 2026** -- The first version was created.
--- a/hqew/query-engine-primer.md
+++ b/hqew/query-engine-primer.md
@ -0,0 +1,262 @@
+# Query Engine Primer
+
+A reference for the core ideas behind how a modern query engine works.
+
+---
+
+## Short definition
+
+A query engine takes a declarative request for data and turns it into an executable plan.
+
+The important separation is:
+
+- the user says what result they want
+- the engine decides how to compute it
+
+That freedom is what makes optimization possible.
+
+---
+
+## Big picture pipeline
+
+Most query engines follow roughly this shape:
+
+1. accept a query from SQL, a DataFrame API, or some other front-end language
+2. parse it into a structured representation
+3. build a logical plan describing the required operations
+4. optimize or rewrite that plan
+5. choose a physical plan with concrete execution operators
+6. execute the plan against one or more data sources
+7. return the result as rows, batches, or some client-facing format
+
+At a high level, a query engine behaves like a small compiler:
+
+- parsing corresponds to front-end syntax work
+- logical planning corresponds to building an internal representation
+- optimization corresponds to semantics-preserving rewrites
+- physical planning and execution correspond to code generation and runtime
+
+---
+
+## The main pieces
+
+### Parser
+
+The parser turns the query text into a structured form such as an abstract syntax tree. For SQL, this means turning raw text into nodes like
+`SELECT`, `FROM`, `WHERE`, `GROUP BY`, and expressions.
+
+### Logical Plan
+
+The logical plan captures the meaning of the query in terms of relational operators. It usually includes operators such as:
+
+- scan
+- projection
+- filter
+- join
+- aggregate
+- limit
+
+At this stage the plan says what needs to happen, not exactly how each step will run.
+
+### Optimizer
+
+The optimizer rewrites the logical plan into an equivalent but cheaper form.
+
+Typical optimizations include:
+
+- removing columns that are never used
+- pushing filters closer to the data source
+- simplifying expressions
+- reordering joins
+- replacing generic operators with cheaper specialized ones
+
+### Physical Plan
+
+The physical plan chooses specific implementations for each operator.
+
+For example, a logical join might become:
+
+- hash join
+- sort-merge join
+- nested-loop join
+
+This is where the engine commits to an execution strategy.
+
+### Executor
+
+The executor runs the physical plan. It requests data from child operators, processes it, and produces output for parent operators or the client.
+
+---
+
+## Logical plan vs physical plan
+
+This distinction is one of the most important ideas in query processing.
+
+| Layer         | Main question answered              | Example |
+|:--------------|:------------------------------------|:--------|
+| Logical plan  | What operations are required?       | Scan `employees`, filter `age > 18`, project `name` |
+| Physical plan | How should those operations run?    | Use Parquet scan, push down predicate, run vectorized filter |
+
+Logical plans are about meaning.
+
+Physical plans are about execution mechanics.
+
+Keeping those layers separate gives the engine room to optimize without changing the query's semantics.
+
+---
+
+## The data model inside the engine
+
+Modern analytical engines often use a columnar, batch-oriented model rather than processing one row at a time.
+
+The core concepts are:
+
+- `Schema`: the names and types of columns
+- `Field`: one column inside a schema
+- `ColumnVector`: the in-memory values for one column
+- `RecordBatch`: a schema plus a set of equal-length columns
+
+This matters because analytical workloads often read only a few columns from very large datasets. A columnar layout makes those accesses much more
+efficient.
+
+It also works well with vectorized execution, where one operator applies the same computation across many values in a batch.
+
+---
+
+## Why columnar and batch-oriented execution matter
+
+Compared with row-at-a-time execution, columnar batches have several advantages:
+
+- less overhead per record
+- better cache behavior
+- better compression
+- easier use of SIMD-style operations
+- simpler projection of only the needed columns
+
+The tradeoff is that batch-oriented systems can be more complex to build and may be less natural for transactional or record-by-record workloads.
+
+So "best" depends on the workload, but for analytics, columnar batches are a strong default.
+
+---
+
+## The data source boundary
+
+Every engine needs a boundary where external data enters the system.
+
+A good data-source abstraction usually answers two questions:
+
+1. What is the schema?
+2. Can you scan the data, ideally with some pushdowns?
+
+This is where the engine starts exploiting source capabilities such as:
+
+- projection pushdown
+- predicate pushdown
+- partition pruning
+- file-format-specific decoding
+
+The abstraction should be simple enough to unify many backends, but rich enough to expose useful performance features.
+
+---
+
+## Common operators
+
+### Scan
+
+Reads data from a source into the engine.
+
+### Projection
+
+Chooses columns or computes derived expressions.
+
+### Filter
+
+Keeps only rows that satisfy a predicate.
+
+### Join
+
+Combines rows from multiple inputs using matching keys or conditions.
+
+### Aggregate
+
+Computes summary results such as counts, sums, averages, mins, and maxes, often grouped by one or more keys.
+
+### Limit
+
+Stops after producing a fixed number of rows.
+
+These operators are simple individually, but query engines become interesting because they compose and can be reordered or fused in many ways.
+
+---
+
+## A tiny end-to-end example
+
+Take this query:
+
+```sql
+SELECT name
+FROM employees
+WHERE age > 18
+```
+
+One reasonable logical plan is:
+
+1. scan `employees`
+2. filter rows by `age > 18`
+3. project `name`
+
+An optimized plan might notice that only `name` and `age` are needed and ask the data source for only those columns.
+
+A physical plan might then choose:
+
+1. Parquet scan with projection pushdown
+2. vectorized filter over `age`
+3. projection of `name`
+
+The result is the same, but the execution cost can be much lower.
+
+---
+
+## What the optimizer is really buying
+
+The optimizer is not magic. It only works because the engine has:
+
+- a stable internal representation
+- explicit schemas and types
+- clear operator semantics
+- enough separation between intent and execution
+
+Without those pieces, the engine has very little room to improve the query.
+
+---
+
+## Practical mental model
+
+If you need to explain a query engine in one sentence, this is a good version:
+
+> A query engine is a planner and executor for declarative data operations.
+
+If you need a slightly longer version:
+
+- parsing turns syntax into structure
+- planning turns structure into operators
+- optimization rewrites operators into a cheaper form
+- execution runs those operators over data
+
+That model is simple, but it is enough to orient most of the important design discussions.
+
+---
+
+## Questions to keep in mind
+
+- What is the engine's internal data model?
+- Where is the boundary between logical and physical planning?
+- What can be pushed down into the data source?
+- Is the workload more row-oriented or analytics-oriented?
+- What is the unit of execution: rows, batches, or fully materialized tables?
+
+---
+
+## Changelog
+
+* **Mar 31, 2026** -- The first version was created.
--- a/scratches/how-query-engines-work-part-1.md
+++ b/scratches/how-query-engines-work-part-1.md
@ -0,0 +1,246 @@
+# How Query Engines Work Part 1
+
+This note covers the foundation chapters from Andy Grove's book _How Query Engines Work_:
+
+- What Is a Query Engine?
+- Apache Arrow
+- Choosing a Type System
+- Data Sources
+
+The main sources were:
+
+- https://howqueryengineswork.com/
+- the local companion repo in `tmp/how-query-engines-work`
+
+## Short answer
+
+Part 1 frames a query engine as a specialized compiler for data work.
+
+A user expresses _what_ data they want in a declarative form such as SQL or a DataFrame API, and the engine is responsible for figuring out _how_ to get it efficiently. The foundational design choices are:
+
+- use columnar, batch-oriented data rather than row-at-a-time execution where possible
+- make schemas and types explicit
+- abstract over data sources behind a common interface
+- treat in-memory representation as a first-class architectural decision, not an implementation detail
+
+## Core mental model
+
+The book's basic pipeline is:
+
+1. parse the query text or API calls
+2. build an abstract query representation
+3. optimize or rewrite that representation
+4. execute a concrete plan against data sources
+
+This is why query engines feel compiler-like:
+
+- a declarative input gets translated into an executable form
+- the engine can validate before execution
+- the engine can rewrite for efficiency without changing semantics
+
+The key difference from a general compiler is that the runtime target is data processing rather than machine code.
+
+## Important terminology
+
+- `declarative query`: a request that says what result is wanted, not how to compute it
+- `query engine`: software that turns declarative queries into results
+- `parsing`: turning SQL text into a structured representation
+- `planning`: deciding which logical operations are needed
+- `optimization`: rewriting the plan into a more efficient equivalent
+- `execution`: actually running operators over data
+- `schema`: the names, types, and nullability of columns
+- `field`: one named column in a schema
+- `column vector`: one in-memory column of values
+- `record batch`: a group of equal-length column vectors processed together
+- `data source`: a backend that can expose a schema and stream batches
+- `projection`: selecting only the needed columns
+- `projection pushdown`: asking the data source to read only required columns
+- `row-based execution`: processing one record at a time
+- `columnar execution`: storing and often processing values by column
+- `vectorized execution`: applying one operation across many values in a batch
+
+## Notes by chapter
+
+### What Is a Query Engine?
+
+The book starts from the simplest possible idea: filtering a collection in code is already a tiny query engine. The real difference in production systems is scale, generality, and optimization.
+
+The most important conceptual shift is:
+
+- application code hard-codes the procedure
+- query languages describe the desired result
+- the engine gets freedom to choose the execution strategy
+
+That separation matters because the same logical request can run:
+
+- against a small file
+- against a large local dataset
+- against a distributed cluster
+
+without changing the query text itself.
+
+The book describes the core stages as:
+
+1. parsing
+2. planning
+3. optimization
+4. execution
+
+That is a useful mental model for our own work too. Even if Geolog does not look like SQL, the same split applies: there is still a front-end language, an internal representation, possible rewrites, and some execution mechanism.
+
+### Apache Arrow
+
+The book treats Arrow as the foundation for the engine's in-memory model.
+
+The main reason is not just standardization. It is that query workloads usually touch a subset of columns, not full rows. A columnar layout means:
+
+- better cache locality for the values actually being processed
+- better compression characteristics
+- better support for SIMD-style vectorized operations
+
+Arrow's core structures are simple but important:
+
+- a data buffer holding values
+- a validity bitmap for null tracking
+- an offset buffer for variable-width values such as strings
+
+Arrow also introduces the concept that ends up mattering most operationally in the book: the `RecordBatch`, meaning a schema plus a set of equal-length columns processed as one unit.
+
+The key takeaway is that the unit of execution should not be "the whole dataset" or "one row". It should usually be "one batch".
+
+### Choosing a Type System
+
+The type-system chapter argues that the engine must know more than raw values. It needs enough metadata to:
+
+- reject invalid expressions early
+- determine result types
+- allocate correctly typed output storage
+- skip unnecessary null checks when nullability is known
+
+The design choice in the book is to build on Arrow's types instead of inventing a proprietary type system. That gives:
+
+- a shared type vocabulary
+- direct compatibility with Arrow-based formats and tools
+- less impedance mismatch between storage and execution
+
+The local companion code reflects this directly:
+
+- [Schema.kt](/home/hassan/Workspace/useful-notes/tmp/how-query-engines-work/datatypes/src/main/kotlin/Schema.kt)
+- [ColumnVector.kt](/home/hassan/Workspace/useful-notes/tmp/how-query-engines-work/datatypes/src/main/kotlin/ColumnVector.kt)
+- [RecordBatch.kt](/home/hassan/Workspace/useful-notes/tmp/how-query-engines-work/datatypes/src/main/kotlin/RecordBatch.kt)
+
+Two ideas here are especially worth carrying forward:
+
+- the schema is part of the contract all the way through the engine
+- batches and vectors are the runtime currency of execution
+
+The chapter also contrasts row-at-a-time iterator execution with batch-oriented vectorized execution. The row-at-a-time model is simpler, but it pays overhead per row. Batch-oriented processing reduces that overhead and opens the door to data-parallel computation.
+
+### Data Sources
+
+The data-source chapter introduces a very clean abstraction: all backends should answer two questions.
+
+1. What is your schema?
+2. Can you scan and stream batches, optionally for only certain columns?
+
+That shows up directly in the local code:
+
+- [DataSource.kt](/home/hassan/Workspace/useful-notes/tmp/how-query-engines-work/datasource/src/main/kotlin/DataSource.kt)
+
+This is the interface:
+
+```kotlin
+interface DataSource {
+  fun schema(): Schema
+  fun scan(projection: List<String>): Sequence<RecordBatch>
+}
+```
+
+That is a small API, but it carries a lot of architectural weight:
+
+- planning depends on `schema()`
+- execution depends on `scan(...)`
+- projection pushdown begins at the source boundary
+- streaming is built in through `Sequence<RecordBatch>`
+
+The CSV and Parquet discussion makes the practical tradeoff clear:
+
+- CSV is simple and widespread but weakly typed and expensive to parse
+- Parquet is much more engine-friendly because it is columnar, typed, and supports efficient projection
+
+This suggests a general rule: data-source abstraction should hide source-specific details, but the engine should still exploit source-specific capabilities when they matter for performance.
+
+## What seems most important
+
+These are the main durable concepts from Part 1:
+
+- A query engine is best understood as a planner and executor over declarative requests.
+- The in-memory data model is a core design choice, not a storage afterthought.
+- Columnar batches are a strong default for analytics-oriented execution.
+- Schemas and nullability are execution metadata, not just documentation.
+- Data sources should be abstracted behind a common interface.
+- Pushdown starts at the boundary with the data source.
+- A clean separation between logical intent and physical execution strategy is foundational.
+
+## Relevance to Geolog
+
+There is not a one-to-one mapping from the book to Geolog, but several ideas transfer cleanly.
+
+First, our existing notes already treat the IR as the contract between front-end work and execution work. The book reinforces that this is the right general shape. A query engine needs a representation that is stable enough for validation, rewriting, and execution.
+
+Second, the book's distinction between declarative intent and execution strategy maps well onto Geolog's likely split between:
+
+- source language and elaboration
+- lowered IR
+- runtime planning and execution
+
+Third, the data-source abstraction looks very relevant if Geolog will query or check laws over multiple backends. A backend-neutral interface should probably expose:
+
+- what data is available
+- how its shape is described
+- how to stream or otherwise iterate over it
+- what pushdowns or source-side evaluation are possible
+
+Fourth, the book is more relational and analytics-oriented than Geolog's logical/chase concerns, so some adaptation is needed. Columnar record batches are a natural fit for relational scans, filters, projections, aggregates, and joins. They are a less obvious fit for:
+
+- existential witness generation
+- branching search
+- equality merging
+- provenance-heavy chase steps
+
+So the likely lesson is not "copy this architecture wholesale". The lesson is that Geolog will probably still need:
+
+- a stable logical representation
+- an execution-oriented runtime representation
+- a clear backend boundary
+- explicit decisions about the unit of execution
+
+## Questions worth keeping open
+
+- Is the first Geolog executor fundamentally row-oriented, set-oriented, or batch-oriented?
+- Do we want a strict split between logical planning and physical execution, or a thinner execution layer at first?
+- What is the Geolog equivalent of `Schema` and `RecordBatch`?
+- What source capabilities do we eventually want to push down: projection, filtering, joins, law checks?
+- Is the first backend abstraction only about storage, or also about partial execution?
+
+## Bottom line
+
+Part 1 is mostly about architectural posture.
+
+Before building planners or optimizers, the book insists on getting four things straight:
+
+- what a query engine is
+- how data lives in memory
+- how types flow through the engine
+- how data enters the system
+
+That is a good sequence. It suggests that for our own work, arguments about optimization or execution strategy should come after we are clear on:
+
+- the contract representation
+- the runtime unit of data
+- the backend boundary
+- the basic execution model
+
+## Changelog
+
+* **Mar 31, 2026** -- First version created.