# How Query Engines Work Part 1 This note covers the foundation chapters from Andy Grove's book _How Query Engines Work_: - What Is a Query Engine? - Apache Arrow - Choosing a Type System - Data Sources The main sources were: - https://howqueryengineswork.com/ - the local companion repo in `tmp/how-query-engines-work` ## Short answer Part 1 frames a query engine as a specialized compiler for data work. A user expresses _what_ data they want in a declarative form such as SQL or a DataFrame API, and the engine is responsible for figuring out _how_ to get it efficiently. The foundational design choices are: - use columnar, batch-oriented data rather than row-at-a-time execution where possible - make schemas and types explicit - abstract over data sources behind a common interface - treat in-memory representation as a first-class architectural decision, not an implementation detail ## Core mental model The book's basic pipeline is: 1. parse the query text or API calls 2. build an abstract query representation 3. optimize or rewrite that representation 4. execute a concrete plan against data sources This is why query engines feel compiler-like: - a declarative input gets translated into an executable form - the engine can validate before execution - the engine can rewrite for efficiency without changing semantics The key difference from a general compiler is that the runtime target is data processing rather than machine code. ## Important terminology - `declarative query`: a request that says what result is wanted, not how to compute it - `query engine`: software that turns declarative queries into results - `parsing`: turning SQL text into a structured representation - `planning`: deciding which logical operations are needed - `optimization`: rewriting the plan into a more efficient equivalent - `execution`: actually running operators over data - `schema`: the names, types, and nullability of columns - `field`: one named column in a schema - `column vector`: one in-memory column of values - `record batch`: a group of equal-length column vectors processed together - `data source`: a backend that can expose a schema and stream batches - `projection`: selecting only the needed columns - `projection pushdown`: asking the data source to read only required columns - `row-based execution`: processing one record at a time - `columnar execution`: storing and often processing values by column - `vectorized execution`: applying one operation across many values in a batch ## Notes by chapter ### What Is a Query Engine? The book starts from the simplest possible idea: filtering a collection in code is already a tiny query engine. The real difference in production systems is scale, generality, and optimization. The most important conceptual shift is: - application code hard-codes the procedure - query languages describe the desired result - the engine gets freedom to choose the execution strategy That separation matters because the same logical request can run: - against a small file - against a large local dataset - against a distributed cluster without changing the query text itself. The book describes the core stages as: 1. parsing 2. planning 3. optimization 4. execution That is a useful mental model for our own work too. Even if Geolog does not look like SQL, the same split applies: there is still a front-end language, an internal representation, possible rewrites, and some execution mechanism. ### Apache Arrow The book treats Arrow as the foundation for the engine's in-memory model. The main reason is not just standardization. It is that query workloads usually touch a subset of columns, not full rows. A columnar layout means: - better cache locality for the values actually being processed - better compression characteristics - better support for SIMD-style vectorized operations Arrow's core structures are simple but important: - a data buffer holding values - a validity bitmap for null tracking - an offset buffer for variable-width values such as strings Arrow also introduces the concept that ends up mattering most operationally in the book: the `RecordBatch`, meaning a schema plus a set of equal-length columns processed as one unit. The key takeaway is that the unit of execution should not be "the whole dataset" or "one row". It should usually be "one batch". ### Choosing a Type System The type-system chapter argues that the engine must know more than raw values. It needs enough metadata to: - reject invalid expressions early - determine result types - allocate correctly typed output storage - skip unnecessary null checks when nullability is known The design choice in the book is to build on Arrow's types instead of inventing a proprietary type system. That gives: - a shared type vocabulary - direct compatibility with Arrow-based formats and tools - less impedance mismatch between storage and execution The local companion code reflects this directly: - [Schema.kt](/home/hassan/Workspace/useful-notes/tmp/how-query-engines-work/datatypes/src/main/kotlin/Schema.kt) - [ColumnVector.kt](/home/hassan/Workspace/useful-notes/tmp/how-query-engines-work/datatypes/src/main/kotlin/ColumnVector.kt) - [RecordBatch.kt](/home/hassan/Workspace/useful-notes/tmp/how-query-engines-work/datatypes/src/main/kotlin/RecordBatch.kt) Two ideas here are especially worth carrying forward: - the schema is part of the contract all the way through the engine - batches and vectors are the runtime currency of execution The chapter also contrasts row-at-a-time iterator execution with batch-oriented vectorized execution. The row-at-a-time model is simpler, but it pays overhead per row. Batch-oriented processing reduces that overhead and opens the door to data-parallel computation. ### Data Sources The data-source chapter introduces a very clean abstraction: all backends should answer two questions. 1. What is your schema? 2. Can you scan and stream batches, optionally for only certain columns? That shows up directly in the local code: - [DataSource.kt](/home/hassan/Workspace/useful-notes/tmp/how-query-engines-work/datasource/src/main/kotlin/DataSource.kt) This is the interface: ```kotlin interface DataSource { fun schema(): Schema fun scan(projection: List): Sequence } ``` That is a small API, but it carries a lot of architectural weight: - planning depends on `schema()` - execution depends on `scan(...)` - projection pushdown begins at the source boundary - streaming is built in through `Sequence` The CSV and Parquet discussion makes the practical tradeoff clear: - CSV is simple and widespread but weakly typed and expensive to parse - Parquet is much more engine-friendly because it is columnar, typed, and supports efficient projection This suggests a general rule: data-source abstraction should hide source-specific details, but the engine should still exploit source-specific capabilities when they matter for performance. ## What seems most important These are the main durable concepts from Part 1: - A query engine is best understood as a planner and executor over declarative requests. - The in-memory data model is a core design choice, not a storage afterthought. - Columnar batches are a strong default for analytics-oriented execution. - Schemas and nullability are execution metadata, not just documentation. - Data sources should be abstracted behind a common interface. - Pushdown starts at the boundary with the data source. - A clean separation between logical intent and physical execution strategy is foundational. ## Relevance to Geolog There is not a one-to-one mapping from the book to Geolog, but several ideas transfer cleanly. First, our existing notes already treat the IR as the contract between front-end work and execution work. The book reinforces that this is the right general shape. A query engine needs a representation that is stable enough for validation, rewriting, and execution. Second, the book's distinction between declarative intent and execution strategy maps well onto Geolog's likely split between: - source language and elaboration - lowered IR - runtime planning and execution Third, the data-source abstraction looks very relevant if Geolog will query or check laws over multiple backends. A backend-neutral interface should probably expose: - what data is available - how its shape is described - how to stream or otherwise iterate over it - what pushdowns or source-side evaluation are possible Fourth, the book is more relational and analytics-oriented than Geolog's logical/chase concerns, so some adaptation is needed. Columnar record batches are a natural fit for relational scans, filters, projections, aggregates, and joins. They are a less obvious fit for: - existential witness generation - branching search - equality merging - provenance-heavy chase steps So the likely lesson is not "copy this architecture wholesale". The lesson is that Geolog will probably still need: - a stable logical representation - an execution-oriented runtime representation - a clear backend boundary - explicit decisions about the unit of execution ## Questions worth keeping open - Is the first Geolog executor fundamentally row-oriented, set-oriented, or batch-oriented? - Do we want a strict split between logical planning and physical execution, or a thinner execution layer at first? - What is the Geolog equivalent of `Schema` and `RecordBatch`? - What source capabilities do we eventually want to push down: projection, filtering, joins, law checks? - Is the first backend abstraction only about storage, or also about partial execution? ## Bottom line Part 1 is mostly about architectural posture. Before building planners or optimizers, the book insists on getting four things straight: - what a query engine is - how data lives in memory - how types flow through the engine - how data enters the system That is a good sequence. It suggests that for our own work, arguments about optimization or execution strategy should come after we are clear on: - the contract representation - the runtime unit of data - the backend boundary - the basic execution model ## Changelog * **Mar 31, 2026** -- First version created.