## Query Engine Glossary * **Query Engine** — A query engine is the component that takes a declarative request for data and produces the result. The user says what they want, and the engine decides how to compute it. * **Declarative Query** — A query that describes the desired result rather than the exact procedure. SQL is the standard example: you ask for rows matching some condition, not a specific loop nest. * **Schema** — The structural description of data: column names, column types, and usually nullability. The schema is the contract that planning and execution rely on. * **Field** — One named column inside a schema. A field has a name and a type. * **Type System** — The set of value types the engine understands and the rules for how expressions over those values behave. It lets the engine reject invalid queries early and allocate correctly typed outputs. * **Nullability** — Whether a column or expression may contain missing values. This matters for both semantics and execution because null handling can add branching and bookkeeping. * **Row** — One logical record in a table. In a row-oriented system, processing often happens one record at a time. * **Column** — All values for one field across many rows. In a columnar engine, values from the same column are stored together in memory. * **Column Vector** — An in-memory representation of one column's values. It is the basic runtime container used by columnar engines. * **Record Batch** — A chunk of data consisting of several equal-length column vectors plus a schema. It is a common unit of execution in modern columnar engines. * **Apache Arrow** — A standard in-memory columnar format. It gives query engines a shared representation for schemas, arrays, null bitmaps, and record batches. * **Data Source** — Anything the engine can read from: CSV files, Parquet files, databases, object stores, or in-memory tables. A clean engine usually abstracts these behind a common interface. * **Scan** — The operator that reads data from a source into the engine. It is usually the leaf node of a query plan. * **Projection** — Selecting only certain columns or computing new expressions from existing ones. In SQL this is mostly the `SELECT` list. * **Filter / Selection** — Removing rows that do not satisfy a predicate. In SQL this is usually the `WHERE` clause. * **Predicate** — A boolean expression used to decide whether a row should be kept, such as `age > 18`. * **Projection Pushdown** — Asking the data source to read only the needed columns instead of materializing everything first. * **Predicate Pushdown** — Asking the data source to apply filtering as early as possible, ideally before data is moved into the main engine. * **Logical Plan** — A representation of what operations are required to answer the query, independent of low-level execution details. It describes intent rather than exact mechanics. * **Physical Plan** — A concrete executable strategy for the query. It chooses specific operator implementations and an execution order. * **Optimizer** — The component that rewrites plans into equivalent but more efficient forms. Examples include pruning unused columns, pushing filters down, and reordering joins. * **Operator** — One node in a plan, such as scan, projection, filter, join, aggregate, or limit. * **Expression** — A computation inside an operator, such as `price * quantity`, `a = b`, or `SUM(total)`. * **Join** — Combining rows from two inputs based on some matching condition. This is one of the most important and expensive relational operators. * **Aggregate** — Collapsing many rows into summary values such as `COUNT`, `SUM`, `MIN`, `MAX`, or grouped results. * **Execution Model** — The style in which operators run: row-at-a-time, batch-oriented, vectorized, pipelined, or materializing intermediate results. * **Row-Oriented Execution** — Processing one row at a time through a chain of operators. It is simple to understand but often pays overhead per row. * **Vectorized Execution** — Processing a batch of values together, typically one column at a time. This reduces per-row overhead and works well with columnar memory layouts. * **Materialization** — Fully storing an intermediate result before the next operator consumes it. This can simplify execution but costs memory and latency. * **Pipeline** — A flow where one operator produces data incrementally and the next consumes it without waiting for the entire input to finish. * **Backend** — The concrete storage or execution target under the engine, such as an in-memory table layer, Parquet files, Postgres, or a distributed service. ## Changelog * **Mar 31, 2026** -- The first version was created.