52 lines
4.7 KiB
Markdown
52 lines
4.7 KiB
Markdown
|
|
## Query Engine Glossary
|
||
|
|
|
||
|
|
* **Query Engine** — A query engine is the component that takes a declarative request for data and produces the result. The user says what they
|
||
|
|
want, and the engine decides how to compute it.
|
||
|
|
* **Declarative Query** — A query that describes the desired result rather than the exact procedure. SQL is the standard example: you ask for rows
|
||
|
|
matching some condition, not a specific loop nest.
|
||
|
|
* **Schema** — The structural description of data: column names, column types, and usually nullability. The schema is the contract that planning and
|
||
|
|
execution rely on.
|
||
|
|
* **Field** — One named column inside a schema. A field has a name and a type.
|
||
|
|
* **Type System** — The set of value types the engine understands and the rules for how expressions over those values behave. It lets the engine
|
||
|
|
reject invalid queries early and allocate correctly typed outputs.
|
||
|
|
* **Nullability** — Whether a column or expression may contain missing values. This matters for both semantics and execution because null handling can
|
||
|
|
add branching and bookkeeping.
|
||
|
|
* **Row** — One logical record in a table. In a row-oriented system, processing often happens one record at a time.
|
||
|
|
* **Column** — All values for one field across many rows. In a columnar engine, values from the same column are stored together in memory.
|
||
|
|
* **Column Vector** — An in-memory representation of one column's values. It is the basic runtime container used by columnar engines.
|
||
|
|
* **Record Batch** — A chunk of data consisting of several equal-length column vectors plus a schema. It is a common unit of execution in modern
|
||
|
|
columnar engines.
|
||
|
|
* **Apache Arrow** — A standard in-memory columnar format. It gives query engines a shared representation for schemas, arrays, null bitmaps, and
|
||
|
|
record batches.
|
||
|
|
* **Data Source** — Anything the engine can read from: CSV files, Parquet files, databases, object stores, or in-memory tables. A clean engine
|
||
|
|
usually abstracts these behind a common interface.
|
||
|
|
* **Scan** — The operator that reads data from a source into the engine. It is usually the leaf node of a query plan.
|
||
|
|
* **Projection** — Selecting only certain columns or computing new expressions from existing ones. In SQL this is mostly the `SELECT` list.
|
||
|
|
* **Filter / Selection** — Removing rows that do not satisfy a predicate. In SQL this is usually the `WHERE` clause.
|
||
|
|
* **Predicate** — A boolean expression used to decide whether a row should be kept, such as `age > 18`.
|
||
|
|
* **Projection Pushdown** — Asking the data source to read only the needed columns instead of materializing everything first.
|
||
|
|
* **Predicate Pushdown** — Asking the data source to apply filtering as early as possible, ideally before data is moved into the main engine.
|
||
|
|
* **Logical Plan** — A representation of what operations are required to answer the query, independent of low-level execution details. It describes
|
||
|
|
intent rather than exact mechanics.
|
||
|
|
* **Physical Plan** — A concrete executable strategy for the query. It chooses specific operator implementations and an execution order.
|
||
|
|
* **Optimizer** — The component that rewrites plans into equivalent but more efficient forms. Examples include pruning unused columns, pushing filters
|
||
|
|
down, and reordering joins.
|
||
|
|
* **Operator** — One node in a plan, such as scan, projection, filter, join, aggregate, or limit.
|
||
|
|
* **Expression** — A computation inside an operator, such as `price * quantity`, `a = b`, or `SUM(total)`.
|
||
|
|
* **Join** — Combining rows from two inputs based on some matching condition. This is one of the most important and expensive relational operators.
|
||
|
|
* **Aggregate** — Collapsing many rows into summary values such as `COUNT`, `SUM`, `MIN`, `MAX`, or grouped results.
|
||
|
|
* **Execution Model** — The style in which operators run: row-at-a-time, batch-oriented, vectorized, pipelined, or materializing intermediate
|
||
|
|
results.
|
||
|
|
* **Row-Oriented Execution** — Processing one row at a time through a chain of operators. It is simple to understand but often pays overhead per row.
|
||
|
|
* **Vectorized Execution** — Processing a batch of values together, typically one column at a time. This reduces per-row overhead and works well with
|
||
|
|
columnar memory layouts.
|
||
|
|
* **Materialization** — Fully storing an intermediate result before the next operator consumes it. This can simplify execution but costs memory and
|
||
|
|
latency.
|
||
|
|
* **Pipeline** — A flow where one operator produces data incrementally and the next consumes it without waiting for the entire input to finish.
|
||
|
|
* **Backend** — The concrete storage or execution target under the engine, such as an in-memory table layer, Parquet files, Postgres, or a
|
||
|
|
distributed service.
|
||
|
|
|
||
|
|
## Changelog
|
||
|
|
|
||
|
|
* **Mar 31, 2026** -- The first version was created.
|