useful-notes/hqew/017-expressions-types-and-nullability.md

# Expressions, Types, and Nullability

A reference for how query engines represent computations over data and why types and nulls shape both planning and execution.

---

## Short answer

Expressions are the smallest units of query computation.

Operators such as projection, filter, join, and aggregate do not do useful work by themselves. They rely on expressions to answer questions like:

- which column is being referenced?
- what value does this literal represent?
- how are two values compared?
- how is a new value computed?
- what type does the result have?
- how should nulls be handled?

In the companion engine, expressions exist in two layers:

- logical expressions for planning
- physical expressions for execution

Types and nullability matter in both layers.

---

## Why expressions deserve their own note

It is easy to talk about query engines only in terms of operators:

- scan
- filter
- join
- aggregate

But operators are really containers for expression evaluation.

For example:

- a filter is only meaningful because it evaluates a boolean predicate
- a projection is only meaningful because it evaluates output expressions
- an aggregate is only meaningful because it evaluates aggregate inputs and grouping keys

So expressions are not side details. They are a large part of the engine's real semantics.

---

## The two-layer model

Relevant modules:

- `tmp/how-query-engines-work/logical-plan/src/main/kotlin/LogicalExpr.kt`
- `tmp/how-query-engines-work/logical-plan/src/main/kotlin/Expressions.kt`
- `tmp/how-query-engines-work/physical-plan/src/main/kotlin/expressions/Expressions.kt`
- `tmp/how-query-engines-work/physical-plan/src/main/kotlin/expressions/BinaryExpression.kt`
- `tmp/how-query-engines-work/physical-plan/src/main/kotlin/expressions/CastExpression.kt`
- `tmp/how-query-engines-work/physical-plan/src/main/kotlin/expressions/AggregateExpression.kt`
- `tmp/how-query-engines-work/datatypes/src/main/kotlin/Schema.kt`
- `tmp/how-query-engines-work/datatypes/src/main/kotlin/ColumnVector.kt`
- `tmp/how-query-engines-work/datatypes/src/main/kotlin/RecordBatch.kt`
- `tmp/how-query-engines-work/datatypes/src/main/kotlin/ArrowTypes.kt`

The key distinction is:

- logical expressions answer planning questions
- physical expressions answer runtime evaluation questions

Logical expressions expose:

- names
- output types
- output fields
- planning-time structure

Physical expressions expose:

- `evaluate(input: RecordBatch): ColumnVector`

That is a clean and important design split.

---

## Logical expressions

Logical expressions represent computation before execution details are fixed.

Examples include:

- `Column`
- `ColumnIndex`
- `LiteralString`
- `LiteralLong`
- `LiteralDouble`
- `LiteralDate`
- `LiteralIntervalDays`
- `CastExpr`
- `Eq`, `Gt`, `LtEq`
- `And`, `Or`, `Not`
- `Add`, `Subtract`, `Multiply`, `Divide`
- aggregate expressions such as `Sum`, `Avg`, `Count`
- aliasing via `Alias`

The defining feature of a logical expression is `toField(input)`.

That means the expression can answer:

- what field name will this produce?
- what data type will it produce?

So logical expressions are where planning gets its schema information.

---

## Physical expressions

Physical expressions are executable.

They evaluate over an input `RecordBatch` and produce a `ColumnVector`.

That means even a scalar-looking expression such as:

```sql
age + 1
```

does not produce one scalar at runtime. It produces a full output vector for the batch.

Examples of physical expressions in the companion engine:

- `ColumnExpression`
- literal expressions
- comparison expressions
- boolean expressions
- arithmetic expressions
- `CastExpression`
- date/interval expressions
- aggregate runtime expressions via accumulator-based machinery

This is one of the clearest signs that the engine is batch-oriented rather than row-oriented.

---

## The runtime currency: columns and batches

Expression evaluation works because the runtime model is built around:

- `Schema`
- `Field`
- `ColumnVector`
- `RecordBatch`

`Schema` holds the list of fields.

Each `Field` has:

- a name
- an Arrow data type

`ColumnVector` provides:

- its Arrow type
- random access to values
- a length

`RecordBatch` groups equal-length column vectors under one schema.

So the actual runtime unit of expression evaluation is not "a row object." It is "a batch plus one or more vectors derived from it."

---

## Why types matter in planning

Types matter at planning time because the engine needs to know whether expressions make sense before it tries to run them.

Examples:

- a referenced column must exist in the input schema
- a cast target must be supported
- a comparison should have operands with compatible meaning
- an aggregate result has a particular output type

The companion engine uses Arrow types as the type vocabulary.

That is a strong design choice because it avoids inventing a parallel type universe disconnected from the runtime representation.

So when planning computes a `Field`, it is already aligning:

- expression semantics
- schema metadata
- runtime representation

---

## Why types matter in execution

At runtime, types affect:

- which physical vectors get allocated
- how values are read and written
- what coercions are legal
- what aggregate logic is valid
- how nulls are represented

For example:

- `CastExpression` allocates an output vector of the requested Arrow type
- selection expects a boolean `BitVector`
- aggregate accumulators dispatch on numeric runtime types
- numeric binary expressions may coerce mismatched numeric inputs to `Double`

So types are not paperwork. They determine concrete execution behavior.

---

## Type coercion

The companion engine has limited but explicit type coercion behavior.

One notable example is `BinaryExpression`:

- evaluate left vector
- evaluate right vector
- if their Arrow types differ
- try to coerce both to `Double` when both are numeric

This is a pragmatic runtime decision.

It shows two things clearly:

- expression evaluation often needs type-reconciliation logic
- even a small engine needs rules for mixed-type arithmetic and comparison

It also shows a current simplification: coercion is narrow and mostly numeric. A production engine would have a broader, more principled coercion matrix.

---

## Casting

`CastExpression` makes type conversion explicit.

It supports a range of targets such as:

- integer widths
- floating-point types
- string

At runtime it:

1. evaluates the source expression to a column vector
2. allocates a destination vector of the target type
3. converts each value
4. preserves nulls rather than coercing them into non-null sentinel values

That last point matters. Nulls are not just a parsing concern. They must survive evaluation correctly through type conversion.

---

## Aggregate expressions and accumulators

Aggregates are special because their runtime behavior is stateful.

The physical layer separates two concerns:

- an `AggregateExpression` knows its input expression and how to create an accumulator
- an `Accumulator` maintains per-group state across many rows

Examples:

- `CountAccumulator` counts non-null inputs
- `SumAccumulator` adds values using type-dependent numeric logic
- `AvgAccumulator` tracks both sum and count

This design is important because aggregate evaluation is not just "run an expression on a batch." It is "maintain state across many batch values."

---

## Nullability in practice

The notes in `hqew/001` and `hqew/014` already say nullability matters. The code shows why.

Null handling appears in several concrete ways:

- CSV readers call `setNull(...)` when cells are empty
- cast logic preserves nulls
- `COUNT` increments only for non-null values
- arithmetic and aggregate logic must decide what to do when values are null
- Arrow vectors use validity information to distinguish null from present values

That means nullability is part of:

- source ingestion
- expression semantics
- aggregate semantics
- output materialization

It is not just an annotation on a schema.

---

## Three-valued logic and current simplification

A production SQL engine typically has careful three-valued logic rules for:

- `TRUE`
- `FALSE`
- `NULL`

The companion engine is simpler and more pedagogical. It uses Arrow and null-aware value handling, but it does not attempt the full depth of industrial SQL null semantics.

That is worth being explicit about.

The important lesson is:

- null semantics are one of the places where "small query engine" and "full SQL engine" diverge quickly

So any future engine work should treat null behavior as a first-class semantics question, not a cleanup item.

---

## Names vs positions

Expressions also sit at an important boundary between symbolic and positional access.

In the logical layer:

- columns are usually referenced by name

In the physical layer:

- columns are typically referenced by index

This shift matters because runtime execution wants fast positional access, while planning wants stable symbolic meaning.

That is why physical planning resolves `Column(name)` into `ColumnExpression(index)`.

This is one of the simplest examples of planning removing abstraction cost before execution begins.

---

## Data types as part of architecture

The companion engine's use of Arrow types reinforces a larger design lesson:

- the type system is part of architecture

It shapes:

- schema interchange
- source integration
- vector allocation
- expression evaluation
- aggregate implementation

That is why "choosing a type system" appears so early in the book. It affects much more than parser validation.

---

## Main takeaways

- Expressions are where much of a query engine's real semantics live.
- Logical expressions are about meaning, names, and output fields.
- Physical expressions are about batch-oriented evaluation and vector production.
- Types are needed both for validation and for concrete runtime behavior.
- Nullability affects ingestion, casting, filtering, aggregation, and output materialization.
- Aggregate expressions are fundamentally stateful and need accumulator machinery rather than plain scalar evaluation.

---

## Related notes

- `hqew/001-query-engine-glossary.md`
- `hqew/004-query-planning.md`
- `hqew/006-query-execution-models.md`
- `hqew/014-how-query-engines-work-part-1.md`
- `hqew/015-sql-front-end-and-logical-planning.md`
- `hqew/016-physical-plans-and-operators.md`

---

## Changelog

* **Apr 7, 2026** -- Added a dedicated note on expressions, Arrow-based types, coercion, aggregates, and null handling.
Add notes files on physical query planning and evaluation 2026-04-07 14:40:37 +02:00			`# Expressions, Types, and Nullability`

			`A reference for how query engines represent computations over data and why types and nulls shape both planning and execution.`

			`---`

			`## Short answer`

			`Expressions are the smallest units of query computation.`

			`Operators such as projection, filter, join, and aggregate do not do useful work by themselves. They rely on expressions to answer questions like:`

			`- which column is being referenced?`
			`- what value does this literal represent?`
			`- how are two values compared?`
			`- how is a new value computed?`
			`- what type does the result have?`
			`- how should nulls be handled?`

			`In the companion engine, expressions exist in two layers:`

			`- logical expressions for planning`
			`- physical expressions for execution`

			`Types and nullability matter in both layers.`

			`---`

			`## Why expressions deserve their own note`

			`It is easy to talk about query engines only in terms of operators:`

			`- scan`
			`- filter`
			`- join`
			`- aggregate`

			`But operators are really containers for expression evaluation.`

			`For example:`

			`- a filter is only meaningful because it evaluates a boolean predicate`
			`- a projection is only meaningful because it evaluates output expressions`
			`- an aggregate is only meaningful because it evaluates aggregate inputs and grouping keys`

			`So expressions are not side details. They are a large part of the engine's real semantics.`

			`---`

			`## The two-layer model`

			`Relevant modules:`

			- `tmp/how-query-engines-work/logical-plan/src/main/kotlin/LogicalExpr.kt`
			- `tmp/how-query-engines-work/logical-plan/src/main/kotlin/Expressions.kt`
			- `tmp/how-query-engines-work/physical-plan/src/main/kotlin/expressions/Expressions.kt`
			- `tmp/how-query-engines-work/physical-plan/src/main/kotlin/expressions/BinaryExpression.kt`
			- `tmp/how-query-engines-work/physical-plan/src/main/kotlin/expressions/CastExpression.kt`
			- `tmp/how-query-engines-work/physical-plan/src/main/kotlin/expressions/AggregateExpression.kt`
			- `tmp/how-query-engines-work/datatypes/src/main/kotlin/Schema.kt`
			- `tmp/how-query-engines-work/datatypes/src/main/kotlin/ColumnVector.kt`
			- `tmp/how-query-engines-work/datatypes/src/main/kotlin/RecordBatch.kt`
			- `tmp/how-query-engines-work/datatypes/src/main/kotlin/ArrowTypes.kt`

			`The key distinction is:`

			`- logical expressions answer planning questions`
			`- physical expressions answer runtime evaluation questions`

			`Logical expressions expose:`

			`- names`
			`- output types`
			`- output fields`
			`- planning-time structure`

			`Physical expressions expose:`

			- `evaluate(input: RecordBatch): ColumnVector`

			`That is a clean and important design split.`

			`---`

			`## Logical expressions`

			`Logical expressions represent computation before execution details are fixed.`

			`Examples include:`

			- `Column`
			- `ColumnIndex`
			- `LiteralString`
			- `LiteralLong`
			- `LiteralDouble`
			- `LiteralDate`
			- `LiteralIntervalDays`
			- `CastExpr`
			- `Eq`, `Gt`, `LtEq`
			- `And`, `Or`, `Not`
			- `Add`, `Subtract`, `Multiply`, `Divide`
			- aggregate expressions such as `Sum`, `Avg`, `Count`
			- aliasing via `Alias`

			The defining feature of a logical expression is `toField(input)`.

			`That means the expression can answer:`

			`- what field name will this produce?`
			`- what data type will it produce?`

			`So logical expressions are where planning gets its schema information.`

			`---`

			`## Physical expressions`

			`Physical expressions are executable.`

			They evaluate over an input `RecordBatch` and produce a `ColumnVector`.

			`That means even a scalar-looking expression such as:`

			```sql
			`age + 1`
			```

			`does not produce one scalar at runtime. It produces a full output vector for the batch.`

			`Examples of physical expressions in the companion engine:`

			- `ColumnExpression`
			`- literal expressions`
			`- comparison expressions`
			`- boolean expressions`
			`- arithmetic expressions`
			- `CastExpression`
			`- date/interval expressions`
			`- aggregate runtime expressions via accumulator-based machinery`

			`This is one of the clearest signs that the engine is batch-oriented rather than row-oriented.`

			`---`

			`## The runtime currency: columns and batches`

			`Expression evaluation works because the runtime model is built around:`

			- `Schema`
			- `Field`
			- `ColumnVector`
			- `RecordBatch`

			`Schema` holds the list of fields.

			Each `Field` has:

			`- a name`
			`- an Arrow data type`

			`ColumnVector` provides:

			`- its Arrow type`
			`- random access to values`
			`- a length`

			`RecordBatch` groups equal-length column vectors under one schema.

			`So the actual runtime unit of expression evaluation is not "a row object." It is "a batch plus one or more vectors derived from it."`

			`---`

			`## Why types matter in planning`

			`Types matter at planning time because the engine needs to know whether expressions make sense before it tries to run them.`

			`Examples:`

			`- a referenced column must exist in the input schema`
			`- a cast target must be supported`
			`- a comparison should have operands with compatible meaning`
			`- an aggregate result has a particular output type`

			`The companion engine uses Arrow types as the type vocabulary.`

			`That is a strong design choice because it avoids inventing a parallel type universe disconnected from the runtime representation.`

			So when planning computes a `Field`, it is already aligning:

			`- expression semantics`
			`- schema metadata`
			`- runtime representation`

			`---`

			`## Why types matter in execution`

			`At runtime, types affect:`

			`- which physical vectors get allocated`
			`- how values are read and written`
			`- what coercions are legal`
			`- what aggregate logic is valid`
			`- how nulls are represented`

			`For example:`

			- `CastExpression` allocates an output vector of the requested Arrow type
			- selection expects a boolean `BitVector`
			`- aggregate accumulators dispatch on numeric runtime types`
			- numeric binary expressions may coerce mismatched numeric inputs to `Double`

			`So types are not paperwork. They determine concrete execution behavior.`

			`---`

			`## Type coercion`

			`The companion engine has limited but explicit type coercion behavior.`

			One notable example is `BinaryExpression`:

			`- evaluate left vector`
			`- evaluate right vector`
			`- if their Arrow types differ`
			- try to coerce both to `Double` when both are numeric

			`This is a pragmatic runtime decision.`

			`It shows two things clearly:`

			`- expression evaluation often needs type-reconciliation logic`
			`- even a small engine needs rules for mixed-type arithmetic and comparison`

			`It also shows a current simplification: coercion is narrow and mostly numeric. A production engine would have a broader, more principled coercion matrix.`

			`---`

			`## Casting`

			`CastExpression` makes type conversion explicit.

			`It supports a range of targets such as:`

			`- integer widths`
			`- floating-point types`
			`- string`

			`At runtime it:`

			`1. evaluates the source expression to a column vector`
			`2. allocates a destination vector of the target type`
			`3. converts each value`
			`4. preserves nulls rather than coercing them into non-null sentinel values`

			`That last point matters. Nulls are not just a parsing concern. They must survive evaluation correctly through type conversion.`

			`---`

			`## Aggregate expressions and accumulators`

			`Aggregates are special because their runtime behavior is stateful.`

			`The physical layer separates two concerns:`

			- an `AggregateExpression` knows its input expression and how to create an accumulator
			- an `Accumulator` maintains per-group state across many rows

			`Examples:`

			- `CountAccumulator` counts non-null inputs
			- `SumAccumulator` adds values using type-dependent numeric logic
			- `AvgAccumulator` tracks both sum and count

			`This design is important because aggregate evaluation is not just "run an expression on a batch." It is "maintain state across many batch values."`

			`---`

			`## Nullability in practice`

			The notes in `hqew/001` and `hqew/014` already say nullability matters. The code shows why.

			`Null handling appears in several concrete ways:`

			- CSV readers call `setNull(...)` when cells are empty
			`- cast logic preserves nulls`
			- `COUNT` increments only for non-null values
			`- arithmetic and aggregate logic must decide what to do when values are null`
			`- Arrow vectors use validity information to distinguish null from present values`

			`That means nullability is part of:`

			`- source ingestion`
			`- expression semantics`
			`- aggregate semantics`
			`- output materialization`

			`It is not just an annotation on a schema.`

			`---`

			`## Three-valued logic and current simplification`

			`A production SQL engine typically has careful three-valued logic rules for:`

			- `TRUE`
			- `FALSE`
			- `NULL`

			`The companion engine is simpler and more pedagogical. It uses Arrow and null-aware value handling, but it does not attempt the full depth of industrial SQL null semantics.`

			`That is worth being explicit about.`

			`The important lesson is:`

			`- null semantics are one of the places where "small query engine" and "full SQL engine" diverge quickly`

			`So any future engine work should treat null behavior as a first-class semantics question, not a cleanup item.`

			`---`

			`## Names vs positions`

			`Expressions also sit at an important boundary between symbolic and positional access.`

			`In the logical layer:`

			`- columns are usually referenced by name`

			`In the physical layer:`

			`- columns are typically referenced by index`

			`This shift matters because runtime execution wants fast positional access, while planning wants stable symbolic meaning.`

			That is why physical planning resolves `Column(name)` into `ColumnExpression(index)`.

			`This is one of the simplest examples of planning removing abstraction cost before execution begins.`

			`---`

			`## Data types as part of architecture`

			`The companion engine's use of Arrow types reinforces a larger design lesson:`

			`- the type system is part of architecture`

			`It shapes:`

			`- schema interchange`
			`- source integration`
			`- vector allocation`
			`- expression evaluation`
			`- aggregate implementation`

			`That is why "choosing a type system" appears so early in the book. It affects much more than parser validation.`

			`---`

			`## Main takeaways`

			`- Expressions are where much of a query engine's real semantics live.`
			`- Logical expressions are about meaning, names, and output fields.`
			`- Physical expressions are about batch-oriented evaluation and vector production.`
			`- Types are needed both for validation and for concrete runtime behavior.`
			`- Nullability affects ingestion, casting, filtering, aggregation, and output materialization.`
			`- Aggregate expressions are fundamentally stateful and need accumulator machinery rather than plain scalar evaluation.`

			`---`

			`## Related notes`

			- `hqew/001-query-engine-glossary.md`
			- `hqew/004-query-planning.md`
			- `hqew/006-query-execution-models.md`
			- `hqew/014-how-query-engines-work-part-1.md`
			- `hqew/015-sql-front-end-and-logical-planning.md`
			- `hqew/016-physical-plans-and-operators.md`

			`---`

			`## Changelog`

			`* Apr 7, 2026 -- Added a dedicated note on expressions, Arrow-based types, coercion, aggregates, and null handling.`