385 lines
10 KiB
Markdown
385 lines
10 KiB
Markdown
|
|
# Expressions, Types, and Nullability
|
||
|
|
|
||
|
|
A reference for how query engines represent computations over data and why types and nulls shape both planning and execution.
|
||
|
|
|
||
|
|
---
|
||
|
|
|
||
|
|
## Short answer
|
||
|
|
|
||
|
|
Expressions are the smallest units of query computation.
|
||
|
|
|
||
|
|
Operators such as projection, filter, join, and aggregate do not do useful work by themselves. They rely on expressions to answer questions like:
|
||
|
|
|
||
|
|
- which column is being referenced?
|
||
|
|
- what value does this literal represent?
|
||
|
|
- how are two values compared?
|
||
|
|
- how is a new value computed?
|
||
|
|
- what type does the result have?
|
||
|
|
- how should nulls be handled?
|
||
|
|
|
||
|
|
In the companion engine, expressions exist in two layers:
|
||
|
|
|
||
|
|
- logical expressions for planning
|
||
|
|
- physical expressions for execution
|
||
|
|
|
||
|
|
Types and nullability matter in both layers.
|
||
|
|
|
||
|
|
---
|
||
|
|
|
||
|
|
## Why expressions deserve their own note
|
||
|
|
|
||
|
|
It is easy to talk about query engines only in terms of operators:
|
||
|
|
|
||
|
|
- scan
|
||
|
|
- filter
|
||
|
|
- join
|
||
|
|
- aggregate
|
||
|
|
|
||
|
|
But operators are really containers for expression evaluation.
|
||
|
|
|
||
|
|
For example:
|
||
|
|
|
||
|
|
- a filter is only meaningful because it evaluates a boolean predicate
|
||
|
|
- a projection is only meaningful because it evaluates output expressions
|
||
|
|
- an aggregate is only meaningful because it evaluates aggregate inputs and grouping keys
|
||
|
|
|
||
|
|
So expressions are not side details. They are a large part of the engine's real semantics.
|
||
|
|
|
||
|
|
---
|
||
|
|
|
||
|
|
## The two-layer model
|
||
|
|
|
||
|
|
Relevant modules:
|
||
|
|
|
||
|
|
- `tmp/how-query-engines-work/logical-plan/src/main/kotlin/LogicalExpr.kt`
|
||
|
|
- `tmp/how-query-engines-work/logical-plan/src/main/kotlin/Expressions.kt`
|
||
|
|
- `tmp/how-query-engines-work/physical-plan/src/main/kotlin/expressions/Expressions.kt`
|
||
|
|
- `tmp/how-query-engines-work/physical-plan/src/main/kotlin/expressions/BinaryExpression.kt`
|
||
|
|
- `tmp/how-query-engines-work/physical-plan/src/main/kotlin/expressions/CastExpression.kt`
|
||
|
|
- `tmp/how-query-engines-work/physical-plan/src/main/kotlin/expressions/AggregateExpression.kt`
|
||
|
|
- `tmp/how-query-engines-work/datatypes/src/main/kotlin/Schema.kt`
|
||
|
|
- `tmp/how-query-engines-work/datatypes/src/main/kotlin/ColumnVector.kt`
|
||
|
|
- `tmp/how-query-engines-work/datatypes/src/main/kotlin/RecordBatch.kt`
|
||
|
|
- `tmp/how-query-engines-work/datatypes/src/main/kotlin/ArrowTypes.kt`
|
||
|
|
|
||
|
|
The key distinction is:
|
||
|
|
|
||
|
|
- logical expressions answer planning questions
|
||
|
|
- physical expressions answer runtime evaluation questions
|
||
|
|
|
||
|
|
Logical expressions expose:
|
||
|
|
|
||
|
|
- names
|
||
|
|
- output types
|
||
|
|
- output fields
|
||
|
|
- planning-time structure
|
||
|
|
|
||
|
|
Physical expressions expose:
|
||
|
|
|
||
|
|
- `evaluate(input: RecordBatch): ColumnVector`
|
||
|
|
|
||
|
|
That is a clean and important design split.
|
||
|
|
|
||
|
|
---
|
||
|
|
|
||
|
|
## Logical expressions
|
||
|
|
|
||
|
|
Logical expressions represent computation before execution details are fixed.
|
||
|
|
|
||
|
|
Examples include:
|
||
|
|
|
||
|
|
- `Column`
|
||
|
|
- `ColumnIndex`
|
||
|
|
- `LiteralString`
|
||
|
|
- `LiteralLong`
|
||
|
|
- `LiteralDouble`
|
||
|
|
- `LiteralDate`
|
||
|
|
- `LiteralIntervalDays`
|
||
|
|
- `CastExpr`
|
||
|
|
- `Eq`, `Gt`, `LtEq`
|
||
|
|
- `And`, `Or`, `Not`
|
||
|
|
- `Add`, `Subtract`, `Multiply`, `Divide`
|
||
|
|
- aggregate expressions such as `Sum`, `Avg`, `Count`
|
||
|
|
- aliasing via `Alias`
|
||
|
|
|
||
|
|
The defining feature of a logical expression is `toField(input)`.
|
||
|
|
|
||
|
|
That means the expression can answer:
|
||
|
|
|
||
|
|
- what field name will this produce?
|
||
|
|
- what data type will it produce?
|
||
|
|
|
||
|
|
So logical expressions are where planning gets its schema information.
|
||
|
|
|
||
|
|
---
|
||
|
|
|
||
|
|
## Physical expressions
|
||
|
|
|
||
|
|
Physical expressions are executable.
|
||
|
|
|
||
|
|
They evaluate over an input `RecordBatch` and produce a `ColumnVector`.
|
||
|
|
|
||
|
|
That means even a scalar-looking expression such as:
|
||
|
|
|
||
|
|
```sql
|
||
|
|
age + 1
|
||
|
|
```
|
||
|
|
|
||
|
|
does not produce one scalar at runtime. It produces a full output vector for the batch.
|
||
|
|
|
||
|
|
Examples of physical expressions in the companion engine:
|
||
|
|
|
||
|
|
- `ColumnExpression`
|
||
|
|
- literal expressions
|
||
|
|
- comparison expressions
|
||
|
|
- boolean expressions
|
||
|
|
- arithmetic expressions
|
||
|
|
- `CastExpression`
|
||
|
|
- date/interval expressions
|
||
|
|
- aggregate runtime expressions via accumulator-based machinery
|
||
|
|
|
||
|
|
This is one of the clearest signs that the engine is batch-oriented rather than row-oriented.
|
||
|
|
|
||
|
|
---
|
||
|
|
|
||
|
|
## The runtime currency: columns and batches
|
||
|
|
|
||
|
|
Expression evaluation works because the runtime model is built around:
|
||
|
|
|
||
|
|
- `Schema`
|
||
|
|
- `Field`
|
||
|
|
- `ColumnVector`
|
||
|
|
- `RecordBatch`
|
||
|
|
|
||
|
|
`Schema` holds the list of fields.
|
||
|
|
|
||
|
|
Each `Field` has:
|
||
|
|
|
||
|
|
- a name
|
||
|
|
- an Arrow data type
|
||
|
|
|
||
|
|
`ColumnVector` provides:
|
||
|
|
|
||
|
|
- its Arrow type
|
||
|
|
- random access to values
|
||
|
|
- a length
|
||
|
|
|
||
|
|
`RecordBatch` groups equal-length column vectors under one schema.
|
||
|
|
|
||
|
|
So the actual runtime unit of expression evaluation is not "a row object." It is "a batch plus one or more vectors derived from it."
|
||
|
|
|
||
|
|
---
|
||
|
|
|
||
|
|
## Why types matter in planning
|
||
|
|
|
||
|
|
Types matter at planning time because the engine needs to know whether expressions make sense before it tries to run them.
|
||
|
|
|
||
|
|
Examples:
|
||
|
|
|
||
|
|
- a referenced column must exist in the input schema
|
||
|
|
- a cast target must be supported
|
||
|
|
- a comparison should have operands with compatible meaning
|
||
|
|
- an aggregate result has a particular output type
|
||
|
|
|
||
|
|
The companion engine uses Arrow types as the type vocabulary.
|
||
|
|
|
||
|
|
That is a strong design choice because it avoids inventing a parallel type universe disconnected from the runtime representation.
|
||
|
|
|
||
|
|
So when planning computes a `Field`, it is already aligning:
|
||
|
|
|
||
|
|
- expression semantics
|
||
|
|
- schema metadata
|
||
|
|
- runtime representation
|
||
|
|
|
||
|
|
---
|
||
|
|
|
||
|
|
## Why types matter in execution
|
||
|
|
|
||
|
|
At runtime, types affect:
|
||
|
|
|
||
|
|
- which physical vectors get allocated
|
||
|
|
- how values are read and written
|
||
|
|
- what coercions are legal
|
||
|
|
- what aggregate logic is valid
|
||
|
|
- how nulls are represented
|
||
|
|
|
||
|
|
For example:
|
||
|
|
|
||
|
|
- `CastExpression` allocates an output vector of the requested Arrow type
|
||
|
|
- selection expects a boolean `BitVector`
|
||
|
|
- aggregate accumulators dispatch on numeric runtime types
|
||
|
|
- numeric binary expressions may coerce mismatched numeric inputs to `Double`
|
||
|
|
|
||
|
|
So types are not paperwork. They determine concrete execution behavior.
|
||
|
|
|
||
|
|
---
|
||
|
|
|
||
|
|
## Type coercion
|
||
|
|
|
||
|
|
The companion engine has limited but explicit type coercion behavior.
|
||
|
|
|
||
|
|
One notable example is `BinaryExpression`:
|
||
|
|
|
||
|
|
- evaluate left vector
|
||
|
|
- evaluate right vector
|
||
|
|
- if their Arrow types differ
|
||
|
|
- try to coerce both to `Double` when both are numeric
|
||
|
|
|
||
|
|
This is a pragmatic runtime decision.
|
||
|
|
|
||
|
|
It shows two things clearly:
|
||
|
|
|
||
|
|
- expression evaluation often needs type-reconciliation logic
|
||
|
|
- even a small engine needs rules for mixed-type arithmetic and comparison
|
||
|
|
|
||
|
|
It also shows a current simplification: coercion is narrow and mostly numeric. A production engine would have a broader, more principled coercion matrix.
|
||
|
|
|
||
|
|
---
|
||
|
|
|
||
|
|
## Casting
|
||
|
|
|
||
|
|
`CastExpression` makes type conversion explicit.
|
||
|
|
|
||
|
|
It supports a range of targets such as:
|
||
|
|
|
||
|
|
- integer widths
|
||
|
|
- floating-point types
|
||
|
|
- string
|
||
|
|
|
||
|
|
At runtime it:
|
||
|
|
|
||
|
|
1. evaluates the source expression to a column vector
|
||
|
|
2. allocates a destination vector of the target type
|
||
|
|
3. converts each value
|
||
|
|
4. preserves nulls rather than coercing them into non-null sentinel values
|
||
|
|
|
||
|
|
That last point matters. Nulls are not just a parsing concern. They must survive evaluation correctly through type conversion.
|
||
|
|
|
||
|
|
---
|
||
|
|
|
||
|
|
## Aggregate expressions and accumulators
|
||
|
|
|
||
|
|
Aggregates are special because their runtime behavior is stateful.
|
||
|
|
|
||
|
|
The physical layer separates two concerns:
|
||
|
|
|
||
|
|
- an `AggregateExpression` knows its input expression and how to create an accumulator
|
||
|
|
- an `Accumulator` maintains per-group state across many rows
|
||
|
|
|
||
|
|
Examples:
|
||
|
|
|
||
|
|
- `CountAccumulator` counts non-null inputs
|
||
|
|
- `SumAccumulator` adds values using type-dependent numeric logic
|
||
|
|
- `AvgAccumulator` tracks both sum and count
|
||
|
|
|
||
|
|
This design is important because aggregate evaluation is not just "run an expression on a batch." It is "maintain state across many batch values."
|
||
|
|
|
||
|
|
---
|
||
|
|
|
||
|
|
## Nullability in practice
|
||
|
|
|
||
|
|
The notes in `hqew/001` and `hqew/014` already say nullability matters. The code shows why.
|
||
|
|
|
||
|
|
Null handling appears in several concrete ways:
|
||
|
|
|
||
|
|
- CSV readers call `setNull(...)` when cells are empty
|
||
|
|
- cast logic preserves nulls
|
||
|
|
- `COUNT` increments only for non-null values
|
||
|
|
- arithmetic and aggregate logic must decide what to do when values are null
|
||
|
|
- Arrow vectors use validity information to distinguish null from present values
|
||
|
|
|
||
|
|
That means nullability is part of:
|
||
|
|
|
||
|
|
- source ingestion
|
||
|
|
- expression semantics
|
||
|
|
- aggregate semantics
|
||
|
|
- output materialization
|
||
|
|
|
||
|
|
It is not just an annotation on a schema.
|
||
|
|
|
||
|
|
---
|
||
|
|
|
||
|
|
## Three-valued logic and current simplification
|
||
|
|
|
||
|
|
A production SQL engine typically has careful three-valued logic rules for:
|
||
|
|
|
||
|
|
- `TRUE`
|
||
|
|
- `FALSE`
|
||
|
|
- `NULL`
|
||
|
|
|
||
|
|
The companion engine is simpler and more pedagogical. It uses Arrow and null-aware value handling, but it does not attempt the full depth of industrial SQL null semantics.
|
||
|
|
|
||
|
|
That is worth being explicit about.
|
||
|
|
|
||
|
|
The important lesson is:
|
||
|
|
|
||
|
|
- null semantics are one of the places where "small query engine" and "full SQL engine" diverge quickly
|
||
|
|
|
||
|
|
So any future engine work should treat null behavior as a first-class semantics question, not a cleanup item.
|
||
|
|
|
||
|
|
---
|
||
|
|
|
||
|
|
## Names vs positions
|
||
|
|
|
||
|
|
Expressions also sit at an important boundary between symbolic and positional access.
|
||
|
|
|
||
|
|
In the logical layer:
|
||
|
|
|
||
|
|
- columns are usually referenced by name
|
||
|
|
|
||
|
|
In the physical layer:
|
||
|
|
|
||
|
|
- columns are typically referenced by index
|
||
|
|
|
||
|
|
This shift matters because runtime execution wants fast positional access, while planning wants stable symbolic meaning.
|
||
|
|
|
||
|
|
That is why physical planning resolves `Column(name)` into `ColumnExpression(index)`.
|
||
|
|
|
||
|
|
This is one of the simplest examples of planning removing abstraction cost before execution begins.
|
||
|
|
|
||
|
|
---
|
||
|
|
|
||
|
|
## Data types as part of architecture
|
||
|
|
|
||
|
|
The companion engine's use of Arrow types reinforces a larger design lesson:
|
||
|
|
|
||
|
|
- the type system is part of architecture
|
||
|
|
|
||
|
|
It shapes:
|
||
|
|
|
||
|
|
- schema interchange
|
||
|
|
- source integration
|
||
|
|
- vector allocation
|
||
|
|
- expression evaluation
|
||
|
|
- aggregate implementation
|
||
|
|
|
||
|
|
That is why "choosing a type system" appears so early in the book. It affects much more than parser validation.
|
||
|
|
|
||
|
|
---
|
||
|
|
|
||
|
|
## Main takeaways
|
||
|
|
|
||
|
|
- Expressions are where much of a query engine's real semantics live.
|
||
|
|
- Logical expressions are about meaning, names, and output fields.
|
||
|
|
- Physical expressions are about batch-oriented evaluation and vector production.
|
||
|
|
- Types are needed both for validation and for concrete runtime behavior.
|
||
|
|
- Nullability affects ingestion, casting, filtering, aggregation, and output materialization.
|
||
|
|
- Aggregate expressions are fundamentally stateful and need accumulator machinery rather than plain scalar evaluation.
|
||
|
|
|
||
|
|
---
|
||
|
|
|
||
|
|
## Related notes
|
||
|
|
|
||
|
|
- `hqew/001-query-engine-glossary.md`
|
||
|
|
- `hqew/004-query-planning.md`
|
||
|
|
- `hqew/006-query-execution-models.md`
|
||
|
|
- `hqew/014-how-query-engines-work-part-1.md`
|
||
|
|
- `hqew/015-sql-front-end-and-logical-planning.md`
|
||
|
|
- `hqew/016-physical-plans-and-operators.md`
|
||
|
|
|
||
|
|
---
|
||
|
|
|
||
|
|
## Changelog
|
||
|
|
|
||
|
|
* **Apr 7, 2026** -- Added a dedicated note on expressions, Arrow-based types, coercion, aggregates, and null handling.
|