Add a note file about SQL frontend and logical query planning

This commit is contained in:
Hassan Abedi 2026-04-07 11:13:16 +02:00
parent d5bbc4886d
commit dfed3989a8

View File

@ -0,0 +1,371 @@
# SQL Front End and Logical Planning
A reference for how SQL enters the engine and becomes a logical plan.
---
## Short answer
The SQL front end is the layer that turns query text into an internal representation the rest of the engine can reason about.
In the companion engine, that front-end pipeline is:
1. tokenize SQL text
2. parse tokens into SQL AST nodes
3. translate SQL AST nodes into logical expressions and logical plan operators
4. attach schema and type information so later stages can optimize and execute
The important separation is:
- SQL syntax is one surface language
- the SQL AST captures parsed structure
- the logical plan captures query meaning in engine terms
That separation keeps optimization and execution independent from the original text syntax.
---
## Why this layer matters
Without a front-end layer, the engine would have to intermingle:
- user-facing syntax
- semantic validation
- operator construction
- runtime execution details
That would make the system harder to extend and much harder to optimize.
A clean front end gives the engine:
- one place to define supported SQL syntax
- one place to resolve names and types
- one place to reject invalid queries early
- one stable logical representation for later rewrites
---
## The pipeline in the companion repo
Relevant modules:
- `tmp/how-query-engines-work/sql/src/main/kotlin/SqlTokenizer.kt`
- `tmp/how-query-engines-work/sql/src/main/kotlin/Tokens.kt`
- `tmp/how-query-engines-work/sql/src/main/kotlin/SqlParser.kt`
- `tmp/how-query-engines-work/sql/src/main/kotlin/Expressions.kt`
- `tmp/how-query-engines-work/sql/src/main/kotlin/SqlPlanner.kt`
- `tmp/how-query-engines-work/logical-plan/src/main/kotlin/LogicalPlan.kt`
- `tmp/how-query-engines-work/logical-plan/src/main/kotlin/Expressions.kt`
Conceptually:
```text
SQL text
-> token stream
-> SQL expressions / SqlSelect AST
-> logical expressions
-> logical plan
```
---
## Tokenization
The tokenizer is responsible for recognizing the smallest meaningful pieces of the query:
- identifiers
- numeric literals
- string literals
- symbols such as `(`, `)`, `,`, `=`, `>`, `<=`
- SQL keywords such as `SELECT`, `FROM`, `WHERE`, `GROUP BY`
The `SqlTokenizer` handles:
- whitespace skipping
- quoted identifiers using backticks
- quoted strings using `'...'` or `"..."`
- escaped quotes inside strings
- special handling for ambiguous words such as `GROUP` and `ORDER`
That last point matters because `GROUP` might be:
- the `GROUP BY` keyword sequence
- or an identifier such as a table/column name in a different context
This is a useful reminder that even "simple SQL parsing" starts with contextual decisions.
---
## Parsing
The parser turns tokens into structured SQL expressions.
The companion parser uses a Pratt-parser style for expressions. That is a compact way to parse operator precedence without a large handwritten grammar.
Key precedence bands in `SqlParser` include:
- aliasing and sort direction
- `OR`
- `AND`
- comparisons
- addition and subtraction
- multiplication and division
- function calls
This means expressions like:
```sql
age > 18 AND salary > 100000
```
can be parsed with the right tree shape rather than as a flat token sequence.
The SQL AST layer includes nodes such as:
- `SqlIdentifier`
- `SqlString`
- `SqlLong`
- `SqlDouble`
- `SqlBinaryExpr`
- `SqlFunction`
- `SqlAlias`
- `SqlCast`
- `SqlSort`
- `SqlSelect`
That AST is still SQL-shaped. It reflects query clauses like projection lists, `WHERE`, `GROUP BY`, `HAVING`, `ORDER BY`, and `LIMIT`.
---
## What `SqlSelect` captures
The parser's main relation node is `SqlSelect`, which stores:
- projection expressions
- optional selection predicate
- grouping expressions
- ordering expressions
- optional having predicate
- optional limit
- table name
This is still a parsed SQL object, not yet a logical plan.
That distinction matters because the SQL AST still speaks in language-level constructs:
- aliases
- SQL function syntax
- keyword-driven clause structure
The logical plan instead speaks in engine-level operators:
- scan
- filter
- projection
- aggregate
- join
- limit
---
## From SQL AST to logical expressions
`SqlPlanner` performs the semantic translation from SQL expressions into logical expressions and then into a `DataFrame` / logical plan.
Examples of mappings:
- `SqlIdentifier` -> `Column`
- string / long / double literals -> logical literal expressions
- SQL comparison operators -> `Eq`, `Gt`, `LtEq`, and similar nodes
- `AND` / `OR` -> boolean logical expressions
- `+`, `-`, `*`, `/` -> arithmetic logical expressions
- `CAST(...)` -> `CastExpr`
- aggregate calls such as `SUM(x)` -> `Sum(...)`
- `COUNT(*)` -> `Count(LiteralLong(1))`
This is the point where SQL surface syntax stops mattering and engine semantics take over.
---
## Name resolution and schema dependence
The logical layer is schema-aware.
For example:
- `Column(name)` resolves itself against the input plan's schema
- `toField(input)` computes the output field for a logical expression
- invalid column references become planning errors rather than execution surprises
This is important because planning is where the engine should answer:
- does this column exist?
- what type does this expression produce?
- what schema flows out of this operator?
The companion code uses `toField(input)` as the main hook that gives expressions their output metadata.
That means logical expressions carry more than syntax. They also carry planning-time type and schema consequences.
---
## Planning non-aggregate queries
For non-aggregate queries, the planner does more than a direct clause-to-operator mapping.
One subtle case is a filter that references columns not present in the final projection.
Example shape:
```sql
SELECT name
FROM employees
WHERE age > 18
```
If the filter uses `age` but the final output only includes `name`, the planner may need an intermediate projection that keeps both:
- output columns needed by the user
- extra columns needed by the filter
Then it can:
1. project enough columns to evaluate the filter
2. apply the filter
3. drop temporary columns not meant for final output
This is a small but important example of planning being semantic work rather than just syntactic translation.
---
## Planning aggregate queries
Aggregate queries add more complexity because the engine must distinguish between:
- grouping expressions
- aggregate expressions
- post-aggregate projection
- optional `HAVING`
The companion planner:
- detects whether the projection contains aggregate expressions
- rejects unsupported cases such as `GROUP BY` without aggregates
- builds a group-by input
- tracks which projected expressions correspond to group columns versus aggregate results
- creates the aggregate operator
- then applies a final projection over the aggregate output
This is important because aggregate SQL syntax often compresses several semantic stages into one query.
For example:
```sql
SELECT department, COUNT(*)
FROM employees
GROUP BY department
HAVING COUNT(*) > 5
```
is not one primitive operation. It becomes:
1. scan input
2. optionally filter pre-aggregation rows
3. group rows
4. compute aggregates
5. apply post-aggregation filtering
6. project final output
---
## What the front end validates
The front end already enforces meaningful semantic constraints, such as:
- referenced tables must exist
- referenced columns must exist
- aggregate functions must have valid arguments
- `COUNT(*)` is treated specially
- `LIMIT` must be numeric
- unsupported data types in `CAST` are rejected
This is the right place for those checks because execution should not be responsible for discovering syntax- or meaning-level errors.
---
## Supported surface area and current limits
The front end in the companion repo is intentionally small and teachable rather than SQL-complete.
It supports a useful subset:
- `SELECT`
- `FROM`
- `WHERE`
- `GROUP BY`
- `HAVING`
- `ORDER BY`
- `LIMIT`
- basic arithmetic and boolean expressions
- basic aggregate functions
- `CAST`
- date and interval literals
But it is still limited compared with a production SQL engine:
- no deep subquery machinery
- no rich multi-table SQL grammar
- no extensive type coercion rules
- limited cast target support
- limited aggregate/function surface
That is not a flaw in the teaching model. It keeps the front end small enough to reveal the core ideas clearly.
---
## Mental model: AST vs logical plan
This is the key conceptual distinction.
The SQL AST answers:
- what syntax did the user write?
The logical plan answers:
- what data operations does that syntax mean?
That distinction is one of the most important design boundaries in a query engine.
It is what allows:
- multiple front ends to target one engine
- optimizers to work on a syntax-independent representation
- physical planners to ignore SQL spelling details
---
## Main takeaways
- Tokenization and parsing are not just string processing. They define the supported query language.
- The SQL AST is still language-shaped, while the logical plan is engine-shaped.
- Planning is where the engine starts doing semantic work: name resolution, schema propagation, and operator construction.
- Aggregate queries require especially careful translation because one SQL clause can imply several execution stages.
- A small, explicit front-end pipeline makes the rest of the engine cleaner and easier to evolve.
---
## Related notes
- `hqew/002-query-engine-primer.md`
- `hqew/004-query-planning.md`
- `hqew/005-query-optimization.md`
- `hqew/014-how-query-engines-work-part-1.md`
- `hqew/016-physical-plans-and-operators.md`
- `hqew/017-expressions-types-and-nullability.md`
---
## Changelog
* **Apr 7, 2026** -- Added a dedicated note on SQL tokenization, parsing, AST structure, and logical planning.