Add a note file about SQL frontend and logical query planning

2026-04-07 11:13:16 +02:00 · 2026-04-07 11:13:16 +02:00 · dfed3989a8
commit dfed3989a8
parent d5bbc4886d
1 changed files with 371 additions and 0 deletions
--- a/hqew/015-sql-front-end-and-logical-planning.md
+++ b/hqew/015-sql-front-end-and-logical-planning.md
@ -0,0 +1,371 @@
+# SQL Front End and Logical Planning
+
+A reference for how SQL enters the engine and becomes a logical plan.
+
+---
+
+## Short answer
+
+The SQL front end is the layer that turns query text into an internal representation the rest of the engine can reason about.
+
+In the companion engine, that front-end pipeline is:
+
+1. tokenize SQL text
+2. parse tokens into SQL AST nodes
+3. translate SQL AST nodes into logical expressions and logical plan operators
+4. attach schema and type information so later stages can optimize and execute
+
+The important separation is:
+
+- SQL syntax is one surface language
+- the SQL AST captures parsed structure
+- the logical plan captures query meaning in engine terms
+
+That separation keeps optimization and execution independent from the original text syntax.
+
+---
+
+## Why this layer matters
+
+Without a front-end layer, the engine would have to intermingle:
+
+- user-facing syntax
+- semantic validation
+- operator construction
+- runtime execution details
+
+That would make the system harder to extend and much harder to optimize.
+
+A clean front end gives the engine:
+
+- one place to define supported SQL syntax
+- one place to resolve names and types
+- one place to reject invalid queries early
+- one stable logical representation for later rewrites
+
+---
+
+## The pipeline in the companion repo
+
+Relevant modules:
+
+- `tmp/how-query-engines-work/sql/src/main/kotlin/SqlTokenizer.kt`
+- `tmp/how-query-engines-work/sql/src/main/kotlin/Tokens.kt`
+- `tmp/how-query-engines-work/sql/src/main/kotlin/SqlParser.kt`
+- `tmp/how-query-engines-work/sql/src/main/kotlin/Expressions.kt`
+- `tmp/how-query-engines-work/sql/src/main/kotlin/SqlPlanner.kt`
+- `tmp/how-query-engines-work/logical-plan/src/main/kotlin/LogicalPlan.kt`
+- `tmp/how-query-engines-work/logical-plan/src/main/kotlin/Expressions.kt`
+
+Conceptually:
+
+```text
+SQL text
+  -> token stream
+  -> SQL expressions / SqlSelect AST
+  -> logical expressions
+  -> logical plan
+```
+
+---
+
+## Tokenization
+
+The tokenizer is responsible for recognizing the smallest meaningful pieces of the query:
+
+- identifiers
+- numeric literals
+- string literals
+- symbols such as `(`, `)`, `,`, `=`, `>`, `<=`
+- SQL keywords such as `SELECT`, `FROM`, `WHERE`, `GROUP BY`
+
+The `SqlTokenizer` handles:
+
+- whitespace skipping
+- quoted identifiers using backticks
+- quoted strings using `'...'` or `"..."`
+- escaped quotes inside strings
+- special handling for ambiguous words such as `GROUP` and `ORDER`
+
+That last point matters because `GROUP` might be:
+
+- the `GROUP BY` keyword sequence
+- or an identifier such as a table/column name in a different context
+
+This is a useful reminder that even "simple SQL parsing" starts with contextual decisions.
+
+---
+
+## Parsing
+
+The parser turns tokens into structured SQL expressions.
+
+The companion parser uses a Pratt-parser style for expressions. That is a compact way to parse operator precedence without a large handwritten grammar.
+
+Key precedence bands in `SqlParser` include:
+
+- aliasing and sort direction
+- `OR`
+- `AND`
+- comparisons
+- addition and subtraction
+- multiplication and division
+- function calls
+
+This means expressions like:
+
+```sql
+age > 18 AND salary > 100000
+```
+
+can be parsed with the right tree shape rather than as a flat token sequence.
+
+The SQL AST layer includes nodes such as:
+
+- `SqlIdentifier`
+- `SqlString`
+- `SqlLong`
+- `SqlDouble`
+- `SqlBinaryExpr`
+- `SqlFunction`
+- `SqlAlias`
+- `SqlCast`
+- `SqlSort`
+- `SqlSelect`
+
+That AST is still SQL-shaped. It reflects query clauses like projection lists, `WHERE`, `GROUP BY`, `HAVING`, `ORDER BY`, and `LIMIT`.
+
+---
+
+## What `SqlSelect` captures
+
+The parser's main relation node is `SqlSelect`, which stores:
+
+- projection expressions
+- optional selection predicate
+- grouping expressions
+- ordering expressions
+- optional having predicate
+- optional limit
+- table name
+
+This is still a parsed SQL object, not yet a logical plan.
+
+That distinction matters because the SQL AST still speaks in language-level constructs:
+
+- aliases
+- SQL function syntax
+- keyword-driven clause structure
+
+The logical plan instead speaks in engine-level operators:
+
+- scan
+- filter
+- projection
+- aggregate
+- join
+- limit
+
+---
+
+## From SQL AST to logical expressions
+
+`SqlPlanner` performs the semantic translation from SQL expressions into logical expressions and then into a `DataFrame` / logical plan.
+
+Examples of mappings:
+
+- `SqlIdentifier` -> `Column`
+- string / long / double literals -> logical literal expressions
+- SQL comparison operators -> `Eq`, `Gt`, `LtEq`, and similar nodes
+- `AND` / `OR` -> boolean logical expressions
+- `+`, `-`, `*`, `/` -> arithmetic logical expressions
+- `CAST(...)` -> `CastExpr`
+- aggregate calls such as `SUM(x)` -> `Sum(...)`
+- `COUNT(*)` -> `Count(LiteralLong(1))`
+
+This is the point where SQL surface syntax stops mattering and engine semantics take over.
+
+---
+
+## Name resolution and schema dependence
+
+The logical layer is schema-aware.
+
+For example:
+
+- `Column(name)` resolves itself against the input plan's schema
+- `toField(input)` computes the output field for a logical expression
+- invalid column references become planning errors rather than execution surprises
+
+This is important because planning is where the engine should answer:
+
+- does this column exist?
+- what type does this expression produce?
+- what schema flows out of this operator?
+
+The companion code uses `toField(input)` as the main hook that gives expressions their output metadata.
+
+That means logical expressions carry more than syntax. They also carry planning-time type and schema consequences.
+
+---
+
+## Planning non-aggregate queries
+
+For non-aggregate queries, the planner does more than a direct clause-to-operator mapping.
+
+One subtle case is a filter that references columns not present in the final projection.
+
+Example shape:
+
+```sql
+SELECT name
+FROM employees
+WHERE age > 18
+```
+
+If the filter uses `age` but the final output only includes `name`, the planner may need an intermediate projection that keeps both:
+
+- output columns needed by the user
+- extra columns needed by the filter
+
+Then it can:
+
+1. project enough columns to evaluate the filter
+2. apply the filter
+3. drop temporary columns not meant for final output
+
+This is a small but important example of planning being semantic work rather than just syntactic translation.
+
+---
+
+## Planning aggregate queries
+
+Aggregate queries add more complexity because the engine must distinguish between:
+
+- grouping expressions
+- aggregate expressions
+- post-aggregate projection
+- optional `HAVING`
+
+The companion planner:
+
+- detects whether the projection contains aggregate expressions
+- rejects unsupported cases such as `GROUP BY` without aggregates
+- builds a group-by input
+- tracks which projected expressions correspond to group columns versus aggregate results
+- creates the aggregate operator
+- then applies a final projection over the aggregate output
+
+This is important because aggregate SQL syntax often compresses several semantic stages into one query.
+
+For example:
+
+```sql
+SELECT department, COUNT(*)
+FROM employees
+GROUP BY department
+HAVING COUNT(*) > 5
+```
+
+is not one primitive operation. It becomes:
+
+1. scan input
+2. optionally filter pre-aggregation rows
+3. group rows
+4. compute aggregates
+5. apply post-aggregation filtering
+6. project final output
+
+---
+
+## What the front end validates
+
+The front end already enforces meaningful semantic constraints, such as:
+
+- referenced tables must exist
+- referenced columns must exist
+- aggregate functions must have valid arguments
+- `COUNT(*)` is treated specially
+- `LIMIT` must be numeric
+- unsupported data types in `CAST` are rejected
+
+This is the right place for those checks because execution should not be responsible for discovering syntax- or meaning-level errors.
+
+---
+
+## Supported surface area and current limits
+
+The front end in the companion repo is intentionally small and teachable rather than SQL-complete.
+
+It supports a useful subset:
+
+- `SELECT`
+- `FROM`
+- `WHERE`
+- `GROUP BY`
+- `HAVING`
+- `ORDER BY`
+- `LIMIT`
+- basic arithmetic and boolean expressions
+- basic aggregate functions
+- `CAST`
+- date and interval literals
+
+But it is still limited compared with a production SQL engine:
+
+- no deep subquery machinery
+- no rich multi-table SQL grammar
+- no extensive type coercion rules
+- limited cast target support
+- limited aggregate/function surface
+
+That is not a flaw in the teaching model. It keeps the front end small enough to reveal the core ideas clearly.
+
+---
+
+## Mental model: AST vs logical plan
+
+This is the key conceptual distinction.
+
+The SQL AST answers:
+
+- what syntax did the user write?
+
+The logical plan answers:
+
+- what data operations does that syntax mean?
+
+That distinction is one of the most important design boundaries in a query engine.
+
+It is what allows:
+
+- multiple front ends to target one engine
+- optimizers to work on a syntax-independent representation
+- physical planners to ignore SQL spelling details
+
+---
+
+## Main takeaways
+
+- Tokenization and parsing are not just string processing. They define the supported query language.
+- The SQL AST is still language-shaped, while the logical plan is engine-shaped.
+- Planning is where the engine starts doing semantic work: name resolution, schema propagation, and operator construction.
+- Aggregate queries require especially careful translation because one SQL clause can imply several execution stages.
+- A small, explicit front-end pipeline makes the rest of the engine cleaner and easier to evolve.
+
+---
+
+## Related notes
+
+- `hqew/002-query-engine-primer.md`
+- `hqew/004-query-planning.md`
+- `hqew/005-query-optimization.md`
+- `hqew/014-how-query-engines-work-part-1.md`
+- `hqew/016-physical-plans-and-operators.md`
+- `hqew/017-expressions-types-and-nullability.md`
+
+---
+
+## Changelog
+
+* **Apr 7, 2026** -- Added a dedicated note on SQL tokenization, parsing, AST structure, and logical planning.