From dfed3989a8aa67c1d965b3cc01aadf6215c05a85 Mon Sep 17 00:00:00 2001 From: Hassan Abedi Date: Tue, 7 Apr 2026 11:13:16 +0200 Subject: [PATCH] Add a note file about SQL frontend and logical query planning --- .../015-sql-front-end-and-logical-planning.md | 371 ++++++++++++++++++ 1 file changed, 371 insertions(+) create mode 100644 hqew/015-sql-front-end-and-logical-planning.md diff --git a/hqew/015-sql-front-end-and-logical-planning.md b/hqew/015-sql-front-end-and-logical-planning.md new file mode 100644 index 0000000..432acfc --- /dev/null +++ b/hqew/015-sql-front-end-and-logical-planning.md @@ -0,0 +1,371 @@ +# SQL Front End and Logical Planning + +A reference for how SQL enters the engine and becomes a logical plan. + +--- + +## Short answer + +The SQL front end is the layer that turns query text into an internal representation the rest of the engine can reason about. + +In the companion engine, that front-end pipeline is: + +1. tokenize SQL text +2. parse tokens into SQL AST nodes +3. translate SQL AST nodes into logical expressions and logical plan operators +4. attach schema and type information so later stages can optimize and execute + +The important separation is: + +- SQL syntax is one surface language +- the SQL AST captures parsed structure +- the logical plan captures query meaning in engine terms + +That separation keeps optimization and execution independent from the original text syntax. + +--- + +## Why this layer matters + +Without a front-end layer, the engine would have to intermingle: + +- user-facing syntax +- semantic validation +- operator construction +- runtime execution details + +That would make the system harder to extend and much harder to optimize. + +A clean front end gives the engine: + +- one place to define supported SQL syntax +- one place to resolve names and types +- one place to reject invalid queries early +- one stable logical representation for later rewrites + +--- + +## The pipeline in the companion repo + +Relevant modules: + +- `tmp/how-query-engines-work/sql/src/main/kotlin/SqlTokenizer.kt` +- `tmp/how-query-engines-work/sql/src/main/kotlin/Tokens.kt` +- `tmp/how-query-engines-work/sql/src/main/kotlin/SqlParser.kt` +- `tmp/how-query-engines-work/sql/src/main/kotlin/Expressions.kt` +- `tmp/how-query-engines-work/sql/src/main/kotlin/SqlPlanner.kt` +- `tmp/how-query-engines-work/logical-plan/src/main/kotlin/LogicalPlan.kt` +- `tmp/how-query-engines-work/logical-plan/src/main/kotlin/Expressions.kt` + +Conceptually: + +```text +SQL text + -> token stream + -> SQL expressions / SqlSelect AST + -> logical expressions + -> logical plan +``` + +--- + +## Tokenization + +The tokenizer is responsible for recognizing the smallest meaningful pieces of the query: + +- identifiers +- numeric literals +- string literals +- symbols such as `(`, `)`, `,`, `=`, `>`, `<=` +- SQL keywords such as `SELECT`, `FROM`, `WHERE`, `GROUP BY` + +The `SqlTokenizer` handles: + +- whitespace skipping +- quoted identifiers using backticks +- quoted strings using `'...'` or `"..."` +- escaped quotes inside strings +- special handling for ambiguous words such as `GROUP` and `ORDER` + +That last point matters because `GROUP` might be: + +- the `GROUP BY` keyword sequence +- or an identifier such as a table/column name in a different context + +This is a useful reminder that even "simple SQL parsing" starts with contextual decisions. + +--- + +## Parsing + +The parser turns tokens into structured SQL expressions. + +The companion parser uses a Pratt-parser style for expressions. That is a compact way to parse operator precedence without a large handwritten grammar. + +Key precedence bands in `SqlParser` include: + +- aliasing and sort direction +- `OR` +- `AND` +- comparisons +- addition and subtraction +- multiplication and division +- function calls + +This means expressions like: + +```sql +age > 18 AND salary > 100000 +``` + +can be parsed with the right tree shape rather than as a flat token sequence. + +The SQL AST layer includes nodes such as: + +- `SqlIdentifier` +- `SqlString` +- `SqlLong` +- `SqlDouble` +- `SqlBinaryExpr` +- `SqlFunction` +- `SqlAlias` +- `SqlCast` +- `SqlSort` +- `SqlSelect` + +That AST is still SQL-shaped. It reflects query clauses like projection lists, `WHERE`, `GROUP BY`, `HAVING`, `ORDER BY`, and `LIMIT`. + +--- + +## What `SqlSelect` captures + +The parser's main relation node is `SqlSelect`, which stores: + +- projection expressions +- optional selection predicate +- grouping expressions +- ordering expressions +- optional having predicate +- optional limit +- table name + +This is still a parsed SQL object, not yet a logical plan. + +That distinction matters because the SQL AST still speaks in language-level constructs: + +- aliases +- SQL function syntax +- keyword-driven clause structure + +The logical plan instead speaks in engine-level operators: + +- scan +- filter +- projection +- aggregate +- join +- limit + +--- + +## From SQL AST to logical expressions + +`SqlPlanner` performs the semantic translation from SQL expressions into logical expressions and then into a `DataFrame` / logical plan. + +Examples of mappings: + +- `SqlIdentifier` -> `Column` +- string / long / double literals -> logical literal expressions +- SQL comparison operators -> `Eq`, `Gt`, `LtEq`, and similar nodes +- `AND` / `OR` -> boolean logical expressions +- `+`, `-`, `*`, `/` -> arithmetic logical expressions +- `CAST(...)` -> `CastExpr` +- aggregate calls such as `SUM(x)` -> `Sum(...)` +- `COUNT(*)` -> `Count(LiteralLong(1))` + +This is the point where SQL surface syntax stops mattering and engine semantics take over. + +--- + +## Name resolution and schema dependence + +The logical layer is schema-aware. + +For example: + +- `Column(name)` resolves itself against the input plan's schema +- `toField(input)` computes the output field for a logical expression +- invalid column references become planning errors rather than execution surprises + +This is important because planning is where the engine should answer: + +- does this column exist? +- what type does this expression produce? +- what schema flows out of this operator? + +The companion code uses `toField(input)` as the main hook that gives expressions their output metadata. + +That means logical expressions carry more than syntax. They also carry planning-time type and schema consequences. + +--- + +## Planning non-aggregate queries + +For non-aggregate queries, the planner does more than a direct clause-to-operator mapping. + +One subtle case is a filter that references columns not present in the final projection. + +Example shape: + +```sql +SELECT name +FROM employees +WHERE age > 18 +``` + +If the filter uses `age` but the final output only includes `name`, the planner may need an intermediate projection that keeps both: + +- output columns needed by the user +- extra columns needed by the filter + +Then it can: + +1. project enough columns to evaluate the filter +2. apply the filter +3. drop temporary columns not meant for final output + +This is a small but important example of planning being semantic work rather than just syntactic translation. + +--- + +## Planning aggregate queries + +Aggregate queries add more complexity because the engine must distinguish between: + +- grouping expressions +- aggregate expressions +- post-aggregate projection +- optional `HAVING` + +The companion planner: + +- detects whether the projection contains aggregate expressions +- rejects unsupported cases such as `GROUP BY` without aggregates +- builds a group-by input +- tracks which projected expressions correspond to group columns versus aggregate results +- creates the aggregate operator +- then applies a final projection over the aggregate output + +This is important because aggregate SQL syntax often compresses several semantic stages into one query. + +For example: + +```sql +SELECT department, COUNT(*) +FROM employees +GROUP BY department +HAVING COUNT(*) > 5 +``` + +is not one primitive operation. It becomes: + +1. scan input +2. optionally filter pre-aggregation rows +3. group rows +4. compute aggregates +5. apply post-aggregation filtering +6. project final output + +--- + +## What the front end validates + +The front end already enforces meaningful semantic constraints, such as: + +- referenced tables must exist +- referenced columns must exist +- aggregate functions must have valid arguments +- `COUNT(*)` is treated specially +- `LIMIT` must be numeric +- unsupported data types in `CAST` are rejected + +This is the right place for those checks because execution should not be responsible for discovering syntax- or meaning-level errors. + +--- + +## Supported surface area and current limits + +The front end in the companion repo is intentionally small and teachable rather than SQL-complete. + +It supports a useful subset: + +- `SELECT` +- `FROM` +- `WHERE` +- `GROUP BY` +- `HAVING` +- `ORDER BY` +- `LIMIT` +- basic arithmetic and boolean expressions +- basic aggregate functions +- `CAST` +- date and interval literals + +But it is still limited compared with a production SQL engine: + +- no deep subquery machinery +- no rich multi-table SQL grammar +- no extensive type coercion rules +- limited cast target support +- limited aggregate/function surface + +That is not a flaw in the teaching model. It keeps the front end small enough to reveal the core ideas clearly. + +--- + +## Mental model: AST vs logical plan + +This is the key conceptual distinction. + +The SQL AST answers: + +- what syntax did the user write? + +The logical plan answers: + +- what data operations does that syntax mean? + +That distinction is one of the most important design boundaries in a query engine. + +It is what allows: + +- multiple front ends to target one engine +- optimizers to work on a syntax-independent representation +- physical planners to ignore SQL spelling details + +--- + +## Main takeaways + +- Tokenization and parsing are not just string processing. They define the supported query language. +- The SQL AST is still language-shaped, while the logical plan is engine-shaped. +- Planning is where the engine starts doing semantic work: name resolution, schema propagation, and operator construction. +- Aggregate queries require especially careful translation because one SQL clause can imply several execution stages. +- A small, explicit front-end pipeline makes the rest of the engine cleaner and easier to evolve. + +--- + +## Related notes + +- `hqew/002-query-engine-primer.md` +- `hqew/004-query-planning.md` +- `hqew/005-query-optimization.md` +- `hqew/014-how-query-engines-work-part-1.md` +- `hqew/016-physical-plans-and-operators.md` +- `hqew/017-expressions-types-and-nullability.md` + +--- + +## Changelog + +* **Apr 7, 2026** -- Added a dedicated note on SQL tokenization, parsing, AST structure, and logical planning.