Add a note file about SQL frontend and logical query planning
This commit is contained in:
parent
d5bbc4886d
commit
dfed3989a8
371
hqew/015-sql-front-end-and-logical-planning.md
Normal file
371
hqew/015-sql-front-end-and-logical-planning.md
Normal file
@ -0,0 +1,371 @@
|
||||
# SQL Front End and Logical Planning
|
||||
|
||||
A reference for how SQL enters the engine and becomes a logical plan.
|
||||
|
||||
---
|
||||
|
||||
## Short answer
|
||||
|
||||
The SQL front end is the layer that turns query text into an internal representation the rest of the engine can reason about.
|
||||
|
||||
In the companion engine, that front-end pipeline is:
|
||||
|
||||
1. tokenize SQL text
|
||||
2. parse tokens into SQL AST nodes
|
||||
3. translate SQL AST nodes into logical expressions and logical plan operators
|
||||
4. attach schema and type information so later stages can optimize and execute
|
||||
|
||||
The important separation is:
|
||||
|
||||
- SQL syntax is one surface language
|
||||
- the SQL AST captures parsed structure
|
||||
- the logical plan captures query meaning in engine terms
|
||||
|
||||
That separation keeps optimization and execution independent from the original text syntax.
|
||||
|
||||
---
|
||||
|
||||
## Why this layer matters
|
||||
|
||||
Without a front-end layer, the engine would have to intermingle:
|
||||
|
||||
- user-facing syntax
|
||||
- semantic validation
|
||||
- operator construction
|
||||
- runtime execution details
|
||||
|
||||
That would make the system harder to extend and much harder to optimize.
|
||||
|
||||
A clean front end gives the engine:
|
||||
|
||||
- one place to define supported SQL syntax
|
||||
- one place to resolve names and types
|
||||
- one place to reject invalid queries early
|
||||
- one stable logical representation for later rewrites
|
||||
|
||||
---
|
||||
|
||||
## The pipeline in the companion repo
|
||||
|
||||
Relevant modules:
|
||||
|
||||
- `tmp/how-query-engines-work/sql/src/main/kotlin/SqlTokenizer.kt`
|
||||
- `tmp/how-query-engines-work/sql/src/main/kotlin/Tokens.kt`
|
||||
- `tmp/how-query-engines-work/sql/src/main/kotlin/SqlParser.kt`
|
||||
- `tmp/how-query-engines-work/sql/src/main/kotlin/Expressions.kt`
|
||||
- `tmp/how-query-engines-work/sql/src/main/kotlin/SqlPlanner.kt`
|
||||
- `tmp/how-query-engines-work/logical-plan/src/main/kotlin/LogicalPlan.kt`
|
||||
- `tmp/how-query-engines-work/logical-plan/src/main/kotlin/Expressions.kt`
|
||||
|
||||
Conceptually:
|
||||
|
||||
```text
|
||||
SQL text
|
||||
-> token stream
|
||||
-> SQL expressions / SqlSelect AST
|
||||
-> logical expressions
|
||||
-> logical plan
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## Tokenization
|
||||
|
||||
The tokenizer is responsible for recognizing the smallest meaningful pieces of the query:
|
||||
|
||||
- identifiers
|
||||
- numeric literals
|
||||
- string literals
|
||||
- symbols such as `(`, `)`, `,`, `=`, `>`, `<=`
|
||||
- SQL keywords such as `SELECT`, `FROM`, `WHERE`, `GROUP BY`
|
||||
|
||||
The `SqlTokenizer` handles:
|
||||
|
||||
- whitespace skipping
|
||||
- quoted identifiers using backticks
|
||||
- quoted strings using `'...'` or `"..."`
|
||||
- escaped quotes inside strings
|
||||
- special handling for ambiguous words such as `GROUP` and `ORDER`
|
||||
|
||||
That last point matters because `GROUP` might be:
|
||||
|
||||
- the `GROUP BY` keyword sequence
|
||||
- or an identifier such as a table/column name in a different context
|
||||
|
||||
This is a useful reminder that even "simple SQL parsing" starts with contextual decisions.
|
||||
|
||||
---
|
||||
|
||||
## Parsing
|
||||
|
||||
The parser turns tokens into structured SQL expressions.
|
||||
|
||||
The companion parser uses a Pratt-parser style for expressions. That is a compact way to parse operator precedence without a large handwritten grammar.
|
||||
|
||||
Key precedence bands in `SqlParser` include:
|
||||
|
||||
- aliasing and sort direction
|
||||
- `OR`
|
||||
- `AND`
|
||||
- comparisons
|
||||
- addition and subtraction
|
||||
- multiplication and division
|
||||
- function calls
|
||||
|
||||
This means expressions like:
|
||||
|
||||
```sql
|
||||
age > 18 AND salary > 100000
|
||||
```
|
||||
|
||||
can be parsed with the right tree shape rather than as a flat token sequence.
|
||||
|
||||
The SQL AST layer includes nodes such as:
|
||||
|
||||
- `SqlIdentifier`
|
||||
- `SqlString`
|
||||
- `SqlLong`
|
||||
- `SqlDouble`
|
||||
- `SqlBinaryExpr`
|
||||
- `SqlFunction`
|
||||
- `SqlAlias`
|
||||
- `SqlCast`
|
||||
- `SqlSort`
|
||||
- `SqlSelect`
|
||||
|
||||
That AST is still SQL-shaped. It reflects query clauses like projection lists, `WHERE`, `GROUP BY`, `HAVING`, `ORDER BY`, and `LIMIT`.
|
||||
|
||||
---
|
||||
|
||||
## What `SqlSelect` captures
|
||||
|
||||
The parser's main relation node is `SqlSelect`, which stores:
|
||||
|
||||
- projection expressions
|
||||
- optional selection predicate
|
||||
- grouping expressions
|
||||
- ordering expressions
|
||||
- optional having predicate
|
||||
- optional limit
|
||||
- table name
|
||||
|
||||
This is still a parsed SQL object, not yet a logical plan.
|
||||
|
||||
That distinction matters because the SQL AST still speaks in language-level constructs:
|
||||
|
||||
- aliases
|
||||
- SQL function syntax
|
||||
- keyword-driven clause structure
|
||||
|
||||
The logical plan instead speaks in engine-level operators:
|
||||
|
||||
- scan
|
||||
- filter
|
||||
- projection
|
||||
- aggregate
|
||||
- join
|
||||
- limit
|
||||
|
||||
---
|
||||
|
||||
## From SQL AST to logical expressions
|
||||
|
||||
`SqlPlanner` performs the semantic translation from SQL expressions into logical expressions and then into a `DataFrame` / logical plan.
|
||||
|
||||
Examples of mappings:
|
||||
|
||||
- `SqlIdentifier` -> `Column`
|
||||
- string / long / double literals -> logical literal expressions
|
||||
- SQL comparison operators -> `Eq`, `Gt`, `LtEq`, and similar nodes
|
||||
- `AND` / `OR` -> boolean logical expressions
|
||||
- `+`, `-`, `*`, `/` -> arithmetic logical expressions
|
||||
- `CAST(...)` -> `CastExpr`
|
||||
- aggregate calls such as `SUM(x)` -> `Sum(...)`
|
||||
- `COUNT(*)` -> `Count(LiteralLong(1))`
|
||||
|
||||
This is the point where SQL surface syntax stops mattering and engine semantics take over.
|
||||
|
||||
---
|
||||
|
||||
## Name resolution and schema dependence
|
||||
|
||||
The logical layer is schema-aware.
|
||||
|
||||
For example:
|
||||
|
||||
- `Column(name)` resolves itself against the input plan's schema
|
||||
- `toField(input)` computes the output field for a logical expression
|
||||
- invalid column references become planning errors rather than execution surprises
|
||||
|
||||
This is important because planning is where the engine should answer:
|
||||
|
||||
- does this column exist?
|
||||
- what type does this expression produce?
|
||||
- what schema flows out of this operator?
|
||||
|
||||
The companion code uses `toField(input)` as the main hook that gives expressions their output metadata.
|
||||
|
||||
That means logical expressions carry more than syntax. They also carry planning-time type and schema consequences.
|
||||
|
||||
---
|
||||
|
||||
## Planning non-aggregate queries
|
||||
|
||||
For non-aggregate queries, the planner does more than a direct clause-to-operator mapping.
|
||||
|
||||
One subtle case is a filter that references columns not present in the final projection.
|
||||
|
||||
Example shape:
|
||||
|
||||
```sql
|
||||
SELECT name
|
||||
FROM employees
|
||||
WHERE age > 18
|
||||
```
|
||||
|
||||
If the filter uses `age` but the final output only includes `name`, the planner may need an intermediate projection that keeps both:
|
||||
|
||||
- output columns needed by the user
|
||||
- extra columns needed by the filter
|
||||
|
||||
Then it can:
|
||||
|
||||
1. project enough columns to evaluate the filter
|
||||
2. apply the filter
|
||||
3. drop temporary columns not meant for final output
|
||||
|
||||
This is a small but important example of planning being semantic work rather than just syntactic translation.
|
||||
|
||||
---
|
||||
|
||||
## Planning aggregate queries
|
||||
|
||||
Aggregate queries add more complexity because the engine must distinguish between:
|
||||
|
||||
- grouping expressions
|
||||
- aggregate expressions
|
||||
- post-aggregate projection
|
||||
- optional `HAVING`
|
||||
|
||||
The companion planner:
|
||||
|
||||
- detects whether the projection contains aggregate expressions
|
||||
- rejects unsupported cases such as `GROUP BY` without aggregates
|
||||
- builds a group-by input
|
||||
- tracks which projected expressions correspond to group columns versus aggregate results
|
||||
- creates the aggregate operator
|
||||
- then applies a final projection over the aggregate output
|
||||
|
||||
This is important because aggregate SQL syntax often compresses several semantic stages into one query.
|
||||
|
||||
For example:
|
||||
|
||||
```sql
|
||||
SELECT department, COUNT(*)
|
||||
FROM employees
|
||||
GROUP BY department
|
||||
HAVING COUNT(*) > 5
|
||||
```
|
||||
|
||||
is not one primitive operation. It becomes:
|
||||
|
||||
1. scan input
|
||||
2. optionally filter pre-aggregation rows
|
||||
3. group rows
|
||||
4. compute aggregates
|
||||
5. apply post-aggregation filtering
|
||||
6. project final output
|
||||
|
||||
---
|
||||
|
||||
## What the front end validates
|
||||
|
||||
The front end already enforces meaningful semantic constraints, such as:
|
||||
|
||||
- referenced tables must exist
|
||||
- referenced columns must exist
|
||||
- aggregate functions must have valid arguments
|
||||
- `COUNT(*)` is treated specially
|
||||
- `LIMIT` must be numeric
|
||||
- unsupported data types in `CAST` are rejected
|
||||
|
||||
This is the right place for those checks because execution should not be responsible for discovering syntax- or meaning-level errors.
|
||||
|
||||
---
|
||||
|
||||
## Supported surface area and current limits
|
||||
|
||||
The front end in the companion repo is intentionally small and teachable rather than SQL-complete.
|
||||
|
||||
It supports a useful subset:
|
||||
|
||||
- `SELECT`
|
||||
- `FROM`
|
||||
- `WHERE`
|
||||
- `GROUP BY`
|
||||
- `HAVING`
|
||||
- `ORDER BY`
|
||||
- `LIMIT`
|
||||
- basic arithmetic and boolean expressions
|
||||
- basic aggregate functions
|
||||
- `CAST`
|
||||
- date and interval literals
|
||||
|
||||
But it is still limited compared with a production SQL engine:
|
||||
|
||||
- no deep subquery machinery
|
||||
- no rich multi-table SQL grammar
|
||||
- no extensive type coercion rules
|
||||
- limited cast target support
|
||||
- limited aggregate/function surface
|
||||
|
||||
That is not a flaw in the teaching model. It keeps the front end small enough to reveal the core ideas clearly.
|
||||
|
||||
---
|
||||
|
||||
## Mental model: AST vs logical plan
|
||||
|
||||
This is the key conceptual distinction.
|
||||
|
||||
The SQL AST answers:
|
||||
|
||||
- what syntax did the user write?
|
||||
|
||||
The logical plan answers:
|
||||
|
||||
- what data operations does that syntax mean?
|
||||
|
||||
That distinction is one of the most important design boundaries in a query engine.
|
||||
|
||||
It is what allows:
|
||||
|
||||
- multiple front ends to target one engine
|
||||
- optimizers to work on a syntax-independent representation
|
||||
- physical planners to ignore SQL spelling details
|
||||
|
||||
---
|
||||
|
||||
## Main takeaways
|
||||
|
||||
- Tokenization and parsing are not just string processing. They define the supported query language.
|
||||
- The SQL AST is still language-shaped, while the logical plan is engine-shaped.
|
||||
- Planning is where the engine starts doing semantic work: name resolution, schema propagation, and operator construction.
|
||||
- Aggregate queries require especially careful translation because one SQL clause can imply several execution stages.
|
||||
- A small, explicit front-end pipeline makes the rest of the engine cleaner and easier to evolve.
|
||||
|
||||
---
|
||||
|
||||
## Related notes
|
||||
|
||||
- `hqew/002-query-engine-primer.md`
|
||||
- `hqew/004-query-planning.md`
|
||||
- `hqew/005-query-optimization.md`
|
||||
- `hqew/014-how-query-engines-work-part-1.md`
|
||||
- `hqew/016-physical-plans-and-operators.md`
|
||||
- `hqew/017-expressions-types-and-nullability.md`
|
||||
|
||||
---
|
||||
|
||||
## Changelog
|
||||
|
||||
* **Apr 7, 2026** -- Added a dedicated note on SQL tokenization, parsing, AST structure, and logical planning.
|
||||
Loading…
x
Reference in New Issue
Block a user