373 lines
9.4 KiB
Markdown
373 lines
9.4 KiB
Markdown
# SQL Front End and Logical Planning
|
|
|
|
A reference for how SQL enters the engine and becomes a logical plan.
|
|
|
|
---
|
|
|
|
## Short answer
|
|
|
|
The SQL front end is the layer that turns query text into an internal representation the rest of the engine can reason about.
|
|
|
|
In the companion engine, that front-end pipeline is:
|
|
|
|
1. tokenize SQL text
|
|
2. parse tokens into SQL AST nodes
|
|
3. translate SQL AST nodes into logical expressions and logical plan operators
|
|
4. attach schema and type information so later stages can optimize and execute
|
|
|
|
The important separation is:
|
|
|
|
- SQL syntax is one surface language
|
|
- the SQL AST captures parsed structure
|
|
- the logical plan captures query meaning in engine terms
|
|
|
|
That separation keeps optimization and execution independent from the original text syntax.
|
|
|
|
---
|
|
|
|
## Why this layer matters
|
|
|
|
Without a front-end layer, the engine would have to intermingle:
|
|
|
|
- user-facing syntax
|
|
- semantic validation
|
|
- operator construction
|
|
- runtime execution details
|
|
|
|
That would make the system harder to extend and much harder to optimize.
|
|
|
|
A clean front end gives the engine:
|
|
|
|
- one place to define supported SQL syntax
|
|
- one place to resolve names and types
|
|
- one place to reject invalid queries early
|
|
- one stable logical representation for later rewrites
|
|
|
|
---
|
|
|
|
## The pipeline in the companion repo
|
|
|
|
Relevant implementation areas:
|
|
|
|
- SQL tokenization
|
|
- SQL tokens and keywords
|
|
- SQL parsing
|
|
- SQL AST expressions
|
|
- SQL-to-logical planning
|
|
- logical plan interfaces
|
|
- logical expressions
|
|
|
|
Conceptually:
|
|
|
|
```text
|
|
SQL text
|
|
-> token stream
|
|
-> SQL expressions / SqlSelect AST
|
|
-> logical expressions
|
|
-> logical plan
|
|
```
|
|
|
|
---
|
|
|
|
## Tokenization
|
|
|
|
The tokenizer is responsible for recognizing the smallest meaningful pieces of the query:
|
|
|
|
- identifiers
|
|
- numeric literals
|
|
- string literals
|
|
- symbols such as `(`, `)`, `,`, `=`, `>`, `<=`
|
|
- SQL keywords such as `SELECT`, `FROM`, `WHERE`, `GROUP BY`
|
|
|
|
The `SqlTokenizer` handles:
|
|
|
|
- whitespace skipping
|
|
- quoted identifiers using backticks
|
|
- quoted strings using `'...'` or `"..."`
|
|
- escaped quotes inside strings
|
|
- special handling for ambiguous words such as `GROUP` and `ORDER`
|
|
|
|
That last point matters because `GROUP` might be:
|
|
|
|
- the `GROUP BY` keyword sequence
|
|
- or an identifier such as a table/column name in a different context
|
|
|
|
This is a useful reminder that even "simple SQL parsing" starts with contextual decisions.
|
|
|
|
---
|
|
|
|
## Parsing
|
|
|
|
The parser turns tokens into structured SQL expressions.
|
|
|
|
The companion parser uses a Pratt-parser style for expressions. That is a compact way to parse operator precedence without a large handwritten grammar.
|
|
|
|
Key precedence bands in `SqlParser` include:
|
|
|
|
- aliasing and sort direction
|
|
- `OR`
|
|
- `AND`
|
|
- comparisons
|
|
- addition and subtraction
|
|
- multiplication and division
|
|
- function calls
|
|
|
|
This means expressions like:
|
|
|
|
```sql
|
|
age > 18 AND salary > 100000
|
|
```
|
|
|
|
can be parsed with the right tree shape rather than as a flat token sequence.
|
|
|
|
The SQL AST layer includes nodes such as:
|
|
|
|
- `SqlIdentifier`
|
|
- `SqlString`
|
|
- `SqlLong`
|
|
- `SqlDouble`
|
|
- `SqlBinaryExpr`
|
|
- `SqlFunction`
|
|
- `SqlAlias`
|
|
- `SqlCast`
|
|
- `SqlSort`
|
|
- `SqlSelect`
|
|
|
|
That AST is still SQL-shaped. It reflects query clauses like projection lists, `WHERE`, `GROUP BY`, `HAVING`, `ORDER BY`, and `LIMIT`.
|
|
|
|
---
|
|
|
|
## What `SqlSelect` captures
|
|
|
|
The parser's main relation node is `SqlSelect`, which stores:
|
|
|
|
- projection expressions
|
|
- optional selection predicate
|
|
- grouping expressions
|
|
- ordering expressions
|
|
- optional having predicate
|
|
- optional limit
|
|
- table name
|
|
|
|
This is still a parsed SQL object, not yet a logical plan.
|
|
|
|
That distinction matters because the SQL AST still speaks in language-level constructs:
|
|
|
|
- aliases
|
|
- SQL function syntax
|
|
- keyword-driven clause structure
|
|
|
|
The logical plan instead speaks in engine-level operators:
|
|
|
|
- scan
|
|
- filter
|
|
- projection
|
|
- aggregate
|
|
- join
|
|
- limit
|
|
|
|
---
|
|
|
|
## From SQL AST to logical expressions
|
|
|
|
`SqlPlanner` performs the semantic translation from SQL expressions into logical expressions and then into a `DataFrame` / logical plan.
|
|
|
|
Examples of mappings:
|
|
|
|
- `SqlIdentifier` -> `Column`
|
|
- string / long / double literals -> logical literal expressions
|
|
- SQL comparison operators -> `Eq`, `Gt`, `LtEq`, and similar nodes
|
|
- `AND` / `OR` -> boolean logical expressions
|
|
- `+`, `-`, `*`, `/` -> arithmetic logical expressions
|
|
- `CAST(...)` -> `CastExpr`
|
|
- aggregate calls such as `SUM(x)` -> `Sum(...)`
|
|
- `COUNT(*)` -> `Count(LiteralLong(1))`
|
|
|
|
This is the point where SQL surface syntax stops mattering and engine semantics take over.
|
|
|
|
---
|
|
|
|
## Name resolution and schema dependence
|
|
|
|
The logical layer is schema-aware.
|
|
|
|
For example:
|
|
|
|
- `Column(name)` resolves itself against the input plan's schema
|
|
- `toField(input)` computes the output field for a logical expression
|
|
- invalid column references become planning errors rather than execution surprises
|
|
|
|
This is important because planning is where the engine should answer:
|
|
|
|
- does this column exist?
|
|
- what type does this expression produce?
|
|
- what schema flows out of this operator?
|
|
|
|
The companion code uses `toField(input)` as the main hook that gives expressions their output metadata.
|
|
|
|
That means logical expressions carry more than syntax. They also carry planning-time type and schema consequences.
|
|
|
|
---
|
|
|
|
## Planning non-aggregate queries
|
|
|
|
For non-aggregate queries, the planner does more than a direct clause-to-operator mapping.
|
|
|
|
One subtle case is a filter that references columns not present in the final projection.
|
|
|
|
Example shape:
|
|
|
|
```sql
|
|
SELECT name
|
|
FROM employees
|
|
WHERE age > 18
|
|
```
|
|
|
|
If the filter uses `age` but the final output only includes `name`, the planner may need an intermediate projection that keeps both:
|
|
|
|
- output columns needed by the user
|
|
- extra columns needed by the filter
|
|
|
|
Then it can:
|
|
|
|
1. project enough columns to evaluate the filter
|
|
2. apply the filter
|
|
3. drop temporary columns not meant for final output
|
|
|
|
This is a small but important example of planning being semantic work rather than just syntactic translation.
|
|
|
|
---
|
|
|
|
## Planning aggregate queries
|
|
|
|
Aggregate queries add more complexity because the engine must distinguish between:
|
|
|
|
- grouping expressions
|
|
- aggregate expressions
|
|
- post-aggregate projection
|
|
- optional `HAVING`
|
|
|
|
The companion planner:
|
|
|
|
- detects whether the projection contains aggregate expressions
|
|
- rejects unsupported cases such as `GROUP BY` without aggregates
|
|
- builds a group-by input
|
|
- tracks which projected expressions correspond to group columns versus aggregate results
|
|
- creates the aggregate operator
|
|
- then applies a final projection over the aggregate output
|
|
|
|
This is important because aggregate SQL syntax often compresses several semantic stages into one query.
|
|
|
|
For example:
|
|
|
|
```sql
|
|
SELECT department, COUNT(*)
|
|
FROM employees
|
|
GROUP BY department
|
|
HAVING COUNT(*) > 5
|
|
```
|
|
|
|
is not one primitive operation. It becomes:
|
|
|
|
1. scan input
|
|
2. optionally filter pre-aggregation rows
|
|
3. group rows
|
|
4. compute aggregates
|
|
5. apply post-aggregation filtering
|
|
6. project final output
|
|
|
|
---
|
|
|
|
## What the front end validates
|
|
|
|
The front end already enforces meaningful semantic constraints, such as:
|
|
|
|
- referenced tables must exist
|
|
- referenced columns must exist
|
|
- aggregate functions must have valid arguments
|
|
- `COUNT(*)` is treated specially
|
|
- `LIMIT` must be numeric
|
|
- unsupported data types in `CAST` are rejected
|
|
|
|
This is the right place for those checks because execution should not be responsible for discovering syntax- or meaning-level errors.
|
|
|
|
---
|
|
|
|
## Supported surface area and current limits
|
|
|
|
The front end in the companion repo is intentionally small and teachable rather than SQL-complete.
|
|
|
|
It supports a useful subset:
|
|
|
|
- `SELECT`
|
|
- `FROM`
|
|
- `WHERE`
|
|
- `GROUP BY`
|
|
- `HAVING`
|
|
- `ORDER BY`
|
|
- `LIMIT`
|
|
- basic arithmetic and boolean expressions
|
|
- basic aggregate functions
|
|
- `CAST`
|
|
- date and interval literals
|
|
|
|
But it is still limited compared with a production SQL engine:
|
|
|
|
- no deep subquery machinery
|
|
- no rich multi-table SQL grammar
|
|
- no extensive type coercion rules
|
|
- limited cast target support
|
|
- limited aggregate/function surface
|
|
|
|
That is not a flaw in the teaching model. It keeps the front end small enough to reveal the core ideas clearly.
|
|
|
|
---
|
|
|
|
## Mental model: AST vs logical plan
|
|
|
|
This is the key conceptual distinction.
|
|
|
|
The SQL AST answers:
|
|
|
|
- what syntax did the user write?
|
|
|
|
The logical plan answers:
|
|
|
|
- what data operations does that syntax mean?
|
|
|
|
That distinction is one of the most important design boundaries in a query engine.
|
|
|
|
It is what allows:
|
|
|
|
- multiple front ends to target one engine
|
|
- optimizers to work on a syntax-independent representation
|
|
- physical planners to ignore SQL spelling details
|
|
|
|
---
|
|
|
|
## Main takeaways
|
|
|
|
- Tokenization and parsing are not just string processing. They define the supported query language.
|
|
- The SQL AST is still language-shaped, while the logical plan is engine-shaped.
|
|
- Planning is where the engine starts doing semantic work: name resolution, schema propagation, and operator construction.
|
|
- Aggregate queries require especially careful translation because one SQL clause can imply several execution stages.
|
|
- A small, explicit front-end pipeline makes the rest of the engine cleaner and easier to evolve.
|
|
|
|
---
|
|
|
|
## Related notes
|
|
|
|
- `hqew/002-query-engine-primer.md`
|
|
- `hqew/004-query-planning.md`
|
|
- `hqew/005-query-optimization.md`
|
|
- `hqew/014-how-query-engines-work-part-1.md`
|
|
- `hqew/016-physical-plans-and-operators.md`
|
|
- `hqew/017-expressions-types-and-nullability.md`
|
|
|
|
---
|
|
|
|
## Changelog
|
|
|
|
* **Apr 7, 2026** -- Removed references to ignored local paths.
|
|
* **Apr 7, 2026** -- Added a dedicated note on SQL tokenization, parsing, AST structure, and logical planning.
|