useful-notes/hqew/015-sql-front-end-and-logical-planning.md

9.4 KiB

SQL Front End and Logical Planning

A reference for how SQL enters the engine and becomes a logical plan.


Short answer

The SQL front end is the layer that turns query text into an internal representation the rest of the engine can reason about.

In the companion engine, that front-end pipeline is:

  1. tokenize SQL text
  2. parse tokens into SQL AST nodes
  3. translate SQL AST nodes into logical expressions and logical plan operators
  4. attach schema and type information so later stages can optimize and execute

The important separation is:

  • SQL syntax is one surface language
  • the SQL AST captures parsed structure
  • the logical plan captures query meaning in engine terms

That separation keeps optimization and execution independent from the original text syntax.


Why this layer matters

Without a front-end layer, the engine would have to intermingle:

  • user-facing syntax
  • semantic validation
  • operator construction
  • runtime execution details

That would make the system harder to extend and much harder to optimize.

A clean front end gives the engine:

  • one place to define supported SQL syntax
  • one place to resolve names and types
  • one place to reject invalid queries early
  • one stable logical representation for later rewrites

The pipeline in the companion repo

Relevant implementation areas:

  • SQL tokenization
  • SQL tokens and keywords
  • SQL parsing
  • SQL AST expressions
  • SQL-to-logical planning
  • logical plan interfaces
  • logical expressions

Conceptually:

SQL text
  -> token stream
  -> SQL expressions / SqlSelect AST
  -> logical expressions
  -> logical plan

Tokenization

The tokenizer is responsible for recognizing the smallest meaningful pieces of the query:

  • identifiers
  • numeric literals
  • string literals
  • symbols such as (, ), ,, =, >, <=
  • SQL keywords such as SELECT, FROM, WHERE, GROUP BY

The SqlTokenizer handles:

  • whitespace skipping
  • quoted identifiers using backticks
  • quoted strings using '...' or "..."
  • escaped quotes inside strings
  • special handling for ambiguous words such as GROUP and ORDER

That last point matters because GROUP might be:

  • the GROUP BY keyword sequence
  • or an identifier such as a table/column name in a different context

This is a useful reminder that even "simple SQL parsing" starts with contextual decisions.


Parsing

The parser turns tokens into structured SQL expressions.

The companion parser uses a Pratt-parser style for expressions. That is a compact way to parse operator precedence without a large handwritten grammar.

Key precedence bands in SqlParser include:

  • aliasing and sort direction
  • OR
  • AND
  • comparisons
  • addition and subtraction
  • multiplication and division
  • function calls

This means expressions like:

age > 18 AND salary > 100000

can be parsed with the right tree shape rather than as a flat token sequence.

The SQL AST layer includes nodes such as:

  • SqlIdentifier
  • SqlString
  • SqlLong
  • SqlDouble
  • SqlBinaryExpr
  • SqlFunction
  • SqlAlias
  • SqlCast
  • SqlSort
  • SqlSelect

That AST is still SQL-shaped. It reflects query clauses like projection lists, WHERE, GROUP BY, HAVING, ORDER BY, and LIMIT.


What SqlSelect captures

The parser's main relation node is SqlSelect, which stores:

  • projection expressions
  • optional selection predicate
  • grouping expressions
  • ordering expressions
  • optional having predicate
  • optional limit
  • table name

This is still a parsed SQL object, not yet a logical plan.

That distinction matters because the SQL AST still speaks in language-level constructs:

  • aliases
  • SQL function syntax
  • keyword-driven clause structure

The logical plan instead speaks in engine-level operators:

  • scan
  • filter
  • projection
  • aggregate
  • join
  • limit

From SQL AST to logical expressions

SqlPlanner performs the semantic translation from SQL expressions into logical expressions and then into a DataFrame / logical plan.

Examples of mappings:

  • SqlIdentifier -> Column
  • string / long / double literals -> logical literal expressions
  • SQL comparison operators -> Eq, Gt, LtEq, and similar nodes
  • AND / OR -> boolean logical expressions
  • +, -, *, / -> arithmetic logical expressions
  • CAST(...) -> CastExpr
  • aggregate calls such as SUM(x) -> Sum(...)
  • COUNT(*) -> Count(LiteralLong(1))

This is the point where SQL surface syntax stops mattering and engine semantics take over.


Name resolution and schema dependence

The logical layer is schema-aware.

For example:

  • Column(name) resolves itself against the input plan's schema
  • toField(input) computes the output field for a logical expression
  • invalid column references become planning errors rather than execution surprises

This is important because planning is where the engine should answer:

  • does this column exist?
  • what type does this expression produce?
  • what schema flows out of this operator?

The companion code uses toField(input) as the main hook that gives expressions their output metadata.

That means logical expressions carry more than syntax. They also carry planning-time type and schema consequences.


Planning non-aggregate queries

For non-aggregate queries, the planner does more than a direct clause-to-operator mapping.

One subtle case is a filter that references columns not present in the final projection.

Example shape:

SELECT name
FROM employees
WHERE age > 18

If the filter uses age but the final output only includes name, the planner may need an intermediate projection that keeps both:

  • output columns needed by the user
  • extra columns needed by the filter

Then it can:

  1. project enough columns to evaluate the filter
  2. apply the filter
  3. drop temporary columns not meant for final output

This is a small but important example of planning being semantic work rather than just syntactic translation.


Planning aggregate queries

Aggregate queries add more complexity because the engine must distinguish between:

  • grouping expressions
  • aggregate expressions
  • post-aggregate projection
  • optional HAVING

The companion planner:

  • detects whether the projection contains aggregate expressions
  • rejects unsupported cases such as GROUP BY without aggregates
  • builds a group-by input
  • tracks which projected expressions correspond to group columns versus aggregate results
  • creates the aggregate operator
  • then applies a final projection over the aggregate output

This is important because aggregate SQL syntax often compresses several semantic stages into one query.

For example:

SELECT department, COUNT(*)
FROM employees
GROUP BY department
HAVING COUNT(*) > 5

is not one primitive operation. It becomes:

  1. scan input
  2. optionally filter pre-aggregation rows
  3. group rows
  4. compute aggregates
  5. apply post-aggregation filtering
  6. project final output

What the front end validates

The front end already enforces meaningful semantic constraints, such as:

  • referenced tables must exist
  • referenced columns must exist
  • aggregate functions must have valid arguments
  • COUNT(*) is treated specially
  • LIMIT must be numeric
  • unsupported data types in CAST are rejected

This is the right place for those checks because execution should not be responsible for discovering syntax- or meaning-level errors.


Supported surface area and current limits

The front end in the companion repo is intentionally small and teachable rather than SQL-complete.

It supports a useful subset:

  • SELECT
  • FROM
  • WHERE
  • GROUP BY
  • HAVING
  • ORDER BY
  • LIMIT
  • basic arithmetic and boolean expressions
  • basic aggregate functions
  • CAST
  • date and interval literals

But it is still limited compared with a production SQL engine:

  • no deep subquery machinery
  • no rich multi-table SQL grammar
  • no extensive type coercion rules
  • limited cast target support
  • limited aggregate/function surface

That is not a flaw in the teaching model. It keeps the front end small enough to reveal the core ideas clearly.


Mental model: AST vs logical plan

This is the key conceptual distinction.

The SQL AST answers:

  • what syntax did the user write?

The logical plan answers:

  • what data operations does that syntax mean?

That distinction is one of the most important design boundaries in a query engine.

It is what allows:

  • multiple front ends to target one engine
  • optimizers to work on a syntax-independent representation
  • physical planners to ignore SQL spelling details

Main takeaways

  • Tokenization and parsing are not just string processing. They define the supported query language.
  • The SQL AST is still language-shaped, while the logical plan is engine-shaped.
  • Planning is where the engine starts doing semantic work: name resolution, schema propagation, and operator construction.
  • Aggregate queries require especially careful translation because one SQL clause can imply several execution stages.
  • A small, explicit front-end pipeline makes the rest of the engine cleaner and easier to evolve.

  • hqew/002-query-engine-primer.md
  • hqew/004-query-planning.md
  • hqew/005-query-optimization.md
  • hqew/014-how-query-engines-work-part-1.md
  • hqew/016-physical-plans-and-operators.md
  • hqew/017-expressions-types-and-nullability.md

Changelog

  • Apr 7, 2026 -- Removed references to ignored local tmp/ paths.
  • Apr 7, 2026 -- Added a dedicated note on SQL tokenization, parsing, AST structure, and logical planning.