9.7 KiB
SQL Front End and Logical Planning
A reference for how SQL enters the engine and becomes a logical plan.
Short answer
The SQL front end is the layer that turns query text into an internal representation the rest of the engine can reason about.
In the companion engine, that front-end pipeline is:
- tokenize SQL text
- parse tokens into SQL AST nodes
- translate SQL AST nodes into logical expressions and logical plan operators
- attach schema and type information so later stages can optimize and execute
The important separation is:
- SQL syntax is one surface language
- the SQL AST captures parsed structure
- the logical plan captures query meaning in engine terms
That separation keeps optimization and execution independent from the original text syntax.
Why this layer matters
Without a front-end layer, the engine would have to intermingle:
- user-facing syntax
- semantic validation
- operator construction
- runtime execution details
That would make the system harder to extend and much harder to optimize.
A clean front end gives the engine:
- one place to define supported SQL syntax
- one place to resolve names and types
- one place to reject invalid queries early
- one stable logical representation for later rewrites
The pipeline in the companion repo
Relevant modules:
tmp/how-query-engines-work/sql/src/main/kotlin/SqlTokenizer.kttmp/how-query-engines-work/sql/src/main/kotlin/Tokens.kttmp/how-query-engines-work/sql/src/main/kotlin/SqlParser.kttmp/how-query-engines-work/sql/src/main/kotlin/Expressions.kttmp/how-query-engines-work/sql/src/main/kotlin/SqlPlanner.kttmp/how-query-engines-work/logical-plan/src/main/kotlin/LogicalPlan.kttmp/how-query-engines-work/logical-plan/src/main/kotlin/Expressions.kt
Conceptually:
SQL text
-> token stream
-> SQL expressions / SqlSelect AST
-> logical expressions
-> logical plan
Tokenization
The tokenizer is responsible for recognizing the smallest meaningful pieces of the query:
- identifiers
- numeric literals
- string literals
- symbols such as
(,),,,=,>,<= - SQL keywords such as
SELECT,FROM,WHERE,GROUP BY
The SqlTokenizer handles:
- whitespace skipping
- quoted identifiers using backticks
- quoted strings using
'...'or"..." - escaped quotes inside strings
- special handling for ambiguous words such as
GROUPandORDER
That last point matters because GROUP might be:
- the
GROUP BYkeyword sequence - or an identifier such as a table/column name in a different context
This is a useful reminder that even "simple SQL parsing" starts with contextual decisions.
Parsing
The parser turns tokens into structured SQL expressions.
The companion parser uses a Pratt-parser style for expressions. That is a compact way to parse operator precedence without a large handwritten grammar.
Key precedence bands in SqlParser include:
- aliasing and sort direction
ORAND- comparisons
- addition and subtraction
- multiplication and division
- function calls
This means expressions like:
age > 18 AND salary > 100000
can be parsed with the right tree shape rather than as a flat token sequence.
The SQL AST layer includes nodes such as:
SqlIdentifierSqlStringSqlLongSqlDoubleSqlBinaryExprSqlFunctionSqlAliasSqlCastSqlSortSqlSelect
That AST is still SQL-shaped. It reflects query clauses like projection lists, WHERE, GROUP BY, HAVING, ORDER BY, and LIMIT.
What SqlSelect captures
The parser's main relation node is SqlSelect, which stores:
- projection expressions
- optional selection predicate
- grouping expressions
- ordering expressions
- optional having predicate
- optional limit
- table name
This is still a parsed SQL object, not yet a logical plan.
That distinction matters because the SQL AST still speaks in language-level constructs:
- aliases
- SQL function syntax
- keyword-driven clause structure
The logical plan instead speaks in engine-level operators:
- scan
- filter
- projection
- aggregate
- join
- limit
From SQL AST to logical expressions
SqlPlanner performs the semantic translation from SQL expressions into logical expressions and then into a DataFrame / logical plan.
Examples of mappings:
SqlIdentifier->Column- string / long / double literals -> logical literal expressions
- SQL comparison operators ->
Eq,Gt,LtEq, and similar nodes AND/OR-> boolean logical expressions+,-,*,/-> arithmetic logical expressionsCAST(...)->CastExpr- aggregate calls such as
SUM(x)->Sum(...) COUNT(*)->Count(LiteralLong(1))
This is the point where SQL surface syntax stops mattering and engine semantics take over.
Name resolution and schema dependence
The logical layer is schema-aware.
For example:
Column(name)resolves itself against the input plan's schematoField(input)computes the output field for a logical expression- invalid column references become planning errors rather than execution surprises
This is important because planning is where the engine should answer:
- does this column exist?
- what type does this expression produce?
- what schema flows out of this operator?
The companion code uses toField(input) as the main hook that gives expressions their output metadata.
That means logical expressions carry more than syntax. They also carry planning-time type and schema consequences.
Planning non-aggregate queries
For non-aggregate queries, the planner does more than a direct clause-to-operator mapping.
One subtle case is a filter that references columns not present in the final projection.
Example shape:
SELECT name
FROM employees
WHERE age > 18
If the filter uses age but the final output only includes name, the planner may need an intermediate projection that keeps both:
- output columns needed by the user
- extra columns needed by the filter
Then it can:
- project enough columns to evaluate the filter
- apply the filter
- drop temporary columns not meant for final output
This is a small but important example of planning being semantic work rather than just syntactic translation.
Planning aggregate queries
Aggregate queries add more complexity because the engine must distinguish between:
- grouping expressions
- aggregate expressions
- post-aggregate projection
- optional
HAVING
The companion planner:
- detects whether the projection contains aggregate expressions
- rejects unsupported cases such as
GROUP BYwithout aggregates - builds a group-by input
- tracks which projected expressions correspond to group columns versus aggregate results
- creates the aggregate operator
- then applies a final projection over the aggregate output
This is important because aggregate SQL syntax often compresses several semantic stages into one query.
For example:
SELECT department, COUNT(*)
FROM employees
GROUP BY department
HAVING COUNT(*) > 5
is not one primitive operation. It becomes:
- scan input
- optionally filter pre-aggregation rows
- group rows
- compute aggregates
- apply post-aggregation filtering
- project final output
What the front end validates
The front end already enforces meaningful semantic constraints, such as:
- referenced tables must exist
- referenced columns must exist
- aggregate functions must have valid arguments
COUNT(*)is treated speciallyLIMITmust be numeric- unsupported data types in
CASTare rejected
This is the right place for those checks because execution should not be responsible for discovering syntax- or meaning-level errors.
Supported surface area and current limits
The front end in the companion repo is intentionally small and teachable rather than SQL-complete.
It supports a useful subset:
SELECTFROMWHEREGROUP BYHAVINGORDER BYLIMIT- basic arithmetic and boolean expressions
- basic aggregate functions
CAST- date and interval literals
But it is still limited compared with a production SQL engine:
- no deep subquery machinery
- no rich multi-table SQL grammar
- no extensive type coercion rules
- limited cast target support
- limited aggregate/function surface
That is not a flaw in the teaching model. It keeps the front end small enough to reveal the core ideas clearly.
Mental model: AST vs logical plan
This is the key conceptual distinction.
The SQL AST answers:
- what syntax did the user write?
The logical plan answers:
- what data operations does that syntax mean?
That distinction is one of the most important design boundaries in a query engine.
It is what allows:
- multiple front ends to target one engine
- optimizers to work on a syntax-independent representation
- physical planners to ignore SQL spelling details
Main takeaways
- Tokenization and parsing are not just string processing. They define the supported query language.
- The SQL AST is still language-shaped, while the logical plan is engine-shaped.
- Planning is where the engine starts doing semantic work: name resolution, schema propagation, and operator construction.
- Aggregate queries require especially careful translation because one SQL clause can imply several execution stages.
- A small, explicit front-end pipeline makes the rest of the engine cleaner and easier to evolve.
Related notes
hqew/002-query-engine-primer.mdhqew/004-query-planning.mdhqew/005-query-optimization.mdhqew/014-how-query-engines-work-part-1.mdhqew/016-physical-plans-and-operators.mdhqew/017-expressions-types-and-nullability.md
Changelog
- Apr 7, 2026 -- Added a dedicated note on SQL tokenization, parsing, AST structure, and logical planning.