Hassan Abedi 1599c82614 Update the newly added note files and remove references to ignored files

2026-04-07 15:32:39 +02:00

9.4 KiB

Raw Blame History

SQL Front End and Logical Planning

A reference for how SQL enters the engine and becomes a logical plan.

Short answer

The SQL front end is the layer that turns query text into an internal representation the rest of the engine can reason about.

In the companion engine, that front-end pipeline is:

tokenize SQL text
parse tokens into SQL AST nodes
translate SQL AST nodes into logical expressions and logical plan operators
attach schema and type information so later stages can optimize and execute

The important separation is:

SQL syntax is one surface language
the SQL AST captures parsed structure
the logical plan captures query meaning in engine terms

That separation keeps optimization and execution independent from the original text syntax.

Why this layer matters

Without a front-end layer, the engine would have to intermingle:

user-facing syntax
semantic validation
operator construction
runtime execution details

That would make the system harder to extend and much harder to optimize.

A clean front end gives the engine:

one place to define supported SQL syntax
one place to resolve names and types
one place to reject invalid queries early
one stable logical representation for later rewrites

The pipeline in the companion repo

Relevant implementation areas:

SQL tokenization
SQL tokens and keywords
SQL parsing
SQL AST expressions
SQL-to-logical planning
logical plan interfaces
logical expressions

Conceptually:

SQL text
  -> token stream
  -> SQL expressions / SqlSelect AST
  -> logical expressions
  -> logical plan

Tokenization

The tokenizer is responsible for recognizing the smallest meaningful pieces of the query:

identifiers
numeric literals
string literals
symbols such as (, ), ,, =, >, <=
SQL keywords such as SELECT, FROM, WHERE, GROUP BY

The SqlTokenizer handles:

whitespace skipping
quoted identifiers using backticks
quoted strings using '...' or "..."
escaped quotes inside strings
special handling for ambiguous words such as GROUP and ORDER

That last point matters because GROUP might be:

the GROUP BY keyword sequence
or an identifier such as a table/column name in a different context

This is a useful reminder that even "simple SQL parsing" starts with contextual decisions.

Parsing

The parser turns tokens into structured SQL expressions.

The companion parser uses a Pratt-parser style for expressions. That is a compact way to parse operator precedence without a large handwritten grammar.

Key precedence bands in SqlParser include:

aliasing and sort direction
OR
AND
comparisons
addition and subtraction
multiplication and division
function calls

This means expressions like:

age > 18 AND salary > 100000

can be parsed with the right tree shape rather than as a flat token sequence.

The SQL AST layer includes nodes such as:

SqlIdentifier
SqlString
SqlLong
SqlDouble
SqlBinaryExpr
SqlFunction
SqlAlias
SqlCast
SqlSort
SqlSelect

That AST is still SQL-shaped. It reflects query clauses like projection lists, WHERE, GROUP BY, HAVING, ORDER BY, and LIMIT.

What `SqlSelect` captures

The parser's main relation node is SqlSelect, which stores:

projection expressions
optional selection predicate
grouping expressions
ordering expressions
optional having predicate
optional limit
table name

This is still a parsed SQL object, not yet a logical plan.

That distinction matters because the SQL AST still speaks in language-level constructs:

aliases
SQL function syntax
keyword-driven clause structure

The logical plan instead speaks in engine-level operators:

scan
filter
projection
aggregate
join
limit

From SQL AST to logical expressions

SqlPlanner performs the semantic translation from SQL expressions into logical expressions and then into a DataFrame / logical plan.

Examples of mappings:

SqlIdentifier -> Column
string / long / double literals -> logical literal expressions
SQL comparison operators -> Eq, Gt, LtEq, and similar nodes
AND / OR -> boolean logical expressions
+, -, *, / -> arithmetic logical expressions
CAST(...) -> CastExpr
aggregate calls such as SUM(x) -> Sum(...)
COUNT(*) -> Count(LiteralLong(1))

This is the point where SQL surface syntax stops mattering and engine semantics take over.

Name resolution and schema dependence

The logical layer is schema-aware.

For example:

Column(name) resolves itself against the input plan's schema
toField(input) computes the output field for a logical expression
invalid column references become planning errors rather than execution surprises

This is important because planning is where the engine should answer:

does this column exist?
what type does this expression produce?
what schema flows out of this operator?

The companion code uses toField(input) as the main hook that gives expressions their output metadata.

That means logical expressions carry more than syntax. They also carry planning-time type and schema consequences.

Planning non-aggregate queries

For non-aggregate queries, the planner does more than a direct clause-to-operator mapping.

One subtle case is a filter that references columns not present in the final projection.

Example shape:

SELECT name
FROM employees
WHERE age > 18

If the filter uses age but the final output only includes name, the planner may need an intermediate projection that keeps both:

output columns needed by the user
extra columns needed by the filter

Then it can:

project enough columns to evaluate the filter
apply the filter
drop temporary columns not meant for final output

This is a small but important example of planning being semantic work rather than just syntactic translation.

Planning aggregate queries

Aggregate queries add more complexity because the engine must distinguish between:

grouping expressions
aggregate expressions
post-aggregate projection
optional HAVING

The companion planner:

detects whether the projection contains aggregate expressions
rejects unsupported cases such as GROUP BY without aggregates
builds a group-by input
tracks which projected expressions correspond to group columns versus aggregate results
creates the aggregate operator
then applies a final projection over the aggregate output

This is important because aggregate SQL syntax often compresses several semantic stages into one query.

For example:

SELECT department, COUNT(*)
FROM employees
GROUP BY department
HAVING COUNT(*) > 5

is not one primitive operation. It becomes:

scan input
optionally filter pre-aggregation rows
group rows
compute aggregates
apply post-aggregation filtering
project final output

What the front end validates

The front end already enforces meaningful semantic constraints, such as:

referenced tables must exist
referenced columns must exist
aggregate functions must have valid arguments
COUNT(*) is treated specially
LIMIT must be numeric
unsupported data types in CAST are rejected

This is the right place for those checks because execution should not be responsible for discovering syntax- or meaning-level errors.

Supported surface area and current limits

The front end in the companion repo is intentionally small and teachable rather than SQL-complete.

It supports a useful subset:

SELECT
FROM
WHERE
GROUP BY
HAVING
ORDER BY
LIMIT
basic arithmetic and boolean expressions
basic aggregate functions
CAST
date and interval literals

But it is still limited compared with a production SQL engine:

no deep subquery machinery
no rich multi-table SQL grammar
no extensive type coercion rules
limited cast target support
limited aggregate/function surface

That is not a flaw in the teaching model. It keeps the front end small enough to reveal the core ideas clearly.

Mental model: AST vs logical plan

This is the key conceptual distinction.

The SQL AST answers:

what syntax did the user write?

The logical plan answers:

what data operations does that syntax mean?

That distinction is one of the most important design boundaries in a query engine.

It is what allows:

multiple front ends to target one engine
optimizers to work on a syntax-independent representation
physical planners to ignore SQL spelling details

Main takeaways

Tokenization and parsing are not just string processing. They define the supported query language.
The SQL AST is still language-shaped, while the logical plan is engine-shaped.
Planning is where the engine starts doing semantic work: name resolution, schema propagation, and operator construction.
Aggregate queries require especially careful translation because one SQL clause can imply several execution stages.
A small, explicit front-end pipeline makes the rest of the engine cleaner and easier to evolve.

hqew/002-query-engine-primer.md
hqew/004-query-planning.md
hqew/005-query-optimization.md
hqew/014-how-query-engines-work-part-1.md
hqew/016-physical-plans-and-operators.md
hqew/017-expressions-types-and-nullability.md

Changelog

Apr 7, 2026 -- Removed references to ignored local tmp/ paths.
Apr 7, 2026 -- Added a dedicated note on SQL tokenization, parsing, AST structure, and logical planning.

9.4 KiB Raw Blame History