Add note files about query planning and optimization

This commit is contained in:
Hassan Abedi 2026-03-31 16:16:53 +02:00
parent 584f82bb82
commit 2a33f8b483
2 changed files with 305 additions and 0 deletions

147
hqew/004-query-planning.md Normal file
View File

@ -0,0 +1,147 @@
# Query Planning
A reference for how a query request becomes an internal plan.
---
## Short answer
Query planning is the stage where a query engine turns a user request into a structured representation of work to be done.
The main point is to separate:
- the syntax the user wrote
- the meaning of the query
- the later execution strategy
Without that separation, optimization and backend-independent execution become much harder.
---
## Typical pipeline
Planning usually sits between parsing and optimization:
1. parse query text or API calls
2. build an AST or similar syntax tree
3. resolve names and types
4. produce a logical plan
5. hand that plan to the optimizer
The exact boundaries differ across systems, but the general idea is stable.
---
## What planning does
### Parse structure into operations
The planner turns syntax such as `SELECT`, `WHERE`, `GROUP BY`, and `JOIN` into relational operators such as:
- scan
- projection
- filter
- join
- aggregate
- limit
### Resolve names
The planner figures out what table or source a name refers to and which columns expressions mention.
### Check types
The planner verifies that expressions are valid, such as comparing compatible types or ensuring aggregates are used correctly.
### Build expressions
Predicates and computed columns are turned into internal expression trees.
### Attach schema information
The planner determines the shape of operator outputs so later stages know what columns and types flow through the plan.
---
## AST vs logical plan
This distinction matters.
- the AST reflects the query language syntax
- the logical plan reflects the data operations implied by that syntax
For example, SQL syntax may contain clauses and aliases that are useful to the parser but irrelevant once the engine understands that the query means
"scan, filter, then project."
So planning is partly a translation from language syntax into execution-oriented semantics.
---
## A tiny example
Query:
```sql
SELECT name
FROM employees
WHERE age > 18
```
The parser may produce an AST containing nodes like:
- `SelectStatement`
- `FromClause`
- `WhereClause`
The planner turns that into a logical plan:
1. `Scan(employees)`
2. `Filter(age > 18)`
3. `Projection(name)`
That logical plan is what later stages optimize.
---
## Why planning matters
Planning is valuable because it creates the first stable representation of meaning inside the engine.
That gives the system a place to:
- validate the query
- reason about schemas
- rewrite plans
- compare equivalent formulations
- target different execution backends
In practice, planning is the bridge between the front-end language and the execution engine.
---
## Common complications
Planning gets harder when the query language includes:
- nested queries
- correlated subqueries
- user-defined functions
- ambiguous names
- multiple source types
- non-relational operators
This is why planning is often a substantial subsystem, not just a parser post-processing step.
---
## Practical mental model
If parsing answers "what syntax did the user write?", planning answers "what data operations does that syntax mean?"
That is the cleanest way to think about it.
---
## Changelog
* **Mar 31, 2026** -- First version created.

View File

@ -0,0 +1,158 @@
# Query Optimization
A reference for how query engines make a plan cheaper without changing its meaning.
---
## Short answer
Query optimization is the process of rewriting a logical or physical plan into an equivalent but more efficient form.
The key word is equivalent: the result must stay the same even though the execution strategy changes.
---
## Why optimization exists
There are usually many ways to compute the same query.
For example, an engine may be able to:
- read all columns or only the needed ones
- filter before or after another operator
- join tables in different orders
- pick different join algorithms
Optimization tries to choose a cheaper plan in terms of CPU, memory, I/O, and network cost.
---
## Common optimizations
### Projection pushdown
Read only the columns that are actually needed.
### Predicate pushdown
Apply filters as early as possible, ideally inside the data source.
### Constant folding
Precompute expressions such as `2 + 3` or simplify boolean expressions before execution.
### Expression simplification
Rewrite expressions into simpler equivalent forms.
### Join reordering
Change the order of joins to reduce intermediate result size.
### Limit pushdown
Push `LIMIT` closer to the source or to earlier stages when semantics allow it.
### Operator fusion
Combine adjacent operations to reduce overhead.
---
## Rule-based vs cost-based optimization
### Rule-based optimization
This applies fixed rewrite rules such as:
- push filters below projections
- remove unused columns
- simplify expressions
Strengths:
- simple
- predictable
- easy to implement incrementally
Weaknesses:
- limited when multiple legal alternatives exist
### Cost-based optimization
This estimates the cost of alternative plans and chooses the best one according to some model.
It often depends on:
- table sizes
- value distributions
- selectivity estimates
- available indexes
Strengths:
- can choose among many alternatives
- important for complex join planning
Weaknesses:
- depends on statistics quality
- more implementation complexity
Most serious engines use both.
---
## Logical vs physical optimization
Optimization can happen at two levels.
### Logical optimization
Rewrite the plan while staying in logical-operator space.
Examples:
- pushdown rewrites
- removing dead columns
- simplifying expressions
### Physical optimization
Choose concrete execution strategies.
Examples:
- hash join vs sort-merge join
- vectorized filter vs generic filter
- index scan vs full scan
This distinction matters because some improvements are about semantics-preserving algebra, while others are about operator implementation choices.
---
## Why optimization is hard
Optimization is difficult because:
- the search space can explode
- estimates are imperfect
- the cheapest local rewrite is not always globally best
- different workloads care about different costs
So optimizers are always making approximations, not proving the perfect plan.
---
## Practical mental model
If planning answers "what operations are needed?", optimization answers "what is the cheapest equivalent way to arrange and implement them?"
That is the essential idea.
---
## Changelog
* **Mar 31, 2026** -- First version created.