useful-notes/hqew/005-query-optimization.md

159 lines
3.2 KiB
Markdown
Raw Normal View History

# Query Optimization
A reference for how query engines make a plan cheaper without changing its meaning.
---
## Short answer
Query optimization is the process of rewriting a logical or physical plan into an equivalent but more efficient form.
The key word is equivalent: the result must stay the same even though the execution strategy changes.
---
## Why optimization exists
There are usually many ways to compute the same query.
For example, an engine may be able to:
- read all columns or only the needed ones
- filter before or after another operator
- join tables in different orders
- pick different join algorithms
Optimization tries to choose a cheaper plan in terms of CPU, memory, I/O, and network cost.
---
## Common optimizations
### Projection pushdown
Read only the columns that are actually needed.
### Predicate pushdown
Apply filters as early as possible, ideally inside the data source.
### Constant folding
Precompute expressions such as `2 + 3` or simplify boolean expressions before execution.
### Expression simplification
Rewrite expressions into simpler equivalent forms.
### Join reordering
Change the order of joins to reduce intermediate result size.
### Limit pushdown
Push `LIMIT` closer to the source or to earlier stages when semantics allow it.
### Operator fusion
Combine adjacent operations to reduce overhead.
---
## Rule-based vs cost-based optimization
### Rule-based optimization
This applies fixed rewrite rules such as:
- push filters below projections
- remove unused columns
- simplify expressions
Strengths:
- simple
- predictable
- easy to implement incrementally
Weaknesses:
- limited when multiple legal alternatives exist
### Cost-based optimization
This estimates the cost of alternative plans and chooses the best one according to some model.
It often depends on:
- table sizes
- value distributions
- selectivity estimates
- available indexes
Strengths:
- can choose among many alternatives
- important for complex join planning
Weaknesses:
- depends on statistics quality
- more implementation complexity
Most serious engines use both.
---
## Logical vs physical optimization
Optimization can happen at two levels.
### Logical optimization
Rewrite the plan while staying in logical-operator space.
Examples:
- pushdown rewrites
- removing dead columns
- simplifying expressions
### Physical optimization
Choose concrete execution strategies.
Examples:
- hash join vs sort-merge join
- vectorized filter vs generic filter
- index scan vs full scan
This distinction matters because some improvements are about semantics-preserving algebra, while others are about operator implementation choices.
---
## Why optimization is hard
Optimization is difficult because:
- the search space can explode
- estimates are imperfect
- the cheapest local rewrite is not always globally best
- different workloads care about different costs
So optimizers are always making approximations, not proving the perfect plan.
---
## Practical mental model
If planning answers "what operations are needed?", optimization answers "what is the cheapest equivalent way to arrange and implement them?"
That is the essential idea.
---
## Changelog
* **Mar 31, 2026** -- First version created.