159 lines
3.2 KiB
Markdown
159 lines
3.2 KiB
Markdown
# Query Optimization
|
|
|
|
A reference for how query engines make a plan cheaper without changing its meaning.
|
|
|
|
---
|
|
|
|
## Short answer
|
|
|
|
Query optimization is the process of rewriting a logical or physical plan into an equivalent but more efficient form.
|
|
|
|
The key word is equivalent: the result must stay the same even though the execution strategy changes.
|
|
|
|
---
|
|
|
|
## Why optimization exists
|
|
|
|
There are usually many ways to compute the same query.
|
|
|
|
For example, an engine may be able to:
|
|
|
|
- read all columns or only the needed ones
|
|
- filter before or after another operator
|
|
- join tables in different orders
|
|
- pick different join algorithms
|
|
|
|
Optimization tries to choose a cheaper plan in terms of CPU, memory, I/O, and network cost.
|
|
|
|
---
|
|
|
|
## Common optimizations
|
|
|
|
### Projection pushdown
|
|
|
|
Read only the columns that are actually needed.
|
|
|
|
### Predicate pushdown
|
|
|
|
Apply filters as early as possible, ideally inside the data source.
|
|
|
|
### Constant folding
|
|
|
|
Precompute expressions such as `2 + 3` or simplify boolean expressions before execution.
|
|
|
|
### Expression simplification
|
|
|
|
Rewrite expressions into simpler equivalent forms.
|
|
|
|
### Join reordering
|
|
|
|
Change the order of joins to reduce intermediate result size.
|
|
|
|
### Limit pushdown
|
|
|
|
Push `LIMIT` closer to the source or to earlier stages when semantics allow it.
|
|
|
|
### Operator fusion
|
|
|
|
Combine adjacent operations to reduce overhead.
|
|
|
|
---
|
|
|
|
## Rule-based vs cost-based optimization
|
|
|
|
### Rule-based optimization
|
|
|
|
This applies fixed rewrite rules such as:
|
|
|
|
- push filters below projections
|
|
- remove unused columns
|
|
- simplify expressions
|
|
|
|
Strengths:
|
|
|
|
- simple
|
|
- predictable
|
|
- easy to implement incrementally
|
|
|
|
Weaknesses:
|
|
|
|
- limited when multiple legal alternatives exist
|
|
|
|
### Cost-based optimization
|
|
|
|
This estimates the cost of alternative plans and chooses the best one according to some model.
|
|
|
|
It often depends on:
|
|
|
|
- table sizes
|
|
- value distributions
|
|
- selectivity estimates
|
|
- available indexes
|
|
|
|
Strengths:
|
|
|
|
- can choose among many alternatives
|
|
- important for complex join planning
|
|
|
|
Weaknesses:
|
|
|
|
- depends on statistics quality
|
|
- more implementation complexity
|
|
|
|
Most serious engines use both.
|
|
|
|
---
|
|
|
|
## Logical vs physical optimization
|
|
|
|
Optimization can happen at two levels.
|
|
|
|
### Logical optimization
|
|
|
|
Rewrite the plan while staying in logical-operator space.
|
|
|
|
Examples:
|
|
|
|
- pushdown rewrites
|
|
- removing dead columns
|
|
- simplifying expressions
|
|
|
|
### Physical optimization
|
|
|
|
Choose concrete execution strategies.
|
|
|
|
Examples:
|
|
|
|
- hash join vs sort-merge join
|
|
- vectorized filter vs generic filter
|
|
- index scan vs full scan
|
|
|
|
This distinction matters because some improvements are about semantics-preserving algebra, while others are about operator implementation choices.
|
|
|
|
---
|
|
|
|
## Why optimization is hard
|
|
|
|
Optimization is difficult because:
|
|
|
|
- the search space can explode
|
|
- estimates are imperfect
|
|
- the cheapest local rewrite is not always globally best
|
|
- different workloads care about different costs
|
|
|
|
So optimizers are always making approximations, not proving the perfect plan.
|
|
|
|
---
|
|
|
|
## Practical mental model
|
|
|
|
If planning answers "what operations are needed?", optimization answers "what is the cheapest equivalent way to arrange and implement them?"
|
|
|
|
That is the essential idea.
|
|
|
|
---
|
|
|
|
## Changelog
|
|
|
|
* **Mar 31, 2026** -- First version created.
|