285 lines
7.1 KiB
Markdown
285 lines
7.1 KiB
Markdown
|
|
# Cost Models, Statistics, and Cardinality Estimation
|
||
|
|
|
||
|
|
A reference for how query engines estimate plan cost and why those estimates so often go wrong.
|
||
|
|
|
||
|
|
---
|
||
|
|
|
||
|
|
## Short answer
|
||
|
|
|
||
|
|
A cost-based optimizer needs estimates for:
|
||
|
|
|
||
|
|
- how many rows each operator will produce
|
||
|
|
- how expensive each operator will be
|
||
|
|
- how much memory, I/O, CPU, and network work a plan will require
|
||
|
|
|
||
|
|
The most important estimate is usually cardinality:
|
||
|
|
|
||
|
|
- how many rows flow out of each stage
|
||
|
|
|
||
|
|
If those estimates are bad, the optimizer may choose:
|
||
|
|
|
||
|
|
- the wrong join order
|
||
|
|
- the wrong join algorithm
|
||
|
|
- the wrong access path
|
||
|
|
- the wrong distributed execution shape
|
||
|
|
|
||
|
|
So statistics and cardinality estimation are central to serious optimization.
|
||
|
|
|
||
|
|
---
|
||
|
|
|
||
|
|
## Why this matters
|
||
|
|
|
||
|
|
Rule-based optimization can improve many queries with local rewrites such as:
|
||
|
|
|
||
|
|
- projection pushdown
|
||
|
|
- predicate pushdown
|
||
|
|
- constant folding
|
||
|
|
|
||
|
|
But once the engine must choose among many legal alternatives, it needs a basis for comparison.
|
||
|
|
|
||
|
|
That basis is usually a cost model informed by statistics.
|
||
|
|
|
||
|
|
Without that, the engine has little principled way to decide:
|
||
|
|
|
||
|
|
- which join should happen first
|
||
|
|
- whether an index scan beats a full scan
|
||
|
|
- whether repartitioning is worth it
|
||
|
|
- whether a partial aggregation is worthwhile
|
||
|
|
|
||
|
|
---
|
||
|
|
|
||
|
|
## Cardinality first
|
||
|
|
|
||
|
|
Cardinality estimation is usually the most important subproblem because many other costs depend on row counts.
|
||
|
|
|
||
|
|
If the engine can estimate:
|
||
|
|
|
||
|
|
- input row counts
|
||
|
|
- filter selectivity
|
||
|
|
- grouping cardinality
|
||
|
|
- join output size
|
||
|
|
|
||
|
|
then it can derive rough downstream costs.
|
||
|
|
|
||
|
|
If it gets those row counts badly wrong, even a sophisticated cost model can make poor decisions.
|
||
|
|
|
||
|
|
That is why cardinality estimation is often treated as the heart of the optimizer.
|
||
|
|
|
||
|
|
---
|
||
|
|
|
||
|
|
## What a cost model tries to estimate
|
||
|
|
|
||
|
|
A cost model usually approximates some combination of:
|
||
|
|
|
||
|
|
- CPU work
|
||
|
|
- memory use
|
||
|
|
- local I/O
|
||
|
|
- network movement
|
||
|
|
- latency
|
||
|
|
|
||
|
|
Different engines weight these differently depending on workload.
|
||
|
|
|
||
|
|
For example:
|
||
|
|
|
||
|
|
- a single-node analytical engine may care most about CPU and memory bandwidth
|
||
|
|
- a distributed engine may care heavily about shuffle size
|
||
|
|
- an interactive engine may favor latency over total throughput
|
||
|
|
|
||
|
|
So a cost model is always tied to system goals, not just pure algebra.
|
||
|
|
|
||
|
|
---
|
||
|
|
|
||
|
|
## Common sources of statistics
|
||
|
|
|
||
|
|
Engines often collect or maintain:
|
||
|
|
|
||
|
|
- total row count
|
||
|
|
- per-column distinct counts
|
||
|
|
- min/max values
|
||
|
|
- null counts
|
||
|
|
- histograms
|
||
|
|
- file or partition sizes
|
||
|
|
- index metadata
|
||
|
|
|
||
|
|
These statistics can exist at several levels:
|
||
|
|
|
||
|
|
- table-level
|
||
|
|
- partition-level
|
||
|
|
- file-level
|
||
|
|
- row-group or block-level
|
||
|
|
|
||
|
|
More granular statistics can improve planning, but they also cost more to collect, store, and keep fresh.
|
||
|
|
|
||
|
|
---
|
||
|
|
|
||
|
|
## Selectivity estimation
|
||
|
|
|
||
|
|
Selectivity is the fraction of rows expected to survive a predicate.
|
||
|
|
|
||
|
|
Examples:
|
||
|
|
|
||
|
|
- `age > 18`
|
||
|
|
- `country = 'US'`
|
||
|
|
- `status IN (...)`
|
||
|
|
|
||
|
|
The optimizer uses selectivity estimates to predict:
|
||
|
|
|
||
|
|
- filter output size
|
||
|
|
- downstream join input size
|
||
|
|
- aggregate input size
|
||
|
|
- whether an index is worthwhile
|
||
|
|
|
||
|
|
This is deceptively hard because selectivity depends on data distribution, not just syntax.
|
||
|
|
|
||
|
|
---
|
||
|
|
|
||
|
|
## Join-order planning
|
||
|
|
|
||
|
|
Join planning is where bad estimates become especially expensive.
|
||
|
|
|
||
|
|
If a query joins several inputs, many legal join orders may exist. Their costs can differ dramatically because intermediate result sizes can explode or collapse depending on the order.
|
||
|
|
|
||
|
|
The optimizer therefore needs estimates for:
|
||
|
|
|
||
|
|
- base table sizes
|
||
|
|
- filter selectivities
|
||
|
|
- join key distinct counts
|
||
|
|
- skew and duplication patterns
|
||
|
|
|
||
|
|
If those estimates are poor, the engine may build large hash tables too early, shuffle too much data, or materialize huge intermediates unnecessarily.
|
||
|
|
|
||
|
|
This is one reason join optimization is considered one of the hardest parts of query planning.
|
||
|
|
|
||
|
|
---
|
||
|
|
|
||
|
|
## Histograms and distributions
|
||
|
|
|
||
|
|
Simple row counts are rarely enough.
|
||
|
|
|
||
|
|
Two columns can have the same row count but very different value distributions.
|
||
|
|
|
||
|
|
Histograms help by approximating:
|
||
|
|
|
||
|
|
- frequency of values
|
||
|
|
- value ranges
|
||
|
|
- skew
|
||
|
|
|
||
|
|
They improve estimates for:
|
||
|
|
|
||
|
|
- range predicates
|
||
|
|
- uneven value distributions
|
||
|
|
- non-uniform grouping sizes
|
||
|
|
|
||
|
|
But histograms are still approximations, and they can become stale.
|
||
|
|
|
||
|
|
---
|
||
|
|
|
||
|
|
## Correlation and independence assumptions
|
||
|
|
|
||
|
|
Many optimizers make simplifying assumptions such as:
|
||
|
|
|
||
|
|
- predicates are independent
|
||
|
|
- column distributions are uniform
|
||
|
|
- joins follow regular key patterns
|
||
|
|
|
||
|
|
These assumptions are convenient but often wrong.
|
||
|
|
|
||
|
|
For example, predicates on `city` and `country` are usually correlated. Treating them as independent can badly underestimate or overestimate result size.
|
||
|
|
|
||
|
|
This is one of the biggest reasons real-world cardinality estimation is difficult.
|
||
|
|
|
||
|
|
---
|
||
|
|
|
||
|
|
## Error propagation
|
||
|
|
|
||
|
|
Estimation errors compound through a plan.
|
||
|
|
|
||
|
|
If an early filter estimate is off, then:
|
||
|
|
|
||
|
|
- join inputs are misestimated
|
||
|
|
- join outputs are misestimated
|
||
|
|
- aggregate sizes are misestimated
|
||
|
|
- memory and network costs are misestimated
|
||
|
|
|
||
|
|
By the time the optimizer compares full plans, small early mistakes may have become large planning errors.
|
||
|
|
|
||
|
|
This is why even mature query engines still struggle with difficult workloads.
|
||
|
|
|
||
|
|
---
|
||
|
|
|
||
|
|
## Distributed cost models
|
||
|
|
|
||
|
|
Distributed engines have additional cost dimensions:
|
||
|
|
|
||
|
|
- shuffle volume
|
||
|
|
- partition balance
|
||
|
|
- scheduling overhead
|
||
|
|
- serialization cost
|
||
|
|
- network latency
|
||
|
|
|
||
|
|
In that world, a plan that is locally cheap may still be globally expensive if it causes large repartitioning or skewed tasks.
|
||
|
|
|
||
|
|
So distributed cost models are not just single-node cost models with larger numbers. They need explicit reasoning about data movement.
|
||
|
|
|
||
|
|
---
|
||
|
|
|
||
|
|
## Statistics freshness
|
||
|
|
|
||
|
|
Statistics are only useful if they are reasonably current.
|
||
|
|
|
||
|
|
Problems appear when:
|
||
|
|
|
||
|
|
- tables grow significantly
|
||
|
|
- distributions drift
|
||
|
|
- partitions change
|
||
|
|
- indexes are added or removed
|
||
|
|
|
||
|
|
This creates a maintenance tradeoff:
|
||
|
|
|
||
|
|
- fresh statistics improve planning
|
||
|
|
- but collecting them can be expensive
|
||
|
|
|
||
|
|
Many systems therefore settle for approximate or periodic stats rather than perfect real-time accuracy.
|
||
|
|
|
||
|
|
---
|
||
|
|
|
||
|
|
## Rule-based vs cost-based planning
|
||
|
|
|
||
|
|
This note should be read as a deepening of cost-based optimization, not a replacement for rule-based optimization.
|
||
|
|
|
||
|
|
Rule-based planning is still useful for:
|
||
|
|
|
||
|
|
- safe local rewrites
|
||
|
|
- simplifying plans
|
||
|
|
- obvious pushdowns
|
||
|
|
|
||
|
|
Cost-based planning becomes necessary when the engine must choose among many globally different alternatives.
|
||
|
|
|
||
|
|
The two approaches are complementary.
|
||
|
|
|
||
|
|
---
|
||
|
|
|
||
|
|
## Main takeaways
|
||
|
|
|
||
|
|
- Cardinality estimation is often the most important input to a cost-based optimizer.
|
||
|
|
- Cost models compare legal plans using rough estimates of CPU, memory, I/O, and network work.
|
||
|
|
- Statistics such as row counts, distinct counts, and histograms make those estimates less blind.
|
||
|
|
- Join planning is especially sensitive to bad estimates.
|
||
|
|
- Correlation, skew, and stale statistics are major reasons optimizers make poor choices.
|
||
|
|
- Distributed engines must model data movement explicitly, not just local operator cost.
|
||
|
|
|
||
|
|
---
|
||
|
|
|
||
|
|
## Related notes
|
||
|
|
|
||
|
|
- `hqew/005-query-optimization.md`
|
||
|
|
- `hqew/008-joins-and-aggregations.md`
|
||
|
|
- `hqew/009-distributed-query-engines.md`
|
||
|
|
- `hqew/014-data-sources-and-file-formats.md`
|
||
|
|
|
||
|
|
---
|
||
|
|
|
||
|
|
## Changelog
|
||
|
|
|
||
|
|
* **Apr 8, 2026** -- Added a dedicated note on cost models, optimizer statistics, and cardinality estimation.
|