useful-notes/hqew/018-cost-models-statistics-and-cardinality-estimation.md

7.1 KiB

Cost Models, Statistics, and Cardinality Estimation

A reference for how query engines estimate plan cost and why those estimates so often go wrong.


Short answer

A cost-based optimizer needs estimates for:

  • how many rows each operator will produce
  • how expensive each operator will be
  • how much memory, I/O, CPU, and network work a plan will require

The most important estimate is usually cardinality:

  • how many rows flow out of each stage

If those estimates are bad, the optimizer may choose:

  • the wrong join order
  • the wrong join algorithm
  • the wrong access path
  • the wrong distributed execution shape

So statistics and cardinality estimation are central to serious optimization.


Why this matters

Rule-based optimization can improve many queries with local rewrites such as:

  • projection pushdown
  • predicate pushdown
  • constant folding

But once the engine must choose among many legal alternatives, it needs a basis for comparison.

That basis is usually a cost model informed by statistics.

Without that, the engine has little principled way to decide:

  • which join should happen first
  • whether an index scan beats a full scan
  • whether repartitioning is worth it
  • whether a partial aggregation is worthwhile

Cardinality first

Cardinality estimation is usually the most important subproblem because many other costs depend on row counts.

If the engine can estimate:

  • input row counts
  • filter selectivity
  • grouping cardinality
  • join output size

then it can derive rough downstream costs.

If it gets those row counts badly wrong, even a sophisticated cost model can make poor decisions.

That is why cardinality estimation is often treated as the heart of the optimizer.


What a cost model tries to estimate

A cost model usually approximates some combination of:

  • CPU work
  • memory use
  • local I/O
  • network movement
  • latency

Different engines weight these differently depending on workload.

For example:

  • a single-node analytical engine may care most about CPU and memory bandwidth
  • a distributed engine may care heavily about shuffle size
  • an interactive engine may favor latency over total throughput

So a cost model is always tied to system goals, not just pure algebra.


Common sources of statistics

Engines often collect or maintain:

  • total row count
  • per-column distinct counts
  • min/max values
  • null counts
  • histograms
  • file or partition sizes
  • index metadata

These statistics can exist at several levels:

  • table-level
  • partition-level
  • file-level
  • row-group or block-level

More granular statistics can improve planning, but they also cost more to collect, store, and keep fresh.


Selectivity estimation

Selectivity is the fraction of rows expected to survive a predicate.

Examples:

  • age > 18
  • country = 'US'
  • status IN (...)

The optimizer uses selectivity estimates to predict:

  • filter output size
  • downstream join input size
  • aggregate input size
  • whether an index is worthwhile

This is deceptively hard because selectivity depends on data distribution, not just syntax.


Join-order planning

Join planning is where bad estimates become especially expensive.

If a query joins several inputs, many legal join orders may exist. Their costs can differ dramatically because intermediate result sizes can explode or collapse depending on the order.

The optimizer therefore needs estimates for:

  • base table sizes
  • filter selectivities
  • join key distinct counts
  • skew and duplication patterns

If those estimates are poor, the engine may build large hash tables too early, shuffle too much data, or materialize huge intermediates unnecessarily.

This is one reason join optimization is considered one of the hardest parts of query planning.


Histograms and distributions

Simple row counts are rarely enough.

Two columns can have the same row count but very different value distributions.

Histograms help by approximating:

  • frequency of values
  • value ranges
  • skew

They improve estimates for:

  • range predicates
  • uneven value distributions
  • non-uniform grouping sizes

But histograms are still approximations, and they can become stale.


Correlation and independence assumptions

Many optimizers make simplifying assumptions such as:

  • predicates are independent
  • column distributions are uniform
  • joins follow regular key patterns

These assumptions are convenient but often wrong.

For example, predicates on city and country are usually correlated. Treating them as independent can badly underestimate or overestimate result size.

This is one of the biggest reasons real-world cardinality estimation is difficult.


Error propagation

Estimation errors compound through a plan.

If an early filter estimate is off, then:

  • join inputs are misestimated
  • join outputs are misestimated
  • aggregate sizes are misestimated
  • memory and network costs are misestimated

By the time the optimizer compares full plans, small early mistakes may have become large planning errors.

This is why even mature query engines still struggle with difficult workloads.


Distributed cost models

Distributed engines have additional cost dimensions:

  • shuffle volume
  • partition balance
  • scheduling overhead
  • serialization cost
  • network latency

In that world, a plan that is locally cheap may still be globally expensive if it causes large repartitioning or skewed tasks.

So distributed cost models are not just single-node cost models with larger numbers. They need explicit reasoning about data movement.


Statistics freshness

Statistics are only useful if they are reasonably current.

Problems appear when:

  • tables grow significantly
  • distributions drift
  • partitions change
  • indexes are added or removed

This creates a maintenance tradeoff:

  • fresh statistics improve planning
  • but collecting them can be expensive

Many systems therefore settle for approximate or periodic stats rather than perfect real-time accuracy.


Rule-based vs cost-based planning

This note should be read as a deepening of cost-based optimization, not a replacement for rule-based optimization.

Rule-based planning is still useful for:

  • safe local rewrites
  • simplifying plans
  • obvious pushdowns

Cost-based planning becomes necessary when the engine must choose among many globally different alternatives.

The two approaches are complementary.


Main takeaways

  • Cardinality estimation is often the most important input to a cost-based optimizer.
  • Cost models compare legal plans using rough estimates of CPU, memory, I/O, and network work.
  • Statistics such as row counts, distinct counts, and histograms make those estimates less blind.
  • Join planning is especially sensitive to bad estimates.
  • Correlation, skew, and stale statistics are major reasons optimizers make poor choices.
  • Distributed engines must model data movement explicitly, not just local operator cost.

  • hqew/005-query-optimization.md
  • hqew/008-joins-and-aggregations.md
  • hqew/009-distributed-query-engines.md
  • hqew/014-data-sources-and-file-formats.md

Changelog

  • Apr 8, 2026 -- Added a dedicated note on cost models, optimizer statistics, and cardinality estimation.