habedi-work/useful-notes

Fork 0

Hassan Abedi 405a609eb8 Add note files for data sources, cost models, and cardinality estimation

2026-04-08 15:09:35 +02:00

7.1 KiB

Raw Blame History

Cost Models, Statistics, and Cardinality Estimation

A reference for how query engines estimate plan cost and why those estimates so often go wrong.

Short answer

A cost-based optimizer needs estimates for:

how many rows each operator will produce
how expensive each operator will be
how much memory, I/O, CPU, and network work a plan will require

The most important estimate is usually cardinality:

how many rows flow out of each stage

If those estimates are bad, the optimizer may choose:

the wrong join order
the wrong join algorithm
the wrong access path
the wrong distributed execution shape

So statistics and cardinality estimation are central to serious optimization.

Why this matters

Rule-based optimization can improve many queries with local rewrites such as:

projection pushdown
predicate pushdown
constant folding

But once the engine must choose among many legal alternatives, it needs a basis for comparison.

That basis is usually a cost model informed by statistics.

Without that, the engine has little principled way to decide:

which join should happen first
whether an index scan beats a full scan
whether repartitioning is worth it
whether a partial aggregation is worthwhile

Cardinality first

Cardinality estimation is usually the most important subproblem because many other costs depend on row counts.

If the engine can estimate:

input row counts
filter selectivity
grouping cardinality
join output size

then it can derive rough downstream costs.

If it gets those row counts badly wrong, even a sophisticated cost model can make poor decisions.

That is why cardinality estimation is often treated as the heart of the optimizer.

What a cost model tries to estimate

A cost model usually approximates some combination of:

CPU work
memory use
local I/O
network movement
latency

Different engines weight these differently depending on workload.

For example:

a single-node analytical engine may care most about CPU and memory bandwidth
a distributed engine may care heavily about shuffle size
an interactive engine may favor latency over total throughput

So a cost model is always tied to system goals, not just pure algebra.

Common sources of statistics

Engines often collect or maintain:

total row count
per-column distinct counts
min/max values
null counts
histograms
file or partition sizes
index metadata

These statistics can exist at several levels:

table-level
partition-level
file-level
row-group or block-level

More granular statistics can improve planning, but they also cost more to collect, store, and keep fresh.

Selectivity estimation

Selectivity is the fraction of rows expected to survive a predicate.

Examples:

age > 18
country = 'US'
status IN (...)

The optimizer uses selectivity estimates to predict:

filter output size
downstream join input size
aggregate input size
whether an index is worthwhile

This is deceptively hard because selectivity depends on data distribution, not just syntax.

Join-order planning

Join planning is where bad estimates become especially expensive.

If a query joins several inputs, many legal join orders may exist. Their costs can differ dramatically because intermediate result sizes can explode or collapse depending on the order.

The optimizer therefore needs estimates for:

base table sizes
filter selectivities
join key distinct counts
skew and duplication patterns

If those estimates are poor, the engine may build large hash tables too early, shuffle too much data, or materialize huge intermediates unnecessarily.

This is one reason join optimization is considered one of the hardest parts of query planning.

Histograms and distributions

Simple row counts are rarely enough.

Two columns can have the same row count but very different value distributions.

Histograms help by approximating:

frequency of values
value ranges
skew

They improve estimates for:

range predicates
uneven value distributions
non-uniform grouping sizes

But histograms are still approximations, and they can become stale.

Correlation and independence assumptions

Many optimizers make simplifying assumptions such as:

predicates are independent
column distributions are uniform
joins follow regular key patterns

These assumptions are convenient but often wrong.

For example, predicates on city and country are usually correlated. Treating them as independent can badly underestimate or overestimate result size.

This is one of the biggest reasons real-world cardinality estimation is difficult.

Error propagation

Estimation errors compound through a plan.

If an early filter estimate is off, then:

join inputs are misestimated
join outputs are misestimated
aggregate sizes are misestimated
memory and network costs are misestimated

By the time the optimizer compares full plans, small early mistakes may have become large planning errors.

This is why even mature query engines still struggle with difficult workloads.

Distributed cost models

Distributed engines have additional cost dimensions:

shuffle volume
partition balance
scheduling overhead
serialization cost
network latency

In that world, a plan that is locally cheap may still be globally expensive if it causes large repartitioning or skewed tasks.

So distributed cost models are not just single-node cost models with larger numbers. They need explicit reasoning about data movement.

Statistics freshness

Statistics are only useful if they are reasonably current.

Problems appear when:

tables grow significantly
distributions drift
partitions change
indexes are added or removed

This creates a maintenance tradeoff:

fresh statistics improve planning
but collecting them can be expensive

Many systems therefore settle for approximate or periodic stats rather than perfect real-time accuracy.

Rule-based vs cost-based planning

This note should be read as a deepening of cost-based optimization, not a replacement for rule-based optimization.

Rule-based planning is still useful for:

safe local rewrites
simplifying plans
obvious pushdowns

Cost-based planning becomes necessary when the engine must choose among many globally different alternatives.

The two approaches are complementary.

Main takeaways

Cardinality estimation is often the most important input to a cost-based optimizer.
Cost models compare legal plans using rough estimates of CPU, memory, I/O, and network work.
Statistics such as row counts, distinct counts, and histograms make those estimates less blind.
Join planning is especially sensitive to bad estimates.
Correlation, skew, and stale statistics are major reasons optimizers make poor choices.
Distributed engines must model data movement explicitly, not just local operator cost.

hqew/005-query-optimization.md
hqew/008-joins-and-aggregations.md
hqew/009-distributed-query-engines.md
hqew/014-data-sources-and-file-formats.md

Changelog

Apr 8, 2026 -- Added a dedicated note on cost models, optimizer statistics, and cardinality estimation.

7.1 KiB Raw Blame History