# Cost Models, Statistics, and Cardinality Estimation A reference for how query engines estimate plan cost and why those estimates so often go wrong. --- ## Short answer A cost-based optimizer needs estimates for: - how many rows each operator will produce - how expensive each operator will be - how much memory, I/O, CPU, and network work a plan will require The most important estimate is usually cardinality: - how many rows flow out of each stage If those estimates are bad, the optimizer may choose: - the wrong join order - the wrong join algorithm - the wrong access path - the wrong distributed execution shape So statistics and cardinality estimation are central to serious optimization. --- ## Why this matters Rule-based optimization can improve many queries with local rewrites such as: - projection pushdown - predicate pushdown - constant folding But once the engine must choose among many legal alternatives, it needs a basis for comparison. That basis is usually a cost model informed by statistics. Without that, the engine has little principled way to decide: - which join should happen first - whether an index scan beats a full scan - whether repartitioning is worth it - whether a partial aggregation is worthwhile --- ## Cardinality first Cardinality estimation is usually the most important subproblem because many other costs depend on row counts. If the engine can estimate: - input row counts - filter selectivity - grouping cardinality - join output size then it can derive rough downstream costs. If it gets those row counts badly wrong, even a sophisticated cost model can make poor decisions. That is why cardinality estimation is often treated as the heart of the optimizer. --- ## What a cost model tries to estimate A cost model usually approximates some combination of: - CPU work - memory use - local I/O - network movement - latency Different engines weight these differently depending on workload. For example: - a single-node analytical engine may care most about CPU and memory bandwidth - a distributed engine may care heavily about shuffle size - an interactive engine may favor latency over total throughput So a cost model is always tied to system goals, not just pure algebra. --- ## Common sources of statistics Engines often collect or maintain: - total row count - per-column distinct counts - min/max values - null counts - histograms - file or partition sizes - index metadata These statistics can exist at several levels: - table-level - partition-level - file-level - row-group or block-level More granular statistics can improve planning, but they also cost more to collect, store, and keep fresh. --- ## Selectivity estimation Selectivity is the fraction of rows expected to survive a predicate. Examples: - `age > 18` - `country = 'US'` - `status IN (...)` The optimizer uses selectivity estimates to predict: - filter output size - downstream join input size - aggregate input size - whether an index is worthwhile This is deceptively hard because selectivity depends on data distribution, not just syntax. --- ## Join-order planning Join planning is where bad estimates become especially expensive. If a query joins several inputs, many legal join orders may exist. Their costs can differ dramatically because intermediate result sizes can explode or collapse depending on the order. The optimizer therefore needs estimates for: - base table sizes - filter selectivities - join key distinct counts - skew and duplication patterns If those estimates are poor, the engine may build large hash tables too early, shuffle too much data, or materialize huge intermediates unnecessarily. This is one reason join optimization is considered one of the hardest parts of query planning. --- ## Histograms and distributions Simple row counts are rarely enough. Two columns can have the same row count but very different value distributions. Histograms help by approximating: - frequency of values - value ranges - skew They improve estimates for: - range predicates - uneven value distributions - non-uniform grouping sizes But histograms are still approximations, and they can become stale. --- ## Correlation and independence assumptions Many optimizers make simplifying assumptions such as: - predicates are independent - column distributions are uniform - joins follow regular key patterns These assumptions are convenient but often wrong. For example, predicates on `city` and `country` are usually correlated. Treating them as independent can badly underestimate or overestimate result size. This is one of the biggest reasons real-world cardinality estimation is difficult. --- ## Error propagation Estimation errors compound through a plan. If an early filter estimate is off, then: - join inputs are misestimated - join outputs are misestimated - aggregate sizes are misestimated - memory and network costs are misestimated By the time the optimizer compares full plans, small early mistakes may have become large planning errors. This is why even mature query engines still struggle with difficult workloads. --- ## Distributed cost models Distributed engines have additional cost dimensions: - shuffle volume - partition balance - scheduling overhead - serialization cost - network latency In that world, a plan that is locally cheap may still be globally expensive if it causes large repartitioning or skewed tasks. So distributed cost models are not just single-node cost models with larger numbers. They need explicit reasoning about data movement. --- ## Statistics freshness Statistics are only useful if they are reasonably current. Problems appear when: - tables grow significantly - distributions drift - partitions change - indexes are added or removed This creates a maintenance tradeoff: - fresh statistics improve planning - but collecting them can be expensive Many systems therefore settle for approximate or periodic stats rather than perfect real-time accuracy. --- ## Rule-based vs cost-based planning This note should be read as a deepening of cost-based optimization, not a replacement for rule-based optimization. Rule-based planning is still useful for: - safe local rewrites - simplifying plans - obvious pushdowns Cost-based planning becomes necessary when the engine must choose among many globally different alternatives. The two approaches are complementary. --- ## Main takeaways - Cardinality estimation is often the most important input to a cost-based optimizer. - Cost models compare legal plans using rough estimates of CPU, memory, I/O, and network work. - Statistics such as row counts, distinct counts, and histograms make those estimates less blind. - Join planning is especially sensitive to bad estimates. - Correlation, skew, and stale statistics are major reasons optimizers make poor choices. - Distributed engines must model data movement explicitly, not just local operator cost. --- ## Related notes - `hqew/005-query-optimization.md` - `hqew/008-joins-and-aggregations.md` - `hqew/009-distributed-query-engines.md` - `hqew/014-data-sources-and-file-formats.md` --- ## Changelog * **Apr 8, 2026** -- Added a dedicated note on cost models, optimizer statistics, and cardinality estimation.