Add note files for data sources, cost models, and cardinality estimation

This commit is contained in:
Hassan Abedi 2026-04-08 15:09:35 +02:00
parent 780a9db6a8
commit 405a609eb8
5 changed files with 580 additions and 3 deletions

View File

@ -0,0 +1,293 @@
# Data Sources and File Formats
A reference for the boundary where external data enters a query engine.
---
## Short answer
A query engine needs a small, stable interface for reading data from many backends.
That interface should answer questions like:
- what schema does this source expose?
- how is the data scanned?
- what columns can be projected?
- what filters or other pushdowns can the source perform?
- what batch or streaming model does it support?
The file format matters because it strongly affects:
- schema fidelity
- parsing cost
- projection efficiency
- compression
- pushdown opportunities
So data sources are not just plumbing. They shape both planning and execution.
---
## Why this matters
Most engines do not operate over one monolithic storage system.
They often need to read from:
- CSV files
- Parquet files
- object stores
- database connectors
- in-memory tables
- remote services
Without a clean source boundary, the engine becomes tightly coupled to one representation and one storage layout.
With a clean source boundary, the engine can:
- plan against many backends
- reuse operators above the source layer
- push work down where possible
- preserve a clean distinction between engine logic and source-specific behavior
---
## What a source interface should expose
At minimum, a data source abstraction should expose:
1. schema discovery
2. scanning
In practice, useful source interfaces often also need to expose:
- projection support
- predicate pushdown support
- partition awareness
- ordering guarantees
- file- or block-level metadata
- statistics useful for planning
The challenge is to keep the interface:
- small enough to stay generic
- rich enough to preserve important performance features
That tradeoff is one of the core design decisions in a query engine.
---
## Schema discovery
Planning depends on schema information before execution starts.
The engine needs to know:
- column names
- data types
- nullability
- sometimes partition columns or hidden metadata columns
There are two common patterns.
### Declared schema
The source is given an explicit schema ahead of time.
Strengths:
- predictable
- fast
- good for strongly typed pipelines
Weaknesses:
- can drift from the actual data
- requires external coordination
### Inferred schema
The source inspects data and infers structure.
Strengths:
- convenient
- good for ad hoc exploration
Weaknesses:
- often expensive
- can be fragile
- may infer overly weak or inconsistent types
This is why many engines support both.
---
## Scan behavior
Scanning is the runtime act of turning a source into batches the engine can consume.
Important scan questions include:
- does the source stream data incrementally or materialize eagerly?
- what is the batch size?
- does it preserve source ordering?
- does it decode only requested columns?
- can it skip files, row groups, or blocks?
In a batch-oriented engine, the scan operator is not just "read rows." It is "produce typed batches with predictable shape."
That is a much stronger contract.
---
## Projection pushdown
Projection pushdown means reading only the columns the plan actually needs.
This is often the first and easiest optimization at the source boundary.
It matters because analytical queries frequently touch:
- a few columns
- from very wide datasets
Projection pushdown reduces:
- I/O
- decoding cost
- memory traffic
- unnecessary batch width
Columnar formats benefit especially strongly because unused columns can often be skipped almost completely.
---
## Predicate pushdown
Predicate pushdown means applying filters inside the source, or at least using source metadata to skip irrelevant data.
This can happen at several levels:
- file pruning
- partition pruning
- row-group pruning
- index-assisted filtering
- direct source-side evaluation
Predicate pushdown is more powerful than projection pushdown, but also harder to implement well because the engine must understand:
- which predicates are safe to push
- what the source can evaluate
- when pushdown changes cost but not semantics
So pushdown is both an optimization problem and an interface-design problem.
---
## CSV vs Parquet
This contrast is one of the most useful mental models for file-format choice.
### CSV
Good for:
- portability
- simplicity
- easy inspection
Weak for:
- type fidelity
- parsing cost
- projection efficiency
- null representation consistency
- analytical performance
CSV is usually row-like, text-heavy, and expensive to parse repeatedly.
### Parquet
Good for:
- columnar access
- stronger typing
- compression
- metadata-driven skipping
- analytical scans
Weak for:
- human readability
- simplicity of implementation
- some update-heavy workflows
Parquet is much more natural for analytical query engines because it aligns with columnar, batch-oriented execution.
---
## Source capabilities and planning
Not every source supports the same optimizations.
A planner often needs to know whether a source can:
- project columns
- evaluate filters
- expose statistics
- support partition pruning
- preserve sort order
This means source interfaces are not only runtime contracts. They also influence planning quality.
If the source abstraction hides too much, the engine loses optimization opportunities.
If it exposes too much, the abstraction becomes brittle and source-specific.
That is the central tension.
---
## Source neutrality vs source exploitation
Good query engines try to be source-neutral at the operator level while still exploiting source-specific strengths.
That usually means:
- a generic scan abstraction
- plus capability-aware planning
This is often better than either extreme:
- fully generic but performance-blind
- or fully source-coupled but hard to extend
In practice, engines usually live somewhere in the middle.
---
## Main takeaways
- The data-source boundary is one of the most important interfaces in a query engine.
- Schema discovery is planning-critical, not just metadata housekeeping.
- Projection and predicate pushdown begin at the source boundary.
- File formats directly affect execution cost and optimization opportunities.
- CSV is convenient but weak for analytical execution; Parquet is far more engine-friendly.
- A good source abstraction balances genericity with enough capability exposure to support real optimization.
---
## Related notes
- `hqew/007-storage-and-indexes.md`
- `hqew/014-how-query-engines-work-part-1.md`
- `hqew/016-physical-plans-and-operators.md`
- `hqew/018-cost-models-statistics-and-cardinality-estimation.md`
---
## Changelog
* **Apr 8, 2026** -- Added a dedicated note on source interfaces, pushdown, schema discovery, and file-format tradeoffs.

View File

@ -368,5 +368,5 @@ It is what allows:
## Changelog ## Changelog
* **Apr 7, 2026** -- Removed references to ignored local `tmp/` paths. * **Apr 7, 2026** -- Removed references to ignored local paths.
* **Apr 7, 2026** -- Added a dedicated note on SQL tokenization, parsing, AST structure, and logical planning. * **Apr 7, 2026** -- Added a dedicated note on SQL tokenization, parsing, AST structure, and logical planning.

View File

@ -375,5 +375,5 @@ So this note should be read as a physical-plan primer, not as the full space of
## Changelog ## Changelog
* **Apr 7, 2026** -- Removed references to ignored local `tmp/` paths. * **Apr 7, 2026** -- Removed references to ignored local paths.
* **Apr 7, 2026** -- Added a dedicated note on physical planning, executable operators, and batch-oriented runtime behavior. * **Apr 7, 2026** -- Added a dedicated note on physical planning, executable operators, and batch-oriented runtime behavior.

View File

@ -380,5 +380,5 @@ That is why "choosing a type system" appears so early in the book. It affects mu
## Changelog ## Changelog
* **Apr 7, 2026** -- Removed references to ignored local `tmp/` paths. * **Apr 7, 2026** -- Removed references to ignored local paths.
* **Apr 7, 2026** -- Added a dedicated note on expressions, Arrow-based types, coercion, aggregates, and null handling. * **Apr 7, 2026** -- Added a dedicated note on expressions, Arrow-based types, coercion, aggregates, and null handling.

View File

@ -0,0 +1,284 @@
# Cost Models, Statistics, and Cardinality Estimation
A reference for how query engines estimate plan cost and why those estimates so often go wrong.
---
## Short answer
A cost-based optimizer needs estimates for:
- how many rows each operator will produce
- how expensive each operator will be
- how much memory, I/O, CPU, and network work a plan will require
The most important estimate is usually cardinality:
- how many rows flow out of each stage
If those estimates are bad, the optimizer may choose:
- the wrong join order
- the wrong join algorithm
- the wrong access path
- the wrong distributed execution shape
So statistics and cardinality estimation are central to serious optimization.
---
## Why this matters
Rule-based optimization can improve many queries with local rewrites such as:
- projection pushdown
- predicate pushdown
- constant folding
But once the engine must choose among many legal alternatives, it needs a basis for comparison.
That basis is usually a cost model informed by statistics.
Without that, the engine has little principled way to decide:
- which join should happen first
- whether an index scan beats a full scan
- whether repartitioning is worth it
- whether a partial aggregation is worthwhile
---
## Cardinality first
Cardinality estimation is usually the most important subproblem because many other costs depend on row counts.
If the engine can estimate:
- input row counts
- filter selectivity
- grouping cardinality
- join output size
then it can derive rough downstream costs.
If it gets those row counts badly wrong, even a sophisticated cost model can make poor decisions.
That is why cardinality estimation is often treated as the heart of the optimizer.
---
## What a cost model tries to estimate
A cost model usually approximates some combination of:
- CPU work
- memory use
- local I/O
- network movement
- latency
Different engines weight these differently depending on workload.
For example:
- a single-node analytical engine may care most about CPU and memory bandwidth
- a distributed engine may care heavily about shuffle size
- an interactive engine may favor latency over total throughput
So a cost model is always tied to system goals, not just pure algebra.
---
## Common sources of statistics
Engines often collect or maintain:
- total row count
- per-column distinct counts
- min/max values
- null counts
- histograms
- file or partition sizes
- index metadata
These statistics can exist at several levels:
- table-level
- partition-level
- file-level
- row-group or block-level
More granular statistics can improve planning, but they also cost more to collect, store, and keep fresh.
---
## Selectivity estimation
Selectivity is the fraction of rows expected to survive a predicate.
Examples:
- `age > 18`
- `country = 'US'`
- `status IN (...)`
The optimizer uses selectivity estimates to predict:
- filter output size
- downstream join input size
- aggregate input size
- whether an index is worthwhile
This is deceptively hard because selectivity depends on data distribution, not just syntax.
---
## Join-order planning
Join planning is where bad estimates become especially expensive.
If a query joins several inputs, many legal join orders may exist. Their costs can differ dramatically because intermediate result sizes can explode or collapse depending on the order.
The optimizer therefore needs estimates for:
- base table sizes
- filter selectivities
- join key distinct counts
- skew and duplication patterns
If those estimates are poor, the engine may build large hash tables too early, shuffle too much data, or materialize huge intermediates unnecessarily.
This is one reason join optimization is considered one of the hardest parts of query planning.
---
## Histograms and distributions
Simple row counts are rarely enough.
Two columns can have the same row count but very different value distributions.
Histograms help by approximating:
- frequency of values
- value ranges
- skew
They improve estimates for:
- range predicates
- uneven value distributions
- non-uniform grouping sizes
But histograms are still approximations, and they can become stale.
---
## Correlation and independence assumptions
Many optimizers make simplifying assumptions such as:
- predicates are independent
- column distributions are uniform
- joins follow regular key patterns
These assumptions are convenient but often wrong.
For example, predicates on `city` and `country` are usually correlated. Treating them as independent can badly underestimate or overestimate result size.
This is one of the biggest reasons real-world cardinality estimation is difficult.
---
## Error propagation
Estimation errors compound through a plan.
If an early filter estimate is off, then:
- join inputs are misestimated
- join outputs are misestimated
- aggregate sizes are misestimated
- memory and network costs are misestimated
By the time the optimizer compares full plans, small early mistakes may have become large planning errors.
This is why even mature query engines still struggle with difficult workloads.
---
## Distributed cost models
Distributed engines have additional cost dimensions:
- shuffle volume
- partition balance
- scheduling overhead
- serialization cost
- network latency
In that world, a plan that is locally cheap may still be globally expensive if it causes large repartitioning or skewed tasks.
So distributed cost models are not just single-node cost models with larger numbers. They need explicit reasoning about data movement.
---
## Statistics freshness
Statistics are only useful if they are reasonably current.
Problems appear when:
- tables grow significantly
- distributions drift
- partitions change
- indexes are added or removed
This creates a maintenance tradeoff:
- fresh statistics improve planning
- but collecting them can be expensive
Many systems therefore settle for approximate or periodic stats rather than perfect real-time accuracy.
---
## Rule-based vs cost-based planning
This note should be read as a deepening of cost-based optimization, not a replacement for rule-based optimization.
Rule-based planning is still useful for:
- safe local rewrites
- simplifying plans
- obvious pushdowns
Cost-based planning becomes necessary when the engine must choose among many globally different alternatives.
The two approaches are complementary.
---
## Main takeaways
- Cardinality estimation is often the most important input to a cost-based optimizer.
- Cost models compare legal plans using rough estimates of CPU, memory, I/O, and network work.
- Statistics such as row counts, distinct counts, and histograms make those estimates less blind.
- Join planning is especially sensitive to bad estimates.
- Correlation, skew, and stale statistics are major reasons optimizers make poor choices.
- Distributed engines must model data movement explicitly, not just local operator cost.
---
## Related notes
- `hqew/005-query-optimization.md`
- `hqew/008-joins-and-aggregations.md`
- `hqew/009-distributed-query-engines.md`
- `hqew/014-data-sources-and-file-formats.md`
---
## Changelog
* **Apr 8, 2026** -- Added a dedicated note on cost models, optimizer statistics, and cardinality estimation.