Add note files for data sources, cost models, and cardinality estimation
This commit is contained in:
parent
780a9db6a8
commit
405a609eb8
293
hqew/014-data-sources-and-file-formats.md
Normal file
293
hqew/014-data-sources-and-file-formats.md
Normal file
@ -0,0 +1,293 @@
|
|||||||
|
# Data Sources and File Formats
|
||||||
|
|
||||||
|
A reference for the boundary where external data enters a query engine.
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## Short answer
|
||||||
|
|
||||||
|
A query engine needs a small, stable interface for reading data from many backends.
|
||||||
|
|
||||||
|
That interface should answer questions like:
|
||||||
|
|
||||||
|
- what schema does this source expose?
|
||||||
|
- how is the data scanned?
|
||||||
|
- what columns can be projected?
|
||||||
|
- what filters or other pushdowns can the source perform?
|
||||||
|
- what batch or streaming model does it support?
|
||||||
|
|
||||||
|
The file format matters because it strongly affects:
|
||||||
|
|
||||||
|
- schema fidelity
|
||||||
|
- parsing cost
|
||||||
|
- projection efficiency
|
||||||
|
- compression
|
||||||
|
- pushdown opportunities
|
||||||
|
|
||||||
|
So data sources are not just plumbing. They shape both planning and execution.
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## Why this matters
|
||||||
|
|
||||||
|
Most engines do not operate over one monolithic storage system.
|
||||||
|
|
||||||
|
They often need to read from:
|
||||||
|
|
||||||
|
- CSV files
|
||||||
|
- Parquet files
|
||||||
|
- object stores
|
||||||
|
- database connectors
|
||||||
|
- in-memory tables
|
||||||
|
- remote services
|
||||||
|
|
||||||
|
Without a clean source boundary, the engine becomes tightly coupled to one representation and one storage layout.
|
||||||
|
|
||||||
|
With a clean source boundary, the engine can:
|
||||||
|
|
||||||
|
- plan against many backends
|
||||||
|
- reuse operators above the source layer
|
||||||
|
- push work down where possible
|
||||||
|
- preserve a clean distinction between engine logic and source-specific behavior
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## What a source interface should expose
|
||||||
|
|
||||||
|
At minimum, a data source abstraction should expose:
|
||||||
|
|
||||||
|
1. schema discovery
|
||||||
|
2. scanning
|
||||||
|
|
||||||
|
In practice, useful source interfaces often also need to expose:
|
||||||
|
|
||||||
|
- projection support
|
||||||
|
- predicate pushdown support
|
||||||
|
- partition awareness
|
||||||
|
- ordering guarantees
|
||||||
|
- file- or block-level metadata
|
||||||
|
- statistics useful for planning
|
||||||
|
|
||||||
|
The challenge is to keep the interface:
|
||||||
|
|
||||||
|
- small enough to stay generic
|
||||||
|
- rich enough to preserve important performance features
|
||||||
|
|
||||||
|
That tradeoff is one of the core design decisions in a query engine.
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## Schema discovery
|
||||||
|
|
||||||
|
Planning depends on schema information before execution starts.
|
||||||
|
|
||||||
|
The engine needs to know:
|
||||||
|
|
||||||
|
- column names
|
||||||
|
- data types
|
||||||
|
- nullability
|
||||||
|
- sometimes partition columns or hidden metadata columns
|
||||||
|
|
||||||
|
There are two common patterns.
|
||||||
|
|
||||||
|
### Declared schema
|
||||||
|
|
||||||
|
The source is given an explicit schema ahead of time.
|
||||||
|
|
||||||
|
Strengths:
|
||||||
|
|
||||||
|
- predictable
|
||||||
|
- fast
|
||||||
|
- good for strongly typed pipelines
|
||||||
|
|
||||||
|
Weaknesses:
|
||||||
|
|
||||||
|
- can drift from the actual data
|
||||||
|
- requires external coordination
|
||||||
|
|
||||||
|
### Inferred schema
|
||||||
|
|
||||||
|
The source inspects data and infers structure.
|
||||||
|
|
||||||
|
Strengths:
|
||||||
|
|
||||||
|
- convenient
|
||||||
|
- good for ad hoc exploration
|
||||||
|
|
||||||
|
Weaknesses:
|
||||||
|
|
||||||
|
- often expensive
|
||||||
|
- can be fragile
|
||||||
|
- may infer overly weak or inconsistent types
|
||||||
|
|
||||||
|
This is why many engines support both.
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## Scan behavior
|
||||||
|
|
||||||
|
Scanning is the runtime act of turning a source into batches the engine can consume.
|
||||||
|
|
||||||
|
Important scan questions include:
|
||||||
|
|
||||||
|
- does the source stream data incrementally or materialize eagerly?
|
||||||
|
- what is the batch size?
|
||||||
|
- does it preserve source ordering?
|
||||||
|
- does it decode only requested columns?
|
||||||
|
- can it skip files, row groups, or blocks?
|
||||||
|
|
||||||
|
In a batch-oriented engine, the scan operator is not just "read rows." It is "produce typed batches with predictable shape."
|
||||||
|
|
||||||
|
That is a much stronger contract.
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## Projection pushdown
|
||||||
|
|
||||||
|
Projection pushdown means reading only the columns the plan actually needs.
|
||||||
|
|
||||||
|
This is often the first and easiest optimization at the source boundary.
|
||||||
|
|
||||||
|
It matters because analytical queries frequently touch:
|
||||||
|
|
||||||
|
- a few columns
|
||||||
|
- from very wide datasets
|
||||||
|
|
||||||
|
Projection pushdown reduces:
|
||||||
|
|
||||||
|
- I/O
|
||||||
|
- decoding cost
|
||||||
|
- memory traffic
|
||||||
|
- unnecessary batch width
|
||||||
|
|
||||||
|
Columnar formats benefit especially strongly because unused columns can often be skipped almost completely.
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## Predicate pushdown
|
||||||
|
|
||||||
|
Predicate pushdown means applying filters inside the source, or at least using source metadata to skip irrelevant data.
|
||||||
|
|
||||||
|
This can happen at several levels:
|
||||||
|
|
||||||
|
- file pruning
|
||||||
|
- partition pruning
|
||||||
|
- row-group pruning
|
||||||
|
- index-assisted filtering
|
||||||
|
- direct source-side evaluation
|
||||||
|
|
||||||
|
Predicate pushdown is more powerful than projection pushdown, but also harder to implement well because the engine must understand:
|
||||||
|
|
||||||
|
- which predicates are safe to push
|
||||||
|
- what the source can evaluate
|
||||||
|
- when pushdown changes cost but not semantics
|
||||||
|
|
||||||
|
So pushdown is both an optimization problem and an interface-design problem.
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## CSV vs Parquet
|
||||||
|
|
||||||
|
This contrast is one of the most useful mental models for file-format choice.
|
||||||
|
|
||||||
|
### CSV
|
||||||
|
|
||||||
|
Good for:
|
||||||
|
|
||||||
|
- portability
|
||||||
|
- simplicity
|
||||||
|
- easy inspection
|
||||||
|
|
||||||
|
Weak for:
|
||||||
|
|
||||||
|
- type fidelity
|
||||||
|
- parsing cost
|
||||||
|
- projection efficiency
|
||||||
|
- null representation consistency
|
||||||
|
- analytical performance
|
||||||
|
|
||||||
|
CSV is usually row-like, text-heavy, and expensive to parse repeatedly.
|
||||||
|
|
||||||
|
### Parquet
|
||||||
|
|
||||||
|
Good for:
|
||||||
|
|
||||||
|
- columnar access
|
||||||
|
- stronger typing
|
||||||
|
- compression
|
||||||
|
- metadata-driven skipping
|
||||||
|
- analytical scans
|
||||||
|
|
||||||
|
Weak for:
|
||||||
|
|
||||||
|
- human readability
|
||||||
|
- simplicity of implementation
|
||||||
|
- some update-heavy workflows
|
||||||
|
|
||||||
|
Parquet is much more natural for analytical query engines because it aligns with columnar, batch-oriented execution.
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## Source capabilities and planning
|
||||||
|
|
||||||
|
Not every source supports the same optimizations.
|
||||||
|
|
||||||
|
A planner often needs to know whether a source can:
|
||||||
|
|
||||||
|
- project columns
|
||||||
|
- evaluate filters
|
||||||
|
- expose statistics
|
||||||
|
- support partition pruning
|
||||||
|
- preserve sort order
|
||||||
|
|
||||||
|
This means source interfaces are not only runtime contracts. They also influence planning quality.
|
||||||
|
|
||||||
|
If the source abstraction hides too much, the engine loses optimization opportunities.
|
||||||
|
|
||||||
|
If it exposes too much, the abstraction becomes brittle and source-specific.
|
||||||
|
|
||||||
|
That is the central tension.
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## Source neutrality vs source exploitation
|
||||||
|
|
||||||
|
Good query engines try to be source-neutral at the operator level while still exploiting source-specific strengths.
|
||||||
|
|
||||||
|
That usually means:
|
||||||
|
|
||||||
|
- a generic scan abstraction
|
||||||
|
- plus capability-aware planning
|
||||||
|
|
||||||
|
This is often better than either extreme:
|
||||||
|
|
||||||
|
- fully generic but performance-blind
|
||||||
|
- or fully source-coupled but hard to extend
|
||||||
|
|
||||||
|
In practice, engines usually live somewhere in the middle.
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## Main takeaways
|
||||||
|
|
||||||
|
- The data-source boundary is one of the most important interfaces in a query engine.
|
||||||
|
- Schema discovery is planning-critical, not just metadata housekeeping.
|
||||||
|
- Projection and predicate pushdown begin at the source boundary.
|
||||||
|
- File formats directly affect execution cost and optimization opportunities.
|
||||||
|
- CSV is convenient but weak for analytical execution; Parquet is far more engine-friendly.
|
||||||
|
- A good source abstraction balances genericity with enough capability exposure to support real optimization.
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## Related notes
|
||||||
|
|
||||||
|
- `hqew/007-storage-and-indexes.md`
|
||||||
|
- `hqew/014-how-query-engines-work-part-1.md`
|
||||||
|
- `hqew/016-physical-plans-and-operators.md`
|
||||||
|
- `hqew/018-cost-models-statistics-and-cardinality-estimation.md`
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## Changelog
|
||||||
|
|
||||||
|
* **Apr 8, 2026** -- Added a dedicated note on source interfaces, pushdown, schema discovery, and file-format tradeoffs.
|
||||||
@ -368,5 +368,5 @@ It is what allows:
|
|||||||
|
|
||||||
## Changelog
|
## Changelog
|
||||||
|
|
||||||
* **Apr 7, 2026** -- Removed references to ignored local `tmp/` paths.
|
* **Apr 7, 2026** -- Removed references to ignored local paths.
|
||||||
* **Apr 7, 2026** -- Added a dedicated note on SQL tokenization, parsing, AST structure, and logical planning.
|
* **Apr 7, 2026** -- Added a dedicated note on SQL tokenization, parsing, AST structure, and logical planning.
|
||||||
|
|||||||
@ -375,5 +375,5 @@ So this note should be read as a physical-plan primer, not as the full space of
|
|||||||
|
|
||||||
## Changelog
|
## Changelog
|
||||||
|
|
||||||
* **Apr 7, 2026** -- Removed references to ignored local `tmp/` paths.
|
* **Apr 7, 2026** -- Removed references to ignored local paths.
|
||||||
* **Apr 7, 2026** -- Added a dedicated note on physical planning, executable operators, and batch-oriented runtime behavior.
|
* **Apr 7, 2026** -- Added a dedicated note on physical planning, executable operators, and batch-oriented runtime behavior.
|
||||||
|
|||||||
@ -380,5 +380,5 @@ That is why "choosing a type system" appears so early in the book. It affects mu
|
|||||||
|
|
||||||
## Changelog
|
## Changelog
|
||||||
|
|
||||||
* **Apr 7, 2026** -- Removed references to ignored local `tmp/` paths.
|
* **Apr 7, 2026** -- Removed references to ignored local paths.
|
||||||
* **Apr 7, 2026** -- Added a dedicated note on expressions, Arrow-based types, coercion, aggregates, and null handling.
|
* **Apr 7, 2026** -- Added a dedicated note on expressions, Arrow-based types, coercion, aggregates, and null handling.
|
||||||
|
|||||||
284
hqew/018-cost-models-statistics-and-cardinality-estimation.md
Normal file
284
hqew/018-cost-models-statistics-and-cardinality-estimation.md
Normal file
@ -0,0 +1,284 @@
|
|||||||
|
# Cost Models, Statistics, and Cardinality Estimation
|
||||||
|
|
||||||
|
A reference for how query engines estimate plan cost and why those estimates so often go wrong.
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## Short answer
|
||||||
|
|
||||||
|
A cost-based optimizer needs estimates for:
|
||||||
|
|
||||||
|
- how many rows each operator will produce
|
||||||
|
- how expensive each operator will be
|
||||||
|
- how much memory, I/O, CPU, and network work a plan will require
|
||||||
|
|
||||||
|
The most important estimate is usually cardinality:
|
||||||
|
|
||||||
|
- how many rows flow out of each stage
|
||||||
|
|
||||||
|
If those estimates are bad, the optimizer may choose:
|
||||||
|
|
||||||
|
- the wrong join order
|
||||||
|
- the wrong join algorithm
|
||||||
|
- the wrong access path
|
||||||
|
- the wrong distributed execution shape
|
||||||
|
|
||||||
|
So statistics and cardinality estimation are central to serious optimization.
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## Why this matters
|
||||||
|
|
||||||
|
Rule-based optimization can improve many queries with local rewrites such as:
|
||||||
|
|
||||||
|
- projection pushdown
|
||||||
|
- predicate pushdown
|
||||||
|
- constant folding
|
||||||
|
|
||||||
|
But once the engine must choose among many legal alternatives, it needs a basis for comparison.
|
||||||
|
|
||||||
|
That basis is usually a cost model informed by statistics.
|
||||||
|
|
||||||
|
Without that, the engine has little principled way to decide:
|
||||||
|
|
||||||
|
- which join should happen first
|
||||||
|
- whether an index scan beats a full scan
|
||||||
|
- whether repartitioning is worth it
|
||||||
|
- whether a partial aggregation is worthwhile
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## Cardinality first
|
||||||
|
|
||||||
|
Cardinality estimation is usually the most important subproblem because many other costs depend on row counts.
|
||||||
|
|
||||||
|
If the engine can estimate:
|
||||||
|
|
||||||
|
- input row counts
|
||||||
|
- filter selectivity
|
||||||
|
- grouping cardinality
|
||||||
|
- join output size
|
||||||
|
|
||||||
|
then it can derive rough downstream costs.
|
||||||
|
|
||||||
|
If it gets those row counts badly wrong, even a sophisticated cost model can make poor decisions.
|
||||||
|
|
||||||
|
That is why cardinality estimation is often treated as the heart of the optimizer.
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## What a cost model tries to estimate
|
||||||
|
|
||||||
|
A cost model usually approximates some combination of:
|
||||||
|
|
||||||
|
- CPU work
|
||||||
|
- memory use
|
||||||
|
- local I/O
|
||||||
|
- network movement
|
||||||
|
- latency
|
||||||
|
|
||||||
|
Different engines weight these differently depending on workload.
|
||||||
|
|
||||||
|
For example:
|
||||||
|
|
||||||
|
- a single-node analytical engine may care most about CPU and memory bandwidth
|
||||||
|
- a distributed engine may care heavily about shuffle size
|
||||||
|
- an interactive engine may favor latency over total throughput
|
||||||
|
|
||||||
|
So a cost model is always tied to system goals, not just pure algebra.
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## Common sources of statistics
|
||||||
|
|
||||||
|
Engines often collect or maintain:
|
||||||
|
|
||||||
|
- total row count
|
||||||
|
- per-column distinct counts
|
||||||
|
- min/max values
|
||||||
|
- null counts
|
||||||
|
- histograms
|
||||||
|
- file or partition sizes
|
||||||
|
- index metadata
|
||||||
|
|
||||||
|
These statistics can exist at several levels:
|
||||||
|
|
||||||
|
- table-level
|
||||||
|
- partition-level
|
||||||
|
- file-level
|
||||||
|
- row-group or block-level
|
||||||
|
|
||||||
|
More granular statistics can improve planning, but they also cost more to collect, store, and keep fresh.
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## Selectivity estimation
|
||||||
|
|
||||||
|
Selectivity is the fraction of rows expected to survive a predicate.
|
||||||
|
|
||||||
|
Examples:
|
||||||
|
|
||||||
|
- `age > 18`
|
||||||
|
- `country = 'US'`
|
||||||
|
- `status IN (...)`
|
||||||
|
|
||||||
|
The optimizer uses selectivity estimates to predict:
|
||||||
|
|
||||||
|
- filter output size
|
||||||
|
- downstream join input size
|
||||||
|
- aggregate input size
|
||||||
|
- whether an index is worthwhile
|
||||||
|
|
||||||
|
This is deceptively hard because selectivity depends on data distribution, not just syntax.
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## Join-order planning
|
||||||
|
|
||||||
|
Join planning is where bad estimates become especially expensive.
|
||||||
|
|
||||||
|
If a query joins several inputs, many legal join orders may exist. Their costs can differ dramatically because intermediate result sizes can explode or collapse depending on the order.
|
||||||
|
|
||||||
|
The optimizer therefore needs estimates for:
|
||||||
|
|
||||||
|
- base table sizes
|
||||||
|
- filter selectivities
|
||||||
|
- join key distinct counts
|
||||||
|
- skew and duplication patterns
|
||||||
|
|
||||||
|
If those estimates are poor, the engine may build large hash tables too early, shuffle too much data, or materialize huge intermediates unnecessarily.
|
||||||
|
|
||||||
|
This is one reason join optimization is considered one of the hardest parts of query planning.
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## Histograms and distributions
|
||||||
|
|
||||||
|
Simple row counts are rarely enough.
|
||||||
|
|
||||||
|
Two columns can have the same row count but very different value distributions.
|
||||||
|
|
||||||
|
Histograms help by approximating:
|
||||||
|
|
||||||
|
- frequency of values
|
||||||
|
- value ranges
|
||||||
|
- skew
|
||||||
|
|
||||||
|
They improve estimates for:
|
||||||
|
|
||||||
|
- range predicates
|
||||||
|
- uneven value distributions
|
||||||
|
- non-uniform grouping sizes
|
||||||
|
|
||||||
|
But histograms are still approximations, and they can become stale.
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## Correlation and independence assumptions
|
||||||
|
|
||||||
|
Many optimizers make simplifying assumptions such as:
|
||||||
|
|
||||||
|
- predicates are independent
|
||||||
|
- column distributions are uniform
|
||||||
|
- joins follow regular key patterns
|
||||||
|
|
||||||
|
These assumptions are convenient but often wrong.
|
||||||
|
|
||||||
|
For example, predicates on `city` and `country` are usually correlated. Treating them as independent can badly underestimate or overestimate result size.
|
||||||
|
|
||||||
|
This is one of the biggest reasons real-world cardinality estimation is difficult.
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## Error propagation
|
||||||
|
|
||||||
|
Estimation errors compound through a plan.
|
||||||
|
|
||||||
|
If an early filter estimate is off, then:
|
||||||
|
|
||||||
|
- join inputs are misestimated
|
||||||
|
- join outputs are misestimated
|
||||||
|
- aggregate sizes are misestimated
|
||||||
|
- memory and network costs are misestimated
|
||||||
|
|
||||||
|
By the time the optimizer compares full plans, small early mistakes may have become large planning errors.
|
||||||
|
|
||||||
|
This is why even mature query engines still struggle with difficult workloads.
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## Distributed cost models
|
||||||
|
|
||||||
|
Distributed engines have additional cost dimensions:
|
||||||
|
|
||||||
|
- shuffle volume
|
||||||
|
- partition balance
|
||||||
|
- scheduling overhead
|
||||||
|
- serialization cost
|
||||||
|
- network latency
|
||||||
|
|
||||||
|
In that world, a plan that is locally cheap may still be globally expensive if it causes large repartitioning or skewed tasks.
|
||||||
|
|
||||||
|
So distributed cost models are not just single-node cost models with larger numbers. They need explicit reasoning about data movement.
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## Statistics freshness
|
||||||
|
|
||||||
|
Statistics are only useful if they are reasonably current.
|
||||||
|
|
||||||
|
Problems appear when:
|
||||||
|
|
||||||
|
- tables grow significantly
|
||||||
|
- distributions drift
|
||||||
|
- partitions change
|
||||||
|
- indexes are added or removed
|
||||||
|
|
||||||
|
This creates a maintenance tradeoff:
|
||||||
|
|
||||||
|
- fresh statistics improve planning
|
||||||
|
- but collecting them can be expensive
|
||||||
|
|
||||||
|
Many systems therefore settle for approximate or periodic stats rather than perfect real-time accuracy.
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## Rule-based vs cost-based planning
|
||||||
|
|
||||||
|
This note should be read as a deepening of cost-based optimization, not a replacement for rule-based optimization.
|
||||||
|
|
||||||
|
Rule-based planning is still useful for:
|
||||||
|
|
||||||
|
- safe local rewrites
|
||||||
|
- simplifying plans
|
||||||
|
- obvious pushdowns
|
||||||
|
|
||||||
|
Cost-based planning becomes necessary when the engine must choose among many globally different alternatives.
|
||||||
|
|
||||||
|
The two approaches are complementary.
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## Main takeaways
|
||||||
|
|
||||||
|
- Cardinality estimation is often the most important input to a cost-based optimizer.
|
||||||
|
- Cost models compare legal plans using rough estimates of CPU, memory, I/O, and network work.
|
||||||
|
- Statistics such as row counts, distinct counts, and histograms make those estimates less blind.
|
||||||
|
- Join planning is especially sensitive to bad estimates.
|
||||||
|
- Correlation, skew, and stale statistics are major reasons optimizers make poor choices.
|
||||||
|
- Distributed engines must model data movement explicitly, not just local operator cost.
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## Related notes
|
||||||
|
|
||||||
|
- `hqew/005-query-optimization.md`
|
||||||
|
- `hqew/008-joins-and-aggregations.md`
|
||||||
|
- `hqew/009-distributed-query-engines.md`
|
||||||
|
- `hqew/014-data-sources-and-file-formats.md`
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## Changelog
|
||||||
|
|
||||||
|
* **Apr 8, 2026** -- Added a dedicated note on cost models, optimizer statistics, and cardinality estimation.
|
||||||
Loading…
x
Reference in New Issue
Block a user