diff --git a/hqew/014-data-sources-and-file-formats.md b/hqew/014-data-sources-and-file-formats.md new file mode 100644 index 0000000..9f596a1 --- /dev/null +++ b/hqew/014-data-sources-and-file-formats.md @@ -0,0 +1,293 @@ +# Data Sources and File Formats + +A reference for the boundary where external data enters a query engine. + +--- + +## Short answer + +A query engine needs a small, stable interface for reading data from many backends. + +That interface should answer questions like: + +- what schema does this source expose? +- how is the data scanned? +- what columns can be projected? +- what filters or other pushdowns can the source perform? +- what batch or streaming model does it support? + +The file format matters because it strongly affects: + +- schema fidelity +- parsing cost +- projection efficiency +- compression +- pushdown opportunities + +So data sources are not just plumbing. They shape both planning and execution. + +--- + +## Why this matters + +Most engines do not operate over one monolithic storage system. + +They often need to read from: + +- CSV files +- Parquet files +- object stores +- database connectors +- in-memory tables +- remote services + +Without a clean source boundary, the engine becomes tightly coupled to one representation and one storage layout. + +With a clean source boundary, the engine can: + +- plan against many backends +- reuse operators above the source layer +- push work down where possible +- preserve a clean distinction between engine logic and source-specific behavior + +--- + +## What a source interface should expose + +At minimum, a data source abstraction should expose: + +1. schema discovery +2. scanning + +In practice, useful source interfaces often also need to expose: + +- projection support +- predicate pushdown support +- partition awareness +- ordering guarantees +- file- or block-level metadata +- statistics useful for planning + +The challenge is to keep the interface: + +- small enough to stay generic +- rich enough to preserve important performance features + +That tradeoff is one of the core design decisions in a query engine. + +--- + +## Schema discovery + +Planning depends on schema information before execution starts. + +The engine needs to know: + +- column names +- data types +- nullability +- sometimes partition columns or hidden metadata columns + +There are two common patterns. + +### Declared schema + +The source is given an explicit schema ahead of time. + +Strengths: + +- predictable +- fast +- good for strongly typed pipelines + +Weaknesses: + +- can drift from the actual data +- requires external coordination + +### Inferred schema + +The source inspects data and infers structure. + +Strengths: + +- convenient +- good for ad hoc exploration + +Weaknesses: + +- often expensive +- can be fragile +- may infer overly weak or inconsistent types + +This is why many engines support both. + +--- + +## Scan behavior + +Scanning is the runtime act of turning a source into batches the engine can consume. + +Important scan questions include: + +- does the source stream data incrementally or materialize eagerly? +- what is the batch size? +- does it preserve source ordering? +- does it decode only requested columns? +- can it skip files, row groups, or blocks? + +In a batch-oriented engine, the scan operator is not just "read rows." It is "produce typed batches with predictable shape." + +That is a much stronger contract. + +--- + +## Projection pushdown + +Projection pushdown means reading only the columns the plan actually needs. + +This is often the first and easiest optimization at the source boundary. + +It matters because analytical queries frequently touch: + +- a few columns +- from very wide datasets + +Projection pushdown reduces: + +- I/O +- decoding cost +- memory traffic +- unnecessary batch width + +Columnar formats benefit especially strongly because unused columns can often be skipped almost completely. + +--- + +## Predicate pushdown + +Predicate pushdown means applying filters inside the source, or at least using source metadata to skip irrelevant data. + +This can happen at several levels: + +- file pruning +- partition pruning +- row-group pruning +- index-assisted filtering +- direct source-side evaluation + +Predicate pushdown is more powerful than projection pushdown, but also harder to implement well because the engine must understand: + +- which predicates are safe to push +- what the source can evaluate +- when pushdown changes cost but not semantics + +So pushdown is both an optimization problem and an interface-design problem. + +--- + +## CSV vs Parquet + +This contrast is one of the most useful mental models for file-format choice. + +### CSV + +Good for: + +- portability +- simplicity +- easy inspection + +Weak for: + +- type fidelity +- parsing cost +- projection efficiency +- null representation consistency +- analytical performance + +CSV is usually row-like, text-heavy, and expensive to parse repeatedly. + +### Parquet + +Good for: + +- columnar access +- stronger typing +- compression +- metadata-driven skipping +- analytical scans + +Weak for: + +- human readability +- simplicity of implementation +- some update-heavy workflows + +Parquet is much more natural for analytical query engines because it aligns with columnar, batch-oriented execution. + +--- + +## Source capabilities and planning + +Not every source supports the same optimizations. + +A planner often needs to know whether a source can: + +- project columns +- evaluate filters +- expose statistics +- support partition pruning +- preserve sort order + +This means source interfaces are not only runtime contracts. They also influence planning quality. + +If the source abstraction hides too much, the engine loses optimization opportunities. + +If it exposes too much, the abstraction becomes brittle and source-specific. + +That is the central tension. + +--- + +## Source neutrality vs source exploitation + +Good query engines try to be source-neutral at the operator level while still exploiting source-specific strengths. + +That usually means: + +- a generic scan abstraction +- plus capability-aware planning + +This is often better than either extreme: + +- fully generic but performance-blind +- or fully source-coupled but hard to extend + +In practice, engines usually live somewhere in the middle. + +--- + +## Main takeaways + +- The data-source boundary is one of the most important interfaces in a query engine. +- Schema discovery is planning-critical, not just metadata housekeeping. +- Projection and predicate pushdown begin at the source boundary. +- File formats directly affect execution cost and optimization opportunities. +- CSV is convenient but weak for analytical execution; Parquet is far more engine-friendly. +- A good source abstraction balances genericity with enough capability exposure to support real optimization. + +--- + +## Related notes + +- `hqew/007-storage-and-indexes.md` +- `hqew/014-how-query-engines-work-part-1.md` +- `hqew/016-physical-plans-and-operators.md` +- `hqew/018-cost-models-statistics-and-cardinality-estimation.md` + +--- + +## Changelog + +* **Apr 8, 2026** -- Added a dedicated note on source interfaces, pushdown, schema discovery, and file-format tradeoffs. diff --git a/hqew/015-sql-front-end-and-logical-planning.md b/hqew/015-sql-front-end-and-logical-planning.md index 6b4795c..f26329d 100644 --- a/hqew/015-sql-front-end-and-logical-planning.md +++ b/hqew/015-sql-front-end-and-logical-planning.md @@ -368,5 +368,5 @@ It is what allows: ## Changelog -* **Apr 7, 2026** -- Removed references to ignored local `tmp/` paths. +* **Apr 7, 2026** -- Removed references to ignored local paths. * **Apr 7, 2026** -- Added a dedicated note on SQL tokenization, parsing, AST structure, and logical planning. diff --git a/hqew/016-physical-plans-and-operators.md b/hqew/016-physical-plans-and-operators.md index bae7ea2..d9bd531 100644 --- a/hqew/016-physical-plans-and-operators.md +++ b/hqew/016-physical-plans-and-operators.md @@ -375,5 +375,5 @@ So this note should be read as a physical-plan primer, not as the full space of ## Changelog -* **Apr 7, 2026** -- Removed references to ignored local `tmp/` paths. +* **Apr 7, 2026** -- Removed references to ignored local paths. * **Apr 7, 2026** -- Added a dedicated note on physical planning, executable operators, and batch-oriented runtime behavior. diff --git a/hqew/017-expressions-types-and-nullability.md b/hqew/017-expressions-types-and-nullability.md index 8482a8c..5ab8254 100644 --- a/hqew/017-expressions-types-and-nullability.md +++ b/hqew/017-expressions-types-and-nullability.md @@ -380,5 +380,5 @@ That is why "choosing a type system" appears so early in the book. It affects mu ## Changelog -* **Apr 7, 2026** -- Removed references to ignored local `tmp/` paths. +* **Apr 7, 2026** -- Removed references to ignored local paths. * **Apr 7, 2026** -- Added a dedicated note on expressions, Arrow-based types, coercion, aggregates, and null handling. diff --git a/hqew/018-cost-models-statistics-and-cardinality-estimation.md b/hqew/018-cost-models-statistics-and-cardinality-estimation.md new file mode 100644 index 0000000..0872a7e --- /dev/null +++ b/hqew/018-cost-models-statistics-and-cardinality-estimation.md @@ -0,0 +1,284 @@ +# Cost Models, Statistics, and Cardinality Estimation + +A reference for how query engines estimate plan cost and why those estimates so often go wrong. + +--- + +## Short answer + +A cost-based optimizer needs estimates for: + +- how many rows each operator will produce +- how expensive each operator will be +- how much memory, I/O, CPU, and network work a plan will require + +The most important estimate is usually cardinality: + +- how many rows flow out of each stage + +If those estimates are bad, the optimizer may choose: + +- the wrong join order +- the wrong join algorithm +- the wrong access path +- the wrong distributed execution shape + +So statistics and cardinality estimation are central to serious optimization. + +--- + +## Why this matters + +Rule-based optimization can improve many queries with local rewrites such as: + +- projection pushdown +- predicate pushdown +- constant folding + +But once the engine must choose among many legal alternatives, it needs a basis for comparison. + +That basis is usually a cost model informed by statistics. + +Without that, the engine has little principled way to decide: + +- which join should happen first +- whether an index scan beats a full scan +- whether repartitioning is worth it +- whether a partial aggregation is worthwhile + +--- + +## Cardinality first + +Cardinality estimation is usually the most important subproblem because many other costs depend on row counts. + +If the engine can estimate: + +- input row counts +- filter selectivity +- grouping cardinality +- join output size + +then it can derive rough downstream costs. + +If it gets those row counts badly wrong, even a sophisticated cost model can make poor decisions. + +That is why cardinality estimation is often treated as the heart of the optimizer. + +--- + +## What a cost model tries to estimate + +A cost model usually approximates some combination of: + +- CPU work +- memory use +- local I/O +- network movement +- latency + +Different engines weight these differently depending on workload. + +For example: + +- a single-node analytical engine may care most about CPU and memory bandwidth +- a distributed engine may care heavily about shuffle size +- an interactive engine may favor latency over total throughput + +So a cost model is always tied to system goals, not just pure algebra. + +--- + +## Common sources of statistics + +Engines often collect or maintain: + +- total row count +- per-column distinct counts +- min/max values +- null counts +- histograms +- file or partition sizes +- index metadata + +These statistics can exist at several levels: + +- table-level +- partition-level +- file-level +- row-group or block-level + +More granular statistics can improve planning, but they also cost more to collect, store, and keep fresh. + +--- + +## Selectivity estimation + +Selectivity is the fraction of rows expected to survive a predicate. + +Examples: + +- `age > 18` +- `country = 'US'` +- `status IN (...)` + +The optimizer uses selectivity estimates to predict: + +- filter output size +- downstream join input size +- aggregate input size +- whether an index is worthwhile + +This is deceptively hard because selectivity depends on data distribution, not just syntax. + +--- + +## Join-order planning + +Join planning is where bad estimates become especially expensive. + +If a query joins several inputs, many legal join orders may exist. Their costs can differ dramatically because intermediate result sizes can explode or collapse depending on the order. + +The optimizer therefore needs estimates for: + +- base table sizes +- filter selectivities +- join key distinct counts +- skew and duplication patterns + +If those estimates are poor, the engine may build large hash tables too early, shuffle too much data, or materialize huge intermediates unnecessarily. + +This is one reason join optimization is considered one of the hardest parts of query planning. + +--- + +## Histograms and distributions + +Simple row counts are rarely enough. + +Two columns can have the same row count but very different value distributions. + +Histograms help by approximating: + +- frequency of values +- value ranges +- skew + +They improve estimates for: + +- range predicates +- uneven value distributions +- non-uniform grouping sizes + +But histograms are still approximations, and they can become stale. + +--- + +## Correlation and independence assumptions + +Many optimizers make simplifying assumptions such as: + +- predicates are independent +- column distributions are uniform +- joins follow regular key patterns + +These assumptions are convenient but often wrong. + +For example, predicates on `city` and `country` are usually correlated. Treating them as independent can badly underestimate or overestimate result size. + +This is one of the biggest reasons real-world cardinality estimation is difficult. + +--- + +## Error propagation + +Estimation errors compound through a plan. + +If an early filter estimate is off, then: + +- join inputs are misestimated +- join outputs are misestimated +- aggregate sizes are misestimated +- memory and network costs are misestimated + +By the time the optimizer compares full plans, small early mistakes may have become large planning errors. + +This is why even mature query engines still struggle with difficult workloads. + +--- + +## Distributed cost models + +Distributed engines have additional cost dimensions: + +- shuffle volume +- partition balance +- scheduling overhead +- serialization cost +- network latency + +In that world, a plan that is locally cheap may still be globally expensive if it causes large repartitioning or skewed tasks. + +So distributed cost models are not just single-node cost models with larger numbers. They need explicit reasoning about data movement. + +--- + +## Statistics freshness + +Statistics are only useful if they are reasonably current. + +Problems appear when: + +- tables grow significantly +- distributions drift +- partitions change +- indexes are added or removed + +This creates a maintenance tradeoff: + +- fresh statistics improve planning +- but collecting them can be expensive + +Many systems therefore settle for approximate or periodic stats rather than perfect real-time accuracy. + +--- + +## Rule-based vs cost-based planning + +This note should be read as a deepening of cost-based optimization, not a replacement for rule-based optimization. + +Rule-based planning is still useful for: + +- safe local rewrites +- simplifying plans +- obvious pushdowns + +Cost-based planning becomes necessary when the engine must choose among many globally different alternatives. + +The two approaches are complementary. + +--- + +## Main takeaways + +- Cardinality estimation is often the most important input to a cost-based optimizer. +- Cost models compare legal plans using rough estimates of CPU, memory, I/O, and network work. +- Statistics such as row counts, distinct counts, and histograms make those estimates less blind. +- Join planning is especially sensitive to bad estimates. +- Correlation, skew, and stale statistics are major reasons optimizers make poor choices. +- Distributed engines must model data movement explicitly, not just local operator cost. + +--- + +## Related notes + +- `hqew/005-query-optimization.md` +- `hqew/008-joins-and-aggregations.md` +- `hqew/009-distributed-query-engines.md` +- `hqew/014-data-sources-and-file-formats.md` + +--- + +## Changelog + +* **Apr 8, 2026** -- Added a dedicated note on cost models, optimizer statistics, and cardinality estimation.