diff --git a/hqew/014-data-sources-and-file-formats.md b/hqew/014-data-sources-and-file-formats.md
new file mode 100644
index 0000000..9f596a1
--- /dev/null
+++ b/hqew/014-data-sources-and-file-formats.md
@@ -0,0 +1,293 @@
+# Data Sources and File Formats
+
+A reference for the boundary where external data enters a query engine.
+
+---
+
+## Short answer
+
+A query engine needs a small, stable interface for reading data from many backends.
+
+That interface should answer questions like:
+
+- what schema does this source expose?
+- how is the data scanned?
+- what columns can be projected?
+- what filters or other pushdowns can the source perform?
+- what batch or streaming model does it support?
+
+The file format matters because it strongly affects:
+
+- schema fidelity
+- parsing cost
+- projection efficiency
+- compression
+- pushdown opportunities
+
+So data sources are not just plumbing. They shape both planning and execution.
+
+---
+
+## Why this matters
+
+Most engines do not operate over one monolithic storage system.
+
+They often need to read from:
+
+- CSV files
+- Parquet files
+- object stores
+- database connectors
+- in-memory tables
+- remote services
+
+Without a clean source boundary, the engine becomes tightly coupled to one representation and one storage layout.
+
+With a clean source boundary, the engine can:
+
+- plan against many backends
+- reuse operators above the source layer
+- push work down where possible
+- preserve a clean distinction between engine logic and source-specific behavior
+
+---
+
+## What a source interface should expose
+
+At minimum, a data source abstraction should expose:
+
+1. schema discovery
+2. scanning
+
+In practice, useful source interfaces often also need to expose:
+
+- projection support
+- predicate pushdown support
+- partition awareness
+- ordering guarantees
+- file- or block-level metadata
+- statistics useful for planning
+
+The challenge is to keep the interface:
+
+- small enough to stay generic
+- rich enough to preserve important performance features
+
+That tradeoff is one of the core design decisions in a query engine.
+
+---
+
+## Schema discovery
+
+Planning depends on schema information before execution starts.
+
+The engine needs to know:
+
+- column names
+- data types
+- nullability
+- sometimes partition columns or hidden metadata columns
+
+There are two common patterns.
+
+### Declared schema
+
+The source is given an explicit schema ahead of time.
+
+Strengths:
+
+- predictable
+- fast
+- good for strongly typed pipelines
+
+Weaknesses:
+
+- can drift from the actual data
+- requires external coordination
+
+### Inferred schema
+
+The source inspects data and infers structure.
+
+Strengths:
+
+- convenient
+- good for ad hoc exploration
+
+Weaknesses:
+
+- often expensive
+- can be fragile
+- may infer overly weak or inconsistent types
+
+This is why many engines support both.
+
+---
+
+## Scan behavior
+
+Scanning is the runtime act of turning a source into batches the engine can consume.
+
+Important scan questions include:
+
+- does the source stream data incrementally or materialize eagerly?
+- what is the batch size?
+- does it preserve source ordering?
+- does it decode only requested columns?
+- can it skip files, row groups, or blocks?
+
+In a batch-oriented engine, the scan operator is not just "read rows." It is "produce typed batches with predictable shape."
+
+That is a much stronger contract.
+
+---
+
+## Projection pushdown
+
+Projection pushdown means reading only the columns the plan actually needs.
+
+This is often the first and easiest optimization at the source boundary.
+
+It matters because analytical queries frequently touch:
+
+- a few columns
+- from very wide datasets
+
+Projection pushdown reduces:
+
+- I/O
+- decoding cost
+- memory traffic
+- unnecessary batch width
+
+Columnar formats benefit especially strongly because unused columns can often be skipped almost completely.
+
+---
+
+## Predicate pushdown
+
+Predicate pushdown means applying filters inside the source, or at least using source metadata to skip irrelevant data.
+
+This can happen at several levels:
+
+- file pruning
+- partition pruning
+- row-group pruning
+- index-assisted filtering
+- direct source-side evaluation
+
+Predicate pushdown is more powerful than projection pushdown, but also harder to implement well because the engine must understand:
+
+- which predicates are safe to push
+- what the source can evaluate
+- when pushdown changes cost but not semantics
+
+So pushdown is both an optimization problem and an interface-design problem.
+
+---
+
+## CSV vs Parquet
+
+This contrast is one of the most useful mental models for file-format choice.
+
+### CSV
+
+Good for:
+
+- portability
+- simplicity
+- easy inspection
+
+Weak for:
+
+- type fidelity
+- parsing cost
+- projection efficiency
+- null representation consistency
+- analytical performance
+
+CSV is usually row-like, text-heavy, and expensive to parse repeatedly.
+
+### Parquet
+
+Good for:
+
+- columnar access
+- stronger typing
+- compression
+- metadata-driven skipping
+- analytical scans
+
+Weak for:
+
+- human readability
+- simplicity of implementation
+- some update-heavy workflows
+
+Parquet is much more natural for analytical query engines because it aligns with columnar, batch-oriented execution.
+
+---
+
+## Source capabilities and planning
+
+Not every source supports the same optimizations.
+
+A planner often needs to know whether a source can:
+
+- project columns
+- evaluate filters
+- expose statistics
+- support partition pruning
+- preserve sort order
+
+This means source interfaces are not only runtime contracts. They also influence planning quality.
+
+If the source abstraction hides too much, the engine loses optimization opportunities.
+
+If it exposes too much, the abstraction becomes brittle and source-specific.
+
+That is the central tension.
+
+---
+
+## Source neutrality vs source exploitation
+
+Good query engines try to be source-neutral at the operator level while still exploiting source-specific strengths.
+
+That usually means:
+
+- a generic scan abstraction
+- plus capability-aware planning
+
+This is often better than either extreme:
+
+- fully generic but performance-blind
+- or fully source-coupled but hard to extend
+
+In practice, engines usually live somewhere in the middle.
+
+---
+
+## Main takeaways
+
+- The data-source boundary is one of the most important interfaces in a query engine.
+- Schema discovery is planning-critical, not just metadata housekeeping.
+- Projection and predicate pushdown begin at the source boundary.
+- File formats directly affect execution cost and optimization opportunities.
+- CSV is convenient but weak for analytical execution; Parquet is far more engine-friendly.
+- A good source abstraction balances genericity with enough capability exposure to support real optimization.
+
+---
+
+## Related notes
+
+- `hqew/007-storage-and-indexes.md`
+- `hqew/014-how-query-engines-work-part-1.md`
+- `hqew/016-physical-plans-and-operators.md`
+- `hqew/018-cost-models-statistics-and-cardinality-estimation.md`
+
+---
+
+## Changelog
+
+* **Apr 8, 2026** -- Added a dedicated note on source interfaces, pushdown, schema discovery, and file-format tradeoffs.
diff --git a/hqew/015-sql-front-end-and-logical-planning.md b/hqew/015-sql-front-end-and-logical-planning.md
index 6b4795c..f26329d 100644
--- a/hqew/015-sql-front-end-and-logical-planning.md
+++ b/hqew/015-sql-front-end-and-logical-planning.md
@@ -368,5 +368,5 @@ It is what allows:
 
 ## Changelog
 
-* **Apr 7, 2026** -- Removed references to ignored local `tmp/` paths.
+* **Apr 7, 2026** -- Removed references to ignored local paths.
 * **Apr 7, 2026** -- Added a dedicated note on SQL tokenization, parsing, AST structure, and logical planning.
diff --git a/hqew/016-physical-plans-and-operators.md b/hqew/016-physical-plans-and-operators.md
index bae7ea2..d9bd531 100644
--- a/hqew/016-physical-plans-and-operators.md
+++ b/hqew/016-physical-plans-and-operators.md
@@ -375,5 +375,5 @@ So this note should be read as a physical-plan primer, not as the full space of
 
 ## Changelog
 
-* **Apr 7, 2026** -- Removed references to ignored local `tmp/` paths.
+* **Apr 7, 2026** -- Removed references to ignored local paths.
 * **Apr 7, 2026** -- Added a dedicated note on physical planning, executable operators, and batch-oriented runtime behavior.
diff --git a/hqew/017-expressions-types-and-nullability.md b/hqew/017-expressions-types-and-nullability.md
index 8482a8c..5ab8254 100644
--- a/hqew/017-expressions-types-and-nullability.md
+++ b/hqew/017-expressions-types-and-nullability.md
@@ -380,5 +380,5 @@ That is why "choosing a type system" appears so early in the book. It affects mu
 
 ## Changelog
 
-* **Apr 7, 2026** -- Removed references to ignored local `tmp/` paths.
+* **Apr 7, 2026** -- Removed references to ignored local paths.
 * **Apr 7, 2026** -- Added a dedicated note on expressions, Arrow-based types, coercion, aggregates, and null handling.
diff --git a/hqew/018-cost-models-statistics-and-cardinality-estimation.md b/hqew/018-cost-models-statistics-and-cardinality-estimation.md
new file mode 100644
index 0000000..0872a7e
--- /dev/null
+++ b/hqew/018-cost-models-statistics-and-cardinality-estimation.md
@@ -0,0 +1,284 @@
+# Cost Models, Statistics, and Cardinality Estimation
+
+A reference for how query engines estimate plan cost and why those estimates so often go wrong.
+
+---
+
+## Short answer
+
+A cost-based optimizer needs estimates for:
+
+- how many rows each operator will produce
+- how expensive each operator will be
+- how much memory, I/O, CPU, and network work a plan will require
+
+The most important estimate is usually cardinality:
+
+- how many rows flow out of each stage
+
+If those estimates are bad, the optimizer may choose:
+
+- the wrong join order
+- the wrong join algorithm
+- the wrong access path
+- the wrong distributed execution shape
+
+So statistics and cardinality estimation are central to serious optimization.
+
+---
+
+## Why this matters
+
+Rule-based optimization can improve many queries with local rewrites such as:
+
+- projection pushdown
+- predicate pushdown
+- constant folding
+
+But once the engine must choose among many legal alternatives, it needs a basis for comparison.
+
+That basis is usually a cost model informed by statistics.
+
+Without that, the engine has little principled way to decide:
+
+- which join should happen first
+- whether an index scan beats a full scan
+- whether repartitioning is worth it
+- whether a partial aggregation is worthwhile
+
+---
+
+## Cardinality first
+
+Cardinality estimation is usually the most important subproblem because many other costs depend on row counts.
+
+If the engine can estimate:
+
+- input row counts
+- filter selectivity
+- grouping cardinality
+- join output size
+
+then it can derive rough downstream costs.
+
+If it gets those row counts badly wrong, even a sophisticated cost model can make poor decisions.
+
+That is why cardinality estimation is often treated as the heart of the optimizer.
+
+---
+
+## What a cost model tries to estimate
+
+A cost model usually approximates some combination of:
+
+- CPU work
+- memory use
+- local I/O
+- network movement
+- latency
+
+Different engines weight these differently depending on workload.
+
+For example:
+
+- a single-node analytical engine may care most about CPU and memory bandwidth
+- a distributed engine may care heavily about shuffle size
+- an interactive engine may favor latency over total throughput
+
+So a cost model is always tied to system goals, not just pure algebra.
+
+---
+
+## Common sources of statistics
+
+Engines often collect or maintain:
+
+- total row count
+- per-column distinct counts
+- min/max values
+- null counts
+- histograms
+- file or partition sizes
+- index metadata
+
+These statistics can exist at several levels:
+
+- table-level
+- partition-level
+- file-level
+- row-group or block-level
+
+More granular statistics can improve planning, but they also cost more to collect, store, and keep fresh.
+
+---
+
+## Selectivity estimation
+
+Selectivity is the fraction of rows expected to survive a predicate.
+
+Examples:
+
+- `age > 18`
+- `country = 'US'`
+- `status IN (...)`
+
+The optimizer uses selectivity estimates to predict:
+
+- filter output size
+- downstream join input size
+- aggregate input size
+- whether an index is worthwhile
+
+This is deceptively hard because selectivity depends on data distribution, not just syntax.
+
+---
+
+## Join-order planning
+
+Join planning is where bad estimates become especially expensive.
+
+If a query joins several inputs, many legal join orders may exist. Their costs can differ dramatically because intermediate result sizes can explode or collapse depending on the order.
+
+The optimizer therefore needs estimates for:
+
+- base table sizes
+- filter selectivities
+- join key distinct counts
+- skew and duplication patterns
+
+If those estimates are poor, the engine may build large hash tables too early, shuffle too much data, or materialize huge intermediates unnecessarily.
+
+This is one reason join optimization is considered one of the hardest parts of query planning.
+
+---
+
+## Histograms and distributions
+
+Simple row counts are rarely enough.
+
+Two columns can have the same row count but very different value distributions.
+
+Histograms help by approximating:
+
+- frequency of values
+- value ranges
+- skew
+
+They improve estimates for:
+
+- range predicates
+- uneven value distributions
+- non-uniform grouping sizes
+
+But histograms are still approximations, and they can become stale.
+
+---
+
+## Correlation and independence assumptions
+
+Many optimizers make simplifying assumptions such as:
+
+- predicates are independent
+- column distributions are uniform
+- joins follow regular key patterns
+
+These assumptions are convenient but often wrong.
+
+For example, predicates on `city` and `country` are usually correlated. Treating them as independent can badly underestimate or overestimate result size.
+
+This is one of the biggest reasons real-world cardinality estimation is difficult.
+
+---
+
+## Error propagation
+
+Estimation errors compound through a plan.
+
+If an early filter estimate is off, then:
+
+- join inputs are misestimated
+- join outputs are misestimated
+- aggregate sizes are misestimated
+- memory and network costs are misestimated
+
+By the time the optimizer compares full plans, small early mistakes may have become large planning errors.
+
+This is why even mature query engines still struggle with difficult workloads.
+
+---
+
+## Distributed cost models
+
+Distributed engines have additional cost dimensions:
+
+- shuffle volume
+- partition balance
+- scheduling overhead
+- serialization cost
+- network latency
+
+In that world, a plan that is locally cheap may still be globally expensive if it causes large repartitioning or skewed tasks.
+
+So distributed cost models are not just single-node cost models with larger numbers. They need explicit reasoning about data movement.
+
+---
+
+## Statistics freshness
+
+Statistics are only useful if they are reasonably current.
+
+Problems appear when:
+
+- tables grow significantly
+- distributions drift
+- partitions change
+- indexes are added or removed
+
+This creates a maintenance tradeoff:
+
+- fresh statistics improve planning
+- but collecting them can be expensive
+
+Many systems therefore settle for approximate or periodic stats rather than perfect real-time accuracy.
+
+---
+
+## Rule-based vs cost-based planning
+
+This note should be read as a deepening of cost-based optimization, not a replacement for rule-based optimization.
+
+Rule-based planning is still useful for:
+
+- safe local rewrites
+- simplifying plans
+- obvious pushdowns
+
+Cost-based planning becomes necessary when the engine must choose among many globally different alternatives.
+
+The two approaches are complementary.
+
+---
+
+## Main takeaways
+
+- Cardinality estimation is often the most important input to a cost-based optimizer.
+- Cost models compare legal plans using rough estimates of CPU, memory, I/O, and network work.
+- Statistics such as row counts, distinct counts, and histograms make those estimates less blind.
+- Join planning is especially sensitive to bad estimates.
+- Correlation, skew, and stale statistics are major reasons optimizers make poor choices.
+- Distributed engines must model data movement explicitly, not just local operator cost.
+
+---
+
+## Related notes
+
+- `hqew/005-query-optimization.md`
+- `hqew/008-joins-and-aggregations.md`
+- `hqew/009-distributed-query-engines.md`
+- `hqew/014-data-sources-and-file-formats.md`
+
+---
+
+## Changelog
+
+* **Apr 8, 2026** -- Added a dedicated note on cost models, optimizer statistics, and cardinality estimation.