useful-notes/hqew/014-data-sources-and-file-formats.md

# Data Sources and File Formats

A reference for the boundary where external data enters a query engine.

---

## Short answer

A query engine needs a small, stable interface for reading data from many backends.

That interface should answer questions like:

- what schema does this source expose?
- how is the data scanned?
- what columns can be projected?
- what filters or other pushdowns can the source perform?
- what batch or streaming model does it support?

The file format matters because it strongly affects:

- schema fidelity
- parsing cost
- projection efficiency
- compression
- pushdown opportunities

So data sources are not just plumbing. They shape both planning and execution.

---

## Why this matters

Most engines do not operate over one monolithic storage system.

They often need to read from:

- CSV files
- Parquet files
- object stores
- database connectors
- in-memory tables
- remote services

Without a clean source boundary, the engine becomes tightly coupled to one representation and one storage layout.

With a clean source boundary, the engine can:

- plan against many backends
- reuse operators above the source layer
- push work down where possible
- preserve a clean distinction between engine logic and source-specific behavior

---

## What a source interface should expose

At minimum, a data source abstraction should expose:

1. schema discovery
2. scanning

In practice, useful source interfaces often also need to expose:

- projection support
- predicate pushdown support
- partition awareness
- ordering guarantees
- file- or block-level metadata
- statistics useful for planning

The challenge is to keep the interface:

- small enough to stay generic
- rich enough to preserve important performance features

That tradeoff is one of the core design decisions in a query engine.

---

## Schema discovery

Planning depends on schema information before execution starts.

The engine needs to know:

- column names
- data types
- nullability
- sometimes partition columns or hidden metadata columns

There are two common patterns.

### Declared schema

The source is given an explicit schema ahead of time.

Strengths:

- predictable
- fast
- good for strongly typed pipelines

Weaknesses:

- can drift from the actual data
- requires external coordination

### Inferred schema

The source inspects data and infers structure.

Strengths:

- convenient
- good for ad hoc exploration

Weaknesses:

- often expensive
- can be fragile
- may infer overly weak or inconsistent types

This is why many engines support both.

---

## Scan behavior

Scanning is the runtime act of turning a source into batches the engine can consume.

Important scan questions include:

- does the source stream data incrementally or materialize eagerly?
- what is the batch size?
- does it preserve source ordering?
- does it decode only requested columns?
- can it skip files, row groups, or blocks?

In a batch-oriented engine, the scan operator is not just "read rows." It is "produce typed batches with predictable shape."

That is a much stronger contract.

---

## Projection pushdown

Projection pushdown means reading only the columns the plan actually needs.

This is often the first and easiest optimization at the source boundary.

It matters because analytical queries frequently touch:

- a few columns
- from very wide datasets

Projection pushdown reduces:

- I/O
- decoding cost
- memory traffic
- unnecessary batch width

Columnar formats benefit especially strongly because unused columns can often be skipped almost completely.

---

## Predicate pushdown

Predicate pushdown means applying filters inside the source, or at least using source metadata to skip irrelevant data.

This can happen at several levels:

- file pruning
- partition pruning
- row-group pruning
- index-assisted filtering
- direct source-side evaluation

Predicate pushdown is more powerful than projection pushdown, but also harder to implement well because the engine must understand:

- which predicates are safe to push
- what the source can evaluate
- when pushdown changes cost but not semantics

So pushdown is both an optimization problem and an interface-design problem.

---

## CSV vs Parquet

This contrast is one of the most useful mental models for file-format choice.

### CSV

Good for:

- portability
- simplicity
- easy inspection

Weak for:

- type fidelity
- parsing cost
- projection efficiency
- null representation consistency
- analytical performance

CSV is usually row-like, text-heavy, and expensive to parse repeatedly.

### Parquet

Good for:

- columnar access
- stronger typing
- compression
- metadata-driven skipping
- analytical scans

Weak for:

- human readability
- simplicity of implementation
- some update-heavy workflows

Parquet is much more natural for analytical query engines because it aligns with columnar, batch-oriented execution.

---

## Source capabilities and planning

Not every source supports the same optimizations.

A planner often needs to know whether a source can:

- project columns
- evaluate filters
- expose statistics
- support partition pruning
- preserve sort order

This means source interfaces are not only runtime contracts. They also influence planning quality.

If the source abstraction hides too much, the engine loses optimization opportunities.

If it exposes too much, the abstraction becomes brittle and source-specific.

That is the central tension.

---

## Source neutrality vs source exploitation

Good query engines try to be source-neutral at the operator level while still exploiting source-specific strengths.

That usually means:

- a generic scan abstraction
- plus capability-aware planning

This is often better than either extreme:

- fully generic but performance-blind
- or fully source-coupled but hard to extend

In practice, engines usually live somewhere in the middle.

---

## Main takeaways

- The data-source boundary is one of the most important interfaces in a query engine.
- Schema discovery is planning-critical, not just metadata housekeeping.
- Projection and predicate pushdown begin at the source boundary.
- File formats directly affect execution cost and optimization opportunities.
- CSV is convenient but weak for analytical execution; Parquet is far more engine-friendly.
- A good source abstraction balances genericity with enough capability exposure to support real optimization.

---

## Related notes

- `hqew/007-storage-and-indexes.md`
- `hqew/014-how-query-engines-work-part-1.md`
- `hqew/016-physical-plans-and-operators.md`
- `hqew/018-cost-models-statistics-and-cardinality-estimation.md`

---

## Changelog

* **Apr 8, 2026** -- Added a dedicated note on source interfaces, pushdown, schema discovery, and file-format tradeoffs.