294 lines
6.6 KiB
Markdown
294 lines
6.6 KiB
Markdown
# Data Sources and File Formats
|
|
|
|
A reference for the boundary where external data enters a query engine.
|
|
|
|
---
|
|
|
|
## Short answer
|
|
|
|
A query engine needs a small, stable interface for reading data from many backends.
|
|
|
|
That interface should answer questions like:
|
|
|
|
- what schema does this source expose?
|
|
- how is the data scanned?
|
|
- what columns can be projected?
|
|
- what filters or other pushdowns can the source perform?
|
|
- what batch or streaming model does it support?
|
|
|
|
The file format matters because it strongly affects:
|
|
|
|
- schema fidelity
|
|
- parsing cost
|
|
- projection efficiency
|
|
- compression
|
|
- pushdown opportunities
|
|
|
|
So data sources are not just plumbing. They shape both planning and execution.
|
|
|
|
---
|
|
|
|
## Why this matters
|
|
|
|
Most engines do not operate over one monolithic storage system.
|
|
|
|
They often need to read from:
|
|
|
|
- CSV files
|
|
- Parquet files
|
|
- object stores
|
|
- database connectors
|
|
- in-memory tables
|
|
- remote services
|
|
|
|
Without a clean source boundary, the engine becomes tightly coupled to one representation and one storage layout.
|
|
|
|
With a clean source boundary, the engine can:
|
|
|
|
- plan against many backends
|
|
- reuse operators above the source layer
|
|
- push work down where possible
|
|
- preserve a clean distinction between engine logic and source-specific behavior
|
|
|
|
---
|
|
|
|
## What a source interface should expose
|
|
|
|
At minimum, a data source abstraction should expose:
|
|
|
|
1. schema discovery
|
|
2. scanning
|
|
|
|
In practice, useful source interfaces often also need to expose:
|
|
|
|
- projection support
|
|
- predicate pushdown support
|
|
- partition awareness
|
|
- ordering guarantees
|
|
- file- or block-level metadata
|
|
- statistics useful for planning
|
|
|
|
The challenge is to keep the interface:
|
|
|
|
- small enough to stay generic
|
|
- rich enough to preserve important performance features
|
|
|
|
That tradeoff is one of the core design decisions in a query engine.
|
|
|
|
---
|
|
|
|
## Schema discovery
|
|
|
|
Planning depends on schema information before execution starts.
|
|
|
|
The engine needs to know:
|
|
|
|
- column names
|
|
- data types
|
|
- nullability
|
|
- sometimes partition columns or hidden metadata columns
|
|
|
|
There are two common patterns.
|
|
|
|
### Declared schema
|
|
|
|
The source is given an explicit schema ahead of time.
|
|
|
|
Strengths:
|
|
|
|
- predictable
|
|
- fast
|
|
- good for strongly typed pipelines
|
|
|
|
Weaknesses:
|
|
|
|
- can drift from the actual data
|
|
- requires external coordination
|
|
|
|
### Inferred schema
|
|
|
|
The source inspects data and infers structure.
|
|
|
|
Strengths:
|
|
|
|
- convenient
|
|
- good for ad hoc exploration
|
|
|
|
Weaknesses:
|
|
|
|
- often expensive
|
|
- can be fragile
|
|
- may infer overly weak or inconsistent types
|
|
|
|
This is why many engines support both.
|
|
|
|
---
|
|
|
|
## Scan behavior
|
|
|
|
Scanning is the runtime act of turning a source into batches the engine can consume.
|
|
|
|
Important scan questions include:
|
|
|
|
- does the source stream data incrementally or materialize eagerly?
|
|
- what is the batch size?
|
|
- does it preserve source ordering?
|
|
- does it decode only requested columns?
|
|
- can it skip files, row groups, or blocks?
|
|
|
|
In a batch-oriented engine, the scan operator is not just "read rows." It is "produce typed batches with predictable shape."
|
|
|
|
That is a much stronger contract.
|
|
|
|
---
|
|
|
|
## Projection pushdown
|
|
|
|
Projection pushdown means reading only the columns the plan actually needs.
|
|
|
|
This is often the first and easiest optimization at the source boundary.
|
|
|
|
It matters because analytical queries frequently touch:
|
|
|
|
- a few columns
|
|
- from very wide datasets
|
|
|
|
Projection pushdown reduces:
|
|
|
|
- I/O
|
|
- decoding cost
|
|
- memory traffic
|
|
- unnecessary batch width
|
|
|
|
Columnar formats benefit especially strongly because unused columns can often be skipped almost completely.
|
|
|
|
---
|
|
|
|
## Predicate pushdown
|
|
|
|
Predicate pushdown means applying filters inside the source, or at least using source metadata to skip irrelevant data.
|
|
|
|
This can happen at several levels:
|
|
|
|
- file pruning
|
|
- partition pruning
|
|
- row-group pruning
|
|
- index-assisted filtering
|
|
- direct source-side evaluation
|
|
|
|
Predicate pushdown is more powerful than projection pushdown, but also harder to implement well because the engine must understand:
|
|
|
|
- which predicates are safe to push
|
|
- what the source can evaluate
|
|
- when pushdown changes cost but not semantics
|
|
|
|
So pushdown is both an optimization problem and an interface-design problem.
|
|
|
|
---
|
|
|
|
## CSV vs Parquet
|
|
|
|
This contrast is one of the most useful mental models for file-format choice.
|
|
|
|
### CSV
|
|
|
|
Good for:
|
|
|
|
- portability
|
|
- simplicity
|
|
- easy inspection
|
|
|
|
Weak for:
|
|
|
|
- type fidelity
|
|
- parsing cost
|
|
- projection efficiency
|
|
- null representation consistency
|
|
- analytical performance
|
|
|
|
CSV is usually row-like, text-heavy, and expensive to parse repeatedly.
|
|
|
|
### Parquet
|
|
|
|
Good for:
|
|
|
|
- columnar access
|
|
- stronger typing
|
|
- compression
|
|
- metadata-driven skipping
|
|
- analytical scans
|
|
|
|
Weak for:
|
|
|
|
- human readability
|
|
- simplicity of implementation
|
|
- some update-heavy workflows
|
|
|
|
Parquet is much more natural for analytical query engines because it aligns with columnar, batch-oriented execution.
|
|
|
|
---
|
|
|
|
## Source capabilities and planning
|
|
|
|
Not every source supports the same optimizations.
|
|
|
|
A planner often needs to know whether a source can:
|
|
|
|
- project columns
|
|
- evaluate filters
|
|
- expose statistics
|
|
- support partition pruning
|
|
- preserve sort order
|
|
|
|
This means source interfaces are not only runtime contracts. They also influence planning quality.
|
|
|
|
If the source abstraction hides too much, the engine loses optimization opportunities.
|
|
|
|
If it exposes too much, the abstraction becomes brittle and source-specific.
|
|
|
|
That is the central tension.
|
|
|
|
---
|
|
|
|
## Source neutrality vs source exploitation
|
|
|
|
Good query engines try to be source-neutral at the operator level while still exploiting source-specific strengths.
|
|
|
|
That usually means:
|
|
|
|
- a generic scan abstraction
|
|
- plus capability-aware planning
|
|
|
|
This is often better than either extreme:
|
|
|
|
- fully generic but performance-blind
|
|
- or fully source-coupled but hard to extend
|
|
|
|
In practice, engines usually live somewhere in the middle.
|
|
|
|
---
|
|
|
|
## Main takeaways
|
|
|
|
- The data-source boundary is one of the most important interfaces in a query engine.
|
|
- Schema discovery is planning-critical, not just metadata housekeeping.
|
|
- Projection and predicate pushdown begin at the source boundary.
|
|
- File formats directly affect execution cost and optimization opportunities.
|
|
- CSV is convenient but weak for analytical execution; Parquet is far more engine-friendly.
|
|
- A good source abstraction balances genericity with enough capability exposure to support real optimization.
|
|
|
|
---
|
|
|
|
## Related notes
|
|
|
|
- `hqew/007-storage-and-indexes.md`
|
|
- `hqew/014-how-query-engines-work-part-1.md`
|
|
- `hqew/016-physical-plans-and-operators.md`
|
|
- `hqew/018-cost-models-statistics-and-cardinality-estimation.md`
|
|
|
|
---
|
|
|
|
## Changelog
|
|
|
|
* **Apr 8, 2026** -- Added a dedicated note on source interfaces, pushdown, schema discovery, and file-format tradeoffs.
|