6.6 KiB
Data Sources and File Formats
A reference for the boundary where external data enters a query engine.
Short answer
A query engine needs a small, stable interface for reading data from many backends.
That interface should answer questions like:
- what schema does this source expose?
- how is the data scanned?
- what columns can be projected?
- what filters or other pushdowns can the source perform?
- what batch or streaming model does it support?
The file format matters because it strongly affects:
- schema fidelity
- parsing cost
- projection efficiency
- compression
- pushdown opportunities
So data sources are not just plumbing. They shape both planning and execution.
Why this matters
Most engines do not operate over one monolithic storage system.
They often need to read from:
- CSV files
- Parquet files
- object stores
- database connectors
- in-memory tables
- remote services
Without a clean source boundary, the engine becomes tightly coupled to one representation and one storage layout.
With a clean source boundary, the engine can:
- plan against many backends
- reuse operators above the source layer
- push work down where possible
- preserve a clean distinction between engine logic and source-specific behavior
What a source interface should expose
At minimum, a data source abstraction should expose:
- schema discovery
- scanning
In practice, useful source interfaces often also need to expose:
- projection support
- predicate pushdown support
- partition awareness
- ordering guarantees
- file- or block-level metadata
- statistics useful for planning
The challenge is to keep the interface:
- small enough to stay generic
- rich enough to preserve important performance features
That tradeoff is one of the core design decisions in a query engine.
Schema discovery
Planning depends on schema information before execution starts.
The engine needs to know:
- column names
- data types
- nullability
- sometimes partition columns or hidden metadata columns
There are two common patterns.
Declared schema
The source is given an explicit schema ahead of time.
Strengths:
- predictable
- fast
- good for strongly typed pipelines
Weaknesses:
- can drift from the actual data
- requires external coordination
Inferred schema
The source inspects data and infers structure.
Strengths:
- convenient
- good for ad hoc exploration
Weaknesses:
- often expensive
- can be fragile
- may infer overly weak or inconsistent types
This is why many engines support both.
Scan behavior
Scanning is the runtime act of turning a source into batches the engine can consume.
Important scan questions include:
- does the source stream data incrementally or materialize eagerly?
- what is the batch size?
- does it preserve source ordering?
- does it decode only requested columns?
- can it skip files, row groups, or blocks?
In a batch-oriented engine, the scan operator is not just "read rows." It is "produce typed batches with predictable shape."
That is a much stronger contract.
Projection pushdown
Projection pushdown means reading only the columns the plan actually needs.
This is often the first and easiest optimization at the source boundary.
It matters because analytical queries frequently touch:
- a few columns
- from very wide datasets
Projection pushdown reduces:
- I/O
- decoding cost
- memory traffic
- unnecessary batch width
Columnar formats benefit especially strongly because unused columns can often be skipped almost completely.
Predicate pushdown
Predicate pushdown means applying filters inside the source, or at least using source metadata to skip irrelevant data.
This can happen at several levels:
- file pruning
- partition pruning
- row-group pruning
- index-assisted filtering
- direct source-side evaluation
Predicate pushdown is more powerful than projection pushdown, but also harder to implement well because the engine must understand:
- which predicates are safe to push
- what the source can evaluate
- when pushdown changes cost but not semantics
So pushdown is both an optimization problem and an interface-design problem.
CSV vs Parquet
This contrast is one of the most useful mental models for file-format choice.
CSV
Good for:
- portability
- simplicity
- easy inspection
Weak for:
- type fidelity
- parsing cost
- projection efficiency
- null representation consistency
- analytical performance
CSV is usually row-like, text-heavy, and expensive to parse repeatedly.
Parquet
Good for:
- columnar access
- stronger typing
- compression
- metadata-driven skipping
- analytical scans
Weak for:
- human readability
- simplicity of implementation
- some update-heavy workflows
Parquet is much more natural for analytical query engines because it aligns with columnar, batch-oriented execution.
Source capabilities and planning
Not every source supports the same optimizations.
A planner often needs to know whether a source can:
- project columns
- evaluate filters
- expose statistics
- support partition pruning
- preserve sort order
This means source interfaces are not only runtime contracts. They also influence planning quality.
If the source abstraction hides too much, the engine loses optimization opportunities.
If it exposes too much, the abstraction becomes brittle and source-specific.
That is the central tension.
Source neutrality vs source exploitation
Good query engines try to be source-neutral at the operator level while still exploiting source-specific strengths.
That usually means:
- a generic scan abstraction
- plus capability-aware planning
This is often better than either extreme:
- fully generic but performance-blind
- or fully source-coupled but hard to extend
In practice, engines usually live somewhere in the middle.
Main takeaways
- The data-source boundary is one of the most important interfaces in a query engine.
- Schema discovery is planning-critical, not just metadata housekeeping.
- Projection and predicate pushdown begin at the source boundary.
- File formats directly affect execution cost and optimization opportunities.
- CSV is convenient but weak for analytical execution; Parquet is far more engine-friendly.
- A good source abstraction balances genericity with enough capability exposure to support real optimization.
Related notes
hqew/007-storage-and-indexes.mdhqew/014-how-query-engines-work-part-1.mdhqew/016-physical-plans-and-operators.mdhqew/018-cost-models-statistics-and-cardinality-estimation.md
Changelog
- Apr 8, 2026 -- Added a dedicated note on source interfaces, pushdown, schema discovery, and file-format tradeoffs.