habedi-work/useful-notes

Fork 0

Hassan Abedi 405a609eb8 Add note files for data sources, cost models, and cardinality estimation

2026-04-08 15:09:35 +02:00

6.6 KiB

Raw Blame History

Data Sources and File Formats

A reference for the boundary where external data enters a query engine.

Short answer

A query engine needs a small, stable interface for reading data from many backends.

That interface should answer questions like:

what schema does this source expose?
how is the data scanned?
what columns can be projected?
what filters or other pushdowns can the source perform?
what batch or streaming model does it support?

The file format matters because it strongly affects:

schema fidelity
parsing cost
projection efficiency
compression
pushdown opportunities

So data sources are not just plumbing. They shape both planning and execution.

Why this matters

Most engines do not operate over one monolithic storage system.

They often need to read from:

CSV files
Parquet files
object stores
database connectors
in-memory tables
remote services

Without a clean source boundary, the engine becomes tightly coupled to one representation and one storage layout.

With a clean source boundary, the engine can:

plan against many backends
reuse operators above the source layer
push work down where possible
preserve a clean distinction between engine logic and source-specific behavior

What a source interface should expose

At minimum, a data source abstraction should expose:

schema discovery
scanning

In practice, useful source interfaces often also need to expose:

projection support
predicate pushdown support
partition awareness
ordering guarantees
file- or block-level metadata
statistics useful for planning

The challenge is to keep the interface:

small enough to stay generic
rich enough to preserve important performance features

That tradeoff is one of the core design decisions in a query engine.

Schema discovery

Planning depends on schema information before execution starts.

The engine needs to know:

column names
data types
nullability
sometimes partition columns or hidden metadata columns

There are two common patterns.

Declared schema

The source is given an explicit schema ahead of time.

Strengths:

predictable
fast
good for strongly typed pipelines

Weaknesses:

can drift from the actual data
requires external coordination

Inferred schema

The source inspects data and infers structure.

Strengths:

convenient
good for ad hoc exploration

Weaknesses:

often expensive
can be fragile
may infer overly weak or inconsistent types

This is why many engines support both.

Scan behavior

Scanning is the runtime act of turning a source into batches the engine can consume.

Important scan questions include:

does the source stream data incrementally or materialize eagerly?
what is the batch size?
does it preserve source ordering?
does it decode only requested columns?
can it skip files, row groups, or blocks?

In a batch-oriented engine, the scan operator is not just "read rows." It is "produce typed batches with predictable shape."

That is a much stronger contract.

Projection pushdown

Projection pushdown means reading only the columns the plan actually needs.

This is often the first and easiest optimization at the source boundary.

It matters because analytical queries frequently touch:

a few columns
from very wide datasets

Projection pushdown reduces:

I/O
decoding cost
memory traffic
unnecessary batch width

Columnar formats benefit especially strongly because unused columns can often be skipped almost completely.

Predicate pushdown

Predicate pushdown means applying filters inside the source, or at least using source metadata to skip irrelevant data.

This can happen at several levels:

file pruning
partition pruning
row-group pruning
index-assisted filtering
direct source-side evaluation

Predicate pushdown is more powerful than projection pushdown, but also harder to implement well because the engine must understand:

which predicates are safe to push
what the source can evaluate
when pushdown changes cost but not semantics

So pushdown is both an optimization problem and an interface-design problem.

CSV vs Parquet

This contrast is one of the most useful mental models for file-format choice.

CSV

Good for:

portability
simplicity
easy inspection

Weak for:

type fidelity
parsing cost
projection efficiency
null representation consistency
analytical performance

CSV is usually row-like, text-heavy, and expensive to parse repeatedly.

Parquet

Good for:

columnar access
stronger typing
compression
metadata-driven skipping
analytical scans

Weak for:

human readability
simplicity of implementation
some update-heavy workflows

Parquet is much more natural for analytical query engines because it aligns with columnar, batch-oriented execution.

Source capabilities and planning

Not every source supports the same optimizations.

A planner often needs to know whether a source can:

project columns
evaluate filters
expose statistics
support partition pruning
preserve sort order

This means source interfaces are not only runtime contracts. They also influence planning quality.

If the source abstraction hides too much, the engine loses optimization opportunities.

If it exposes too much, the abstraction becomes brittle and source-specific.

That is the central tension.

Source neutrality vs source exploitation

Good query engines try to be source-neutral at the operator level while still exploiting source-specific strengths.

That usually means:

a generic scan abstraction
plus capability-aware planning

This is often better than either extreme:

fully generic but performance-blind
or fully source-coupled but hard to extend

In practice, engines usually live somewhere in the middle.

Main takeaways

The data-source boundary is one of the most important interfaces in a query engine.
Schema discovery is planning-critical, not just metadata housekeeping.
Projection and predicate pushdown begin at the source boundary.
File formats directly affect execution cost and optimization opportunities.
CSV is convenient but weak for analytical execution; Parquet is far more engine-friendly.
A good source abstraction balances genericity with enough capability exposure to support real optimization.

hqew/007-storage-and-indexes.md
hqew/014-how-query-engines-work-part-1.md
hqew/016-physical-plans-and-operators.md
hqew/018-cost-models-statistics-and-cardinality-estimation.md

Changelog

Apr 8, 2026 -- Added a dedicated note on source interfaces, pushdown, schema discovery, and file-format tradeoffs.

6.6 KiB Raw Blame History