Add note files about query execution models and indexes

This commit is contained in:
Hassan Abedi 2026-04-01 09:09:58 +02:00
parent 2a33f8b483
commit 8ed8347380
2 changed files with 320 additions and 0 deletions

View File

@ -0,0 +1,167 @@
# Query Execution Models
A reference for the main ways query operators run at runtime.
---
## Short answer
An execution model defines how operators consume input, produce output, and pass data through a plan.
The most important questions are:
- one row at a time or many values at once?
- pull-based or push-based?
- pipelined or materialized?
Those choices strongly affect latency, CPU efficiency, and implementation complexity.
---
## Row-at-a-time execution
In a row-oriented model, operators process one tuple at a time.
This is often implemented with an iterator interface where a parent asks a child for the next row.
Strengths:
- simple
- modular
- easy to debug
Weaknesses:
- high per-row overhead
- worse cache behavior for analytics
This model is historically important and still useful in many systems.
---
## Batch-oriented execution
In a batch model, operators process chunks of rows together.
The batch may be row-based or columnar, but the main idea is to amortize operator overhead across many values.
Strengths:
- better CPU efficiency
- lower dispatch overhead
- easier parallelism inside an operator
Weaknesses:
- more bookkeeping
- more complex control flow
---
## Vectorized execution
Vectorized execution is a batch-oriented style where operators often process column vectors rather than full row objects.
This fits well with columnar memory layouts and analytical workloads.
Strengths:
- excellent cache locality
- better SIMD opportunities
- good fit for scans, filters, joins, and aggregates
Weaknesses:
- some control-flow-heavy logic is less natural
- more careful null and type handling is needed
---
## Pull vs push
### Pull-based execution
Parent operators ask children for data.
Strengths:
- natural operator trees
- straightforward control flow
Weaknesses:
- can introduce repeated dispatch overhead
### Push-based execution
Child operators push data to parents or downstream consumers.
Strengths:
- natural for streaming or event-driven systems
- can work well with pipeline fusion
Weaknesses:
- control flow can be harder to reason about
Many systems combine these ideas rather than choosing only one.
---
## Pipelining vs materialization
### Pipelined execution
Operators pass intermediate results incrementally.
Strengths:
- low latency
- less temporary storage in favorable cases
Weaknesses:
- some operators still create barriers
### Materializing execution
An operator stores its entire output before the next operator consumes it.
Strengths:
- simpler boundaries
- easier reuse of intermediates
Weaknesses:
- more memory and I/O cost
- higher latency
---
## Blocking operators
Some operators are naturally blocking.
Examples:
- sort
- some aggregates
- some join strategies
These operators shape the real execution behavior of the plan because they force buffering or full-input processing before useful output appears.
---
## Practical mental model
Execution models are about runtime granularity and data flow.
If architecture asks "what kind of engine is this?", the execution model asks "how do operators actually run?"
---
## Changelog
* **April 1, 2026** -- First version created.

View File

@ -0,0 +1,153 @@
# Storage and Indexes
A reference for how storage layout and indexing shape query execution.
---
## Short answer
Storage is not just where data sits. It strongly influences which queries are cheap, which operators are natural, and what the optimizer can exploit.
Indexes matter because they trade extra write and storage cost for faster reads on selected access patterns.
---
## Row store vs column store
### Row store
Stores all fields of one row together.
Good for:
- point lookups
- updates of whole records
- transactional workloads
Weak for:
- scanning a few columns across many rows
### Column store
Stores values of the same column together.
Good for:
- analytical scans
- compression
- vectorized execution
- reading only selected columns
Weak for:
- reconstructing many full records repeatedly
---
## Why storage layout matters
The storage layout affects:
- I/O volume
- cache locality
- compression opportunities
- pushdown behavior
- operator implementation strategy
So storage is a first-order architecture decision, not just a persistence detail.
---
## Common index types
### B-tree
A classic ordered index, good for:
- point lookups
- range queries
- ordered scans
### Hash index
Optimized for exact-match lookups.
Good for:
- equality predicates
Weak for:
- range queries
### LSM-based indexing
Common in modern write-heavy systems.
Good for:
- high write throughput
- append-heavy workloads
Tradeoff:
- reads often need compaction-aware logic
### Inverted index
Maps terms to documents or postings.
Good for:
- text search
- filtering over tokenized fields
### Vector index
Supports approximate nearest-neighbor search over embeddings.
Good for:
- semantic search
- similarity retrieval
Tradeoff:
- often approximate rather than exact
---
## What indexes buy
Indexes can help the engine avoid full scans and reduce candidate sets before expensive operators run.
They are most valuable when:
- the predicate is selective
- the access pattern repeats often
- the engine can exploit the index directly
They are less valuable when:
- most rows are needed anyway
- the predicate is too broad
- maintaining the index is too expensive for the workload
---
## Practical mental model
Tables define what data exists.
Storage layout defines how that data is physically organized.
Indexes define shortcuts through that organization.
That is the simplest useful framing.
---
## Changelog
* **April 1, 2026** -- First version created.