diff --git a/hqew/006-query-execution-models.md b/hqew/006-query-execution-models.md new file mode 100644 index 0000000..69ebbe6 --- /dev/null +++ b/hqew/006-query-execution-models.md @@ -0,0 +1,167 @@ +# Query Execution Models + +A reference for the main ways query operators run at runtime. + +--- + +## Short answer + +An execution model defines how operators consume input, produce output, and pass data through a plan. + +The most important questions are: + +- one row at a time or many values at once? +- pull-based or push-based? +- pipelined or materialized? + +Those choices strongly affect latency, CPU efficiency, and implementation complexity. + +--- + +## Row-at-a-time execution + +In a row-oriented model, operators process one tuple at a time. + +This is often implemented with an iterator interface where a parent asks a child for the next row. + +Strengths: + +- simple +- modular +- easy to debug + +Weaknesses: + +- high per-row overhead +- worse cache behavior for analytics + +This model is historically important and still useful in many systems. + +--- + +## Batch-oriented execution + +In a batch model, operators process chunks of rows together. + +The batch may be row-based or columnar, but the main idea is to amortize operator overhead across many values. + +Strengths: + +- better CPU efficiency +- lower dispatch overhead +- easier parallelism inside an operator + +Weaknesses: + +- more bookkeeping +- more complex control flow + +--- + +## Vectorized execution + +Vectorized execution is a batch-oriented style where operators often process column vectors rather than full row objects. + +This fits well with columnar memory layouts and analytical workloads. + +Strengths: + +- excellent cache locality +- better SIMD opportunities +- good fit for scans, filters, joins, and aggregates + +Weaknesses: + +- some control-flow-heavy logic is less natural +- more careful null and type handling is needed + +--- + +## Pull vs push + +### Pull-based execution + +Parent operators ask children for data. + +Strengths: + +- natural operator trees +- straightforward control flow + +Weaknesses: + +- can introduce repeated dispatch overhead + +### Push-based execution + +Child operators push data to parents or downstream consumers. + +Strengths: + +- natural for streaming or event-driven systems +- can work well with pipeline fusion + +Weaknesses: + +- control flow can be harder to reason about + +Many systems combine these ideas rather than choosing only one. + +--- + +## Pipelining vs materialization + +### Pipelined execution + +Operators pass intermediate results incrementally. + +Strengths: + +- low latency +- less temporary storage in favorable cases + +Weaknesses: + +- some operators still create barriers + +### Materializing execution + +An operator stores its entire output before the next operator consumes it. + +Strengths: + +- simpler boundaries +- easier reuse of intermediates + +Weaknesses: + +- more memory and I/O cost +- higher latency + +--- + +## Blocking operators + +Some operators are naturally blocking. + +Examples: + +- sort +- some aggregates +- some join strategies + +These operators shape the real execution behavior of the plan because they force buffering or full-input processing before useful output appears. + +--- + +## Practical mental model + +Execution models are about runtime granularity and data flow. + +If architecture asks "what kind of engine is this?", the execution model asks "how do operators actually run?" + +--- + +## Changelog + +* **April 1, 2026** -- First version created. diff --git a/hqew/007-storage-and-indexes.md b/hqew/007-storage-and-indexes.md new file mode 100644 index 0000000..711c09b --- /dev/null +++ b/hqew/007-storage-and-indexes.md @@ -0,0 +1,153 @@ +# Storage and Indexes + +A reference for how storage layout and indexing shape query execution. + +--- + +## Short answer + +Storage is not just where data sits. It strongly influences which queries are cheap, which operators are natural, and what the optimizer can exploit. + +Indexes matter because they trade extra write and storage cost for faster reads on selected access patterns. + +--- + +## Row store vs column store + +### Row store + +Stores all fields of one row together. + +Good for: + +- point lookups +- updates of whole records +- transactional workloads + +Weak for: + +- scanning a few columns across many rows + +### Column store + +Stores values of the same column together. + +Good for: + +- analytical scans +- compression +- vectorized execution +- reading only selected columns + +Weak for: + +- reconstructing many full records repeatedly + +--- + +## Why storage layout matters + +The storage layout affects: + +- I/O volume +- cache locality +- compression opportunities +- pushdown behavior +- operator implementation strategy + +So storage is a first-order architecture decision, not just a persistence detail. + +--- + +## Common index types + +### B-tree + +A classic ordered index, good for: + +- point lookups +- range queries +- ordered scans + +### Hash index + +Optimized for exact-match lookups. + +Good for: + +- equality predicates + +Weak for: + +- range queries + +### LSM-based indexing + +Common in modern write-heavy systems. + +Good for: + +- high write throughput +- append-heavy workloads + +Tradeoff: + +- reads often need compaction-aware logic + +### Inverted index + +Maps terms to documents or postings. + +Good for: + +- text search +- filtering over tokenized fields + +### Vector index + +Supports approximate nearest-neighbor search over embeddings. + +Good for: + +- semantic search +- similarity retrieval + +Tradeoff: + +- often approximate rather than exact + +--- + +## What indexes buy + +Indexes can help the engine avoid full scans and reduce candidate sets before expensive operators run. + +They are most valuable when: + +- the predicate is selective +- the access pattern repeats often +- the engine can exploit the index directly + +They are less valuable when: + +- most rows are needed anyway +- the predicate is too broad +- maintaining the index is too expensive for the workload + +--- + +## Practical mental model + +Tables define what data exists. + +Storage layout defines how that data is physically organized. + +Indexes define shortcuts through that organization. + +That is the simplest useful framing. + +--- + +## Changelog + +* **April 1, 2026** -- First version created.