diff --git a/hassan/005-felixs-datalog.md b/hassan/005-felixs-datalog.md new file mode 100644 index 0000000..fc2bc8e --- /dev/null +++ b/hassan/005-felixs-datalog.md @@ -0,0 +1,447 @@ +## felix-db: Project Internals + +This document provides detailed technical documentation of the felix-db Datalog implementation. + +### Overview + +felix-db is an experimental Datalog implementation in Haskell. Datalog is a declarative logic programming language (a subset of Prolog) commonly used +for database queries, program analysis, and reasoning systems. + +**Project Status:** The implementation is correct and fully functional for positive Datalog (without negation in rules). + +### Architecture + +``` +┌─────────────────────────────────────────────────────────────────┐ +│ User Input │ +│ (Datalog text/files) │ +└─────────────────────────────────────────────────────────────────┘ + │ + ▼ +┌─────────────────────────────────────────────────────────────────┐ +│ DatalogParser.hs │ +│ (Megaparsec-based parser) │ +│ Text → Statement (Fact | Rule | Query) │ +└─────────────────────────────────────────────────────────────────┘ + │ + ▼ +┌─────────────────────────────────────────────────────────────────┐ +│ DatalogDB.hs │ +│ (Type class + core data structures) │ +│ Relation, RelationRule, RuleElement │ +└─────────────────────────────────────────────────────────────────┘ + │ + ┌───────────────┴───────────────┐ + ▼ ▼ +┌─────────────────────────┐ ┌─────────────────────────────────┐ +│ InMemoryDB.hs │ │ Rules.hs │ +│ (Concrete DB impl) │ │ (Rule digestion logic) │ +│ relations + constants │ │ digestHead, digestBody │ +└─────────────────────────┘ └─────────────────────────────────┘ + │ │ + └───────────────┬───────────────┘ + ▼ +┌─────────────────────────────────────────────────────────────────┐ +│ NaiveQE.hs │ +│ (Naive Query Engine - bottom-up) │ +│ Fixed-point computation of Herbrand model │ +└─────────────────────────────────────────────────────────────────┘ + │ + ▼ +┌─────────────────────────────────────────────────────────────────┐ +│ DigestedQuery.hs │ +│ (Query preprocessing) │ +│ Query → DigestedQuery with indexed variables │ +└─────────────────────────────────────────────────────────────────┘ +``` + +### Module Details + +#### 1. DatalogParser.hs + +**Purpose:** Parses Datalog syntax into an Abstract Syntax Tree (AST). + +**Parser Type:** Uses Megaparsec (`Parsec Void Text`). + +##### Core Types + +```haskell +data Term + = Var Text -- Variables: X, Person, Y1 (uppercase start) + | Sym Text -- Symbols: alice, "london", uk (lowercase/quoted) + | Num Integer + +data Literal = Literal + { positive :: Bool -- True = positive, False = negated + , predName :: Text -- Predicate name + , arguments :: [Term] -- Argument list + } + +data Statement + = Fact Literal -- p(a,b). + | Rule Head [Literal] -- p(X,Y) :- q(X,Z), r(Z,Y). + | Query [Text] [Literal] -- ?- p(X,Y). or ?- p(X,Y) → X,Y. +``` + +##### Syntax Support + +| Feature | Example | Notes | +|-------------------|-------------------------------------|--------------------------| +| Facts | `parent("alice", "bob").` | Ground atoms | +| Rules | `ancestor(X,Y) :- parent(X,Y).` | Head :- body | +| Queries | `?- parent(X,Y).` | Returns all variables | +| Projected queries | `?- edge(A,B), edge(B,C) → A,C.` | Explicit output vars | +| Negation | `not member(X,L)` or `!member(X,L)` | Parsed but not evaluated | +| Comments | `-- line` or `/* block */` | Haskell-style | +| Arrow variants | `:-`, `→`, `->` | All accepted | + +##### Entry Points + +```haskell +parseDatalog :: Text -> Either (ParseErrorBundle Text Void) Statement +parseDatalogFile :: Text -> Either (ParseErrorBundle Text Void) [Statement] +``` + +--- + +#### 2. DatalogDB.hs + +**Purpose:** Defines the core data structures and the `DatalogDB` type class. + +##### Core Types + +```haskell +data Relation = Relation + { name :: RelationId -- Relation name (Text) + , arity :: Int -- Number of arguments + , tuples :: Set [Constant] -- Known facts (extensional) + , rules :: [RelationRule] -- Derivation rules (intensional) + } + +data RelationRule = RelationRule + { headVariables :: [Text] -- Variables in scope + , bodyElements :: [RuleBodyElement] -- Body constraints + , outputElements :: [RuleElement] -- How to construct head tuple + } + +data RuleBodyElement = RuleBodyElement + { subRelationId :: RelationId -- Referenced relation + , ruleElements :: [RuleElement] -- How to map variables + } + +data RuleElement + = RuleElementConstant Constant -- Fixed value + | RuleElementVariable Int -- Index into headVariables +``` + +##### Type Class + +```haskell +class DatalogDB db where + emptyDB :: db + relationNames :: db -> [Text] + lookupRelation :: db -> Text -> Maybe Relation + insertRelation :: db -> Relation -> db + addConstants :: db -> Set Constant -> db + allConstants :: db -> Set Constant -- Herbrand universe +``` + +--- + +#### 3. InMemoryDB.hs + +**Purpose:** Concrete implementation of `DatalogDB` using in-memory data structures. + +```haskell +data InMemoryDB = InMemoryDB + { relations :: Map RelationId Relation + , constants :: Set Constant -- The Herbrand universe + } +``` + +##### Builder Functions + +```haskell +-- Build DB from facts only +withFacts :: [Text] -> InMemoryDB + +-- Build DB from facts and rules +withFactsAndRules :: [Text] -> [Text] -> InMemoryDB + +-- Build DB with explicit extra constants +withFactsRulesAndConstants :: [Text] -> [Text] -> [Constant] -> InMemoryDB +``` + +**Example:** + +```haskell +let db = withFactsAndRules + [ "parent(\"alice\", \"bob\")." + , "parent(\"bob\", \"carol\")." + ] + [ "ancestor(X,Y) :- parent(X,Y)." + , "ancestor(X,Z) :- parent(X,Y), ancestor(Y,Z)." + ] +``` + +--- + +#### 4. Rules.hs + +**Purpose:** Processes rules and builds the internal representation. + +##### Key Functions + +**`digestHead`** - Processes the rule head: + +1. Extracts variables using `nub` (preserves first-occurrence order) +2. Creates `RuleElement` list mapping terms to indices/constants +3. Collects constants for the Herbrand universe + +**`digestBody`** - Processes each body literal: + +1. Extends variable list with new variables from body +2. Creates `BodyConstraint` linking to sub-relations +3. Maps terms to `RuleElement` (index or constant) + +**Variable Indexing Example:** + +``` +Rule: ancestor(X,Y) :- parent(X,Z), ancestor(Z,Y). + +After digestHead: variableNames = ["X", "Y"] +After digestBody[0]: variableNames = ["X", "Y", "Z"] (Z added) +After digestBody[1]: variableNames = ["X", "Y", "Z"] (no change) + +Indices: X=0, Y=1, Z=2 +``` + +**`addRule`** - Main entry point: + +```haskell +addRule :: (DatalogDB db) => (Literal, [Literal]) -> db -> db +``` + +--- + +#### 5. NaiveQE.hs (Naive Query Engine) + +**Purpose:** Implements naive bottom-up evaluation with fixed-point iteration. + +```haskell +data NaiveQE db = NaiveQE + { db :: db + , herbrand :: Map Text Relation -- Computed model + } +``` + +##### Algorithm: `computeHerbrand` + +The naive evaluation algorithm: + +``` +1. Initialize Facts := all known facts from DB +2. Repeat: + a. For each rule R in each relation: + - Enumerate all variable assignments from Herbrand universe + - For each assignment that satisfies all body constraints: + - Compute the derived tuple + - If tuple is new, add to Facts and set changed=True + b. If no changes, terminate +3. Return final Facts (the minimal Herbrand model) +``` + +**Complexity:** O(n^k) per iteration where n = |constants|, k = |variables in rule| + +##### Query Execution + +```haskell +executeQuery :: (DatalogDB db) => NaiveQE db -> DigestedQuery -> QueryResults +``` + +1. Enumerate all assignments for sought variables +2. For each, check if any assignment to unsought variables satisfies all conditions +3. Return matching sought-variable tuples + +--- + +#### 6. DigestedQuery.hs + +**Purpose:** Preprocesses queries for efficient execution. + +```haskell +data DigestedQuery = DigestedQuery + { allBoundVariables :: [Text] -- All variables in query + , numSoughtVariables :: Int -- Variables in output + , conditions :: [DigestedQueryCondition] + } + +data DigestedQueryCondition = DigestedQueryCondition + { relationName :: Text + , entries :: [DigestedQueryEntry] + } + +data DigestedQueryEntry + = DigestedQueryEntryConstant Constant + | DigestedQueryEntryVariable Int -- Index into allBoundVariables +``` + +**Query Forms:** + +- `?- p(X,Y).` - Output all variables (X, Y) +- `?- edge(A,B), edge(B,C) → A,C.` - Output only A, C (B is existential) + +--- + +#### 7. QueryEngine.hs + +**Purpose:** Type class for query engines. + +```haskell +class QueryEngine qe where + queryEngine :: (DatalogDB db) => db -> qe db + query :: (DatalogDB db) => qe db -> Text -> Text +``` + +--- + +#### 8. Utility.hs + +**Purpose:** Helper functions. + +```haskell +-- Generate all possible maps from domain to codomain +allMaps :: (Ord a) => [a] -> [b] -> [Map.Map a b] +``` + +Used by `NaiveQE` to enumerate variable assignments. + +--- + +### Data Flow Example + +**Input:** + +```datalog +parent("alice", "bob"). +parent("bob", "carol"). +ancestor(X,Y) :- parent(X,Y). +ancestor(X,Z) :- parent(X,Y), ancestor(Y,Z). +?- ancestor(alice, X). +``` + +**Step 1: Parse** → `[Fact, Fact, Rule, Rule, Query]` + +**Step 2: Build DB** + +``` +relations = { + "parent" -> Relation { + tuples = {["alice","bob"], ["bob","carol"]}, + rules = [] + }, + "ancestor" -> Relation { + tuples = {}, + rules = [Rule1, Rule2] + } +} +constants = {"alice", "bob", "carol"} +``` + +**Step 3: Compute Herbrand Model** + +``` +Iteration 1: Apply rules + ancestor(alice,bob) ← parent(alice,bob) + ancestor(bob,carol) ← parent(bob,carol) + +Iteration 2: Apply rules again + ancestor(alice,carol) ← parent(alice,bob), ancestor(bob,carol) + +Iteration 3: No new facts → terminate +``` + +**Step 4: Execute Query** + +``` +?- ancestor(alice, X) +Result: bob, carol +``` + +--- + +### Correctness Assessment + +| Component | Status | Notes | +|-----------------|---------|--------------------------------------| +| Parser | Correct | Full Datalog syntax support | +| Data Model | Correct | Proper Relation/Rule representation | +| Rule Processing | Correct | Correct variable indexing with `nub` | +| Query Engine | Correct | Fixed-point terminates correctly | +| Tests | Pass | Comprehensive coverage | + +#### Limitations + +1. **No stratified negation** - Negation is parsed but ignored in evaluation +2. **Naive complexity** - Exponential in number of variables per rule +3. **No aggregation** - Standard Datalog without extensions +4. **No built-in predicates** - No arithmetic comparison, etc. + +--- + +### Test Coverage + +Tests are in `test/Test/Datalog/`: + +- **DatalogParserSpec.hs** - Parser correctness +- **InMemoryDBSpec.hs** - Database operations +- **RulesSpec.hs** - Rule digestion +- **NaiveQESpec.hs** - Query evaluation +- **DigestedQuerySpec.hs** - Query preprocessing + +**Notable Test Cases:** + +- Reflexive relations: `equiv(X, X) :- .` +- Symmetric relations: `equiv(Y, X) :- equiv(X, Y).` +- Transitive closure: `equiv(X, Z) :- equiv(X, Y), equiv(Y, Z).` +- Full equivalence relations (combining all three) +- Genealogical queries (niece, sibling relationships) +- Poset ordering with constants + +--- + +### Build & Run + +```bash +cabal build ## Build +cabal test ## Run all tests +cabal repl ## Interactive REPL + +## Run specific tests +cabal test --test-option=--match="/NaiveQE/" +``` + +--- + +### Dependencies + +- **megaparsec** - Parser combinators +- **containers** - Map, Set +- **text** - Text processing +- **hspec** - Testing framework + +--- + +### Future Work + +Potential enhancements: + +1. Implement stratified negation +2. Add semi-naive evaluation for better performance +3. Built-in predicates (comparison, arithmetic) +4. Magic sets optimization +5. Persistent storage backend + +## Changelog + +* **Mar 4, 2026** -- The first version was created. diff --git a/hassan/006-cales-datalog.md b/hassan/006-cales-datalog.md new file mode 100644 index 0000000..01f7574 --- /dev/null +++ b/hassan/006-cales-datalog.md @@ -0,0 +1,357 @@ +## toy-datalog: Implementation Notes + +### Overview + +**toy-datalog** is a Haskell implementation of a Datalog interpreter. Datalog is a declarative logic programming language that is a syntactic subset +of Prolog, commonly used for: + +- Database query languages (it's the theoretical foundation for SQL's recursive queries) +- Static program analysis +- Knowledge representation and reasoning +- Access control policies + +This implementation provides parsing, evaluation, and fixed-point computation for Datalog programs. + +**Author:** Cale Gibbard +**License:** BSD-2-Clause +**Version:** 0.1.0.0 + +--- + +### Architecture + +``` +┌─────────────────────────────────────────────────────────────┐ +│ Datalog Program (Text) │ +└─────────────────────────────────────────────────────────────┘ + │ + ▼ +┌─────────────────────────────────────────────────────────────┐ +│ Parser (Megaparsec) │ +│ • parseTerm, parseAtom, parseRule, parseProgram │ +└─────────────────────────────────────────────────────────────┘ + │ + ▼ +┌─────────────────────────────────────────────────────────────┐ +│ Abstract Syntax Tree │ +│ • Program = [Rule] │ +│ • Rule = Atom :- [Atom] │ +│ • Atom = RelId [Term] │ +│ • Term = Con ConId | Var VarId │ +└─────────────────────────────────────────────────────────────┘ + │ + ▼ +┌─────────────────────────────────────────────────────────────┐ +│ Database │ +│ • Herbrand Universe (all constants) │ +│ • Relations (indexed tuples) │ +│ • Rules (indexed by head relation) │ +└─────────────────────────────────────────────────────────────┘ + │ + ▼ +┌─────────────────────────────────────────────────────────────┐ +│ Evaluation Engine │ +│ • Forward chaining (bottom-up) │ +│ • Fixed-point iteration │ +│ • Variable binding and unification │ +└─────────────────────────────────────────────────────────────┘ + │ + ▼ +┌─────────────────────────────────────────────────────────────┐ +│ Derived Facts (Complete Database) │ +└─────────────────────────────────────────────────────────────┘ +``` + +--- + +### Module Analysis + +#### 1. Datalog.Syntax (`src/Datalog/Syntax.hs`) + +Defines the core AST types: + +| Type | Description | Example | +|-----------|-----------------------------------------------|---------------------------------| +| `ConId` | Constant identifier (lowercase/numeric start) | `xerces`, `0`, `brooke` | +| `VarId` | Variable identifier (uppercase start) | `X`, `Y`, `Parent` | +| `RelId` | Relation/predicate name | `parent`, `ancestor` | +| `Term` | Either a constant or variable | `Con "xerces"`, `Var "X"` | +| `Atom` | Relation applied to terms | `parent(X, Y)` | +| `Rule` | Head atom with body atoms | `ancestor(X,Y) :- parent(X,Y).` | +| `Program` | Collection of rules | Full Datalog program | + +**Key Features:** + +- `HasConstants` typeclass for extracting constants from any Datalog structure +- `Pretty` typeclass for converting AST back to readable syntax +- Smart constructor `term` that auto-detects constants vs variables by case + +#### 2. Datalog.Parser (`src/Datalog/Parser.hs`) + +Megaparsec-based parser with: + +- **Lexical conventions:** + - Whitespace handling (spaces, line comments `--`, block comments `{- -}`) + - Constants: start with lowercase or digit + - Variables: start with uppercase + - Relations: start with lowercase + +- **Grammar:** + ``` + program ::= rule* + rule ::= atom '.' (fact) + | atom ':-' query '.' (rule with body) + query ::= atom (',' atom)* + atom ::= relId '(' term (',' term)* ')' + term ::= variable | constant + ``` + +#### 3. Datalog.Eval (`src/Datalog/Eval.hs`) + +The evaluation engine implements **naive bottom-up evaluation** (forward chaining). + +##### Data Structures + +**Relation:** + +```haskell +data Relation = Relation + { _relation_arity :: Int + , _relation_members :: Set [ConId] -- all tuples + , _relation_indices :: Map (Int, ConId) (Set [ConId]) -- index + } +``` + +The index maps `(position, value)` pairs to tuples containing that value at that position, enabling efficient lookups. + +**Database:** + +```haskell +data Database = Database + { _database_universe :: Set ConId -- Herbrand universe + , _database_relations :: Map RelId Relation + , _database_rules :: Map RelId [Rule] + } +``` + +##### Evaluation Algorithm + +1. **`evalAtomDb`**: Query an atom against the database + - Uses indices to find candidate tuples + - Computes intersection of candidates for each bound position + - Returns set of variable bindings + +2. **`evalConjunction`**: Evaluate a conjunction (rule body) + - Evaluates each atom + - Joins bindings using `><` operator (relational join) + +3. **`immediateConsequences`**: Derive new facts from a rule + - Evaluate the body to get bindings + - Project bindings onto head atom terms + +4. **`extendDb`**: Single iteration of forward chaining + - Apply all rules to derive new facts + - Add new tuples to relations + +5. **`extendFixedpointDb`**: Iterate until no new facts + - Repeatedly call `extendDb` until database stops changing + +#### 4. Datalog.IO (`src/Datalog/IO.hs`) + +File I/O utilities: + +- `readProgram`: Parse a file into a `Program` +- `readDatabase`: Parse and load into a `Database` + +--- + +### Correctness Analysis + +#### Verified Correct Behaviors + +1. **Variable binding consistency**: The `mkBinding` function in `unsafeEvalAtomRel` correctly handles the case where the same variable appears + multiple times in an atom. It checks that all occurrences bind to the same value (lines 125-130). + +2. **Join semantics**: The `joinBindings` function correctly implements a consistent merge of two bindings, failing when variables are bound to + different values. + +3. **Arity checking**: The implementation properly tracks and validates relation arities, preventing arity mismatches. + +4. **Fixed-point termination**: Since Datalog is guaranteed to terminate (finite Herbrand universe, monotonic derivation), the fixed-point iteration + is correct. + +5. **Index maintenance**: Indices are properly maintained during insertion via `insertInRelation`. + +#### Potential Issues + +1. **`projectTuple` with unbound variables** (lines 147-157): + + When the head of a rule contains a variable not appearing in the body, `projectTuple` generates all possible values from the Herbrand universe. + This is semantically correct for Datalog (the variable ranges over all constants), but: + - It can cause a combinatorial explosion + - This is an unusual Datalog pattern (typically all head variables appear in the body) + + Example: `foo(X) :- bar(Y).` would generate `foo(c)` for every constant `c` in the universe. + +2. **Naive evaluation strategy**: + + The implementation uses naive bottom-up evaluation, which re-derives all facts at each iteration. This is correct but inefficient compared to + semi-naive evaluation. + +3. **Unsafe partial function**: + + `insertManyInRelationDb` uses `error` on line 51 if the relation doesn't exist. However, this should never occur in practice due to prior arity + checks. + +4. **Empty atom arguments**: + + The parser allows zero-argument atoms `foo()`, and evaluation handles them correctly (they act as propositional facts). + +--- + +### Strengths + +#### 1. Clean Haskell Design + +- Proper use of newtypes for type safety (`ConId`, `VarId`, `RelId`) +- Effective use of typeclasses (`HasConstants`, `Pretty`) +- Clear separation of concerns between modules +- Appropriate use of `Either` for error handling + +#### 2. Efficient Indexing + +- Position-based indexing enables efficient lookups +- Intersection of index results for atoms with multiple constants +- Avoids full table scans when constants are present in queries + +#### 3. Correct Semantics + +- Proper handling of variable binding consistency +- Correct relational join semantics +- Sound fixed-point computation + +#### 4. Good Parser Design + +- Uses Megaparsec for good error messages +- Supports both Haskell-style comments (`--` and `{- -}`) +- Clean lexeme handling with whitespace consumption + +#### 5. Extensible Architecture + +- Easy to add new syntax constructs +- Database structure supports multiple relations +- Error types are well-defined and informative + +--- + +### Limitations + +#### 1. Performance Limitations + +| Issue | Impact | Potential Fix | +|-----------------------|----------------------------------------------------|----------------------------------| +| Naive evaluation | Re-derives all facts each iteration | Implement semi-naive evaluation | +| No query optimization | Joins evaluated left-to-right | Add query planner | +| Full materialization | All derived facts stored | Support demand-driven evaluation | +| No stratification | Potential issues with negation (not yet supported) | Add stratification analysis | + +#### 2. Missing Datalog Features + +| Feature | Status | Description | +|---------------------|-----------------|------------------------------| +| Negation | Not supported | `not p(X)` in rule bodies | +| Aggregation | Not supported | `count`, `sum`, `min`, `max` | +| Built-in predicates | Not supported | `X < Y`, `X = Y + 1` | +| Stratification | Not applicable | Would be needed for negation | +| Magic sets | Not implemented | Query optimization technique | +| Tabling/Memoization | Basic only | Via relation storage | + +#### 3. Usability Limitations + +- **No REPL**: No interactive query interface +- **No CLI executable**: Executable is commented out in cabal file +- **Limited error messages**: Parsing errors are good (Megaparsec), but evaluation errors are basic +- **No source locations**: Error messages don't include line numbers for semantic errors +- **No pretty-printed output**: Query results shown as Haskell data structures + +#### 4. Safety/Edge Cases + +- **Unsafe head variables**: Variables in rule heads not appearing in bodies cause universe enumeration +- **No recursion limit**: Deeply recursive programs could stack overflow +- **No fact limit**: Large derived fact sets could exhaust memory + +#### 5. Testing Gaps + +- Only parser golden tests are active +- Database and fixed-point tests are commented out +- No property-based testing +- No negative test cases (expected failures) + +--- + +### Example Programs + +#### Transitive Closure (ancestor.dl) + +```datalog +parent(xerces, brooke). +parent(brooke, damocles). + +ancestor(X, Y) :- parent(X, Y). +ancestor(X, Y) :- parent(X, Z), ancestor(Z, Y). +``` + +Derives: `ancestor(xerces, brooke)`, `ancestor(brooke, damocles)`, `ancestor(xerces, damocles)` + +#### Symmetric Closure with Parity (graph.dl) + +```datalog +odd(X,Y) :- r(X,Y). +odd(X,Y) :- even(X,Z), r(Z,Y). +even(X,Y) :- odd(X,Z), r(Z,Y). + +r(0,1). r(1,2). r(2,3). r(3,4). r(4,5). +r(X,Y) :- r(Y,X). +``` + +Computes paths of odd/even length in a bidirectional graph. + +--- + +### Recommendations for Future Development + +#### High Priority + +1. **Enable the executable** with a simple REPL or query interface +2. **Implement semi-naive evaluation** for significant performance improvement +3. **Add more tests** including evaluation tests and property tests +4. **Add source locations** to AST for better error messages + +#### Medium Priority + +5. **Support negation** with stratification checking +6. **Add built-in predicates** for arithmetic comparisons +7. **Implement query optimization** (join reordering) +8. **Add aggregation** support + +#### Nice to Have + +9. **Magic sets transformation** for goal-directed evaluation +10. **Incremental evaluation** for database updates +11. **Parallel evaluation** of independent rules +12. **External data sources** (CSV, databases) + +--- + +### Conclusion + +toy-datalog is a well-structured, pedagogically clear implementation of core Datalog semantics. It correctly implements the fundamental operations of +bottom-up evaluation with proper variable binding and fixed-point computation. While it lacks advanced features like negation, aggregation, and query +optimization, it provides a solid foundation for learning about Datalog and could be extended into a more full-featured implementation. + +The code quality is high, with good use of Haskell idioms and proper error handling. The main limitations are performance (naive evaluation) and +missing features rather than correctness issues. + +## Changelog + +* **Mar 4, 2026** -- The first version was created.