useful-notes/hassan/006-cales-datalog.md

## toy-datalog: Implementation Notes

### Overview

**toy-datalog** is a Haskell implementation of a Datalog interpreter. Datalog is a declarative logic programming language that is a syntactic subset
of Prolog, commonly used for:

- Database query languages (it's the theoretical foundation for SQL's recursive queries)
- Static program analysis
- Knowledge representation and reasoning
- Access control policies

This implementation provides parsing, evaluation, and fixed-point computation for Datalog programs.

**Author:** Cale Gibbard
**License:** BSD-2-Clause
**Version:** 0.1.0.0

---

### Architecture

```
┌─────────────────────────────────────────────────────────────┐
│                    Datalog Program (Text)                    │
└─────────────────────────────────────────────────────────────┘
                              │
                              ▼
┌─────────────────────────────────────────────────────────────┐
│                   Parser (Megaparsec)                        │
│  • parseTerm, parseAtom, parseRule, parseProgram            │
└─────────────────────────────────────────────────────────────┘
                              │
                              ▼
┌─────────────────────────────────────────────────────────────┐
│                     Abstract Syntax Tree                     │
│  • Program = [Rule]                                          │
│  • Rule = Atom :- [Atom]                                     │
│  • Atom = RelId [Term]                                       │
│  • Term = Con ConId | Var VarId                              │
└─────────────────────────────────────────────────────────────┘
                              │
                              ▼
┌─────────────────────────────────────────────────────────────┐
│                      Database                                │
│  • Herbrand Universe (all constants)                         │
│  • Relations (indexed tuples)                                │
│  • Rules (indexed by head relation)                          │
└─────────────────────────────────────────────────────────────┘
                              │
                              ▼
┌─────────────────────────────────────────────────────────────┐
│                  Evaluation Engine                           │
│  • Forward chaining (bottom-up)                              │
│  • Fixed-point iteration                                     │
│  • Variable binding and unification                          │
└─────────────────────────────────────────────────────────────┘
                              │
                              ▼
┌─────────────────────────────────────────────────────────────┐
│              Derived Facts (Complete Database)               │
└─────────────────────────────────────────────────────────────┘
```

---

### Module Analysis

#### 1. Datalog.Syntax (`src/Datalog/Syntax.hs`)

Defines the core AST types:

| Type      | Description                                   | Example                         |
|-----------|-----------------------------------------------|---------------------------------|
| `ConId`   | Constant identifier (lowercase/numeric start) | `xerces`, `0`, `brooke`         |
| `VarId`   | Variable identifier (uppercase start)         | `X`, `Y`, `Parent`              |
| `RelId`   | Relation/predicate name                       | `parent`, `ancestor`            |
| `Term`    | Either a constant or variable                 | `Con "xerces"`, `Var "X"`       |
| `Atom`    | Relation applied to terms                     | `parent(X, Y)`                  |
| `Rule`    | Head atom with body atoms                     | `ancestor(X,Y) :- parent(X,Y).` |
| `Program` | Collection of rules                           | Full Datalog program            |

**Key Features:**

- `HasConstants` typeclass for extracting constants from any Datalog structure
- `Pretty` typeclass for converting AST back to readable syntax
- Smart constructor `term` that auto-detects constants vs variables by case

#### 2. Datalog.Parser (`src/Datalog/Parser.hs`)

Megaparsec-based parser with:

- **Lexical conventions:**
    - Whitespace handling (spaces, line comments `--`, block comments `{- -}`)
    - Constants: start with lowercase or digit
    - Variables: start with uppercase
    - Relations: start with lowercase

- **Grammar:**
  ```
  program  ::= rule*
  rule     ::= atom '.'                    (fact)
             | atom ':-' query '.'         (rule with body)
  query    ::= atom (',' atom)*
  atom     ::= relId '(' term (',' term)* ')'
  term     ::= variable | constant
  ```

#### 3. Datalog.Eval (`src/Datalog/Eval.hs`)

The evaluation engine implements **naive bottom-up evaluation** (forward chaining).

##### Data Structures

**Relation:**

```haskell
data Relation = Relation
  { _relation_arity   :: Int
  , _relation_members :: Set [ConId]           -- all tuples
  , _relation_indices :: Map (Int, ConId) (Set [ConId])  -- index
  }
```

The index maps `(position, value)` pairs to tuples containing that value at that position, enabling efficient lookups.

**Database:**

```haskell
data Database = Database
  { _database_universe  :: Set ConId        -- Herbrand universe
  , _database_relations :: Map RelId Relation
  , _database_rules     :: Map RelId [Rule]
  }
```

##### Evaluation Algorithm

1. **`evalAtomDb`**: Query an atom against the database
    - Uses indices to find candidate tuples
    - Computes intersection of candidates for each bound position
    - Returns set of variable bindings

2. **`evalConjunction`**: Evaluate a conjunction (rule body)
    - Evaluates each atom
    - Joins bindings using `><` operator (relational join)

3. **`immediateConsequences`**: Derive new facts from a rule
    - Evaluate the body to get bindings
    - Project bindings onto head atom terms

4. **`extendDb`**: Single iteration of forward chaining
    - Apply all rules to derive new facts
    - Add new tuples to relations

5. **`extendFixedpointDb`**: Iterate until no new facts
    - Repeatedly call `extendDb` until database stops changing

#### 4. Datalog.IO (`src/Datalog/IO.hs`)

File I/O utilities:

- `readProgram`: Parse a file into a `Program`
- `readDatabase`: Parse and load into a `Database`

---

### Correctness Analysis

#### Verified Correct Behaviors

1. **Variable binding consistency**: The `mkBinding` function in `unsafeEvalAtomRel` correctly handles the case where the same variable appears
   multiple times in an atom. It checks that all occurrences bind to the same value (lines 125-130).

2. **Join semantics**: The `joinBindings` function correctly implements a consistent merge of two bindings, failing when variables are bound to
   different values.

3. **Arity checking**: The implementation properly tracks and validates relation arities, preventing arity mismatches.

4. **Fixed-point termination**: Since Datalog is guaranteed to terminate (finite Herbrand universe, monotonic derivation), the fixed-point iteration
   is correct.

5. **Index maintenance**: Indices are properly maintained during insertion via `insertInRelation`.

#### Potential Issues

1. **`projectTuple` with unbound variables** (lines 147-157):

   When the head of a rule contains a variable not appearing in the body, `projectTuple` generates all possible values from the Herbrand universe.
   This is semantically correct for Datalog (the variable ranges over all constants), but:
    - It can cause a combinatorial explosion
    - This is an unusual Datalog pattern (typically all head variables appear in the body)

   Example: `foo(X) :- bar(Y).` would generate `foo(c)` for every constant `c` in the universe.

2. **Naive evaluation strategy**:

   The implementation uses naive bottom-up evaluation, which re-derives all facts at each iteration. This is correct but inefficient compared to
   semi-naive evaluation.

3. **Unsafe partial function**:

   `insertManyInRelationDb` uses `error` on line 51 if the relation doesn't exist. However, this should never occur in practice due to prior arity
   checks.

4. **Empty atom arguments**:

   The parser allows zero-argument atoms `foo()`, and evaluation handles them correctly (they act as propositional facts).

---

### Strengths

#### 1. Clean Haskell Design

- Proper use of newtypes for type safety (`ConId`, `VarId`, `RelId`)
- Effective use of typeclasses (`HasConstants`, `Pretty`)
- Clear separation of concerns between modules
- Appropriate use of `Either` for error handling

#### 2. Efficient Indexing

- Position-based indexing enables efficient lookups
- Intersection of index results for atoms with multiple constants
- Avoids full table scans when constants are present in queries

#### 3. Correct Semantics

- Proper handling of variable binding consistency
- Correct relational join semantics
- Sound fixed-point computation

#### 4. Good Parser Design

- Uses Megaparsec for good error messages
- Supports both Haskell-style comments (`--` and `{- -}`)
- Clean lexeme handling with whitespace consumption

#### 5. Extensible Architecture

- Easy to add new syntax constructs
- Database structure supports multiple relations
- Error types are well-defined and informative

---

### Limitations

#### 1. Performance Limitations

| Issue                 | Impact                                             | Potential Fix                    |
|-----------------------|----------------------------------------------------|----------------------------------|
| Naive evaluation      | Re-derives all facts each iteration                | Implement semi-naive evaluation  |
| No query optimization | Joins evaluated left-to-right                      | Add query planner                |
| Full materialization  | All derived facts stored                           | Support demand-driven evaluation |
| No stratification     | Potential issues with negation (not yet supported) | Add stratification analysis      |

#### 2. Missing Datalog Features

| Feature             | Status          | Description                  |
|---------------------|-----------------|------------------------------|
| Negation            | Not supported   | `not p(X)` in rule bodies    |
| Aggregation         | Not supported   | `count`, `sum`, `min`, `max` |
| Built-in predicates | Not supported   | `X < Y`, `X = Y + 1`         |
| Stratification      | Not applicable  | Would be needed for negation |
| Magic sets          | Not implemented | Query optimization technique |
| Tabling/Memoization | Basic only      | Via relation storage         |

#### 3. Usability Limitations

- **No REPL**: No interactive query interface
- **No CLI executable**: Executable is commented out in cabal file
- **Limited error messages**: Parsing errors are good (Megaparsec), but evaluation errors are basic
- **No source locations**: Error messages don't include line numbers for semantic errors
- **No pretty-printed output**: Query results shown as Haskell data structures

#### 4. Safety/Edge Cases

- **Unsafe head variables**: Variables in rule heads not appearing in bodies cause universe enumeration
- **No recursion limit**: Deeply recursive programs could stack overflow
- **No fact limit**: Large derived fact sets could exhaust memory

#### 5. Testing Gaps

- Only parser golden tests are active
- Database and fixed-point tests are commented out
- No property-based testing
- No negative test cases (expected failures)

---

### Example Programs

#### Transitive Closure (ancestor.dl)

```datalog
parent(xerces, brooke).
parent(brooke, damocles).

ancestor(X, Y) :- parent(X, Y).
ancestor(X, Y) :- parent(X, Z), ancestor(Z, Y).
```

Derives: `ancestor(xerces, brooke)`, `ancestor(brooke, damocles)`, `ancestor(xerces, damocles)`

#### Symmetric Closure with Parity (graph.dl)

```datalog
odd(X,Y) :- r(X,Y).
odd(X,Y) :- even(X,Z), r(Z,Y).
even(X,Y) :- odd(X,Z), r(Z,Y).

r(0,1). r(1,2). r(2,3). r(3,4). r(4,5).
r(X,Y) :- r(Y,X).
```

Computes paths of odd/even length in a bidirectional graph.

---

### Recommendations for Future Development

#### High Priority

1. **Enable the executable** with a simple REPL or query interface
2. **Implement semi-naive evaluation** for significant performance improvement
3. **Add more tests** including evaluation tests and property tests
4. **Add source locations** to AST for better error messages

#### Medium Priority

5. **Support negation** with stratification checking
6. **Add built-in predicates** for arithmetic comparisons
7. **Implement query optimization** (join reordering)
8. **Add aggregation** support

#### Nice to Have

9. **Magic sets transformation** for goal-directed evaluation
10. **Incremental evaluation** for database updates
11. **Parallel evaluation** of independent rules
12. **External data sources** (CSV, databases)

---

### Conclusion

toy-datalog is a well-structured, pedagogically clear implementation of core Datalog semantics. It correctly implements the fundamental operations of
bottom-up evaluation with proper variable binding and fixed-point computation. While it lacks advanced features like negation, aggregation, and query
optimization, it provides a solid foundation for learning about Datalog and could be extended into a more full-featured implementation.

The code quality is high, with good use of Haskell idioms and proper error handling. The main limitations are performance (naive evaluation) and
missing features rather than correctness issues.

## Changelog

* **Mar 4, 2026** -- The first version was created.
Add notes about Cale's and Felix's Datalog implementations 2026-03-04 12:37:35 +01:00			`## toy-datalog: Implementation Notes`

			`### Overview`

			`toy-datalog is a Haskell implementation of a Datalog interpreter. Datalog is a declarative logic programming language that is a syntactic subset`
			`of Prolog, commonly used for:`

			`- Database query languages (it's the theoretical foundation for SQL's recursive queries)`
			`- Static program analysis`
			`- Knowledge representation and reasoning`
			`- Access control policies`

			`This implementation provides parsing, evaluation, and fixed-point computation for Datalog programs.`

			`Author: Cale Gibbard`
			`License: BSD-2-Clause`
			`Version: 0.1.0.0`

			`---`

			`### Architecture`

			```
			`┌─────────────────────────────────────────────────────────────┐`
			`│ Datalog Program (Text) │`
			`└─────────────────────────────────────────────────────────────┘`
			`│`
			`▼`
			`┌─────────────────────────────────────────────────────────────┐`
			`│ Parser (Megaparsec) │`
			`│ • parseTerm, parseAtom, parseRule, parseProgram │`
			`└─────────────────────────────────────────────────────────────┘`
			`│`
			`▼`
			`┌─────────────────────────────────────────────────────────────┐`
			`│ Abstract Syntax Tree │`
			`│ • Program = [Rule] │`
			`│ • Rule = Atom :- [Atom] │`
			`│ • Atom = RelId [Term] │`
			`│ • Term = Con ConId \| Var VarId │`
			`└─────────────────────────────────────────────────────────────┘`
			`│`
			`▼`
			`┌─────────────────────────────────────────────────────────────┐`
			`│ Database │`
			`│ • Herbrand Universe (all constants) │`
			`│ • Relations (indexed tuples) │`
			`│ • Rules (indexed by head relation) │`
			`└─────────────────────────────────────────────────────────────┘`
			`│`
			`▼`
			`┌─────────────────────────────────────────────────────────────┐`
			`│ Evaluation Engine │`
			`│ • Forward chaining (bottom-up) │`
			`│ • Fixed-point iteration │`
			`│ • Variable binding and unification │`
			`└─────────────────────────────────────────────────────────────┘`
			`│`
			`▼`
			`┌─────────────────────────────────────────────────────────────┐`
			`│ Derived Facts (Complete Database) │`
			`└─────────────────────────────────────────────────────────────┘`
			```

			`---`

			`### Module Analysis`

			#### 1. Datalog.Syntax (`src/Datalog/Syntax.hs`)

			`Defines the core AST types:`

			`\| Type \| Description \| Example \|`
			`\|-----------\|-----------------------------------------------\|---------------------------------\|`
			\| `ConId` \| Constant identifier (lowercase/numeric start) \| `xerces`, `0`, `brooke` \|
			\| `VarId` \| Variable identifier (uppercase start) \| `X`, `Y`, `Parent` \|
			\| `RelId` \| Relation/predicate name \| `parent`, `ancestor` \|
			\| `Term` \| Either a constant or variable \| `Con "xerces"`, `Var "X"` \|
			\| `Atom` \| Relation applied to terms \| `parent(X, Y)` \|
			\| `Rule` \| Head atom with body atoms \| `ancestor(X,Y) :- parent(X,Y).` \|
			\| `Program` \| Collection of rules \| Full Datalog program \|

			`Key Features:`

			- `HasConstants` typeclass for extracting constants from any Datalog structure
			- `Pretty` typeclass for converting AST back to readable syntax
			- Smart constructor `term` that auto-detects constants vs variables by case

			#### 2. Datalog.Parser (`src/Datalog/Parser.hs`)

			`Megaparsec-based parser with:`

			`- Lexical conventions:`
			- Whitespace handling (spaces, line comments `--`, block comments `{- -}`)
			`- Constants: start with lowercase or digit`
			`- Variables: start with uppercase`
			`- Relations: start with lowercase`

			`- Grammar:`
			```
			`program ::= rule*`
			`rule ::= atom '.' (fact)`
			`\| atom ':-' query '.' (rule with body)`
			`query ::= atom (',' atom)*`
			`atom ::= relId '(' term (',' term)* ')'`
			`term ::= variable \| constant`
			```

			#### 3. Datalog.Eval (`src/Datalog/Eval.hs`)

			`The evaluation engine implements naive bottom-up evaluation (forward chaining).`

			`##### Data Structures`

			`Relation:`

			```haskell
			`data Relation = Relation`
			`{ _relation_arity :: Int`
			`, _relation_members :: Set [ConId] -- all tuples`
			`, _relation_indices :: Map (Int, ConId) (Set [ConId]) -- index`
			`}`
			```

			The index maps `(position, value)` pairs to tuples containing that value at that position, enabling efficient lookups.

			`Database:`

			```haskell
			`data Database = Database`
			`{ _database_universe :: Set ConId -- Herbrand universe`
			`, _database_relations :: Map RelId Relation`
			`, _database_rules :: Map RelId [Rule]`
			`}`
			```

			`##### Evaluation Algorithm`

			1. `evalAtomDb`: Query an atom against the database
			`- Uses indices to find candidate tuples`
			`- Computes intersection of candidates for each bound position`
			`- Returns set of variable bindings`

			2. `evalConjunction`: Evaluate a conjunction (rule body)
			`- Evaluates each atom`
			- Joins bindings using `><` operator (relational join)

			3. `immediateConsequences`: Derive new facts from a rule
			`- Evaluate the body to get bindings`
			`- Project bindings onto head atom terms`

			4. `extendDb`: Single iteration of forward chaining
			`- Apply all rules to derive new facts`
			`- Add new tuples to relations`

			5. `extendFixedpointDb`: Iterate until no new facts
			- Repeatedly call `extendDb` until database stops changing

			#### 4. Datalog.IO (`src/Datalog/IO.hs`)

			`File I/O utilities:`

			- `readProgram`: Parse a file into a `Program`
			- `readDatabase`: Parse and load into a `Database`

			`---`

			`### Correctness Analysis`

			`#### Verified Correct Behaviors`

			1. Variable binding consistency: The `mkBinding` function in `unsafeEvalAtomRel` correctly handles the case where the same variable appears
			`multiple times in an atom. It checks that all occurrences bind to the same value (lines 125-130).`

			2. Join semantics: The `joinBindings` function correctly implements a consistent merge of two bindings, failing when variables are bound to
			`different values.`

			`3. Arity checking: The implementation properly tracks and validates relation arities, preventing arity mismatches.`

			`4. Fixed-point termination: Since Datalog is guaranteed to terminate (finite Herbrand universe, monotonic derivation), the fixed-point iteration`
			`is correct.`

			5. Index maintenance: Indices are properly maintained during insertion via `insertInRelation`.

			`#### Potential Issues`

			1. `projectTuple` with unbound variables (lines 147-157):

			When the head of a rule contains a variable not appearing in the body, `projectTuple` generates all possible values from the Herbrand universe.
			`This is semantically correct for Datalog (the variable ranges over all constants), but:`
			`- It can cause a combinatorial explosion`
			`- This is an unusual Datalog pattern (typically all head variables appear in the body)`

			Example: `foo(X) :- bar(Y).` would generate `foo(c)` for every constant `c` in the universe.

			`2. Naive evaluation strategy:`

			`The implementation uses naive bottom-up evaluation, which re-derives all facts at each iteration. This is correct but inefficient compared to`
			`semi-naive evaluation.`

			`3. Unsafe partial function:`

			`insertManyInRelationDb` uses `error` on line 51 if the relation doesn't exist. However, this should never occur in practice due to prior arity
			`checks.`

			`4. Empty atom arguments:`

			The parser allows zero-argument atoms `foo()`, and evaluation handles them correctly (they act as propositional facts).

			`---`

			`### Strengths`

			`#### 1. Clean Haskell Design`

			- Proper use of newtypes for type safety (`ConId`, `VarId`, `RelId`)
			- Effective use of typeclasses (`HasConstants`, `Pretty`)
			`- Clear separation of concerns between modules`
			- Appropriate use of `Either` for error handling

			`#### 2. Efficient Indexing`

			`- Position-based indexing enables efficient lookups`
			`- Intersection of index results for atoms with multiple constants`
			`- Avoids full table scans when constants are present in queries`

			`#### 3. Correct Semantics`

			`- Proper handling of variable binding consistency`
			`- Correct relational join semantics`
			`- Sound fixed-point computation`

			`#### 4. Good Parser Design`

			`- Uses Megaparsec for good error messages`
			- Supports both Haskell-style comments (`--` and `{- -}`)
			`- Clean lexeme handling with whitespace consumption`

			`#### 5. Extensible Architecture`

			`- Easy to add new syntax constructs`
			`- Database structure supports multiple relations`
			`- Error types are well-defined and informative`

			`---`

			`### Limitations`

			`#### 1. Performance Limitations`

			`\| Issue \| Impact \| Potential Fix \|`
			`\|-----------------------\|----------------------------------------------------\|----------------------------------\|`
			`\| Naive evaluation \| Re-derives all facts each iteration \| Implement semi-naive evaluation \|`
			`\| No query optimization \| Joins evaluated left-to-right \| Add query planner \|`
			`\| Full materialization \| All derived facts stored \| Support demand-driven evaluation \|`
			`\| No stratification \| Potential issues with negation (not yet supported) \| Add stratification analysis \|`

			`#### 2. Missing Datalog Features`

			`\| Feature \| Status \| Description \|`
			`\|---------------------\|-----------------\|------------------------------\|`
			\| Negation \| Not supported \| `not p(X)` in rule bodies \|
			\| Aggregation \| Not supported \| `count`, `sum`, `min`, `max` \|
			\| Built-in predicates \| Not supported \| `X < Y`, `X = Y + 1` \|
			`\| Stratification \| Not applicable \| Would be needed for negation \|`
			`\| Magic sets \| Not implemented \| Query optimization technique \|`
			`\| Tabling/Memoization \| Basic only \| Via relation storage \|`

			`#### 3. Usability Limitations`

			`- No REPL: No interactive query interface`
			`- No CLI executable: Executable is commented out in cabal file`
			`- Limited error messages: Parsing errors are good (Megaparsec), but evaluation errors are basic`
			`- No source locations: Error messages don't include line numbers for semantic errors`
			`- No pretty-printed output: Query results shown as Haskell data structures`

			`#### 4. Safety/Edge Cases`

			`- Unsafe head variables: Variables in rule heads not appearing in bodies cause universe enumeration`
			`- No recursion limit: Deeply recursive programs could stack overflow`
			`- No fact limit: Large derived fact sets could exhaust memory`

			`#### 5. Testing Gaps`

			`- Only parser golden tests are active`
			`- Database and fixed-point tests are commented out`
			`- No property-based testing`
			`- No negative test cases (expected failures)`

			`---`

			`### Example Programs`

			`#### Transitive Closure (ancestor.dl)`

			```datalog
			`parent(xerces, brooke).`
			`parent(brooke, damocles).`

			`ancestor(X, Y) :- parent(X, Y).`
			`ancestor(X, Y) :- parent(X, Z), ancestor(Z, Y).`
			```

			Derives: `ancestor(xerces, brooke)`, `ancestor(brooke, damocles)`, `ancestor(xerces, damocles)`

			`#### Symmetric Closure with Parity (graph.dl)`

			```datalog
			`odd(X,Y) :- r(X,Y).`
			`odd(X,Y) :- even(X,Z), r(Z,Y).`
			`even(X,Y) :- odd(X,Z), r(Z,Y).`

			`r(0,1). r(1,2). r(2,3). r(3,4). r(4,5).`
			`r(X,Y) :- r(Y,X).`
			```

			`Computes paths of odd/even length in a bidirectional graph.`

			`---`

			`### Recommendations for Future Development`

			`#### High Priority`

			`1. Enable the executable with a simple REPL or query interface`
			`2. Implement semi-naive evaluation for significant performance improvement`
			`3. Add more tests including evaluation tests and property tests`
			`4. Add source locations to AST for better error messages`

			`#### Medium Priority`

			`5. Support negation with stratification checking`
			`6. Add built-in predicates for arithmetic comparisons`
			`7. Implement query optimization (join reordering)`
			`8. Add aggregation support`

			`#### Nice to Have`

			`9. Magic sets transformation for goal-directed evaluation`
			`10. Incremental evaluation for database updates`
			`11. Parallel evaluation of independent rules`
			`12. External data sources (CSV, databases)`

			`---`

			`### Conclusion`

			`toy-datalog is a well-structured, pedagogically clear implementation of core Datalog semantics. It correctly implements the fundamental operations of`
			`bottom-up evaluation with proper variable binding and fixed-point computation. While it lacks advanced features like negation, aggregation, and query`
			`optimization, it provides a solid foundation for learning about Datalog and could be extended into a more full-featured implementation.`

			`The code quality is high, with good use of Haskell idioms and proper error handling. The main limitations are performance (naive evaluation) and`
			`missing features rather than correctness issues.`

			`## Changelog`

			`* Mar 4, 2026 -- The first version was created.`