358 lines
15 KiB
Markdown
358 lines
15 KiB
Markdown
|
|
## toy-datalog: Implementation Notes
|
||
|
|
|
||
|
|
### Overview
|
||
|
|
|
||
|
|
**toy-datalog** is a Haskell implementation of a Datalog interpreter. Datalog is a declarative logic programming language that is a syntactic subset
|
||
|
|
of Prolog, commonly used for:
|
||
|
|
|
||
|
|
- Database query languages (it's the theoretical foundation for SQL's recursive queries)
|
||
|
|
- Static program analysis
|
||
|
|
- Knowledge representation and reasoning
|
||
|
|
- Access control policies
|
||
|
|
|
||
|
|
This implementation provides parsing, evaluation, and fixed-point computation for Datalog programs.
|
||
|
|
|
||
|
|
**Author:** Cale Gibbard
|
||
|
|
**License:** BSD-2-Clause
|
||
|
|
**Version:** 0.1.0.0
|
||
|
|
|
||
|
|
---
|
||
|
|
|
||
|
|
### Architecture
|
||
|
|
|
||
|
|
```
|
||
|
|
┌─────────────────────────────────────────────────────────────┐
|
||
|
|
│ Datalog Program (Text) │
|
||
|
|
└─────────────────────────────────────────────────────────────┘
|
||
|
|
│
|
||
|
|
▼
|
||
|
|
┌─────────────────────────────────────────────────────────────┐
|
||
|
|
│ Parser (Megaparsec) │
|
||
|
|
│ • parseTerm, parseAtom, parseRule, parseProgram │
|
||
|
|
└─────────────────────────────────────────────────────────────┘
|
||
|
|
│
|
||
|
|
▼
|
||
|
|
┌─────────────────────────────────────────────────────────────┐
|
||
|
|
│ Abstract Syntax Tree │
|
||
|
|
│ • Program = [Rule] │
|
||
|
|
│ • Rule = Atom :- [Atom] │
|
||
|
|
│ • Atom = RelId [Term] │
|
||
|
|
│ • Term = Con ConId | Var VarId │
|
||
|
|
└─────────────────────────────────────────────────────────────┘
|
||
|
|
│
|
||
|
|
▼
|
||
|
|
┌─────────────────────────────────────────────────────────────┐
|
||
|
|
│ Database │
|
||
|
|
│ • Herbrand Universe (all constants) │
|
||
|
|
│ • Relations (indexed tuples) │
|
||
|
|
│ • Rules (indexed by head relation) │
|
||
|
|
└─────────────────────────────────────────────────────────────┘
|
||
|
|
│
|
||
|
|
▼
|
||
|
|
┌─────────────────────────────────────────────────────────────┐
|
||
|
|
│ Evaluation Engine │
|
||
|
|
│ • Forward chaining (bottom-up) │
|
||
|
|
│ • Fixed-point iteration │
|
||
|
|
│ • Variable binding and unification │
|
||
|
|
└─────────────────────────────────────────────────────────────┘
|
||
|
|
│
|
||
|
|
▼
|
||
|
|
┌─────────────────────────────────────────────────────────────┐
|
||
|
|
│ Derived Facts (Complete Database) │
|
||
|
|
└─────────────────────────────────────────────────────────────┘
|
||
|
|
```
|
||
|
|
|
||
|
|
---
|
||
|
|
|
||
|
|
### Module Analysis
|
||
|
|
|
||
|
|
#### 1. Datalog.Syntax (`src/Datalog/Syntax.hs`)
|
||
|
|
|
||
|
|
Defines the core AST types:
|
||
|
|
|
||
|
|
| Type | Description | Example |
|
||
|
|
|-----------|-----------------------------------------------|---------------------------------|
|
||
|
|
| `ConId` | Constant identifier (lowercase/numeric start) | `xerces`, `0`, `brooke` |
|
||
|
|
| `VarId` | Variable identifier (uppercase start) | `X`, `Y`, `Parent` |
|
||
|
|
| `RelId` | Relation/predicate name | `parent`, `ancestor` |
|
||
|
|
| `Term` | Either a constant or variable | `Con "xerces"`, `Var "X"` |
|
||
|
|
| `Atom` | Relation applied to terms | `parent(X, Y)` |
|
||
|
|
| `Rule` | Head atom with body atoms | `ancestor(X,Y) :- parent(X,Y).` |
|
||
|
|
| `Program` | Collection of rules | Full Datalog program |
|
||
|
|
|
||
|
|
**Key Features:**
|
||
|
|
|
||
|
|
- `HasConstants` typeclass for extracting constants from any Datalog structure
|
||
|
|
- `Pretty` typeclass for converting AST back to readable syntax
|
||
|
|
- Smart constructor `term` that auto-detects constants vs variables by case
|
||
|
|
|
||
|
|
#### 2. Datalog.Parser (`src/Datalog/Parser.hs`)
|
||
|
|
|
||
|
|
Megaparsec-based parser with:
|
||
|
|
|
||
|
|
- **Lexical conventions:**
|
||
|
|
- Whitespace handling (spaces, line comments `--`, block comments `{- -}`)
|
||
|
|
- Constants: start with lowercase or digit
|
||
|
|
- Variables: start with uppercase
|
||
|
|
- Relations: start with lowercase
|
||
|
|
|
||
|
|
- **Grammar:**
|
||
|
|
```
|
||
|
|
program ::= rule*
|
||
|
|
rule ::= atom '.' (fact)
|
||
|
|
| atom ':-' query '.' (rule with body)
|
||
|
|
query ::= atom (',' atom)*
|
||
|
|
atom ::= relId '(' term (',' term)* ')'
|
||
|
|
term ::= variable | constant
|
||
|
|
```
|
||
|
|
|
||
|
|
#### 3. Datalog.Eval (`src/Datalog/Eval.hs`)
|
||
|
|
|
||
|
|
The evaluation engine implements **naive bottom-up evaluation** (forward chaining).
|
||
|
|
|
||
|
|
##### Data Structures
|
||
|
|
|
||
|
|
**Relation:**
|
||
|
|
|
||
|
|
```haskell
|
||
|
|
data Relation = Relation
|
||
|
|
{ _relation_arity :: Int
|
||
|
|
, _relation_members :: Set [ConId] -- all tuples
|
||
|
|
, _relation_indices :: Map (Int, ConId) (Set [ConId]) -- index
|
||
|
|
}
|
||
|
|
```
|
||
|
|
|
||
|
|
The index maps `(position, value)` pairs to tuples containing that value at that position, enabling efficient lookups.
|
||
|
|
|
||
|
|
**Database:**
|
||
|
|
|
||
|
|
```haskell
|
||
|
|
data Database = Database
|
||
|
|
{ _database_universe :: Set ConId -- Herbrand universe
|
||
|
|
, _database_relations :: Map RelId Relation
|
||
|
|
, _database_rules :: Map RelId [Rule]
|
||
|
|
}
|
||
|
|
```
|
||
|
|
|
||
|
|
##### Evaluation Algorithm
|
||
|
|
|
||
|
|
1. **`evalAtomDb`**: Query an atom against the database
|
||
|
|
- Uses indices to find candidate tuples
|
||
|
|
- Computes intersection of candidates for each bound position
|
||
|
|
- Returns set of variable bindings
|
||
|
|
|
||
|
|
2. **`evalConjunction`**: Evaluate a conjunction (rule body)
|
||
|
|
- Evaluates each atom
|
||
|
|
- Joins bindings using `><` operator (relational join)
|
||
|
|
|
||
|
|
3. **`immediateConsequences`**: Derive new facts from a rule
|
||
|
|
- Evaluate the body to get bindings
|
||
|
|
- Project bindings onto head atom terms
|
||
|
|
|
||
|
|
4. **`extendDb`**: Single iteration of forward chaining
|
||
|
|
- Apply all rules to derive new facts
|
||
|
|
- Add new tuples to relations
|
||
|
|
|
||
|
|
5. **`extendFixedpointDb`**: Iterate until no new facts
|
||
|
|
- Repeatedly call `extendDb` until database stops changing
|
||
|
|
|
||
|
|
#### 4. Datalog.IO (`src/Datalog/IO.hs`)
|
||
|
|
|
||
|
|
File I/O utilities:
|
||
|
|
|
||
|
|
- `readProgram`: Parse a file into a `Program`
|
||
|
|
- `readDatabase`: Parse and load into a `Database`
|
||
|
|
|
||
|
|
---
|
||
|
|
|
||
|
|
### Correctness Analysis
|
||
|
|
|
||
|
|
#### Verified Correct Behaviors
|
||
|
|
|
||
|
|
1. **Variable binding consistency**: The `mkBinding` function in `unsafeEvalAtomRel` correctly handles the case where the same variable appears
|
||
|
|
multiple times in an atom. It checks that all occurrences bind to the same value (lines 125-130).
|
||
|
|
|
||
|
|
2. **Join semantics**: The `joinBindings` function correctly implements a consistent merge of two bindings, failing when variables are bound to
|
||
|
|
different values.
|
||
|
|
|
||
|
|
3. **Arity checking**: The implementation properly tracks and validates relation arities, preventing arity mismatches.
|
||
|
|
|
||
|
|
4. **Fixed-point termination**: Since Datalog is guaranteed to terminate (finite Herbrand universe, monotonic derivation), the fixed-point iteration
|
||
|
|
is correct.
|
||
|
|
|
||
|
|
5. **Index maintenance**: Indices are properly maintained during insertion via `insertInRelation`.
|
||
|
|
|
||
|
|
#### Potential Issues
|
||
|
|
|
||
|
|
1. **`projectTuple` with unbound variables** (lines 147-157):
|
||
|
|
|
||
|
|
When the head of a rule contains a variable not appearing in the body, `projectTuple` generates all possible values from the Herbrand universe.
|
||
|
|
This is semantically correct for Datalog (the variable ranges over all constants), but:
|
||
|
|
- It can cause a combinatorial explosion
|
||
|
|
- This is an unusual Datalog pattern (typically all head variables appear in the body)
|
||
|
|
|
||
|
|
Example: `foo(X) :- bar(Y).` would generate `foo(c)` for every constant `c` in the universe.
|
||
|
|
|
||
|
|
2. **Naive evaluation strategy**:
|
||
|
|
|
||
|
|
The implementation uses naive bottom-up evaluation, which re-derives all facts at each iteration. This is correct but inefficient compared to
|
||
|
|
semi-naive evaluation.
|
||
|
|
|
||
|
|
3. **Unsafe partial function**:
|
||
|
|
|
||
|
|
`insertManyInRelationDb` uses `error` on line 51 if the relation doesn't exist. However, this should never occur in practice due to prior arity
|
||
|
|
checks.
|
||
|
|
|
||
|
|
4. **Empty atom arguments**:
|
||
|
|
|
||
|
|
The parser allows zero-argument atoms `foo()`, and evaluation handles them correctly (they act as propositional facts).
|
||
|
|
|
||
|
|
---
|
||
|
|
|
||
|
|
### Strengths
|
||
|
|
|
||
|
|
#### 1. Clean Haskell Design
|
||
|
|
|
||
|
|
- Proper use of newtypes for type safety (`ConId`, `VarId`, `RelId`)
|
||
|
|
- Effective use of typeclasses (`HasConstants`, `Pretty`)
|
||
|
|
- Clear separation of concerns between modules
|
||
|
|
- Appropriate use of `Either` for error handling
|
||
|
|
|
||
|
|
#### 2. Efficient Indexing
|
||
|
|
|
||
|
|
- Position-based indexing enables efficient lookups
|
||
|
|
- Intersection of index results for atoms with multiple constants
|
||
|
|
- Avoids full table scans when constants are present in queries
|
||
|
|
|
||
|
|
#### 3. Correct Semantics
|
||
|
|
|
||
|
|
- Proper handling of variable binding consistency
|
||
|
|
- Correct relational join semantics
|
||
|
|
- Sound fixed-point computation
|
||
|
|
|
||
|
|
#### 4. Good Parser Design
|
||
|
|
|
||
|
|
- Uses Megaparsec for good error messages
|
||
|
|
- Supports both Haskell-style comments (`--` and `{- -}`)
|
||
|
|
- Clean lexeme handling with whitespace consumption
|
||
|
|
|
||
|
|
#### 5. Extensible Architecture
|
||
|
|
|
||
|
|
- Easy to add new syntax constructs
|
||
|
|
- Database structure supports multiple relations
|
||
|
|
- Error types are well-defined and informative
|
||
|
|
|
||
|
|
---
|
||
|
|
|
||
|
|
### Limitations
|
||
|
|
|
||
|
|
#### 1. Performance Limitations
|
||
|
|
|
||
|
|
| Issue | Impact | Potential Fix |
|
||
|
|
|-----------------------|----------------------------------------------------|----------------------------------|
|
||
|
|
| Naive evaluation | Re-derives all facts each iteration | Implement semi-naive evaluation |
|
||
|
|
| No query optimization | Joins evaluated left-to-right | Add query planner |
|
||
|
|
| Full materialization | All derived facts stored | Support demand-driven evaluation |
|
||
|
|
| No stratification | Potential issues with negation (not yet supported) | Add stratification analysis |
|
||
|
|
|
||
|
|
#### 2. Missing Datalog Features
|
||
|
|
|
||
|
|
| Feature | Status | Description |
|
||
|
|
|---------------------|-----------------|------------------------------|
|
||
|
|
| Negation | Not supported | `not p(X)` in rule bodies |
|
||
|
|
| Aggregation | Not supported | `count`, `sum`, `min`, `max` |
|
||
|
|
| Built-in predicates | Not supported | `X < Y`, `X = Y + 1` |
|
||
|
|
| Stratification | Not applicable | Would be needed for negation |
|
||
|
|
| Magic sets | Not implemented | Query optimization technique |
|
||
|
|
| Tabling/Memoization | Basic only | Via relation storage |
|
||
|
|
|
||
|
|
#### 3. Usability Limitations
|
||
|
|
|
||
|
|
- **No REPL**: No interactive query interface
|
||
|
|
- **No CLI executable**: Executable is commented out in cabal file
|
||
|
|
- **Limited error messages**: Parsing errors are good (Megaparsec), but evaluation errors are basic
|
||
|
|
- **No source locations**: Error messages don't include line numbers for semantic errors
|
||
|
|
- **No pretty-printed output**: Query results shown as Haskell data structures
|
||
|
|
|
||
|
|
#### 4. Safety/Edge Cases
|
||
|
|
|
||
|
|
- **Unsafe head variables**: Variables in rule heads not appearing in bodies cause universe enumeration
|
||
|
|
- **No recursion limit**: Deeply recursive programs could stack overflow
|
||
|
|
- **No fact limit**: Large derived fact sets could exhaust memory
|
||
|
|
|
||
|
|
#### 5. Testing Gaps
|
||
|
|
|
||
|
|
- Only parser golden tests are active
|
||
|
|
- Database and fixed-point tests are commented out
|
||
|
|
- No property-based testing
|
||
|
|
- No negative test cases (expected failures)
|
||
|
|
|
||
|
|
---
|
||
|
|
|
||
|
|
### Example Programs
|
||
|
|
|
||
|
|
#### Transitive Closure (ancestor.dl)
|
||
|
|
|
||
|
|
```datalog
|
||
|
|
parent(xerces, brooke).
|
||
|
|
parent(brooke, damocles).
|
||
|
|
|
||
|
|
ancestor(X, Y) :- parent(X, Y).
|
||
|
|
ancestor(X, Y) :- parent(X, Z), ancestor(Z, Y).
|
||
|
|
```
|
||
|
|
|
||
|
|
Derives: `ancestor(xerces, brooke)`, `ancestor(brooke, damocles)`, `ancestor(xerces, damocles)`
|
||
|
|
|
||
|
|
#### Symmetric Closure with Parity (graph.dl)
|
||
|
|
|
||
|
|
```datalog
|
||
|
|
odd(X,Y) :- r(X,Y).
|
||
|
|
odd(X,Y) :- even(X,Z), r(Z,Y).
|
||
|
|
even(X,Y) :- odd(X,Z), r(Z,Y).
|
||
|
|
|
||
|
|
r(0,1). r(1,2). r(2,3). r(3,4). r(4,5).
|
||
|
|
r(X,Y) :- r(Y,X).
|
||
|
|
```
|
||
|
|
|
||
|
|
Computes paths of odd/even length in a bidirectional graph.
|
||
|
|
|
||
|
|
---
|
||
|
|
|
||
|
|
### Recommendations for Future Development
|
||
|
|
|
||
|
|
#### High Priority
|
||
|
|
|
||
|
|
1. **Enable the executable** with a simple REPL or query interface
|
||
|
|
2. **Implement semi-naive evaluation** for significant performance improvement
|
||
|
|
3. **Add more tests** including evaluation tests and property tests
|
||
|
|
4. **Add source locations** to AST for better error messages
|
||
|
|
|
||
|
|
#### Medium Priority
|
||
|
|
|
||
|
|
5. **Support negation** with stratification checking
|
||
|
|
6. **Add built-in predicates** for arithmetic comparisons
|
||
|
|
7. **Implement query optimization** (join reordering)
|
||
|
|
8. **Add aggregation** support
|
||
|
|
|
||
|
|
#### Nice to Have
|
||
|
|
|
||
|
|
9. **Magic sets transformation** for goal-directed evaluation
|
||
|
|
10. **Incremental evaluation** for database updates
|
||
|
|
11. **Parallel evaluation** of independent rules
|
||
|
|
12. **External data sources** (CSV, databases)
|
||
|
|
|
||
|
|
---
|
||
|
|
|
||
|
|
### Conclusion
|
||
|
|
|
||
|
|
toy-datalog is a well-structured, pedagogically clear implementation of core Datalog semantics. It correctly implements the fundamental operations of
|
||
|
|
bottom-up evaluation with proper variable binding and fixed-point computation. While it lacks advanced features like negation, aggregation, and query
|
||
|
|
optimization, it provides a solid foundation for learning about Datalog and could be extended into a more full-featured implementation.
|
||
|
|
|
||
|
|
The code quality is high, with good use of Haskell idioms and proper error handling. The main limitations are performance (naive evaluation) and
|
||
|
|
missing features rather than correctness issues.
|
||
|
|
|
||
|
|
## Changelog
|
||
|
|
|
||
|
|
* **Mar 4, 2026** -- The first version was created.
|