Add notes about Cale's and Felix's Datalog implementations

This commit is contained in:
Hassan Abedi 2026-03-04 12:37:35 +01:00
parent 9ae3bc9b14
commit 654012f2ef
2 changed files with 804 additions and 0 deletions

View File

@ -0,0 +1,447 @@
## felix-db: Project Internals
This document provides detailed technical documentation of the felix-db Datalog implementation.
### Overview
felix-db is an experimental Datalog implementation in Haskell. Datalog is a declarative logic programming language (a subset of Prolog) commonly used
for database queries, program analysis, and reasoning systems.
**Project Status:** The implementation is correct and fully functional for positive Datalog (without negation in rules).
### Architecture
```
┌─────────────────────────────────────────────────────────────────┐
│ User Input │
│ (Datalog text/files) │
└─────────────────────────────────────────────────────────────────┘
┌─────────────────────────────────────────────────────────────────┐
│ DatalogParser.hs │
│ (Megaparsec-based parser) │
│ Text → Statement (Fact | Rule | Query) │
└─────────────────────────────────────────────────────────────────┘
┌─────────────────────────────────────────────────────────────────┐
│ DatalogDB.hs │
│ (Type class + core data structures) │
│ Relation, RelationRule, RuleElement │
└─────────────────────────────────────────────────────────────────┘
┌───────────────┴───────────────┐
▼ ▼
┌─────────────────────────┐ ┌─────────────────────────────────┐
│ InMemoryDB.hs │ │ Rules.hs │
│ (Concrete DB impl) │ │ (Rule digestion logic) │
│ relations + constants │ │ digestHead, digestBody │
└─────────────────────────┘ └─────────────────────────────────┘
│ │
└───────────────┬───────────────┘
┌─────────────────────────────────────────────────────────────────┐
│ NaiveQE.hs │
│ (Naive Query Engine - bottom-up) │
│ Fixed-point computation of Herbrand model │
└─────────────────────────────────────────────────────────────────┘
┌─────────────────────────────────────────────────────────────────┐
│ DigestedQuery.hs │
│ (Query preprocessing) │
│ Query → DigestedQuery with indexed variables │
└─────────────────────────────────────────────────────────────────┘
```
### Module Details
#### 1. DatalogParser.hs
**Purpose:** Parses Datalog syntax into an Abstract Syntax Tree (AST).
**Parser Type:** Uses Megaparsec (`Parsec Void Text`).
##### Core Types
```haskell
data Term
= Var Text -- Variables: X, Person, Y1 (uppercase start)
| Sym Text -- Symbols: alice, "london", uk (lowercase/quoted)
| Num Integer
data Literal = Literal
{ positive :: Bool -- True = positive, False = negated
, predName :: Text -- Predicate name
, arguments :: [Term] -- Argument list
}
data Statement
= Fact Literal -- p(a,b).
| Rule Head [Literal] -- p(X,Y) :- q(X,Z), r(Z,Y).
| Query [Text] [Literal] -- ?- p(X,Y). or ?- p(X,Y) → X,Y.
```
##### Syntax Support
| Feature | Example | Notes |
|-------------------|-------------------------------------|--------------------------|
| Facts | `parent("alice", "bob").` | Ground atoms |
| Rules | `ancestor(X,Y) :- parent(X,Y).` | Head :- body |
| Queries | `?- parent(X,Y).` | Returns all variables |
| Projected queries | `?- edge(A,B), edge(B,C) → A,C.` | Explicit output vars |
| Negation | `not member(X,L)` or `!member(X,L)` | Parsed but not evaluated |
| Comments | `-- line` or `/* block */` | Haskell-style |
| Arrow variants | `:-`, `→`, `->` | All accepted |
##### Entry Points
```haskell
parseDatalog :: Text -> Either (ParseErrorBundle Text Void) Statement
parseDatalogFile :: Text -> Either (ParseErrorBundle Text Void) [Statement]
```
---
#### 2. DatalogDB.hs
**Purpose:** Defines the core data structures and the `DatalogDB` type class.
##### Core Types
```haskell
data Relation = Relation
{ name :: RelationId -- Relation name (Text)
, arity :: Int -- Number of arguments
, tuples :: Set [Constant] -- Known facts (extensional)
, rules :: [RelationRule] -- Derivation rules (intensional)
}
data RelationRule = RelationRule
{ headVariables :: [Text] -- Variables in scope
, bodyElements :: [RuleBodyElement] -- Body constraints
, outputElements :: [RuleElement] -- How to construct head tuple
}
data RuleBodyElement = RuleBodyElement
{ subRelationId :: RelationId -- Referenced relation
, ruleElements :: [RuleElement] -- How to map variables
}
data RuleElement
= RuleElementConstant Constant -- Fixed value
| RuleElementVariable Int -- Index into headVariables
```
##### Type Class
```haskell
class DatalogDB db where
emptyDB :: db
relationNames :: db -> [Text]
lookupRelation :: db -> Text -> Maybe Relation
insertRelation :: db -> Relation -> db
addConstants :: db -> Set Constant -> db
allConstants :: db -> Set Constant -- Herbrand universe
```
---
#### 3. InMemoryDB.hs
**Purpose:** Concrete implementation of `DatalogDB` using in-memory data structures.
```haskell
data InMemoryDB = InMemoryDB
{ relations :: Map RelationId Relation
, constants :: Set Constant -- The Herbrand universe
}
```
##### Builder Functions
```haskell
-- Build DB from facts only
withFacts :: [Text] -> InMemoryDB
-- Build DB from facts and rules
withFactsAndRules :: [Text] -> [Text] -> InMemoryDB
-- Build DB with explicit extra constants
withFactsRulesAndConstants :: [Text] -> [Text] -> [Constant] -> InMemoryDB
```
**Example:**
```haskell
let db = withFactsAndRules
[ "parent(\"alice\", \"bob\")."
, "parent(\"bob\", \"carol\")."
]
[ "ancestor(X,Y) :- parent(X,Y)."
, "ancestor(X,Z) :- parent(X,Y), ancestor(Y,Z)."
]
```
---
#### 4. Rules.hs
**Purpose:** Processes rules and builds the internal representation.
##### Key Functions
**`digestHead`** - Processes the rule head:
1. Extracts variables using `nub` (preserves first-occurrence order)
2. Creates `RuleElement` list mapping terms to indices/constants
3. Collects constants for the Herbrand universe
**`digestBody`** - Processes each body literal:
1. Extends variable list with new variables from body
2. Creates `BodyConstraint` linking to sub-relations
3. Maps terms to `RuleElement` (index or constant)
**Variable Indexing Example:**
```
Rule: ancestor(X,Y) :- parent(X,Z), ancestor(Z,Y).
After digestHead: variableNames = ["X", "Y"]
After digestBody[0]: variableNames = ["X", "Y", "Z"] (Z added)
After digestBody[1]: variableNames = ["X", "Y", "Z"] (no change)
Indices: X=0, Y=1, Z=2
```
**`addRule`** - Main entry point:
```haskell
addRule :: (DatalogDB db) => (Literal, [Literal]) -> db -> db
```
---
#### 5. NaiveQE.hs (Naive Query Engine)
**Purpose:** Implements naive bottom-up evaluation with fixed-point iteration.
```haskell
data NaiveQE db = NaiveQE
{ db :: db
, herbrand :: Map Text Relation -- Computed model
}
```
##### Algorithm: `computeHerbrand`
The naive evaluation algorithm:
```
1. Initialize Facts := all known facts from DB
2. Repeat:
a. For each rule R in each relation:
- Enumerate all variable assignments from Herbrand universe
- For each assignment that satisfies all body constraints:
- Compute the derived tuple
- If tuple is new, add to Facts and set changed=True
b. If no changes, terminate
3. Return final Facts (the minimal Herbrand model)
```
**Complexity:** O(n^k) per iteration where n = |constants|, k = |variables in rule|
##### Query Execution
```haskell
executeQuery :: (DatalogDB db) => NaiveQE db -> DigestedQuery -> QueryResults
```
1. Enumerate all assignments for sought variables
2. For each, check if any assignment to unsought variables satisfies all conditions
3. Return matching sought-variable tuples
---
#### 6. DigestedQuery.hs
**Purpose:** Preprocesses queries for efficient execution.
```haskell
data DigestedQuery = DigestedQuery
{ allBoundVariables :: [Text] -- All variables in query
, numSoughtVariables :: Int -- Variables in output
, conditions :: [DigestedQueryCondition]
}
data DigestedQueryCondition = DigestedQueryCondition
{ relationName :: Text
, entries :: [DigestedQueryEntry]
}
data DigestedQueryEntry
= DigestedQueryEntryConstant Constant
| DigestedQueryEntryVariable Int -- Index into allBoundVariables
```
**Query Forms:**
- `?- p(X,Y).` - Output all variables (X, Y)
- `?- edge(A,B), edge(B,C) → A,C.` - Output only A, C (B is existential)
---
#### 7. QueryEngine.hs
**Purpose:** Type class for query engines.
```haskell
class QueryEngine qe where
queryEngine :: (DatalogDB db) => db -> qe db
query :: (DatalogDB db) => qe db -> Text -> Text
```
---
#### 8. Utility.hs
**Purpose:** Helper functions.
```haskell
-- Generate all possible maps from domain to codomain
allMaps :: (Ord a) => [a] -> [b] -> [Map.Map a b]
```
Used by `NaiveQE` to enumerate variable assignments.
---
### Data Flow Example
**Input:**
```datalog
parent("alice", "bob").
parent("bob", "carol").
ancestor(X,Y) :- parent(X,Y).
ancestor(X,Z) :- parent(X,Y), ancestor(Y,Z).
?- ancestor(alice, X).
```
**Step 1: Parse** → `[Fact, Fact, Rule, Rule, Query]`
**Step 2: Build DB**
```
relations = {
"parent" -> Relation {
tuples = {["alice","bob"], ["bob","carol"]},
rules = []
},
"ancestor" -> Relation {
tuples = {},
rules = [Rule1, Rule2]
}
}
constants = {"alice", "bob", "carol"}
```
**Step 3: Compute Herbrand Model**
```
Iteration 1: Apply rules
ancestor(alice,bob) ← parent(alice,bob)
ancestor(bob,carol) ← parent(bob,carol)
Iteration 2: Apply rules again
ancestor(alice,carol) ← parent(alice,bob), ancestor(bob,carol)
Iteration 3: No new facts → terminate
```
**Step 4: Execute Query**
```
?- ancestor(alice, X)
Result: bob, carol
```
---
### Correctness Assessment
| Component | Status | Notes |
|-----------------|---------|--------------------------------------|
| Parser | Correct | Full Datalog syntax support |
| Data Model | Correct | Proper Relation/Rule representation |
| Rule Processing | Correct | Correct variable indexing with `nub` |
| Query Engine | Correct | Fixed-point terminates correctly |
| Tests | Pass | Comprehensive coverage |
#### Limitations
1. **No stratified negation** - Negation is parsed but ignored in evaluation
2. **Naive complexity** - Exponential in number of variables per rule
3. **No aggregation** - Standard Datalog without extensions
4. **No built-in predicates** - No arithmetic comparison, etc.
---
### Test Coverage
Tests are in `test/Test/Datalog/`:
- **DatalogParserSpec.hs** - Parser correctness
- **InMemoryDBSpec.hs** - Database operations
- **RulesSpec.hs** - Rule digestion
- **NaiveQESpec.hs** - Query evaluation
- **DigestedQuerySpec.hs** - Query preprocessing
**Notable Test Cases:**
- Reflexive relations: `equiv(X, X) :- .`
- Symmetric relations: `equiv(Y, X) :- equiv(X, Y).`
- Transitive closure: `equiv(X, Z) :- equiv(X, Y), equiv(Y, Z).`
- Full equivalence relations (combining all three)
- Genealogical queries (niece, sibling relationships)
- Poset ordering with constants
---
### Build & Run
```bash
cabal build ## Build
cabal test ## Run all tests
cabal repl ## Interactive REPL
## Run specific tests
cabal test --test-option=--match="/NaiveQE/"
```
---
### Dependencies
- **megaparsec** - Parser combinators
- **containers** - Map, Set
- **text** - Text processing
- **hspec** - Testing framework
---
### Future Work
Potential enhancements:
1. Implement stratified negation
2. Add semi-naive evaluation for better performance
3. Built-in predicates (comparison, arithmetic)
4. Magic sets optimization
5. Persistent storage backend
## Changelog
* **Mar 4, 2026** -- The first version was created.

357
hassan/006-cales-datalog.md Normal file
View File

@ -0,0 +1,357 @@
## toy-datalog: Implementation Notes
### Overview
**toy-datalog** is a Haskell implementation of a Datalog interpreter. Datalog is a declarative logic programming language that is a syntactic subset
of Prolog, commonly used for:
- Database query languages (it's the theoretical foundation for SQL's recursive queries)
- Static program analysis
- Knowledge representation and reasoning
- Access control policies
This implementation provides parsing, evaluation, and fixed-point computation for Datalog programs.
**Author:** Cale Gibbard
**License:** BSD-2-Clause
**Version:** 0.1.0.0
---
### Architecture
```
┌─────────────────────────────────────────────────────────────┐
│ Datalog Program (Text) │
└─────────────────────────────────────────────────────────────┘
┌─────────────────────────────────────────────────────────────┐
│ Parser (Megaparsec) │
│ • parseTerm, parseAtom, parseRule, parseProgram │
└─────────────────────────────────────────────────────────────┘
┌─────────────────────────────────────────────────────────────┐
│ Abstract Syntax Tree │
│ • Program = [Rule] │
│ • Rule = Atom :- [Atom] │
│ • Atom = RelId [Term] │
│ • Term = Con ConId | Var VarId │
└─────────────────────────────────────────────────────────────┘
┌─────────────────────────────────────────────────────────────┐
│ Database │
│ • Herbrand Universe (all constants) │
│ • Relations (indexed tuples) │
│ • Rules (indexed by head relation) │
└─────────────────────────────────────────────────────────────┘
┌─────────────────────────────────────────────────────────────┐
│ Evaluation Engine │
│ • Forward chaining (bottom-up) │
│ • Fixed-point iteration │
│ • Variable binding and unification │
└─────────────────────────────────────────────────────────────┘
┌─────────────────────────────────────────────────────────────┐
│ Derived Facts (Complete Database) │
└─────────────────────────────────────────────────────────────┘
```
---
### Module Analysis
#### 1. Datalog.Syntax (`src/Datalog/Syntax.hs`)
Defines the core AST types:
| Type | Description | Example |
|-----------|-----------------------------------------------|---------------------------------|
| `ConId` | Constant identifier (lowercase/numeric start) | `xerces`, `0`, `brooke` |
| `VarId` | Variable identifier (uppercase start) | `X`, `Y`, `Parent` |
| `RelId` | Relation/predicate name | `parent`, `ancestor` |
| `Term` | Either a constant or variable | `Con "xerces"`, `Var "X"` |
| `Atom` | Relation applied to terms | `parent(X, Y)` |
| `Rule` | Head atom with body atoms | `ancestor(X,Y) :- parent(X,Y).` |
| `Program` | Collection of rules | Full Datalog program |
**Key Features:**
- `HasConstants` typeclass for extracting constants from any Datalog structure
- `Pretty` typeclass for converting AST back to readable syntax
- Smart constructor `term` that auto-detects constants vs variables by case
#### 2. Datalog.Parser (`src/Datalog/Parser.hs`)
Megaparsec-based parser with:
- **Lexical conventions:**
- Whitespace handling (spaces, line comments `--`, block comments `{- -}`)
- Constants: start with lowercase or digit
- Variables: start with uppercase
- Relations: start with lowercase
- **Grammar:**
```
program ::= rule*
rule ::= atom '.' (fact)
| atom ':-' query '.' (rule with body)
query ::= atom (',' atom)*
atom ::= relId '(' term (',' term)* ')'
term ::= variable | constant
```
#### 3. Datalog.Eval (`src/Datalog/Eval.hs`)
The evaluation engine implements **naive bottom-up evaluation** (forward chaining).
##### Data Structures
**Relation:**
```haskell
data Relation = Relation
{ _relation_arity :: Int
, _relation_members :: Set [ConId] -- all tuples
, _relation_indices :: Map (Int, ConId) (Set [ConId]) -- index
}
```
The index maps `(position, value)` pairs to tuples containing that value at that position, enabling efficient lookups.
**Database:**
```haskell
data Database = Database
{ _database_universe :: Set ConId -- Herbrand universe
, _database_relations :: Map RelId Relation
, _database_rules :: Map RelId [Rule]
}
```
##### Evaluation Algorithm
1. **`evalAtomDb`**: Query an atom against the database
- Uses indices to find candidate tuples
- Computes intersection of candidates for each bound position
- Returns set of variable bindings
2. **`evalConjunction`**: Evaluate a conjunction (rule body)
- Evaluates each atom
- Joins bindings using `><` operator (relational join)
3. **`immediateConsequences`**: Derive new facts from a rule
- Evaluate the body to get bindings
- Project bindings onto head atom terms
4. **`extendDb`**: Single iteration of forward chaining
- Apply all rules to derive new facts
- Add new tuples to relations
5. **`extendFixedpointDb`**: Iterate until no new facts
- Repeatedly call `extendDb` until database stops changing
#### 4. Datalog.IO (`src/Datalog/IO.hs`)
File I/O utilities:
- `readProgram`: Parse a file into a `Program`
- `readDatabase`: Parse and load into a `Database`
---
### Correctness Analysis
#### Verified Correct Behaviors
1. **Variable binding consistency**: The `mkBinding` function in `unsafeEvalAtomRel` correctly handles the case where the same variable appears
multiple times in an atom. It checks that all occurrences bind to the same value (lines 125-130).
2. **Join semantics**: The `joinBindings` function correctly implements a consistent merge of two bindings, failing when variables are bound to
different values.
3. **Arity checking**: The implementation properly tracks and validates relation arities, preventing arity mismatches.
4. **Fixed-point termination**: Since Datalog is guaranteed to terminate (finite Herbrand universe, monotonic derivation), the fixed-point iteration
is correct.
5. **Index maintenance**: Indices are properly maintained during insertion via `insertInRelation`.
#### Potential Issues
1. **`projectTuple` with unbound variables** (lines 147-157):
When the head of a rule contains a variable not appearing in the body, `projectTuple` generates all possible values from the Herbrand universe.
This is semantically correct for Datalog (the variable ranges over all constants), but:
- It can cause a combinatorial explosion
- This is an unusual Datalog pattern (typically all head variables appear in the body)
Example: `foo(X) :- bar(Y).` would generate `foo(c)` for every constant `c` in the universe.
2. **Naive evaluation strategy**:
The implementation uses naive bottom-up evaluation, which re-derives all facts at each iteration. This is correct but inefficient compared to
semi-naive evaluation.
3. **Unsafe partial function**:
`insertManyInRelationDb` uses `error` on line 51 if the relation doesn't exist. However, this should never occur in practice due to prior arity
checks.
4. **Empty atom arguments**:
The parser allows zero-argument atoms `foo()`, and evaluation handles them correctly (they act as propositional facts).
---
### Strengths
#### 1. Clean Haskell Design
- Proper use of newtypes for type safety (`ConId`, `VarId`, `RelId`)
- Effective use of typeclasses (`HasConstants`, `Pretty`)
- Clear separation of concerns between modules
- Appropriate use of `Either` for error handling
#### 2. Efficient Indexing
- Position-based indexing enables efficient lookups
- Intersection of index results for atoms with multiple constants
- Avoids full table scans when constants are present in queries
#### 3. Correct Semantics
- Proper handling of variable binding consistency
- Correct relational join semantics
- Sound fixed-point computation
#### 4. Good Parser Design
- Uses Megaparsec for good error messages
- Supports both Haskell-style comments (`--` and `{- -}`)
- Clean lexeme handling with whitespace consumption
#### 5. Extensible Architecture
- Easy to add new syntax constructs
- Database structure supports multiple relations
- Error types are well-defined and informative
---
### Limitations
#### 1. Performance Limitations
| Issue | Impact | Potential Fix |
|-----------------------|----------------------------------------------------|----------------------------------|
| Naive evaluation | Re-derives all facts each iteration | Implement semi-naive evaluation |
| No query optimization | Joins evaluated left-to-right | Add query planner |
| Full materialization | All derived facts stored | Support demand-driven evaluation |
| No stratification | Potential issues with negation (not yet supported) | Add stratification analysis |
#### 2. Missing Datalog Features
| Feature | Status | Description |
|---------------------|-----------------|------------------------------|
| Negation | Not supported | `not p(X)` in rule bodies |
| Aggregation | Not supported | `count`, `sum`, `min`, `max` |
| Built-in predicates | Not supported | `X < Y`, `X = Y + 1` |
| Stratification | Not applicable | Would be needed for negation |
| Magic sets | Not implemented | Query optimization technique |
| Tabling/Memoization | Basic only | Via relation storage |
#### 3. Usability Limitations
- **No REPL**: No interactive query interface
- **No CLI executable**: Executable is commented out in cabal file
- **Limited error messages**: Parsing errors are good (Megaparsec), but evaluation errors are basic
- **No source locations**: Error messages don't include line numbers for semantic errors
- **No pretty-printed output**: Query results shown as Haskell data structures
#### 4. Safety/Edge Cases
- **Unsafe head variables**: Variables in rule heads not appearing in bodies cause universe enumeration
- **No recursion limit**: Deeply recursive programs could stack overflow
- **No fact limit**: Large derived fact sets could exhaust memory
#### 5. Testing Gaps
- Only parser golden tests are active
- Database and fixed-point tests are commented out
- No property-based testing
- No negative test cases (expected failures)
---
### Example Programs
#### Transitive Closure (ancestor.dl)
```datalog
parent(xerces, brooke).
parent(brooke, damocles).
ancestor(X, Y) :- parent(X, Y).
ancestor(X, Y) :- parent(X, Z), ancestor(Z, Y).
```
Derives: `ancestor(xerces, brooke)`, `ancestor(brooke, damocles)`, `ancestor(xerces, damocles)`
#### Symmetric Closure with Parity (graph.dl)
```datalog
odd(X,Y) :- r(X,Y).
odd(X,Y) :- even(X,Z), r(Z,Y).
even(X,Y) :- odd(X,Z), r(Z,Y).
r(0,1). r(1,2). r(2,3). r(3,4). r(4,5).
r(X,Y) :- r(Y,X).
```
Computes paths of odd/even length in a bidirectional graph.
---
### Recommendations for Future Development
#### High Priority
1. **Enable the executable** with a simple REPL or query interface
2. **Implement semi-naive evaluation** for significant performance improvement
3. **Add more tests** including evaluation tests and property tests
4. **Add source locations** to AST for better error messages
#### Medium Priority
5. **Support negation** with stratification checking
6. **Add built-in predicates** for arithmetic comparisons
7. **Implement query optimization** (join reordering)
8. **Add aggregation** support
#### Nice to Have
9. **Magic sets transformation** for goal-directed evaluation
10. **Incremental evaluation** for database updates
11. **Parallel evaluation** of independent rules
12. **External data sources** (CSV, databases)
---
### Conclusion
toy-datalog is a well-structured, pedagogically clear implementation of core Datalog semantics. It correctly implements the fundamental operations of
bottom-up evaluation with proper variable binding and fixed-point computation. While it lacks advanced features like negation, aggregation, and query
optimization, it provides a solid foundation for learning about Datalog and could be extended into a more full-featured implementation.
The code quality is high, with good use of Haskell idioms and proper error handling. The main limitations are performance (naive evaluation) and
missing features rather than correctness issues.
## Changelog
* **Mar 4, 2026** -- The first version was created.