useful-notes/hassan/005-felixs-datalog.md

448 lines
14 KiB
Markdown
Raw Permalink Normal View History

## felix-db: Project Internals
This document provides detailed technical documentation of the felix-db Datalog implementation.
### Overview
felix-db is an experimental Datalog implementation in Haskell. Datalog is a declarative logic programming language (a subset of Prolog) commonly used
for database queries, program analysis, and reasoning systems.
**Project Status:** The implementation is correct and fully functional for positive Datalog (without negation in rules).
### Architecture
```
┌─────────────────────────────────────────────────────────────────┐
│ User Input │
│ (Datalog text/files) │
└─────────────────────────────────────────────────────────────────┘
┌─────────────────────────────────────────────────────────────────┐
│ DatalogParser.hs │
│ (Megaparsec-based parser) │
│ Text → Statement (Fact | Rule | Query) │
└─────────────────────────────────────────────────────────────────┘
┌─────────────────────────────────────────────────────────────────┐
│ DatalogDB.hs │
│ (Type class + core data structures) │
│ Relation, RelationRule, RuleElement │
└─────────────────────────────────────────────────────────────────┘
┌───────────────┴───────────────┐
▼ ▼
┌─────────────────────────┐ ┌─────────────────────────────────┐
│ InMemoryDB.hs │ │ Rules.hs │
│ (Concrete DB impl) │ │ (Rule digestion logic) │
│ relations + constants │ │ digestHead, digestBody │
└─────────────────────────┘ └─────────────────────────────────┘
│ │
└───────────────┬───────────────┘
┌─────────────────────────────────────────────────────────────────┐
│ NaiveQE.hs │
│ (Naive Query Engine - bottom-up) │
│ Fixed-point computation of Herbrand model │
└─────────────────────────────────────────────────────────────────┘
┌─────────────────────────────────────────────────────────────────┐
│ DigestedQuery.hs │
│ (Query preprocessing) │
│ Query → DigestedQuery with indexed variables │
└─────────────────────────────────────────────────────────────────┘
```
### Module Details
#### 1. DatalogParser.hs
**Purpose:** Parses Datalog syntax into an Abstract Syntax Tree (AST).
**Parser Type:** Uses Megaparsec (`Parsec Void Text`).
##### Core Types
```haskell
data Term
= Var Text -- Variables: X, Person, Y1 (uppercase start)
| Sym Text -- Symbols: alice, "london", uk (lowercase/quoted)
| Num Integer
data Literal = Literal
{ positive :: Bool -- True = positive, False = negated
, predName :: Text -- Predicate name
, arguments :: [Term] -- Argument list
}
data Statement
= Fact Literal -- p(a,b).
| Rule Head [Literal] -- p(X,Y) :- q(X,Z), r(Z,Y).
| Query [Text] [Literal] -- ?- p(X,Y). or ?- p(X,Y) → X,Y.
```
##### Syntax Support
| Feature | Example | Notes |
|-------------------|-------------------------------------|--------------------------|
| Facts | `parent("alice", "bob").` | Ground atoms |
| Rules | `ancestor(X,Y) :- parent(X,Y).` | Head :- body |
| Queries | `?- parent(X,Y).` | Returns all variables |
| Projected queries | `?- edge(A,B), edge(B,C) → A,C.` | Explicit output vars |
| Negation | `not member(X,L)` or `!member(X,L)` | Parsed but not evaluated |
| Comments | `-- line` or `/* block */` | Haskell-style |
| Arrow variants | `:-`, `→`, `->` | All accepted |
##### Entry Points
```haskell
parseDatalog :: Text -> Either (ParseErrorBundle Text Void) Statement
parseDatalogFile :: Text -> Either (ParseErrorBundle Text Void) [Statement]
```
---
#### 2. DatalogDB.hs
**Purpose:** Defines the core data structures and the `DatalogDB` type class.
##### Core Types
```haskell
data Relation = Relation
{ name :: RelationId -- Relation name (Text)
, arity :: Int -- Number of arguments
, tuples :: Set [Constant] -- Known facts (extensional)
, rules :: [RelationRule] -- Derivation rules (intensional)
}
data RelationRule = RelationRule
{ headVariables :: [Text] -- Variables in scope
, bodyElements :: [RuleBodyElement] -- Body constraints
, outputElements :: [RuleElement] -- How to construct head tuple
}
data RuleBodyElement = RuleBodyElement
{ subRelationId :: RelationId -- Referenced relation
, ruleElements :: [RuleElement] -- How to map variables
}
data RuleElement
= RuleElementConstant Constant -- Fixed value
| RuleElementVariable Int -- Index into headVariables
```
##### Type Class
```haskell
class DatalogDB db where
emptyDB :: db
relationNames :: db -> [Text]
lookupRelation :: db -> Text -> Maybe Relation
insertRelation :: db -> Relation -> db
addConstants :: db -> Set Constant -> db
allConstants :: db -> Set Constant -- Herbrand universe
```
---
#### 3. InMemoryDB.hs
**Purpose:** Concrete implementation of `DatalogDB` using in-memory data structures.
```haskell
data InMemoryDB = InMemoryDB
{ relations :: Map RelationId Relation
, constants :: Set Constant -- The Herbrand universe
}
```
##### Builder Functions
```haskell
-- Build DB from facts only
withFacts :: [Text] -> InMemoryDB
-- Build DB from facts and rules
withFactsAndRules :: [Text] -> [Text] -> InMemoryDB
-- Build DB with explicit extra constants
withFactsRulesAndConstants :: [Text] -> [Text] -> [Constant] -> InMemoryDB
```
**Example:**
```haskell
let db = withFactsAndRules
[ "parent(\"alice\", \"bob\")."
, "parent(\"bob\", \"carol\")."
]
[ "ancestor(X,Y) :- parent(X,Y)."
, "ancestor(X,Z) :- parent(X,Y), ancestor(Y,Z)."
]
```
---
#### 4. Rules.hs
**Purpose:** Processes rules and builds the internal representation.
##### Key Functions
**`digestHead`** - Processes the rule head:
1. Extracts variables using `nub` (preserves first-occurrence order)
2. Creates `RuleElement` list mapping terms to indices/constants
3. Collects constants for the Herbrand universe
**`digestBody`** - Processes each body literal:
1. Extends variable list with new variables from body
2. Creates `BodyConstraint` linking to sub-relations
3. Maps terms to `RuleElement` (index or constant)
**Variable Indexing Example:**
```
Rule: ancestor(X,Y) :- parent(X,Z), ancestor(Z,Y).
After digestHead: variableNames = ["X", "Y"]
After digestBody[0]: variableNames = ["X", "Y", "Z"] (Z added)
After digestBody[1]: variableNames = ["X", "Y", "Z"] (no change)
Indices: X=0, Y=1, Z=2
```
**`addRule`** - Main entry point:
```haskell
addRule :: (DatalogDB db) => (Literal, [Literal]) -> db -> db
```
---
#### 5. NaiveQE.hs (Naive Query Engine)
**Purpose:** Implements naive bottom-up evaluation with fixed-point iteration.
```haskell
data NaiveQE db = NaiveQE
{ db :: db
, herbrand :: Map Text Relation -- Computed model
}
```
##### Algorithm: `computeHerbrand`
The naive evaluation algorithm:
```
1. Initialize Facts := all known facts from DB
2. Repeat:
a. For each rule R in each relation:
- Enumerate all variable assignments from Herbrand universe
- For each assignment that satisfies all body constraints:
- Compute the derived tuple
- If tuple is new, add to Facts and set changed=True
b. If no changes, terminate
3. Return final Facts (the minimal Herbrand model)
```
**Complexity:** O(n^k) per iteration where n = |constants|, k = |variables in rule|
##### Query Execution
```haskell
executeQuery :: (DatalogDB db) => NaiveQE db -> DigestedQuery -> QueryResults
```
1. Enumerate all assignments for sought variables
2. For each, check if any assignment to unsought variables satisfies all conditions
3. Return matching sought-variable tuples
---
#### 6. DigestedQuery.hs
**Purpose:** Preprocesses queries for efficient execution.
```haskell
data DigestedQuery = DigestedQuery
{ allBoundVariables :: [Text] -- All variables in query
, numSoughtVariables :: Int -- Variables in output
, conditions :: [DigestedQueryCondition]
}
data DigestedQueryCondition = DigestedQueryCondition
{ relationName :: Text
, entries :: [DigestedQueryEntry]
}
data DigestedQueryEntry
= DigestedQueryEntryConstant Constant
| DigestedQueryEntryVariable Int -- Index into allBoundVariables
```
**Query Forms:**
- `?- p(X,Y).` - Output all variables (X, Y)
- `?- edge(A,B), edge(B,C) → A,C.` - Output only A, C (B is existential)
---
#### 7. QueryEngine.hs
**Purpose:** Type class for query engines.
```haskell
class QueryEngine qe where
queryEngine :: (DatalogDB db) => db -> qe db
query :: (DatalogDB db) => qe db -> Text -> Text
```
---
#### 8. Utility.hs
**Purpose:** Helper functions.
```haskell
-- Generate all possible maps from domain to codomain
allMaps :: (Ord a) => [a] -> [b] -> [Map.Map a b]
```
Used by `NaiveQE` to enumerate variable assignments.
---
### Data Flow Example
**Input:**
```datalog
parent("alice", "bob").
parent("bob", "carol").
ancestor(X,Y) :- parent(X,Y).
ancestor(X,Z) :- parent(X,Y), ancestor(Y,Z).
?- ancestor(alice, X).
```
**Step 1: Parse** → `[Fact, Fact, Rule, Rule, Query]`
**Step 2: Build DB**
```
relations = {
"parent" -> Relation {
tuples = {["alice","bob"], ["bob","carol"]},
rules = []
},
"ancestor" -> Relation {
tuples = {},
rules = [Rule1, Rule2]
}
}
constants = {"alice", "bob", "carol"}
```
**Step 3: Compute Herbrand Model**
```
Iteration 1: Apply rules
ancestor(alice,bob) ← parent(alice,bob)
ancestor(bob,carol) ← parent(bob,carol)
Iteration 2: Apply rules again
ancestor(alice,carol) ← parent(alice,bob), ancestor(bob,carol)
Iteration 3: No new facts → terminate
```
**Step 4: Execute Query**
```
?- ancestor(alice, X)
Result: bob, carol
```
---
### Correctness Assessment
| Component | Status | Notes |
|-----------------|---------|--------------------------------------|
| Parser | Correct | Full Datalog syntax support |
| Data Model | Correct | Proper Relation/Rule representation |
| Rule Processing | Correct | Correct variable indexing with `nub` |
| Query Engine | Correct | Fixed-point terminates correctly |
| Tests | Pass | Comprehensive coverage |
#### Limitations
1. **No stratified negation** - Negation is parsed but ignored in evaluation
2. **Naive complexity** - Exponential in number of variables per rule
3. **No aggregation** - Standard Datalog without extensions
4. **No built-in predicates** - No arithmetic comparison, etc.
---
### Test Coverage
Tests are in `test/Test/Datalog/`:
- **DatalogParserSpec.hs** - Parser correctness
- **InMemoryDBSpec.hs** - Database operations
- **RulesSpec.hs** - Rule digestion
- **NaiveQESpec.hs** - Query evaluation
- **DigestedQuerySpec.hs** - Query preprocessing
**Notable Test Cases:**
- Reflexive relations: `equiv(X, X) :- .`
- Symmetric relations: `equiv(Y, X) :- equiv(X, Y).`
- Transitive closure: `equiv(X, Z) :- equiv(X, Y), equiv(Y, Z).`
- Full equivalence relations (combining all three)
- Genealogical queries (niece, sibling relationships)
- Poset ordering with constants
---
### Build & Run
```bash
cabal build ## Build
cabal test ## Run all tests
cabal repl ## Interactive REPL
## Run specific tests
cabal test --test-option=--match="/NaiveQE/"
```
---
### Dependencies
- **megaparsec** - Parser combinators
- **containers** - Map, Set
- **text** - Text processing
- **hspec** - Testing framework
---
### Future Work
Potential enhancements:
1. Implement stratified negation
2. Add semi-naive evaluation for better performance
3. Built-in predicates (comparison, arithmetic)
4. Magic sets optimization
5. Persistent storage backend
## Changelog
* **Mar 4, 2026** -- The first version was created.