useful-notes/006-cales-datalog.md at 654012f2ef1c9303c8a6f1a59e836350c8e34f5e

Hassan Abedi 654012f2ef Add notes about Cale's and Felix's Datalog implementations

2026-03-04 12:37:35 +01:00

15 KiB

Raw Blame History

toy-datalog: Implementation Notes

Overview

toy-datalog is a Haskell implementation of a Datalog interpreter. Datalog is a declarative logic programming language that is a syntactic subset of Prolog, commonly used for:

Database query languages (it's the theoretical foundation for SQL's recursive queries)
Static program analysis
Knowledge representation and reasoning
Access control policies

This implementation provides parsing, evaluation, and fixed-point computation for Datalog programs.

Author: Cale Gibbard License: BSD-2-Clause Version: 0.1.0.0

Architecture

┌─────────────────────────────────────────────────────────────┐
│                    Datalog Program (Text)                    │
└─────────────────────────────────────────────────────────────┘
                              │
                              ▼
┌─────────────────────────────────────────────────────────────┐
│                   Parser (Megaparsec)                        │
│  • parseTerm, parseAtom, parseRule, parseProgram            │
└─────────────────────────────────────────────────────────────┘
                              │
                              ▼
┌─────────────────────────────────────────────────────────────┐
│                     Abstract Syntax Tree                     │
│  • Program = [Rule]                                          │
│  • Rule = Atom :- [Atom]                                     │
│  • Atom = RelId [Term]                                       │
│  • Term = Con ConId | Var VarId                              │
└─────────────────────────────────────────────────────────────┘
                              │
                              ▼
┌─────────────────────────────────────────────────────────────┐
│                      Database                                │
│  • Herbrand Universe (all constants)                         │
│  • Relations (indexed tuples)                                │
│  • Rules (indexed by head relation)                          │
└─────────────────────────────────────────────────────────────┘
                              │
                              ▼
┌─────────────────────────────────────────────────────────────┐
│                  Evaluation Engine                           │
│  • Forward chaining (bottom-up)                              │
│  • Fixed-point iteration                                     │
│  • Variable binding and unification                          │
└─────────────────────────────────────────────────────────────┘
                              │
                              ▼
┌─────────────────────────────────────────────────────────────┐
│              Derived Facts (Complete Database)               │
└─────────────────────────────────────────────────────────────┘

Module Analysis

1. Datalog.Syntax (`src/Datalog/Syntax.hs`)

Defines the core AST types:

Type	Description	Example
`ConId`	Constant identifier (lowercase/numeric start)	`xerces`, `0`, `brooke`
`VarId`	Variable identifier (uppercase start)	`X`, `Y`, `Parent`
`RelId`	Relation/predicate name	`parent`, `ancestor`
`Term`	Either a constant or variable	`Con "xerces"`, `Var "X"`
`Atom`	Relation applied to terms	`parent(X, Y)`
`Rule`	Head atom with body atoms	`ancestor(X,Y) :- parent(X,Y).`
`Program`	Collection of rules	Full Datalog program

Key Features:

HasConstants typeclass for extracting constants from any Datalog structure
Pretty typeclass for converting AST back to readable syntax
Smart constructor term that auto-detects constants vs variables by case

2. Datalog.Parser (`src/Datalog/Parser.hs`)

Megaparsec-based parser with:

Lexical conventions:
- Whitespace handling (spaces, line comments --, block comments {- -})
- Constants: start with lowercase or digit
- Variables: start with uppercase
- Relations: start with lowercase

Grammar:

program  ::= rule*
rule     ::= atom '.'                    (fact)
           | atom ':-' query '.'         (rule with body)
query    ::= atom (',' atom)*
atom     ::= relId '(' term (',' term)* ')'
term     ::= variable | constant

3. Datalog.Eval (`src/Datalog/Eval.hs`)

The evaluation engine implements naive bottom-up evaluation (forward chaining).

Data Structures

Relation:

data Relation = Relation
  { _relation_arity   :: Int
  , _relation_members :: Set [ConId]           -- all tuples
  , _relation_indices :: Map (Int, ConId) (Set [ConId])  -- index
  }

The index maps (position, value) pairs to tuples containing that value at that position, enabling efficient lookups.

Database:

data Database = Database
  { _database_universe  :: Set ConId        -- Herbrand universe
  , _database_relations :: Map RelId Relation
  , _database_rules     :: Map RelId [Rule]
  }

Evaluation Algorithm

evalAtomDb: Query an atom against the database
- Uses indices to find candidate tuples
- Computes intersection of candidates for each bound position
- Returns set of variable bindings
evalConjunction: Evaluate a conjunction (rule body)
- Evaluates each atom
- Joins bindings using >< operator (relational join)
immediateConsequences: Derive new facts from a rule
- Evaluate the body to get bindings
- Project bindings onto head atom terms
extendDb: Single iteration of forward chaining
- Apply all rules to derive new facts
- Add new tuples to relations
extendFixedpointDb: Iterate until no new facts
- Repeatedly call extendDb until database stops changing

4. Datalog.IO (`src/Datalog/IO.hs`)

File I/O utilities:

readProgram: Parse a file into a Program
readDatabase: Parse and load into a Database

Correctness Analysis

Verified Correct Behaviors

Variable binding consistency: The mkBinding function in unsafeEvalAtomRel correctly handles the case where the same variable appears multiple times in an atom. It checks that all occurrences bind to the same value (lines 125-130).
Join semantics: The joinBindings function correctly implements a consistent merge of two bindings, failing when variables are bound to different values.
Arity checking: The implementation properly tracks and validates relation arities, preventing arity mismatches.
Fixed-point termination: Since Datalog is guaranteed to terminate (finite Herbrand universe, monotonic derivation), the fixed-point iteration is correct.
Index maintenance: Indices are properly maintained during insertion via insertInRelation.

Potential Issues

projectTuple with unbound variables (lines 147-157):

When the head of a rule contains a variable not appearing in the body, projectTuple generates all possible values from the Herbrand universe. This is semantically correct for Datalog (the variable ranges over all constants), but:
- It can cause a combinatorial explosion
- This is an unusual Datalog pattern (typically all head variables appear in the body)
Example: foo(X) :- bar(Y). would generate foo(c) for every constant c in the universe.
Naive evaluation strategy:

The implementation uses naive bottom-up evaluation, which re-derives all facts at each iteration. This is correct but inefficient compared to semi-naive evaluation.
Unsafe partial function:

insertManyInRelationDb uses error on line 51 if the relation doesn't exist. However, this should never occur in practice due to prior arity checks.
Empty atom arguments:

The parser allows zero-argument atoms foo(), and evaluation handles them correctly (they act as propositional facts).

Strengths

1. Clean Haskell Design

Proper use of newtypes for type safety (ConId, VarId, RelId)
Effective use of typeclasses (HasConstants, Pretty)
Clear separation of concerns between modules
Appropriate use of Either for error handling

2. Efficient Indexing

Position-based indexing enables efficient lookups
Intersection of index results for atoms with multiple constants
Avoids full table scans when constants are present in queries

3. Correct Semantics

Proper handling of variable binding consistency
Correct relational join semantics
Sound fixed-point computation

4. Good Parser Design

Uses Megaparsec for good error messages
Supports both Haskell-style comments (-- and {- -})
Clean lexeme handling with whitespace consumption

5. Extensible Architecture

Easy to add new syntax constructs
Database structure supports multiple relations
Error types are well-defined and informative

Limitations

1. Performance Limitations

Issue	Impact	Potential Fix
Naive evaluation	Re-derives all facts each iteration	Implement semi-naive evaluation
No query optimization	Joins evaluated left-to-right	Add query planner
Full materialization	All derived facts stored	Support demand-driven evaluation
No stratification	Potential issues with negation (not yet supported)	Add stratification analysis

2. Missing Datalog Features

Feature	Status	Description
Negation	Not supported	`not p(X)` in rule bodies
Aggregation	Not supported	`count`, `sum`, `min`, `max`
Built-in predicates	Not supported	`X < Y`, `X = Y + 1`
Stratification	Not applicable	Would be needed for negation
Magic sets	Not implemented	Query optimization technique
Tabling/Memoization	Basic only	Via relation storage

3. Usability Limitations

No REPL: No interactive query interface
No CLI executable: Executable is commented out in cabal file
Limited error messages: Parsing errors are good (Megaparsec), but evaluation errors are basic
No source locations: Error messages don't include line numbers for semantic errors
No pretty-printed output: Query results shown as Haskell data structures

4. Safety/Edge Cases

Unsafe head variables: Variables in rule heads not appearing in bodies cause universe enumeration
No recursion limit: Deeply recursive programs could stack overflow
No fact limit: Large derived fact sets could exhaust memory

5. Testing Gaps

Only parser golden tests are active
Database and fixed-point tests are commented out
No property-based testing
No negative test cases (expected failures)

Example Programs

Transitive Closure (ancestor.dl)

parent(xerces, brooke).
parent(brooke, damocles).

ancestor(X, Y) :- parent(X, Y).
ancestor(X, Y) :- parent(X, Z), ancestor(Z, Y).

Derives: ancestor(xerces, brooke), ancestor(brooke, damocles), ancestor(xerces, damocles)

Symmetric Closure with Parity (graph.dl)

odd(X,Y) :- r(X,Y).
odd(X,Y) :- even(X,Z), r(Z,Y).
even(X,Y) :- odd(X,Z), r(Z,Y).

r(0,1). r(1,2). r(2,3). r(3,4). r(4,5).
r(X,Y) :- r(Y,X).

Computes paths of odd/even length in a bidirectional graph.

Recommendations for Future Development

High Priority

Enable the executable with a simple REPL or query interface
Implement semi-naive evaluation for significant performance improvement
Add more tests including evaluation tests and property tests
Add source locations to AST for better error messages

Medium Priority

Support negation with stratification checking
Add built-in predicates for arithmetic comparisons
Implement query optimization (join reordering)
Add aggregation support

Nice to Have

Magic sets transformation for goal-directed evaluation
Incremental evaluation for database updates
Parallel evaluation of independent rules
External data sources (CSV, databases)

Conclusion

toy-datalog is a well-structured, pedagogically clear implementation of core Datalog semantics. It correctly implements the fundamental operations of bottom-up evaluation with proper variable binding and fixed-point computation. While it lacks advanced features like negation, aggregation, and query optimization, it provides a solid foundation for learning about Datalog and could be extended into a more full-featured implementation.

The code quality is high, with good use of Haskell idioms and proper error handling. The main limitations are performance (naive evaluation) and missing features rather than correctness issues.

Changelog

Mar 4, 2026 -- The first version was created.

15 KiB Raw Blame History