useful-notes/hassan/006-cales-datalog.md

15 KiB

toy-datalog: Implementation Notes

Overview

toy-datalog is a Haskell implementation of a Datalog interpreter. Datalog is a declarative logic programming language that is a syntactic subset of Prolog, commonly used for:

  • Database query languages (it's the theoretical foundation for SQL's recursive queries)
  • Static program analysis
  • Knowledge representation and reasoning
  • Access control policies

This implementation provides parsing, evaluation, and fixed-point computation for Datalog programs.

Author: Cale Gibbard License: BSD-2-Clause Version: 0.1.0.0


Architecture

┌─────────────────────────────────────────────────────────────┐
│                    Datalog Program (Text)                    │
└─────────────────────────────────────────────────────────────┘
                              │
                              ▼
┌─────────────────────────────────────────────────────────────┐
│                   Parser (Megaparsec)                        │
│  • parseTerm, parseAtom, parseRule, parseProgram            │
└─────────────────────────────────────────────────────────────┘
                              │
                              ▼
┌─────────────────────────────────────────────────────────────┐
│                     Abstract Syntax Tree                     │
│  • Program = [Rule]                                          │
│  • Rule = Atom :- [Atom]                                     │
│  • Atom = RelId [Term]                                       │
│  • Term = Con ConId | Var VarId                              │
└─────────────────────────────────────────────────────────────┘
                              │
                              ▼
┌─────────────────────────────────────────────────────────────┐
│                      Database                                │
│  • Herbrand Universe (all constants)                         │
│  • Relations (indexed tuples)                                │
│  • Rules (indexed by head relation)                          │
└─────────────────────────────────────────────────────────────┘
                              │
                              ▼
┌─────────────────────────────────────────────────────────────┐
│                  Evaluation Engine                           │
│  • Forward chaining (bottom-up)                              │
│  • Fixed-point iteration                                     │
│  • Variable binding and unification                          │
└─────────────────────────────────────────────────────────────┘
                              │
                              ▼
┌─────────────────────────────────────────────────────────────┐
│              Derived Facts (Complete Database)               │
└─────────────────────────────────────────────────────────────┘

Module Analysis

1. Datalog.Syntax (src/Datalog/Syntax.hs)

Defines the core AST types:

Type Description Example
ConId Constant identifier (lowercase/numeric start) xerces, 0, brooke
VarId Variable identifier (uppercase start) X, Y, Parent
RelId Relation/predicate name parent, ancestor
Term Either a constant or variable Con "xerces", Var "X"
Atom Relation applied to terms parent(X, Y)
Rule Head atom with body atoms ancestor(X,Y) :- parent(X,Y).
Program Collection of rules Full Datalog program

Key Features:

  • HasConstants typeclass for extracting constants from any Datalog structure
  • Pretty typeclass for converting AST back to readable syntax
  • Smart constructor term that auto-detects constants vs variables by case

2. Datalog.Parser (src/Datalog/Parser.hs)

Megaparsec-based parser with:

  • Lexical conventions:

    • Whitespace handling (spaces, line comments --, block comments {- -})
    • Constants: start with lowercase or digit
    • Variables: start with uppercase
    • Relations: start with lowercase
  • Grammar:

    program  ::= rule*
    rule     ::= atom '.'                    (fact)
               | atom ':-' query '.'         (rule with body)
    query    ::= atom (',' atom)*
    atom     ::= relId '(' term (',' term)* ')'
    term     ::= variable | constant
    

3. Datalog.Eval (src/Datalog/Eval.hs)

The evaluation engine implements naive bottom-up evaluation (forward chaining).

Data Structures

Relation:

data Relation = Relation
  { _relation_arity   :: Int
  , _relation_members :: Set [ConId]           -- all tuples
  , _relation_indices :: Map (Int, ConId) (Set [ConId])  -- index
  }

The index maps (position, value) pairs to tuples containing that value at that position, enabling efficient lookups.

Database:

data Database = Database
  { _database_universe  :: Set ConId        -- Herbrand universe
  , _database_relations :: Map RelId Relation
  , _database_rules     :: Map RelId [Rule]
  }
Evaluation Algorithm
  1. evalAtomDb: Query an atom against the database

    • Uses indices to find candidate tuples
    • Computes intersection of candidates for each bound position
    • Returns set of variable bindings
  2. evalConjunction: Evaluate a conjunction (rule body)

    • Evaluates each atom
    • Joins bindings using >< operator (relational join)
  3. immediateConsequences: Derive new facts from a rule

    • Evaluate the body to get bindings
    • Project bindings onto head atom terms
  4. extendDb: Single iteration of forward chaining

    • Apply all rules to derive new facts
    • Add new tuples to relations
  5. extendFixedpointDb: Iterate until no new facts

    • Repeatedly call extendDb until database stops changing

4. Datalog.IO (src/Datalog/IO.hs)

File I/O utilities:

  • readProgram: Parse a file into a Program
  • readDatabase: Parse and load into a Database

Correctness Analysis

Verified Correct Behaviors

  1. Variable binding consistency: The mkBinding function in unsafeEvalAtomRel correctly handles the case where the same variable appears multiple times in an atom. It checks that all occurrences bind to the same value (lines 125-130).

  2. Join semantics: The joinBindings function correctly implements a consistent merge of two bindings, failing when variables are bound to different values.

  3. Arity checking: The implementation properly tracks and validates relation arities, preventing arity mismatches.

  4. Fixed-point termination: Since Datalog is guaranteed to terminate (finite Herbrand universe, monotonic derivation), the fixed-point iteration is correct.

  5. Index maintenance: Indices are properly maintained during insertion via insertInRelation.

Potential Issues

  1. projectTuple with unbound variables (lines 147-157):

    When the head of a rule contains a variable not appearing in the body, projectTuple generates all possible values from the Herbrand universe. This is semantically correct for Datalog (the variable ranges over all constants), but:

    • It can cause a combinatorial explosion
    • This is an unusual Datalog pattern (typically all head variables appear in the body)

    Example: foo(X) :- bar(Y). would generate foo(c) for every constant c in the universe.

  2. Naive evaluation strategy:

    The implementation uses naive bottom-up evaluation, which re-derives all facts at each iteration. This is correct but inefficient compared to semi-naive evaluation.

  3. Unsafe partial function:

    insertManyInRelationDb uses error on line 51 if the relation doesn't exist. However, this should never occur in practice due to prior arity checks.

  4. Empty atom arguments:

    The parser allows zero-argument atoms foo(), and evaluation handles them correctly (they act as propositional facts).


Strengths

1. Clean Haskell Design

  • Proper use of newtypes for type safety (ConId, VarId, RelId)
  • Effective use of typeclasses (HasConstants, Pretty)
  • Clear separation of concerns between modules
  • Appropriate use of Either for error handling

2. Efficient Indexing

  • Position-based indexing enables efficient lookups
  • Intersection of index results for atoms with multiple constants
  • Avoids full table scans when constants are present in queries

3. Correct Semantics

  • Proper handling of variable binding consistency
  • Correct relational join semantics
  • Sound fixed-point computation

4. Good Parser Design

  • Uses Megaparsec for good error messages
  • Supports both Haskell-style comments (-- and {- -})
  • Clean lexeme handling with whitespace consumption

5. Extensible Architecture

  • Easy to add new syntax constructs
  • Database structure supports multiple relations
  • Error types are well-defined and informative

Limitations

1. Performance Limitations

Issue Impact Potential Fix
Naive evaluation Re-derives all facts each iteration Implement semi-naive evaluation
No query optimization Joins evaluated left-to-right Add query planner
Full materialization All derived facts stored Support demand-driven evaluation
No stratification Potential issues with negation (not yet supported) Add stratification analysis

2. Missing Datalog Features

Feature Status Description
Negation Not supported not p(X) in rule bodies
Aggregation Not supported count, sum, min, max
Built-in predicates Not supported X < Y, X = Y + 1
Stratification Not applicable Would be needed for negation
Magic sets Not implemented Query optimization technique
Tabling/Memoization Basic only Via relation storage

3. Usability Limitations

  • No REPL: No interactive query interface
  • No CLI executable: Executable is commented out in cabal file
  • Limited error messages: Parsing errors are good (Megaparsec), but evaluation errors are basic
  • No source locations: Error messages don't include line numbers for semantic errors
  • No pretty-printed output: Query results shown as Haskell data structures

4. Safety/Edge Cases

  • Unsafe head variables: Variables in rule heads not appearing in bodies cause universe enumeration
  • No recursion limit: Deeply recursive programs could stack overflow
  • No fact limit: Large derived fact sets could exhaust memory

5. Testing Gaps

  • Only parser golden tests are active
  • Database and fixed-point tests are commented out
  • No property-based testing
  • No negative test cases (expected failures)

Example Programs

Transitive Closure (ancestor.dl)

parent(xerces, brooke).
parent(brooke, damocles).

ancestor(X, Y) :- parent(X, Y).
ancestor(X, Y) :- parent(X, Z), ancestor(Z, Y).

Derives: ancestor(xerces, brooke), ancestor(brooke, damocles), ancestor(xerces, damocles)

Symmetric Closure with Parity (graph.dl)

odd(X,Y) :- r(X,Y).
odd(X,Y) :- even(X,Z), r(Z,Y).
even(X,Y) :- odd(X,Z), r(Z,Y).

r(0,1). r(1,2). r(2,3). r(3,4). r(4,5).
r(X,Y) :- r(Y,X).

Computes paths of odd/even length in a bidirectional graph.


Recommendations for Future Development

High Priority

  1. Enable the executable with a simple REPL or query interface
  2. Implement semi-naive evaluation for significant performance improvement
  3. Add more tests including evaluation tests and property tests
  4. Add source locations to AST for better error messages

Medium Priority

  1. Support negation with stratification checking
  2. Add built-in predicates for arithmetic comparisons
  3. Implement query optimization (join reordering)
  4. Add aggregation support

Nice to Have

  1. Magic sets transformation for goal-directed evaluation
  2. Incremental evaluation for database updates
  3. Parallel evaluation of independent rules
  4. External data sources (CSV, databases)

Conclusion

toy-datalog is a well-structured, pedagogically clear implementation of core Datalog semantics. It correctly implements the fundamental operations of bottom-up evaluation with proper variable binding and fixed-point computation. While it lacks advanced features like negation, aggregation, and query optimization, it provides a solid foundation for learning about Datalog and could be extended into a more full-featured implementation.

The code quality is high, with good use of Haskell idioms and proper error handling. The main limitations are performance (naive evaluation) and missing features rather than correctness issues.

Changelog

  • Mar 4, 2026 -- The first version was created.