Chapter 2: Parsing Lua with Analisar

Wire a real Lua parser into Salsa. Source text → AST, incrementally.

What You'll Learn

  • How to make parsing a tracked query (so it only re-runs when source changes)
  • How analisar's AST is structured
  • Why parsing needs to be separate from type-checking (incrementality!)
  • The difference between Salsa-owned data and borrowed data in tracked functions

Why Separate Parsing from Type-Checking?

Here's a question you might be asking: why not just type-check directly from the source text? Why have a parsing step at all?

Two reasons:

  1. Parsing is expensive, type-checking is more expensive. If you mash them together, you lose the ability to cache them independently. A change that only affects one function's body shouldn't require re-parsing the entire file — and it shouldn't require re-type-checking unrelated functions either.

  2. Different invalidation granularity. The parser runs on the whole source text. The type checker runs on individual expressions. By separating them, Salsa can skip the type checker when only the parse result changes but the types stay the same (this happens less often, but it can happen — think comments or whitespace that don't affect the AST structure).

The pipeline so far:

SourceFile (input)
     │
     ▼
   parse()     ← tracked, cached
     │
     ▼
   LuaAst      ← owned data
     │
     ▼
  derived queries (type checking, name resolution, etc.)

Step 1: The Same Input

We reuse the SourceFile input from Chapter 1:

#![allow(unused)]
fn main() {
#[salsa::input]
pub struct SourceFile {
    #[returns(ref)]
    pub path: String,
    #[returns(ref)]
    pub text: String,
}
}

Nothing new here. The source text is our input, and Salsa tracks when it changes.

Step 2: An Owned AST

Analisar's AST uses borrowed strings (Cow<'a, str> for identifiers). This makes sense for analisar — it avoids allocating copies. But Salsa tracked functions need to return owned data. Why? Because the cached result has to survive across revisions. If the result borrows from the source text, and the source text changes, you've got a dangling reference.

So we define our own owned AST types:

#![allow(unused)]
fn main() {
#[derive(Debug, Clone, PartialEq, Eq, Hash, salsa::Update)]
pub struct LuaAst {
    pub statements: Vec<Statement>,
}

#[derive(Debug, Clone, PartialEq, Eq, Hash, salsa::Update)]
pub enum Statement {
    Assignment { local: bool, targets: Vec<Expression>, values: Vec<Expression> },
    Expression(Expression),
    Return(Vec<Expression>),
    Function { local: bool, name: String, params: Vec<String>, body: Vec<Statement> },
    // ...
}
}

Notice the salsa::Update derive. This is required for any type used as a tracked function's return value. It tells Salsa how to update the cached value when needed (for structural sharing in advanced use cases — for now, just know it's required).

But wait — isn't cloning expensive? Yes, for large files. That's why in Chapter 3 we'll use #[salsa::interned] for strings, and in Chapter 4 we'll use #[salsa::tracked] for AST nodes. Both avoid cloning. For now, we keep it simple with owned Strings.

Step 3: The Parse Query

#![allow(unused)]
fn main() {
#[salsa::tracked]
pub fn parse(db: &dyn salsa::Database, source: SourceFile) -> LuaAst {
    let text = source.text(db);
    convert_ast(text)
}
}

This is the key tracked function. It takes a SourceFile (input) and returns a LuaAst (owned data). Salsa caches this — if the source text hasn't changed since the last revision, the cached AST is returned without re-parsing.

The convert_ast function is where we bridge analisar's borrowed world into our owned world. We walk analisar's AST node by node and build our own representation. It's mechanical — mapping one enum variant to another — but it's important to understand why we do it:

  • analisar gives us a correct but borrowed AST
  • Our AST is owned, hashable, and Salsa-compatible
  • The conversion is the boundary between "external library" and "Salsa system"

This boundary pattern — converting external data into Salsa-owned data at the query boundary — is common in Salsa projects. rust-analyzer does the same thing with the Rust parser.

Step 4: Query Chaining

Now that we have parse(), we can build derived queries on top of it:

#![allow(unused)]
fn main() {
#[salsa::tracked]
pub fn top_level_names(db: &dyn salsa::Database, source: SourceFile) -> Vec<String> {
    let ast = parse(db, source);  // depends on parse
    let mut names = Vec::new();
    for stmt in &ast.statements {
        match stmt {
            Statement::Assignment { local: true, targets, .. } => {
                for target in targets {
                    if let Expression::Name(name) = target {
                        names.push(name.clone());
                    }
                }
            }
            Statement::Function { local: true, name, .. } => {
                names.push(name.clone());
            }
            _ => {}
        }
    }
    names
}
}

top_level_names calls parse(db, source), so it depends on the parse result. If the source changes:

  1. parse() re-runs (its input changed)
  2. top_level_names() re-runs (its dependency — parse — produced a new result)

But here's a subtlety: if parse() returns a different LuaAst (the source changed in a way that affects the AST), top_level_names() re-runs. If parse() returns the same LuaAst (e.g., you changed a comment that analisar doesn't include in the AST), top_level_names() returns its cached value.

Salsa checks the value of dependencies, not just whether they were re-executed. This is called verify re-execution: Salsa re-runs parse(), compares the result to the old cache, and only propagates the change if the value is actually different.

Running

cargo run --bin ch02-parsing-lua

Key Takeaways

  1. Parsing is a query. By making parse() a tracked function, we get caching for free. This is the first step in any Salsa pipeline: raw input → parsed representation.

  2. Owned data at the boundary. External libraries often return borrowed data. Convert to owned at the Salsa boundary. Later we'll use interning to avoid the cloning cost.

  3. Query chaining. top_level_names depends on parse. Changing the source triggers both. But if parse returns the same value, downstream queries are skipped.

  4. The pipeline is already incremental. We have two layers of caching (parse + derived). Each layer only re-runs when its inputs change. In a real type checker, you'd have many more layers.

What's Next

Chapter 3: Interned Symbols — We'll use #[salsa::interned] for identifiers, which lets us compare names by ID instead of by string comparison. This is how compilers handle identifiers efficiently.