Compiler Internals
Deep dive into the Turf compiler internals: AST, Codegen, and LLVM integration.
This section provides a technical deep dive into how the Turf compiler transforms source code into native executables.
The Compilation Pipeline
- Lexical Analysis (
Lexer.cpp): Converts the raw source text into a stream of tokens (e.g.,TOKEN_INT,TOKEN_IDENTIFIER,TOKEN_PLUS). - Parsing (
Parser.cpp): A recursive-descent parser consumes tokens and builds an Abstract Syntax Tree (AST). - Semantic Analysis: Performed during parsing and AST construction to ensure types are compatible and variables are declared before use.
- Code Generation (
Codegen.cpp): The AST nodes are traversed, and each node’scodegen()method generates LLVM Intermediate Representation (IR). - Linking & Optimization: The generated LLVM IR is passed to the LLVM backend to produce machine code and linked into a final binary.
Multi-Pass Compilation
To support features like forward function declarations and recursion without requiring the developer to manually define prototypes, Turf employs a Multi-Pass Compilation strategy in main.cpp.
- Pre-Pass (Prototype Registration): The compiler performs an initial, partial parse of the source file. It looks specifically for
fn(function definition) nodes. For each function found, it generates an LLVM function prototype (name and signature) but skips the body. - Main Pass (Full Codegen): The compiler resets the lexer and performs a full parse. Because all functions were registered in the pre-pass, calls to functions defined later in the file are correctly resolved during this second pass.
// Pre-pass: register function prototypes
while (CurTok != TOK_EOF) {
if (CurTok == TOK_FN) {
auto AST = ParseExpression();
AST->codegen(); // Registers prototype on first call
} else {
getNextToken();
}
}
Abstract Syntax Tree (AST)
The AST is the heart of the compiler’s intermediate representation. Every language construct is represented by a class inheriting from ExprAST (defined in AST.h).
Key AST Nodes
NumberExprAST: Represents numeric literals (integers or doubles).BinaryExprAST: Represents operations like+,-,*,/. Includes support for type promotion (e.g.,int + doubleresults indouble).VariableExprAST: Represents variable references, handled via theSymbolTable.IfExprAST: Represents conditional branching, implementing theif-then-elselogic using LLVMbrandphinodes.WhileExprAST: Represents loop constructs with support forbreakandcontinue.FuncDefExprAST: Represents function definitions, managing its own local scope in the symbol table.
Symbol Table Architecture
The SymbolTable (implemented in SymbolTable.cpp and SymbolTable.h) is responsible for managing variable scopes and ensuring semantic correctness.
- Scoped Logic: Turf uses a stack of maps (
CurrentScope) to handle nested scopes (e.g., inside{...}blocks). - Type Safety: Each entry in the symbol table stores the variable’s name, its
allocainstruction (memory address), and itsTurfType. - Shadowing Detection: The compiler automatically detects if a new variable “shadows” one in an outer scope and issues a warning if enabled.
LLVM Backend Integration
Turf leverages the LLVM C++ API to bridge the gap between AST nodes and machine code.
llvm::LLVMContext: Owns the global state and maintains the uniqueness of types and constants.llvm::IRBuilder<>: A powerful helper for emitting IR instructions.llvm::Module: The container for all generated IR, which is later optimized and converted to an object file via theTargetMachine.
Codegen Context
During codegen(), nodes often need to access global resources. Turf uses global pointers for the Builder, TheModule, and TheContext to maintain a clean AST structure while allowing nodes to emit instructions effectively.
Built-in Functions
Turf includes a flexible built-in system (defined in Builtins.cpp). Functions like print() and printline() are implemented as external LLVM function calls (e.g., to C’s printf), allowing Turf programs to interact with the system standard output.