SU-CS143 APR082025

Last edited: August 8, 2025

Lexer

Goal: identify tokens in the input string. Its a lot of regular expressions and DFAing.

Example

Consider:

if (i == j)
    z = 0;
else
    z = 1;

We want a linear algorithm for lexing, because quadratic algorithms are slow. The gaol here is to partition the input string into substrings.

Let’s make a Lexer!

identify token classes
describe which strings belong to each token

token classes

token classes define all items of interest; this is dependent on the choice of language and the design of the parser.

SU-CS143 APR102025

Last edited: August 8, 2025

Its DFA time! Deterministic Finite Automata + regular expression (complexity). See programmatically compiling RegEx to DFA.

SU-CS143 APR152025

Last edited: August 8, 2025

Another review of the compiler

Compiler Front End

Lexer => Parser => Semantic Analysis; usually language specific.

Compiler Mid End

Optimizer; usually shared across many languages.

Compiler Back End

Codegen

parser

parser does parsing

parser: takes sequence of tokens from the lexer and produce AST of the program. General goal is to distinguish between valid and invalid strings of tokens.

Importantly, we don’t really emit errors from parsing; instead we just do it at the layer below and above.

SU-CS143 APR172025

Last edited: August 8, 2025

and also abstract syntax tree and top-down parsing

error handling

Recall! Our compiler should detect non-valid programs and to translate valid ones.

Error Kind	Example	Detected by…
Lexical	$ <- doesn’t exist	lexer
Syntax	x * % <- doesnt’ make sense	parser
Semantic	int x; x(3) <- doesn’t check	type checker
Correctness		you!

Good error handler should report errors accurately and clearly; recovering from an error quickly, and not slow down compilation of valid code.

SU-CS143 APR242025

Last edited: August 8, 2025

Bottom-Up Parsing

Bottom-Up Parsing is more general than top-down parsing (i.e., it doesn’t require left factorization, etc.)

This is generally the preferred method. Key insight: we reduce strings into increasingly higher level symbols

Sketch

Because of the Properties of Bottom-Up Parsing, we can….

initialization

split the string into two pieces
right substring has terminals, not yet examined
left substring has terminals and non-terminals

We begin by splitting to the left since we begin with no non-terminals.