Posts

SU-CS143 APR082025

Last edited: August 8, 2025

Lexer

Goal: identify tokens in the input string. Its a lot of regular expressions and DFAing.

Example

Consider:

if (i == j)
    z = 0;
else
    z = 1;

We want a linear algorithm for lexing, because quadratic algorithms are slow. The gaol here is to partition the input string into substrings.

Let’s make a Lexer!

  1. identify token classes
  2. describe which strings belong to each token

token classes

token classes define all items of interest; this is dependent on the choice of language and the design of the parser.

SU-CS143 APR152025

Last edited: August 8, 2025

Another review of the compiler

Compiler Front End

Lexer => Parser => Semantic Analysis; usually language specific.

Compiler Mid End

Optimizer; usually shared across many languages.

Compiler Back End

Codegen

parser

parser does parsing

parser: takes sequence of tokens from the lexer and produce AST of the program. General goal is to distinguish between valid and invalid strings of tokens.

Importantly, we don’t really emit errors from parsing; instead we just do it at the layer below and above.

SU-CS143 APR172025

Last edited: August 8, 2025

and also abstract syntax tree and top-down parsing

error handling

Recall! Our compiler should detect non-valid programs and to translate valid ones.

Error KindExampleDetected by…
Lexical$ <- doesn’t existlexer
Syntaxx * % <- doesnt’ make senseparser
Semanticint x; x(3) <- doesn’t checktype checker
Correctnessyou!

Good error handler should report errors accurately and clearly; recovering from an error quickly, and not slow down compilation of valid code.

SU-CS143 APR242025

Last edited: August 8, 2025

Bottom-Up Parsing

Bottom-Up Parsing is more general than top-down parsing (i.e., it doesn’t require left factorization, etc.)

This is generally the preferred method. Key insight: we reduce strings into increasingly higher level symbols

Sketch

Because of the Properties of Bottom-Up Parsing, we can….

initialization

  • split the string into two pieces
  • right substring has terminals, not yet examined
  • left substring has terminals and non-terminals

We begin by splitting to the left since we begin with no non-terminals.