Houjun Liu

SU-CS143 APR082025

Lexer

Goal: identify tokens in the input string. Its a lot of regular expressions and DFAing.

Example

Consider:

if (i == j)
    z = 0;
else
    z = 1;

We want a linear algorithm for lexing, because quadratic algorithms are slow. The gaol here is to partition the input string into substrings.

Let’s make a Lexer!

  1. identify token classes
  2. describe which strings belong to each token

token classes

token classes define all items of interest; this is dependent on the choice of language and the design of the parser.

We separate the language into a few known tokens:

  • identifier: strings of letters or digits starting with a letter

  • integer: non-empty string of digits

  • keyword: else, if, begin, etc.

  • whitespace: non-empty sequence of blanks, newlines, or tabs

  • (: (

  • [: [

    etc.

  • a note on empty strings

    Its very easy to write regexp which matches on empty strings, which will cause your lexer to spin forever because every position has infinite empty strings. So make sure you don’t do that.

classifying tokens

  1. clasify each substring as a token
  2. return the value or lexeme of the token
    • lexeme: the actual substring
    • token: the classified token class which the substring belongs to

When we classify tokens, we want the “rule of maximum munch”:

throwing stuff away

We can just ignore 1) white spaces and 2) comments (what happens to <keyword> <whitespace> <keyword>? it just looks like <keyword> <keyword> which is fine).

Can this be hard?

Yes. Don’t be FORTRAN: “white space is insignificant”. And then suddenly you can’t tell the difference between:

      DO 5 I = 1,25

versus

      DO 5 I = 1.25

where the latter is a variable named DO5I and the former is a do jump loop.

Or if you don’t mix keywords and statements and now parsing takes forever.

formalism

language

see language

regular languages

See regular language. For properties of regular languages, we take a recursive definition for regular \(A, B\) and alphabet \(\Sigma\):

the following is regular—

  • empty string: \(L\qty(\epsilon) = \qty {’’}\)
  • member of the set: \(L\qty(‘c’) = \qty {c}\) where our particular \(c \in \Sigma\)
  • union: \(L\qty(A + B) = \qty {w | w \in L(A)\ \cup\ w \in L(B) }\)
  • concatenation: \(L\qty(A B) = \{vw | v \in L(A), w \in L(B)\}\)
  • clean star: \(L(A^{*}) = \qty {s_1 \cdot \dots \cdot s_{k} | k \geq 0, s_{i} \in L(A)}\)

component

  • keywords: else = e+l+s+e, etc.
  • integers: digit = 0+1+2+…; integer = digit digit* = digit+
  • identifier: letter (letter + digit)*