Lexer
Goal: identify tokens in the input string. Its a lot of regular expressions and DFAing.
Example
Consider:
if (i == j)
z = 0;
else
z = 1;
We want a linear algorithm for lexing, because quadratic algorithms are slow. The gaol here is to partition the input string into substrings.
Let’s make a Lexer!
- identify token classes
- describe which strings belong to each token
token classes
token classes define all items of interest; this is dependent on the choice of language and the design of the parser.
We separate the language into a few known tokens:
identifier: strings of letters or digits starting with a letter
integer: non-empty string of digits
keyword: else, if, begin, etc.
whitespace: non-empty sequence of blanks, newlines, or tabs
(: (
[: [
etc.
a note on empty strings
Its very easy to write regexp which matches on empty strings, which will cause your lexer to spin forever because every position has infinite empty strings. So make sure you don’t do that.
classifying tokens
- clasify each substring as a token
- return the value or lexeme of the token
- lexeme: the actual substring
- token: the classified token class which the substring belongs to
When we classify tokens, we want the “rule of maximum munch”:
throwing stuff away
We can just ignore 1) white spaces and 2) comments (what happens to <keyword> <whitespace> <keyword>
? it just looks like <keyword> <keyword>
which is fine).
Can this be hard?
Yes. Don’t be FORTRAN: “white space is insignificant”. And then suddenly you can’t tell the difference between:
DO 5 I = 1,25
versus
DO 5 I = 1.25
where the latter is a variable named DO5I
and the former is a do jump loop.
Or if you don’t mix keywords and statements and now parsing takes forever.
formalism
language
see language
regular languages
See regular language. For properties of regular languages, we take a recursive definition for regular \(A, B\) and alphabet \(\Sigma\):
the following is regular—
- empty string: \(L\qty(\epsilon) = \qty {’’}\)
- member of the set: \(L\qty(‘c’) = \qty {c}\) where our particular \(c \in \Sigma\)
- union: \(L\qty(A + B) = \qty {w | w \in L(A)\ \cup\ w \in L(B) }\)
- concatenation: \(L\qty(A B) = \{vw | v \in L(A), w \in L(B)\}\)
- clean star: \(L(A^{*}) = \qty {s_1 \cdot \dots \cdot s_{k} | k \geq 0, s_{i} \in L(A)}\)
component
- keywords: else = e+l+s+e, etc.
- integers: digit = 0+1+2+…; integer = digit digit* = digit+
- identifier: letter (letter + digit)*