Houjun Liu

programatically compiling RegEx to DFA

A high level sketch:

Another High Level Sketch

Step 1: Write some RegExps

Do that.

Step 2: Construct R

For regular expressions you have defined for keywords, identifies, numbers, etc… We want to construct an uber union regular expression:

\begin{align} R &= \text{Keyword} + \text{Identifier} + \text{Number} + \dots \\ &= R_1 | R_2 | R_3 | \dots \end{align}

Step 3: Tokenization

For input \(x_1, …, x_{n}\), for \(i \in 1 …n\) inclusive, we check:

\begin{equation} x_1 \dots x_{i} \stackrel{?}{\in} L\qty( R) \end{equation}

Upon success for some \(i\), we know that \(x_{1}, …, x_{i} \in L\qty(R_{j})\) for some \(j\). We can then send \(x_1, …, x_{i}\) to the parser and then continue lexing for input \(x_{i+1}…x_{n}\).

ambiguity

  • multiple matches, different lengths: we pick the longest possible substring
  • multiple matches, same length: pick the first rule
  • nothing matches: we just put a catch all rule in the bottom

How exactly does “matching” mean?

DFAs, NFAs, subset construction! Conventions used in this class:

  • \(\Sigma\) alphabet
  • \(S\) states
  • \(n\) start
  • \(F \subseteq S\) accepting states
  • \(S \to^{x} S\) transition

and recall regular expressions are equivalent to regular languages. When you attempt to take a transition that doesn’t exist in NFAs, you end into nothing.

algorithm for casting RegExp to NFA

By convention every NFA will have a single starting state and a single ending state. We will then compose every composition’s machine together.

concatenation

For parser of \(A\) and \(B\), we compose them by composing the accept state of \(A\) with an \(\epsilon\) move to \(B\), and make \(A\) no longer an accepting state.

union

We make an epsilon move from the starting state to both \(A\) and \(B\). And then when you hit the accepting state of either \(A\) or \(B\) you epsilon move to the accepting state.

kleene star

implementing DFAs as a table

  • columns: alphabet
  • rows: states

the way you implement this is to just look up the (state, symbol) pair, take the transition, and look up the new state, symbol, etc.

01
aab
bab

is

and so on. To construct this table, you just read off the transition function.