We use CFGs to highlight the recursive structure in expressions.
A CFG consists of
- Non-Recursed Terminals (
if
,else
, etc.) \(T\) - Non-Terminals (
expr
,stmt
, etc.) \(N\) - Start symbol (i.e.,
program
) \(S \in N\) - Set of productions
generative semantics
Let \(G\) be an CFG with start symbol \(S\); the language of \(G\) is…
\begin{equation} \qty {a_1, \dots, a_{n} | S \to^{*} a_{1}, a_{n}} \end{equation}
with each \(a_{j}, \forall j\) being terminals.
That is, the “language of a CFG” is just the set of languages that could be produced by as many moves as needed until we obtain the output.
productions
\begin{equation} X \to Y_1, Y_2, \dots, Y_{n} \end{equation}
whereby \(X \in N\), \(Y_{i} \in T \cup N \cup \qty {\varepsilon}\). Notably, we could replace a non-terminal with nothing.
derivation
a derivation is a sequence of productions which leads to a string of only terminals
A derivation can be drawn as a tree—
- start symbol is the root
- for production \(X \to Y_1, …, Y_{n}\), add children \(Y_1, …, Y_{n}\) to note \(X\)
stratification
stratification allows us to deal with the PEMDAS (/associativity) of the operations. Consider an addition and multiplication world:
E = E+E | E*E | (E) | id
how do we parse id * id + id
.
We can just have a “multiplication world” E’ language, whereby after en counting an multiplication, we are not allowed to do addition anymore unless we encounter a parentheses.
E = E' + E | E'
E' = id * E' | id | (E) * E' | (E)
dangling if problem
Consider the following ambiguous grammar:
E -> if E then E
| if E then E else E
| ....
Consider:
if something then if something then something else something
Is the else the first or the second expression?~:w
to fix this, we need to make a world of “matched if expression”
E -> MIF -- all "then" are matched
| UIF -- some "then" is unmatched
UIF -> if E then E
| if E then MIF else UIF
{- "we don't allow unmatched if expressions in then branches" -}
MIF -> if E then MIF else MIF
| OTHER
Most tools, like flex/bison, will allow you to control precedence.
conventions
non-terminals in upper case
terminals in lower case
start symbol is the non-terminal of the first production
membership in a language is a “yes”/“no”
we must handle errors gracefully
bison could help us actually implement the parser
example
EXPR -> if EXPR then EXPR else EXPR fi
| while EXPR loop EXPR pool
| id