A compiler complies code!
parts of compilers
lexical analysis
make stuff tokens — “identifying words”
an example!
if x == y then z = 1; else z = 2;
- if: keyword
- " “: white space
- x: identifier
- ==: relation
… and so on
parsing
ASTify the tokens — “identifying sentences”.
digraph {
graph [bgcolor=transparent];
node [fontcolor=white, color=white, shape=none];
edge [fontcolor=white, color=white];
assign1 [label="assign"];
assign2 [label="assign"];
z1 [label="z", shape=none];
z2 [label="z", shape=none];
x [shape=none];
y [shape=none];
eq [shape=none,label="=="];
2 [shape=none,label="=="];
1 [shape=none,label="=="];
{x, eq, y} -> "relation" -> "predicate";
{z1, 1} -> assign1 -> "then-stmt";
{z2, 2} -> assign2 -> "else-stmt";
{"predicate", "then-stmt", "else-stmt"} -> "if-the-else";
}
semantic analysis
lower things into LLVM IR, HLO, etc. — “analyze the sentences”
- check type
- catch inconsistency
- scoping rules, etc.
example
Look! Shadowing!!!
{
int i = 3;
{
int i = 4;
cout << i;
}
}
optimization
make the IR faster — “editing”.
goals
- run faster
- use less memory
- generally, conserve resources
tricky tricky!
Can this be optimized to x=0
?
x = y * 0
Its tricky. y
may not be numeric; y
maybe float, then in which can nan * 0 = nan
.
code generation
generate machine code — “translation.” Consider: register layout, etc.
other stuff
intermediate representation
compliers typically translate between multiple intermediate languages.
- all but the first and last representations are called intermediate representations
- IRs are generally ordered in descending levels of abstraction
digraph {
rankdir=LR;
graph [bgcolor=transparent];
node [fontcolor=white, color=white, shape=none];
edge [fontcolor=white, color=white];
ir1 [label="ir"]
source -> ir -> "..." -> ir1 -> {assembly, ir1}
}
issues
many pitfalls:
- your copmiler maybe slow
- may not be able to error nous inputs
- language design is important – determines what is ambiguous (hence what is easy / hard to compile)
- tradeoffs! in language design