CMSC 129
Principles of Compiler Design
Created by TJ Monserrat
Brief introduction of Designing Compilers
Structure of Compiler
Compilers
For the whole semester: We will learn how to design and implement compilers.
What are Compilers?
Programs that can turn everyday human language to things that can be executed by computers
What will we be doing for the project?
What will we be doing for the project?
What will we be doing for the project?
What will we be doing for the project?
Parts of Compiler: Lexical Analysis
Parts of Compiler: Lexical Analysis
Takes in Source Code or Character Stream and turns it into an array of tokens or Lexemes
Parts of Compiler: Lexical Analysis
Each token is in the form of:
[token-name, attribute-value]
-
token-name - abstract symbol or key identifier that is used during syntax analysis
-
attribute-value - Points to an entry in the symbol table for this token.
- That's why we need Symbol Tables...
- Information from the symbol-table entry is needed for semantic analysis and code generation.
These tokens will be passed to the Syntax Analyzer
Parts of Compiler: Lexical Analysis
For Example
position = initial + rate * 60
-
position is a lexeme that can mapped to a token of [id, 0].
-
id is an abstract symbol for it being an identifier or a variable
-
0 points to the index of the symbol-table entry for the word position
Parts of Compiler: Lexical Analysis
For Example
position = initial + rate * 60
-
= is a lexeme that can mapped to a token of [eq].
-
Since this token doesn't need an attribute, we can omit the second component
-
We could use any other token-name like '=' or 'assign'... as long as it serves the purpose: to identify that this is an assign statement.
Parts of Compiler: Lexical Analysis
For Example
position = initial + rate * 60
-
initial is a lexeme that can mapped to a token of [id, 1].
-
1 points to the index of the symbol-table entry for the word initial
-
+ is a lexeme that can mapped to a token of [add].
-
rate is a lexeme that can mapped to a token of [id, 2].
-
2 points to the index of the symbol-table entry for the word rate
Parts of Compiler: Lexical Analysis
For Example
position = initial + rate * 60
-
* is a lexeme that can mapped to a token of [mul].
-
60 is a lexeme that can mapped to a token of [number, 3].
-
3 points to the index of the symbol-table entry for the value 60
Parts of Compiler: Lexical Analysis
This makes the example:
position = initial + rate * 60
into...
Lexemes:
[ [id, 0], [eq], [id, 1], [add], [id, 2], [mul], [number, 3] ]
SymbolTable: {
0: { value: 'position', ... },
1: { value: 'initial', ... },
2: { value: 'rate', ... },
3: { value: 60, ... }
}
Parts of Compiler: Syntax Analysis
Parts of Compiler: Syntax Analysis
The Lexemes...
[ [id, 0], [eq], [id, 1], [add], [id, 2], [mul], [number, 3] ]
...will need to be fed to the syntax analyzer to produce a syntax tree like this...
[eq]
|--------
| |
[id, 0] [add]
|--------
| |
[id, 2] [mul]
|---------
| |
[id, 3] [number, 60]
Parts of Compiler: Syntax Analysis
To be able to do this...
[eq]
|--------
| |
[id, 0] [add]
|--------
| |
[id, 2] [mul]
|---------
| |
[id, 3] [number, 60]
We need to know how to parse them using Context-free grammars.
Parts of Compiler: Intermediate Code Generation
Parts of Compiler: Intermediate Code Generation
[eq]
|--------
| |
[id, 0] [add]
|--------
| |
[id, 2] [mul]
|---------
| |
[id, 3] [number, 60]
The system can create an intermediate code before translating it into the target program code.
t1 = 60
t2 = id3 * t1
t3 = id2 + t2
id1 = t3
Parts of Compiler: Intermediate Code Generation
t1 = 60
t2 = id3 * t1
t3 = id2 + t2
id1 = t3
We need to transform parse trees into a three-address code sequence. Note that three-address code sequences will be of the following...
-
It has AT MOST ONE OPERATOR on the right side
-
The compiler must generate a temporary name to hold the value computed
by a three-address instruction.
-
Some "three-address instructions" have fewer than three operands.
Parts of Compiler: Code Generation
Parts of Compiler: Code Generation
t1 = 60
t2 = id3 * t1
t3 = id2 + t2
id1 = t3
We can then transform the intermediate code generation to the target code that can be understood by the system
LDF R2, id3
MULF R2, R2, #60.0
LDF R1, id2
ADDF R1, R1, R2
STF id1, R1
Parts of Compiler: Code Generation
t1 = 60
t2 = id3 * t1
t3 = id2 + t2
id1 = t3
But for our class, we need to transform the intermediate code above to our pseudo-ASM that we will discuss next time.
load t1
60
load t2
mul id3 t1
load t3
add id2 t2
load id1
t3
Parts of Compiler: Symbol Table Management
Parts of Compiler: Symbol Table Management
An essential function of a compiler is to record the variable names used in the
source program and collect information about various attributes of each name.
Parts of Compiler: Symbol Table Management
These attributes may provide information about the storage allocated for:
-
name
-
type
-
scope (where in the program its value may be used)
-
in the case of procedure names:
-
the number of arguments
-
types of each of the arguments
-
the method of passing each argument
-
the type returned
Programming Language Basics
Things you need to ponder when designing a language:
Programming Language Basics - Static/Dynamic Distinction
-
Static - a language uses a policy that allows the compiler to decide an issue
-
Dynamic - a policy that only allows a decision to be made when
we execute the program
Programming Language Basics - Environments and States
-
Environment - mapping from names to locations in the store.
-
State - mapping from locations in store to their values.the state maps left-values to their corresponding right-values (x = 3)
Programming Language Basics - Static Scope and Block Structure
We should consider static-scope rules for a language with blocks, where a block is a grouping of declarations and statements.
Programming Language Basics - Static Scope and Block Structure
const y = 10;
const x = (z, a, b = 10) => {
let q = 0;
const m = (p) => {
return p + 5;
}
q = (b === 10) ? m(z) + a : m(b);
setTimeout(() => {
y = q + 5
}, 5000);
}
x(y, 25);
Programming Language Basics - Static Scope and Block Structure
const y = 10;
const x = (z, a, b = 10) => {
...
}
x(y, 25);
// Outer scope has variables y and x, where x is a function
Programming Language Basics - Static Scope and Block Structure
const x = (z, a, b = 10) => {
let q = 0;
const m = (p) => {
...
}
q = (b === 10) ? m(z) + a : m(b);
setTimeout(() => {
...
}, 5000);
}
// the function x has parameter values z, a, b, q, and m,
// where m is also a function
Programming Language Basics - Static Scope and Block Structure
const y = 10;
const x = (z, a, b = 10) => {
let q = 0;
const m = (p) => {
...
}
q = (b === 10) ? m(z) + a : m(b);
setTimeout(() => {
...
}, 5000);
}
// Take note that function x also includes variable y
// as part of its scope.
Programming Language Basics - Static Scope and Block Structure
const m = (p) => {
return p + 5;
}
// function m includes variable p.
Programming Language Basics - Static Scope and Block Structure
const y = 10;
const x = (z, a, b = 10) => {
let q = 0;
const m = (p) => {
return p + 5;
}
q = (b === 10) ? m(z) + a : m(b);
setTimeout(() => {
...
}, 5000);
}
// But the same as function x, function m also includes:
// z, a, b, q, and y as part of its scope
Programming Language Basics - Static Scope and Block Structure
setTimeout(() => {
y = q + 5
}, 5000);
// The anonymous function inside the parameter callback
// of the setTimeout function includes y and q even if not initialized...
Programming Language Basics - Static Scope and Block Structure
const y = 10;
const x = (z, a, b = 10) => {
let q = 0;
const m = (p) => {
...
}
q = (b === 10) ? m(z) + a : m(b);
setTimeout(() => {
y = q + 5
}, 5000);
}
// Because the anonymous function is within the scope of
// function x.
Programming Language Basics - Static Scope and Block Structure
Note that Scoping affects the Symbol Table structure.
Programming Language Basics - Parameter Passing
-
Pass-by-value - the actual parameter is evaluated (if it is an expression) or copied (if it is a variable)
-
Pass-by-reference - the address of the actual parameter is passed to the callee as the value of the corresponding formal parameter.
Exercises
Every exercise will have an effect on your project.
Exercises
-
Grammar Creation
-
1. Create the Grammar of the Intermediate Code
-
2. Create your own Source Code Grammar
-
Lexical Analysis
-
3. DFA of the Intermediate Code
-
4. DFA of your own Source Code
-
5. Token list creation of the Intermediate Code
-
6. Token list creation of your own Source Code
Exercises
-
Syntax Analysis
-
7. Error Reporting of the Intermediate Code
-
8. Error Reporting of your own Source Code
-
9. Syntax Tree creation of the Intermediate Code
-
10. Syntax Tree creation of your own Source Code
Exercises
-
Intermediate Code Generation
-
11. Three-address Code generation from the Syntax Tree of Intermediate Code
-
12. Three-address Code generation from the Syntax Tree of your own Source Code
Exercises
All exercises will building towards the project, Except the Target-Code Generation, because...
Project
Translate own Source code to Target-code generation, passing through all of the phases:
-
Lexical Analysis: Given Source Code, Show Lexemes
-
Syntax Analysis: Given Lexemes, Show Parse Tree
-
Intermediate Code Generation: Given Parse Tree, Show Three-Address Code
-
Target Code Generation: Given Three-Address Code, Show Target-Code generation
-
Run Target-Code Generation: Given input, show output
New Grading System
- 1.0 - All 14 exercises done and working, Project working with documentation
New Grading System - Subtractions
- Did 8-11 projects, all working: -0.25
- Did 5-7 projects, all working: -0.5
- Did 1-4 projects, all working: -0.75
- Did 0 projects, all working: -1.0
New Grading System - Subtractions
- Did not do final project: -1.0
- Did final project but has major bugs: -0.5
- Did not do documentation: -0.5
- Documentation lacking in substance: -0.25
New Grading System - Notes to live by...
- If after the grade subtraction, grade is < 3.0 then grade is INC
Target-Code: Pseudo-ASM Language
// SOURCE CODE
var y;
var x = 2;
y = x + 1
// PSEUDO-ASM
start main
var y
var x
load x 2
load y
add x 1
end
Target-Code: Pseudo-ASM Language
// SOURCE CODE
var x;
var y;
...
if ((x < 3) && (y >= 3)) {
x = 4
}
else {
x = 2
}
Target-Code: Pseudo-ASM Language
// PSEUDO-ASM
start main
var x
...
var temp1
load temp1
lt x 3
var temp2
load temp2
load gte y 3
var temp3
load temp3
and temp1 temp2
cond temp3
load x 4
else
load x 2
endcond
end
Target-Code: Pseudo-ASM Language
// SOURCE CODE
for (x=0; x<10; x++) {
console.log(x)
}
// or
var x = 0
while (x<10) {
console.log(x)
x++
}
Target-Code: Pseudo-ASM Language
// PSEUDO-ASM
start main
var x
load x 0
var temp1
load temp1
lt x 10
loop temp1
print x
load x
add x 1
endloop
end
Target-Code: Pseudo-ASM Language
// SOURCE CODE
var x = 0
do {
console.log(x)
x++;
} while(x<10)
Target-Code: Pseudo-ASM Language
// PSEUDO-ASM
start main
var x
load x 0
floop
print x
add x 1
var temp1
load temp1
lt x 10
endfloop temp1
end
Target-Code: Pseudo-ASM Language
// SOURCE CODE
function c() {
console.log("Hah")
}
function b(y, z) {
console.log(y)
console.log(z)
console.log("Hey")
}
function a() {
var x = 0;
var y = 1;
b(x, y);
console.log("Yoh")
c();
return x
}
var x;
x = a();
Target-Code: Pseudo-ASM Language
// PSEUDO-ASM
start c
print "Hah"
end
start b param var y var z
print y
print z
print "Hey"
end
start a
var x
var y
load x 0
load y 1
call b x y
print "Yoh"
call c
return x
end
start main
var x
load x
call a
end
CMSC 129
Principles of Compiler Design
Created by TJ Monserrat