More on compilation, regexes, and CFGs
TA: Paul Fisher
This is our chance to ask questions.
He uses slides, they’ll be posted on NYU classes.
Today, we’re going to recap regexes and then move on to CFGs and parse trees.
Phases of compilation
- Lexer
- Parser
- Semantic Analyzer
- Intermediate code generator
- Optimization (based on architecture of the system)
- Target code generation
The lexer
Turns a string of characters into a string of tokens, where each token represents something like a keyword, a literal, a symbol, etc. Tokens are the shortest strings of characters with individual meaning. We specify tokens with regular expressions.
Regex rules are the same we saw last time, that is ab
for concatenation, a|b
to mean “a or b”, and a*
to mean repetition of 0 or more “a”s.
Additional rules are in the slides, including square-bracket notation for ranges, ex. [a-z]
.
One important thing to remember is that regexes have no recursion, so you can never define something in terms of itself. As such, regexes cannot recognize nested constructs.
Context free grammars
The lexer turns a series of characters into a series of tokens. But there’s no structure to those tokens. They’re just an ordered string.
CFGs, unlike regexes, can represent recursion. And that makes CFGs strictly more powerful than regexes. (What this means is that every regex can be defined with a CFG, but not vice versa.) They’re used in the parser.
CFGs consist of the following components:
-
Productions: Substitute one symbol with another symbol or set of symbols (A ---> B)
-
Nonterminals
- All symbols that are on the left side of the producton
-
Terminals
- All symbols that are not on any left side of any production
-
Start symbol
- The nonterminal symbol on the left side of the first production (ex. representing the whole program)
Ack, slides move too quickly. See them online.
Productions look like this:
A ---> 0A1 | ɛ
This is a language of strings consisting of a number of 0s followed by an equal number of 1s.
For another example, give a grammar for all strings ending with a
that contain the set of characters {a, b}
, like aa
, ba
, aaba
, abbaa
…
S ---> aS|bS|a
Prof. Goldberg will continue going over CFGs next lecture, and next recitation will go into parse trees.