# Formal Languages

Basic concepts of any formal language:

• Alphabet $\Sigma$ = set of all possible symbols

• Sequence (or word) $\alpha$ = string of symbols

• Language = Particular set of sequences

Further concepts:

• Syntax: which sequences belong to the language

• Semantics: the meaning of these sequences

In the context of programming languages:

• Sequences in the formal language are all valid program codes

If a language L has a finite number of sequences it is described by listing all sequences

If it has an infinite number of sequences it is represented with more complex set notation or with a set of substitution rules

# Formal Grammars

finite way of describing an infinite number of strings

Given by a start symbol and a set of production rules

quadruple $(V_T,V_N,P,S)$ where:

• $V_T$ is a set of terminal symbols (or terminals)

• $V_N$ is a set of non-terminal symbols (or terminals)

• $P$ is a set of productions (or rules)

• S is the start symbol (which is a non-terminal)

$V=V_T\cup V_N$ is the set of all symbols where $V_T\cap V_N=\varnothing$

## Notation

• Non terminals: Capital letters

• Terminals: Lower case letters

• Sequences (strings): Greek letters

• Productions have the form: $\alpha \rightarrow \beta$

• $\alpha$ and $\beta$ are strings of both terminals and non-terminals

• $\alpha$ lies in $VY^+$ (it has at least one non-terminal)

• $\beta$ lies in $V^*$ (i.e. can also be the empty string $\epsilon$)

• Whenever we have two rules $\alpha\rightarrow \beta_1, \alpha\rightarrow \beta_2$ we write equivalently $\alpha \rightarrow \beta_1|\beta_2$

## Example

$$L_4=\{a^mb^n|m,n\geqslant 0\}$$

Grammar for this language: $G_4=\{ \{a,b\},\{S,A,B\},P,S\}$ where P includes:

$S\rightarrow AB$

$A\rightarrow a A$

$A\rightarrow\epsilon$

$B\rightarrow bB$

$B\rightarrow \epsilon$

Or equivalently $S\rightarrow AB$

$A\rightarrow aA | \epsilon$

$B\rightarrow bB | \epsilon$

$$S\rightarrow AB \rightarrow aAB \rightarrow aaAB \rightarrow aaaAB \rightarrow aaaB \rightarrow aaabB \rightarrow aaabbB \rightarrow aaabb$$

Or in short:

$$S\xrightarrow[+]{G_4}aaabb$$

# Language generated by a grammar

$$L(G)=\{\alpha\in V^*_T| S\xrightarrow[+]{G} \alpha \}$$

A non-terminal N is called

• Left-recursive if starting with $N$, we can produce a string that starts again with $N$

• e.g. $N\rightarrow Nab$
• Nullable if starting with N, we can produce $\epsilon$

• e.g. $N\rightarrow aNb | \epsilon$0
• Useless if starting with $N$, we can never produce a string consisting only of terminals

• e.g. $N\rightarrow aN | bN$

# The Chomsky Hierarchy

he classes of grammars we obtain by restricting the types of productions that can appear

Type 0 grammars - The set of all grammars

Type 1 (or context sensitive) grammars

• The grammars that have only productions of the form $\alpha \rightarrow \beta$, where $|\alpha| \geqslant |\beta|$

Type 2 (or context free) grammars:

• The context sensitive grammars, in which the left part of each production has one non terminal

• e.g. $A\rightarrow BaA$

Type 3 (or regular) grammars

• The context free grammars, in which all productions have one of the forms:

$$A\rightarrow a$$
$$A\rightarrow aB$$
• where B can also be equal to A. They are also called right linear grammars

Equivalent definition of a type 3 (regular) grammars:

• All productions have one of the forms

$$A\rightarrow a$$
$$A\rightarrow Ba$$
• They are also called left linear grammars

The chomsky hierarchy also classifies the languages generated by these grammars

# Regular Expressions

Regular expressions over an alphabet $\Sigma$ (recursive definition):

• If any element of $\Sigma$ is a regular expression (also the empty string $\epsilon$)

• If P and Q are regular expressions, then the following are as well:

• PQ, i.e. the concatenation of P and Q

• $P|Q$ i.e. the disjunction of P and Q (that is: P or Q)

• $P^*$ i.e. Kleene star of P (zero or more copies of P)

Regular grammars generate exactly regular expressions

Context free and regular languages are very useful in describing programming languages

Regular grammars:

• Many “local” features of a programming language can be expressed by regular expressions, e.g. constants, strings, identifiers

• A regular expression describing an identifier: $L(L'|D)^*$ Here:

• L is the set of all letters

• $L'=L\cup \{\_\}$ is the set of all letters, including ’_

• D is the set of all digits

# Context Free Grammars

• Many syntactic properties of a programming language can be expressed by a context free grammar

• e.g. the pattern of matching brackets of arbitrary length

• This can’t be expressed by a regular expression but by this context free grammar

$$S\rightarrow (S)$$
$$S\rightarrow SS$$
$$S\rightarrow \epsilon$$

# Calling Graph

Many algorithms in compiler construction first collect some basic data items then apply some rules to them to:

• Extend the known information about these data items

• Draw more general conclusions about them

These “information-improving” algorithms:

• Share a common structure

• Although they seem different

Basic example: the calling graph of a program

• Directed graph with a node for each procedure

• Edge from A to B, whenever A calls

An example of why we need to find the calling graph is to find out which procedures:

• Are recursive

• Can be expanded in-line inside another procedure

# Regular Definition

sequence of definitions of the form

$$\begin{array}{l}{d_{1} \rightarrow r_{1}} \\ {d_{2} \rightarrow r_{2}} \\ {\quad \cdots} \\ {d_{k} \rightarrow r_{k}}\end{array}$$

Where:

• Each $d_i$ is different and not in $\Sigma$

• Each $r_i$ is a regular expression in $\Sigma \cup \{r_1,r_2,...,r_{i-1}\}$

We avoid recursive definitions by sequentially “eliminating” all $d_i$’s, thus if we can write a set of regular expressions as a regular definition, then we have no recursive definitions