Top down constriction of a parse tree:

Start with the root, labelled by the starting symbol

Repeatedly perform the following steps:

At internal node N, labelled with non-terminal A

Select one of the production rules for A

Construct children at N for the symbols in the right part of this production rule

Find the next node to construct a subtree

- Typically, the leftmost unexpanded non-terminal of the current tree

During the construction of the parse tree the current terminal of the input that is being scanned is called the lookahead symbol

Our aim during top-down parsing - to construct the parse tree, such that the string generated by the parse tree matches the input string

For a match to occur - the starting symbol `stmt`

must derive a string
that starts with for

When a node in the parse tree:

Is labelled with a terminal

Matches the lookahead symbol

Then

The lookahead becomes the next terminal in the input

We consider the next child in the parse tree

When a node in the parse tree is labelled with a non-terminal then:

We repeat by selecting one of its production rules

Special case: $A\rightarrow \epsilon$ - we choose it when nothing else can be used

In general:

Many possibilities for a production at a non-terminal

The selection of one of them may involve trial and error

A selected production is unsuitable if after using this production, we can’t complete the tree to match the whole input string

If a selected production is unsuitable, backtrack and try another production until we

either match the input string

or we report error (input string not in the language)

A top-down parsing method, using recursive procedures to process the input

One procedure for each non-terminal of the grammar

In general, it requires backtracking until it finds the correct production rule

Very easy and intuitive to write code for it

Special case of recursive-descent parsing

It uniquely determines the steps of each procedure, therefore no backtracking is required

Runs in linear time

Not all grammars can be parsed by predictive parsing

LL(k) grammars:

A recursive de4cent parser can uniquely determine the next production rule, just by looking to the next k tokens of the input

Subclass of context free grammars

Predictive parsing is only possible for LL(k) grammars LL(k) grammars exclude:

All ambiguous grammars

All grammars containing left recursion

A grammar is left recursive if for some non-terminal A and some string $\alpha: A\xrightarrow{+}A\alpha$

**Immediate left recursion:**: $A\rightarrow A\alpha | \beta$

This can be rewritten as

$$A\rightarrow \beta A'$$

$$A'\rightarrow \alpha A' | \epsilon$$

**Left recursion in two or more steps**:

$$S\rightarrow A\alpha | \beta$$

$$A\rightarrow Sd|c|\epsilon$$

In this case it can occur that $S\rightarrow A\alpha \rightarrow Sd\alpha$

A left recursive grammar can lead a recursive-descent parser into an infinite loop, e.g.

$$A\rightarrow A\alpha|\beta$$

Expanding A using the production rule $A\rightarrow A\alpha$ will lead to backtracking forever

Take the following production with **immediate** left recursion

$$A\rightarrow A\alpha_1|A\alpha_2|...|A\alpha_m|\beta_1|\beta_2|...|\beta_n$$

where no $\beta_i$ begins with A

This can then be rewritten as:

$$A\rightarrow \beta_1A'|\beta_2A'|...|\beta_nA'$$

$$A'\rightarrow \alpha_1A'|\alpha_2A'|...|\alpha_mA'|\epsilon$$

Many times:

Two alternative production rules have the same prefix

The choice between them is not clear

Example: $A\rightarrow \alpha\beta_1|\alpha\beta_2$

Suppose the input begins with a string derived from $\alpha$

Which transition to choose?

We need to read more symbols from the input - this grammar is not an LL(1) grammar

We can turn this example into an LL(1) grammar in the following way

$$A\rightarrow \alpha A'$$

$$A' \rightarrow \beta_1|\beta_2$$

For each non-terminal A

Find the longest prefix $\alpha$ common to two or more of its alternative productions

$$A\rightarrow \alpha\beta_1|\alpha\beta_2|...|\alpha\beta_n|\gamma$$Replace these productions by the following

$$A\rightarrow \alpha A' | \gamma$$$$A'\rightarrow \beta_1|\beta_2|...|\beta_n$$

Repeat until no two alternative productions have a common prefix

When we consider a non-terminal:

How do we know which production rule to use?

The predictive parser looks ahead to the rest of the input and its decision is forced

LL(k) grammar:

L: left-to-right scan of the input tokens

L: leftmost derivation of the input string

k: look k tokens ahead in the input string

We must first:

Eliminate ambiguities

Eliminate left recursions

Perform left factoring

Interesting case:

LL(1) grammars - read only one token ahead in the input

More efficient parsing than other LL(k) grammars

How can we uniquely determine the next production in an LL(1) grammar?

We do these with the help of two functions (sets)

FIRST($\alpha$), where $\alpha$ is any string of grammar symbols

FOLLOW(A), where A is any non-terminal

And a predictive parsing table that associates terminals (of the grammar) with non-terminals (of the input)

The set FIRST($\alpha$), for a string $\alpha$, contains:

Every first terminal of a string derived from $\alpha$

i.e. if $\alpha\xRightarrow{*} c\gamma$, then $c\in FIRST(\alpha)$

also, if $\alpha\xRightarrow{*} \epsilon$, then $\epsilon\in FIRST(\alpha)$

Example usage:

Consider the two productions $A\rightarrow \gamma_1$ and $A\rightarrow \gamma_2$

Suppose that FIRST($\gamma_1$) and FIRST($\gamma_2$) are disjoint

let the next input symbol be a

if a belongs to FIRST($\gamma_1$), then choose $A\rightarrow \gamma_1$

if a belongs to FIRST($\gamma_2$), then choose $A\rightarrow \gamma_2$

Otherwise report error

The set FOLLOW(A), for a non-terminal contains:

Every terminal a that can appear immediately to the right of A in some string derived from S

i.e. there exists a derivation $S\xRightarrow{*}\alpha Aa\beta$ where $\alpha$ and $\beta$ are two strings

In addition, if A can be the rightmost symbol in some string derived from S, then $ is in FOLLOW(A) (where $ is the end of file symbol)

LL(1) Grammar

A Grammar G is an LL(1) grammar iff for every two productions $A\rightarrow \alpha | \beta$:

- There is no terminal a such that both $\alpha$ and $\beta$ derive a string starting with a
- At most one of $\alpha$ and $\beta$ can derive the empty string $\epsilon$
- If $\beta \Rightarrow \epsilon$ then $\alpha$ does not derive any string beginning with a terminal in FOLLOW(A) and vice versa

Therefore, an efficient algorithm to determine whether a grammar G is LL(1) grammar by computing all needed sets FIRST() and FOLLOW()

A language L is called an LL(1) language is L can be generated by an LL(1) grammar

If any language L is given by a non-LL(1) grammar L may still be an LL(1) language by a different grammar

We saw predictive parsing via recursive calls of procedures

We can “mimic” leftmost derivation:

- Explicitly maintain a stack
- “table driven predictive parsing”

If w is the input that has been matched so far, then the stack holds a sequence $\alpha$ of grammar symbols (terminals and non-terminals), such that

$$S\xRightarrow{*}w\alpha$$

where S is the start symbol of the grammar

Given a parsing table M and input w:

Initialise a stack containing S (with bottom symbol $)

Repeat until the stack contains only $:

Let the next input character be c

If the top of the stack is a terminal t, then:

If c and t don’t match, report an error

Otherwise match c and t - consume c from input and pop t from stack

If the top of the stack is a non-terminal A, then:

If $M[A,c]$ is undefined, report an error

Otherwise replace the top of the stack with $M[A,c]$