Lesson 1.4: Disambiguating Parses
The purpose of this lesson is to teach how to use K's builtin features for disambiguation to transform an ambiguous grammar into an unambiguous one that expresses the intended ASTs.
Priority blocks
In practice, very few formal languages outside the domain of natural language processing are ambiguous. The main reason for this is that parsing unambiguous languages is asymptotically faster than parsing ambiguous languages. Programming language designers instead usually use the notions of operator precedence and associativity to make expression grammars unambiguous. These mechanisms work by instructing the parser to reject certain ASTs in favor of others in case of ambiguities; it is often possible to remove all ambiguities in a grammar with these techniques.
While it is sometimes possible to explicitly rewrite the grammar to remove these parses, because K's grammar specification and AST generation are inextricably linked, this is generally discouraged. Instead, we use the approach of explicitly expressing the relative precedence of different operators in different situations in order to resolve the ambiguity.
For example, in C, &&
binds tighter in precedence than ||
, meaning that
the expression true && false || false
has only one valid AST:
(true && false) || false
.
Consider, then, the third iteration on the grammar of this definition
(lesson-04-a.k
):
kmodule LESSON-04-A syntax Boolean ::= "true" | "false" | "(" Boolean ")" [bracket] > "!" Boolean [function] > Boolean "&&" Boolean [function] > Boolean "^" Boolean [function] > Boolean "||" Boolean [function] endmodule
In this example, some of the |
symbols separating productions in a single
block have been replaced with >
. This serves to describe the
priority groups associated with this block of productions.
The first priority group consists of the atoms of the
language: true
, false
, and the bracket operator. In general, a priority
group starts either at the ::=
or >
operator and extends until either the
next >
operator or the end of the production block. Thus, we can see that the
second, third, fourth, and fifth priority groups in this grammar all consist
of a single production.
The meaning of these priority groups becomes apparent when parsing programs:
A symbol with a lesser priority, (i.e., one that binds looser), cannot
appear as the direct child of a symbol with a greater priority (i.e.,
one that binds tighter. In this case, the >
operator can be seen as a
greater-than operator describing a transitive partial ordering on the
productions in the production block, expressing their relative priority.
To see this more concretely, let's look again at the program
true && false || false
. As noted before, previously this program was
ambiguous because the parser could either choose that &&
was the child of ||
or vice versa. However, because a symbol with lesser priority (i.e., ||
)
cannot appear as the direct child of a symbol with greater priority
(i.e., &&
), the parser will reject the parse where ||
is under the
&&
operator. As a result, we are left with the unambiguous parse
(true && false) || false
. Similarly, true || false && false
parses
unambiguously as true || (false && false)
. Conversely, if the user explicitly
wants the other parse, they can express this using brackets by explicitly
writing true && (false || false)
. This still parses successfully because the
||
operator is no longer the direct child of the &&
operator, but is
instead the direct child of the ()
operator, and the &&
operator is an
indirect parent, which is not subject to the priority restriction.
Astute readers, however, will already have noticed what seems to be a
contradiction: we have defined ()
as also having greater priority than ||
.
One would think that this should mean that ||
cannot appear as a direct
child of ()
. This is a problem because priority groups are applied to every
possible parse separately. That is to say, even if the term is unambiguous
prior to this disambiguation rule, we still reject that parse if it violates
the rule of priority.
In fact, however, we do not reject this program as a parse error. Why is that?
Well, the rule for priority is slightly more complex than previously described.
In actual fact, it applies only conditionally. Specifically, it applies in
cases where the child is either the first or last production item in the
parent's production. For example, in the production Bool "&&" Bool
, the
first Bool
non-terminal is not preceded by any terminals, and the last Bool
non-terminal is not followed by any terminals. As a result of this, we apply
the priority rule to both children of &&
. However, in the ()
operator,
the sole non-terminal is both preceded by and followed by terminals. As a
result, the priority rule is not applied when ()
is the parent. Because of
this, the program we mentioned above successfully parses.
Exercise
Parse the program true && false || false
using kast, and confirm that the AST
places ||
as the top level symbol. Then modify the definition so that you
will get the alternative parse.
Associativity
Even having broken the expression grammar into priority blocks, the resulting
grammar is still ambiguous. We can see this if we try to parse the following
program (assoc.bool
):
true && false && false
Priority blocks will not help us here: the problem comes between two parses
where both possible parses have a direct parent and child which is within a
single priority block (in this case, &&
is in the same block as itself).
This is where the notion of associativity comes into play. Associativity applies the following additional rules to parses:
- a left-associative symbol cannot appear as a direct rightmost child of a symbol with equal priority;
- a right-associative symbol cannot appear as a direct leftmost child of a symbol with equal priority; and
- a non-associative symbol cannot appear as a direct leftmost or rightmost child of a symbol with equal priority.
In C, binary operators are all left-associative, meaning that the expression
true && false && false
parses unambiguously as (true && false) && false
,
because &&
cannot appear as the rightmost child of itself.
Consider, then, the fourth iteration on the grammar of this definition
(lesson-04-b.k
):
kmodule LESSON-04-B syntax Boolean ::= "true" | "false" | "(" Boolean ")" [bracket] > "!" Boolean [function] > left: Boolean "&&" Boolean [function] > left: Boolean "^" Boolean [function] > left: Boolean "||" Boolean [function] endmodule
Here each priority group, immediately after the ::=
or >
operator, can
be followed by a symbol representing the associativity of that priority group:
either left:
for left associativity, right:
for right associativity, or
non-assoc:
for non-associativity. In this example, each priority group we
apply associativity to has only a single production, but we could equally well
write a priority block with multiple productions and an associativity.
For example, consider the following, different grammar (lesson-04-c.k
):
kmodule LESSON-04-C syntax Boolean ::= "true" | "false" | "(" Boolean ")" [bracket] > "!" Boolean [function] > left: Boolean "&&" Boolean [function] | Boolean "^" Boolean [function] | Boolean "||" Boolean [function] endmodule
In this example, unlike the one above, &&
, ^
, and ||
have the same
priority. However, viewed as a group, the entire group is left associative.
This means that none of &&
, ^
, and ||
can appear as the right child of
any of &&
, ^
, or ||
. As a result of this, this grammar is also not
ambiguous. However, it expresses a different grammar, and you are encouraged
to think about what the differences are in practice.
Exercise
Parse the program true && false && false
yourself, and confirm that the AST
places the rightmost &&
at the top of the expression. Then modify the
definition to generate the alternative parse.
Explicit priority and associativity declarations
Previously we have only considered the case where all of the productions which you wish to express a priority or associativity relation over are co-located in the same block of productions. However, in practice this is not always feasible or desirable, especially as a definition grows in size across multiple modules.
As a result of this, K provides a second way of declaring priority and associativity relations.
Consider the following grammar, which we will name lesson-04-d.k
and which
will express the exact same grammar as lesson-04-b.k
kmodule LESSON-04-D syntax Boolean ::= "true" [group(literal)] | "false" [group(literal)] | "(" Boolean ")" [group(atom), bracket] | "!" Boolean [group(not), function] | Boolean "&&" Boolean [group(and), function] | Boolean "^" Boolean [group(xor), function] | Boolean "||" Boolean [group(or), function] syntax priority literal atom > not > and > xor > or syntax left and syntax left xor syntax left or endmodule
This introduces a couple of new features of K. First, the group(_)
attribute
is used to conceptually group together sets of sentences under a common
user-defined name. For example, literal
in the syntax priority
sentence is
used to refer to all the productions marked with the group(literal)
attribute,
i.e., true
and false
. A production can belong to multiple groups using
syntax such as group(myGrp1,myGrp2)
.
Once we understand this, it becomes relatively straightforward to understand
the meaning of this grammar. Each syntax priority
sentence defines a
priority relation where >
separates different priority groups. Each priority
group is defined by a list of one or more group names, and consists of all
productions which are members of at least one of those named groups.
In the same way, a syntax left
, syntax right
, or syntax non-assoc
sentence
defines an associativity relation among left-, right-, or non-associative
groups. Specifically, this means that:
syntax left a b
is different to:
syntax left a
syntax left b
As a consequence of this, syntax [left|right|non-assoc]
should not be used to
group together labels with different priority.
Prefer/avoid
Sometimes priority and associativity prove insufficient to disambiguate a grammar. In particular, sometimes it is desirable to be able to choose between two ambiguous parses directly while still not rejecting any parses if the term parsed is unambiguous. A good example of this is the famous "dangling else" problem in imperative C-like languages.
Consider the following definition (lesson-04-E.k
):
kmodule LESSON-04-E syntax Exp ::= "true" | "false" syntax Stmt ::= "if" "(" Exp ")" Stmt | "if" "(" Exp ")" Stmt "else" Stmt | "{" "}" endmodule
We can write the following program (dangling-else.if
):
if (true) if (false) {} else {}
This is ambiguous because it is unclear whether the else
clause is part of
the outer if
or the inner if
. At first we might try to resolve this with
priorities, saying that the if
without an else
cannot appear as a child of
the if
with an else
. However, because the non-terminal in the parent symbol
is both preceded and followed by a terminal, this will not work.
Instead, we can resolve the ambiguity directly by telling the parser to "prefer" or "avoid" certain productions when ambiguities arise. For example, when we parse this program, we see the following ambiguity as an error message:
[Error] Inner Parser: Parsing ambiguity.
1: syntax Stmt ::= "if" "(" Exp ")" Stmt
`if(_)__LESSON-04-E_Stmt_Exp_Stmt`(`true_LESSON-04-E_Exp`(.KList),`if(_)_else__LESSON-04-E_Stmt_Exp_Stmt_Stmt`(`false_LESSON-04-E_Exp`(.KList),`;_LESSON-04-E_Stmt`(.KList),`;_LESSON-04-E_Stmt`(.KList)))
2: syntax Stmt ::= "if" "(" Exp ")" Stmt "else" Stmt
`if(_)_else__LESSON-04-E_Stmt_Exp_Stmt_Stmt`(`true_LESSON-04-E_Exp`(.KList),`if(_)__LESSON-04-E_Stmt_Exp_Stmt`(`false_LESSON-04-E_Exp`(.KList),`;_LESSON-04-E_Stmt`(.KList)),`;_LESSON-04-E_Stmt`(.KList))
Source(./dangling-else.if)
Location(1,1,1,30)
Roughly, we see that the ambiguity is between an if
with an else
or an if
without an else
. Since we want to pick the first parse, we can tell K to
"avoid" the second parse with the avoid
attribute. Consider the following
modified definition (lesson-04-f.k
):
kmodule LESSON-04-F syntax Exp ::= "true" | "false" syntax Stmt ::= "if" "(" Exp ")" Stmt | "if" "(" Exp ")" Stmt "else" Stmt [avoid] | "{" "}" endmodule
Here we have added the avoid
attribute to the else
production. As a result,
when an ambiguity occurs and one or more of the possible parses has that symbol
at the top of the ambiguous part of the parse, we remove those parses from
consideration and consider only those remaining. The prefer
attribute behaves
similarly, but instead removes all parses which do not have that attribute.
In both cases, no action is taken if the parse is not ambiguous.
Exercises
-
Parse the program
if (true) if (false) {} else {}
usinglesson-04-f.k
and confirm that else clause is part of the innermostif
statement. Then modify the definition so that you will get the alternative parse. -
Modify your solution from Lesson 1.3, Exercise 2 so that unary negation should bind tighter than multiplication and division, which should bind tighter than addition and subtraction, and each binary operator should be left associative. Write these priority and associativity declarations explicitly, and then try to write them inline.
-
Write a simple grammar containing at least one ambiguity that cannot be resolved via priority or associativity, and then use the
prefer
attribute to resolve that ambiguity. -
Explain why the following grammar is not labeled ambiguous by the K parser when parsing
abb
, then make the parser realize the ambiguity.
kmodule EXERCISE4 syntax Expr ::= "a" Expr "b" | "abb" | "b" endmodule
Next lesson
Once you have completed the above exercises, you can continue to Lesson 1.5: Modules, Imports, and Requires.