Lesson 1.3: BNF Syntax and Parser Generation
In this lesson we will introduce more key aspects of the syntax and semantics of productions in K, and show how these, along with other syntactic sentences can be used to define grammars for parsing both rules and programs. In this context, you'll also learn about two additional types of productions, brakets and tokens.
K's approach to parsing
K's grammar is divided into two components: the outer syntax of K and the inner syntax of K. Outer syntax refers to the parsing of requires, modules, imports, and sentences in a K definition. Inner syntax refers to the parsing of rules and programs. Unlike the outer syntax of K, which is predetermined, much of the inner syntax of K is defined by you, the developer. When rules or programs are parsed, they are parsed within the context of a module. Rules are parsed in the context of the module in which they exist, whereas programs are parsed in the context of the main syntax module of a K definition.
Recall that a K definition consists of several modules, which in turn consist each of several sentences (productions, rules, etc.). Sentences within a module form the grammar of that module, and this grammar is used for parsing programs in the language you defined.
Basic BNF productions
To illustrate how this works, let's consider the K module below which defines a calculator for evaluating Boolean expressions containing operations AND, OR, NOT, and XOR.
Save the code below in file lesson-03-a.k
:
kmodule LESSON-03-A syntax Boolean ::= "true" | "false" | "!" Boolean [function] | Boolean "&&" Boolean [function] | Boolean "^" Boolean [function] | Boolean "||" Boolean [function] endmodule
Observe that the productions in this module look a little different than
what we have seen in the previous lesson. The reason is that K has two
mechanisms for defining productions. A more generic one using a variant
of BNF notation
and its special case we have seen in Lesson 1.2.
There the ::=
symbol was followed by an alphanumeric identifier and a
(possibly empty) comma-separated list of sorts in parentheses. In this
lesson, we focus on the former.
Recall the set of productions from the previous lesson:
kmodule LESSON-03-B syntax Color ::= Yellow() | Blue() syntax Fruit ::= Banana() | Blueberry() syntax Color ::= colorOf(Fruit) [function] endmodule
We can write an equivalent definition in BNF notation as follows:
kmodule LESSON-03-C syntax Color ::= "Yellow" "(" ")" | "Blue" "(" ")" syntax Fruit ::= "Banana" "(" ")" | "Blueberrry" "(" ")" syntax Color ::= "colorOf" "(" Fruit ")" [function] endmodule
Note that sort Fruit
of the function's argument is unchanged, but
everything else has been wrapped in double quotation marks. This is because
in BNF notation, we distinguish between two types of production items:
terminals and non-terminals. A terminal denotes a fixed sequence of
characters that is a verbatim part of the syntax of that production. For
example, Banana
, (
, )
, or colorOf
are such sequences of characters and
all considered terminals. Conversely, non-terminals, refer to a sort name,
like Fruit
, and the syntax of the production they belong to accepts any valid
term of that sort at that position.
In the previous lesson we executed successfully the program colorOf(Banana())
using krun
. That is because the program represented a term of sort Color
:
indeed, Banana()
is a term of sort Fruit
, hence a valid argument for
function colorOf
. krun
parses and interprets terms according to the grammar
you define. Under the hood, the term is automatically converted into an AST
(abstract syntax tree), and then the function colorOf
is evaluated using the
function rules provided in the definition.
How does K match the strings between the double quotes? The answer is that K
uses Flex to generate a
scanner for the grammar. Remember that a scanner, or lexical analyzer or lexer,
is a component of an interpreter that breaks down source code into tokens,
which are units such as keywords, variables, and operators. These tokens are
then processed by the parser, which interprets the structure of the code
according to the grammar rules. Flex looks for the longest possible match of a
regular expression in the input. If there are ambiguities between two or more
regular expressions, it will pick the one with the highest prec
attribute.
You can learn more about how Flex matching works in the
Flex Manual | Matching.
Returning to module LESSON-03-A
, we can see that it defines a simple BNF
grammar for expressions over Booleans. We have defined constructors
corresponding to the Boolean values true and false, and functions
corresponding to the Boolean operators AND, OR, NOT, and XOR. We have also
given a syntax for each of these functions based on their syntax in the C
programming language. As such, we can now write programs in the simple language
we have defined!
Save the code below in file and.bool
:
true && false
Now, let's compile our grammar first:
kompile lesson-03-a.k
Recall that compilation produces a parser, interpreter, and verifier for the grammar specified in the K definition. Interpreting the program by executing
krun and.bool
will raise an error:
terminate called after throwing an instance of 'std::runtime_error'
what(): No tag found for symbol Lbl'UndsAnd-And-UndsUnds'LESSON-03-A'Unds'Boolean'Unds'Boolean'Unds'Boolean{}. Maybe attempted to evaluate a symbol with no rules?
/nix/store/2av44963lcqsmkj7hwmjhrg9pzpbkr1i-k-7.1.232-88c9c766d76f624329400cd554fafd3beec16a15/bin-unwrapped/../lib/kframework/k-util.sh: line 114: 384286 Aborted (core dumped) "$@"
[Error] krun: ./lesson-03-a-kompiled/interpreter
/tmp/.krun-2025-04-29-12-58-55-lb8wwXcWVv/tmp.in.0nVZCuXTXv -1
/tmp/.krun-2025-04-29-12-58-55-lb8wwXcWVv/result.kore
Syntax error at /tmp/.krun-2025-04-29-12-58-55-lb8wwXcWVv/result.kore:1.1: Expected: [<id>, <string>] Actual: <EOF>
[Error] krun: kore-print --definition ./lesson-03-a-kompiled --output pretty
/tmp/.krun-2025-04-29-12-58-55-lb8wwXcWVv/result.kore --color on
This is expected, as we have not given rules defining the meaning of the &&
function, and the error message highlights exactly this—Maybe attempted
to evaluate a symbol with no rules?
While we cannot interpret the program just yet, we can parse it. To do this, run the command below from the same directory:
kast --output kore and.bool
You should see the following AST printed on standard output, minus the formatting:
inj{SortBoolean{}, SortKItem{}}(
Lbl'UndsAnd-And-UndsUnds'LESSON-03-A'Unds'Boolean'Unds'Boolean'Unds'Boolean{}(
Lbltrue'Unds'LESSON-03-A'Unds'Boolean{}(),
Lblfalse'Unds'LESSON-03-A'Unds'Boolean{}()
)
)
kast
is K's just-in-time parser, just another tool generated at compile time.
It produces a grammar from the K definition on the fly and uses it to parse
the program passed on the command line.
K allows for several AST representations and you can choose a specific one by
setting the --output
flag. You can see all possible value options by running
kast --help
. kore
used above is one of them and denotes KORE, the
intermediate representation of K. You can learn more about KORE in another
tutorial,
currently work-in-progress.
Value kast
for the flag gives us an AST in a more direct representation of
the original K definition.
Executing
kast --output kast and.bool
yields the following output, minus the formatting:
`_&&__LESSON-03-A_Boolean_Boolean_Boolean`(
`true_LESSON-03-A_Boolean`(.KList),
`false_LESSON-03-A_Boolean`(.KList)
)
Comparing both outputs, you can observe that the former is largely a
name-mangled version of the latter. A notable difference is the presence of the
inj
symbol in the KORE output and you can learn more about it in the
KORE tutorial.
Note that kast
also takes expressions as arguments, not only file names,
but not both at the same time. If you want to parse an expression, you need to
use flag -e
or --expression
:
kast --output kast -e "true && false"
Exercise
Parse the expression false || true
with --output kast
. See if you can
predict approximately what the corresponding output would be with
--output kore
, then run the command and compare it to your prediction.
Ambiguities
Now let's try a slightly more advanced example. Save the following program as
and-or.bool
:
true && false || false
If you try to parse it, you will see the following error:
[Error] Inner Parser: Parsing ambiguity.
1: syntax Boolean ::= Boolean "&&" Boolean [function]
`_&&__LESSON-03-A_Boolean_Boolean_Boolean`(`true_LESSON-03-A_Boolean`(.KList),`_||__LESSON-03-A_Boolean_Boolean_Boolean`(`false_LESSON-03-A_Boolean`(.KList),`false_LESSON-03-A_Boolean`(.KList)))
2: syntax Boolean ::= Boolean "||" Boolean [function]
`_||__LESSON-03-A_Boolean_Boolean_Boolean`(`_&&__LESSON-03-A_Boolean_Boolean_Boolean`(`true_LESSON-03-A_Boolean`(.KList),`false_LESSON-03-A_Boolean`(.KList)),`false_LESSON-03-A_Boolean`(.KList))
Source(./and-or.bool)
Location(1,1,1,23)
1 | true && false || false
. ^~~~~~~~~~~~~~~~~~~~~~
The error is saying that kast
was unable to parse the program because it is
ambiguous. K's just-in-time parser is a GLL (generalized
left-to-right, leftmost derivation) parser, which means it can handle
the full generality of context-free grammars, including those grammars which
are ambiguous. An ambiguous grammar is one where the same string can be parsed
as multiple distinct ASTs. In this example, it can't decide whether it should
be parsed as (true && false) || false
(Fig. 3-A) or as true && (false || false)
(Fig. 3-B).
Fig. 3-A
||
/ \
&& false
/ \
true false
Fig. 3-B
&&
/ \
true ||
/ \
false false
In Boolean logic and other programming languages such as C, logical AND has
precedence over logical OR, rendering the AST in Fig. 3-A the only valid one.
However, grammars defined in K assume all operators to have the same priority
in evaluation, unless specified otherwise. Both ASTs in Fig. 3-A and Fig. 3-B
are possible with the grammar we defined in module LESSON-3-A
, hence the
ambiguity reported by the parser. You will learn in the next lesson how to set
up precendence of some operators over others and define the logical connectives
the usual way. We continue this lesson by showing how to reduce ambiguity
through the use of brackets.
Brackets
With the grammar defined in module LESSON-03-A
there is no way of resolving
this ambiguity, making it impossible to write complex expressions in our small
language. The standard solution in most programming languages to this problem
is to use parentheses to indicate the appropriate grouping. K generalizes this
notion into a type of production called bracket.
A bracket production is any production with the bracket
attribute. It is
required that such a production only have a single non-terminal, and the sort
of the production must equal the sort of that non-terminal. K does not
otherwise impose any restrictions on the grammar provided for a bracket.
Like in other languages, the most common type of bracket is one in which a
non-terminal is surrounded by terminals representing one of the following
symbols ()
, []
, {}
, or <>
. For example, we can define the most common
type of bracket, the parentheses, quite simply. Consider the following modified
definition and save it to file lesson-03-d.k
:
kmodule LESSON-03-D syntax Boolean ::= "true" | "false" | "(" Boolean ")" [bracket] | "!" Boolean [function] | Boolean "&&" Boolean [function] | Boolean "^" Boolean [function] | Boolean "||" Boolean [function] endmodule
With this augmented definition, you are now able to parse more complex programs by explicitly grouping subterms with the bracket we have just defined.
Consider and-or-left.bool
:
(true && false) || false
and and-or-right.bool
:
true && (false || false)
When parsing these programs with kast
, you get a unique AST with no error.
If you check the output carefully, you will notice that the bracket itself does
not appear in the AST. In fact, this is a property unique to bracket
productions: they are not represented in the parsed AST of a term, and the
child of the bracket is folded immediately into the parent term. This is why we
have the requirement mentioned above, that a bracket production must have a
single non-terminal of the same sort as the production itself.
Exercise
Write out the AST you expect to be arising from parsing the two programs above
with --output kast
, then parse them and compare the result to the AST you
expected. Confirm for yourself that the bracket production does not appear in
the AST.
Tokens
So far we have seen how to define the grammar of a language and we have implicitly been using K's automatic lexer generation to produce a token for each terminal in our grammar. However, the grammar is not the only relevant part of parsing a language. Also relevant is the lexical syntax of the language, i.e., how the tokens are defined and recognized.
Sometimes we need to define more complex lexical syntax. Consider, for
instance, integers in C. They consist of a decimal, octal, or hexadecimal
number, followed by an optional suffix that specifies the type of the literal.
While it’s theoretically possible to define this syntax using a grammar, doing
so would be cumbersome and tedious. Additionally, you'd be faced with an AST
generated for the literal, which is not particularly convenient to work with.
As an alternative, K allows you to define token productions, which consist
of a regular expressions
followed by the token
attribute. The resulting AST would then consist of a
typed string containing the value recognized by the regular expression.
For example, the built-in integers in K are defined using the following production:
.k .excludesyntax Int ::= r"[\\+\\-]?[0-9]+" [token]
An integer is thus an optional sign followed by a nonzero sequence of digits.
The r
preceding the terminal indicates that what appears inside the double
quotes is a regular expression, and the token
attribute indicates that terms
which parse as this production should be converted into a token.
Before looking at how integers in C can be defined in K, let us mention that it is also possible to define tokens that do not use regular expressions. This can be useful when you wish to declare particular identifiers for use in your semantics later. For example:
.k .excludesyntax Id ::= "main" [token]
Here, we declare main
as a token of sort Id
. Instead of being parsed as a
symbol, it gets parsed as a token, generating a typed string in the AST.
This can be useful in a semantics of C because the parser typically doesn't
handle the main
function in any special way; it's only the semantics that
gives it special treatment.
The syntax of integers in C has a more complex lexical structure than the one of built-in integers in K, and a production defining them could look as follows:
.k .excludesyntax IntConstant ::= r"(([1-9][0-9]*)|(0[0-7]*)|(0[xX][0-9a-fA-F]+))(([uU][lL]?)|([uU]((ll)|(LL)))|([lL][uU]?)|(((ll)|(LL))[uU]?))?" [token]
This is a long and complex regular expression, hard to read. In addition, unlike a grammar, it is not particularly modular. However, we can get around this restriction by declaring explicit regular expressions, giving them a name, and referring to them in productions.
Consider the following (equivalent) way to define the lexical syntax of integers in C:
.k .excludesyntax IntConstant ::= r"({DecConstant}|{OctConstant}|{HexConstant})({IntSuffix}?)" [token] syntax lexical DecConstant = r"{NonzeroDigit}({Digit}*)" syntax lexical OctConstant = r"0({OctDigit}*)" syntax lexical HexConstant = r"{HexPrefix}({HexDigit}+)" syntax lexical HexPrefix = r"0x|0X" syntax lexical NonzeroDigit = r"[1-9]" syntax lexical Digit = r"[0-9]" syntax lexical OctDigit = r"[0-7]" syntax lexical HexDigit = r"[0-9a-fA-F]" syntax lexical IntSuffix = r"{UnsignedSuffix}({LongSuffix}?)|{UnsignedSuffix}{LongLongSuffix}|{LongSuffix}({UnsignedSuffix}?)|{LongLongSuffix}({UnsignedSuffix}?)" syntax lexical UnsignedSuffix = r"[uU]" syntax lexical LongSuffix = r"[lL]" syntax lexical LongLongSuffix = r"ll|LL"
As you can see, this is rather more verbose, but it has the benefit of being easier to read and understand, as well as providing increased modularity.
Note that we refer to a named regular expression by putting the name in curly
brackets. Note also that only the first sentence actually declares a new piece
of syntax in the language. syntax lexical
only declares an explicit regular
expression.
Finally, recall that K uses Flex to implement its lexical analysis. As such, you can refer to the Flex Manual | Patterns for a detailed description of the regular expression syntax supported. For performance reasons, Flex's regular expressions are actually a regular language, and thus lack some of the syntactic convenience of modern "regular expression" libraries. If you need features that are not part of the syntax of Flex regular expressions, you are encouraged to express them via a grammar instead.
Ahead-of-time parser generation
So far we have been entirely focused on K's support for just-in-time parsing, where the parser is generated on the fly prior to being used. This method offers faster parser generation, but its performance suffers if you have to repeatedly parse strings with the same parser. For this reason, when parsing programs, it is generally recommended to use K's ahead-of-time parser generation based on GNU Bison.
You can enable ahead-of-time parsing via the --gen-bison-parser
flag to
kompile
. This will make use of Bison's
LR(1) parser generator. As
such, if your grammar is not LR(1), it may not parse exactly the same as if
you were to use the just-in-time parser because Bison will automatically pick
one of the possible branches whenever it encounters a shift-reduce or
reduce-reduce conflict. In this case, you can either modify your grammar to be
LR(1), or you can use Bison's GLR support by passing flag
--gen-glr-bison-parser
to kompile
instead. Note that if your grammar is ambiguous,
the ahead-of-time parser will not provide you with particularly readable error
messages at this time.
kompile --gen-bison-parser 'lesson-03-a.k'
gives
[Warning] Compiler: Could not find main syntax module with name
LESSON-03-A-SYNTAX in definition. Use --syntax-module to specify one. Using
LESSON-03-A as default.
[Warning] Inner Parser: Skipping modules [ML-SYNTAX] tagged as not-lr1 which
are not supported by Bison.
We have seen the first warning before, and we discussed it in Lesson 1.2. The second warning we get is a side effect of the first one, informing that certain modules—e.g., for parsing—were excluded from the grammar generation because they were known to cause Bison to crash or behave incorrectly.
Next, run
lesson-03-a-kompiled/parser_PGM and-or.bool
to see that now you don't get an error when parsing. Even though our grammar is ambiguous, the LR(1) algorithm generates a single parse tree. The output, minus formatting, is the following:
inj{SortBoolean{}, SortKItem{}}(
Lbl'UndsAnd-And-UndsUnds'LESSON-03-A'Unds'Boolean'Unds'Boolean'Unds'Boolean{}(
Lbltrue'Unds'LESSON-03-A'Unds'Boolean{}(),
Lbl'UndsPipePipeUndsUnds'LESSON-03-A'Unds'Boolean'Unds'Boolean'Unds'Boolean{}(
Lblfalse'Unds'LESSON-03-A'Unds'Boolean{}(),
Lblfalse'Unds'LESSON-03-A'Unds'Boolean{}()
)
)
)
At closer look, you see it is the AST where ||
has higher priority (Fig. 3-B).
Compare this with the output given when running the same command, but when the
ahead-of-time parser has been enabled with flag --gen-glr-bison-parser
:
inj{SortBoolean{}, SortKItem{}}(
Lblamb{SortBoolean{}}(
Lbl'UndsAnd-And-UndsUnds'LESSON-03-A'Unds'Boolean'Unds'Boolean'Unds'Boolean{}(
Lbltrue'Unds'LESSON-03-A'Unds'Boolean{}(),
Lbl'UndsPipePipeUndsUnds'LESSON-03-A'Unds'Boolean'Unds'Boolean'Unds'Boolean{}(
Lblfalse'Unds'LESSON-03-A'Unds'Boolean{}(),
Lblfalse'Unds'LESSON-03-A'Unds'Boolean{}()
)
),
Lbl'UndsPipePipeUndsUnds'LESSON-03-A'Unds'Boolean'Unds'Boolean'Unds'Boolean{}(
Lbl'UndsAnd-And-UndsUnds'LESSON-03-A'Unds'Boolean'Unds'Boolean'Unds'Boolean{}(
Lbltrue'Unds'LESSON-03-A'Unds'Boolean{}(),
Lblfalse'Unds'LESSON-03-A'Unds'Boolean{}()
),
Lblfalse'Unds'LESSON-03-A'Unds'Boolean{}()
)
)
)
In this case, we get both ASTs. Since our grammar is ambiguous, the GLR
algorithm produces the complete parse forest. The ambiguity is indicated in the
above KORE output by node Lblamb
and its two children, each a possible AST of
the term true && false || false
.
Finally, note that, for a K definition named foo.k
, and directory
foo-kompiled
created when running kompile
, you can invoke the ahead-of-time
parser you generated by executing foo-kompiled/parser_PGM <file>
on a file.
Exercises
-
Compile
lesson-03-d.k
with ahead-of-time parsing enabled. Then compare how long it takes to runkast --output kore and-or-left.bool
with how long it takes to runlesson-03-d-kompiled/parser_PGM and-or-left.bool
. Confirm for yourself that both produce the same result, but that the latter is faster. -
Define a simple grammar consisting of integers, brackets, addition, subtraction, multiplication, division, and unary negation. Integers should be in decimal form and lexically without a sign, whereas negative numbers can be represented via unary negation. Ensure that you are able to parse some basic arithmetic expressions using a generated ahead-of-time parser. Do not worry about disambiguating the grammar or about writing rules to implement the operations in this definition.
-
Write a program where the meaning of the arithmetic expression based on the grammar you defined above is ambiguous, and then write programs that express each individual intended meaning using brackets.
Next lesson
Once you have completed the exercises above, you can continue to Lesson 1.4: Disambiguating Parses.