Copyright © 2008, 2009 Christian Surlykke
This document is licensed under the Creative Commons Attribution-Share Alike 3.0 Unported License. To view a copy of this license, visit http://creativecommons.org/licenses/by-sa/3.0/ or send a letter to Creative Commons, 171 Second Street, Suite 300, San Francisco, California, 94105, USA.
The software that makes up Kanocc is licensed under GPL v3. See the README file that comes with kanocc or the individual sourcefiles that make up Kanocc.
Table of Contents
Table of Contents
Kanocc is a Ruby-framework for parsing and translating for context-free grammars.
Kanocc tries to be 'scripty': Emphasis is on ease of use and seamless integration with your Ruby programs. Performance is a secondary concern - so you probably won't use Kanocc to do a production quality C++ compiler.
I imagine that Kanocc could be useful, for example, when your program is reading input that is best analyzed using parsing techniques, or if you create a small domain specific language.
Unlike compiler-compilers such as yacc or bison, Kanocc doesn't produce an intermediate file (like a c-program) to be run through a compiler to produce an executable. Kanocc reads your grammar and parses and executes your input in one run. Hence it's name.
English is not my native language, and it shows in this document. My apologies.
Kanocc is largely untested software, so it will burn down your house, and break. If this happens, you get to keep the pieces.
In the following it's assumed that you have knowledge of parsers, grammars and syntax-directed translation. If not, you'll have to do some reading. There is a lot of literature you can choose from. In particular I recommend Compilers: Principles, Techniques and Tools (The Dragonbook), which was where I learned about the subject. Also, the Bison manual contains much useful information.
It is also assumed that you have working knowledge of Ruby
Table of Contents
In this chapter we'll look at how to write an interpreter using Kanocc.
Most texts on grammars, parsers and such use arithmetic expressions as example. Obviously we don't want to be at odds with this tradition, so we'll do an interpreter, that can understand and evaluate simple arithmetic expressions.
The grammar looks like this:
Expr ::= Expr "+" Term | Expr "-" Term | Term Term ::= Term "*" Factor | Term "/" Factor | Factor Factor ::= Number | "(" Expr ")"
This grammar is made up of the nonterminals Expr
, Term
and Factor
and 9 tokens:
Number
which is suppose to match a sequence of digits, and 8 string literals:
"("
, ")"
, "+"
, "-"
, "*"
, "/"
, "\n"
, "exit"
With Kanocc, you can do an interpreter for the above grammar like this:
require "rubygems" require "kanocc" # ================================================== # Define the token 'Number' and what it matches class Number < Kanocc::Token attr_reader :val pattern(/\d+/) { @val = @m[0].to_i} end # ================================================== # Define the nonterminals Expr, Term and Factor with # their rules # First, a forward declaration class Expr < Kanocc::Nonterminal end # Then - the real stuff. class Factor < Kanocc::Nonterminal attr_reader :val rule(Number) {@val = @rhs[0].val} rule("(", Expr, ")") { @val = @rhs[1].val} end class Term < Kanocc::Nonterminal attr_reader :val rule(Term, "*", Factor) { @val = @rhs[0].val * @rhs[2].val} rule(Term, "/", Factor) { @val = @rhs[0].val / @rhs[2].val} rule(Factor) {@val = @rhs[0].val} end class Expr attr_reader :val rule(Expr, "+", Term) { @val = @rhs[0].val + @rhs[2].val} rule(Expr, "-", Term) { @val = @rhs[0].val - @rhs[2].val} rule(Term) { @val = @rhs[0].val} end # ========== Set up a parser for Expr ============= myParser = Kanocc::Kanocc.new(Expr) # ========== And try it out ======================= puts myParser.parse('3 + 4 - 2').val puts myParser.parse('8 - 2 * 3').val
If you run this script, you'll get:
5 2
In the following we will go through the elements of this example
In our example we defined a class Number
, extending Kanocc::Token
.
By calling Kanocc::Token.pattern
we tell Kanocc how Numbers look.
The method pattern
must be given a regular expression as argument.
By referring to Number
in our rules, we make Kanocc look for Number
during lexical scanning.
Obviously, string literals (such as "("
, "exit"
, "\n"
,..)
doesn't need to have patterns attached. All you have to do is mention them in grammarrules,
and Kanocc will look for them in the input.
Tokens must always have a parameterless initialize
method,
so that Kanocc knows how to create instances.
You'll have noticed that the method Kanocc::Token.pattern
takes a block as
it's final argument. This allows you to initialize the token instances that Kanocc creates.
When Kanocc recognizes a token (of the Kanocc::Token
kind) in input, it will do the following:
Create an instance of the token - in our example above an instance of Number
- using it's
parameterless constructor.
Create a Match
object by applying the regular expression you gave, to
the portion of input being matched, and inject it into the newly created instance
(of Number
) as @m
Execute the block given to pattern
in the context of the newly created instance.
Note how, in our example, this allows us to set @val
in those instances of Number
that Kanocc creates during parsing.
It is possible to call pattern
more than once in a Token class. As an example let's
say we want to allow input of numbers in both decimal and hexadecimal form. We could then change the
definition of the class Num
to:
class Num < Kanocc::Token attr_reader :val pattern(/0x[0-9A-F]+/) { @val = @m[0].hex } # Hex notation pattern(/\d+/) { @val = @m[0].to_i} # Decimal notation end
With this change, the calculator in our example above will be able to accept expressions like:
"0xA3 + 7"
By default Kanocc considers anything that matches the regular expression /\s/
,
whitespace.
You may redefine this, if you wish, by calling the method Kanocc::Kanocc#set_whitespace
.
set_whitespace
takes a list of regular expressions as argument.
For example, to define that spaces, tabs and anything from a '#
' to lineending is whitespace,
you can do:
myParser.setWhitespace(/ /, /\t/, /#.*/)
- or, if you prefer:
myParser.setWhitespace(/ |\t|#.*/)
You have all the power of Ruby regular expressions at your disposal here.
If you're very observant, you'll have noticed that the literal "\n"
which
we used in our example above is also whitespace. This is ok. If a part of input can be recognized both as a token and as whitespace,
Kanocc will prefer to emit a token.
^{[1]}
In our example we defined 3 classes: Expr
, Term
and Factor
,
each of them extending Kanocc::Nonterminal
.
This is how we tell Kanocc about the nonterminals of our grammar.
Inside each class definition we make calls to the static method rule
.
With this we define the grammarrules belonging to that that particular nonterminal.
For examle, we told Kanocc about the rule:
Expr ::= Expr "+" Term
with the call
rule(Expr, "+", Term) {@val = @rhs[0].val + @rhs[2].val}
The arguments to rule
must be nonterminals
(that means: classes that inherit from Kanocc::Nonterminal
),
tokens (classes that inherit from Kanoc::Token
) or strings.
Nonterminals must always have a parameterless initialize
method so Kanocc
can create instances of them.
Like Token.pattern
, Nonterminal.rule
accepts a block as it's final argument.
This allows you to attach semantics to grammarrules.
Whenever Kanocc reduces by a grammarrule, it creates an instance of the class that has the rule.
For example, when reducing by the rule
Expr ::= Expr "+" Term
Kanocc will create an instance of the class Expr.
This happens bottom-up, so at this point Kanocc will already have made instances of the nonterminals on the right-hand side of the rule (Expr
and Term
).
Lets call these instances expr
and term
, respectively.
When Kanocc does the reduction it:
Calls Expr.new
to create a new instance of Expr
.
Injects the array [expr, "+", term]
into this instance as @rhs
Excecutes the given block in the context of this new instance.
Note how, in our example above, this allows us to set @val in the class instances Kanocc creates.
In our example above, we used an unambigous grammar.
We might, however, prefer the following ambigous grammar:
Expr ::= Expr '+' Expr | Expr - Expr | Expr * Expr | Expr / Expr | '(' Expr ')' | Number
This grammar is ambigous. For example it doesn't specify whether to interpret the expression
8 - 2 * 3
as (8 - 2) * 3
or 8 - (2 * 3)
.
To control this you assign precedence to your grammarules. This is done by following the call to rule by a call to
Nonterminal.precedence
with an integer as argument. Default precedence is 0, so if you give a precedence > 0 you increase precedence - i.e make Kanocc more inclined to use the production.
Using precedence, we can rewrite our example to:
require "rubygems" require "kanocc" class Number < Kanocc::Token attr_reader :val pattern(/\d+/) { @val = @m[0].to_i} end class Expr < Kanocc::Nonterminal attr_reader :val rule(Expr, "+", Expr) { @val = @rhs[0].val + @rhs[2].val} rule(Expr, "-", Expr) { @val = @rhs[0].val - @rhs[2].val} rule(Expr, "*", Expr) { @val = @rhs[0].val * @rhs[2].val}; precedence -1 rule(Expr, "/", Expr) { @val = @rhs[0].val / @rhs[2].val}; precedence -1 rule("(", Expr, ")") { @val = @rhs[1].val} rule(Number) {@val = @rhs[0].val} end myParser = Kanocc::Kanocc.new(Expr) puts myParser.parse('3 + 4 - 2').val puts myParser.parse('8 - 2 * 3').val
The two calls to precedence
gave the rules for multiplication and division lower precedence.
When mathematicians talk about operators and precedence they say that multiplication operators have higher precedence than addition operators. Also traditional parser-generators like yacc or bison would have you asign a higher precedence to '*'. Why then do we assign a lower precedence to multiplication?
The reason for this quirk is that the Earley parsing algorithm,
which is employed by Kanocc, constructs the parse tree top-down. The Earley parser must
find a way to expand the nonterminal Expr
to the
string '8 - 2 * 3'
.
This can be done in two ways:
Expr / | \ / | \ Expr '-' Expr / / | \ / / | \ Number Expr '*' Expr | | | '8' | | Number Number | | '2' '3'
or
Expr / | \ / | \ Expr '*' Expr / | \ \ / | \ \ Expr '-' Expr Number | | | | | '3' Number Number | | '8' '2'
Clearly, what we want is the first variant, so we need to make Kanocc choose the expansion
Expr --> Expr '-' Expr
before
Expr --> Expr '*' Expr
Therefore we must give multiplication rules a lower precedence.
Consider the expression: 8 - 5 + 3
. Given our grammar, Expr
may expand to that in 2 ways:
Expr / | \ / | \ Expr '+' Number / | \ | / | \ '3' Number '-' Number | | '8' '5'
or
Expr / | \ / | \ Number '-' Expr | / | \ '8' / | \ Number '+' Number | | '5' '3'
By default, Kanocc will choose the first variant, which is well in tune with the fact that operators '+' and '-' are left associative. If you look at the parse trees above, you'll notice that the topmost Expr of the first tree has it's rightmost nonterminal (Number) expand to '3' - one token - while the topmost Expr of the second tree has it's rightmost nonterminal (Expr) expand to '5' '+' '3' - a sequence of three tokens. You can control this by declaring wether Kanocc should expand the rightmost grammarsymbol of a rule to as much or as little as possible. By default Kanocc expands the rightmost grammarsymbol of a rule to as little as possible. If you want to change that, eg. have your addition-operators be right-associative, you can declare rules to derive right. You might do:
. . class Expr < Kanocc::Nonterminal rule(Expr, "+", Expr) { @val = @rhs[0].val + @rhs[2].val}; derives_right rule(Expr, "-", Expr) { @val = @rhs[0].val - @rhs[2].val}; derives_right end . .
If Kanocc is expanding a nonterminal and has several rules to choose from and they all derive right, it will choose to expand by the rule whos rightmost symbol can expand to the longest sequence of input tokens.
Once you have your tokens and nonterminals in place, the rest is simple:
Construct an instance of Kanocc::Kanocc
with the startsymbol
of your grammar as argument.
Then call Kannoc#parse with the input as argument, and Kanocc will try to reduce the input to the given startsymbol. If succesful, an instance of that startsymbol will be returned.
In our example:
myParser = Kanocc::Kanocc.new(Expr) puts myParser.parse('8 - 3 * 2').val
did that.
^{[1] } In full the Kanocc algorithm for lexical scanning is:
Often, a language you define will contain lists of stuff. Kanocc comes with
two metods for this particular purpose: zm
('zero or more') and om
('one or more').
Both methods take as their first argument the grammarsymbol that make up the list, and as an optional second parameter, a symbol that will be used as a separator.
Thus, zm(A)
, means a (possibly empty) list of A
's, and
zm(A, ";")
means a (possibly empty) list of A
's separated by semicolons.
You may also feed zm
and om
arrays - zm([A, B, C], ",")
would accept lists like:
A B C, A B C, A B C, A B C
Say we wanted to extend our parser above, so that it accepts several expressions, each on one line. We could rewrite the parser like this:
require "rubygems" require "kanocc" class Number < Kanocc::Token attr_reader :val pattern(/\d+/) { @val = @m[0].to_i} end class Expr < Kanocc::Nonterminal attr_reader :val rule(Expr, "+", Expr) { @val = @rhs[0].val + @rhs[2].val} rule(Expr, "-", Expr) { @val = @rhs[0].val - @rhs[2].val} rule(Expr, "*", Expr) { @val = @rhs[0].val * @rhs[2].val}; precedence -1 rule(Expr, "/", Expr) { @val = @rhs[0].val / @rhs[2].val}; precedence -1 rule("(", Expr, ")") { @val = @rhs[1].val} rule(Number) {@val = @rhs[0].val} end class Program < Kanocc::Nonterminal rule(zm(Expr, "\n"), "\n") {@rhs[0].elements.each {|expr| puts expr.val}} end Kanocc::Kanocc.new(Program).parse <<-EOF 2 + 3 4 - 1 8*(2 + 3) EOF
In other words: A Program
is a sequence of Expr
's separated by newlines, ending with a newline.
As an other example, consider compund statements - a sequence of statements surrounded by
"begin"
and "end"
and separated by semicolons. For example:
begin <stmt>; <stmt>; <stmt> end
In Kanocc you can write a grammar for that like:
class CompoundStatement < Kanocc::Nonterminal rule("begin", zm(Statement, ";"), "end") end
Table of Contents
Suppose you try this:
require "rubygems" require "kanocc" class Number < Kanocc::Token attr_reader :val pattern(/\d+/) { @val = @m[0].to_i} end class Expr < Kanocc::Nonterminal attr_reader :val rule(Expr, "+", Expr) { @val = @rhs[0].val + @rhs[2].val} rule(Expr, "-", Expr) { @val = @rhs[0].val - @rhs[2].val} rule(Expr, "*", Expr) { @val = @rhs[0].val * @rhs[2].val}; precedence -1 rule(Expr, "/", Expr) { @val = @rhs[0].val / @rhs[2].val}; precedence -1 rule("(", Expr, ")") { @val = @rhs[1].val} rule(Number) {@val = @rhs[0].val} end myParser = Kanocc::Kanocc.new(Expr) puts myParser.parse('3 + 4 - -').val
- i.e feed your parser input that is not syntactically correct.
You'll get something like:
/home/christian/projekter/Kanocc/lib/kanocc/earley.rb:166:in `reportParsingError': Could not consume input: "-" at position: 8 - expected Number or "(" (Kanocc::ParseException) from /home/christian/projekter/Kanocc/lib/kanocc/earley.rb:135:in `consume' from /home/christian/projekter/Kanocc/lib/kanocc.rb:133:in `parse' from /home/christian/projekter/Kanocc/lib/kanocc/scanner.rb:72:in `each_token' from /home/christian/projekter/Kanocc/lib/kanocc.rb:130:in `parse' from ./doc_calc.rb:49
This is the most basic error-handling Kanocc offers: When it finds that input cannot parse according to the grammar, it will terminate parsing ant raise a Kanocc::ParseException
. This exception will contain the offending input token, it's position, and a list of terminals that Kanocc was expecting at that point.
Rather than abort parsing when an error is discovered, it can be nice to be able to skip an error and continue parsing, so that at least all (or most of) the errors in input can be reported in one run.
Kanocc has a special nonterminal Kanocc::Error
which may be used on the
right-hand-side of rules. Error
can expand to anything, and rules containing
Error
on their right-hand-side will have a precedence of negative infinity. In other words:
A rule containing Error
will only be used if Kanocc can see no other way to reduce
the input to the start symbol.
Using Kanocc::Error
we can rewrite our example to:
require "rubygems" require "kanocc" class Number < Kanocc::Token attr_reader :val pattern(/\d+/) { @val = @m[0].to_i} end class Expr < Kanocc::Nonterminal attr_reader :val rule(Expr, "+", Expr) { @val = @rhs[0].val + @rhs[2].val} rule(Expr, "-", Expr) { @val = @rhs[0].val - @rhs[2].val} rule(Expr, "*", Expr) { @val = @rhs[0].val * @rhs[2].val}; precedence -1 rule(Expr, "/", Expr) { @val = @rhs[0].val / @rhs[2].val}; precedence -1 rule("(", Expr, ")") { @val = @rhs[1].val} rule(Number) {@val = @rhs[0].val} end class Line < Kanocc::Nonterminal rule(Expr, "\n") {puts @rhs[0].val} rule(Kanocc::Error, "\n") {puts "Sorry - didn't understand: " + @rhs[0].str.inspect} end class Program < Kanocc::Nonterminal rule(zm(line)) end Kanocc::Kanocc.new(Program).parse <<-EOI 3 + 4 - 2 8 - 2 * 3 8 * + 3 2 + 3 * 2 EOI
This should produce:
5 2 Sorry - didn't understand: " 8 * + 3" 8
To facillitate error messaging, an instance of Kanocc::Error
will have set two instance
variables:
@program
which refers the input being processed.
@start
which indexes the first character of the part of input that has been
reduced to this error.
@end
which indexes the first character after the erronous
part of input.
In our example, @program
would reference the string :
<<-EOT exit EOT
,
@start
and @end
would have the values 23 and 34 respectively, so that
@program[@start, @end]
would give
" 8 * + 3 "