Kanocc

Kanocc ain't no compiler-compiler

Documentation for the ruby parsing framework 'Kanocc'.

Christian Surlykke

This document is licensed under the Creative Commons Attribution-Share Alike 3.0 Unported License. To view a copy of this license, visit http://creativecommons.org/licenses/by-sa/3.0/ or send a letter to Creative Commons, 171 Second Street, Suite 300, San Francisco, California, 94105, USA.

The software that makes up Kanocc is licensed under GPL v3. See the README file that comes with kanocc or the individual sourcefiles that make up Kanocc.


Table of Contents

1. Introduction
What is Kanocc?
Disclaimers
Reading this document
Notation
Getting and installing Kanocc
2. Writing an interpreter
Tokens
Semantics
More than one pattern
Whitespace
Nonterminals
Semantics
Precedence
Left- and right-assotiative operators
Setting up the parser
3. Lists
4. Errors
Error recovery

Chapter 1. Introduction

What is Kanocc?

Kanocc is a Ruby-framework for parsing and translating for context-free grammars.

Kanocc tries to be 'scripty': Emphasis is on ease of use and seamless integration with your Ruby programs. Performance is a secondary concern - so you probably won't use Kanocc to do a production quality C++ compiler.

I imagine that Kanocc could be useful, for example, when your program is reading input that is best analyzed using parsing techniques, or if you create a small domain specific language.

Unlike compiler-compilers such as yacc or bison, Kanocc doesn't produce an intermediate file (like a c-program) to be run through a compiler to produce an executable. Kanocc reads your grammar and parses and executes your input in one run. Hence it's name.

Disclaimers

  • English is not my native language, and it shows in this document. My apologies.

  • Kanocc is largely untested software, so it will burn down your house, and break. If this happens, you get to keep the pieces.

Reading this document

In the following it's assumed that you have knowledge of parsers, grammars and syntax-directed translation. If not, you'll have to do some reading. There is a lot of literature you can choose from. In particular I recommend Compilers: Principles, Techniques and Tools (The Dragonbook), which was where I learned about the subject. Also, the Bison manual contains much useful information.

It is also assumed that you have working knowledge of Ruby

Notation

Program examples will be shown like this:

class A
  def initialize
    puts "In A's initializer..."
  end
end

A.new

Output will be shown as:

In A's initializer...

Foo#bar means the instance-method bar of the class Foo.

Foo.bar means the class-method bar of the class Foo.

Getting and installing Kanocc

Kanocc is available as a gem from RubyForge. From the command-prompt, do:

gem install kanocc

and then in your Ruby-scripts:

require 'rubygems'
require 'kanocc'

Chapter 2. Writing an interpreter

In this chapter we'll look at how to write an interpreter using Kanocc.

Most texts on grammars, parsers and such use arithmetic expressions as example. Obviously we don't want to be at odds with this tradition, so we'll do an interpreter, that can understand and evaluate simple arithmetic expressions.

The grammar looks like this:

Expr    ::= Expr "+" Term 
          | Expr "-" Term
          | Term
  
Term    ::= Term "*" Factor
          | Term "/" Factor
          | Factor

Factor  ::= Number
          | "(" Expr ")"

This grammar is made up of the nonterminals Expr, Term and Factor and 9 tokens: Number which is suppose to match a sequence of digits, and 8 string literals: "(", ")", "+", "-", "*", "/", "\n", "exit"

With Kanocc, you can do an interpreter for the above grammar like this:

require "rubygems"   
require "kanocc"

# ==================================================
# Define the token 'Number' and what it matches 
        
class Number < Kanocc::Token 
  attr_reader :val 
  pattern(/\d+/) { @val = @m[0].to_i} 
end
 
# ==================================================
# Define the nonterminals Expr, Term and Factor with
# their rules
       
# First, a forward declaration
class Expr < Kanocc::Nonterminal 
end   
  
# Then - the real stuff.
class Factor < Kanocc::Nonterminal 
  attr_reader :val
  rule(Number) {@val = @rhs[0].val}
  rule("(", Expr, ")") { @val = @rhs[1].val}
end

class Term < Kanocc::Nonterminal 
  attr_reader :val
  rule(Term, "*", Factor) { @val = @rhs[0].val * @rhs[2].val}
  rule(Term, "/", Factor) { @val = @rhs[0].val / @rhs[2].val} 
  rule(Factor) {@val = @rhs[0].val}
end

class Expr 
  attr_reader :val 

  rule(Expr, "+", Term) { @val = @rhs[0].val + @rhs[2].val}
  rule(Expr, "-", Term) { @val = @rhs[0].val - @rhs[2].val}
  rule(Term) { @val = @rhs[0].val} 
end

# ========== Set up a parser for Expr ============= 
myParser = Kanocc::Kanocc.new(Expr) 
 
# ========== And try it out ======================= 
puts myParser.parse('3 + 4 - 2').val 
puts myParser.parse('8 - 2 * 3').val 

If you run this script, you'll get:

5 
2 

In the following we will go through the elements of this example

Tokens

In our example we defined a class Number, extending Kanocc::Token.

By calling Kanocc::Token.pattern we tell Kanocc how Numbers look. The method pattern must be given a regular expression as argument.

By referring to Number in our rules, we make Kanocc look for Number during lexical scanning.

Obviously, string literals (such as "(", "exit", "\n",..) doesn't need to have patterns attached. All you have to do is mention them in grammarrules, and Kanocc will look for them in the input.

Tokens must always have a parameterless initialize method, so that Kanocc knows how to create instances.

Semantics

You'll have noticed that the method Kanocc::Token.pattern takes a block as it's final argument. This allows you to initialize the token instances that Kanocc creates.

When Kanocc recognizes a token (of the Kanocc::Token kind) in input, it will do the following:

  1. Create an instance of the token - in our example above an instance of Number - using it's parameterless constructor.

  2. Create a Match object by applying the regular expression you gave, to the portion of input being matched, and inject it into the newly created instance (of Number) as @m

  3. Execute the block given to pattern in the context of the newly created instance.

Note how, in our example, this allows us to set @val in those instances of Number that Kanocc creates during parsing.

More than one pattern

It is possible to call pattern more than once in a Token class. As an example let's say we want to allow input of numbers in both decimal and hexadecimal form. We could then change the definition of the class Num to:

class Num < Kanocc::Token 
  attr_reader :val 
  pattern(/0x[0-9A-F]+/) { @val = @m[0].hex } # Hex notation
  pattern(/\d+/) { @val = @m[0].to_i}  # Decimal notation 
end

With this change, the calculator in our example above will be able to accept expressions like:

"0xA3 + 7"

Whitespace

By default Kanocc considers anything that matches the regular expression /\s/, whitespace.

You may redefine this, if you wish, by calling the method Kanocc::Kanocc#set_whitespace. set_whitespace takes a list of regular expressions as argument.

For example, to define that spaces, tabs and anything from a '#' to lineending is whitespace, you can do:

  myParser.setWhitespace(/ /, /\t/, /#.*/)

- or, if you prefer:

  myParser.setWhitespace(/ |\t|#.*/)

You have all the power of Ruby regular expressions at your disposal here.

If you're very observant, you'll have noticed that the literal "\n" which we used in our example above is also whitespace. This is ok. If a part of input can be recognized both as a token and as whitespace, Kanocc will prefer to emit a token. [1]

Nonterminals

In our example we defined 3 classes: Expr, Term and Factor, each of them extending Kanocc::Nonterminal. This is how we tell Kanocc about the nonterminals of our grammar.

Inside each class definition we make calls to the static method rule. With this we define the grammarrules belonging to that that particular nonterminal. For examle, we told Kanocc about the rule:

  
Expr ::= Expr "+" Term

with the call

rule(Expr, "+", Term) {@val = @rhs[0].val + @rhs[2].val}

The arguments to rule must be nonterminals (that means: classes that inherit from Kanocc::Nonterminal), tokens (classes that inherit from Kanoc::Token) or strings.

Nonterminals must always have a parameterless initialize method so Kanocc can create instances of them.

Semantics

Like Token.pattern, Nonterminal.rule accepts a block as it's final argument. This allows you to attach semantics to grammarrules.

Whenever Kanocc reduces by a grammarrule, it creates an instance of the class that has the rule.

For example, when reducing by the rule

Expr ::= Expr "+" Term

Kanocc will create an instance of the class Expr.

This happens bottom-up, so at this point Kanocc will already have made instances of the nonterminals on the right-hand side of the rule (Expr and Term). Lets call these instances expr and term, respectively.

When Kanocc does the reduction it:

  1. Calls Expr.new to create a new instance of Expr.

  2. Injects the array [expr, "+", term] into this instance as @rhs

  3. Excecutes the given block in the context of this new instance.

Note how, in our example above, this allows us to set @val in the class instances Kanocc creates.

Precedence

In our example above, we used an unambigous grammar.

We might, however, prefer the following ambigous grammar:

Expr ::= Expr '+' Expr
       | Expr - Expr
       | Expr * Expr
       | Expr / Expr
       | '(' Expr ')'
       | Number

This grammar is ambigous. For example it doesn't specify whether to interpret the expression 8 - 2 * 3 as (8 - 2) * 3 or 8 - (2 * 3) .

To control this you assign precedence to your grammarules. This is done by following the call to rule by a call to

Nonterminal.precedence

with an integer as argument. Default precedence is 0, so if you give a precedence > 0 you increase precedence - i.e make Kanocc more inclined to use the production.

Using precedence, we can rewrite our example to:

require "rubygems"   
require "kanocc"

class Number < Kanocc::Token 
  attr_reader :val 
  pattern(/\d+/) { @val = @m[0].to_i} 
end
         
class Expr < Kanocc::Nonterminal 
  attr_reader :val 
        
  rule(Expr, "+", Expr) { @val = @rhs[0].val + @rhs[2].val}
  rule(Expr, "-", Expr) { @val = @rhs[0].val - @rhs[2].val}
  rule(Expr, "*", Expr) { @val = @rhs[0].val * @rhs[2].val}; precedence -1
  rule(Expr, "/", Expr) { @val = @rhs[0].val / @rhs[2].val}; precedence -1
  rule("(", Expr, ")") { @val = @rhs[1].val}
  rule(Number) {@val = @rhs[0].val}
end

myParser = Kanocc::Kanocc.new(Expr) 
 
puts myParser.parse('3 + 4 - 2').val 
puts myParser.parse('8 - 2 * 3').val 

The two calls to precedence gave the rules for multiplication and division lower precedence.

When mathematicians talk about operators and precedence they say that multiplication operators have higher precedence than addition operators. Also traditional parser-generators like yacc or bison would have you asign a higher precedence to '*'. Why then do we assign a lower precedence to multiplication?

The reason for this quirk is that the Earley parsing algorithm, which is employed by Kanocc, constructs the parse tree top-down. The Earley parser must find a way to expand the nonterminal Expr to the string '8 - 2 * 3'. This can be done in two ways:

            Expr
           /  | \
          /   |  \
        Expr '-'  Expr
        /        /  | \
       /        /   |  \
   Number    Expr  '*'  Expr
     |        |          |
    '8'       |          |
            Number     Number
              |          |
             '2'        '3'

or

            Expr
           /  | \
          /   |  \
        Expr '*'  Expr
       /  | \        \
      /   |  \        \
   Expr  '-'  Expr   Number
    |          |       |
    |          |      '3'
  Number     Number
    |          |
   '8'        '2' 

Clearly, what we want is the first variant, so we need to make Kanocc choose the expansion

  Expr --> Expr '-' Expr

before

  Expr --> Expr '*' Expr

Therefore we must give multiplication rules a lower precedence.

Left- and right-assotiative operators

Consider the expression: 8 - 5 + 3. Given our grammar, Expr may expand to that in 2 ways:

            Expr
          /  |   \
         /   |    \
       Expr '+'   Number
      /  | \         |
    /    |  \       '3'
 Number '-' Number
   |          |
  '8'        '5'
  

or

       Expr
     /  |  \
    /   |   \
Number '-'   Expr
  |         /  |  \
 '8'       /   |   \
       Number '+'  Number
         |           |
        '5'         '3'

  

By default, Kanocc will choose the first variant, which is well in tune with the fact that operators '+' and '-' are left associative. If you look at the parse trees above, you'll notice that the topmost Expr of the first tree has it's rightmost nonterminal (Number) expand to '3' - one token - while the topmost Expr of the second tree has it's rightmost nonterminal (Expr) expand to '5' '+' '3' - a sequence of three tokens. You can control this by declaring wether Kanocc should expand the rightmost grammarsymbol of a rule to as much or as little as possible. By default Kanocc expands the rightmost grammarsymbol of a rule to as little as possible. If you want to change that, eg. have your addition-operators be right-associative, you can declare rules to derive right. You might do:

.
.

class Expr < Kanocc::Nonterminal
  rule(Expr, "+", Expr) { @val = @rhs[0].val + @rhs[2].val}; derives_right
  rule(Expr, "-", Expr) { @val = @rhs[0].val - @rhs[2].val}; derives_right
 end

.
.

If Kanocc is expanding a nonterminal and has several rules to choose from and they all derive right, it will choose to expand by the rule whos rightmost symbol can expand to the longest sequence of input tokens.

Setting up the parser

Once you have your tokens and nonterminals in place, the rest is simple: Construct an instance of Kanocc::Kanocc with the startsymbol of your grammar as argument.

Then call Kannoc#parse with the input as argument, and Kanocc will try to reduce the input to the given startsymbol. If succesful, an instance of that startsymbol will be returned.

In our example:

myParser = Kanocc::Kanocc.new(Expr)
puts myParser.parse('8 - 3 * 2').val

did that.



[1] In full the Kanocc algorithm for lexical scanning is:

  1. At current position in input, all patterns are considered - tokens, literals and whitespace.
  2. Those patterns that matches the longest substring starting at current position are picked out.
  3. If all these are whitespace patterns, current position is advanced to just after the matched substring
  4. If there are tokens and/or literals among the patterns all these are forwarded to the parser for consideration
  5. If nothing could be matched, the first character at current position is emitted as a literal
This scheme has the nice quality that lexical scanning (the process of transforming input to a stream of tokens) never fails.

Chapter 3. Lists

Often, a language you define will contain lists of stuff. Kanocc comes with two metods for this particular purpose: zm ('zero or more') and om ('one or more').

Both methods take as their first argument the grammarsymbol that make up the list, and as an optional second parameter, a symbol that will be used as a separator.

Thus, zm(A), means a (possibly empty) list of A's, and zm(A, ";") means a (possibly empty) list of A's separated by semicolons.

You may also feed zm and om arrays - zm([A, B, C], ",") would accept lists like:

A B C, A B C, A B C, A B C

Say we wanted to extend our parser above, so that it accepts several expressions, each on one line. We could rewrite the parser like this:

require "rubygems"   
require "kanocc"

class Number < Kanocc::Token
  attr_reader :val
  pattern(/\d+/) { @val = @m[0].to_i}
end


class Expr < Kanocc::Nonterminal
  attr_reader :val

  rule(Expr, "+", Expr) { @val = @rhs[0].val + @rhs[2].val}
  rule(Expr, "-", Expr) { @val = @rhs[0].val - @rhs[2].val}
  rule(Expr, "*", Expr) { @val = @rhs[0].val * @rhs[2].val}; precedence -1
  rule(Expr, "/", Expr) { @val = @rhs[0].val / @rhs[2].val}; precedence -1
  rule("(", Expr, ")") { @val = @rhs[1].val}
  rule(Number) {@val = @rhs[0].val}
end

class Program < Kanocc::Nonterminal
	rule(zm(Expr, "\n"), "\n") {@rhs[0].elements.each {|expr| puts expr.val}}
end

Kanocc::Kanocc.new(Program).parse <<-EOF
  2 + 3
  4 - 1
  8*(2 + 3)
EOF

In other words: A Program is a sequence of Expr's separated by newlines, ending with a newline.

As an other example, consider compund statements - a sequence of statements surrounded by "begin" and "end" and separated by semicolons. For example:

begin
  <stmt>;
  <stmt>;
  <stmt>
end

In Kanocc you can write a grammar for that like:

class CompoundStatement < Kanocc::Nonterminal
  rule("begin", zm(Statement, ";"), "end")
end

Chapter 4. Errors

Table of Contents

Error recovery

Suppose you try this:

require "rubygems"   
require "kanocc"

class Number < Kanocc::Token 
  attr_reader :val 
  pattern(/\d+/) { @val = @m[0].to_i} 
end
         
class Expr < Kanocc::Nonterminal 
  attr_reader :val 
        
  rule(Expr, "+", Expr) { @val = @rhs[0].val + @rhs[2].val}
  rule(Expr, "-", Expr) { @val = @rhs[0].val - @rhs[2].val}
  rule(Expr, "*", Expr) { @val = @rhs[0].val * @rhs[2].val}; precedence -1
  rule(Expr, "/", Expr) { @val = @rhs[0].val / @rhs[2].val}; precedence -1
  rule("(", Expr, ")") { @val = @rhs[1].val}
  rule(Number) {@val = @rhs[0].val}
end

myParser = Kanocc::Kanocc.new(Expr) 
 
puts myParser.parse('3 + 4 - -').val 

- i.e feed your parser input that is not syntactically correct.

You'll get something like:

/home/christian/projekter/Kanocc/lib/kanocc/earley.rb:166:in `reportParsingError': 
    Could not consume input: "-" at position: 8 - expected Number or "(" (Kanocc::ParseException)
        from /home/christian/projekter/Kanocc/lib/kanocc/earley.rb:135:in `consume'
        from /home/christian/projekter/Kanocc/lib/kanocc.rb:133:in `parse'
        from /home/christian/projekter/Kanocc/lib/kanocc/scanner.rb:72:in `each_token'
        from /home/christian/projekter/Kanocc/lib/kanocc.rb:130:in `parse'
        from ./doc_calc.rb:49

This is the most basic error-handling Kanocc offers: When it finds that input cannot parse according to the grammar, it will terminate parsing ant raise a Kanocc::ParseException. This exception will contain the offending input token, it's position, and a list of terminals that Kanocc was expecting at that point.

Error recovery

Rather than abort parsing when an error is discovered, it can be nice to be able to skip an error and continue parsing, so that at least all (or most of) the errors in input can be reported in one run.

Kanocc has a special nonterminal Kanocc::Error which may be used on the right-hand-side of rules. Error can expand to anything, and rules containing Error on their right-hand-side will have a precedence of negative infinity. In other words: A rule containing Error will only be used if Kanocc can see no other way to reduce the input to the start symbol.

Using Kanocc::Error we can rewrite our example to:

require "rubygems"
require "kanocc"

class Number < Kanocc::Token
attr_reader :val
pattern(/\d+/) { @val = @m[0].to_i}
end


class Expr < Kanocc::Nonterminal
  attr_reader :val

  rule(Expr, "+", Expr) { @val = @rhs[0].val + @rhs[2].val}
  rule(Expr, "-", Expr) { @val = @rhs[0].val - @rhs[2].val}
  rule(Expr, "*", Expr) { @val = @rhs[0].val * @rhs[2].val}; precedence -1
  rule(Expr, "/", Expr) { @val = @rhs[0].val / @rhs[2].val}; precedence -1
  rule("(", Expr, ")") { @val = @rhs[1].val}
  rule(Number) {@val = @rhs[0].val}
end

class Line < Kanocc::Nonterminal
  rule(Expr, "\n") {puts @rhs[0].val}
  rule(Kanocc::Error, "\n") {puts "Sorry - didn't understand: " + @rhs[0].str.inspect}
end

class Program < Kanocc::Nonterminal
  rule(zm(line))
end

Kanocc::Kanocc.new(Program).parse <<-EOI
  3 + 4 - 2
  8 - 2 * 3
  8 * + 3
  2 + 3 * 2
EOI

This should produce:

  5
  2
  Sorry - didn't understand: "  8 * + 3"
  8

To facillitate error messaging, an instance of Kanocc::Error will have set two instance variables:

  1. @program which refers the input being processed.

  2. @start which indexes the first character of the part of input that has been reduced to this error.

  3. @end which indexes the first character after the erronous part of input.

In our example, @program would reference the string :

<<-EOT
exit
EOT

, @start and @end would have the values 23 and 34 respectively, so that @program[@start, @end] would give

"  8 * + 3 "