Chapter 5
Grammar structure

5.1 TPG grammar structure

TPG grammars are contained in the doc string of the parser class. TPG grammars may contain three parts:

Options
are defined at the beginning of the grammar (see 5.3).
Tokens
are introduced by the token or separator keyword (see 6.2).
Rules
are described after tokens (see 5.5).

See figure 5.1 for a generic TPG grammar.




Figure 5.1: TPG grammar structure
class Foo(tpg.Parser):  
    r"""  
 
        # Options  
        set lexer = CSL  
 
        # Tokens  
        separator spaces    '\s+'       ;  
        token int           '\d+'   int ;  
 
        # Rules  
        START -> X Y Z ;  
 
    """  
 
foo = Foo()  
result = foo("input string")



5.2 Comments

Comments in TPG start with # and run until the end of the line.

    # This is a comment

5.3 Options

Some options can be set at the beginning of TPG grammars. The syntax for options is:

set name = value
sets the name option to value.

5.3.1 Lexer option

The lexer option tells TPG which lexer to use.

set lexer = NamedGroupLexer
is the default lexer. It is context free and uses named groups of the sre package (and its limitation of 100 named groups, ie 100 tokens).
set lexer = Lexer
is similar to NamedGroupLexer but doesn’t use named groups. It is slower than NamedGroupLexer.
set lexer = CacheNamedGroupLexer
is similar to NamedGroupLexer except that tokens are first stored in a list. It is faster for heavy backtracking grammars.
set lexer = CacheLexer
is similar to Lexer except that tokens are first stored in a list. It is faster for heavy backtracking grammar.
set lexer = ContextSensitiveLexer
is the context sensitive lexer (see 8).

5.3.2 Word bondary option

The word_boundary options tells the lexer to search for word boundaries after identifiers.

set word_boundary = True
enables the word boundary search. This is the default.
set word_boundary = False
disables the word boundary search.

5.3.3 Regular expression options

The sre module accepts some options to define the behaviour of the compiled regular expressions. These options can be changed for each parser.

set lexer_ignorecase = True
enables the re.IGNORECASE option.
set lexer_locale = True
enables the re.LOCALE option.
set lexer_multiline = True
enables the re.MULTILINE option.
set lexer_dotall = True
enables the re.DOTALL option.
set lexer_verbose = True
enables the re.VERBOSE option.
set lexer_unicode = True
enables the re.UNICODE option.

5.4 Python code

Python code section are not handled by TPG. TPG won’t complain about syntax errors in Python code sections, it is Python’s job. They are copied verbatim to the generated Python parser.

5.4.1 Syntax

Before TPG 3, Python code is enclosed in double curly brackets. That means that Python code must not contain to consecutive close brackets. You can avoid this by writting } } (with a space) instead of }} (without space). This syntaxe is still available but the new syntax may be more readable. The new syntax uses $ to delimit code sections. When several $ sections are consecutive they are seen as a single section.

5.4.2 Indentation

Python code can appear in several parts of a grammar. Since indentation has a special meaning in Python it is important to know how TPG handles spaces and tabulations at the beginning of the lines.

When TPG encounters some Python code it removes in all non blank lines the spaces and tabulations that are common to every lines. TPG considers spaces and tabulations as the same character so it is important to always use the same indentation style. Thus it is advised not to mix spaces and tabulations in indentation. Then this code will be reindented when generated according to its location (in a class, in a method or in global space).

The figure 5.2 shows how TPG handles indentation.



Figure 5.2: Code indentation examples




Code in grammars (old syntax)

Code in grammars (new syntax)

Generated code

Comment









{{  
____if_1==2:  
________print_"???"  
____else:  
________print_"OK"  
}}  
____

 
$__if_1==2:  
$______print_"???"  
$__else:  
$______print_"OK"  
 
____

 
if_1==2:  
____print_"???"  
else:  
____print_"OK"  
____

Correct: these lines have four spaces in common. These spaces are removed.





{{__if_1==2:  
________print_"???"  
____else:  
________print_"OK"  
}}  
____

The new syntax has no trouble in that case.

if_1==2:  
______print_"???"  
__else:  
______print_"OK"  
____

WRONG: it’s a bad idea to start a multiline code section on the first line since the common indentation may be different from what you expect. No error will be raised by TPG but Python won’t compile this code.





{{____print_"OK"_}}  
____

$_______print_"OK"  
____

or

$_print_"OK"_$  
____

print_"OK"  
____

Correct: indentation does not matter in a one line Python code.






5.5 TPG parsers

TPG parsers are tpg.Parser classes. The grammar is the doc string of the class.

5.5.1 Methods

As TPG parsers are just Python classes, you can use them as normal classes. If you redefine the __init__ method, don’t forget to call tpg.Parser.__init__.

5.5.2 Rules

Each rule will be translated into a method of the parser.