Chapter 4: Bisonc++ grammar files

Bisonc++ processes a context-free grammar specification, producing a C++ class offering various predefined members, among which the member parse, that recognizes correct instances of the grammar.

In this chapter the organization and specification of such a grammar file is discussed in detail.

Having read this chapter you should be able to define a grammar for which Bisonc++ can generate a class, containing a member that recognizes correctly formulated (according to the grammar) input. Such a grammar must be in the class of LALR(1) grammars (see, e.g., Aho, Sethi & Ullman, 2003 (Addison-Wesley)).

4.1: Outline of a Bisonc++ Grammar File

The input file for bisonc++ is a grammar file. Different from Bison++ and Bison grammar files, bisonc++ grammar file consist of just two sections. The general form of a bisonc++ grammar file looks like this:

    Bisonc++ directives
    %%
    Grammar rules
        
Readers familiar with Bison may note that there is no C declaration section and no section to define Additional C code. With bisonc++ these sections are superfluous since, due to the fact that a bisonc++ parser is defined as a C++ class, all additional code required for the parser's implementation can be incorporated into the parser class itself. Also, C++ classes normally only require declarations that can be defined in the classes' header files, so also the `additional C code' section was omitted from bisonc++'s grammar file.

The `%%' is a separator that appears in every bisonc++ grammar file separating the two sections.

The directives section is used to declare the names of the terminal and nonterminal symbols, and may also describe operator precedence and the data types of semantic values of various symbols. Additional directives are used to define, e.g., the name of the generated parser class and a namespace in which the parser class is defined. All bisonc++ directives are covered in section 4.5.

Grammar rules define how to construct nonterminal symbols. The grammar rules section defines one or more bisonc++ grammar rules. See section 4.3, covering the syntax of grammar rules.

The `%%' separator must always be specified, even if the directives section is empty.

Bisonc++'s grammar file may be split into several files. Each file may be given a suggestive name. This allows quick identification of where a particular section or rule is found, and improves readability of the designed grammar. The %include-directive (see section 4.5.8) can be used to include a partial grammar specification file into another specification file.

4.2: Symbols, Terminal and Nonterminal Symbols

Symbols are the building blocks of bisonc++ grammars.

A terminal symbol (also known as a token) represents a class of syntactically equivalent symbols. Tokens are represented in bisonc++'s parser class by a number, defined in an enum. The parser's lex member function returns a token value indicating what kind of token has been read. You don't need to know what the code value is; instead, its symbol should always be used. By convention, it contains uppercase characters.

Nonterminal symbol define concepts of grammars. The symbol name is used in writing grammar rules. By convention, it contains lowercase characters.

Symbol names consist of letters, digits (not at the beginning), and underscores. Bisonc++ does not support periods in symbol names (users familiar with Bison may observe that Bison does support periods in symbol names, but the Bison's user guide states that `periods make sense only in nonterminals'. Even so, it appears that periods in symbols are hardly ever used).

There are two ways terminal symbols can be referred to:

By convention, a character token is only used to represent that particular character. Thus, the token '+' is represents the character `+' as a token.

All common escape sequences that can be used in C++'s character constants can be used in bisonc++ as well. Be careful not to use the 0 character as a character literal because its ASCII code, zero, is the code lex returns to indicate end-of-input (see section 5.3.1). If your program must be able to return 0-byte characters, define a special token (e.g., ZERO_BYTE) and return that token instead.

Note that literal string tokens, formally supported in Bison, are not supported by bisonc++. Again, such tokens are hardly ever encountered, and lexical scanner generators (like flex(1) and flexc++(1)) do not support them. Common practice is to define a symbolic name for a literal string token. So, a token like EQ may be defined in the grammar file, with the lexical scanner returning EQ when it matches ==.

The value returned by the parser's lex member is always a terminal token. The numeric code for a character token is simply the ASCII code of that character, and lex can simply return that character constant as a token value. Each named token becomes a C++ enumeration value in the parser base class header file, and lex can return such enumeration identifiers as well. When using an externally defined lexical scanner, that lexical scanner should include the parser's base class header file, and it should return either character constants or the token identifiers defined in that header file. So, if (%token NUM) was defined in the parser class Parser, then the lexical scanner may return Parser::NUM.

The symbol `error' is a reserved terminal symbol reserved by bisonc++ for error recovery purposes (see chapter 8). The error symbol should not be used for other purposes. In particular, the parser's member function lex should never return error. Several more identifiers should not be used as terminal symbols either. See section 4.5.29.1 for an overview.

4.3: Syntax of Grammar Rules

Grammar rules have the following characteristics:

4.4: Writing recursive rules

A rule is called recursive when its result nonterminal appears also in its production rules. Nearly all bisonc++ grammars use recursion, because that is the only way to define a sequence of any number of somethings. Consider this recursive definition of a comma-separated sequence of one or more expressions:

    expseq1:  
            expseq1 ',' exp
    | 
            exp
    ;
        
Since the recursive use of expseq1 is the leftmost symbol in the right hand side, we call this left recursion. By contrast, here the same construct is defined using right recursion:

    expseq1:  
            exp ',' expseq1
    | 
            exp
    ;
        
Any kind of sequence can be defined using either left recursion or right recursion, but whenever possible use left recursion, because it can parse a sequence of any number of elements given a limited stack space. Right recursion consumes stack space proportionally to the number of elements in the sequence, because all elements must be pushed on the stack before the rule can be applied even once. See chapter 7 for further explanation of this phenomenon.

Indirect or mutual recursion occurs when the result of the rule does not appear directly on its right hand side, but does appear in rules for other nonterminals which do appear on its right hand side. For example:


    expr:     
        primary '+' primary
    |
        primary
        ;

    primary:  
        constant
    | 
        '(' expr ')'
    ;
        
defines two mutually-recursive nonterminals, since each refers to the other.

4.5: Bisonc++ Directives

The declaration section of a grammar file defines the symbols used in formulating the production rules and the data types of semantic values. See section 4.2.

All token names (except for single-character literal tokens such as '+' and '*') must be declared. If data types must be associated with semantic values of symbols (see sections 4.6) then those symbols must also be declared.

By default the first rule in the rule section specifies the grammar's start symbol. The %start directive can be used to specify another rule (see section 3.1).

This section covers all of bisonc++'s declarations. Some have already been mentioned, but several more are available. Some define how the grammar parses its input (like %left, %right); other directives define, e.g., the name of the parsing function, or the name(s) of the files bisonc++ generates.

In particular readers familiar with Bison (or Bison++) should read this section thoroughly, since bisonc++'s directives are more extensive and different from the `declarations' offered by Bison, and the macros defined by Bison++.

Several directives expect file- or pathname arguments. File- or pathnames must be provided on the same line as the directive itself, starting at the first non-blank character after the directive. File- or pathnames may contain escape sequences (e.g., if you must: use `\ ' to include a blank in a filename), and continue until encountering the next blank character. Alternatively, file- or pathnames may be surrounded by double quotes ("...") or pointed brackets (<...>). Pointed brackets surrounding file- or path-names merely function to delimit filenames. They do not refer to, e.g., C++'s include path. No escape sequences are required for blanks within delimited file- or path-names.

Directives accepting a `filename' do not accept path names, i.e., they cannot contain directory separators (/); directives accepting a 'pathname' may contain directory separators.

Sometimes directives have analogous command-line options. In those cases command-line options take priority over directives.

Some directives may generate errors. This happens when a directive conflicts with the contents of a file bisonc++ cannot modify (e.g., a parser class header file exists, but doesn't define a namespace, but a %namespace directive was provided).

To solve such errore the offending directive could be omitted, the existing file could be removed, or the existing file could be hand-edited according to the directive's specification.

4.5.1: %baseclass-preinclude: specifying a header included by the baseclass

Syntax: %baseclass-preinclude pathname

Pathname defines the path to the file preincluded in the parser's base-class header. See the description of the --baseclass-preinclude option for details about this directive.

By default `pathname' is surrounded by double quotes; but the double quotes can also explicitly be specified. When the argument is surrounded by pointed brackets #include <header> is used.

4.5.2: %class-name: defining the name of the parser class

Syntax: %class-name parser-class-name

By default, bisonc++ generates a parser-class using the class name Parser. The default can be changed using this directive which defines the name of the C++ class that will be generated. It may be defined only once and parser-class-name must be a C++ identifier.

It is an error if this directive is used and an already existing parser-class header file does not define class `className' and/or if an already existing implementation header file does not define members of the class `className'.

4.5.3: %debug: adding debugging code to the `parse()' member

Syntax: %debug

Provide parse and its support functions with debugging code, showing the actual parsing process on the standard output stream.

When specified, debug output is shown by default, but its activity may be controlled using the setDebug(bool on-off) or setDebug(DebugMode_) members. Note that no #ifdef DEBUG macros are used anymore. Rerun bisonc++ without the --debug option to generate an equivalent parser not containing the debugging code.

4.5.4: %default-actions: adding `$$ = $1' action blocks to production rules

Syntax: %default-actions off|quiet|warn|std

By default, production rules without final action blocks are augmented by the bison(1) parser generator with $$ = $1 action blocks: the semantic value of the first component is returned as the rule's semantic value. Its manual also states that for empty rules there is no meaningful default action. However, it could be argued that empty production rules could return default semantic values, resulting in every matched rule having a defined semantic value.

When multiple semantic value types are used, the semantic value type returned by a $$ = $1 action is not uniquely defined. For one rule $1 might be an int field in a union, for another rule it might be a std::string *. With polymorphic semantic values comparable situations are encountered.

By default, bisonc++ mimics bison's behavior in that it adds a $$ = $1 action block to rules not having final action blocks, but not to empty production rules. This default behavior can also explicitly be configured using the default-actions std option or directive.

Bisonc++ also supports alternate ways of handling rules not having final action blocks. When off is specified, bisonc++ does not add $$ = $1 action blocks; when polymorphic semantic values are used, then specifying

When either warn or quiet are specified the types of $$ and $1 must match. When bisonc++ detects a type mismatches it issues errors.

When polymorphic semantic values are used then the default $$ = $1 action probably is less useful than when, e.g., plain %stype semantic values are used. After all, no semantic values are associated with $$. Furthermore, once the production rule has been recognized, the production rule is reduced to the rule's left-hand side non-terminal. Thus, $1 ceases to exists, immediately following the $$ = $1 statement. Therefore, if default actions are used in combination with polymorphic semantic value types they are implemented using move-operations: $$ = std::move($1). However, in this situation default actions can frequently be suppressed, slightly improving the efficency of the generated parser.

When a default action block can be added to a production rule and either warn or quiet was specified, bisonc++ compares the types associated with rule's nonterminal, and with the production rule's first element. The warn and quiet specifications make identical decisions about the action blocks to add, but in addition warn also shows a warning message that the action block is added to the production rule.

4.5.5: %error-verbose: dumping the parser's state stack

Syntax: %error-verbose

The parser's state stack is dumped to the standard error stream when an error is detected by the parse member function. After calling error the stack is dumped from the top of the stack (highest offset) down to its bottom (offset 0). Each stack element is prefixed by the stack element's index.

4.5.6: %expect: suppressing conflict warnings

Syntax: %expect number

Bisonc++ normally warns if there are any conflicts in the grammar (see section 7.1), but many real grammars have harmless shift/reduce conflicts which are resolved in a predictable way and which would be difficult to eliminate. It is desirable to suppress the warning about these conflicts unless the number of conflicts changes. You can do this with the %expect declaration.

The argument number is a decimal integer. The declaration says there should be no warning if there are number shift/reduce conflicts and no reduce/reduce conflicts. The usual warning is given if there are either more or fewer conflicts, or if there are any reduce/reduce conflicts.

In general, using %expect involves these steps:

Now bisonc++ will stop annoying you about the conflicts you have checked, but it will warn you again if changes in the grammar result in a different number or type of conflicts.

4.5.7: %flex: using the traditional `flex++' interface

Syntax: %flex

When provided, the scanner matched text function is called as d_scanner.YYText(), and the scanner token function is called as d_scanner.yylex(). This directive is only interpreted if the %scanner directive is also provided.

4.5.8: %include: splitting the input file

Syntax: %include pathname

This directive is used to switch to pathname while processing a grammar specification. Unless pathname defines an absolute file-path, pathname is searched relative to the location of bisonc++'s main grammar specification file (i.e., the grammar file that was specified as bisonc++'s command-line option). This directive can be used to split long grammar specification files in shorter, meaningful units. After processing pathname processing continues beyond the %include pathname directive.

Bisonc++'s main grammar specification file could simply be:


    %include spec/declarations
    %%
    %include spec/rules
        
where spec/declarations contains declarations and spec/rules contains the rules. Each of the files included using %include may itself use %include directives (which are then processed relative to their locations). The default nesting limit for %include directives is 10, but the option --max-inclusion-depth can be used to change this default.

%include directives should be specified on their own lines, not containing any other information.

4.5.9: %left, %right, %nonassoc: defining operator precedence

Syntax:
%left [ <type> ] terminal(s)
%nonassoc [ <type> ] terminal(s)
%right [ <type> ] terminal(s)
These directives are called precedence directives (see also section 4.5.9 for general information on operator precedence).

The %left, %right, and %nonassoc directives are used to declare tokens and to specify their precedence and associativity, all at once.

The <type> specification is optional, and specifies the type of the semantic value when a token specified to the right of a <type> specification is received. The pointed arrows are part of the type specification; the type itself must be a field of a %union (see section 4.5.33) or it must be a polymorphic tag (see section 4.6.1).

When multiple tokens are listed they must be separated by whitespace or by commas. Note that the precedence directives also serve to define token names: symbolic tokens mentioned with these directives should not be defined using %token directives.

4.5.10: %locationstruct: specifying a dedicated location struct

Syntax: %locationstruct struct-definition

Defines the organization of the location-struct data type LTYPE_. This struct should be specified analogously to the way the parser's stacktype is defined using %union (see below). The location struct type is named LTYPE_. If neither locationstruct nor LTYPE_ is specified, the default LTYPE_ struct is used.

4.5.11: %lsp-needed: using the default location type

Syntax: %lsp-needed

When this directive is specified the standard location stack is added to the generated parser class. The standard location type (defined in the parser's base class) is equal to the following struct:


    struct LTYPE_
    {
        int timestamp;
        int first_line;
        int first_column;
        int last_line;
        int last_column;
        char *text;
    };
           
Note that defining this struct type does not imply that its field are also assigned. Some form of communication with the lexical scanner is probably required to initialize the fields of this struct properly.

4.5.12: %ltype: using an existing location type

%ltype typename

Specifies a user-defined token location type. If %ltype is used, typename should be the name of an alternate default-constructible type (e.g., size_t). It should not be used together with a %locationstruct specification. From inside the parser class, this type may be referred to as LTYPE_.

Any text following %ltype up to the end of the line, up to the first of a series of trailing blanks or tabs or up to a comment-token (// or /*) becomes part of the type definition. Be sure not to end an %ltype definition in a semicolon.

4.5.13: %namespace: using a namespace

Syntax: %namespace namespace

Defines all of the code generated by bisonc++ in the namespace namespace. By default no namespace is defined.

If this options is used the implementation header will contain a commented out using namespace directive for the requested namespace.

In addition, the parser and parser base class header files also use the specified namespace to define their include guard directives.

It is an error to use this directive while an already existing parser-class header file and/or implementation header file does not specify namespace identifier.

4.5.14: %negative-dollar-indices: using constructions like $-1

Syntax: %negative-dollar-indices

Accept (do not generate warnings) zero- or negative dollar-indices in the grammar's action blocks. Zero or negative dollar-indices are commonly used to implement inherited attributes and should normally be avoided. When used they can be specified like $-1, or like $<type>-1, where type is an STYPE_ tag; or a %union field-name. See also sections 4.6.2, 5.6, and 4.6.1.

4.5.15: %no-lines: suppressing `#line' directives

Syntax: %no-lines

Do not put #line preprocessor directives in the file containing the parser's parse function. By default #line preprocessor directives are inserted just before action blocks in the generated parse.cc file.

The #line directives allow compilers and debuggers to associate errors with lines in your grammar specification file, rather than with the source file containing the parse function itself.

4.5.16: %polymorphic: using polymorphism to define multiple semantic values

Syntax: %polymorphic polymorphic-specification(s)

The %polymorphic directive is used to define a polymorphic semantic value class, offering a (preferred) alternative to (traditional) union types.

Refer to section 4.6.1 for a detailed description of the specification, characteristics, and use of polymorphic semantic values.

As a quick reference: to define multiple semantic values using a polymorphic semantic value class offering either an int, a std::string or a std::vector<double> specify:


    %polymorphic INT: int; STRING: std::string; 
                 VECT: std::vector<double>
                
and use %type specifications (cf. section 4.5.32) to associate (non-)terminals with specific semantic value types.

%stype, %union and %polymorphic are mutually exclusive: only one of these directives can be used.

4.5.17: %prec: overruling default precedences

Syntax: %prec token

The construction %prec token may be used in non-empty production rules to overrule the actual precedence of an operator in those production rules. Well known is the construction


    expression:
        '-' expression %prec UMINUS
        {
            ...
        }
                
Here, the default precedence and precedence of the `-' token as the subtraction operator is overruled by the precedence and precedence of the UMINUS token, which is frequently defined as:

    %right UMINUS
                
E.g., a list of arithmetic operators could consists of:

    %left '+' '-'
    %left '*' '/' '%'
    %right UMINUS
        
giving * / and % a higher precedence than + and -, ensuring at the same time that UMINUS is given both the highest precedence and a right-associativity.

In the above production rule the operator order would result in the construction


    '-' expression
        
being evaluated from right to left, having a higher precedence than either the multiplication or the addition operators.

The %prec directive associates priorities to (non-empty) rules. These priorities are interpreted whenever there are (shift-reduce) conflicts. If there are no conflicts, priorities are not required, and are consequently ignored.

When the parser analyzes the above grammar, a conflict is encountered. Consider the following simple grammar, in which only the minus ('-') operator is used, albeit in beinary and unary form:


    %token NR
    %left '-'
    %right UNARY

    expr:
        NR
    |
        expr '-' expr
    |
        '-' expr %prec UNARY
    ;
        
When analuzing this grammar, bisonc++ defines states (cf. chapter 7) defining what to do when encountering certain input tokens. Each possibility is defined by an item, in which a dot (.) is commonly used to show to which point a production rule has been recognized. For the above grammar one such state looks like this:

    0: expr -> '-' expr  .        { '-' <EOF> }
    1: expr -> expr  . '-' expr
        
The elements between parentheses define the look-ahead tokens: the token that may appear next for reducible rules. Item 0 is such a reducible rule, and it is used to reduce '-' expr to an expression (expr).

The second item shows the production rule defining the binary minus operator. Its left-hand side expression has been recognized, and the parser expects to see a minus operator next.

The conflict is caused by the expected minus operator in item 1, and a minus operator that may follow item 0. As there is only one look-ahead symbol (since bisonc++ can only handle LALR(1) grammars) the grammer contains a shift-reduce conflict: shift, and continue with item 1; or reduce, and continue with item 0. In this case, %prec solves the issue, by giving item 0 a higher precedence than item 1 (whose precedence is equal to the precedence of its first terminal token, which is the binary minus operator's precedence).

Although never encountered in real life, it's also possible to give the unary minus operator a lower priority than the binary minus operator. The grammar, in this case, looks like this:


    %token NR
    %right UNARY
    %left '-'

    expr:
        NR
    |
        expr '-' expr
    |
        '-' expr %prec UNARY
    ;
        

With this grammar we encounter a state with these two items:


    0:  expr -> '-' expr  .   { <EOF> } 
    1:  expr -> expr  . '-' expr
        
Here, the conflict no longer manifests itself, as the minus operator no longer appears in item 0's look-ahead set. The resulting parser will, when encountering a minus, shift the minus, and proceed according to item 1, and when anything else is encountered reduce the '-' expr production to expr. In real life this means that an expression like -4 - 3 evaluates to -1.

To illustrate a situation where %prec won't work consider this grammar:


    %token NR
    %left '-'
    %right POSTFIX

    expr:
        NR
    |
        expr '-' expr
    |
        expr '-' %prec POSTFIX
    ;
        
When this grammar is analyzed the following state is encountered:

    0:  expr -> expr '-'  . expr   
    1:  expr -> expr '-'  .         { '-' <EOF> }
    2:  expr ->  . NR   
    3:  expr ->  . expr '-' expr   
    4:  expr ->  . expr '-'   
      
To appreciate why %prec doesn't work here, consider the various look-ahead tokens. For items 0, 3, and 4 the look-ahead token is the non-terminal expr; for item 2 the look-ahead token is the terminal NR, and for item 1, handling the postfix minus operator, it is a minus character. Thus, there isn't any conflict between the shiftable items and the reducible item 1, and consequently the %prec specification isn't used. Any attempt to define a grammar handling a postfix minus operator will fail. A common solution consists of defining a separate operator, explicitly giving it its appropriate priority and associativity. E.g.,

    %token NR
    %left '-'
    %right '_'  // underscore

    expr:
        NR
    |
        expr '-' expr
    |
        expr '_'    // underscore as postfix minus
    ;
        

4.5.18: %print-tokens: displaying tokens and matched text

Syntax: %print-tokens

The %print-tokens directive provides an implementation of the Parser class's print_ function displaying the current token value and the text matched by the lexical scanner as received by the generated parse function.

The print_ function is also implemented if the --print command-line option is provided.

4.5.19: %prompt: show prompts when using debug

Syntax: %prompt

When adding debugging code (using the debug option or directive) debug information is displayed continuously while the parser processes its input. When using the prompt directive the generated parser displays a prompt (a question mark) at each step of the parsing process.

Caveat: when using this option the parser's input cannot be provided at the parser's standard input stream.

Instead of using the %prompt directive the --prompt option can also be used.

4.5.20: %required-tokens: defining the minimum number of tokens between error reports

Syntax: %required-tokens ntokens

Whenever a syntactic error is detected during the parsing process subsequent tokens received by the parsing function may easily cause yet another (spurious) syntactic error. In this situation error recovery in fact produces an avalanche of additional errors. If this happens the recovery process may benefit from a slight modification. Rather than reporting every syntactic error encountered by the parsing function, the parsing function may wait for a series of successfully processed tokens before reporting the next error.

The directive %required-tokens can be used to specify this number. E.g., the specification %required-tokens 10 requires the parsing function to process successfully a series of 10 tokens before another syntactic error is reported (and counted). If a syntactic error is encountered before processing 10 tokens then the counter counting the number of successfully processed tokens is reset to zero, no error is reported, but the error recoery procedure continues as usual. The number of required tokens can also be set using the option --required-tokens. By default the number of required tokens is initialized to 0.

4.5.21: %scanner: using a standard scanner interface

Syntax: %scanner header

Use header as the pathname of a file to include in the parser's class header file. See the description of the --scanner option for details about this directive. When this directive is used a Scanner d_scanner data member is automatically included in the generated parser, while the predefined int lex() member is simply returning d_scanner.lex()'s return value. When, in addition to the %scanner directive the %flex directive was also specified then the function d_scanner.YYText() is called.

Unless double quotes or angular brackets were explixity used, the specified header file will be surrounded by double quotes.

It is an error to use this directive in combination with an already existing parser-class header not including `header'.

4.5.22: %scanner-matched-text-function: define the name of the scanner's member returning the matched texttoken

Syntax: %scanner-matched-text-function function-call

The %scanner-matched-text-function directive defines the scanner function that must be called to obtain the text matching the most recently returned token. By default this is d_scanner.matched().

A complete function call expression should be provided (including a scanner object, if applicable). This option overrules the d_scanner.matched() call used by default when the %scanner directive is specified. Example:


    %scanner-matched-text-function myScanner.matchedText()
                
If the function call expression contains white space then the function-call specification must be surrounded by double quotes (").

Note that an expression is expected, not an expression statement: do not include a final semicolon in the specification.

4.5.23: %scanner-token-function: define the name of the scanner's token function

Syntax: %scanner-token-function function-call

This directive is used to specify how to call the scanner function returning the next token from the parser's lex function. A complete function call expression should be provided (including a scanner object, if applicable). Example:


    %scanner-token-function d_scanner.lex()
                
If the function call contains white space then the function call specification must be surrounded by double quotes.

It is an error to use this directive in combination with an already existing internal header file (.ih file) in which the specified function is not called.

Note that an expression is expected, not an expression statement: do not include a final semicolon in the specification.

4.5.24: %stack-expansion: the number of elements added to the semantic value stack

Syntax: %stack-expansion size

Defines the number of elements to be added to the generated parser's semantic value stack when it must be enlarged. By default 10 elements are added to the stack. This option/directive is interpreted only once, and only if size at least equals the default stack expansion size of 10.

4.5.25: %start: defining the start rule

Syntax: %start nonterminal symbol

By default bisonc++ uses the nonterminal that is defined by the first rule in a grammar specification file as the start symbol. I.e., the parser tries to recognize that nonterminal when parsing input.

This default behavior may be modified using the %start directive. The nonterminal symbol specifies a nonterminal that may be defined anywhere in the rules section of the grammar specification file. This nonterminal then becomes the grammar's start symbol.

4.5.26: %stype: defining a single semantic value type

Syntax: %stype typename

This directive defines the type of the semantic value of tokens. The specified type must be a default constructible type, like size_t or std::string. By default, bisonc++ uses int for the semantic value type of its parser's tokens. To use another single semantic value type , this directive must be used.

In programs using a simple grammar it may be sufficient to use the same data type for the semantic values of all language constructs (see, e.g., sections 6.1 and 6.2).

Any text following %stype up to the end of the line, up to the first of a series of trailing blanks or tabs or up to a comment-token (// or /*) becomes part of the type definition. Be sure not to end a %stype definition in a semicolon.

%stype, %union and %polymorphic are mutually exclusive: only one of these directives can be used.

Sources including the generated parser class header file should refer to the semantic value typename as STYPE_.

4.5.27: %tag-mismatches: check for tag-mismatches with polymorphic semantic values

Syntax: %tag-mismatches on|off

This directive is only interpreted when polymorphic semantic values are used. When on is specified (which is used by default) the parse member of the generated parser dynamically checks that the tag used when calling a semantic value's get member matches the actual tag of the semantic value.

If a mismatch is observed, then the parsing function aborts after displaying a fatal error message. If this happens, and if the option/directive debug was specified when bisonc++ created the parser's parsing function, then the program can be rerun, specifying parser.setDebug(Parser::ACTIONCASES) before calling the parsing function. As a result the case-entry numbers of the switch, defined in the parser's executeAction member, are inserted into the standard output stream. The action case number reported just before the program displays the fatal error message tells you in which of the grammar's action block the error was encountered.

4.5.28: %thread-safe: use thread-local data with polymorphic semantic values

Syntax: %thread-safe

Only used with polymorphic semantic values, and then only required when the same parser is used in multiple threads: it ensures that each thread's polymorphic code only accesses the errors counter (i.e., d_nErrors_) of its own parser.

Instead of using the %thread-safe directive the --thread-safe option can also be used.

4.5.29: %token: defining token names

Syntax:
%token terminal token(s)
%token [ <type> ] terminal token(s)

The %token directive is used to define one or more symbolic terminal tokens. When multiple tokens are listed they must be separated by whitespace or by commas.

The <type> specification is optional, and specifies the type of the semantic value when receiving one of the subsequently named tokens is received. The pointed arrows are part of the type specification; the type itself must be a field of a %union specification (see section 4.5.33) or a tag defined at the %polymorphic directive (see section 4.6.1).

bisonc++ traditionally converted symbolic tokens (including those defined by the precedence directives (cf. section 4.5.9)) into Parser::Tokens_ enumeration values (see section 4.5.2), allowing the lexical scanner to return named tokens as Parser::name. Although this approach is still available, it is deprecated as of bisonc++'s version 6.04.00. Starting with that version the token-path option or directive should be used, by default defining the symbolic tokens in the class Tokens, made available in a separate file which can be included by any class needing access to the grammar's symbolic tokens (cf. section 4.5.35.7).

4.5.29.1: Improper token names

Several identifiers cannot be used as token names as their use would collide with identifiers that are defined in the parser's base class.

In particular,

4.5.30: %token-class: defining the class name containing the semantic tokens

Syntax: %token-class classname

This directive is only processed when the token-path directive or option has also been specified.

Classname defines the name of the Tokens class that is defined when the token-path option (see below) is specified. By default the class name Tokens is used. Assuming the default, then classes that need access to the tokens defined by bisonc++ may derive from Tokens (in which case a token like IDENTIFIER can directly be used, or tokens must be provided with their class context (e.g., Tokens::IDENTIFIER). The enumeration Tokens_ in the Tokens class has public access rights.

4.5.31: %token-namespace: defining the namespce of the Tokens class

Syntax: %token-namespace namespace

This directive is only processed when the token-path directive or option has also been specified.

Namespace defines the namespace of the class containing the Tokens_ enumeration. By default no namespace is used. The namespace defined here can independently be specified from the Parser's namespace (specified by the namespace directive or option).

4.5.32: %type: associating semantic values with (non)terminals

Syntax: %type <type> symbol-list

To associate (non-)terminals with specific semantic value types the %type directive is used.

When %polymorphic is used to specify multiple semantic value types, (non-)terminals can be associated with one of the semantic value types specified with the %polymorphic directive.

When %union is used to specify multiple semantic value types, (non-)terminals can be associated with one of the union fields specified with the %union directive.

With this directive, symbol-list defines of one or more blank or comma delimited grammatical symbols (i.e., terminal and/or nonterminal symbols); type is either a polymorphic type-identifier or a field name defined in the %union specification. The specified nonterminal(s) are automatically associated with the indicate semantic type. The pointed arrows are part of the type specification.

When the semantic value type of a terminal symbol is defined the lexical scanner rather than the parser's actions must assign the appropriate semantic value to d_val_ just prior to returning the token. To associate terminal symbols with semantic values, terminal symbols can also be specified in a %type directive.

4.5.33: %union: using a 'union' to define multiple semantic values

Syntax: %union union-definition body

In the grammars of many programs different types of data are used for different terminal and nonterminal tokens. For example, a numeric constant may need type int or double, while a string needs type std::string, and an identifier might need a pointer to an entry in a symbol table.

Traditionally, the %union directive has always been used to accomplish this. The directive defines a C union-type whose fields specify one or more data types for semantic values. The directive %union is followed by a pair of braces containing one or more field definitions. For example:


    %union {
      double u_val;
      symrec *u_tptr;
    };
        
In this example the two fields represent a double and a symrec *. The associated field names are u_val and u_tptr, which are used in the %token and %type directives to specify types that are associated with terminal or nonterminal symbols (see section 4.5.32).

Notes:

Although the %union directive is still supported by bisonc++, its use is largely superseded by the newer %polymorphic directive, allowing bisonc++ and the C++ compiler to verify that the correct types are used when semantic values are assigned or retrieved, which, in turn, helps preventing run-time errors.

%stype, %union and %polymorphic are mutually exclusive: only one of these directives can be used.

4.5.34: %weak-tags: %polymorphic declaring 'enum Tag_'

Syntax: %weak-tags

By default, the %polymorphic directive declares a strongly typed enum: enum class Tag_, and code generated by bisonc++ always uses the Tag_ scope when referring to tag identifiers. It is often possible (by pre-associating tokens with tags, using %type directives) to avoid the use of tags in user-code.

If tags are explicitly used, then they must be prefixed with the Tag_ scope. Before the arrival of the C++-11 standard strongly typed enumerations didn't exist, and explicit enum-type scope prefixes were usually omitted.

The %weak-tags directive can be specified when the Tag_ enum should not be declared as a strongly typed enum. This directive should not be used, unless you know what you're doing.

4.5.35: Directives controlling the names of generated files

Unless otherwise specified, bisonc++ uses the name of the parser-class to derive the names of most of the files it may generate. Below <CLASS> should be interpreted as the name of the parser's class, Parser by default, but configurable using %class-name (see section 4.5.2).

4.5.35.1: %baseclass-header: defining the parser's base class header

Syntax: %baseclass-header filename

Filename defines the name of the file to contain the parser's base class. This class defines, e.g., the parser's symbolic tokens. Defaults to the name of the parser class plus the suffix base.h. It is always generated, unless (re)writing is suppressed by the --no-baseclass-header and --dont-rewrite-baseclass-header options. This directive is overruled by the --baseclass-header (-b) command-line option.

It is an error to use this directive while an already existing parser class header file does not contain #include "filename".

4.5.35.2: %class-header: defining the parser's class header

Syntax: %class-header filename

Filename defines the name of the file to contain the parser class. Defaults to the name of the parser class plus the suffix .h This directive is overruled by the --class-header (-c) command-line option.

It is an error to use this directive while an already existing parser-class header file does not define class `className' and/or if an already existing implementation header file does not define members of the class `className'.

4.5.35.3: %filenames: specifying a generic filename

Syntax: %filenames filename

Filename is a generic filename, used for all header files generated by bisonc++.

4.5.35.4: %implementation-header: defining the implementation header

Syntax: %implementation-header filename

Filename defines the name of the file to contain the implementation header. It defaults to the name of the generated parser class plus the suffix .ih.

The implementation header should contain all directives and declarations only used by the implementations of the parser's member functions. It is the only header file that is included by the source file containing parse's implementation. User defined implementation of other class members may use the same convention, thus concentrating all directives and declarations that are required for the compilation of other source files belonging to the parser class in one header file.

4.5.35.5: %parsefun-source: defining the parse() function's sourcefile

Syntax: %parsefun-source filename

Filename defines the name of the source file to contain the parser member function parse. Defaults to parse.cc.

4.5.35.6: %target-directory: defining the directory where files must be written

Syntax: %target-directory pathname

Pathname defines the directory where generated files should be written. By default this is the directory where bisonc++ is called.

4.5.35.7: %token-path: defining the path of the file containing the Tokens_ enumeration

Syntax: %token-path pathname

By default the symbolic tokens defined in the grammar specification are collected in the enumeration Tokens_ which is defined in the generated parser's base class. This is a suboptimal procedure in cases where these tokens are also used by other classes. In those cases the tokens should be collected in an escalated class which can be accessed by classes that are independent of the generated Parser class.

The directive %token-path is used for this: it expects a pathname path specification of the file to contain the struct Tokens defining the enumeration Tokens_ containing the symbolic tokens of the generated grammar. If this option is specified the ParserBase class is derived from it, thus making the tokens available to the generated parser class.

The name of the struct Tokens can be altered using the token-class option. By default (if token_path is not specified) the tokens are defined as the enum Tokens_ in the ParserBase class. The filenames and target-directory directive specifications are ignored by token-path.

4.6: The Meaning Of Things: Semantics

A language's grammar rules define its syntax. Its semantics are defined through semantic values that can be associated with terminal and nonterminal symbols, and generally by the actions taken when nonterminals or parts of production rules have been recognized.

For example, the calculator performs real-life calculations because the value associated with each expression is its computed numeric value; it correcly performs addition because the action for an expression like `x + y' is to add those numbers and to return their sum.

Two ways of defining semantics have already been discussed:

A third way for defining semantic values is discussed next (cf. section 4.6.1). Shorthand notations that can be used in action blocks are described next (cf. section ACTIONS). Finally, Action blocks usually appear at the end of production rules. But in fact they can be defined anywhere in production rules. Refer to this section's final subsection (section 4.6.2.4) for the characteristics of such mid-rule action blocks.

4.6.1: Polymorphism and Multiple Semantic Values: `%polymorphic'

Bisonc++ also supports polymorphic semantic values. Polymorphic semantic values were developed as an alternative to using unions. Traditional unions can still be used, but are now considered somewhat `old school'. Polymorphic semantic values were added to bisonc++ as the result of a suggestion originally made by Dallas A. Clement in September 2007.

In this section a simple example program is developed illustrating the use of polymorphic semantic values. The sources of the example can be retrieved from the distribution's poly directory.

One may wonder why a union is still used by bisonc++ as C++ offers inherently superior approaches to combine multiple types into one union type. The C++ way to do so is by defining a polymorphic base class and a series of derived classes implementing the various exclusive data types. The union approach is still supported by bisonc++, mainly for historic reasons as it is supported by bison(1) and bison++; dropping the union would needlessly impede backward compatibility.

The preferred alternative to a union, however, is a polymorphic base class. The example program (cf. poly) uses a polymorphic semantic value type, supporting either int or std::string semantic values. These types are asociated with tags (resp. INT and TEXT) using the %polymorphic directive, discussed next.

4.6.1.1: The %polymorphic directive

Specifying the %polymorphic directive results in a parser using polymorphic semantic values. Polymorphic semantic value specifications consist of a tag, which is a C++ identifier, and a C++ type definition.

Tags and type definitions are separated by colons, and multiple semantic value specifications are separated by semicolons. The semicolon trailing the final semantic value specification is optional.

The %polymorphic directive may be specified only once, and the %polymorphic, %stype and %union directives are mutually exclusive.

Here is an example, defining three semantic values types: int, std::string and std::vector<double>:


    %polymorphic INT: int; STRING: std::string; 
                 VECT: std::vector<double>
        
The identifier to the left of the colon is called the type-tag (or simply `tag'), and the type definition to the right of the colon is called the type-definition. Types specified at the %polymorphic type-definitions must be built-in types or class-type declarations. Since bisonc++ version 4.12.00 the types no longer have to be default-constructible.

As the parser's generic semantic value type is called STYPE_, and as functions called by the parser may return STYPE_ values and may expect STYPE_ arguments, grammar symbols can also be associated with the generic STYPE_ semantic type using %type <STYPE_> directives.

To prevent ambiguities the generic STYPE_ type cannot be specified as a polymorphic type. E.g., a specification like GENERIC: STYPE_ cannot be used when defining the tag/type pairs at the %polymorphic directive.

When polymorphic type-names refer to types that have not yet been declared by the parser's base class header, then these types must be declared in a separate header file, included into the parser's base class header file through the %baseclass-preinclude directive.

4.6.1.2: Code generated by %polymorphic

When polymorphic semantic values are used then, in addition to the parser and parse-base classes, several more classes are defined. Some of them are class templates; most are defined in parserbase.h; an occasional implementation is added to the parse.cc source file.

To minimize namespace pollution most of the extra code is contained in a namespace of its own: Meta_. If the %namespace directive was used then Meta_ is nested under the namespace declared by that directive. The name Meta_ is used because much of the code implementing polymorphic semantic values uses template meta programming.

The enumeration 'enum class Tag_'

One notable exception to the above is the enumeration Tag_. To simplify its use it is declared outside of Meta_ (but inside the %namespace namespace, if specified). Its tags are declared at the %polymorphic directive. Targs_ is a strongly typed enumeration. The %weak-tags directive can be used to declare a pre C++-11 standard `enum Tag_'.

The namespace Meta_

Below, DataType refers to the semantic value's data type that is associated with a Tag_ identifier.

Several classes are defined in the namespace Meta_. The most important class is SType, providing the interface to the semantic value types. The class SType becomes the parser's STYPE_ type. Each SType object is either a default SType object, not containing a semantic value, or it contains a semantic value of a single DataType. All operations related to semantic values are handled by this class.

The class SType provides the following public interface:

4.6.1.3: A parser using a polymorphic semantic value type

In this section a small demo-parser is developed using polymorphic semantic values. Its %polymorphic directive looks like this:

    %polymorphic INT: int; TEXT: std::string;
        
Furthermore, the grammar declares tokens INT and IDENTIFIER, and pre-associates the TEXT tag with the identifier nonterminal, associates the INT tag with the int nonterminal. The combi nonterminal is associated with the generic STYPE_ semantic value type:

    %type <TEXT>    identifier
    %type <INT>     int
    %type <STYPE_> combi
        

The parser's grammar is simple, expecting input lines, formatted according to the following (rule) production rule:


    rule:
        identifier '(' identifier ')' '\n'
    |
        identifier '=' int '\n'
    |
        combi '\n'
    ;
        

The rules for identifier and int return, respectively, text and an int value:


    identifier:
        IDENTIFIER
        {
            $$ = d_scanner.matched();
        }
    ;
    int:
        INT
        {
            $$ = d_scanner.intValue();
        }
    ;
        

These simple assignments can be used as int is pre-associated with the INT tag and identifier is asociated with the TEXT tag.

The combi rule, which is used in one of the production rules of `rule', accepts a single int value, as well as an identifier. So it cannot be associated with a single polymorphic type. But as it is associated with the generic STYPE_ type, it can pass on any polymorphic value. In rule's production rule the generic semantic value is then simply passed on to process, expecting a plain STYPE_ const &. The function process has to inspect the semantic value's tag to learn what kind of value is stored inside the received semantic value. Here are the definition of the combi nonterminal and action blocks for the rule nonterminal:


    combi:          
        int
    |
        identifier
    ;
        
    rule:
        identifier '(' identifier ')' '\n'
        {
            cout << $1 << " " << $3 << '\n';
        }
    |
        identifier '=' int '\n'
        {
            cout << $1 << " " << $3 << '\n';
        }
    |
        combi '\n'
        {
            process($1);
        }
    ;
        

Note that combi's production rules do not define action blocks. The standard way to handle these situations is to add $$ = $1 action blocks to non-empty production rules not defining final action blocks. This works well in the current example, but a default-actions quiet (or warn) option or directive can also be used.

The function process, called from combi's action block, inspects the semantic value's tag to select the proper way of handling the received semantic value. Here is its implementation:


    void Parser::process(STYPE_ &semVal) const
    {
        if (semVal.tag() == Tag_::INT)
            cout << "Saw an int-value: " << semVal.get<Tag_::INT>() << '\n';
        else
            cout << "Saw text: " << semVal.get<Tag_::TEXT>() << '\n';
    }
        

4.6.1.4: A scanner using a polymorphic semantic value type

The scanner recognizes input patterns, and returns Parser tokens (e.g., Parser::INT) matching the recognized input.

It is easily created by flexc++(1) processing the following simple specification file.

%interactive
%filenames scanner

%%

[ \t]+                  // skip white space

[0-9]+                  return Parser::INT;

[a-zA-Z_][a-zA-Z0-9_]*  return Parser::IDENTIFIER;

.|\n                    return matched()[0];





The reader may refer to flexc++(1) documentation for details about flexc++(1) specification files.

4.6.2: Actions and shorthand ($) notations

Often an action accompanies a production rule. It contains C++ code that is executed once that production rule is recognized. The task of most actions is to compute a semantic value for the nonterminal recognized by the production rule, often using the semantic values made available by components of the production rule.

An action consists of C++ statements surrounded by braces, much like a compound statement in C++. Most rules have just one action at the end of the rule, following all the components. Actions in the middle of a rule are tricky and should be used only for special purposes (see section 4.6.2.4).

The components of production rules are numbered, the first component having number 1. E.g., in a production rule


    nonterm:
        first second third
        
first is component #1, second is component #2, third is component #3. C++ code in action blocks may refer to semantic values of these components using dollar-notations, where $i refers to the semantic value of the i^th component.

Likewise, the semantic value of the rule's nonterminal is represented by $$. Here is a typical example:


    exp:
        ...
    |
        exp '+' exp
        { 
            $$ = $1 + $3; 
        }
    |
        ...    
        
This rule constructs exp from two exp nonterminals connected by a plus-sign token. In the action, $1 and $3 represent the semantic values of, respectively, the first (left-hand side) exp component, and the second (right-hand side) exp component.

The sum is assigned to $$, which becomes the semantic value of the exp nonterminal represented by the production rule.

Depending on the specification of the default-actions option/directive (cf. section 4.5.4) bisonc++ may supply non-empty production rules with default action blocks containing the statement $$ = $1: the semantic value of the first component of the production rule is returned as the nonterminal's semantic value. Of course, the default action only is valid if the two data types match. Empty production rules are not provided with default action blocks.

Negative dollar indices (e.g., $-1) are allowed, and refer to semantic values of elements in rules before the component naming the current rule's nonterminal. This is a risky practice, and you should use it only when you know what you're doing. Here is a situation where you can use this reliably:


    vardef:
        type varlist ';'
    ;
    
    varlist:
        varlist ',' variable
        {
            defineVar($-1, $3);
        }
    |
        variable
        {
            defineVar($-1, $1);
        }
    ;
        

As long as varlist is only used in the vardef rule varlist can be sure that the type of the variable is available as the semantic value of the component immediately before (hence: $-1) the varlist component in the vardef rule. See also section 5.6.

In addition to the dollar-notations shown here, bisonc++ supports several more dollar-notations. The next three subsections describe the dollar-notations that are available after specifying, respectively, the %stype, %union or %polymorphic directives.

4.6.2.1: %stype shorthand notations

The %stype directive is used to specify a single semantic value type. By default the semantic value type is int. For single type semantic values several dollar-notations are available. If some of the dollar-notations shown below appear to be redundant: that's because their meanings are identical when using single type semantic values. Their meanings differ, however, when using union or polymorphic semantic values. Furthermore, below the notation $1 is used as a generic reference to semantic values of production rule components. Instead of $1 other available numbered dollar references can also be used.

Here is the overview:

4.6.2.2: %union shorthand notations

The %union directive is used to specify a union of semantic value types. Although each semantic value type can be used for each STYPE_ variable, (non)terminals are in practice associated with a single type. These associations are automatically applied through bisonc++'s dollar-notations.

Note that the %union directive is now largely superseded by the facilities offered by the %polymorphic directive. Using %union is considered somewhat `old school', as %polymorphic implements cleaner type definitions, allowing bisonc++ and the C++ compiler to verify that types are correctly being used.

In the next overview the notation $1 is used as a generic reference to semantic values of production rule components. Instead of $1 other available numbered dollar references can also be used.

4.6.2.3: %polymorphic shorthand notations

The %polymorphic directive is used to specify a series of semantic value types. (Non)terminals can be associated with exactly one of these types, or with the generic STYPE_ semantic value type. When the latter type is used either another STYPE_ value can be assigned to it, or a value of one of the defined polymorphic value types can be assigned to it. At any time an STYPE_ can only hold one single value type. Polymorphic semantic types are type safe: types cannot be confused. Furthermore, as STYPE_ objects are responsible for their own memory management, memory leaks cannot occur, assuming that the different semantic value types do not leak.

In the next overview the available dollar-notations that are available with polymorphic semantic values are described. In this overview $1 is used as a generic reference to semantic values of production rule components. Instead of $1 other available numbered dollar references can also be used.

4.6.2.4: Mid-Rule Action Blocks

Occasionally it is useful to insert an action block somewhere in the middle of a production rule. These so-called mid-rule action blocks are comparable to the usual action blocks, appearing at the end of production rules, but mid-rule action blocks are executed before the parser has recognized the production rule's components that follow them.

A mid-rule action block can refer to the components preceding it using $i, but it may not (cannot) refer to subsequent components because it is executed before they have been observed.

The mid-rule action block itself counts as one of the components of the production rule. In the example shown below, stmnt can be referred to as $6 in the final action block.

Mid-rule action blocks can also have semantic values. When using %union or %polymorphic and a rule's nonterminal is associated with a union field or polymorphic token, then mid-rule action blocks loose those associations. When the $$, $$. or $$-> shorthand notations appear in mid-action blocks of production rules whose nonterminal is associated with a polymorphic type or union field then a warning is issued that automatic type associations do not apply. Using the _$$ shorthand notation prevents the warning from being issued.

Here is an example from a hypothetical grammar, defining a let statement that looks like `this:


    let (variable) statement
        
Here, variable is the name of a variable that must only exists for the statement's lifetime. To parse this construct, we must insert variable into a symbol table before parsing statement, and must remove it afterward. Here is how it is done:

    stmt:   
        LET '(' variable ')'
        {
            _$$.assign<SYMTAB>(symtab());
            addVariable($3); 
        }
        stmt    
        { 
            $$ = $6;
            restoreSymtab($5); 
        }
        
As soon as `let (variable)' has been recognized, the first action is executed. It saves the current symbol table as the mid-rule's semantic value, using the polymorphic tag SYMTAB (which could be associated with, e.g., an std::unordered_map). Then addVariable receives the new variable's name, adding it to the current symbol table. Once the first action is finished, the embedded statement (stmt) is parsed. Note that the mid-rule action is component number 5, so `stmt' is component number 6.

Once statement has been parsed, its semantic value is returned as the semantic value of the production rule's nonterminal. Then the semantic value from the mid-rule action block is used to restore the symbol table to its original state. This removes the temporary let-variable from the list so that it won't appear to exist while the rest of the program is parsed.

Defining mid-rule action blocks before a rule has completely been recognized often leads to conflicts since the parser, because of the single look-ahead tokoen, must make a decision about the parsing sequence to use. For example, the following two rules, without mid-rule actions, can coexist because the parser can always process the open-brace token and only then look at the next token before deciding whether there is a declaration or not:


    compound: 
        '{' declarations statements '}'
    | 
        '{' statements '}'
    ;
        
But when we an action block is inserted before the first open-parenthesis a conflict is introduced:

    compound: 
        { 
            prepareForLocalVariables(); 
        }
        '{' declarations statements '}'
    | 
        '{' statements '}'
    ;
        
Now the parser, encountering the open-parenthesis, must decide whether the open-parenthesis belongs to the first or second production rule. It must make this decision because it must execute the statement in the action block if it selects the first production rule, while no action is required when selecting the second production rule (cf. section 7.0.3).

This problem cannot be solved by putting identical actions into the two rules, like so:


        { 
            prepareForLocalVariables(); 
        }
        '{' declarations statements '}'
    | 
        { 
            prepareForLocalVariables(); 
        }
        '{' statements '}'
    ;
        
But this does not solve the issue, because bisonc++ considers all action blocks as unique elements of production rules, and does not inspect the action blocks' contents.

A solution to the above problem consists of hiding the action block in a support nonterminal symbol which recognizes the first block-open brace and then performs the required preparations:


    openblock:
        '{'
        { 
            prepareForLocalVariables(); 
        }
    ;

    compound: 
            openblock declarations statements '}'
    | 
            openblock statements '}'
    ;
        
Now bisonc++ can execute the action in the rule for subroutine without deciding which rule for compound it eventually uses. Note that the action is now at the end of its rule. Any mid-rule action can be converted to an end-of-rule action in this way, and this is what bisonc++ actually does to implement mid-rule actions. Bisonc++ converts mid-rule action blocks to hash-tag numbered elements in production rules. In a rule like

    string:
        {
            ...
        }
        TOKEN
        ...
    ;
        
warning messages referring to the mid-rule action block could look like this:

    [grammar: warning] Line 10: `rule 3: string ->  #0001': ...
        
Here, `#0001' is the hidden nonterminal merely containing the mid-rule action block. It is as though we had written this grammar:

    string:
        #0001 TOKEN
        ...
    ;

    #0001:
        {
            ...
        }
    ;
        

4.7: Basic Grammatical Constructions

In the following sections several basic building blocks for writing grammars are presented in their prototypical forms. When these building blocks are used to write a grammar, the resulting grammar is usually accepted by bisonc++. Moreover, these building blocks are frequently encountered in programming languages. When designing your own grammar, try to stick as closely as possible to the following basic forms.

4.7.1: Plain Alternatives

Simple alternatives can be specified using the vertical bar (|). Each alternative should begin with a unique identifying terminal token. The terminal token may actually be hidden in a nonterminal rule, in which case that nonterminal can be used as an alias for the nonterminal. In fact identical terminal tokens may be used if at some point the terminal tokens differ over different alternatives. Here are some examples:

    // Example 1:  plain terminal distinguishing tokens
    expr:
        ID        
    |
        NUMBER
    ;

    // Example 2:  nested terminal distinguishing tokens
    expr:
        id
    |
        number
    ;

    id:
        ID
    ;

    number:
        NUMBER
    ;

    // Example 3:  eventually diverting routes
    expr:
        ID
        id
    |
        ID
        number
    ;

    id:
        ID
    ;

    number:
        NUMBER
    ;
        

4.7.2: One Or More Alternatives, No Separators

A series of elements normally use left-recursion. For example, C++ supports string concatenation: series of double quote delimited ASCII characters define a string, and multiple white-space delimited strings are handled as one single string:

    "hello"         // multiple ws-delimited strings
    " " 
    "world"

    "hello world"   // same thing
        
Usually a parser is responsible for concatenating the individual string-parts, receiving one or more STRING tokens from the lexical scanner. A string rule handles one or more incoming STRING tokens:

    string:
        STRING
    |
        string STRING
        
The above rule can be used as a prototype for recognizing a series of elements. The token STRING may of course be embedded in another rule. The generic form of this rule could be formulated as follows:

    series:
        unit
    |
        series unit
        
Note that the single element is first recognized, whereafter the left-recursive alternative may be recognized repeatedly.

4.7.3: Zero Or More Alternatives, No Separators

An optional series of elements also use left-recursion, but the single element alternative remains empty. For example, in C++ a compound statement may contain statements or declarations, but it may also be empty:

    opt_statements:
        // empty
    |
        opt_statements statements
        
The above rule can be used as a prototype for recognizing a series of elements: the generic form of this rule could be formulated as follows:

    opt_series:
        // empty
    |
        opt_series unit
        
Note that the empty element is recognized first, even though it is empty, whereafter the left-recursive alternative may be recognized repeatedly. In practice this means that an action block may be defined at the empty alternative, which is then executed prior to the left-recursive alternative. Such an action block could be used to perform initializations necessary for the proper handling of the left-recursive alternative.

4.7.4: One Or More Alternatives, Using Separators

A series of elements which are separated from each other using some delimiter again normally uses left-recursion. For example, a C++ variable definition list consists of one or more identifiers, separated by comma's. If there is only one identifier no comma is used. Here is the rule defining a list using separators:

    variables:
        IDENTIFIER
    |
        variables ',' IDENTIFIER
        
The above rule can be used as a prototype for recognizing a series of delimited elements. The generic form of this rule could be formulated as follows:

    series:
        unit
    |
        series delimiter unit
        
Note that the single element is first recognized, whereafter the left-recursive alternative may be recognized repeatedly. In fact, this rule is not really different from the standard rule for a series, which does not hold true for the rule to recognize an optional series of delimited elements, covered in the next section.

4.7.5: Zero Or More Alternatives, Using Separators

An optional series of elements, separated from each other using delimiters occurs frequently in programming languages. For example, C++ functions have parameter lists which may or may not require arguments. Since a parameter list may be defined empty, an empty alternative is required. However, a simple generalization of the optional series construction (section 4.7.3) won't work, since that would imply that the first argument is preceded by a separator, which is clearly not the intention. So, the following construction is wrong:

    opt_parlist:
        // empty
    |
        opt_parlist ',' parameter
        
To define an optional series of delimited elements two rules are required: one rule handling the optional part, the other the delimited series of elements. So, the correct definition is as follows:

    opt_parlist:
        // empty
    |
        parlist
    ;

    parlist:
        parameter
    |
        parlist ',' parameter
    ;
        
Again, the above rule pair can be used as a prototype for recognizing an optional series of delimited elements. The generic form of these rules could be formulated as follows:

    opt_series:
        // empty
    |
        series
    ;

    series:
        element
    |
        series delimiter element
        
Note that the opt_series rules neatly distinguishes the no-element case from the case were elements are present. Usually these two cases need to be handled quite differently, and the opt_series rules empty alternative easily allows us to recognize the no-elements case.

4.7.6: Nested Blocks

Finally, we add the nested rule to our bag of rule-tricks. Again, nested rules appear frequently: parenthesized expressions and compound statements are two very well known examples. These kind of rules are characterized by the fact that the nested variant is itself an example of the element appearing in the nested variant. The definition of a statement is actually a bit more complex than the definition of an expression, since the statement appearing in the compound statement is in fact an optional series of elements. Let's first have a look at the nested expression rule. Here it is, in a basic form:

    expr:
        NUMBER
    |
        ID
    |
        expr '+' expr
    |
        ...
    |
        '(' expr ')'
    ;
        
This definition is simply characterized that the nonterminal expr appears within a set of parentheses, which is not too complex.

The definition of opt_statements, however, is a bit more complex. But acknowledging the fact that a statement contains among other elements a compound statement, and that a compound statement, in turn, contains opt_statements an opt_statements construction can be formulated accordingly:


    opt_statements:         // define an optional series
        // empty
    |
        opt_statements statement
    ;
        
    statement:              // define alternatives for `statement'
        expr_statement
    |
        if_statement
    |
        ...
    |
        compound_statement
    ;

    compound_statement:     // define the compound statement itself
        '{' opt_statements '}'
    ;
        

4.8: Multiple Parsers in the Same Program

Most programs that use bisonc++ parse only one language and therefore only need one bisonc++ parser. But what if you want to parse more than one language with the same program? Since bisonc++ constructs a class rather than a parsing function, this problem can easily be solved: simply define your second (third, fourth, ...) parser class, each having its own unique class-name, using the %class-name directive, and construct parser objects of each of the defined classes.

In multi-threaded programs each thread can create its own parser object, and then call that object's parse function: bisonc++ does not use static data that are modified by the generated parser. In multi-threaded programs, when the parser uses polymorphic semantic values, the option --thread-safe or the directive %thread-safe should be specified to ensure that the code handling the polymorphic semantic data types accesses the error-counter data member of their own parser object.