Chapter 3: Format of the input file

The flexc++ input file consists of two sections, separated by a line containing `%%'. The section above %% contains option specifications and definitions; the section below %% contains the regular expressions (and their (optional) actions). The general layout of flexc++'s input file, therefore, looks like this:


definitions
%%
rules
    

Optionally, a final line containing `%%' may follow the rules. The following sections cover the `definitions' and `rules' sections.

3.1: Definitions section

Flexc++ supports command-line options and input-file directives controlling flexc++'s behavior. Directives are covered in the next section (3.1.1), options are covered in the section 1.1.1.

The definitions section may also contain declarations of named regular expressions. A named regular expression looks like this:

name   pattern

Here, name is an identfier, which may also contain hyphens (-); `pattern' is a regular expression, see section 3.4. Patterns start at the first non-blank character following the name, and end at the line's last non-blank character. A named regular expression cannot contain comment.

Finally, the definitions section may be used to declare mini-scanners (a.k.a. start conditions), cf. section 3.7. Start conditions are very useful for defining small `sub-languages' inside the language whose tokens must be recognized by the scanner. A commonly encountered example is the start condition recognizing C style multi-line comment.

3.1.1: Directives

Some directives require arguments, which are usually provided following separating (but optional) = characters. Arguments of directives are text, surrounded by double quotes (strings), or embedded in raw string literals (rawstrings). Double quotes or backslashes inside strings must themselves be preceded by backslashes; these backslashes are not required when rawstrings are used.

The %s and %x directives are immediately followed by name lists, consisting of identifiers separated by blanks. Here is an example of the definition of a directive:


    %class-name = "MyScanner"
        

Directives accepting a `filename' do not accept path names, i.e., they cannot contain directory separators (/); options accepting a 'pathname' may contain directory separators. A 'pathname' using blank characters should be surrounded by double quotes.

Some directives may generate errors. This happens when a directive conflicts with the contents of an existing file which flexc++ cannot modify (e.g., a scanner class header file exists, but doesn't define a name space, but a %namespace directive was provided). To solve the error the offending directive could be omitted, the existing file could be removed, or the existing file could be hand-edited according to the directive's specification. Note that flexc++ currently does not handle the opposite error condition: if a previously used directive is omitted, then flexc++ does not detect the inconsistency. In those cases you may encounter compilation errors.

3.2: Rules section

The rules section of the flexc++ input file contains rules of the form:

pattern    action

Action is optional, and is separated from pattern by spaces and/or tabs. It consists of a single-line C++-statement, or it consists of a compound statement that may span several lines.

Alternatively, an action may consist of a vertical bar (`|'). A vertical bar indicates that pattern uses the same action as the next rule.

3.3: Comment

Comment may be used almost everywhere in flexc++'s input file. Both traditional C-style multi-line comment (i.e., /* ... */) and C++ style end-of-line comment (i.e., // ...) can be used. Indentation is optional.

When comment is encountered outside of an action, flexc++ discards the comment, while all comment provided in the contect of actions are copied verbatim to the generated source file.

Comment cannot be used when defining named regular expressions in the definitions section.

3.4: Patterns

The patterns in the input (see Rules Section 3.2) are written using an extended set of regular expressions. These are:

Once a character class has started, all subsequent character (ranges) are added to the set, until the final closing bracket (]) has been reached.

Operator precedence

The operators used in specifying regular expressions have the following priorities (listed from lowest to highest):

Different from the lex-standard, but in line with most other regular expression engines the interval operator is given higher precedence than concatenation. To require two repetitions of the word hello use (hello){2} rather than hello{2}, which to flexc++ is identical to the regular expression helloo.

Named regular expressions have the same precedence as parenthesized regular expressions. So after


    WORD  xyz[a-zA-Z]
    %%
    {WORD}{2}
        
the input xyzaxyzb is matched, whereas xyzab isn't.

In addition to characters and ranges of characters, character classes can also contain predefined character sets. These consist of certain names between [: and :] delimiters. The predefined character sets are:

     
         [:alnum:] [:alpha:] [:blank:]
         [:cntrl:] [:digit:] [:graph:]
         [:lower:] [:print:] [:punct:]
         [:space:] [:upper:] [:xdigit:]

These predefined sets designate sets of characters equivalent to the corresponding standard C isXXX function. For example, [:alnum:] defines all characters for which isalnum returns true.

As an illustration, the following character classes are equivalent:

 
         [[:alnum:]]
         [[:alpha:][:digit:]]
         [[:alpha:][0-9]]
         [a-zA-Z0-9]
    

Note that a negated character class like [^A-Z] matches a newline unless \n (or an equivalent escape sequence) is one of the characters explicitly present in the negated character class (e.g., [^A-Z\n]). This differs from the way many other regular expression engines treat negated character classes. Matching newlines means that a pattern like [^"]* can match the entire input unless there's another quote in the input.

Flexc++ allows negation of character class expressions by prepending ^ to the name of a predefined character set. Here are the negated predefined character sets:

                
         [:^alnum:] [:^alpha:] [:^blank:]
         [:^cntrl:] [:^digit:] [:^graph:]
         [:^lower:] [:^print:] [:^punct:]
         [:^space:] [:^upper:] [:^xdigit:]
    

The `{+}' operator computes the union of two character classes. For example, [a-z]{+}[0-9] is the same as [a-z0-9].

The `{-}' operator computes the difference of two character classes. For example, [a-c]{-}[b-z] represents all the characters in the class [a-c] that are not in the class [b-z] (which in this case, is just the single character `a').

A rule can have at most one instance of trailing context (the / operator or the $ operator). The start condition, ^, and <<EOF>> patterns can only occur at the beginning of a pattern, and, as well as with / and $, cannot be grouped inside parentheses. A ^ which does not occur at the beginning of a rule or a $ which does not occur at the end of a rule loses its special properties and is treated as a normal character.

The following are invalid:

                
         foo/bar$
         <sc1>foo<sc2>bar
    
Note that the first of these can be rewritten `foo/bar\n'.

If the desired meaning is a `foo' or a `bar'-followed-by-a-newline, the following could be used (the special | action is explained below, see section 3.6):

                
         foo      |
         bar$     /* action goes here */
    
A comparable definition can be used to match a `foo' or a `bar'-at-the-beginning-of-a-line.

3.5: Character constants

Character constants are surrounded by single quote characters. They match single characters which, however, can be specified in various ways.

Considering the above, to match character (in this example: except for the newline character) including its surrounding quotes a regular expression consisting of an escaped quote character, followed by any character, followed by a quote character can be used:


    \'.'        // matches characters surrounded by quotes
        

3.6: Actions

As described in Section 3.2, the second section of the flexc++ input file contains rules: pairs of patterns and (optional) actions.

Specifications of patterns end at the first unescaped white space character; the action then starts at the first non-white space character. It usually contains C++ code, with two exceptions: the empty and the bar (|) action (see below). If the C++ code starts with a brace ({), the action can span multiple lines until the matching closing brace (}) is encountered. Flexc++ correctly handles braces in strings and comments.

Actions can be empty (omitted). Empty actions discard the matched pattern. To avoid confusion it is advised to provide at least a simple comment stating that the matched input is ignored.

The bar action is an action containing only a single vertical bar (|). This tells flexc++ to use the action of the next rule. This can be repeated so the following rules all use the same action:


    a   |
    b   |
    c   std::cout << "Matched " << match() << "\n";
        
Actions can return an int value, which is usually interpreted as a token by the program calling the scanner's lex member. When lex is called after it has returned it continues its pattern-matching process just beyond the last-matched point in the input stream.

3.7: Start conditions (Mini scanners)

Flexc++ uses regular expressions to generically descibe textual patterns. Often a flexc++ specification file uses multiple `sub-languages' having specialized tasks. A sub-language to describe the normal structure of the input, a sub-language to describe comment, a sub-language to describe strings, etc., etc.

For flexible handling of these sub-languages flexc++, like flex, offers start conditions, a.k.a. mini scanners. A start condition can be declared in the definition section of the lexer file:


%x  string
%%
...
    
A %x is used to declare exclusive start conditions. Following %x a list (no commas) of start condition names is expected. Rules specified for exclusive start conditions only apply to that particular mini scanner. It is also possible to define inclusive start condition using %s. Rules not explicitly associated with a start condition (or with the (default) start condition StartCondition_::INITIAL also apply to inclusive start conditions.

A start condition is used in the rules section of the lexical scanner specification file as indicated in section 3.4. Here is a concrete example:


%x string
%%

\"              {
                    more();
                    begin(StartCondition_::string);
                }

<string>{
    \"          {
                    begin(StartCondition_::INITIAL);
                    return Token::STRING;
                }
    \\.|.       more();
}
    
This tells flexc++ that the double quote starts (begins) the StartCondition_::string start condition. The string start condition's rules then define what happens to double quoted strings. All its characters are collected, and eventually the string's content is returned by matched().

By default, scanners generated by flexc++ start in the StartCondition_::INITIAL start condition. When encountering a double quote, the scanner switches to the StartCondition_::string mini scanner. Now, only the rules that are defined for the string start condition are active. Once flexc++ encounters an unescaped double quote, it switches back to the StartCondition_::INITIAL start condition and returns Token::STRING to its called, indicating that it has seen a C string.

There is nothing special to either the function begin(StartCondition_) or to the StartCondition_ enum itself. They can be used anywhere within the Scanner class. E.g., after providing the Scanner class with a std::stack<StartCondition_> d_scStack start conditions can be stacked. Calling member begin could be embedded in a member Scanner::push(StartCondition_) like this:


    void Scanner::push(StartCondition_ next)
    {
        d_scStack.push(startCondition()); // push the current SC.
        begin(next);                      // switch to the next
    }
        
In addition, for returning to the start condition currently on top of the stack simply call a member Scanner::popStartCondition(), implemented like this:

    void Scanner::popStartCondition()
    {
        begin(d_scStack.top());
        d_scStack.pop();
    }
        
push and popStartCondition should be given the same access rights as begin: they should be defined in the private section of the Scanner class.

3.7.1: Notation details

Instead of using a mini scanner compound statement, it is also possible to define rules using explicit start condition specifications (cf. section 3.4. Here is the string start condition once again, now using explicit start condition specifications:

%x string
    
%%

\"              {
                    more();
                    begin(StartCondition_::string);
                }
<string>\"      {
                    begin(StartCondition_::INITIAL);
                    return Token::STRING;
                }
<string>\\.|.   more();
    

3.8: Members

The Scanner class offers the following members, which can be called from within actions (or by members called from those actions):

3.9: Handling input your own way

Assuming that the scanner class is called `Scanner' the class Input is nested within the class `ScannerBase'. The stream from which flexc++ retrieves characters is completely decoupled from the pattern-matching algorithm implemented in the ScannerBase class. the pattern-matching algorithm retrieves the next character from a class Input, nested under ScannerBase. This class will usually provide all the required functionality, but users of flexc++ may optionally provide their own Input class.

In situations where the default Input implementation doesn't suffice simply `roll your own', implementing the following interface and use the %option input-interface and %option input-implementation options in the lexer file to include, respectively, your own class Input interface in the generated Scannerbase.h file and Input member function implementations in the generated lex.cc file.

When implementing your own class Input, the following public interface must at least be provided:


    class Input
    {
        public:
            Input();
                                            // dynamically allocated iStream
            Input(std::istream *iStream, size_t lineNr = 1);   
            size_t get();                   // the next character
            size_t lineNr() const;          
            size_t nPending() const;          
            void setPending(size_t nPending);          
            void reRead(size_t ch);         // push back 'ch' (if <= 0x100)
                                            // push back str from idx 'fmIdx'
            void reRead(std::string const &str, size_t fmIdx);

            void close();                 // delete dynamically allocated
    };
        
This interface may be augmented with additional members, but the shown interface is used by ScannerBase. Flexc++ places Input in ScannerBase's private interface and all communication with Input is handled by ScannerBase. Input's members must perform the following tasks: