This is Info file flex.info, produced by Makeinfo-1.55 from the input
file /root/flex-2.4.7/MISC/flex.texinfo.

START-INFO-DIR-ENTRY

* flex: (flex).
		Fast lexical analyzer generator.

END-INFO-DIR-ENTRY


File: flex.info,  Node: Performance,  Next: Incompatibilities,  Prev: Invoking,  Up: Top

Performance Considerations
**************************

   The main design goal of `flex' is that it generate high performance
scanners.  It has been optimized for dealing well with large sets of
rules.  Aside from the effects of table compression on scanner speed
outlined above, there are a number of options/actions which degrade
performance.  These are, from most expensive to least:

     `REJECT'
     
     pattern sets that require backtracking
     arbitrary trailing context
     
     `^' beginning-of-line operator
     `yymore'

with the first three all being quite expensive and the last two being
quite cheap.

   `REJECT' should be avoided at all costs  when  performance  is
important.  It is a particularly expensive option.

   Getting rid of backtracking is messy and often may be an enormous
amount of work for a complicated scanner.  In principle, one begins by
using the `-b' flag to generate a `lex.backtrack' file.  For example,
on the input

     %%
     foo        return TOK_KEYWORD;
     foobar     return TOK_KEYWORD;

the file looks like:

     State #6 is non-accepting -
      associated rule line numbers:
            2       3
      out-transitions: [ o ]
      jam-transitions: EOF [ \001-n  p-\177 ]
     
     State #8 is non-accepting -
      associated rule line numbers:
            3
      out-transitions: [ a ]
      jam-transitions: EOF [ \001-`  b-\177 ]
     
     State #9 is non-accepting -
      associated rule line numbers:
            3
      out-transitions: [ r ]
      jam-transitions: EOF [ \001-q  s-\177 ]
     
     Compressed tables always backtrack.

   The first few lines tell us that there's a scanner state in which it
can make a transition on an `o' but not on any other character, and
that in that state the currently scanned text does not match any rule.
The state occurs when trying to match the rules found at lines 2 and 3
in the input file.  If the scanner is in that state and then reads
something other than an `o', it will have to backtrack to find a rule
which is matched.  With a bit of headscratching one can see that this
must be the state it's in when it has seen `fo'.  When this has
happened, if anything other than another `o' is seen, the scanner will
have to back up to simply match the `f' (by the default rule).

   The comment regarding State #8 indicates there's a problem when
`foob' has been scanned.  Indeed, on any character other than a `b',
the scanner will have to back up to accept `foo'.  Similarly, the
comment for State #9 concerns when `fooba' has been scanned.

   The final comment reminds us that there's no point going to all the
trouble of removing backtracking from the rules unless we're using `-f'
or `-F', since there's no performance gain doing so with compressed
scanners.

   The way to remove the backtracking is to add "error" rules:

     %%
     foo         return TOK_KEYWORD;
     foobar      return TOK_KEYWORD;
     
     fooba       |
     foob        |
     fo          {
                 /* false alarm, not really a keyword */
                 return TOK_ID;
                 }

   Eliminating backtracking among a list of keywords can also be done
using a "catch-all" rule:

     %%
     foo         return TOK_KEYWORD;
     foobar      return TOK_KEYWORD;
     
     [a-z]+      return TOK_ID;

   This is usually the best solution when appropriate.

   Backtracking messages tend to cascade.  With a complicated set of
rules it's not uncommon to get hundreds of messages.  If one can
decipher them, though, it often only takes a dozen or so rules to
eliminate the backtracking (though it's easy to make a mistake and have
an error rule accidentally match a valid token.  A possible future
`flex' feature will be to automatically add rules to eliminate
backtracking).

   Variable trailing context (where both the leading and trailing parts
do not have a fixed length) entails almost the same performance loss as
`REJECT' (i.e., substantial).  So when possible a rule like:

     %%
     mouse|rat/(cat|dog)   run();

is better written:

     %%
     mouse/cat|dog         run();
     rat/cat|dog           run();

or as

     %%
     mouse|rat/cat         run();
     mouse|rat/dog         run();

   Note that here the special `|' action does not provide any savings,
and can even make things worse (*note Deficiencies and Bugs: Bugs.).

   Another area where the user can increase a scanner's performance (and
one that's easier to implement) arises from the fact that the longer the
tokens matched, the faster the scanner will run.  This is because with
long tokens the processing of most input characters takes place in the
(short) inner scanning loop, and does not often have to go through the
additional work of setting up the scanning environment (e.g., `yytext')
for the action.  Recall the scanner for C comments:

     %x comment
     %%
             int line_num = 1;
     
     "/*"         BEGIN(comment);
     
     <comment>[^*\n]*
     <comment>"*"+[^*/\n]*
     <comment>\n             ++line_num;
     <comment>"*"+"/"        BEGIN(INITIAL);

This could be sped up by writing it as:

     %x comment
     %%
             int line_num = 1;
     
     "/*"         BEGIN(comment);
     
     <comment>[^*\n]*
     <comment>[^*\n]*\n      ++line_num;
     <comment>"*"+[^*/\n]*
     <comment>"*"+[^*/\n]*\n ++line_num;
     <comment>"*"+"/"        BEGIN(INITIAL);

   Now instead of each newline requiring the processing of another
action, recognizing the newlines is "distributed" over the other rules
to keep the matched text as long as possible.  Note that adding rules
does not slow down the scanner!  The speed of the scanner is
independent of the number of rules or (modulo the considerations given
at the beginning of this section) how complicated the rules are with
regard to operators such as `*' and `|'.

   A final example in speeding up a scanner: suppose you want to scan
through a file containing identifiers and keywords, one per line and
with no other extraneous characters, and recognize all the keywords.  A
natural first approach is:

     %%
     asm      |
     auto     |
     break    |
     ... etc ...
     volatile |
     while    /* it's a keyword */
     
     .|\n     /* it's not a keyword */

To eliminate the back-tracking, introduce a catch-all rule:

     %%
     asm      |
     auto     |
     break    |
     ... etc ...
     volatile |
     while    /* it's a keyword */
     
     [a-z]+   |
     .|\n     /* it's not a keyword */

   Now, if it's guaranteed that there's exactly one word per line, then
we can reduce the total number of matches by a half by merging in the
recognition of newlines with that of the other tokens:

     %%
     asm\n    |
     auto\n   |
     break\n  |
     ... etc ...
     volatile\n |
     while\n  /* it's a keyword */
     
     [a-z]+\n |
     .|\n     /* it's not a keyword */

   One has to be careful here, as we have now reintroduced backtracking
into the scanner.  In particular, while we know that there will never be
any characters in the input stream other than letters or newlines,
`flex' can't figure this out, and it will plan for possibly needing
backtracking when it has scanned a token like `auto' and then the next
character is something other than a newline or a letter.  Previously it
would then just match the `auto' rule and be done, but now it has no
`auto' rule, only a `auto\n' rule.  To eliminate the possibility of
backtracking, we could either duplicate all rules but without final
newlines, or, since we never expect to encounter such an input and
therefore don't how it's classified, we can introduce one more
catch-all rule, this one which doesn't include a newline:

     %%
     asm\n    |
     auto\n   |
     break\n  |
     ... etc ...
     volatile\n |
     while\n  /* it's a keyword */
     [a-z]+\n |
     [a-z]+   |
     .|\n     /* it's not a keyword */

   Compiled with `-Cf', this is about as fast as one  can  get  a
`flex' scanner to go for this particular problem.

   A final note: `flex' is slow when matching `NUL''s, particularly
when  a  token  contains multiple `NUL''s.  It's best to write rules
which match short amounts of text if it's  anticipated that the text
will often include `NUL''s.


File: flex.info,  Node: Incompatibilities,  Next: Diagnostics,  Prev: Performance,  Up: Top

Incompatibilities with `lex' and POSIX
**************************************

   `flex' is a rewrite of the Unix tool `lex' (the two implementations
do not share any code, though), with some extensions and
incompatibilities, both of which are of concern to those who wish to
write scanners acceptable to either implementation.  At present, the
POSIX `lex' draft is very close to the original `lex' implementation,
so some of these incompatibilities are also in conflict with the POSIX
draft.  But the intent is that except as noted below, `flex' as it
presently stands will ultimately be POSIX conformant (i.e., that those
areas of conflict with the POSIX draft will be resolved in `flex''s
favor).  Please bear in mind that all the comments which follow are
with regard to the POSIX draft standard of Summer 1989, and not the
final document (or subsequent drafts); they are included so `flex'
users can be aware of the standardization issues and those areas where
`flex' may in the near future undergo changes incompatible with its
current definition.

   `flex' is fully compatible with `lex' with the following exceptions:

   * The undocumented `lex' scanner internal variable `yylineno' is not
     supported.  It is difficult to support this option efficiently,
     since it requires examining every character scanned and
     reexamining the characters when the scanner backs up.  Things get
     more complicated when the end of buffer or file is reached or a
     `NUL' is scanned (since the scan must then be restarted with the
     proper line number count), or the user uses the `yyless', `unput',
     or `REJECT' actions, or the multiple input buffer functions.

     The fix is to add rules which, upon seeing a newline, increment
     `yylineno'.  This is usually an easy process, though it can be a
     drag if some of the patterns can match multiple newlines along with
     other characters.

     `yylineno' is not part of the POSIX draft.

   * The `input' routine is not redefinable, though it may be called to
     read characters following whatever has been matched by a rule.  If
     `input' encounters an end-of-file the normal `yywrap' processing
     is done.  A "real" end-of-file is returned by `input' as `EOF'.

     Input is instead controlled by redefining the `YY_INPUT' macro.

     The `flex' restriction that `input' cannot be redefined is in
     accordance with the POSIX draft, but `YY_INPUT' has not yet been
     accepted into the draft (and probably won't; it looks like the
     draft will simply not specify any way of controlling the scanner's
     input other than by making an initial assignment to `yyin').

   * `flex' scanners do not use `stdio' for input.  Because of this,
     when writing an interactive scanner one must explicitly call
     `fflush' on the stream associated with the terminal after writing
     out a prompt.  With `lex' such writes are automatically flushed
     since `lex' scanners use `getchar' for their input.  Also, when
     writing interactive scanners with `flex', the `-I' flag must be
     used.

   * `flex' scanners are not as reentrant as `lex' scanners.  In
     particular,  if  you have an interactive scanner and an interrupt
     handler which long-jumps out of the  scanner, and  the  scanner is
     subsequently called again, you may get the following message:

          fatal flex scanner internal error--end of buffer missed

     To reenter the scanner, first use

          yyrestart( yyin );

   * `output' is not supported.  Output from the `ECHO' macro is done
     to the file-pointer `yyout' (default `stdout').

     The POSIX draft mentions that an `output' routine exists but
     currently gives no details as to what it does.

   * `lex' does not support exclusive start  conditions  (`%x'), though
     they are in the current POSIX draft.

   * When definitions are expanded, `flex'  encloses  them  in
     parentheses.  With `lex', the following:

          NAME    [A-Z][A-Z0-9]*
          %%
          foo{NAME}?      printf( "Found it\n" );
          %%

     will not match the string `foo' because, when the macro is
     expanded, the rule is equivalent to `foo[A-Z][A-Z0-9]*?' and the
     precedence is such that the `?' is associated with `[A-Z0-9]*'.
     With `flex', the rule will be expanded to `foo([A-Z][A-Z0-9]*)?'
     and so the string `foo' will match.  Note that because of this,
     the `^', `$', `<S>', `/', and `<<EOF>>' operators cannot be used
     in a `flex' definition.

     The POSIX draft interpretation is the same as in `flex'.

   * To specify a character class which matches anything but a left
     bracket (`]'), in `lex' one can use `[^]]' but with `flex' one
     must use `[^\]]'.  The latter works with `lex', too.

   * The lex `%r' (generate a Ratfor scanner)  option  is  not
     supported.  It is not part of the POSIX draft.

   * If you are providing your own `yywrap' routine, you must include a
     `#undef yywrap' in the definitions section (section 1).  Note that
     the `#undef' will have to be enclosed in `%{}'.

     The POSIX draft specifies that `yywrap' is a function, and this is
     very unlikely to change; so `flex' users are warned that `yywrap'
     is likely to be changed to a function in the near future.

   * After a call to `unput', `yytext' and `yyleng' are undefined until
     the next token is matched.  This is not the case with `lex' or the
     present POSIX draft.

   * The precedence of the `{}' (numeric range) operator is different.
     `lex' interprets `abc{1,3}' as "match one, two, or three
     occurrences of `abc'," whereas `flex' interprets it as "match `ab'
     followed by one, two, or three occurrences of `c'."  The latter is
     in agreement with the current POSIX draft.

   * The precedence of the `^'  operator  is  different.   `lex'
     interprets  `^foo|bar'  as  "match  either `foo' at the beginning
     of a line, or `bar' anywhere",  whereas  `flex' interprets  it  as
     "match either `foo' or `bar' if they come at the beginning of a
     line".   The  latter  is  in agreement with the current POSIX
     draft.

   * To refer to `yytext' outside of the scanner source file, the
     correct definition with `flex' is `extern char *yytext' rather
     than `extern char yytext[]'.  This is contrary to the current POSIX
     draft but a point on which `flex' will not be changing, as the
     array representation entails a serious performance penalty.  It is
     hoped that the POSIX draft will be amended to support the `flex'
     variety of declaration (as this is a fairly painless change to
     require of `lex' users).

   * `yyin' is initialized by `lex' to be `stdin'; `flex', on the other
     hand, initializes `yyin' to `NULL' and then assigns it to stdin
     the first time the scanner is called, providing `yyin' has not
     already been assigned to a non-`NULL' value.  The difference is
     subtle, but the net effect is that with `flex' scanners, `yyin'
     does not have a valid value until the scanner has been called.

   * The special table-size declarations such as `%a' supported by
     `lex' are not required by `flex' scanners; `flex' ignores them.

   * The name `FLEX_SCANNER' is `#define''d so scanners  may  be
     written for use with either `flex' or `lex'.

   The following `flex' features are not included in lex  or  the POSIX
draft standard:

     `yyterminate()'
     `<<EOF>>'
     `YY_DECL'
     `#line' directives
     `%{}' around actions
     `yyrestart()'
     comments beginning with `#' (deprecated)
     multiple actions on a line

   This last feature refers to the fact that with `flex' you can put
multiple actions on the same line, separated with semicolons, while with
`lex', the following

     foo    handle_foo(); ++num_foos_seen;

is (rather surprisingly) truncated to

     foo    handle_foo();

   `flex' does not truncate the action.  Actions that are not enclosed
in braces are simply terminated at the end of the line.


File: flex.info,  Node: Diagnostics,  Next: Bugs,  Prev: Incompatibilities,  Up: Top

Diagnostic Messages
*******************

`reject_used_but_not_detected undefined'
`yymore_used_but_not_detected undefined'
     These errors can occur at compile time.  They indicate that the
     scanner uses `REJECT' or `yymore' but that `flex' failed to notice
     the fact, meaning that `flex' scanned the first two sections
     looking for occurrences of these actions and failed to find any,
     but somehow you snuck some in (via a `#include' file, for example).
     Make an explicit reference to the action in your `flex' input file.
     (Note that previously `flex' supported a `%used'/`%unused'
     mechanism for dealing with this problem; this feature is still
     supported but now deprecated, and will go away soon unless the
     author hears from people who can argue compellingly that they need
     it.)

`flex scanner jammed'
     A scanner compiled with `-s' has encountered an input string which
     wasn't matched by any of its rules.

`flex input buffer overflowed'
     A scanner rule matched a string long enough to overflow the
     scanner's internal input buffer (16K bytes by default--controlled
     by `YY_BUF_SIZE' in `flex.skel'.  Note that to redefine this
     macro, you must first `#undefine' it).

`scanner requires -8 flag'
     Your scanner specification includes recognizing 8-bit characters
     and you did not specify the `-8' flag (and your site has not
     installed `flex' with `-8' as the default).

`fatal flex scanner internal error--end of buffer missed'
     This can occur in an scanner which is reentered after a long-jump
     has jumped out (or over) the scanner's activation frame.  Before
     reentering the scanner, use:

          yyrestart( yyin );

`too many %t classes!'
     You managed to put every single character into its own `%t' class.
     `flex' requires that at least one of the classes share characters.


File: flex.info,  Node: Bugs,  Next: Acknowledgements,  Prev: Diagnostics,  Up: Top

Deficiencies and Bugs
*********************

   Some trailing context patterns cannot be properly matched and
generate warning messages (`Dangerous trailing context').  These are
patterns where the ending of the first part of the rule matches the
beginning of the second part, such as `zx*/xy*', where the `x*' matches
the `x' at the beginning of the trailing context.  (Note that the POSIX
draft states that the text matched by such patterns is undefined.)

   For some trailing context rules, parts which are actually
fixed-length are not recognized as such, leading to the abovementioned
performance loss.  In particular, parts using `|' or {n} (such as
`foo{3}') are always considered variable-length.

   Combining trailing context with the special `|' action can result in
fixed trailing context being turned into the more expensive variable
trailing context.  For example, this happens in the following example:

     %%
     abc      |
     xyz/def

   Use of `unput' invalidates `yytext' and `yyleng'.

   Use of `unput' to push back more text than was  matched  can result
in the pushed-back text matching a beginning-of-line (`^') rule even
though it didn't come at  the  beginning  of the line (though this is
rare!).

   Pattern-matching of `NUL''s is substantially slower than matching
other characters.

   `flex' does not generate correct `#line' directives for code
internal to the scanner; thus, bugs in `flex.skel' yield bogus line
numbers.

   Due to both buffering of input and read-ahead, you cannot intermix
calls to `stdio.h' routines, such as, for example, `getchar', with
`flex' rules and expect it to work.  Call `input' instead.

   The total table entries listed by the `-v' flag excludes the number
of table entries needed to determine what rule has been matched.  The
number of entries is equal to the number of DFA states if the scanner
does not use `REJECT', and somewhat greater than the number of states
if it does.

   `REJECT' cannot be used with the `-f' or `-F' options.

   Some of the macros, such as `yywrap', may in the future become
functions which live in the `-lfl' library.  This will doubtless break
a lot of code, but may be required for POSIX compliance.

   The `flex' internal algorithms need documentation.


File: flex.info,  Node: Acknowledgements,  Prev: Bugs,  Up: Top

Contributors to `flex'
**********************

   The author of `flex' is Vern Paxson, with the help of many ideas and
much inspiration from Van Jacobson.  Original version by Jef Poskanzer.
The fast table representation is a partial implementation of a design
done by Van Jacobson.  The implementation was done by Kevin Gong and
Vern Paxson.

   Thanks to the many `flex' beta-testers, feedbackers, and
contributors, especially Casey Leedom, `benson@odi.com', Keith Bostic,
Frederic Brehm, Nick Christopher, Jason Coughlin, Scott David Daniels,
Leo Eskin, Chris Faylor, Eric Goldman, Eric Hughes, Jeffrey R. Jones,
Kevin B. Kenny, Ronald Lamprecht, Greg Lee, Craig Leres, Mohamed el
Lozy, Jim Meyering, Marc Nozell, Esmond Pitt, Jef Poskanzer, Jim
Roskind, Dave Tallman, Frank Whaley, Ken Yap, and those whose names have
slipped my marginal mail-archiving skills but whose contributions are
appreciated all the same.

   Thanks to Keith Bostic, John Gilmore, Craig Leres, Bob Mulcahy, Rich
Salz, and Richard Stallman for help with various distribution headaches.

   Thanks to Esmond Pitt and Earle Horton for 8-bit character support;
to Benson Margulies and Fred Burke for C++ support; to Ove Ewerlid for
the basics of support for `NUL''s; and to Eric Hughes for the basics of
support for multiple buffers.

   Work is being done on extending `flex' to generate scanners in which
the state machine is directly represented in C code rather than tables.
These scanners may well be substantially faster than those generated
using `-f' or `-F'.  If you are working in this area and are interested
in comparing notes and seeing whether redundant work can be avoided,
contact Ove Ewerlid (`ewerlid@mizar.DoCS.UU.SE').

   This work was primarily done when I was at the Real Time Systems
Group at the Lawrence Berkeley Laboratory in Berkeley, CA.  Many thanks
to all there for the support I received.

   Send comments to:

     Vern Paxson
     Computer Systems Engineering
     Bldg. 46A, Room 1123
     Lawrence Berkeley Laboratory
     Berkeley, CA 94720
     
     vern@ee.lbl.gov