ABNF Generator
The abnfgen parses ABNF definitions from source file and generates
output as state machine definition for
Ragel. The aim is to make
simplier implementation of parsers defined by ABNF language which is used in
many RFCs.
ABNF
The abnfgen takes one or more ABNF definitions specified at command line. Unless name
is recognized as internal rule list name then is treated as name of file containing
rule declarations according in ABNF format (RFC2234). If no name is provided
or "-" name is used then standard input is read. Input files are processed in order as
provided.
Because more input files may be processed a conflict may happen when rule name being read
already exists in rule list. In such case latter rule overrides former one. But if "=/" is
used then rules re merged together. Internal rule list cannot override any other rule
Syntax declaration are often located in plain text so must be manually
separated and checked if rule definition starting at the begining of line.
Even ABNF defines mandatory CRLF line ends then abnfgen accepts also simple LF.
Unfortunately even particular RFCs reference ABNF's RFC when
declaring own syntax then in many cases the syntax has
deviations and won't pass through strict parser, e.g. alternation delimiter
"|" instead of "/", numbers are declared as "0xFF" instead of "%xFF" etc.
Because many RFCs references ABNF core definitions then these rules are
built-in abnfgen as internal "core" rule list.
If a problem is detected then a message is written to stderr and no other output is
generated unless "-F" option is specifies. It ofter happen if unknown rule name
is referenced.
There are multiple formats from abnfgen:
ABNF: normalized ABNF format
Ragel state machine definition: it's the main product and will be discussed bellow
abnfgen structures which may be used for compiling internal rule list in abnfgen itself
Ragel
Ragel (
http://www.cs.queensu.ca/~thurston/ragel/) is fast GNU state machine compiler enabling
calling custom actions in any state of processing. The abnfgen generates state machine
definition in format known to Ragel but developer must know logic and check if such definition
meets requirments. The main problem is ambiguity. It's often mind bending problem
recognize which state is ambigious and causes the machine will produce false results or
even jump in never ending loop. Note the very good Ragel instrument is scanner which helps to
overcome many ambiguities. Also priorities help but it's a kind of magic.
The Ragel does not support circular references (easily detectable because raises error message.
Such case must be corrected manually probably as
separate state machine. The circularily dependend machines will call each other using
fcall/fret
commands.
The rule which the abnfgen takes to be main rule is located as
main:= instance. It's simply
the last rule which does not depend on any other rule.
Examples
# pipe from stdin to stdout
abnfgen -f abnf
# load my.txt
abnfgen my.txt -f ragel -o my.rl
# load built-in rules core + abnf + my.txt
abnfgen core abnf my.txt -f ragel -o my.rl
# read RFC2234 core rules and RFC3261 and print to stdout
abnfgen core rfc3261.txt -f ragel
Input
list = 1*(item CRLF)
item = name ":" *SP body
name = ALPHA *( ALPHA / DIGIT / "_")
body = *( %x20-7e )
Ragel
abnfgen core test.txt -f ragel
# Generated by abnfgen at Sun Aug 12 15:03:04 2007
# Sources:
# core
# test.txt
%%{
# write your name
machine generated_from_abnf;
# generated rules, define required actions
ALPHA = 0x41..0x5a | 0x61..0x7a;
BIT = "0" | "1";
CHAR = 0x01..0x7f;
CR = "\r";
LF = "\n";
CRLF = CR LF;
CTL = 0x00..0x1f | 0x7f;
DIGIT = 0x30..0x39;
DQUOTE = "\"";
HEXDIG = DIGIT | "A"i | "B"i | "C"i | "D"i | "E"i | "F"i;
HTAB = "\t";
SP = " ";
WSP = SP | HTAB;
LWSP = ( WSP | ( CRLF WSP ) )*;
OCTET = 0x00..0xff;
VCHAR = 0x21..0x7e;
name = ALPHA ( ALPHA | DIGIT | "_" )*;
body = 0x20..0x7e*;
item = name ":" SP* body;
list = ( item CRLF )+;
# instantiate machine rules
main:= list;
}%%
ABNF output
abnfgen core test.txt -f abnf
; Generated by abnfgen at Sun Aug 12 14:58:43 2007
; Sources:
; core
; test.txt
ALPHA = %x41-5a / %x61-7a
BIT = "0" / "1"
CHAR = %x01-7f
CR = %x0d
CRLF = CR LF
CTL = %x00-1f / %x7f
DIGIT = %x30-39
DQUOTE = %x22
HEXDIG = DIGIT / "A" / "B" / "C" / "D" / "E" / "F"
HTAB = %x09
LF = %x0a
LWSP = *( WSP / ( CRLF WSP ) )
OCTET = %x00-ff
SP = " "
VCHAR = %x21-7e
WSP = SP / HTAB
list = 1*( item CRLF )
item = name ":" *SP body
name = ALPHA *( ALPHA / DIGIT / "_" )
body = *%x20-7e