Alang (Automata Language)
Alang is a formal language for defining finite-state automata using human-readable regular expressions. It supports many operations, such as union, intersection, complement and set difference, enabling expressions like "(a (b | c)* - (b b))?". Alang's syntax is defined by the Alang Grammar which is an LL(1) context-free grammar. The Alang parser is optimized for fast parsing of very large inputs. The parser validates syntactic correctness and generates detailed error messages for invalid inputs.
Alang Grammar Specification
Rule | Expansion |
---|---|
AlangExpr (root) | Union |
🔹Union | Difference ('|' Difference)* |
🔹Difference | Intersection ('-' Intersection)* |
🔹Intersection | Concatenation ('&' Concatenation)* |
🔹Concatenation | UnaryExpr+ |
UnaryExpr | PrimaryExpr (Option ┃ KleeneStar ┃ KleenePlus ┃ Complement)* |
🔹Option | PrimaryExpr '?' |
🔹KleeneStar | PrimaryExpr '*' |
🔹KleenePlus | PrimaryExpr '+' |
🔹Complement | Primary '~' |
PrimaryExpr | '(' AlangExpr ')' ┃ Atom ┃ Wildcard ┃ EmptySet |
🔹Atom | AtomChar+ |
🔹Wildcard | '.' |
🔹EmptySet | '(' ')' |
AtomChar | any character except operator characters and whitespace |
🔹 Denotes a node-type that can be included in the resulting parse tree.
The root rule AlangExpr must cover the entire input, with no residue.
Operators Ordered by Precedence (Lowest-to-Highest)
Precedence | Operation | Operator Character | Position & Arity |
---|---|---|---|
1 | Union | x | x | Infix Binary |
2 | Difference | x - x | Infix Binary |
3 | Intersection | x & x | Infix Binary |
4 | Concatenation | x x | Infix Implicit |
5 | Option | x? | Postfix Unary |
5 | Kleene Star | x* | Postfix Unary |
5 | Kleene Plus | x+ | Postfix Unary |
5 | Complement | x~ | Postfix Unary |
6 | Group | ( x ) | Enclosing Unary |
7 | Empty set | () | Empty parentheses |
7 | Wildcard | . | Atomic leaf |
7 | Atom | string literal | Atomic leaf |
Notes
- Operators with higher precedence levels bind more tightly than those with lower levels.
- Operators of the same precedence level are left-associative (left-to-right).
- Whitespace denotes any whitespace character (i.e. space, tab, newline, etc.)
- Atoms are user defined (alphabet symbols) and can contain any characters except for the operator characters and whitespace.
- Wildcard '.' matches exactly one “Atom” in the alphabet.
- Whitespace is allowed anywhere in the grammar, but it is never required unless to separate directly adjacent atoms.