**A grammar** is a list of rules and terminals, that together define a language.
**A grammar** is a list of rules and terminals, that together define a language.
@@ -25,6 +33,7 @@ Lark begins the parse with the rule 'start', unless specified otherwise in the o
Names of rules are always in lowercase, while names of terminals are always in uppercase. This distinction has practical effects, for the shape of the generated parse-tree, and the automatic construction of the lexer (aka tokenizer, or scanner).
Names of rules are always in lowercase, while names of terminals are always in uppercase. This distinction has practical effects, for the shape of the generated parse-tree, and the automatic construction of the lexer (aka tokenizer, or scanner).
<a name="terms"></a>
## Terminals
## Terminals
Terminals are used to match text into symbols. They can be defined as a combination of literals and other terminals.
Terminals are used to match text into symbols. They can be defined as a combination of literals and other terminals.
@@ -70,6 +79,53 @@ WHITESPACE: (" " | /\t/ )+
SQL_SELECT: "select"i
SQL_SELECT: "select"i
```
```
### Regular expressions & Ambiguity
Each terminal is eventually compiled to a regular expression. All the operators and references inside it are mapped to their respective expressions.
For example, in the following grammar, `A1` and `A2`, are equivalent:
```perl
A1: "a" | "b"
A2: /a|b/
```
This means that inside terminals, Lark cannot detect or resolve ambiguity, even when using Earley.
For example, for this grammar:
```perl
start : (A | B)+
A : "a" | "ab"
B : "b"
```
We get this behavior:
```bash
>>> p.parse("ab")
Tree(start, [Token(A, 'a'), Token(B, 'b')])
```
This is happening because Python's regex engine always returns the first matching option.
If you find yourself in this situation, the recommended solution is to use rules instead.
Example:
```python
>>> p = Lark("""start: (a | b)+
... !a: "a" | "ab"
... !b: "b"
... """, ambiguity="explicit")
>>> print(p.parse("ab").pretty())
_ambig
start
a ab
start
a a
b b
```
<a name="rules"></a>
## Rules
## Rules
**Syntax:**
**Syntax:**
@@ -114,6 +170,7 @@ Rules can be assigned priority only when using Earley (future versions may suppo
Priority can be either positive or negative. In not specified for a terminal, it's assumed to be 1 (i.e. the default).
Priority can be either positive or negative. In not specified for a terminal, it's assumed to be 1 (i.e. the default).