@@ -45,6 +45,7 @@ Notice punctuation doesn't appear in the resulting tree. It's automatically filt | |||||
- Read the [tutorial](/docs/json_tutorial.md), which shows how to write a JSON parser in Lark. | - Read the [tutorial](/docs/json_tutorial.md), which shows how to write a JSON parser in Lark. | ||||
- Browse the [examples](/examples), which include a calculator, and a Python-code parser. | - Browse the [examples](/examples), which include a calculator, and a Python-code parser. | ||||
- Read the [reference](/docs/reference.md) | |||||
## List of Features | ## List of Features | ||||
@@ -20,13 +20,16 @@ Knowledge assumed: | |||||
Lark accepts its grammars in a format called [EBNF](https://www.wikiwand.com/en/Extended_Backus%E2%80%93Naur_form). It basically looks like this: | Lark accepts its grammars in a format called [EBNF](https://www.wikiwand.com/en/Extended_Backus%E2%80%93Naur_form). It basically looks like this: | ||||
rule_name: some rules and TOKENS | |||||
| or others | |||||
rule_name : list of rules and TOKENS to match | |||||
| another possible list of items | |||||
| etc. | |||||
TOKEN: "some text to match" | TOKEN: "some text to match" | ||||
(*a token is a string or a regular expression*) | (*a token is a string or a regular expression*) | ||||
The parser will try to match each rule (left-part) by matching its items (right-part) sequentially, trying each alternative (In practice, the parser is predictive so we don't have to try every alternative). | |||||
How to structure those rules is beyond the scope of this tutorial, but often it's enough to follow one's intuition. | How to structure those rules is beyond the scope of this tutorial, but often it's enough to follow one's intuition. | ||||
In the case of JSON, the structure is simple: A json document is either a list, or a dictionary, or a string/number/etc. | In the case of JSON, the structure is simple: A json document is either a list, or a dictionary, or a string/number/etc. | ||||
@@ -393,7 +396,7 @@ PyPy is awesome! | |||||
### Conclusion | ### Conclusion | ||||
We've brought the run-time down from 36 seconds to 1.4 seconds, in a series of small and simple steps. | |||||
We've brought the run-time down from 36 seconds to 1.1 seconds, in a series of small and simple steps. | |||||
Now let's compare the benchmarks in a nicely organized table. | Now let's compare the benchmarks in a nicely organized table. | ||||
@@ -0,0 +1,165 @@ | |||||
# Lark Reference | |||||
## What is Lark? | |||||
Lark is a general-purpose parsing library. It's written in Python, and supports two parsing algorithms: Earley (default) and LALR(1). | |||||
## Grammar | |||||
Lark accepts its grammars in [EBNF](https://www.wikiwand.com/en/Extended_Backus%E2%80%93Naur_form) form. | |||||
The grammar is a list of rules and tokens, each in their own line. | |||||
Rules can be defined on multiple lines when using the *OR* operator ( | ). | |||||
Comments start with // and last to the end of the line (C++ style) | |||||
Lark begins the parse with the rule 'start', unless specified otherwise in the options. | |||||
### Tokens | |||||
Tokens are defined in terms of: | |||||
NAME : "string" or /regexp/ | |||||
NAME.ignore : .. | |||||
.ignore is a flag that drops the token before it reaches the parser (usually whitespace) | |||||
Example: | |||||
IF: "if" | |||||
INTEGER : /[0-9]+/ | |||||
WHITESPACE.ignore: /[ \t\n]+/ | |||||
### Rules | |||||
Each rule is defined in terms of: | |||||
name : list of items to match | |||||
| another list of items -> optional_alias | |||||
| etc. | |||||
An alias is a name for the specific rule alternative. It affects tree construction. | |||||
An item is a: | |||||
- rule | |||||
- token | |||||
- (item item ..) - Group items | |||||
- [item item ..] - Maybe. Same as: "(item item ..)?" | |||||
- item? - Zero or one instances of item ("maybe") | |||||
- item\* - Zero or more instances of item | |||||
- item+ - One or more instances of item | |||||
Example: | |||||
float: "-"? DIGIT* "." DIGIT+ exp | |||||
| "-"? DIGIT+ exp | |||||
exp: "-"? ("e" | "E") DIGIT+ | |||||
DIGIT: /[0-9]/ | |||||
## Tree Construction | |||||
Lark builds a tree automatically based on the structure of the grammar. Is also accepts some hints. | |||||
In general, Lark will place each rule as a branch, and its matches as the children of the branch. | |||||
Using item+ or item\* will result in a list of items. | |||||
Example: | |||||
expr: "(" expr ")" | |||||
| NAME+ | |||||
NAME: /\w+/ | |||||
Lark will parse "(((hello world)))" as: | |||||
expr | |||||
expr | |||||
expr | |||||
"hello" | |||||
"world" | |||||
The brackets do not appear in the tree by design. | |||||
Tokens that won't appear in the tree are: | |||||
- Unnamed strings (like "keyword" or "+") | |||||
- Tokens whose name starts with an underscore (like \_DIGIT) | |||||
Tokens that *will* appear in the tree are: | |||||
- Unnamed regular expressions (like /[0-9]/) | |||||
- Named tokens whose name starts with a letter (like DIGIT) | |||||
## Shaping the tree | |||||
1. Rules whose name begins with an underscore will be inlined into their containing rule. | |||||
Example: | |||||
start: "(" _greet ")" | |||||
_greet: /\w+/ /\w+/ | |||||
Lark will parse "(hello world)" as: | |||||
start | |||||
"hello" | |||||
"world" | |||||
2. Rules that recieve a question mark (?) at the beginning of their definition, will be inlined if they have a single child. | |||||
Example: | |||||
start: greet greet | |||||
?greet: "(" /\w+/ ")" | |||||
| /\w+ /\w+/ | |||||
Lark will parse "hello world (planet)" as: | |||||
start | |||||
greet | |||||
"hello" | |||||
"world" | |||||
"planet" | |||||
3. Aliases - options in a rule can receive an alias. It will be then used as the branch name for the option. | |||||
Example: | |||||
start: greet greet | |||||
greet: "hello" -> hello | |||||
| "world" | |||||
Lark will parse "hello world" as: | |||||
start | |||||
hello | |||||
greet | |||||
## Lark Options | |||||
When initializing the Lark object, you can provide it with keyword options: | |||||
- start - The start symbol (Default: "start") | |||||
- parser - Decides which parser engine to use, "earley" or "lalr". (Default: "earley") | |||||
Note: Both will use Lark's lexer. | |||||
- transformer - Applies the transformer to every parse tree (only allowed with parser="lalr") | |||||
- only\_lex - Don't build a parser. Useful for debugging (default: False) | |||||
- postlex - Lexer post-processing (Default: None) | |||||
- profile - Measure run-time usage in Lark. Read results from the profiler proprety (Default: False) | |||||
To be supported: | |||||
- debug | |||||
- cache\_grammar | |||||
- keep\_all\_tokens | |||||
@@ -21,10 +21,11 @@ class LarkOptions(object): | |||||
transformer - Applies the transformer to every parse tree | transformer - Applies the transformer to every parse tree | ||||
debug - Affects verbosity (default: False) | debug - Affects verbosity (default: False) | ||||
only_lex - Don't build a parser. Useful for debugging (default: False) | only_lex - Don't build a parser. Useful for debugging (default: False) | ||||
keep_all_tokens - Don't automagically remove "punctuation" tokens (default: True) | |||||
keep_all_tokens - Don't automagically remove "punctuation" tokens (default: False) | |||||
cache_grammar - Cache the Lark grammar (Default: False) | cache_grammar - Cache the Lark grammar (Default: False) | ||||
postlex - Lexer post-processing (Default: None) | postlex - Lexer post-processing (Default: None) | ||||
start - The start symbol (Default: start) | start - The start symbol (Default: start) | ||||
profile - Measure run-time usage in Lark. Read results from the profiler proprety (Default: False) | |||||
""" | """ | ||||
__doc__ += OPTIONS_DOC | __doc__ += OPTIONS_DOC | ||||
def __init__(self, options_dict): | def __init__(self, options_dict): | ||||
@@ -39,7 +40,7 @@ class LarkOptions(object): | |||||
self.parser = o.pop('parser', 'earley') | self.parser = o.pop('parser', 'earley') | ||||
self.transformer = o.pop('transformer', None) | self.transformer = o.pop('transformer', None) | ||||
self.start = o.pop('start', 'start') | self.start = o.pop('start', 'start') | ||||
self.profile = o.pop('profile', False) # XXX new | |||||
self.profile = o.pop('profile', False) | |||||
assert self.parser in ENGINE_DICT | assert self.parser in ENGINE_DICT | ||||
if self.parser == 'earley' and self.transformer: | if self.parser == 'earley' and self.transformer: | ||||