@@ -1,5 +1,6 @@ | |||||
*.pyc | *.pyc | ||||
*.pyo | *.pyo | ||||
/.tox | |||||
/lark_parser.egg-info/** | /lark_parser.egg-info/** | ||||
tags | tags | ||||
.vscode | .vscode | ||||
@@ -1,7 +1,7 @@ | |||||
# Features | |||||
# Main Features | |||||
- EBNF-inspired grammar, with extra features (See: [Grammar Reference](grammar.md)) | - EBNF-inspired grammar, with extra features (See: [Grammar Reference](grammar.md)) | ||||
- Builds a parse-tree (AST) automagically based on the grammar | |||||
- Builds a parse-tree (AST) automagically based on the grammar | |||||
- Stand-alone parser generator - create a small independent parser to embed in your project. | - Stand-alone parser generator - create a small independent parser to embed in your project. | ||||
- Automatic line & column tracking | - Automatic line & column tracking | ||||
- Automatic terminal collision resolution | - Automatic terminal collision resolution | ||||
@@ -39,16 +39,17 @@ Lark extends the traditional YACC-based architecture with a *contextual lexer*, | |||||
The contextual lexer communicates with the parser, and uses the parser's lookahead prediction to narrow its choice of tokens. So at each point, the lexer only matches the subgroup of terminals that are legal at that parser state, instead of all of the terminals. It’s surprisingly effective at resolving common terminal collisions, and allows to parse languages that LALR(1) was previously incapable of parsing. | The contextual lexer communicates with the parser, and uses the parser's lookahead prediction to narrow its choice of tokens. So at each point, the lexer only matches the subgroup of terminals that are legal at that parser state, instead of all of the terminals. It’s surprisingly effective at resolving common terminal collisions, and allows to parse languages that LALR(1) was previously incapable of parsing. | ||||
This is an improvement to LALR(1) that is unique to Lark. | |||||
This is an improvement to LALR(1) that is unique to Lark. | |||||
### CYK Parser | ### CYK Parser | ||||
A [CYK parser](https://www.wikiwand.com/en/CYK_algorithm) can parse any context-free grammar at O(n^3*|G|). | |||||
A [CYK parser](https://www.wikiwand.com/en/CYK_algorithm) can parse any context-free grammar at O(n^3*|G|). | |||||
Its too slow to be practical for simple grammars, but it offers good performance for highly ambiguous grammars. | Its too slow to be practical for simple grammars, but it offers good performance for highly ambiguous grammars. | ||||
# Other features | |||||
# Extra features | |||||
- Import rules and tokens from other Lark grammars, for code reuse and modularity. | |||||
- Import grammars from Nearley.js | - Import grammars from Nearley.js | ||||
### Experimental features | ### Experimental features | ||||
@@ -59,4 +60,3 @@ Its too slow to be practical for simple grammars, but it offers good performance | |||||
- Grammar composition | - Grammar composition | ||||
- LALR(k) parser | - LALR(k) parser | ||||
- Full regexp-collision support using NFAs | - Full regexp-collision support using NFAs | ||||
- Automatically produce syntax-highlighters for grammars, for popular IDEs |
@@ -109,6 +109,10 @@ four_words: word ~ 4 | |||||
All occurrences of the terminal will be ignored, and won't be part of the parse. | All occurrences of the terminal will be ignored, and won't be part of the parse. | ||||
Using the `%ignore` directive results in a cleaner grammar. | |||||
It's especially important for the LALR(1) algorithm, because adding whitespace (or comments, or other extranous elements) explicitly in the grammar, harms its predictive abilities, which are based on a lookahead of 1. | |||||
**Syntax:** | **Syntax:** | ||||
```html | ```html | ||||
%ignore <TERMINAL> | %ignore <TERMINAL> | ||||
@@ -122,9 +126,9 @@ COMMENT: "#" /[^\n]/* | |||||
``` | ``` | ||||
### %import | ### %import | ||||
Allows to import terminals from lark grammars. | |||||
Allows to import terminals and rules from lark grammars. | |||||
Future versions will allow to import rules and macros. | |||||
When importing rules, all their dependencies will be imported into a namespace, to avoid collisions. It's not possible to override their dependencies (e.g. like you would when inheriting a class). | |||||
**Syntax:** | **Syntax:** | ||||
```html | ```html | ||||
@@ -10,7 +10,7 @@ This is the recommended process for working with Lark: | |||||
3. Try your grammar in Lark against each input sample. Make sure the resulting parse-trees make sense. | 3. Try your grammar in Lark against each input sample. Make sure the resulting parse-trees make sense. | ||||
4. Use Lark's grammar features to [[shape the tree|Tree Construction]]: Get rid of superfluous rules by inlining them, and use aliases when specific cases need clarification. | |||||
4. Use Lark's grammar features to [[shape the tree|Tree Construction]]: Get rid of superfluous rules by inlining them, and use aliases when specific cases need clarification. | |||||
- You can perform steps 1-4 repeatedly, gradually growing your grammar to include more sentences. | - You can perform steps 1-4 repeatedly, gradually growing your grammar to include more sentences. | ||||
@@ -18,39 +18,15 @@ This is the recommended process for working with Lark: | |||||
Of course, some specific use-cases may deviate from this process. Feel free to suggest these cases, and I'll add them to this page. | Of course, some specific use-cases may deviate from this process. Feel free to suggest these cases, and I'll add them to this page. | ||||
## Basic API Usage | |||||
## Getting started | |||||
For common use, you only need to know 3 classes: Lark, Tree, Transformer ([[Classes Reference]]) | |||||
Browse the [Examples](https://github.com/lark-parser/lark/tree/master/examples) to find a template that suits your purposes. | |||||
Here is some mock usage of them. You can see a real example in the [[examples]] | |||||
Read the tutorials to get a better understanding of how everything works. (links in the [main page](/)) | |||||
```python | |||||
from lark import Lark, Transformer | |||||
grammar = """start: rules and more rules | |||||
rule1: other rules AND TOKENS | |||||
| rule1 "+" rule2 -> add | |||||
| some value [maybe] | |||||
rule2: rule1 "-" (rule2 | "whatever")* | |||||
TOKEN1: "a literal" | |||||
TOKEN2: TOKEN1 "and literals" | |||||
""" | |||||
parser = Lark(grammar) | |||||
Use the [Cheatsheet (PDF)](lark_cheatsheet.pdf) for quick reference. | |||||
tree = parser.parse("some input string") | |||||
class MyTransformer(Transformer): | |||||
def rule1(self, matches): | |||||
return matches[0] + matches[1] | |||||
# I don't have to implement rule2 if I don't feel like it! | |||||
new_tree = MyTransformer().transform(tree) | |||||
``` | |||||
Use the reference pages for more in-depth explanations. (links in the [main page](/)] | |||||
## LALR usage | ## LALR usage | ||||
@@ -64,7 +40,7 @@ logging.basicConfig(level=logging.DEBUG) | |||||
collision_grammar = ''' | collision_grammar = ''' | ||||
start: as as | start: as as | ||||
as: a* | as: a* | ||||
a: 'a' | |||||
a: "a" | |||||
''' | ''' | ||||
p = Lark(collision_grammar, parser='lalr', debug=True) | p = Lark(collision_grammar, parser='lalr', debug=True) | ||||
``` | ``` |
@@ -36,6 +36,8 @@ $ pip install lark-parser | |||||
* Tutorials | * Tutorials | ||||
* [How to write a DSL](http://blog.erezsh.com/how-to-write-a-dsl-in-python-with-lark/) - Implements a toy LOGO-like language with an interpreter | * [How to write a DSL](http://blog.erezsh.com/how-to-write-a-dsl-in-python-with-lark/) - Implements a toy LOGO-like language with an interpreter | ||||
* [How to write a JSON parser](json_tutorial.md) | * [How to write a JSON parser](json_tutorial.md) | ||||
* External | |||||
* [Program Synthesis is Possible](https://www.cs.cornell.edu/~asampson/blog/minisynth.html) - Creates a DSL for Z3 | |||||
* Guides | * Guides | ||||
* [How to use Lark](how_to_use.md) | * [How to use Lark](how_to_use.md) | ||||
* Reference | * Reference | ||||
@@ -44,4 +46,5 @@ $ pip install lark-parser | |||||
* [Classes](classes.md) | * [Classes](classes.md) | ||||
* [Cheatsheet (PDF)](lark_cheatsheet.pdf) | * [Cheatsheet (PDF)](lark_cheatsheet.pdf) | ||||
* Discussion | * Discussion | ||||
* [Forum (Google Groups)](https://groups.google.com/forum/#!forum/lark-parser) | |||||
* [Gitter](https://gitter.im/lark-parser/Lobby) | |||||
* [Forum (Google Groups)](https://groups.google.com/forum/#!forum/lark-parser) |
@@ -79,7 +79,8 @@ By the way, if you're curious what these terminals signify, they are roughly equ | |||||
Lark will accept this, if you really want to complicate your life :) | Lark will accept this, if you really want to complicate your life :) | ||||
(You can find the original definitions in [common.lark](/lark/grammars/common.lark).) | |||||
You can find the original definitions in [common.lark](/lark/grammars/common.lark). | |||||
They're don't strictly adhere to [json.org](https://json.org/) - but our purpose here is to accept json, not validate it. | |||||
Notice that terminals are written in UPPER-CASE, while rules are written in lower-case. | Notice that terminals are written in UPPER-CASE, while rules are written in lower-case. | ||||
I'll touch more on the differences between rules and terminals later. | I'll touch more on the differences between rules and terminals later. | ||||
@@ -27,7 +27,7 @@ In accordance with these principles, I arrived at the following design choices: | |||||
### 1. Separation of code and grammar | ### 1. Separation of code and grammar | ||||
Grammars are the de-facto reference for your language, and for the structure of your parse-tree. For any non-trivial language, the conflation of code and grammar always turns out convoluted and difficult to read. | |||||
Grammars are the de-facto reference for your language, and for the structure of your parse-tree. For any non-trivial language, the conflation of code and grammar always turns out convoluted and difficult to read. | |||||
The grammars in Lark are EBNF-inspired, so they are especially easy to read & work with. | The grammars in Lark are EBNF-inspired, so they are especially easy to read & work with. | ||||
@@ -45,13 +45,13 @@ And anyway, every parse-tree can be replayed as a state-machine, so there is no | |||||
See this answer in more detail [here](https://github.com/erezsh/lark/issues/4). | See this answer in more detail [here](https://github.com/erezsh/lark/issues/4). | ||||
You can skip the building the tree for LALR(1), by providing Lark with a transformer (see the [JSON example](https://github.com/erezsh/lark/blob/master/examples/json_parser.py)). | |||||
To improve performance, you can skip building the tree for LALR(1), by providing Lark with a transformer (see the [JSON example](https://github.com/erezsh/lark/blob/master/examples/json_parser.py)). | |||||
### 3. Earley is the default | ### 3. Earley is the default | ||||
The Earley algorithm can accept *any* context-free grammar you throw at it (i.e. any grammar you can write in EBNF, it can parse). That makes it extremely useful for beginners, who are not aware of the strange and arbitrary restrictions that LALR(1) places on its grammars. | The Earley algorithm can accept *any* context-free grammar you throw at it (i.e. any grammar you can write in EBNF, it can parse). That makes it extremely useful for beginners, who are not aware of the strange and arbitrary restrictions that LALR(1) places on its grammars. | ||||
As the users grow to understand the structure of their grammar, the scope of their target language and their performance requirements, they may choose to switch over to LALR(1) to gain a huge performance boost, possibly at the cost of some language features. | |||||
As the users grow to understand the structure of their grammar, the scope of their target language and their performance requirements, they may choose to switch over to LALR(1) to gain a huge performance boost, possibly at the cost of some language features. | |||||
In short, "Premature optimization is the root of all evil." | In short, "Premature optimization is the root of all evil." | ||||
@@ -60,4 +60,4 @@ In short, "Premature optimization is the root of all evil." | |||||
- Automatically resolve terminal collisions whenever possible | - Automatically resolve terminal collisions whenever possible | ||||
- Automatically keep track of line & column numbers | - Automatically keep track of line & column numbers | ||||
@@ -22,6 +22,8 @@ It only works with the standard and contextual lexers. | |||||
from lark import Lark, Token | from lark import Lark, Token | ||||
def tok_to_int(tok): | def tok_to_int(tok): | ||||
"Convert the value of `tok` from string to int, while maintaining line number & column." | |||||
# tok.type == 'INT' | |||||
return Token.new_borrow_pos(tok.type, int(tok), tok) | return Token.new_borrow_pos(tok.type, int(tok), tok) | ||||
parser = Lark(""" | parser = Lark(""" | ||||
@@ -54,7 +56,7 @@ parser = Lark(""" | |||||
%import common (INT, WS) | %import common (INT, WS) | ||||
%ignore COMMENT | %ignore COMMENT | ||||
%ignore WS | %ignore WS | ||||
""", parser="lalr", lexer_callbacks={'COMMENT': comments.append}) | |||||
""", parser="lalr", lexer_callbacks={'COMMENT': comments.append}) | |||||
parser.parse(""" | parser.parse(""" | ||||
1 2 3 # hello | 1 2 3 # hello | ||||
@@ -71,4 +73,4 @@ Prints out: | |||||
[Token(COMMENT, '# hello'), Token(COMMENT, '# world')] | [Token(COMMENT, '# hello'), Token(COMMENT, '# world')] | ||||
``` | ``` | ||||
*Note: We don't have to return a token, because comments are ignored* | |||||
*Note: We don't have to return a token, because comments are ignored* |
@@ -126,4 +126,4 @@ Lark will parse "hello world" as: | |||||
start | start | ||||
greet | greet | ||||
planet | |||||
planet |
@@ -49,7 +49,7 @@ def test(): | |||||
res = ParseToDict().transform(tree) | res = ParseToDict().transform(tree) | ||||
print('-->') | print('-->') | ||||
print(res) # prints {'alice': [1, 27, 3], 'bob': [4], 'carrie': [], 'dan': [8, 6]} | |||||
print(res) # prints {'alice': [1, 27, 3], 'bob': [4], 'carrie': [], 'dan': [8, 6]} | |||||
if __name__ == '__main__': | if __name__ == '__main__': | ||||
@@ -162,7 +162,7 @@ IMAG_NUMBER: (_INT | FLOAT) ("j"|"J") | |||||
%ignore /[\t \f]+/ // WS | %ignore /[\t \f]+/ // WS | ||||
%ignore /\\[\t \f]*\r?\n/ // LINE_CONT | |||||
%ignore /\\[\t \f]*\r?\n/ // LINE_CONT | |||||
%ignore COMMENT | %ignore COMMENT | ||||
%declare _INDENT _DEDENT | %declare _INDENT _DEDENT | ||||
@@ -86,4 +86,6 @@ class UnexpectedToken(ParseError, UnexpectedInput): | |||||
super(UnexpectedToken, self).__init__(message.encode('utf-8')) | super(UnexpectedToken, self).__init__(message.encode('utf-8')) | ||||
class VisitError(Exception): | |||||
pass | |||||
###} | ###} |
@@ -42,8 +42,12 @@ class LarkOptions(object): | |||||
cache_grammar - Cache the Lark grammar (Default: False) | cache_grammar - Cache the Lark grammar (Default: False) | ||||
postlex - Lexer post-processing (Default: None) Only works with the standard and contextual lexers. | postlex - Lexer post-processing (Default: None) Only works with the standard and contextual lexers. | ||||
start - The start symbol (Default: start) | start - The start symbol (Default: start) | ||||
<<<<<<< HEAD | |||||
profile - Measure run-time usage in Lark. Read results from the profiler proprety (Default: False) | profile - Measure run-time usage in Lark. Read results from the profiler proprety (Default: False) | ||||
priority - How priorities should be evaluated - auto, none, normal, invert (Default: auto) | priority - How priorities should be evaluated - auto, none, normal, invert (Default: auto) | ||||
======= | |||||
profile - Measure run-time usage in Lark. Read results from the profiler property (Default: False) | |||||
>>>>>>> master | |||||
propagate_positions - Propagates [line, column, end_line, end_column] attributes into all tree branches. | propagate_positions - Propagates [line, column, end_line, end_column] attributes into all tree branches. | ||||
lexer_callbacks - Dictionary of callbacks for the lexer. May alter tokens during lexing. Use with caution. | lexer_callbacks - Dictionary of callbacks for the lexer. May alter tokens during lexing. Use with caution. | ||||
maybe_placeholders - Experimental feature. Instead of omitting optional rules (i.e. rule?), replace them with None | maybe_placeholders - Experimental feature. Instead of omitting optional rules (i.e. rule?), replace them with None | ||||
@@ -549,7 +549,7 @@ def import_from_grammar_into_namespace(grammar, namespace, aliases): | |||||
imported_terms = dict(grammar.term_defs) | imported_terms = dict(grammar.term_defs) | ||||
imported_rules = {n:(n,deepcopy(t),o) for n,t,o in grammar.rule_defs} | imported_rules = {n:(n,deepcopy(t),o) for n,t,o in grammar.rule_defs} | ||||
term_defs = [] | term_defs = [] | ||||
rule_defs = [] | rule_defs = [] | ||||
@@ -1,3 +1,4 @@ | |||||
from collections import Counter | |||||
from ..utils import bfs, fzset, classify | from ..utils import bfs, fzset, classify | ||||
from ..exceptions import GrammarError | from ..exceptions import GrammarError | ||||
@@ -111,7 +112,10 @@ class GrammarAnalyzer(object): | |||||
rules = parser_conf.rules + [Rule(NonTerminal('$root'), [NonTerminal(parser_conf.start), Terminal('$END')])] | rules = parser_conf.rules + [Rule(NonTerminal('$root'), [NonTerminal(parser_conf.start), Terminal('$END')])] | ||||
self.rules_by_origin = classify(rules, lambda r: r.origin) | self.rules_by_origin = classify(rules, lambda r: r.origin) | ||||
assert len(rules) == len(set(rules)) | |||||
if len(rules) != len(set(rules)): | |||||
duplicates = [item for item, count in Counter(rules).items() if count > 1] | |||||
raise GrammarError("Rules defined twice: %s" % ', '.join(str(i) for i in duplicates)) | |||||
for r in rules: | for r in rules: | ||||
for sym in r.expansion: | for sym in r.expansion: | ||||
if not (sym.is_term or sym in self.rules_by_origin): | if not (sym.is_term or sym in self.rules_by_origin): | ||||
@@ -100,10 +100,17 @@ class Reconstructor: | |||||
for origin, rule_aliases in aliases.items(): | for origin, rule_aliases in aliases.items(): | ||||
for alias in rule_aliases: | for alias in rule_aliases: | ||||
<<<<<<< HEAD | |||||
yield Rule(origin, [Terminal(alias)], alias=MakeMatchTree(origin.name, [NonTerminal(alias)])) | yield Rule(origin, [Terminal(alias)], alias=MakeMatchTree(origin.name, [NonTerminal(alias)])) | ||||
yield Rule(origin, [Terminal(origin.name)], alias=MakeMatchTree(origin.name, [origin])) | yield Rule(origin, [Terminal(origin.name)], alias=MakeMatchTree(origin.name, [origin])) | ||||
======= | |||||
yield Rule(origin, [Terminal(alias)], MakeMatchTree(origin.name, [NonTerminal(alias)])) | |||||
yield Rule(origin, [Terminal(origin.name)], MakeMatchTree(origin.name, [origin])) | |||||
>>>>>>> master | |||||
def _match(self, term, token): | def _match(self, term, token): | ||||
@@ -57,12 +57,16 @@ from functools import wraps, partial | |||||
from contextlib import contextmanager | from contextlib import contextmanager | ||||
Str = type(u'') | Str = type(u'') | ||||
try: | |||||
classtype = types.ClassType # Python2 | |||||
except AttributeError: | |||||
classtype = type # Python3 | |||||
def smart_decorator(f, create_decorator): | def smart_decorator(f, create_decorator): | ||||
if isinstance(f, types.FunctionType): | if isinstance(f, types.FunctionType): | ||||
return wraps(f)(create_decorator(f, True)) | return wraps(f)(create_decorator(f, True)) | ||||
elif isinstance(f, (type, types.BuiltinFunctionType)): | |||||
elif isinstance(f, (classtype, type, types.BuiltinFunctionType)): | |||||
return wraps(f)(create_decorator(f, False)) | return wraps(f)(create_decorator(f, False)) | ||||
elif isinstance(f, types.MethodType): | elif isinstance(f, types.MethodType): | ||||
@@ -2,6 +2,7 @@ from functools import wraps | |||||
from .utils import smart_decorator | from .utils import smart_decorator | ||||
from .tree import Tree | from .tree import Tree | ||||
from .exceptions import VisitError, GrammarError | |||||
###{standalone | ###{standalone | ||||
from inspect import getmembers, getmro | from inspect import getmembers, getmro | ||||
@@ -28,16 +29,21 @@ class Transformer: | |||||
except AttributeError: | except AttributeError: | ||||
return self.__default__(tree.data, children, tree.meta) | return self.__default__(tree.data, children, tree.meta) | ||||
else: | else: | ||||
if getattr(f, 'meta', False): | |||||
return f(children, tree.meta) | |||||
elif getattr(f, 'inline', False): | |||||
return f(*children) | |||||
elif getattr(f, 'whole_tree', False): | |||||
if new_children is not None: | |||||
raise NotImplementedError("Doesn't work with the base Transformer class") | |||||
return f(tree) | |||||
else: | |||||
return f(children) | |||||
try: | |||||
if getattr(f, 'meta', False): | |||||
return f(children, tree.meta) | |||||
elif getattr(f, 'inline', False): | |||||
return f(*children) | |||||
elif getattr(f, 'whole_tree', False): | |||||
if new_children is not None: | |||||
raise NotImplementedError("Doesn't work with the base Transformer class") | |||||
return f(tree) | |||||
else: | |||||
return f(children) | |||||
except GrammarError: | |||||
raise | |||||
except Exception as e: | |||||
raise VisitError('Error trying to process rule "%s":\n\n%s' % (tree.data, e)) | |||||
def _transform_children(self, children): | def _transform_children(self, children): | ||||
for c in children: | for c in children: | ||||
@@ -1,3 +1,3 @@ | |||||
%import common.NUMBER | %import common.NUMBER | ||||
%import common.WORD | %import common.WORD | ||||
%import common.WS | |||||
%import common.WS |
@@ -1311,8 +1311,8 @@ def _make_parser_test(LEXER, PARSER): | |||||
self.assertEqual(p.parse("bb").children, [None, 'b', None, None, 'b', None]) | self.assertEqual(p.parse("bb").children, [None, 'b', None, None, 'b', None]) | ||||
self.assertEqual(p.parse("abbc").children, ['a', 'b', None, None, 'b', 'c']) | self.assertEqual(p.parse("abbc").children, ['a', 'b', None, None, 'b', 'c']) | ||||
self.assertEqual(p.parse("babbcabcb").children, | self.assertEqual(p.parse("babbcabcb").children, | ||||
[None, 'b', None, | |||||
'a', 'b', None, | |||||
[None, 'b', None, | |||||
'a', 'b', None, | |||||
None, 'b', 'c', | None, 'b', 'c', | ||||
'a', 'b', 'c', | 'a', 'b', 'c', | ||||
None, 'b', None]) | None, 'b', None]) | ||||
@@ -21,4 +21,4 @@ recreate=True | |||||
commands= | commands= | ||||
git submodule sync -q | git submodule sync -q | ||||
git submodule update --init | git submodule update --init | ||||
python -m tests | |||||
python -m tests {posargs} |