@@ -1,18 +1,18 @@ | |||
# Lark - a modern parsing library for Python | |||
# Lark - a parsing toolkit for Python | |||
Lark is a parser built with a focus on ergonomics, performance and resilience. | |||
Lark is a parsing toolkit for Python, built with a focus on ergonomics, performance and modularity. | |||
Lark can parse all context-free languages. That means it is capable of parsing almost any programming language out there, and to some degree most natural languages too. | |||
Lark can parse all context-free languages. To put it simply, it means that it is capable of parsing almost any programming language out there, and to some degree most natural languages too. | |||
**Who is it for?** | |||
- **Beginners**: Lark is very friendly for experimentation. It can parse any grammar you throw at it, no matter how complicated or ambiguous, and do so efficiently. It also constructs an annotated parse-tree for you, using only the grammar, and it gives you convienient and flexible tools to process that parse-tree. | |||
- **Beginners**: Lark is very friendly for experimentation. It can parse any grammar you throw at it, no matter how complicated or ambiguous, and do so efficiently. It also constructs an annotated parse-tree for you, using only the grammar and an input, and it gives you convienient and flexible tools to process that parse-tree. | |||
- **Experts**: Lark implements both Earley(SPPF) and LALR(1), and several different lexers, so you can trade-off power and speed, according to your requirements. It also provides a variety of sophisticated features and utilities. | |||
**What can it do?** | |||
- Parse all context-free grammars, and handle any ambiguity | |||
- Parse all context-free grammars, and handle any ambiguity gracefully | |||
- Build an annotated parse-tree automagically, no construction code required. | |||
- Provide first-rate performance in terms of both Big-O complexity and measured run-time (considering that this is Python ;) | |||
- Run on every Python interpreter (it's pure-python) | |||
@@ -33,7 +33,7 @@ Most importantly, Lark will save you time and prevent you from getting parsing h | |||
### Install Lark | |||
$ pip install lark-parser | |||
$ pip install lark-parser --upgrade | |||
Lark has no dependencies. | |||
@@ -77,12 +77,11 @@ Notice punctuation doesn't appear in the resulting tree. It's automatically filt | |||
### Fruit flies like bananas | |||
Lark is great at handling ambiguity. Let's parse the phrase "fruit flies like bananas": | |||
Lark is great at handling ambiguity. Here is the result of parsing the phrase "fruit flies like bananas": | |||
 | |||
See more [examples here](https://github.com/lark-parser/lark/tree/master/examples) | |||
See the code and more [examples here](https://github.com/lark-parser/lark/tree/master/examples) | |||
## List of main features | |||
@@ -100,7 +99,7 @@ See more [examples here](https://github.com/lark-parser/lark/tree/master/example | |||
- **Python 2 & 3** compatible | |||
- Automatic line & column tracking | |||
- Standard library of terminals (strings, numbers, names, etc.) | |||
- Import grammars from Nearley.js | |||
- Import grammars from Nearley.js ([read more](/docs/nearley.md)) | |||
- Extensive test suite [](https://codecov.io/gh/erezsh/lark) | |||
- MyPy support using type stubs | |||
- And much more! | |||
@@ -159,25 +158,6 @@ Check out the [JSON tutorial](/docs/json_tutorial.md#conclusion) for more detail | |||
Using Lark? Send me a message and I'll add your project! | |||
### How to use Nearley grammars in Lark | |||
Lark comes with a tool to convert grammars from [Nearley](https://github.com/Hardmath123/nearley), a popular Earley library for Javascript. It uses [Js2Py](https://github.com/PiotrDabkowski/Js2Py) to convert and run the Javascript postprocessing code segments. | |||
Here's an example: | |||
```bash | |||
git clone https://github.com/Hardmath123/nearley | |||
python -m lark.tools.nearley nearley/examples/calculator/arithmetic.ne main nearley > ncalc.py | |||
``` | |||
You can use the output as a regular python module: | |||
```python | |||
>>> import ncalc | |||
>>> ncalc.parse('sin(pi/4) ^ e') | |||
0.38981434460254655 | |||
``` | |||
## License | |||
Lark uses the [MIT license](LICENSE). | |||
@@ -19,9 +19,9 @@ | |||
[Read more about the parsers](parsers.md) | |||
# Extra features | |||
- Import rules and tokens from other Lark grammars, for code reuse and modularity. | |||
- Import grammars from Nearley.js | |||
- Support for external regex module ([see here](classes.md#using-unicode-character-classes-with-regex)) | |||
- Import grammars from Nearley.js ([read more](nearley.md)) | |||
- CYK parser | |||
### Experimental features | |||
@@ -112,6 +112,19 @@ Terminals can be assigned priority only when using a lexer (future versions may | |||
Priority can be either positive or negative. If not specified for a terminal, it defaults to 1. | |||
### Regexp Flags | |||
You can use flags on regexps and strings. For example: | |||
```perl | |||
SELECT: "select"i //# Will ignore case, and match SELECT or Select, etc. | |||
MULTILINE_TEXT: /.+/s | |||
``` | |||
Supported flags are one of: `imslu`. See Python's regex documentation for more details on each one. | |||
Regexps/strings of different flags can only be concatenated in Python 3.6+ | |||
#### Notes for when using a lexer: | |||
When using a lexer (standard or contextual), it is the grammar-author's responsibility to make sure the literals don't collide, or that if they do, they are matched in the desired order. Literals are matched according to the following precedence: | |||
@@ -32,7 +32,7 @@ $ pip install lark-parser | |||
* [Philosophy & Design Choices](philosophy.md) | |||
* [Full List of Features](features.md) | |||
* [Features](features.md) | |||
* [Examples](https://github.com/lark-parser/lark/tree/master/examples) | |||
* [Online IDE](https://lark-parser.github.io/lark/ide/app.html) | |||
* Tutorials | |||
@@ -49,6 +49,7 @@ $ pip install lark-parser | |||
* [Visitors & Transformers](visitors.md) | |||
* [Classes](classes.md) | |||
* [Cheatsheet (PDF)](lark_cheatsheet.pdf) | |||
* [Importing grammars from Nearley](nearley.md) | |||
* Discussion | |||
* [Gitter](https://gitter.im/lark-parser/Lobby) | |||
* [Forum (Google Groups)](https://groups.google.com/forum/#!forum/lark-parser) |
@@ -0,0 +1,47 @@ | |||
# Importing grammars from Nearley | |||
Lark comes with a tool to convert grammars from [Nearley](https://github.com/Hardmath123/nearley), a popular Earley library for Javascript. It uses [Js2Py](https://github.com/PiotrDabkowski/Js2Py) to convert and run the Javascript postprocessing code segments. | |||
## Requirements | |||
1. Install Lark with the `nearley` component: | |||
```bash | |||
pip install lark-parser[nearley] | |||
``` | |||
2. Acquire a copy of the nearley codebase. This can be done using: | |||
```bash | |||
git clone https://github.com/Hardmath123/nearley | |||
``` | |||
## Usage | |||
Here's an example of how to import nearley's calculator example into Lark: | |||
```bash | |||
git clone https://github.com/Hardmath123/nearley | |||
python -m lark.tools.nearley nearley/examples/calculator/arithmetic.ne main nearley > ncalc.py | |||
``` | |||
You can use the output as a regular python module: | |||
```python | |||
>>> import ncalc | |||
>>> ncalc.parse('sin(pi/4) ^ e') | |||
0.38981434460254655 | |||
``` | |||
The Nearley converter also supports an experimental converter for newer JavaScript (ES6+), using the `--es6` flag: | |||
```bash | |||
git clone https://github.com/Hardmath123/nearley | |||
python -m lark.tools.nearley nearley/examples/calculator/arithmetic.ne main nearley --es6 > ncalc.py | |||
``` | |||
## Notes | |||
- Lark currently cannot import templates from Nearley | |||
- Lark currently cannot export grammars to Nearley | |||
These might get added in the future, if enough users ask for them. |
@@ -13,7 +13,7 @@ It's possible to bypass the dynamic lexing, and use the regular Earley parser wi | |||
Lark implements the Shared Packed Parse Forest data-structure for the Earley parser, in order to reduce the space and computation required to handle ambiguous grammars. | |||
You can read more about SPPF [here](http://www.bramvandersanden.com/post/2014/06/shared-packed-parse-forest/) | |||
You can read more about SPPF [here](https://web.archive.org/web/20191229100607/www.bramvandersanden.com/post/2014/06/shared-packed-parse-forest) | |||
As a result, Lark can efficiently parse and store every ambiguity in the grammar, when using Earley. | |||
@@ -81,7 +81,7 @@ class UnexpectedInput(LarkError): | |||
class UnexpectedCharacters(LexError, UnexpectedInput): | |||
def __init__(self, seq, lex_pos, line, column, allowed=None, considered_tokens=None, state=None, token_history=None): | |||
if isinstance(seq, bytes): | |||
message = "No terminal defined for '%s' at line %d col %d" % (seq[lex_pos:lex_pos+1].decode("ascii", "backslashreplace"), line, column) | |||
else: | |||
@@ -85,7 +85,7 @@ TERMINALS = { | |||
'RULE': '!?[_?]?[a-z][_a-z0-9]*', | |||
'TERMINAL': '_?[A-Z][_A-Z0-9]*', | |||
'STRING': r'"(\\"|\\\\|[^"\n])*?"i?', | |||
'REGEXP': r'/(?!/)(\\/|\\\\|[^/\n])*?/[%s]*' % _RE_FLAGS, | |||
'REGEXP': r'/(?!/)(\\/|\\\\|[^/])*?/[%s]*' % _RE_FLAGS, | |||
'_NL': r'(\r?\n)+\s*', | |||
'WS': r'[ \t]+', | |||
'COMMENT': r'\s*//[^\n]*', | |||
@@ -307,6 +307,7 @@ class PrepareAnonTerminals(Transformer_InPlace): | |||
self.term_set = {td.name for td in self.terminals} | |||
self.term_reverse = {td.pattern: td for td in terminals} | |||
self.i = 0 | |||
self.rule_options = None | |||
@inline_args | |||
@@ -335,7 +336,7 @@ class PrepareAnonTerminals(Transformer_InPlace): | |||
term_name = None | |||
elif isinstance(p, PatternRE): | |||
if p in self.term_reverse: # Kind of a wierd placement.name | |||
if p in self.term_reverse: # Kind of a weird placement.name | |||
term_name = self.term_reverse[p].name | |||
else: | |||
assert False, p | |||
@@ -351,7 +352,10 @@ class PrepareAnonTerminals(Transformer_InPlace): | |||
self.term_reverse[p] = termdef | |||
self.terminals.append(termdef) | |||
return Terminal(term_name, filter_out=isinstance(p, PatternStr)) | |||
filter_out = False if self.rule_options and self.rule_options.keep_all_tokens else isinstance(p, PatternStr) | |||
return Terminal(term_name, filter_out=filter_out) | |||
class _ReplaceSymbols(Transformer_InPlace): | |||
" Helper for ApplyTemplates " | |||
@@ -405,6 +409,13 @@ def _literal_to_pattern(literal): | |||
flags = v[flag_start:] | |||
assert all(f in _RE_FLAGS for f in flags), flags | |||
if literal.type == 'STRING' and '\n' in v: | |||
raise GrammarError('You cannot put newlines in string literals') | |||
if literal.type == 'REGEXP' and '\n' in v and 'x' not in flags: | |||
raise GrammarError('You can only use newlines in regular expressions ' | |||
'with the `x` (verbose) flag') | |||
v = v[:flag_start] | |||
assert v[0] == v[-1] and v[0] in '"/' | |||
x = v[1:-1] | |||
@@ -413,9 +424,11 @@ def _literal_to_pattern(literal): | |||
if literal.type == 'STRING': | |||
s = s.replace('\\\\', '\\') | |||
return { 'STRING': PatternStr, | |||
'REGEXP': PatternRE }[literal.type](s, flags) | |||
return PatternStr(s, flags) | |||
elif literal.type == 'REGEXP': | |||
return PatternRE(s, flags) | |||
else: | |||
assert False, 'Invariant failed: literal.type not in ["STRING", "REGEXP"]' | |||
@inline_args | |||
@@ -541,7 +554,8 @@ class Grammar: | |||
# ================= | |||
# 1. Pre-process terminals | |||
transformer = PrepareLiterals() * PrepareSymbols() * PrepareAnonTerminals(terminals) # Adds to terminals | |||
anon_tokens_transf = PrepareAnonTerminals(terminals) | |||
transformer = PrepareLiterals() * PrepareSymbols() * anon_tokens_transf # Adds to terminals | |||
# 2. Inline Templates | |||
@@ -556,8 +570,10 @@ class Grammar: | |||
i += 1 | |||
if len(params) != 0: # Dont transform templates | |||
continue | |||
ebnf_to_bnf.rule_options = RuleOptions(keep_all_tokens=True) if options.keep_all_tokens else None | |||
rule_options = RuleOptions(keep_all_tokens=True) if options and options.keep_all_tokens else None | |||
ebnf_to_bnf.rule_options = rule_options | |||
ebnf_to_bnf.prefix = name | |||
anon_tokens_transf.rule_options = rule_options | |||
tree = transformer.transform(rule_tree) | |||
res = ebnf_to_bnf.transform(tree) | |||
rules.append((name, res, options)) | |||
@@ -834,7 +850,7 @@ class GrammarLoader: | |||
if len(stmt.children) > 1: | |||
path_node, arg1 = stmt.children | |||
else: | |||
path_node, = stmt.children | |||
path_node ,= stmt.children | |||
arg1 = None | |||
if isinstance(arg1, Tree): # Multi import | |||
@@ -86,6 +86,14 @@ def best_from_group(seq, group_key, cmp_key): | |||
d[key] = item | |||
return list(d.values()) | |||
def make_recons_rule(origin, expansion, old_expansion): | |||
return Rule(origin, expansion, alias=MakeMatchTree(origin.name, old_expansion)) | |||
def make_recons_rule_to_term(origin, term): | |||
return make_recons_rule(origin, [Terminal(term.name)], [term]) | |||
class Reconstructor: | |||
""" | |||
A Reconstructor that will, given a full parse Tree, generate source code. | |||
@@ -100,6 +108,8 @@ class Reconstructor: | |||
tokens, rules, _grammar_extra = parser.grammar.compile(parser.options.start) | |||
self.write_tokens = WriteTokensTransformer({t.name:t for t in tokens}, term_subs) | |||
self.rules_for_root = defaultdict(list) | |||
self.rules = list(self._build_recons_rules(rules)) | |||
self.rules.reverse() | |||
@@ -107,9 +117,8 @@ class Reconstructor: | |||
self.rules = best_from_group(self.rules, lambda r: r, lambda r: -len(r.expansion)) | |||
self.rules.sort(key=lambda r: len(r.expansion)) | |||
callbacks = {rule: rule.alias for rule in self.rules} # TODO pass callbacks through dict, instead of alias? | |||
self.parser = earley.Parser(ParserConf(self.rules, callbacks, parser.options.start), | |||
self._match, resolve_ambiguity=True) | |||
self.parser = parser | |||
self._parser_cache = {} | |||
def _build_recons_rules(self, rules): | |||
expand1s = {r.origin for r in rules if r.options.expand1} | |||
@@ -121,24 +130,36 @@ class Reconstructor: | |||
rule_names = {r.origin for r in rules} | |||
nonterminals = {sym for sym in rule_names | |||
if sym.name.startswith('_') or sym in expand1s or sym in aliases } | |||
if sym.name.startswith('_') or sym in expand1s or sym in aliases } | |||
seen = set() | |||
for r in rules: | |||
recons_exp = [sym if sym in nonterminals else Terminal(sym.name) | |||
for sym in r.expansion if not is_discarded_terminal(sym)] | |||
# Skip self-recursive constructs | |||
if recons_exp == [r.origin]: | |||
if recons_exp == [r.origin] and r.alias is None: | |||
continue | |||
sym = NonTerminal(r.alias) if r.alias else r.origin | |||
rule = make_recons_rule(sym, recons_exp, r.expansion) | |||
yield Rule(sym, recons_exp, alias=MakeMatchTree(sym.name, r.expansion)) | |||
if sym in expand1s and len(recons_exp) != 1: | |||
self.rules_for_root[sym.name].append(rule) | |||
if sym.name not in seen: | |||
yield make_recons_rule_to_term(sym, sym) | |||
seen.add(sym.name) | |||
else: | |||
if sym.name.startswith('_') or sym in expand1s: | |||
yield rule | |||
else: | |||
self.rules_for_root[sym.name].append(rule) | |||
for origin, rule_aliases in aliases.items(): | |||
for alias in rule_aliases: | |||
yield Rule(origin, [Terminal(alias)], alias=MakeMatchTree(origin.name, [NonTerminal(alias)])) | |||
yield Rule(origin, [Terminal(origin.name)], alias=MakeMatchTree(origin.name, [origin])) | |||
yield make_recons_rule_to_term(origin, NonTerminal(alias)) | |||
yield make_recons_rule_to_term(origin, origin) | |||
def _match(self, term, token): | |||
if isinstance(token, Tree): | |||
@@ -149,7 +170,20 @@ class Reconstructor: | |||
def _reconstruct(self, tree): | |||
# TODO: ambiguity? | |||
unreduced_tree = self.parser.parse(tree.children, tree.data) # find a full derivation | |||
try: | |||
parser = self._parser_cache[tree.data] | |||
except KeyError: | |||
rules = self.rules + best_from_group( | |||
self.rules_for_root[tree.data], lambda r: r, lambda r: -len(r.expansion) | |||
) | |||
rules.sort(key=lambda r: len(r.expansion)) | |||
callbacks = {rule: rule.alias for rule in rules} # TODO pass callbacks through dict, instead of alias? | |||
parser = earley.Parser(ParserConf(rules, callbacks, [tree.data]), self._match, resolve_ambiguity=True) | |||
self._parser_cache[tree.data] = parser | |||
unreduced_tree = parser.parse(tree.children, tree.data) # find a full derivation | |||
assert unreduced_tree.data == tree.data | |||
res = self.write_tokens.transform(unreduced_tree) | |||
for item in res: | |||
@@ -1,8 +1,9 @@ | |||
"Converts between Lark and Nearley grammars. Work in progress!" | |||
"Converts Nearley grammars to Lark" | |||
import os.path | |||
import sys | |||
import codecs | |||
import argparse | |||
from lark import Lark, InlineTransformer | |||
@@ -137,7 +138,7 @@ def _nearley_to_lark(g, builtin_path, n2l, js_code, folder_path, includes): | |||
return rule_defs | |||
def create_code_for_nearley_grammar(g, start, builtin_path, folder_path): | |||
def create_code_for_nearley_grammar(g, start, builtin_path, folder_path, es6=False): | |||
import js2py | |||
emit_code = [] | |||
@@ -160,7 +161,10 @@ def create_code_for_nearley_grammar(g, start, builtin_path, folder_path): | |||
for alias, code in n2l.alias_js_code.items(): | |||
js_code.append('%s = (%s);' % (alias, code)) | |||
emit(js2py.translate_js('\n'.join(js_code))) | |||
if es6: | |||
emit(js2py.translate_js6('\n'.join(js_code))) | |||
else: | |||
emit(js2py.translate_js('\n'.join(js_code))) | |||
emit('class TransformNearley(Transformer):') | |||
for alias in n2l.alias_js_code: | |||
emit(" %s = var.get('%s').to_python()" % (alias, alias)) | |||
@@ -173,18 +177,20 @@ def create_code_for_nearley_grammar(g, start, builtin_path, folder_path): | |||
return ''.join(emit_code) | |||
def main(fn, start, nearley_lib): | |||
def main(fn, start, nearley_lib, es6=False): | |||
with codecs.open(fn, encoding='utf8') as f: | |||
grammar = f.read() | |||
return create_code_for_nearley_grammar(grammar, start, os.path.join(nearley_lib, 'builtin'), os.path.abspath(os.path.dirname(fn))) | |||
return create_code_for_nearley_grammar(grammar, start, os.path.join(nearley_lib, 'builtin'), os.path.abspath(os.path.dirname(fn)), es6=es6) | |||
def get_arg_parser(): | |||
parser = argparse.ArgumentParser('Reads Nearley grammar (with js functions) outputs an equivalent lark parser.') | |||
parser.add_argument('nearley_grammar', help='Path to the file containing the nearley grammar') | |||
parser.add_argument('start_rule', help='Rule within the nearley grammar to make the base rule') | |||
parser.add_argument('nearley_lib', help='Path to root directory of nearley codebase (used for including builtins)') | |||
parser.add_argument('--es6', help='Enable experimental ES6 support', action='store_true') | |||
return parser | |||
if __name__ == '__main__': | |||
if len(sys.argv) < 4: | |||
print("Reads Nearley grammar (with js functions) outputs an equivalent lark parser.") | |||
print("Usage: %s <nearley_grammar_path> <start_rule> <nearley_lib_path>" % sys.argv[0]) | |||
sys.exit(1) | |||
fn, start, nearley_lib = sys.argv[1:] | |||
print(main(fn, start, nearley_lib)) | |||
parser = get_arg_parser() | |||
args = parser.parse_args() | |||
print(main(fn=args.nearley_grammar, start=args.start_rule, nearley_lib=args.nearley_lib, es6=args.es6)) |
@@ -14,6 +14,8 @@ class Discard(Exception): | |||
# Transformers | |||
class _Decoratable: | |||
"Provides support for decorating methods with @v_args" | |||
@classmethod | |||
def _apply_decorator(cls, decorator, **kwargs): | |||
mro = getmro(cls) | |||
@@ -12,3 +12,5 @@ pages: | |||
- Visitors and Transformers: visitors.md | |||
- Classes Reference: classes.md | |||
- Recipes: recipes.md | |||
- Import grammars from Nearley: nearley.md | |||
- Tutorial - JSON Parser: json_tutorial.md |
@@ -15,7 +15,8 @@ setup( | |||
install_requires = [], | |||
extras_require = { | |||
"regex": ["regex"] | |||
"regex": ["regex"], | |||
"nearley": ["js2py"] | |||
}, | |||
package_data = {'': ['*.md', '*.lark'], 'lark-stubs': ['*.pyi']}, | |||
@@ -721,7 +721,8 @@ def _make_parser_test(LEXER, PARSER): | |||
""") | |||
g.parse('\x01\x02\x03') | |||
@unittest.skipIf(sys.version_info[:2]==(2, 7), "bytes parser isn't perfect in Python2.7, exceptions don't work correctly") | |||
@unittest.skipIf(sys.version_info[0]==2 or sys.version_info[:2]==(3, 4), | |||
"bytes parser isn't perfect in Python2, exceptions don't work correctly") | |||
def test_bytes_utf8(self): | |||
g = r""" | |||
start: BOM? char+ | |||
@@ -1261,6 +1262,32 @@ def _make_parser_test(LEXER, PARSER): | |||
tree = l.parse('aA') | |||
self.assertEqual(tree.children, ['a', 'A']) | |||
def test_token_flags_verbose(self): | |||
g = _Lark(r"""start: NL | ABC | |||
ABC: / [a-z] /x | |||
NL: /\n/ | |||
""") | |||
x = g.parse('a') | |||
self.assertEqual(x.children, ['a']) | |||
def test_token_flags_verbose_multiline(self): | |||
g = _Lark(r"""start: ABC | |||
ABC: / a b c | |||
d | |||
e f | |||
/x | |||
""") | |||
x = g.parse('abcdef') | |||
self.assertEqual(x.children, ['abcdef']) | |||
def test_token_multiline_only_works_with_x_flag(self): | |||
g = r"""start: ABC | |||
ABC: / a b c | |||
d | |||
e f | |||
/i | |||
""" | |||
self.assertRaises( GrammarError, _Lark, g) | |||
@unittest.skipIf(PARSER == 'cyk', "No empty rules") | |||
def test_twice_empty(self): | |||
@@ -69,6 +69,35 @@ class TestReconstructor(TestCase): | |||
self.assert_reconstruct(g, code) | |||
def test_keep_tokens(self): | |||
g = """ | |||
start: (NL | stmt)* | |||
stmt: var op var | |||
!op: ("+" | "-" | "*" | "/") | |||
var: WORD | |||
NL: /(\\r?\\n)+\s*/ | |||
""" + common | |||
code = """ | |||
a+b | |||
""" | |||
self.assert_reconstruct(g, code) | |||
def test_expand_rule(self): | |||
g = """ | |||
?start: (NL | mult_stmt)* | |||
?mult_stmt: sum_stmt ["*" sum_stmt] | |||
?sum_stmt: var ["+" var] | |||
var: WORD | |||
NL: /(\\r?\\n)+\s*/ | |||
""" + common | |||
code = ['a', 'a*b', 'a+b', 'a*b+c', 'a+b*c', 'a+b*c+d'] | |||
for c in code: | |||
self.assert_reconstruct(g, c) | |||
def test_json_example(self): | |||
test_json = ''' | |||
{ | |||