Browse Source

Merge branch 'master' of https://github.com/lark-parser/lark into error-handling

tags/gm/2021-09-23T00Z/github.com--lark-parser-lark/0.10.0
MegaIng1 4 years ago
parent
commit
56083b0dbb
15 changed files with 225 additions and 67 deletions
  1. +9
    -29
      README.md
  2. +2
    -2
      docs/features.md
  3. +13
    -0
      docs/grammar.md
  4. +2
    -1
      docs/index.md
  5. +47
    -0
      docs/nearley.md
  6. +1
    -1
      docs/parsers.md
  7. +1
    -1
      lark/exceptions.py
  8. +25
    -9
      lark/load_grammar.py
  9. +43
    -9
      lark/reconstruct.py
  10. +19
    -13
      lark/tools/nearley.py
  11. +2
    -0
      lark/visitors.py
  12. +2
    -0
      mkdocs.yml
  13. +2
    -1
      setup.py
  14. +28
    -1
      tests/test_parser.py
  15. +29
    -0
      tests/test_reconstructor.py

+ 9
- 29
README.md View File

@@ -1,18 +1,18 @@
# Lark - a modern parsing library for Python
# Lark - a parsing toolkit for Python

Lark is a parser built with a focus on ergonomics, performance and resilience.
Lark is a parsing toolkit for Python, built with a focus on ergonomics, performance and modularity.

Lark can parse all context-free languages. That means it is capable of parsing almost any programming language out there, and to some degree most natural languages too.
Lark can parse all context-free languages. To put it simply, it means that it is capable of parsing almost any programming language out there, and to some degree most natural languages too.

**Who is it for?**

- **Beginners**: Lark is very friendly for experimentation. It can parse any grammar you throw at it, no matter how complicated or ambiguous, and do so efficiently. It also constructs an annotated parse-tree for you, using only the grammar, and it gives you convienient and flexible tools to process that parse-tree.
- **Beginners**: Lark is very friendly for experimentation. It can parse any grammar you throw at it, no matter how complicated or ambiguous, and do so efficiently. It also constructs an annotated parse-tree for you, using only the grammar and an input, and it gives you convienient and flexible tools to process that parse-tree.

- **Experts**: Lark implements both Earley(SPPF) and LALR(1), and several different lexers, so you can trade-off power and speed, according to your requirements. It also provides a variety of sophisticated features and utilities.

**What can it do?**

- Parse all context-free grammars, and handle any ambiguity
- Parse all context-free grammars, and handle any ambiguity gracefully
- Build an annotated parse-tree automagically, no construction code required.
- Provide first-rate performance in terms of both Big-O complexity and measured run-time (considering that this is Python ;)
- Run on every Python interpreter (it's pure-python)
@@ -33,7 +33,7 @@ Most importantly, Lark will save you time and prevent you from getting parsing h

### Install Lark

$ pip install lark-parser
$ pip install lark-parser --upgrade

Lark has no dependencies.

@@ -77,12 +77,11 @@ Notice punctuation doesn't appear in the resulting tree. It's automatically filt

### Fruit flies like bananas

Lark is great at handling ambiguity. Let's parse the phrase "fruit flies like bananas":
Lark is great at handling ambiguity. Here is the result of parsing the phrase "fruit flies like bananas":

![fruitflies.png](examples/fruitflies.png)

See more [examples here](https://github.com/lark-parser/lark/tree/master/examples)

See the code and more [examples here](https://github.com/lark-parser/lark/tree/master/examples)


## List of main features
@@ -100,7 +99,7 @@ See more [examples here](https://github.com/lark-parser/lark/tree/master/example
- **Python 2 & 3** compatible
- Automatic line & column tracking
- Standard library of terminals (strings, numbers, names, etc.)
- Import grammars from Nearley.js
- Import grammars from Nearley.js ([read more](/docs/nearley.md))
- Extensive test suite [![codecov](https://codecov.io/gh/erezsh/lark/branch/master/graph/badge.svg)](https://codecov.io/gh/erezsh/lark)
- MyPy support using type stubs
- And much more!
@@ -159,25 +158,6 @@ Check out the [JSON tutorial](/docs/json_tutorial.md#conclusion) for more detail

Using Lark? Send me a message and I'll add your project!

### How to use Nearley grammars in Lark

Lark comes with a tool to convert grammars from [Nearley](https://github.com/Hardmath123/nearley), a popular Earley library for Javascript. It uses [Js2Py](https://github.com/PiotrDabkowski/Js2Py) to convert and run the Javascript postprocessing code segments.

Here's an example:
```bash
git clone https://github.com/Hardmath123/nearley
python -m lark.tools.nearley nearley/examples/calculator/arithmetic.ne main nearley > ncalc.py
```

You can use the output as a regular python module:

```python
>>> import ncalc
>>> ncalc.parse('sin(pi/4) ^ e')
0.38981434460254655
```


## License

Lark uses the [MIT license](LICENSE).


+ 2
- 2
docs/features.md View File

@@ -19,9 +19,9 @@
[Read more about the parsers](parsers.md)

# Extra features

- Import rules and tokens from other Lark grammars, for code reuse and modularity.
- Import grammars from Nearley.js
- Support for external regex module ([see here](classes.md#using-unicode-character-classes-with-regex))
- Import grammars from Nearley.js ([read more](nearley.md))
- CYK parser

### Experimental features


+ 13
- 0
docs/grammar.md View File

@@ -112,6 +112,19 @@ Terminals can be assigned priority only when using a lexer (future versions may

Priority can be either positive or negative. If not specified for a terminal, it defaults to 1.

### Regexp Flags

You can use flags on regexps and strings. For example:

```perl
SELECT: "select"i //# Will ignore case, and match SELECT or Select, etc.
MULTILINE_TEXT: /.+/s
```

Supported flags are one of: `imslu`. See Python's regex documentation for more details on each one.

Regexps/strings of different flags can only be concatenated in Python 3.6+

#### Notes for when using a lexer:

When using a lexer (standard or contextual), it is the grammar-author's responsibility to make sure the literals don't collide, or that if they do, they are matched in the desired order. Literals are matched according to the following precedence:


+ 2
- 1
docs/index.md View File

@@ -32,7 +32,7 @@ $ pip install lark-parser


* [Philosophy & Design Choices](philosophy.md)
* [Full List of Features](features.md)
* [Features](features.md)
* [Examples](https://github.com/lark-parser/lark/tree/master/examples)
* [Online IDE](https://lark-parser.github.io/lark/ide/app.html)
* Tutorials
@@ -49,6 +49,7 @@ $ pip install lark-parser
* [Visitors & Transformers](visitors.md)
* [Classes](classes.md)
* [Cheatsheet (PDF)](lark_cheatsheet.pdf)
* [Importing grammars from Nearley](nearley.md)
* Discussion
* [Gitter](https://gitter.im/lark-parser/Lobby)
* [Forum (Google Groups)](https://groups.google.com/forum/#!forum/lark-parser)

+ 47
- 0
docs/nearley.md View File

@@ -0,0 +1,47 @@
# Importing grammars from Nearley

Lark comes with a tool to convert grammars from [Nearley](https://github.com/Hardmath123/nearley), a popular Earley library for Javascript. It uses [Js2Py](https://github.com/PiotrDabkowski/Js2Py) to convert and run the Javascript postprocessing code segments.

## Requirements

1. Install Lark with the `nearley` component:
```bash
pip install lark-parser[nearley]
```

2. Acquire a copy of the nearley codebase. This can be done using:
```bash
git clone https://github.com/Hardmath123/nearley
```

## Usage

Here's an example of how to import nearley's calculator example into Lark:

```bash
git clone https://github.com/Hardmath123/nearley
python -m lark.tools.nearley nearley/examples/calculator/arithmetic.ne main nearley > ncalc.py
```

You can use the output as a regular python module:

```python
>>> import ncalc
>>> ncalc.parse('sin(pi/4) ^ e')
0.38981434460254655
```

The Nearley converter also supports an experimental converter for newer JavaScript (ES6+), using the `--es6` flag:

```bash
git clone https://github.com/Hardmath123/nearley
python -m lark.tools.nearley nearley/examples/calculator/arithmetic.ne main nearley --es6 > ncalc.py
```

## Notes

- Lark currently cannot import templates from Nearley

- Lark currently cannot export grammars to Nearley

These might get added in the future, if enough users ask for them.

+ 1
- 1
docs/parsers.md View File

@@ -13,7 +13,7 @@ It's possible to bypass the dynamic lexing, and use the regular Earley parser wi

Lark implements the Shared Packed Parse Forest data-structure for the Earley parser, in order to reduce the space and computation required to handle ambiguous grammars.

You can read more about SPPF [here](http://www.bramvandersanden.com/post/2014/06/shared-packed-parse-forest/)
You can read more about SPPF [here](https://web.archive.org/web/20191229100607/www.bramvandersanden.com/post/2014/06/shared-packed-parse-forest)

As a result, Lark can efficiently parse and store every ambiguity in the grammar, when using Earley.



+ 1
- 1
lark/exceptions.py View File

@@ -81,7 +81,7 @@ class UnexpectedInput(LarkError):

class UnexpectedCharacters(LexError, UnexpectedInput):
def __init__(self, seq, lex_pos, line, column, allowed=None, considered_tokens=None, state=None, token_history=None):
if isinstance(seq, bytes):
message = "No terminal defined for '%s' at line %d col %d" % (seq[lex_pos:lex_pos+1].decode("ascii", "backslashreplace"), line, column)
else:


+ 25
- 9
lark/load_grammar.py View File

@@ -85,7 +85,7 @@ TERMINALS = {
'RULE': '!?[_?]?[a-z][_a-z0-9]*',
'TERMINAL': '_?[A-Z][_A-Z0-9]*',
'STRING': r'"(\\"|\\\\|[^"\n])*?"i?',
'REGEXP': r'/(?!/)(\\/|\\\\|[^/\n])*?/[%s]*' % _RE_FLAGS,
'REGEXP': r'/(?!/)(\\/|\\\\|[^/])*?/[%s]*' % _RE_FLAGS,
'_NL': r'(\r?\n)+\s*',
'WS': r'[ \t]+',
'COMMENT': r'\s*//[^\n]*',
@@ -307,6 +307,7 @@ class PrepareAnonTerminals(Transformer_InPlace):
self.term_set = {td.name for td in self.terminals}
self.term_reverse = {td.pattern: td for td in terminals}
self.i = 0
self.rule_options = None


@inline_args
@@ -335,7 +336,7 @@ class PrepareAnonTerminals(Transformer_InPlace):
term_name = None

elif isinstance(p, PatternRE):
if p in self.term_reverse: # Kind of a wierd placement.name
if p in self.term_reverse: # Kind of a weird placement.name
term_name = self.term_reverse[p].name
else:
assert False, p
@@ -351,7 +352,10 @@ class PrepareAnonTerminals(Transformer_InPlace):
self.term_reverse[p] = termdef
self.terminals.append(termdef)

return Terminal(term_name, filter_out=isinstance(p, PatternStr))
filter_out = False if self.rule_options and self.rule_options.keep_all_tokens else isinstance(p, PatternStr)

return Terminal(term_name, filter_out=filter_out)


class _ReplaceSymbols(Transformer_InPlace):
" Helper for ApplyTemplates "
@@ -405,6 +409,13 @@ def _literal_to_pattern(literal):
flags = v[flag_start:]
assert all(f in _RE_FLAGS for f in flags), flags

if literal.type == 'STRING' and '\n' in v:
raise GrammarError('You cannot put newlines in string literals')

if literal.type == 'REGEXP' and '\n' in v and 'x' not in flags:
raise GrammarError('You can only use newlines in regular expressions '
'with the `x` (verbose) flag')

v = v[:flag_start]
assert v[0] == v[-1] and v[0] in '"/'
x = v[1:-1]
@@ -413,9 +424,11 @@ def _literal_to_pattern(literal):

if literal.type == 'STRING':
s = s.replace('\\\\', '\\')

return { 'STRING': PatternStr,
'REGEXP': PatternRE }[literal.type](s, flags)
return PatternStr(s, flags)
elif literal.type == 'REGEXP':
return PatternRE(s, flags)
else:
assert False, 'Invariant failed: literal.type not in ["STRING", "REGEXP"]'


@inline_args
@@ -541,7 +554,8 @@ class Grammar:
# =================

# 1. Pre-process terminals
transformer = PrepareLiterals() * PrepareSymbols() * PrepareAnonTerminals(terminals) # Adds to terminals
anon_tokens_transf = PrepareAnonTerminals(terminals)
transformer = PrepareLiterals() * PrepareSymbols() * anon_tokens_transf # Adds to terminals

# 2. Inline Templates

@@ -556,8 +570,10 @@ class Grammar:
i += 1
if len(params) != 0: # Dont transform templates
continue
ebnf_to_bnf.rule_options = RuleOptions(keep_all_tokens=True) if options.keep_all_tokens else None
rule_options = RuleOptions(keep_all_tokens=True) if options and options.keep_all_tokens else None
ebnf_to_bnf.rule_options = rule_options
ebnf_to_bnf.prefix = name
anon_tokens_transf.rule_options = rule_options
tree = transformer.transform(rule_tree)
res = ebnf_to_bnf.transform(tree)
rules.append((name, res, options))
@@ -834,7 +850,7 @@ class GrammarLoader:
if len(stmt.children) > 1:
path_node, arg1 = stmt.children
else:
path_node, = stmt.children
path_node ,= stmt.children
arg1 = None

if isinstance(arg1, Tree): # Multi import


+ 43
- 9
lark/reconstruct.py View File

@@ -86,6 +86,14 @@ def best_from_group(seq, group_key, cmp_key):
d[key] = item
return list(d.values())


def make_recons_rule(origin, expansion, old_expansion):
return Rule(origin, expansion, alias=MakeMatchTree(origin.name, old_expansion))

def make_recons_rule_to_term(origin, term):
return make_recons_rule(origin, [Terminal(term.name)], [term])


class Reconstructor:
"""
A Reconstructor that will, given a full parse Tree, generate source code.
@@ -100,6 +108,8 @@ class Reconstructor:
tokens, rules, _grammar_extra = parser.grammar.compile(parser.options.start)

self.write_tokens = WriteTokensTransformer({t.name:t for t in tokens}, term_subs)
self.rules_for_root = defaultdict(list)

self.rules = list(self._build_recons_rules(rules))
self.rules.reverse()

@@ -107,9 +117,8 @@ class Reconstructor:
self.rules = best_from_group(self.rules, lambda r: r, lambda r: -len(r.expansion))

self.rules.sort(key=lambda r: len(r.expansion))
callbacks = {rule: rule.alias for rule in self.rules} # TODO pass callbacks through dict, instead of alias?
self.parser = earley.Parser(ParserConf(self.rules, callbacks, parser.options.start),
self._match, resolve_ambiguity=True)
self.parser = parser
self._parser_cache = {}

def _build_recons_rules(self, rules):
expand1s = {r.origin for r in rules if r.options.expand1}
@@ -121,24 +130,36 @@ class Reconstructor:

rule_names = {r.origin for r in rules}
nonterminals = {sym for sym in rule_names
if sym.name.startswith('_') or sym in expand1s or sym in aliases }
if sym.name.startswith('_') or sym in expand1s or sym in aliases }

seen = set()
for r in rules:
recons_exp = [sym if sym in nonterminals else Terminal(sym.name)
for sym in r.expansion if not is_discarded_terminal(sym)]

# Skip self-recursive constructs
if recons_exp == [r.origin]:
if recons_exp == [r.origin] and r.alias is None:
continue

sym = NonTerminal(r.alias) if r.alias else r.origin
rule = make_recons_rule(sym, recons_exp, r.expansion)

yield Rule(sym, recons_exp, alias=MakeMatchTree(sym.name, r.expansion))
if sym in expand1s and len(recons_exp) != 1:
self.rules_for_root[sym.name].append(rule)

if sym.name not in seen:
yield make_recons_rule_to_term(sym, sym)
seen.add(sym.name)
else:
if sym.name.startswith('_') or sym in expand1s:
yield rule
else:
self.rules_for_root[sym.name].append(rule)

for origin, rule_aliases in aliases.items():
for alias in rule_aliases:
yield Rule(origin, [Terminal(alias)], alias=MakeMatchTree(origin.name, [NonTerminal(alias)]))
yield Rule(origin, [Terminal(origin.name)], alias=MakeMatchTree(origin.name, [origin]))
yield make_recons_rule_to_term(origin, NonTerminal(alias))
yield make_recons_rule_to_term(origin, origin)

def _match(self, term, token):
if isinstance(token, Tree):
@@ -149,7 +170,20 @@ class Reconstructor:

def _reconstruct(self, tree):
# TODO: ambiguity?
unreduced_tree = self.parser.parse(tree.children, tree.data) # find a full derivation
try:
parser = self._parser_cache[tree.data]
except KeyError:
rules = self.rules + best_from_group(
self.rules_for_root[tree.data], lambda r: r, lambda r: -len(r.expansion)
)

rules.sort(key=lambda r: len(r.expansion))

callbacks = {rule: rule.alias for rule in rules} # TODO pass callbacks through dict, instead of alias?
parser = earley.Parser(ParserConf(rules, callbacks, [tree.data]), self._match, resolve_ambiguity=True)
self._parser_cache[tree.data] = parser

unreduced_tree = parser.parse(tree.children, tree.data) # find a full derivation
assert unreduced_tree.data == tree.data
res = self.write_tokens.transform(unreduced_tree)
for item in res:


+ 19
- 13
lark/tools/nearley.py View File

@@ -1,8 +1,9 @@
"Converts between Lark and Nearley grammars. Work in progress!"
"Converts Nearley grammars to Lark"

import os.path
import sys
import codecs
import argparse


from lark import Lark, InlineTransformer
@@ -137,7 +138,7 @@ def _nearley_to_lark(g, builtin_path, n2l, js_code, folder_path, includes):
return rule_defs


def create_code_for_nearley_grammar(g, start, builtin_path, folder_path):
def create_code_for_nearley_grammar(g, start, builtin_path, folder_path, es6=False):
import js2py

emit_code = []
@@ -160,7 +161,10 @@ def create_code_for_nearley_grammar(g, start, builtin_path, folder_path):
for alias, code in n2l.alias_js_code.items():
js_code.append('%s = (%s);' % (alias, code))

emit(js2py.translate_js('\n'.join(js_code)))
if es6:
emit(js2py.translate_js6('\n'.join(js_code)))
else:
emit(js2py.translate_js('\n'.join(js_code)))
emit('class TransformNearley(Transformer):')
for alias in n2l.alias_js_code:
emit(" %s = var.get('%s').to_python()" % (alias, alias))
@@ -173,18 +177,20 @@ def create_code_for_nearley_grammar(g, start, builtin_path, folder_path):

return ''.join(emit_code)

def main(fn, start, nearley_lib):
def main(fn, start, nearley_lib, es6=False):
with codecs.open(fn, encoding='utf8') as f:
grammar = f.read()
return create_code_for_nearley_grammar(grammar, start, os.path.join(nearley_lib, 'builtin'), os.path.abspath(os.path.dirname(fn)))
return create_code_for_nearley_grammar(grammar, start, os.path.join(nearley_lib, 'builtin'), os.path.abspath(os.path.dirname(fn)), es6=es6)

def get_arg_parser():
parser = argparse.ArgumentParser('Reads Nearley grammar (with js functions) outputs an equivalent lark parser.')
parser.add_argument('nearley_grammar', help='Path to the file containing the nearley grammar')
parser.add_argument('start_rule', help='Rule within the nearley grammar to make the base rule')
parser.add_argument('nearley_lib', help='Path to root directory of nearley codebase (used for including builtins)')
parser.add_argument('--es6', help='Enable experimental ES6 support', action='store_true')
return parser

if __name__ == '__main__':
if len(sys.argv) < 4:
print("Reads Nearley grammar (with js functions) outputs an equivalent lark parser.")
print("Usage: %s <nearley_grammar_path> <start_rule> <nearley_lib_path>" % sys.argv[0])
sys.exit(1)

fn, start, nearley_lib = sys.argv[1:]

print(main(fn, start, nearley_lib))
parser = get_arg_parser()
args = parser.parse_args()
print(main(fn=args.nearley_grammar, start=args.start_rule, nearley_lib=args.nearley_lib, es6=args.es6))

+ 2
- 0
lark/visitors.py View File

@@ -14,6 +14,8 @@ class Discard(Exception):
# Transformers

class _Decoratable:
"Provides support for decorating methods with @v_args"

@classmethod
def _apply_decorator(cls, decorator, **kwargs):
mro = getmro(cls)


+ 2
- 0
mkdocs.yml View File

@@ -12,3 +12,5 @@ pages:
- Visitors and Transformers: visitors.md
- Classes Reference: classes.md
- Recipes: recipes.md
- Import grammars from Nearley: nearley.md
- Tutorial - JSON Parser: json_tutorial.md

+ 2
- 1
setup.py View File

@@ -15,7 +15,8 @@ setup(
install_requires = [],

extras_require = {
"regex": ["regex"]
"regex": ["regex"],
"nearley": ["js2py"]
},

package_data = {'': ['*.md', '*.lark'], 'lark-stubs': ['*.pyi']},


+ 28
- 1
tests/test_parser.py View File

@@ -721,7 +721,8 @@ def _make_parser_test(LEXER, PARSER):
""")
g.parse('\x01\x02\x03')

@unittest.skipIf(sys.version_info[:2]==(2, 7), "bytes parser isn't perfect in Python2.7, exceptions don't work correctly")
@unittest.skipIf(sys.version_info[0]==2 or sys.version_info[:2]==(3, 4),
"bytes parser isn't perfect in Python2, exceptions don't work correctly")
def test_bytes_utf8(self):
g = r"""
start: BOM? char+
@@ -1261,6 +1262,32 @@ def _make_parser_test(LEXER, PARSER):
tree = l.parse('aA')
self.assertEqual(tree.children, ['a', 'A'])

def test_token_flags_verbose(self):
g = _Lark(r"""start: NL | ABC
ABC: / [a-z] /x
NL: /\n/
""")
x = g.parse('a')
self.assertEqual(x.children, ['a'])

def test_token_flags_verbose_multiline(self):
g = _Lark(r"""start: ABC
ABC: / a b c
d
e f
/x
""")
x = g.parse('abcdef')
self.assertEqual(x.children, ['abcdef'])

def test_token_multiline_only_works_with_x_flag(self):
g = r"""start: ABC
ABC: / a b c
d
e f
/i
"""
self.assertRaises( GrammarError, _Lark, g)

@unittest.skipIf(PARSER == 'cyk', "No empty rules")
def test_twice_empty(self):


+ 29
- 0
tests/test_reconstructor.py View File

@@ -69,6 +69,35 @@ class TestReconstructor(TestCase):

self.assert_reconstruct(g, code)

def test_keep_tokens(self):
g = """
start: (NL | stmt)*
stmt: var op var
!op: ("+" | "-" | "*" | "/")
var: WORD
NL: /(\\r?\\n)+\s*/
""" + common

code = """
a+b
"""

self.assert_reconstruct(g, code)

def test_expand_rule(self):
g = """
?start: (NL | mult_stmt)*
?mult_stmt: sum_stmt ["*" sum_stmt]
?sum_stmt: var ["+" var]
var: WORD
NL: /(\\r?\\n)+\s*/
""" + common

code = ['a', 'a*b', 'a+b', 'a*b+c', 'a+b*c', 'a+b*c+d']

for c in code:
self.assert_reconstruct(g, c)

def test_json_example(self):
test_json = '''
{


Loading…
Cancel
Save