Merge branch 'master' of https://github.com/lark-parser/lark into error-handling

4 years ago · 56083b0dbb
--- a/README.md
+++ b/README.md
@@ -1,18 +1,18 @@
 # Lark - a modern parsing library for Python
 # Lark - a parsing toolkit for Python

 Lark is a parser built with a focus on ergonomics, performance and resilience.
 Lark is a parsing toolkit for Python, built with a focus on ergonomics, performance and modularity.

 Lark can parse all context-free languages. That means it is capable of parsing almost any programming language out there, and to some degree most natural languages too.
 Lark can parse all context-free languages. To put it simply, it means that it is capable of parsing almost any programming language out there, and to some degree most natural languages too.

 **Who is it for?**

   - **Beginners**: Lark is very friendly for experimentation. It can parse any grammar you throw at it, no matter how complicated or ambiguous, and do so efficiently. It also constructs an annotated parse-tree for you, using only the grammar, and it gives you convienient and flexible tools to process that parse-tree.
   - **Beginners**: Lark is very friendly for experimentation. It can parse any grammar you throw at it, no matter how complicated or ambiguous, and do so efficiently. It also constructs an annotated parse-tree for you, using only the grammar and an input, and it gives you convienient and flexible tools to process that parse-tree.

   - **Experts**: Lark implements both Earley(SPPF) and LALR(1), and several different lexers, so you can trade-off power and speed, according to your requirements. It also provides a variety of sophisticated features and utilities.

 **What can it do?**

 - Parse all context-free grammars, and handle any ambiguity
 - Parse all context-free grammars, and handle any ambiguity gracefully
 - Build an annotated parse-tree automagically, no construction code required.
 - Provide first-rate performance in terms of both Big-O complexity and measured run-time (considering that this is Python ;)
 - Run on every Python interpreter (it's pure-python)
@@ -33,7 +33,7 @@ Most importantly, Lark will save you time and prevent you from getting parsing h

 ### Install Lark

    $ pip install lark-parser
    $ pip install lark-parser --upgrade

 Lark has no dependencies.

@@ -77,12 +77,11 @@ Notice punctuation doesn't appear in the resulting tree. It's automatically filt

 ### Fruit flies like bananas

 Lark is great at handling ambiguity. Let's parse the phrase "fruit flies like bananas":
 Lark is great at handling ambiguity. Here is the result of parsing the phrase "fruit flies like bananas":

 ![fruitflies.png](examples/fruitflies.png)

 See more [examples here](https://github.com/lark-parser/lark/tree/master/examples)

 See the code and more [examples here](https://github.com/lark-parser/lark/tree/master/examples)


 ## List of main features
@@ -100,7 +99,7 @@ See more [examples here](https://github.com/lark-parser/lark/tree/master/example
 - **Python 2 & 3** compatible
 - Automatic line & column tracking
 - Standard library of terminals (strings, numbers, names, etc.)
 - Import grammars from Nearley.js
 - Import grammars from Nearley.js ([read more](/docs/nearley.md))
 - Extensive test suite [![codecov](https://codecov.io/gh/erezsh/lark/branch/master/graph/badge.svg)](https://codecov.io/gh/erezsh/lark)
 - MyPy support using type stubs
 - And much more!
@@ -159,25 +158,6 @@ Check out the [JSON tutorial](/docs/json_tutorial.md#conclusion) for more detail

 Using Lark? Send me a message and I'll add your project!

 ### How to use Nearley grammars in Lark

 Lark comes with a tool to convert grammars from [Nearley](https://github.com/Hardmath123/nearley), a popular Earley library for Javascript. It uses [Js2Py](https://github.com/PiotrDabkowski/Js2Py) to convert and run the Javascript postprocessing code segments.

 Here's an example:
 ```bash
 git clone https://github.com/Hardmath123/nearley
 python -m lark.tools.nearley nearley/examples/calculator/arithmetic.ne main nearley > ncalc.py
 ```

 You can use the output as a regular python module:

 ```python
 >>> import ncalc
 >>> ncalc.parse('sin(pi/4) ^ e')
 0.38981434460254655
 ```


 ## License

 Lark uses the [MIT license](LICENSE).
--- a/docs/features.md
+++ b/docs/features.md
@@ -19,9 +19,9 @@
 [Read more about the parsers](parsers.md)

 # Extra features

  - Import rules and tokens from other Lark grammars, for code reuse and modularity.
  - Import grammars from Nearley.js
  - Support for external regex module ([see here](classes.md#using-unicode-character-classes-with-regex))
  - Import grammars from Nearley.js ([read more](nearley.md))
  - CYK parser

 ### Experimental features
--- a/docs/grammar.md
+++ b/docs/grammar.md
@@ -112,6 +112,19 @@ Terminals can be assigned priority only when using a lexer (future versions may

 Priority can be either positive or negative. If not specified for a terminal, it defaults to 1.

 ### Regexp Flags

 You can use flags on regexps and strings. For example:

 ```perl
 SELECT: "select"i     //# Will ignore case, and match SELECT or Select, etc.
 MULTILINE_TEXT: /.+/s
 ```

 Supported flags are one of: `imslu`. See Python's regex documentation for more details on each one.

 Regexps/strings of different flags can only be concatenated in Python 3.6+

 #### Notes for when using a lexer:

 When using a lexer (standard or contextual), it is the grammar-author's responsibility to make sure the literals don't collide, or that if they do, they are matched in the desired order. Literals are matched according to the following precedence:
--- a/docs/index.md
+++ b/docs/index.md
@@ -32,7 +32,7 @@ $ pip install lark-parser


 * [Philosophy & Design Choices](philosophy.md)
 * [Full List of Features](features.md)
 * [Features](features.md)
 * [Examples](https://github.com/lark-parser/lark/tree/master/examples)
 * [Online IDE](https://lark-parser.github.io/lark/ide/app.html)
 * Tutorials
@@ -49,6 +49,7 @@ $ pip install lark-parser
    * [Visitors & Transformers](visitors.md)
    * [Classes](classes.md)
    * [Cheatsheet (PDF)](lark_cheatsheet.pdf)
    * [Importing grammars from Nearley](nearley.md)
 * Discussion
    * [Gitter](https://gitter.im/lark-parser/Lobby)
    * [Forum (Google Groups)](https://groups.google.com/forum/#!forum/lark-parser)
--- a/docs/nearley.md
+++ b/docs/nearley.md
@@ -0,0 +1,47 @@
 # Importing grammars from Nearley

 Lark comes with a tool to convert grammars from [Nearley](https://github.com/Hardmath123/nearley), a popular Earley library for Javascript. It uses [Js2Py](https://github.com/PiotrDabkowski/Js2Py) to convert and run the Javascript postprocessing code segments.

 ## Requirements

 1. Install Lark with the `nearley` component:
 ```bash
 pip install lark-parser[nearley]
 ```

 2. Acquire a copy of the nearley codebase. This can be done using:
 ```bash
 git clone https://github.com/Hardmath123/nearley
 ```

 ## Usage

 Here's an example of how to import nearley's calculator example into Lark:

 ```bash
 git clone https://github.com/Hardmath123/nearley
 python -m lark.tools.nearley nearley/examples/calculator/arithmetic.ne main nearley > ncalc.py
 ```

 You can use the output as a regular python module:

 ```python
 >>> import ncalc
 >>> ncalc.parse('sin(pi/4) ^ e')
 0.38981434460254655
 ```

 The Nearley converter also supports an experimental converter for newer JavaScript (ES6+), using the `--es6` flag:

 ```bash
 git clone https://github.com/Hardmath123/nearley
 python -m lark.tools.nearley nearley/examples/calculator/arithmetic.ne main nearley --es6 > ncalc.py
 ```

 ## Notes

 - Lark currently cannot import templates from Nearley

 - Lark currently cannot export grammars to Nearley

 These might get added in the future, if enough users ask for them.
--- a/docs/parsers.md
+++ b/docs/parsers.md
@@ -13,7 +13,7 @@ It's possible to bypass the dynamic lexing, and use the regular Earley parser wi

 Lark implements the Shared Packed Parse Forest data-structure for the Earley parser, in order to reduce the space and computation required to handle ambiguous grammars.

 You can read more about SPPF [here](http://www.bramvandersanden.com/post/2014/06/shared-packed-parse-forest/)
 You can read more about SPPF [here](https://web.archive.org/web/20191229100607/www.bramvandersanden.com/post/2014/06/shared-packed-parse-forest)

 As a result, Lark can efficiently parse and store every ambiguity in the grammar, when using Earley.

--- a/lark/exceptions.py
+++ b/lark/exceptions.py
@@ -81,7 +81,7 @@ class UnexpectedInput(LarkError):

 class UnexpectedCharacters(LexError, UnexpectedInput):
    def __init__(self, seq, lex_pos, line, column, allowed=None, considered_tokens=None, state=None, token_history=None):
        

        if isinstance(seq, bytes):
            message = "No terminal defined for '%s' at line %d col %d" % (seq[lex_pos:lex_pos+1].decode("ascii", "backslashreplace"), line, column)
        else:
--- a/lark/load_grammar.py
+++ b/lark/load_grammar.py
@@ -85,7 +85,7 @@ TERMINALS = {
    'RULE': '!?[_?]?[a-z][_a-z0-9]*',
    'TERMINAL': '_?[A-Z][_A-Z0-9]*',
    'STRING': r'"(\\"|\\\\|[^"\n])*?"i?',
    'REGEXP': r'/(?!/)(\\/|\\\\|[^/\n])*?/[%s]*' % _RE_FLAGS,
    'REGEXP': r'/(?!/)(\\/|\\\\|[^/])*?/[%s]*' % _RE_FLAGS,
    '_NL': r'(\r?\n)+\s*',
    'WS': r'[ \t]+',
    'COMMENT': r'\s*//[^\n]*',
@@ -307,6 +307,7 @@ class PrepareAnonTerminals(Transformer_InPlace):
        self.term_set = {td.name for td in self.terminals}
        self.term_reverse = {td.pattern: td for td in terminals}
        self.i = 0
        self.rule_options = None


    @inline_args
@@ -335,7 +336,7 @@ class PrepareAnonTerminals(Transformer_InPlace):
                    term_name = None

        elif isinstance(p, PatternRE):
            if p in self.term_reverse: # Kind of a wierd placement.name
            if p in self.term_reverse: # Kind of a weird placement.name
                term_name = self.term_reverse[p].name
        else:
            assert False, p
@@ -351,7 +352,10 @@ class PrepareAnonTerminals(Transformer_InPlace):
            self.term_reverse[p] = termdef
            self.terminals.append(termdef)

        return Terminal(term_name, filter_out=isinstance(p, PatternStr))
        filter_out = False if self.rule_options and self.rule_options.keep_all_tokens else isinstance(p, PatternStr)

        return Terminal(term_name, filter_out=filter_out)


 class _ReplaceSymbols(Transformer_InPlace):
    " Helper for ApplyTemplates "
@@ -405,6 +409,13 @@ def _literal_to_pattern(literal):
    flags = v[flag_start:]
    assert all(f in _RE_FLAGS for f in flags), flags

    if literal.type == 'STRING' and '\n' in v:
        raise GrammarError('You cannot put newlines in string literals')

    if literal.type == 'REGEXP' and '\n' in v and 'x' not in flags:
        raise GrammarError('You can only use newlines in regular expressions '
                           'with the `x` (verbose) flag')

    v = v[:flag_start]
    assert v[0] == v[-1] and v[0] in '"/'
    x = v[1:-1]
@@ -413,9 +424,11 @@ def _literal_to_pattern(literal):

    if literal.type == 'STRING':
        s = s.replace('\\\\', '\\')

    return { 'STRING': PatternStr,
             'REGEXP': PatternRE }[literal.type](s, flags)
        return PatternStr(s, flags)
    elif literal.type == 'REGEXP':
        return PatternRE(s, flags)
    else:
        assert False, 'Invariant failed: literal.type not in ["STRING", "REGEXP"]'


@inline_args
@@ -541,7 +554,8 @@ class Grammar:
        # =================

        # 1. Pre-process terminals
        transformer = PrepareLiterals() * PrepareSymbols() * PrepareAnonTerminals(terminals)  # Adds to terminals
        anon_tokens_transf = PrepareAnonTerminals(terminals)
        transformer = PrepareLiterals() * PrepareSymbols() * anon_tokens_transf  # Adds to terminals

        # 2. Inline Templates

@@ -556,8 +570,10 @@ class Grammar:
            i += 1
            if len(params) != 0: # Dont transform templates
                continue
            ebnf_to_bnf.rule_options = RuleOptions(keep_all_tokens=True) if options.keep_all_tokens else None
            rule_options = RuleOptions(keep_all_tokens=True) if options and options.keep_all_tokens else None
            ebnf_to_bnf.rule_options = rule_options
            ebnf_to_bnf.prefix = name
            anon_tokens_transf.rule_options = rule_options
            tree = transformer.transform(rule_tree)
            res = ebnf_to_bnf.transform(tree)
            rules.append((name, res, options))
@@ -834,7 +850,7 @@ class GrammarLoader:
                if len(stmt.children) > 1:
                    path_node, arg1 = stmt.children
                else:
                    path_node, = stmt.children
                    path_node ,= stmt.children
                    arg1 = None

                if isinstance(arg1, Tree):  # Multi import
--- a/lark/reconstruct.py
+++ b/lark/reconstruct.py
@@ -86,6 +86,14 @@ def best_from_group(seq, group_key, cmp_key):
            d[key] = item
    return list(d.values())


 def make_recons_rule(origin, expansion, old_expansion):
    return Rule(origin, expansion, alias=MakeMatchTree(origin.name, old_expansion))

 def make_recons_rule_to_term(origin, term):
    return make_recons_rule(origin, [Terminal(term.name)], [term])


 class Reconstructor:
    """
    A Reconstructor that will, given a full parse Tree, generate source code.
@@ -100,6 +108,8 @@ class Reconstructor:
        tokens, rules, _grammar_extra = parser.grammar.compile(parser.options.start)

        self.write_tokens = WriteTokensTransformer({t.name:t for t in tokens}, term_subs)
        self.rules_for_root = defaultdict(list)

        self.rules = list(self._build_recons_rules(rules))
        self.rules.reverse()

@@ -107,9 +117,8 @@ class Reconstructor:
        self.rules = best_from_group(self.rules, lambda r: r, lambda r: -len(r.expansion))

        self.rules.sort(key=lambda r: len(r.expansion))
        callbacks = {rule: rule.alias for rule in self.rules}   # TODO pass callbacks through dict, instead of alias?
        self.parser = earley.Parser(ParserConf(self.rules, callbacks, parser.options.start),
                                    self._match, resolve_ambiguity=True)
        self.parser = parser
        self._parser_cache = {}

    def _build_recons_rules(self, rules):
        expand1s = {r.origin for r in rules if r.options.expand1}
@@ -121,24 +130,36 @@ class Reconstructor:

        rule_names = {r.origin for r in rules}
        nonterminals = {sym for sym in rule_names
                       if sym.name.startswith('_') or sym in expand1s or sym in aliases }
                        if sym.name.startswith('_') or sym in expand1s or sym in aliases }

        seen = set()
        for r in rules:
            recons_exp = [sym if sym in nonterminals else Terminal(sym.name)
                          for sym in r.expansion if not is_discarded_terminal(sym)]

            # Skip self-recursive constructs
            if recons_exp == [r.origin]:
            if recons_exp == [r.origin] and r.alias is None:
                continue

            sym = NonTerminal(r.alias) if r.alias else r.origin
            rule = make_recons_rule(sym, recons_exp, r.expansion)

            yield Rule(sym, recons_exp, alias=MakeMatchTree(sym.name, r.expansion))
            if sym in expand1s and len(recons_exp) != 1:
                self.rules_for_root[sym.name].append(rule)

                if sym.name not in seen:
                    yield make_recons_rule_to_term(sym, sym)
                    seen.add(sym.name)
            else:
                if sym.name.startswith('_') or sym in expand1s:
                    yield rule
                else:
                    self.rules_for_root[sym.name].append(rule)

        for origin, rule_aliases in aliases.items():
            for alias in rule_aliases:
                yield Rule(origin, [Terminal(alias)], alias=MakeMatchTree(origin.name, [NonTerminal(alias)]))
            yield Rule(origin, [Terminal(origin.name)], alias=MakeMatchTree(origin.name, [origin]))
                yield make_recons_rule_to_term(origin, NonTerminal(alias))
            yield make_recons_rule_to_term(origin, origin)

    def _match(self, term, token):
        if isinstance(token, Tree):
@@ -149,7 +170,20 @@ class Reconstructor:

    def _reconstruct(self, tree):
        # TODO: ambiguity?
        unreduced_tree = self.parser.parse(tree.children, tree.data)   # find a full derivation
        try:
            parser = self._parser_cache[tree.data]
        except KeyError:
            rules = self.rules + best_from_group(
                self.rules_for_root[tree.data], lambda r: r, lambda r: -len(r.expansion)
            )

            rules.sort(key=lambda r: len(r.expansion))

            callbacks = {rule: rule.alias for rule in rules}  # TODO pass callbacks through dict, instead of alias?
            parser = earley.Parser(ParserConf(rules, callbacks, [tree.data]), self._match, resolve_ambiguity=True)
            self._parser_cache[tree.data] = parser

        unreduced_tree = parser.parse(tree.children, tree.data)   # find a full derivation
        assert unreduced_tree.data == tree.data
        res = self.write_tokens.transform(unreduced_tree)
        for item in res:
--- a/lark/tools/nearley.py
+++ b/lark/tools/nearley.py
@@ -1,8 +1,9 @@
 "Converts between Lark and Nearley grammars. Work in progress!"
 "Converts Nearley grammars to Lark"

 import os.path
 import sys
 import codecs
 import argparse


 from lark import Lark, InlineTransformer
@@ -137,7 +138,7 @@ def _nearley_to_lark(g, builtin_path, n2l, js_code, folder_path, includes):
    return rule_defs


 def create_code_for_nearley_grammar(g, start, builtin_path, folder_path):
 def create_code_for_nearley_grammar(g, start, builtin_path, folder_path, es6=False):
    import js2py

    emit_code = []
@@ -160,7 +161,10 @@ def create_code_for_nearley_grammar(g, start, builtin_path, folder_path):
    for alias, code in n2l.alias_js_code.items():
        js_code.append('%s = (%s);' % (alias, code))

    emit(js2py.translate_js('\n'.join(js_code)))
    if es6:
        emit(js2py.translate_js6('\n'.join(js_code)))
    else:
        emit(js2py.translate_js('\n'.join(js_code)))
    emit('class TransformNearley(Transformer):')
    for alias in n2l.alias_js_code:
        emit("    %s = var.get('%s').to_python()" % (alias, alias))
@@ -173,18 +177,20 @@ def create_code_for_nearley_grammar(g, start, builtin_path, folder_path):

    return ''.join(emit_code)

 def main(fn, start, nearley_lib):
 def main(fn, start, nearley_lib, es6=False):
    with codecs.open(fn, encoding='utf8') as f:
        grammar = f.read()
    return create_code_for_nearley_grammar(grammar, start, os.path.join(nearley_lib, 'builtin'), os.path.abspath(os.path.dirname(fn)))
    return create_code_for_nearley_grammar(grammar, start, os.path.join(nearley_lib, 'builtin'), os.path.abspath(os.path.dirname(fn)), es6=es6)

 def get_arg_parser():
    parser = argparse.ArgumentParser('Reads Nearley grammar (with js functions) outputs an equivalent lark parser.')
    parser.add_argument('nearley_grammar', help='Path to the file containing the nearley grammar')
    parser.add_argument('start_rule', help='Rule within the nearley grammar to make the base rule')
    parser.add_argument('nearley_lib', help='Path to root directory of nearley codebase (used for including builtins)')
    parser.add_argument('--es6', help='Enable experimental ES6 support', action='store_true')
    return parser

 if __name__ == '__main__':
    if len(sys.argv) < 4:
        print("Reads Nearley grammar (with js functions) outputs an equivalent lark parser.")
        print("Usage: %s <nearley_grammar_path> <start_rule> <nearley_lib_path>" % sys.argv[0])
        sys.exit(1)

    fn, start, nearley_lib = sys.argv[1:]

    print(main(fn, start, nearley_lib))
    parser = get_arg_parser()
    args = parser.parse_args()
    print(main(fn=args.nearley_grammar, start=args.start_rule, nearley_lib=args.nearley_lib, es6=args.es6))
--- a/lark/visitors.py
+++ b/lark/visitors.py
@@ -14,6 +14,8 @@ class Discard(Exception):
 # Transformers

 class _Decoratable:
    "Provides support for decorating methods with @v_args"

    @classmethod
    def _apply_decorator(cls, decorator, **kwargs):
        mro = getmro(cls)
--- a/mkdocs.yml
+++ b/mkdocs.yml
@@ -12,3 +12,5 @@ pages:
    - Visitors and Transformers: visitors.md
    - Classes Reference: classes.md
    - Recipes: recipes.md
    - Import grammars from Nearley: nearley.md
    - Tutorial - JSON Parser: json_tutorial.md
--- a/setup.py
+++ b/setup.py
@@ -15,7 +15,8 @@ setup(
    install_requires = [],

    extras_require = {
        "regex": ["regex"]
        "regex": ["regex"],
        "nearley": ["js2py"]
    },

    package_data = {'': ['*.md', '*.lark'], 'lark-stubs': ['*.pyi']},
--- a/tests/test_parser.py
+++ b/tests/test_parser.py
@@ -721,7 +721,8 @@ def _make_parser_test(LEXER, PARSER):
                          """)
            g.parse('\x01\x02\x03')

        @unittest.skipIf(sys.version_info[:2]==(2, 7), "bytes parser isn't perfect in Python2.7, exceptions don't work correctly")
        @unittest.skipIf(sys.version_info[0]==2 or sys.version_info[:2]==(3, 4),
                         "bytes parser isn't perfect in Python2, exceptions don't work correctly")
        def test_bytes_utf8(self):
            g = r"""
            start: BOM? char+
@@ -1261,6 +1262,32 @@ def _make_parser_test(LEXER, PARSER):
            tree = l.parse('aA')
            self.assertEqual(tree.children, ['a', 'A'])

        def test_token_flags_verbose(self):
            g = _Lark(r"""start: NL | ABC
                          ABC: / [a-z] /x
                          NL: /\n/
                      """)
            x = g.parse('a')
            self.assertEqual(x.children, ['a'])

        def test_token_flags_verbose_multiline(self):
            g = _Lark(r"""start: ABC
                          ABC: /  a      b c
                               d
                                e f
                           /x
                       """)
            x = g.parse('abcdef')
            self.assertEqual(x.children, ['abcdef'])

        def test_token_multiline_only_works_with_x_flag(self):
            g = r"""start: ABC
                    ABC: /  a      b c
                              d
                                e f
                            /i
                      """
            self.assertRaises( GrammarError, _Lark, g)

        @unittest.skipIf(PARSER == 'cyk', "No empty rules")
        def test_twice_empty(self):
--- a/tests/test_reconstructor.py
+++ b/tests/test_reconstructor.py
@@ -69,6 +69,35 @@ class TestReconstructor(TestCase):

        self.assert_reconstruct(g, code)

    def test_keep_tokens(self):
        g = """
        start: (NL | stmt)*
        stmt: var op var
        !op: ("+" | "-" | "*" | "/")
        var: WORD
        NL: /(\\r?\\n)+\s*/
        """ + common

        code = """
        a+b
        """

        self.assert_reconstruct(g, code)

    def test_expand_rule(self):
        g = """
        ?start: (NL | mult_stmt)*
        ?mult_stmt: sum_stmt ["*" sum_stmt]
        ?sum_stmt: var ["+" var]
        var: WORD
        NL: /(\\r?\\n)+\s*/
        """ + common

        code = ['a', 'a*b', 'a+b', 'a*b+c', 'a+b*c', 'a+b*c+d']

        for c in code:
            self.assert_reconstruct(g, c)

    def test_json_example(self):
        test_json = '''
            {