Browse Source

Breaking backwards compatibility:

* Removed the scanless parsing feature (dynamic lexing is king)
* Default LALR lexer is now contextual
tags/gm/2021-09-23T00Z/github.com--lark-parser-lark/0.6.0
Erez Shinan 6 years ago
parent
commit
33caa391d5
12 changed files with 43 additions and 349 deletions
  1. +0
    -176
      docs/reference.md
  2. +2
    -1
      examples/README.md
  3. +6
    -7
      examples/conf_earley.py
  4. +5
    -4
      examples/conf_lalr.py
  5. +2
    -5
      lark/grammar.py
  6. +4
    -4
      lark/lark.py
  7. +3
    -73
      lark/load_grammar.py
  8. +0
    -13
      lark/parse_tree_builder.py
  9. +1
    -27
      lark/parser_frontends.py
  10. +1
    -1
      lark/tools/standalone.py
  11. +2
    -2
      tests/__main__.py
  12. +17
    -36
      tests/test_parser.py

+ 0
- 176
docs/reference.md View File

@@ -1,176 +0,0 @@
# Lark Reference

## What is Lark?

Lark is a general-purpose parsing library. It's written in Python, and supports two parsing algorithms: Earley (default) and LALR(1).

Lark also supports scanless parsing (with Earley), contextual lexing (with LALR), and regular lexing for both parsers.

Lark is a re-write of my previous parsing library, [PlyPlus](https://github.com/erezsh/plyplus).

## Grammar

Lark accepts its grammars in [EBNF](https://www.wikiwand.com/en/Extended_Backus%E2%80%93Naur_form) form.

The grammar is a list of rules and terminals, each in their own line.

Rules and terminals can be defined on multiple lines when using the *OR* operator ( | ).

Comments start with // and last to the end of the line (C++ style)

Lark begins the parse with the rule 'start', unless specified otherwise in the options.

It might help to think of Rules and Terminals as existing in two separate layers, so that all the terminals are recognized first, and all the rules are recognized afterwards. This is not always how things happen (depending on your choice of parser & lexer), but the concept is relevant in all cases.

### Rules

Each rule is defined in terms of:

name : list of items to match
| another list of items -> optional_alias
| etc.

An alias is a name for the specific rule alternative. It affects tree construction.

An item is a:

- rule
- terminal
- (item item ..) - Group items
- [item item ..] - Maybe. Same as: "(item item ..)?"
- item? - Zero or one instances of item ("maybe")
- item\* - Zero or more instances of item
- item+ - One or more instances of item


Example:

float: "-"? DIGIT* "." DIGIT+ exp
| "-"? DIGIT+ exp

exp: "-"? ("e" | "E") DIGIT+

DIGIT: /[0-9]/

### Terminals

Terminals are defined just like rules, but cannot contain rules:

NAME : list of items to match

Example:

IF: "if"
INTEGER : /[0-9]+/
DECIMAL: INTEGER "." INTEGER
WHITESPACE: (" " | /\t/ )+

## Tree Construction

Lark builds a tree automatically based on the structure of the grammar. Is also accepts some hints.

In general, Lark will place each rule as a branch, and its matches as the children of the branch.

Terminals are always values in the tree, never branches.

In grammar rules, using item+ or item\* will result in a list of items.

Example:

expr: "(" expr ")"
| NAME+

NAME: /\w+/

%ignore " "

Lark will parse "(((hello world)))" as:

expr
expr
expr
"hello"
"world"

The brackets do not appear in the tree by design.

Terminals that won't appear in the tree are:

- Unnamed literals (like "keyword" or "+")
- Terminals whose name starts with an underscore (like \_DIGIT)

Terminals that *will* appear in the tree are:

- Unnamed regular expressions (like /[0-9]/)
- Named terminals whose name starts with a letter (like DIGIT)

## Shaping the tree

a. Rules whose name begins with an underscore will be inlined into their containing rule.

Example:

start: "(" _greet ")"
_greet: /\w+/ /\w+/

Lark will parse "(hello world)" as:

start
"hello"
"world"


b. Rules that receive a question mark (?) at the beginning of their definition, will be inlined if they have a single child.

Example:

start: greet greet
?greet: "(" /\w+/ ")"
| /\w+ /\w+/

Lark will parse "hello world (planet)" as:

start
greet
"hello"
"world"
"planet"

c. Rules that begin with an exclamation mark will keep all their terminals (they won't get filtered).

d. Aliases - options in a rule can receive an alias. It will be then used as the branch name for the option.

Example:

start: greet greet
greet: "hello" -> hello
| "world"

Lark will parse "hello world" as:

start
hello
greet

## Lark Options

When initializing the Lark object, you can provide it with keyword options:

- start - The start symbol (Default: "start")
- parser - Decides which parser engine to use, "earley" or "lalr". (Default: "earley")
Note: "lalr" requires a lexer
- lexer - Decides whether or not to use a lexer stage
- None: Don't use a lexer
- "standard": Use a standard lexer
- "contextual": Stronger lexer (only works with parser="lalr")
- "auto" (default): Choose for me based on grammar and parser

- transformer - Applies the transformer to every parse tree (only allowed with parser="lalr")
- postlex - Lexer post-processing (Default: None)

To be supported:

- debug
- cache\_grammar
- keep\_all\_tokens
- profile - Measure run-time usage in Lark. Read results from the profiler property (Default: False)

+ 2
- 1
examples/README.md View File

@@ -12,5 +12,6 @@

- [error\_reporting\_lalr.py](error_reporting_lalr.py) - A demonstration of example-driven error reporting with the LALR parser
- [python\_parser.py](python_parser.py) - A fully-working Python 2 & 3 parser (but not production ready yet!)
- [conf.py](conf.py) - Demonstrates the power of LALR's contextual lexer on a toy configuration language
- [conf\_lalr.py](conf_lalr.py) - Demonstrates the power of LALR's contextual lexer on a toy configuration language
- [conf\_earley.py](conf_earley.py) - Demonstrates the power of Earley's dynamic lexer on a toy configuration language
- [reconstruct\_json.py](reconstruct_json.py) - Demonstrates the experimental text-reconstruction feature

examples/conf_nolex.py → examples/conf_earley.py View File

@@ -1,16 +1,14 @@
#
# This example demonstrates scanless parsing using the dynamic-lexer earley frontend
# This example demonstrates parsing using the dynamic-lexer earley frontend
#
# Using a lexer for configuration files is tricky, because values don't
# have to be surrounded by delimiters. Using a standard lexer for this just won't work.
#
# In this example we use a dynamic lexer and let the Earley parser resolve the ambiguity.
#
# Future versions of lark will make it easier to write these kinds of grammars.
#
# Another approach is to use the contextual lexer with LALR. It is less powerful than Earley,
# but it can handle some ambiguity when lexing and it's much faster.
# See examples/conf.py for an example of that approach.
# See examples/conf_lalr.py for an example of that approach.
#


@@ -19,14 +17,14 @@ from lark import Lark
parser = Lark(r"""
start: _NL? section+
section: "[" NAME "]" _NL item+
item: NAME "=" VALUE _NL
VALUE: /./*
item: NAME "=" VALUE? _NL
VALUE: /./+
%import common.CNAME -> NAME
%import common.NEWLINE -> _NL

%import common.WS_INLINE
%ignore WS_INLINE
""", lexer='dynamic')
""", parser="earley")

def test():
sample_conf = """
@@ -34,6 +32,7 @@ def test():

a=Hello
this="that",4
empty=
"""

r = parser.parse(sample_conf)

examples/conf.py → examples/conf_lalr.py View File

@@ -1,16 +1,16 @@
#
# This example demonstrates the power of the contextual lexer, by parsing a config file.
#
# The tokens NAME and VALUE match the same input. A regular lexer would arbitrarily
# The tokens NAME and VALUE match the same input. A standard lexer would arbitrarily
# choose one over the other, which would lead to a (confusing) parse error.
# However, due to the unambiguous structure of the grammar, the LALR(1) algorithm knows
# However, due to the unambiguous structure of the grammar, Lark's LALR(1) algorithm knows
# which one of them to expect at each point during the parse.
# The lexer then only matches the tokens that the parser expects.
# The result is a correct parse, something that is impossible with a regular lexer.
#
# Another approach is to discard a lexer altogether and use the Earley algorithm.
# It will handle more cases than the contextual lexer, but at the cost of performance.
# See examples/conf_nolex.py for an example of that approach.
# See examples/conf_earley.py for an example of that approach.
#

from lark import Lark
@@ -25,13 +25,14 @@ parser = Lark(r"""

%import common.WS_INLINE
%ignore WS_INLINE
""", parser="lalr", lexer="contextual")
""", parser="lalr")


sample_conf = """
[bla]
a=Hello
this="that",4
empty=
"""

print(parser.parse(sample_conf).pretty())

+ 2
- 5
lark/grammar.py View File

@@ -46,20 +46,17 @@ class Rule(object):


class RuleOptions:
def __init__(self, keep_all_tokens=False, expand1=False, create_token=None, filter_out=False, priority=None):
def __init__(self, keep_all_tokens=False, expand1=False, filter_out=False, priority=None):
self.keep_all_tokens = keep_all_tokens
self.expand1 = expand1
self.create_token = create_token # used for scanless postprocessing
self.priority = priority

self.filter_out = filter_out # remove this rule from the tree
# used for "token"-rules in scanless

def __repr__(self):
return 'RuleOptions(%r, %r, %r, %r, %r)' % (
return 'RuleOptions(%r, %r, %r, %r)' % (
self.keep_all_tokens,
self.expand1,
self.create_token,
self.priority,
self.filter_out
)

+ 4
- 4
lark/lark.py View File

@@ -23,9 +23,9 @@ class LarkOptions(object):
Note: "lalr" requires a lexer

lexer - Decides whether or not to use a lexer stage
None: Don't use a lexer (scanless, only works with parser="earley")
"standard": Use a standard lexer
"contextual": Stronger lexer (only works with parser="lalr")
"dynamic": Flexible and powerful (only with parser="earley")
"auto" (default): Choose for me based on grammar and parser

ambiguity - Decides how to handle ambiguity in the parse. Only relevant if parser="earley"
@@ -131,7 +131,7 @@ class Lark:

if self.options.lexer == 'auto':
if self.options.parser == 'lalr':
self.options.lexer = 'standard'
self.options.lexer = 'contextual'
elif self.options.parser == 'earley':
self.options.lexer = 'dynamic'
elif self.options.parser == 'cyk':
@@ -139,7 +139,7 @@ class Lark:
else:
assert False, self.options.parser
lexer = self.options.lexer
assert lexer in ('standard', 'contextual', 'dynamic', None)
assert lexer in ('standard', 'contextual', 'dynamic')

if self.options.ambiguity == 'auto':
if self.options.parser == 'earley':
@@ -154,7 +154,7 @@ class Lark:
self.grammar = load_grammar(grammar, self.source)

# Compile the EBNF grammar into BNF
tokens, self.rules, self.ignore_tokens = self.grammar.compile(lexer=bool(lexer), start=self.options.start)
tokens, self.rules, self.ignore_tokens = self.grammar.compile()

self.lexer_conf = LexerConf(tokens, self.ignore_tokens, self.options.postlex, self.options.lexer_callbacks)



+ 3
- 73
lark/load_grammar.py View File

@@ -363,12 +363,6 @@ class PrepareLiterals(InlineTransformer):
regexp = '[%s-%s]' % (start, end)
return ST('pattern', [PatternRE(regexp)])

class SplitLiterals(InlineTransformer):
def pattern(self, p):
if isinstance(p, PatternStr) and len(p.value)>1:
return ST('expansion', [ST('pattern', [PatternStr(ch, flags=p.flags)]) for ch in p.value])
return ST('pattern', [p])

class TokenTreeToPattern(Transformer):
def pattern(self, ps):
p ,= ps
@@ -405,15 +399,6 @@ class TokenTreeToPattern(Transformer):
def alias(self, t):
raise GrammarError("Aliasing not allowed in terminals (You used -> in the wrong place)")

def _interleave(l, item):
for e in l:
yield e
if isinstance(e, Tree):
if e.data in ('literal', 'range'):
yield item
elif is_terminal(e):
yield item

def _choice_of_rules(rules):
return ST('expansions', [ST('expansion', [Token('RULE', name)]) for name in rules])

@@ -423,62 +408,9 @@ class Grammar:
self.rule_defs = rule_defs
self.ignore = ignore

def _prepare_scanless_grammar(self, start):
# XXX Pretty hacky! There should be a better way to write this method..

rule_defs = deepcopy(self.rule_defs)
term_defs = self.token_defs

# Implement the "%ignore" feature without a lexer..
terms_to_ignore = {name:'__'+name for name in self.ignore}
if terms_to_ignore:
assert set(terms_to_ignore) <= {name for name, _t in term_defs}

term_defs = [(terms_to_ignore.get(name,name),t) for name,t in term_defs]
expr = Token('RULE', '__ignore')
for r, tree, _o in rule_defs:
for exp in tree.find_data('expansion'):
exp.children = list(_interleave(exp.children, expr))
if r == start:
exp.children = [expr] + exp.children
for exp in tree.find_data('expr'):
exp.children[0] = ST('expansion', list(_interleave(exp.children[:1], expr)))

_ignore_tree = ST('expr', [_choice_of_rules(terms_to_ignore.values()), Token('OP', '?')])
rule_defs.append(('__ignore', _ignore_tree, None))

# Convert all tokens to rules
new_terminal_names = {name: '__token_'+name for name, _t in term_defs}

for name, tree, options in rule_defs:
for exp in chain( tree.find_data('expansion'), tree.find_data('expr') ):
for i, sym in enumerate(exp.children):
if sym in new_terminal_names:
exp.children[i] = Token(sym.type, new_terminal_names[sym])

for name, (tree, priority) in term_defs: # TODO transfer priority to rule?
if any(tree.find_data('alias')):
raise GrammarError("Aliasing not allowed in terminals (You used -> in the wrong place)")

if name.startswith('_'):
options = RuleOptions(filter_out=True, priority=-priority)
else:
options = RuleOptions(keep_all_tokens=True, create_token=name, priority=-priority)

name = new_terminal_names[name]
inner_name = name + '_inner'
rule_defs.append((name, _choice_of_rules([inner_name]), None))
rule_defs.append((inner_name, tree, options))

return [], rule_defs


def compile(self, lexer=False, start=None):
if not lexer:
token_defs, rule_defs = self._prepare_scanless_grammar(start)
else:
token_defs = list(self.token_defs)
rule_defs = self.rule_defs
def compile(self):
token_defs = list(self.token_defs)
rule_defs = self.rule_defs

# =================
# Compile Tokens
@@ -495,8 +427,6 @@ class Grammar:

# 1. Pre-process terminals
transformer = PrepareLiterals()
if not lexer:
transformer *= SplitLiterals()
transformer *= ExtractAnonTokens(tokens) # Adds to tokens

# 2. Convert EBNF to BNF (and apply step 1)


+ 0
- 13
lark/parse_tree_builder.py View File

@@ -18,17 +18,6 @@ class ExpandSingleChild:
return self.node_builder(children)


class CreateToken:
"Used for fixing the results of scanless parsing"

def __init__(self, token_name, node_builder):
self.node_builder = node_builder
self.token_name = token_name

def __call__(self, children):
return self.node_builder( [Token(self.token_name, ''.join(children))] )


class PropagatePositions:
def __init__(self, node_builder):
self.node_builder = node_builder
@@ -116,10 +105,8 @@ class ParseTreeBuilder:
options = rule.options
keep_all_tokens = self.always_keep_all_tokens or (options.keep_all_tokens if options else False)
expand_single_child = options.expand1 if options else False
create_token = options.create_token if options else False

wrapper_chain = filter(None, [
create_token and partial(CreateToken, create_token),
(expand_single_child and not rule.alias) and ExpandSingleChild,
maybe_create_child_filter(rule.expansion, () if keep_all_tokens else filter_out, self.ambiguous),
self.propagate_positions and PropagatePositions,


+ 1
- 27
lark/parser_frontends.py View File

@@ -72,30 +72,6 @@ def tokenize_text(text):
col_start_pos = i + ch.rindex('\n')
yield Token('CHAR', ch, line=line, column=i - col_start_pos)

class Earley_NoLex:
def __init__(self, lexer_conf, parser_conf, options=None):
self._prepare_match(lexer_conf)

self.parser = earley.Parser(parser_conf, self.match,
resolve_ambiguity=get_ambiguity_resolver(options))


def match(self, term, text, index=0):
return self.regexps[term.name].match(text, index)

def _prepare_match(self, lexer_conf):
self.regexps = {}
for t in lexer_conf.tokens:
regexp = t.pattern.to_regexp()
width = get_regexp_width(regexp)
if width != (1,1):
raise GrammarError('Scanless parsing (lexer=None) requires all tokens to have a width of 1 (terminal %s: %s is %s)' % (t.name, regexp, width))
self.regexps[t.name] = re.compile(regexp)

def parse(self, text):
token_stream = tokenize_text(text)
return self.parser.parse(token_stream)

class Earley(WithLexer):
def __init__(self, lexer_conf, parser_conf, options=None):
self.init_traditional_lexer(lexer_conf)
@@ -190,9 +166,7 @@ def get_frontend(parser, lexer):
else:
raise ValueError('Unknown lexer: %s' % lexer)
elif parser=='earley':
if lexer is None:
return Earley_NoLex
elif lexer=='standard':
if lexer=='standard':
return Earley
elif lexer=='dynamic':
return XEarley


+ 1
- 1
lark/tools/standalone.py View File

@@ -168,7 +168,7 @@ class TreeBuilderAtoms:
print('parse_tree_builder = ParseTreeBuilder(RULES.values(), Tree)')

def main(fobj, start):
lark_inst = Lark(fobj, parser="lalr", start=start)
lark_inst = Lark(fobj, parser="lalr", lexer="standard", start=start)

lexer_atoms = LexerAtoms(lark_inst.parser.lexer)
parser_atoms = ParserAtoms(lark_inst.parser.parser)


+ 2
- 2
tests/__main__.py View File

@@ -19,10 +19,10 @@ from .test_parser import (
TestEarleyStandard,
TestCykStandard,
TestLalrContextual,
TestEarleyScanless,
# TestEarleyScanless,
TestEarleyDynamic,

TestFullEarleyScanless,
# TestFullEarleyScanless,
TestFullEarleyDynamic,

TestParsers,


+ 17
- 36
tests/test_parser.py View File

@@ -48,9 +48,6 @@ class TestParsers(unittest.TestCase):

self.assertRaises(GrammarError, Lark, g, parser='lalr')

l = Lark(g, parser='earley', lexer=None)
self.assertRaises(ParseError, l.parse, 'a')

l = Lark(g, parser='earley', lexer='dynamic')
self.assertRaises(ParseError, l.parse, 'a')

@@ -155,7 +152,7 @@ class TestParsers(unittest.TestCase):

def _make_full_earley_test(LEXER):
class _TestFullEarley(unittest.TestCase):
def test_anon_in_scanless(self):
def test_anon(self):
# Fails an Earley implementation without special handling for empty rules,
# or re-processing of already completed rules.
g = Lark(r"""start: B
@@ -164,14 +161,14 @@ def _make_full_earley_test(LEXER):

self.assertEqual( g.parse('abc').children[0], 'abc')

def test_earley_scanless(self):
def test_earley(self):
g = Lark("""start: A "b" c
A: "a"+
c: "abc"
""", parser="earley", lexer=LEXER)
x = g.parse('aaaababc')

def test_earley_scanless2(self):
def test_earley2(self):
grammar = """
start: statement+

@@ -187,24 +184,19 @@ def _make_full_earley_test(LEXER):
l.parse(program)


# XXX Fails for scanless mode
# XXX Decided not to fix, because
# a) It's a subtle bug
# b) Scanless is intended for deprecation
#
# def test_earley_scanless3(self):
# "Tests prioritization and disambiguation for pseudo-terminals (there should be only one result)"
def test_earley3(self):
"Tests prioritization and disambiguation for pseudo-terminals (there should be only one result)"

# grammar = """
# start: A A
# A: "a"+
# """
grammar = """
start: A A
A: "a"+
"""

# l = Lark(grammar, parser='earley', lexer=LEXER)
# res = l.parse("aaa")
# self.assertEqual(res.children, ['aa', 'a'])
l = Lark(grammar, parser='earley', lexer=LEXER)
res = l.parse("aaa")
self.assertEqual(res.children, ['aa', 'a'])

def test_earley_scanless4(self):
def test_earley4(self):
grammar = """
start: A A?
A: "a"+
@@ -259,7 +251,6 @@ def _make_full_earley_test(LEXER):
assert x.data == '_ambig', x
assert len(x.children) == 2

@unittest.skipIf(LEXER==None, "BUG in scanless parsing!") # TODO fix bug!
def test_fruitflies_ambig(self):
grammar = """
start: noun verb noun -> simple
@@ -350,7 +341,7 @@ def _make_full_earley_test(LEXER):
# assert x.data != '_ambig', x
# assert len(x.children) == 1

_NAME = "TestFullEarley" + (LEXER or 'Scanless').capitalize()
_NAME = "TestFullEarley" + LEXER.capitalize()
_TestFullEarley.__name__ = _NAME
globals()[_NAME] = _TestFullEarley

@@ -402,7 +393,6 @@ def _make_parser_test(LEXER, PARSER):
""")
g.parse(u'\xa3\u0101\u00a3')

@unittest.skipIf(LEXER is None, "Regexps >1 not supported with scanless parsing")
def test_unicode2(self):
g = _Lark(r"""start: UNIA UNIB UNIA UNIC
UNIA: /\xa3/
@@ -614,11 +604,7 @@ def _make_parser_test(LEXER, PARSER):
self.assertSequenceEqual(x.children, ['HelloWorld'])


@unittest.skipIf(LEXER is None, "Known bug with scanless parsing") # TODO
def test_token_collision2(self):
# NOTE: This test reveals a bug in token reconstruction in Scanless Earley
# I probably need to re-write grammar transformation

g = _Lark("""
!start: "starts"

@@ -662,7 +648,6 @@ def _make_parser_test(LEXER, PARSER):
x = g.parse('aaaab')
x = g.parse('b')

@unittest.skipIf(LEXER in (None, 'dynamic'), "Known bug with scanless parsing") # TODO
def test_token_not_anon(self):
"""Tests that "a" is matched as A, rather than an anonymous token.

@@ -755,7 +740,6 @@ def _make_parser_test(LEXER, PARSER):
""")
x = g.parse('AB')

@unittest.skipIf(LEXER == None, "Scanless can't handle regexps")
def test_regex_quote(self):
g = r"""
start: SINGLE_QUOTED_STRING | DOUBLE_QUOTED_STRING
@@ -866,7 +850,6 @@ def _make_parser_test(LEXER, PARSER):
"""
self.assertRaises( GrammarError, _Lark, g)

@unittest.skipIf(LEXER==None, "TODO: Fix scanless parsing or get rid of it") # TODO
def test_line_and_column(self):
g = r"""!start: "A" bc "D"
!bc: "B\nC"
@@ -1054,7 +1037,6 @@ def _make_parser_test(LEXER, PARSER):



@unittest.skipIf(LEXER==None, "Scanless doesn't support regular expressions")
@unittest.skipIf(PARSER == 'cyk', "No empty rules")
def test_ignore(self):
grammar = r"""
@@ -1081,7 +1063,6 @@ def _make_parser_test(LEXER, PARSER):
self.assertEqual(tree.children, [])


@unittest.skipIf(LEXER==None, "Scanless doesn't support regular expressions")
def test_regex_escaping(self):
g = _Lark("start: /[ab]/")
g.parse('a')
@@ -1188,7 +1169,7 @@ def _make_parser_test(LEXER, PARSER):



_NAME = "Test" + PARSER.capitalize() + (LEXER or 'Scanless').capitalize()
_NAME = "Test" + PARSER.capitalize() + LEXER.capitalize()
_TestParser.__name__ = _NAME
globals()[_NAME] = _TestParser

@@ -1199,13 +1180,13 @@ _TO_TEST = [
('dynamic', 'earley'),
('standard', 'lalr'),
('contextual', 'lalr'),
(None, 'earley'),
# (None, 'earley'),
]

for _LEXER, _PARSER in _TO_TEST:
_make_parser_test(_LEXER, _PARSER)

for _LEXER in (None, 'dynamic'):
for _LEXER in ('dynamic',):
_make_full_earley_test(_LEXER)

if __name__ == '__main__':


Loading…
Cancel
Save