diff --git a/README.md b/README.md index 2b47706..9a13831 100644 --- a/README.md +++ b/README.md @@ -28,7 +28,7 @@ Here is a little program to parse "Hello, World!" (Or any other similar phrase): from lark import Lark l = Lark('''start: WORD "," WORD "!" WORD: /\w+/ - SPACE.ignore: " " + %ignore " " ''') print( l.parse("Hello, World!") ) ``` @@ -53,11 +53,12 @@ parser = Lark('''?sum: product | product "*" item -> mul | product "/" item -> div - ?item: /[\d.]+/ -> number + ?item: NUMBER -> number | "-" item -> neg | "(" sum ")" - SPACE.ignore: /\s+/ + %import common.NUMBER + %ignore /\s+/ ''', start='sum') class CalculateTree(InlineTransformer): @@ -92,24 +93,24 @@ Lark has no dependencies. ## List of Features - - EBNF grammar with a little extra + - Python 2 & 3 compatible - Earley & LALR(1) + - EBNF grammar with a little extra - Builds an AST automagically based on the grammar - - Optional Lexer - - Automatic line & column tracking - - Automatic token collision resolution (unless both tokens are regexps) - - Python 2 & 3 compatible + - Standard library of terminals (strings, numbers, names, etc.) - Unicode fully supported - Extensive test suite + - Lexer (optional) + - Automatic line & column tracking + - Automatic token collision resolution (unless both terminals are regexps) + - Contextual lexing for LALR ## Coming soon These features are planned to be implemented in the near future: - - Standard library of tokens (string, int, name, etc.) - - Contextual lexing for LALR (already working, needs some finishing touches) - Parser generator - create a small parser, independent of Lark, to embed in your project. - - Grammar composition (in cases that the tokens can reliably signify a grammar change) + - Grammar composition - Optimizations in both the parsers and the lexer - Better handling of ambiguity diff --git a/docs/json_tutorial.md b/docs/json_tutorial.md index 23a6161..c83d6c7 100644 --- a/docs/json_tutorial.md +++ b/docs/json_tutorial.md @@ -20,13 +20,13 @@ Knowledge assumed: Lark accepts its grammars in a format called [EBNF](https://www.wikiwand.com/en/Extended_Backus%E2%80%93Naur_form). It basically looks like this: - rule_name : list of rules and TOKENS to match + rule_name : list of rules and TERMINALS to match | another possible list of items | etc. - TOKEN: "some text to match" + TERMINAL: "some text to match" -(*a token is a string or a regular expression*) +(*a terminal is a string or a regular expression*) The parser will try to match each rule (left-part) by matching its items (right-part) sequentially, trying each alternative (In practice, the parser is predictive so we don't have to try every alternative). @@ -57,20 +57,32 @@ A quick explanation of the syntax: Lark also supports the rule+ operator, meaning one or more instances. It also supports the rule? operator which is another way to say *optional*. -Of course, we still haven't defined "STRING" and "NUMBER". +Of course, we still haven't defined "STRING" and "NUMBER". Luckily, both these literals are already defined in Lark's common library: -We'll do that now, and also take care of the white-space, which is part of the text. + %import common.ESCAPED_STRING -> STRING + %import common.SIGNED_NUMBER -> NUMBER + +The arrow (->) renames the terminals. But that only adds obscurity in this case, so going forward we'll just use their original names. + +We'll also take care of the white-space, which is part of the text. + + %import common.WS + %ignore WS + +We tell our parser to ignore whitespace. Otherwise, we'd have to fill our grammar with WS terminals. + +By the way, if you're curious what these terminals signify, they are roughly equivalent to this: NUMBER : /-?\d+(\.\d+)?([eE][+-]?\d+)?/ STRING : /".*?(? number | "true" -> true | "false" -> false | "null" -> null ... - number : /-?\d+(\.\d+)?([eE][+-]?\d+)?/ - string : /".*?(? number | "true" -> true | "false" -> false | "null" -> null @@ -175,10 +186,12 @@ json_parser = Lark(r""" dict : "{" [pair ("," pair)*] "}" pair : string ":" value - number : /-?\d+(\.\d+)?([eE][+-]?\d+)?/ - string : /".*?(? number | "true" -> true | "false" -> false | "null" -> null @@ -292,10 +305,12 @@ json_grammar = r""" dict : "{" [pair ("," pair)*] "}" pair : string ":" value - number : /-?\d+(\.\d+)?([eE][+-]?\d+)?/ - string : /".*?(? /dev/null - real 0m7.722s - user 0m7.504s - sys 0m0.175s + real 0m7.554s + user 0m7.352s + sys 0m0.148s Ah, that's much better. The resulting JSON is of course exactly the same. You can run it for yourself and see. diff --git a/docs/reference.md b/docs/reference.md index 6ae651f..90553f5 100644 --- a/docs/reference.md +++ b/docs/reference.md @@ -4,37 +4,23 @@ Lark is a general-purpose parsing library. It's written in Python, and supports two parsing algorithms: Earley (default) and LALR(1). +Lark also supports scanless parsing (with Earley), contextual lexing (with LALR), and regular lexing for both parsers. + Lark is a re-write of my previous parsing library, [PlyPlus](https://github.com/erezsh/plyplus). ## Grammar Lark accepts its grammars in [EBNF](https://www.wikiwand.com/en/Extended_Backus%E2%80%93Naur_form) form. -The grammar is a list of rules and tokens, each in their own line. +The grammar is a list of rules and terminals, each in their own line. -Rules can be defined on multiple lines when using the *OR* operator ( | ). +Rules and terminals can be defined on multiple lines when using the *OR* operator ( | ). Comments start with // and last to the end of the line (C++ style) Lark begins the parse with the rule 'start', unless specified otherwise in the options. -### Tokens - -Tokens are defined in terms of: - - NAME : "string" or /regexp/ - - NAME.ignore : .. - -.ignore is a flag that drops the token before it reaches the parser (usually whitespace) - -Example: - - IF: "if" - - INTEGER : /[0-9]+/ - - WHITESPACE.ignore: /[ \t\n]+/ +It might help to think of Rules and Terminals as existing in two separate layers, so that all the terminals are recognized first, and all the rules are recognized afterwards. This is not always how things happen (depending on your choice of parser & lexer), but the concept is relevant in all cases. ### Rules @@ -47,9 +33,9 @@ Each rule is defined in terms of: An alias is a name for the specific rule alternative. It affects tree construction. An item is a: - + - rule - - token + - terminal - (item item ..) - Group items - [item item ..] - Maybe. Same as: "(item item ..)?" - item? - Zero or one instances of item ("maybe") @@ -66,13 +52,28 @@ Example: DIGIT: /[0-9]/ +### Terminals + +Terminals are defined just like rules, but cannot contain rules: + + NAME : list of items to match + +Example: + + IF: "if" + INTEGER : /[0-9]+/ + DECIMAL: INTEGER "." INTEGER + WHITESPACE: (" " | /\t/ )+ + ## Tree Construction Lark builds a tree automatically based on the structure of the grammar. Is also accepts some hints. In general, Lark will place each rule as a branch, and its matches as the children of the branch. -Using item+ or item\* will result in a list of items. +Terminals are always values in the tree, never branches. + +In grammar rules, using item+ or item\* will result in a list of items. Example: @@ -81,6 +82,8 @@ Example: NAME: /\w+/ + %ignore " " + Lark will parse "(((hello world)))" as: expr @@ -91,15 +94,15 @@ Lark will parse "(((hello world)))" as: The brackets do not appear in the tree by design. -Tokens that won't appear in the tree are: +Terminals that won't appear in the tree are: - - Unnamed strings (like "keyword" or "+") - - Tokens whose name starts with an underscore (like \_DIGIT) + - Unnamed literals (like "keyword" or "+") + - Terminals whose name starts with an underscore (like \_DIGIT) -Tokens that *will* appear in the tree are: +Terminals that *will* appear in the tree are: - Unnamed regular expressions (like /[0-9]/) - - Named tokens whose name starts with a letter (like DIGIT) + - Named terminals whose name starts with a letter (like DIGIT) ## Shaping the tree @@ -133,7 +136,9 @@ Lark will parse "hello world (planet)" as: "world" "planet" -c. Aliases - options in a rule can receive an alias. It will be then used as the branch name for the option. +c. Rules that begin with an exclamation mark will keep all their terminals (they won't get filtered). + +d. Aliases - options in a rule can receive an alias. It will be then used as the branch name for the option. Example: @@ -153,15 +158,19 @@ When initializing the Lark object, you can provide it with keyword options: - start - The start symbol (Default: "start") - parser - Decides which parser engine to use, "earley" or "lalr". (Default: "earley") - Note: Both will use Lark's lexer. + Note: "lalr" requires a lexer +- lexer - Decides whether or not to use a lexer stage + - None: Don't use a lexer + - "standard": Use a standard lexer + - "contextual": Stronger lexer (only works with parser="lalr") + - "auto" (default): Choose for me based on grammar and parser + - transformer - Applies the transformer to every parse tree (only allowed with parser="lalr") -- only\_lex - Don't build a parser. Useful for debugging (default: False) - postlex - Lexer post-processing (Default: None) -- profile - Measure run-time usage in Lark. Read results from the profiler property (Default: False) To be supported: - debug - cache\_grammar - keep\_all\_tokens - +- profile - Measure run-time usage in Lark. Read results from the profiler property (Default: False) diff --git a/examples/json_parser.py b/examples/json_parser.py index 5b910ef..ba4efbd 100644 --- a/examples/json_parser.py +++ b/examples/json_parser.py @@ -15,7 +15,7 @@ json_grammar = r""" ?value: object | array | string - | number + | SIGNED_NUMBER -> number | "true" -> true | "false" -> false | "null" -> null @@ -24,7 +24,6 @@ json_grammar = r""" object : "{" [pair ("," pair)*] "}" pair : string ":" value - number: SIGNED_NUMBER string : ESCAPED_STRING %import common.ESCAPED_STRING diff --git a/lark/lark.py b/lark/lark.py index 3fb4d52..2624ada 100644 --- a/lark/lark.py +++ b/lark/lark.py @@ -18,9 +18,10 @@ class LarkOptions(object): """ OPTIONS_DOC = """ - parser - Which parser engine to use ("earley" or "lalr". Default: "earley") + parser - Decides which parser engine to use, "earley" or "lalr". (Default: "earley") Note: "lalr" requires a lexer - lexer - Whether or not to use a lexer stage + + lexer - Decides whether or not to use a lexer stage None: Don't use a lexer "standard": Use a standard lexer "contextual": Stronger lexer (only works with parser="lalr") @@ -28,7 +29,6 @@ class LarkOptions(object): transformer - Applies the transformer to every parse tree debug - Affects verbosity (default: False) - only_lex - Don't build a parser. Useful for debugging (default: False) keep_all_tokens - Don't automagically remove "punctuation" tokens (default: False) cache_grammar - Cache the Lark grammar (Default: False) postlex - Lexer post-processing (Default: None) @@ -40,7 +40,6 @@ class LarkOptions(object): o = dict(options_dict) self.debug = bool(o.pop('debug', False)) - self.only_lex = bool(o.pop('only_lex', False)) self.keep_all_tokens = bool(o.pop('keep_all_tokens', False)) self.tree_class = o.pop('tree_class', Tree) self.cache_grammar = o.pop('cache_grammar', False) @@ -51,12 +50,13 @@ class LarkOptions(object): self.start = o.pop('start', 'start') self.profile = o.pop('profile', False) - # assert self.parser in ENGINE_DICT + assert self.parser in ('earley', 'lalr', None) + if self.parser == 'earley' and self.transformer: raise ValueError('Cannot specify an auto-transformer when using the Earley algorithm.' 'Please use your transformer on the resulting parse tree, or use a different algorithm (i.e. lalr)') if self.keep_all_tokens: - raise NotImplementedError("Not implemented yet!") + raise NotImplementedError("keep_all_tokens: Not implemented yet!") if o: raise ValueError("Unknown options: %s" % o.keys()) @@ -166,8 +166,6 @@ class Lark: return stream def parse(self, text): - assert not self.options.only_lex - return self.parser.parse(text) # if self.profiler: