Browse Source

Updated docs to match v0.2

tags/gm/2021-09-23T00Z/github.com--lark-parser-lark/0.5.1
Erez Shinan 7 years ago
parent
commit
df4d471641
5 changed files with 115 additions and 93 deletions
  1. +12
    -11
      README.md
  2. +55
    -40
      docs/json_tutorial.md
  3. +41
    -32
      docs/reference.md
  4. +1
    -2
      examples/json_parser.py
  5. +6
    -8
      lark/lark.py

+ 12
- 11
README.md View File

@@ -28,7 +28,7 @@ Here is a little program to parse "Hello, World!" (Or any other similar phrase):
from lark import Lark
l = Lark('''start: WORD "," WORD "!"
WORD: /\w+/
SPACE.ignore: " "
%ignore " "
''')
print( l.parse("Hello, World!") )
```
@@ -53,11 +53,12 @@ parser = Lark('''?sum: product
| product "*" item -> mul
| product "/" item -> div

?item: /[\d.]+/ -> number
?item: NUMBER -> number
| "-" item -> neg
| "(" sum ")"

SPACE.ignore: /\s+/
%import common.NUMBER
%ignore /\s+/
''', start='sum')

class CalculateTree(InlineTransformer):
@@ -92,24 +93,24 @@ Lark has no dependencies.

## List of Features

- EBNF grammar with a little extra
- Python 2 & 3 compatible
- Earley & LALR(1)
- EBNF grammar with a little extra
- Builds an AST automagically based on the grammar
- Optional Lexer
- Automatic line & column tracking
- Automatic token collision resolution (unless both tokens are regexps)
- Python 2 & 3 compatible
- Standard library of terminals (strings, numbers, names, etc.)
- Unicode fully supported
- Extensive test suite
- Lexer (optional)
- Automatic line & column tracking
- Automatic token collision resolution (unless both terminals are regexps)
- Contextual lexing for LALR

## Coming soon

These features are planned to be implemented in the near future:

- Standard library of tokens (string, int, name, etc.)
- Contextual lexing for LALR (already working, needs some finishing touches)
- Parser generator - create a small parser, independent of Lark, to embed in your project.
- Grammar composition (in cases that the tokens can reliably signify a grammar change)
- Grammar composition
- Optimizations in both the parsers and the lexer
- Better handling of ambiguity



+ 55
- 40
docs/json_tutorial.md View File

@@ -20,13 +20,13 @@ Knowledge assumed:

Lark accepts its grammars in a format called [EBNF](https://www.wikiwand.com/en/Extended_Backus%E2%80%93Naur_form). It basically looks like this:

rule_name : list of rules and TOKENS to match
rule_name : list of rules and TERMINALS to match
| another possible list of items
| etc.

TOKEN: "some text to match"
TERMINAL: "some text to match"

(*a token is a string or a regular expression*)
(*a terminal is a string or a regular expression*)

The parser will try to match each rule (left-part) by matching its items (right-part) sequentially, trying each alternative (In practice, the parser is predictive so we don't have to try every alternative).

@@ -57,20 +57,32 @@ A quick explanation of the syntax:

Lark also supports the rule+ operator, meaning one or more instances. It also supports the rule? operator which is another way to say *optional*.

Of course, we still haven't defined "STRING" and "NUMBER".
Of course, we still haven't defined "STRING" and "NUMBER". Luckily, both these literals are already defined in Lark's common library:

We'll do that now, and also take care of the white-space, which is part of the text.
%import common.ESCAPED_STRING -> STRING
%import common.SIGNED_NUMBER -> NUMBER

The arrow (->) renames the terminals. But that only adds obscurity in this case, so going forward we'll just use their original names.

We'll also take care of the white-space, which is part of the text.

%import common.WS
%ignore WS

We tell our parser to ignore whitespace. Otherwise, we'd have to fill our grammar with WS terminals.

By the way, if you're curious what these terminals signify, they are roughly equivalent to this:

NUMBER : /-?\d+(\.\d+)?([eE][+-]?\d+)?/
STRING : /".*?(?<!\\)"/
%ignore /[ \t\n\f\r]+/

WS.ignore: /[ \t\n]+/

Upper-case names signify tokens, while lower-case names signify rules. Rules can contain other rules and tokens, while tokens can only contain a single value.
Lark will accept this, if you really want to complicate your life :)

These regular-expressions are a bit complex, but there's no simple way around it. We want to match "3.14" and also "-2e10", and that's just how it's done.
(You can find the original definitions in [common.g](/lark/grammars/common.g).)

Notice that WS, which matches whitespace, gets flagged with "ignore". This tells Lark not to pass it to the parser. Otherwise, we'd have to fill our grammar with WS tokens.
Notice that terminals are written in UPPER-CASE, while rules are written in lower-case.
I'll touch more on the differences between rules and terminals later.

## Part 2 - Creating the Parser

@@ -83,19 +95,19 @@ from lark import Lark
json_parser = Lark(r"""
value: dict
| list
| STRING
| NUMBER
| ESCAPED_STRING
| SIGNED_NUMBER
| "true" | "false" | "null"

list : "[" [value ("," value)*] "]"

dict : "{" [pair ("," pair)*] "}"
pair : STRING ":" value

NUMBER : /-?\d+(\.\d+)?([eE][+-]?\d+)?/
STRING : /".*?(?<!\\)"/
pair : ESCAPED_STRING ":" value

WS.ignore: /[ \t\n]+/
%import common.ESCAPED_STRING
%import common.SIGNED_NUMBER
%import common.WS
%ignore WS

""", start='value')
```
@@ -120,14 +132,14 @@ value

As promised, Lark automagically creates a tree that represents the parsed text.

But something is suspiciously missing from the tree. Where are the curly braces, the commas and all the other punctuation tokens?
But something is suspiciously missing from the tree. Where are the curly braces, the commas and all the other punctuation literals?

Lark automatically filters out tokens from the tree, based on the following criteria:
Lark automatically filters out literals from the tree, based on the following criteria:

- Filter out string tokens without a name, or with a name that starts with an underscore.
- Keep regex tokens, even unnamed ones, unless their name starts with an underscore.
- Filter out string literals without a name, or with a name that starts with an underscore.
- Keep regexps, even unnamed ones, unless their name starts with an underscore.

Unfortunately, this means that it will also filter out tokens like "true" and "false", and we will lose that information. The next section, "Shaping the tree" deals with this issue, and others.
Unfortunately, this means that it will also filter out literals like "true" and "false", and we will lose that information. The next section, "Shaping the tree" deals with this issue, and others.

## Part 3 - Shaping the Tree

@@ -141,21 +153,20 @@ I'll present the solution, and then explain it:
?value: dict
| list
| string
| number
| SIGNED_NUMBER -> number
| "true" -> true
| "false" -> false
| "null" -> null

...

number : /-?\d+(\.\d+)?([eE][+-]?\d+)?/
string : /".*?(?<!\\)"/
string : ESCAPED_STRING

1. Those little arrows signify *aliases*. An alias is a name for a specific part of the rule. In this case, we will name *true/false/null* matches, and this way we won't lose the information.
1. Those little arrows signify *aliases*. An alias is a name for a specific part of the rule. In this case, we will name the *true/false/null* matches, and this way we won't lose the information. We also alias *SIGNED_NUMBER* to mark it for later processing.

2. The question mark prefixing *value* ("?value") tells the tree-builder to inline this branch if it has only one member. In this case, *value* will always have only one member.
2. The question-mark prefixing *value* ("?value") tells the tree-builder to inline this branch if it has only one member. In this case, *value* will always have only one member, and will always be inlined.

3. We turned the *string* and *number* tokens into rules containing anonymous tokens. This way they will appear in the tree as a branch. You will see why that's useful in the next part of the tutorial. Note that these anonymous tokens won't get filtered out, because they are regular expressions.
3. We turned the *ESCAPED_STRING* terminal into a rule. This way it will appear in the tree as a branch. This is equivalent to aliasing (like we did for the number), but now *string* can also be used elsewhere in the grammar (namely, in the *pair* rule).

Here is the new grammar:

@@ -165,7 +176,7 @@ json_parser = Lark(r"""
?value: dict
| list
| string
| number
| SIGNED_NUMBER -> number
| "true" -> true
| "false" -> false
| "null" -> null
@@ -175,10 +186,12 @@ json_parser = Lark(r"""
dict : "{" [pair ("," pair)*] "}"
pair : string ":" value

number : /-?\d+(\.\d+)?([eE][+-]?\d+)?/
string : /".*?(?<!\\)"/
string : ESCAPED_STRING

WS.ignore: /[ \t\n]+/
%import common.ESCAPED_STRING
%import common.SIGNED_NUMBER
%import common.WS
%ignore WS

""", start='value')
```
@@ -229,7 +242,7 @@ And when we run it, we get this:
{Tree(string, [Token(ANONRE_1, "key")]): [Tree(string, [Token(ANONRE_1, "item0")]), Tree(string, [Token(ANONRE_1, "item1")]), Tree(number, [Token(ANONRE_0, 3.14)]), Tree(true, [])]}
```

This is pretty close. Let's write a full transformer that can handle the tokens too.
This is pretty close. Let's write a full transformer that can handle the terminals too.

Also, our definitions of list and dict are a bit verbose. We can do better:

@@ -282,7 +295,7 @@ json_grammar = r"""
?value: dict
| list
| string
| number
| SIGNED_NUMBER -> number
| "true" -> true
| "false" -> false
| "null" -> null
@@ -292,10 +305,12 @@ json_grammar = r"""
dict : "{" [pair ("," pair)*] "}"
pair : string ":" value

number : /-?\d+(\.\d+)?([eE][+-]?\d+)?/
string : /".*?(?<!\\)"/
string : ESCAPED_STRING

WS.ignore: /[ \t\n]+/
%import common.ESCAPED_STRING
%import common.SIGNED_NUMBER
%import common.WS
%ignore WS
"""

class TreeToJson(Transformer):
@@ -344,9 +359,9 @@ json_parser = Lark(json_grammar, start='value', parser='lalr')
```
$ time python tutorial_json.py json_data > /dev/null

real 0m7.722s
user 0m7.504s
sys 0m0.175s
real 0m7.554s
user 0m7.352s
sys 0m0.148s

Ah, that's much better. The resulting JSON is of course exactly the same. You can run it for yourself and see.



+ 41
- 32
docs/reference.md View File

@@ -4,37 +4,23 @@

Lark is a general-purpose parsing library. It's written in Python, and supports two parsing algorithms: Earley (default) and LALR(1).

Lark also supports scanless parsing (with Earley), contextual lexing (with LALR), and regular lexing for both parsers.

Lark is a re-write of my previous parsing library, [PlyPlus](https://github.com/erezsh/plyplus).

## Grammar

Lark accepts its grammars in [EBNF](https://www.wikiwand.com/en/Extended_Backus%E2%80%93Naur_form) form.

The grammar is a list of rules and tokens, each in their own line.
The grammar is a list of rules and terminals, each in their own line.

Rules can be defined on multiple lines when using the *OR* operator ( | ).
Rules and terminals can be defined on multiple lines when using the *OR* operator ( | ).

Comments start with // and last to the end of the line (C++ style)

Lark begins the parse with the rule 'start', unless specified otherwise in the options.

### Tokens

Tokens are defined in terms of:

NAME : "string" or /regexp/
NAME.ignore : ..

.ignore is a flag that drops the token before it reaches the parser (usually whitespace)

Example:

IF: "if"

INTEGER : /[0-9]+/

WHITESPACE.ignore: /[ \t\n]+/
It might help to think of Rules and Terminals as existing in two separate layers, so that all the terminals are recognized first, and all the rules are recognized afterwards. This is not always how things happen (depending on your choice of parser & lexer), but the concept is relevant in all cases.

### Rules

@@ -47,9 +33,9 @@ Each rule is defined in terms of:
An alias is a name for the specific rule alternative. It affects tree construction.

An item is a:
- rule
- token
- terminal
- (item item ..) - Group items
- [item item ..] - Maybe. Same as: "(item item ..)?"
- item? - Zero or one instances of item ("maybe")
@@ -66,13 +52,28 @@ Example:

DIGIT: /[0-9]/

### Terminals

Terminals are defined just like rules, but cannot contain rules:

NAME : list of items to match

Example:

IF: "if"
INTEGER : /[0-9]+/
DECIMAL: INTEGER "." INTEGER
WHITESPACE: (" " | /\t/ )+

## Tree Construction

Lark builds a tree automatically based on the structure of the grammar. Is also accepts some hints.

In general, Lark will place each rule as a branch, and its matches as the children of the branch.

Using item+ or item\* will result in a list of items.
Terminals are always values in the tree, never branches.

In grammar rules, using item+ or item\* will result in a list of items.

Example:

@@ -81,6 +82,8 @@ Example:

NAME: /\w+/

%ignore " "

Lark will parse "(((hello world)))" as:

expr
@@ -91,15 +94,15 @@ Lark will parse "(((hello world)))" as:

The brackets do not appear in the tree by design.

Tokens that won't appear in the tree are:
Terminals that won't appear in the tree are:

- Unnamed strings (like "keyword" or "+")
- Tokens whose name starts with an underscore (like \_DIGIT)
- Unnamed literals (like "keyword" or "+")
- Terminals whose name starts with an underscore (like \_DIGIT)

Tokens that *will* appear in the tree are:
Terminals that *will* appear in the tree are:

- Unnamed regular expressions (like /[0-9]/)
- Named tokens whose name starts with a letter (like DIGIT)
- Named terminals whose name starts with a letter (like DIGIT)

## Shaping the tree

@@ -133,7 +136,9 @@ Lark will parse "hello world (planet)" as:
"world"
"planet"

c. Aliases - options in a rule can receive an alias. It will be then used as the branch name for the option.
c. Rules that begin with an exclamation mark will keep all their terminals (they won't get filtered).

d. Aliases - options in a rule can receive an alias. It will be then used as the branch name for the option.

Example:

@@ -153,15 +158,19 @@ When initializing the Lark object, you can provide it with keyword options:

- start - The start symbol (Default: "start")
- parser - Decides which parser engine to use, "earley" or "lalr". (Default: "earley")
Note: Both will use Lark's lexer.
Note: "lalr" requires a lexer
- lexer - Decides whether or not to use a lexer stage
- None: Don't use a lexer
- "standard": Use a standard lexer
- "contextual": Stronger lexer (only works with parser="lalr")
- "auto" (default): Choose for me based on grammar and parser

- transformer - Applies the transformer to every parse tree (only allowed with parser="lalr")
- only\_lex - Don't build a parser. Useful for debugging (default: False)
- postlex - Lexer post-processing (Default: None)
- profile - Measure run-time usage in Lark. Read results from the profiler property (Default: False)

To be supported:

- debug
- cache\_grammar
- keep\_all\_tokens
- profile - Measure run-time usage in Lark. Read results from the profiler property (Default: False)

+ 1
- 2
examples/json_parser.py View File

@@ -15,7 +15,7 @@ json_grammar = r"""
?value: object
| array
| string
| number
| SIGNED_NUMBER -> number
| "true" -> true
| "false" -> false
| "null" -> null
@@ -24,7 +24,6 @@ json_grammar = r"""
object : "{" [pair ("," pair)*] "}"
pair : string ":" value

number: SIGNED_NUMBER
string : ESCAPED_STRING

%import common.ESCAPED_STRING


+ 6
- 8
lark/lark.py View File

@@ -18,9 +18,10 @@ class LarkOptions(object):

"""
OPTIONS_DOC = """
parser - Which parser engine to use ("earley" or "lalr". Default: "earley")
parser - Decides which parser engine to use, "earley" or "lalr". (Default: "earley")
Note: "lalr" requires a lexer
lexer - Whether or not to use a lexer stage

lexer - Decides whether or not to use a lexer stage
None: Don't use a lexer
"standard": Use a standard lexer
"contextual": Stronger lexer (only works with parser="lalr")
@@ -28,7 +29,6 @@ class LarkOptions(object):

transformer - Applies the transformer to every parse tree
debug - Affects verbosity (default: False)
only_lex - Don't build a parser. Useful for debugging (default: False)
keep_all_tokens - Don't automagically remove "punctuation" tokens (default: False)
cache_grammar - Cache the Lark grammar (Default: False)
postlex - Lexer post-processing (Default: None)
@@ -40,7 +40,6 @@ class LarkOptions(object):
o = dict(options_dict)

self.debug = bool(o.pop('debug', False))
self.only_lex = bool(o.pop('only_lex', False))
self.keep_all_tokens = bool(o.pop('keep_all_tokens', False))
self.tree_class = o.pop('tree_class', Tree)
self.cache_grammar = o.pop('cache_grammar', False)
@@ -51,12 +50,13 @@ class LarkOptions(object):
self.start = o.pop('start', 'start')
self.profile = o.pop('profile', False)

# assert self.parser in ENGINE_DICT
assert self.parser in ('earley', 'lalr', None)

if self.parser == 'earley' and self.transformer:
raise ValueError('Cannot specify an auto-transformer when using the Earley algorithm.'
'Please use your transformer on the resulting parse tree, or use a different algorithm (i.e. lalr)')
if self.keep_all_tokens:
raise NotImplementedError("Not implemented yet!")
raise NotImplementedError("keep_all_tokens: Not implemented yet!")

if o:
raise ValueError("Unknown options: %s" % o.keys())
@@ -166,8 +166,6 @@ class Lark:
return stream

def parse(self, text):
assert not self.options.only_lex

return self.parser.parse(text)

# if self.profiler:


Loading…
Cancel
Save