Browse Source

Added MkDocs documentation

tags/gm/2021-09-23T00Z/github.com--lark-parser-lark/0.6.4
Erez Shinan 6 years ago
parent
commit
17da44146a
8 changed files with 657 additions and 0 deletions
  1. +156
    -0
      docs/classes.md
  2. +62
    -0
      docs/features.md
  3. +148
    -0
      docs/grammar.md
  4. +54
    -0
      docs/how_to_use.md
  5. +42
    -0
      docs/index.md
  6. +63
    -0
      docs/philosophy.md
  7. +129
    -0
      docs/tree_construction.md
  8. +3
    -0
      mkdocs.yml

+ 156
- 0
docs/classes.md View File

@@ -0,0 +1,156 @@
# Classes - Reference

This page details the important classes in Lark.

----

## Lark

The Lark class is the main interface for the library. It's mostly a thin wrapper for the many different parsers, and for the tree constructor.

### Methods

#### \_\_init\_\_(self, grammar, **options)

The Lark class accepts a grammar string or file object, and keyword options:

* start - The symbol in the grammar that begins the parse (Default: `"start"`)

* parser - Decides which parser engine to use, "earley", "lalr" or "cyk". (Default: `"earley"`)

* lexer - Overrides default lexer.

* transformer - Applies the transformer instead of building a parse tree (only allowed with parser="lalr")

* postlex - Lexer post-processing (Default: None. only works when lexer is "standard" or "contextual")

* ambiguity (only relevant for earley and cyk)

* "explicit" - Return all derivations inside an "_ambig" data node.

* "resolve" - Let the parser choose the best derivation (greedy for tokens, non-greedy for rules. Default)

* debug - Display warnings (such as Shift-Reduce warnings for LALR)

* keep_all_tokens - Don't throw away any terminals from the tree (Default=False)

* propagate_positions - Propagate line/column count to tree nodes (default=False)

#### parse(self, text)

Return a complete parse tree for the text (of type Tree)

If a transformer is supplied to `__init__`, returns whatever is the result of the transformation.

----

## Tree

The main tree class

### Properties

* `data` - The name of the rule or alias
* `children` - List of matched sub-rules and terminals
* `meta` - Line & Column numbers, if using `propagate_positions`

### Methods

#### \_\_init\_\_(self, data, children)

Creates a new tree, and stores "data" and "children" in attributes of the same name.

#### pretty(self, indent_str=' ')

Returns an indented string representation of the tree. Great for debugging.

#### find_pred(self, pred)

Returns all nodes of the tree that evaluate pred(node) as true.

#### find_data(self, data)

Returns all nodes of the tree whose data equals the given data.

#### iter_subtrees(self)

Iterates over all the subtrees, never returning to the same node twice (Lark's parse-tree is actually a DAG)

#### \_\_eq\_\_, \_\_hash\_\_

Trees can be hashed and compared.

----

## Transformers & Visitors

Transformers & Visitors provide a convenient interface to process the parse-trees that Lark returns.

They are used by inheriting from the correct class (visitor or transformer), and implementing methods corresponding to the rule you wish to process. Each methods accepts the children as an argument. That can be modified using the `v-args` decorator, which allows to inline the arguments (akin to `*args`), or add the tree `meta` property as an argument.

See: https://github.com/lark-parser/lark/blob/master/lark/visitors.py

### Visitors

Visitors visit each node of the tree, and run the appropriate method on it according to the node's data.

They work bottom-up, starting with the leaves and ending at the root of the tree.

**Example**
```python
class IncreaseAllNumbers(Visitor):
def number(self, tree):
assert tree.data == "number"
tree.children[0] += 1

IncreaseAllNumbers().visit(parse_tree)
```

There are two classes that implement the visitor interface:

* Visitor - Visit every node (without recursion)

* Visitor_Recursive - Visit every node using recursion. Slightly faster.

### Transformers

Transformers visit each node of the tree, and run the appropriate method on it according to the node's data.

They work bottom-up, starting with the leaves and ending at the root of the tree.

Transformers can be used to implement map & reduce patterns.

Because nodes are reduced from leaf to root, at any point the callbacks may assume the children have already been transformed (if applicable).

Transformers can be chained into a new transformer by using multiplication.

**Example:**
```python
from lark import Tree, Transformer

class EvalExpressions(Transformer):
def expr(self, args):
return eval(args[0])

t = Tree('a', [Tree('expr', ['1+2'])])
print(EvalExpressions().transform( t ))

# Prints: Tree(a, [3])
```


Here are the classes that implement the transformer interface:
- Transformer - Recursively transforms the tree. This is the one you probably want.
- Transformer_InPlace - Non-recursive. Changes the tree in-place instead of returning new instances
- Transformer_InPlaceRecursive - Recursive. Changes the tree in-place instead of returning new instances

## Token

When using a lexer, the resulting tokens in the trees will be of the Token class, which inherits from Python's string. So, normal string comparisons and operations will work as expected. Tokens also have other useful attributes:

* type - Name of the token (as specified in grammar).
* pos_in_stream - the index of the token in the text
* line - The line of the token in the text (starting with 1)
* column - The column of the token in the text (starting with 1)
* end_line - The line where the token ends
* end_column - The column where the token ends

+ 62
- 0
docs/features.md View File

@@ -0,0 +1,62 @@
# Features

- EBNF-inspired grammar, with extra features (See: [Grammar Reference](grammar.md))
- Builds a parse-tree (AST) automagically based on the grammar
- Stand-alone parser generator - create a small independent parser to embed in your project.
- Automatic line & column tracking
- Automatic terminal collision resolution
- Standard library of terminals (strings, numbers, names, etc.)
- Unicode fully supported
- Extensive test suite
- Python 2 & Python 3 compatible
- Pure-Python implementation

## Parsers

Lark implements the following parsing algorithms:

### Earley

An [Earley Parser](https://www.wikiwand.com/en/Earley_parser) is a chart parser capable of parsing any context-free grammar at O(n^3), and O(n^2) when the grammar is unambiguous. It can parse most LR grammars at O(n). Most programming languages are LR, and can be parsed at a linear time.

Lark's Earley implmementation runs on top of a skipping chart parser, which allows it to use regular expressions, instead of matching characters one-by-one. This is a huge improvement to Earley that is unique to Lark. This feature is used by default, but can also be requested explicitely using `lexer='dynamic'`.

It's possible to bypass the dynamic lexer, and use the regular Earley parser with a traditional lexer, that tokenizes as an independant first step. Doing so will provide a speed benefit, but will tokenize without using Earley's ambiguity-resolution ability. So choose this only if you know why! Activate with `lexer='standard'`

**Note on ambiguity**

Lark by default can handle any ambiguity in the grammar (Earley+dynamic). The user may request to recieve all derivations (using ambiguity='explicit'), or let Lark automatically choose the most fitting derivation (default behavior).

Lark also supports user-defined rule priority to steer the automatic ambiguity resolution.

### LALR(1)

[LALR(1)](https://www.wikiwand.com/en/LALR_parser) is a very efficient, true-and-tested parsing algorithm. It's incredibly fast and requires very little memory. It can parse most programming languages (For example: Python and Java).

Lark comes with an efficient implementation that outperforms every other parsing library for Python (including PLY)

Lark extends the traditional YACC-based architecture with a *contextual lexer*, which automatically provides feedback from the parser to the lexer, making the LALR(1) algorithm stronger than ever.

The contextual lexer communicates with the parser, and uses the parser's lookahead prediction to narrow its choice of tokens. So at each point, the lexer only matches the subgroup of terminals that are legal at that parser state, instead of all of the terminals. It’s surprisingly effective at resolving common terminal collisions, and allows to parse languages that LALR(1) was previously incapable of parsing.

This is an improvement to LALR(1) that is unique to Lark.

### CYK Parser

A [CYK parser](https://www.wikiwand.com/en/CYK_algorithm) can parse any context-free grammar at O(n^3*|G|).

Its too slow to be practical for simple grammars, but it offers good performance for highly ambiguous grammars.

# Other features

- Import grammars from Nearley.js

### Experimental features
- Automatic reconstruction of input from parse-tree (see examples)

### Planned features (not implemented yet)
- Generate code in other languages than Python
- Grammar composition
- LALR(k) parser
- Full regexp-collision support using NFAs
- Automatically produce syntax-highlighters for grammars, for popular IDEs

+ 148
- 0
docs/grammar.md View File

@@ -0,0 +1,148 @@
# Grammar Reference

## Definitions

**A grammar** is a list of rules and terminals, that together define a language.

Terminals define the alphabet of the language, while rules define its structure.

In Lark, a terminal may be a string, a regular expression, or a concatenation of these and other terminals.

Each rule is a list of terminals and rules, whose location and nesting define the structure of the resulting parse-tree.

A **parsing algorithm** is an algorithm that takes a grammar definition and a sequence of symbols (members of the alphabet), and matches the entirety of the sequence by searching for a structure that is allowed by the grammar.

## General Syntax and notes

Grammars in Lark are based on [EBNF](https://en.wikipedia.org/wiki/Extended_Backus–Naur_form) syntax, with several enhancements.

Lark grammars are composed of a list of definitions and directives, each on its own line. A definition is either a named rule, or a named terminal.

**Comments** start with `//` and last to the end of the line (C++ style)

Lark begins the parse with the rule 'start', unless specified otherwise in the options.

Names of rules are always in lowercase, while names of terminals are always in uppercase. This distinction has practical effects for tree construction, and for building a lexer (aka tokenizer, or scanner).


## Terminals

Terminals are used to match text into symbols. They can be defined as a combination of literals and other terminals.

**Syntax:**

```html
<NAME> [. <priority>] : <literals-and-or-terminals>
```

Terminal names must be uppercase.

Literals can be one of:

* `"string"`
* `/regular expression+/`
* `"case-insensitive string"i`
* `/re with flags/imulx`
* Literal range: `"a".."z"`, `"1..9"`, etc.

#### Notes for when using a lexer:

When using a lexer (standard or contextual), it is the grammar-author's responsibility to make sure the literals don't collide, or that if they do, they are matched in the desired order. Literals are matched in an order according to the following criteria:

1. Highest priority first (priority is specified as: TERM.number: ...)
2. Length of match (for regexps, the longest theoretical match is used)
3. Length of literal / pattern definition
4. Name

**Examples:**
```perl
IF: "if"
INTEGER : /[0-9]+/
INTEGER2 : ("0".."9")+ //# Same as INTEGER
DECIMAL.2: INTEGER "." INTEGER //# Will be matched before INTEGER
WHITESPACE: (" " | /\t/ )+
SQL_SELECT: "select"i
```

## Rules

**Syntax:**
```html
<name> : <items-to-match> [-> <alias> ]
| ...
```

Names of rules and aliases are always in lowercase.

Rule definitions can be extended to the next line by using the OR operator (signified by a pipe: `|` ).

An alias is a name for the specific rule alternative. It affects tree construction.


Each item is one of:

* `rule`
* `TERMINAL`
* `"string literal"` or `/regexp literal/`
* `(item item ..)` - Group items
* `[item item ..]` - Maybe. Same as: `(item item ..)?`
* `item?` - Zero or one instances of item ("maybe")
* `item*` - Zero or more instances of item
* `item+` - One or more instances of item
* `item ~ n` - Exactly *n* instances of item
* `item ~ n..m` - Between *n* to *m* instances of item

**Examples:**
```perl
hello_world: "hello" "world"
mul: [mul "*"] number //# Left-recursion is allowed!
expr: expr operator expr
| value //# Multi-line, belongs to expr

four_words: word ~ 4
```


## Directives

### %ignore

All occurrences of the terminal will be ignored, and won't be part of the parse.

**Syntax:**
```html
%ignore <TERMINAL>
```
**Examples:**
```perl
%ignore " "

COMMENT: "#" /[^\n]/*
%ignore COMMENT
```
### %import

Allows to import terminals from lark grammars.

Future versions will allow to import rules and macros.

**Syntax:**
```html
%import <module>.<TERMINAL>
%import <module> (<TERM1> <TERM2>)
```

If the module path is absolute, Lark will attempt to load it from the built-in directory (currently, only `common.lark` is available).

If the module path is relative, such as `.path.to.file`, Lark will attempt to load it from the current working directory. Grammars must have the `.lark` extension.

**Example:**
```perl
%import common.NUMBER

%import .terminals_file (A B C)
```

### %declare

Declare a terminal without defining it. Useful for plugins.

+ 54
- 0
docs/how_to_use.md View File

@@ -0,0 +1,54 @@
# How To Use Lark - Guide

## Work process

This is the recommended process for working with Lark:

1. Collect or create input samples, that demonstrate key features or behaviors in the language you're trying to parse.

2. Write a grammar. Try to aim for a structure that is intuitive, and in a way that imitates how you would explain your language to a fellow human.

3. Try your grammar in Lark against each input sample. Make sure the resulting parse-trees make sense.

4. Use Lark's grammar features to [[shape the tree|Tree Construction]]: Get rid of superfluous rules by inlining them, and use aliases when specific cases need clarification.

- You can perform steps 1-4 repeatedly, gradually growing your grammar to include more sentences.

5. Create a transformer to evaluate the parse-tree into a structure you'll be comfortable to work with. This may include evaluating literals, merging branches, or even converting the entire tree into your own set of AST classes.

Of course, some specific use-cases may deviate from this process. Feel free to suggest these cases, and I'll add them to this page.

## Basic API Usage

For common use, you only need to know 3 classes: Lark, Tree, Transformer ([[Classes Reference]])

Here is some mock usage of them. You can see a real example in the [[examples]]

```python
from lark import Lark, Transformer

gramamr = """start: rules and more rules

rule1: other rules AND TOKENS
| rule1 "+" rule2 -> add
| some value [maybe]
rule2: rule1 "-" (rule2 | "whatever")*

TOKEN1: "a literal"
TOKEN2: TOKEN1 "and literals"
"""

parser = Lark(grammar)

tree = parser.parse("some input string")

class MyTransformer(Transformer):
def rule1(self, matches):
return matches[0] + matches[1]

# I don't have to implement rule2 if I don't feel like it!

new_tree = MyTransformer().transform(tree)
```


+ 42
- 0
docs/index.md View File

@@ -0,0 +1,42 @@
# Lark - a modern parsing library for Python

Lark can parse any context-free grammar.

Lark provides

- Advanced grammar language, based on EBNF
- Three parsing algorithms to choose from: Earley, LALR(1) and CYK
- Automatic tree construction, based on grammar
- Fast unicode lexer with regexp support, and automatic line-counting

Code is hosted on Github: [https://github.com/lark-parser/lark](https://github.com/lark-parser/lark)

### Install
```bash
$ pip install lark-parser
```

#### Syntax Highlighting

- [Sublime Text & TextMate](https://github.com/lark-parser/lark_syntax)
- [Visual Studio Code](https://github.com/lark-parser/vscode-lark) (Or install through the vscode plugin system)


## Documentation Index


* [Philosophy & Design Choices](philosophy.md)
* [Full List of Features](features.md)
* [Examples](https://github.com/lark-parser/lark/tree/master/examples)
* Tutorials
* [How to write a DSL](http://blog.erezsh.com/how-to-write-a-dsl-in-python-with-lark/) - Implements a toy LOGO-like language with an interpreter
* [How to write a JSON parser](json_tutorial.md)
* Guides
* [How to use Lark](how_to_use.md)
* Reference
* [Grammar](grammar.md)
* [Tree Construction](tree_construction.md)
* [Classes](classes.md)
* [Cheatsheet (PDF)](lark_cheatsheet.pdf)
* Discussion
* [Forum (Google Groups)](https://groups.google.com/forum/#!forum/lark-parser)

+ 63
- 0
docs/philosophy.md View File

@@ -0,0 +1,63 @@
# Philosophy

Parsers are innately complicated and confusing. They're difficult to understand, difficult to write, and difficult to use. Even experts on the subject can become baffled by the nuances of these complicated state-machines.

Lark's mission is to make the process of writing them as simple and abstract as possible. by the following design principles:

### Design Principles

1. Readability matters

2. Keep the grammar clean and simple

2. Don't force the user to decide on things that the parser can figure out on its own

4. Usability is more important than performance

5. Performance is still very important

6. Follow the Zen Of Python, whenever possible and applicable


In accordance with these principles, I arrived at the following design choices:

-----------

# Design Choices

### 1. Separation of code and grammar

Grammars are the de-facto reference for your language, and for the structure of your parse-tree. For any non-trivial language, the conflation of code and grammar always turns out convoluted and difficult to read.

The grammars in Lark are EBNF-inspired, so they are especially easy to read & work with.

### 2. Always build a parse-tree (unless told not to)

Trees are always simpler to work with than state-machines.

1. Trees allow you to see the "state-machine" visually

2. Trees allow your computation to be aware of previous and future states

3. Trees allow you to process the parse in steps, instead of forcing you to do it all at once.

And anyway, every parse-tree can be replayed as a state-machine, so there is no loss of information.

See this answer in more detail [here](https://github.com/erezsh/lark/issues/4).

You can skip the building the tree for LALR(1), by providing Lark with a transformer (see the [JSON example](https://github.com/erezsh/lark/blob/master/examples/json_parser.py)).

### 3. Earley is the default

The Earley algorithm can accept *any* contxet-free grammar you throw at it (i.e. any grammar you can write in EBNF, it can parse). That makes it extremely useful for beginners, who are not aware of the strange and arbitrary restrictions that LALR(1) places on its grammars.

As the users grow to understand the structure of their grammar, the scope of their target language and their performance requirements, they may choose to switch over to LALR(1) to gain a huge performance boost, possibly at the cost of some language features.

In short, "Premature optimization is the root of all evil."

### Other design features

- Automatically resolve terminal collisions whenever possible

- Automatically keep track of line & column numbers

+ 129
- 0
docs/tree_construction.md View File

@@ -0,0 +1,129 @@
# Automatic Tree Construction - Reference


Lark builds a tree automatically based on the structure of the grammar, where each rule that is matched becomes a branch (node) in the tree, and its children are its matches, in the order of matching.

For example, the rule `node: child1 child2` will create a tree node with two children. If it is matched as part of another rule (i.e. if it isn't the root), the new rule's tree node will become its parent.

Using `item+` or `item*` will result in a list of items, equivalent to writing `item item item ..`.

### Terminals

Terminals are always values in the tree, never branches.

Lark filters out certain types of terminals by default, considering them punctuation:

- Terminals that won't appear in the tree are:

- Unnamed literals (like `"keyword"` or `"+"`)
- Terminals whose name starts with an underscore (like `_DIGIT`)

- Terminals that *will* appear in the tree are:

- Unnamed regular expressions (like `/[0-9]/`)
- Named terminals whose name starts with a letter (like `DIGIT`)

Rules prefixed with `!` will retain all their literals regardless.




**Example:**

```perl
expr: "(" expr ")"
| NAME+

NAME: /\w+/

%ignore " "
```

Lark will parse "((hello world))" as:

expr
expr
expr
"hello"
"world"

The brackets do not appear in the tree by design. The words appear because they are matched by a named terminal.


# Shaping the tree

Users can alter the automatic construction of the tree using a collection of grammar features.


* Rules whose name begins with an underscore will be inlined into their containing rule.

**Example:**

```perl
start: "(" _greet ")"
_greet: /\w+/ /\w+/
```

Lark will parse "(hello world)" as:

start
"hello"
"world"


* Rules that receive a question mark (?) at the beginning of their definition, will be inlined if they have a single child, after filtering.

**Example:**

```ruby
start: greet greet
?greet: "(" /\w+/ ")"
| /\w+/ /\w+/
```

Lark will parse "hello world (planet)" as:

start
greet
"hello"
"world"
"planet"

* Rules that begin with an exclamation mark will keep all their terminals (they won't get filtered).

```perl
!expr: "(" expr ")"
| NAME+
NAME: /\w+/
%ignore " "
```

Will parse "((hello world))" as:

expr
(
expr
(
expr
hello
world
)
)

Using the `!` prefix is usually a "code smell", and may point to a flaw in your grammar design.

* Aliases - options in a rule can receive an alias. It will be then used as the branch name for the option, instead of the rule name.

**Example:**

```ruby
start: greet greet
greet: "hello"
| "world" -> planet
```

Lark will parse "hello world" as:

start
greet
planet

+ 3
- 0
mkdocs.yml View File

@@ -0,0 +1,3 @@
site_name: Lark
theme: readthedocs


Loading…
Cancel
Save