From 17da44146a2f2d7a4018acc13922f88d94b526ff Mon Sep 17 00:00:00 2001 From: Erez Shinan Date: Fri, 3 Aug 2018 09:08:37 +0300 Subject: [PATCH] Added MkDocs documentation --- docs/classes.md | 156 ++++++++++++++++++++++++++++++++++++++ docs/features.md | 62 +++++++++++++++ docs/grammar.md | 148 ++++++++++++++++++++++++++++++++++++ docs/how_to_use.md | 54 +++++++++++++ docs/index.md | 42 ++++++++++ docs/philosophy.md | 63 +++++++++++++++ docs/tree_construction.md | 129 +++++++++++++++++++++++++++++++ mkdocs.yml | 3 + 8 files changed, 657 insertions(+) create mode 100644 docs/classes.md create mode 100644 docs/features.md create mode 100644 docs/grammar.md create mode 100644 docs/how_to_use.md create mode 100644 docs/index.md create mode 100644 docs/philosophy.md create mode 100644 docs/tree_construction.md create mode 100644 mkdocs.yml diff --git a/docs/classes.md b/docs/classes.md new file mode 100644 index 0000000..27d1a48 --- /dev/null +++ b/docs/classes.md @@ -0,0 +1,156 @@ +# Classes - Reference + +This page details the important classes in Lark. + +---- + +## Lark + +The Lark class is the main interface for the library. It's mostly a thin wrapper for the many different parsers, and for the tree constructor. + +### Methods + +#### \_\_init\_\_(self, grammar, **options) + +The Lark class accepts a grammar string or file object, and keyword options: + +* start - The symbol in the grammar that begins the parse (Default: `"start"`) + +* parser - Decides which parser engine to use, "earley", "lalr" or "cyk". (Default: `"earley"`) + +* lexer - Overrides default lexer. + +* transformer - Applies the transformer instead of building a parse tree (only allowed with parser="lalr") + +* postlex - Lexer post-processing (Default: None. only works when lexer is "standard" or "contextual") + +* ambiguity (only relevant for earley and cyk) + + * "explicit" - Return all derivations inside an "_ambig" data node. + + * "resolve" - Let the parser choose the best derivation (greedy for tokens, non-greedy for rules. Default) + +* debug - Display warnings (such as Shift-Reduce warnings for LALR) + +* keep_all_tokens - Don't throw away any terminals from the tree (Default=False) + +* propagate_positions - Propagate line/column count to tree nodes (default=False) + +#### parse(self, text) + +Return a complete parse tree for the text (of type Tree) + +If a transformer is supplied to `__init__`, returns whatever is the result of the transformation. + +---- + +## Tree + +The main tree class + +### Properties + +* `data` - The name of the rule or alias +* `children` - List of matched sub-rules and terminals +* `meta` - Line & Column numbers, if using `propagate_positions` + +### Methods + +#### \_\_init\_\_(self, data, children) + +Creates a new tree, and stores "data" and "children" in attributes of the same name. + +#### pretty(self, indent_str=' ') + +Returns an indented string representation of the tree. Great for debugging. + +#### find_pred(self, pred) + +Returns all nodes of the tree that evaluate pred(node) as true. + +#### find_data(self, data) + +Returns all nodes of the tree whose data equals the given data. + +#### iter_subtrees(self) + +Iterates over all the subtrees, never returning to the same node twice (Lark's parse-tree is actually a DAG) + +#### \_\_eq\_\_, \_\_hash\_\_ + +Trees can be hashed and compared. + +---- + +## Transformers & Visitors + +Transformers & Visitors provide a convenient interface to process the parse-trees that Lark returns. + +They are used by inheriting from the correct class (visitor or transformer), and implementing methods corresponding to the rule you wish to process. Each methods accepts the children as an argument. That can be modified using the `v-args` decorator, which allows to inline the arguments (akin to `*args`), or add the tree `meta` property as an argument. + +See: https://github.com/lark-parser/lark/blob/master/lark/visitors.py + +### Visitors + +Visitors visit each node of the tree, and run the appropriate method on it according to the node's data. + +They work bottom-up, starting with the leaves and ending at the root of the tree. + +**Example** +```python +class IncreaseAllNumbers(Visitor): + def number(self, tree): + assert tree.data == "number" + tree.children[0] += 1 + +IncreaseAllNumbers().visit(parse_tree) +``` + +There are two classes that implement the visitor interface: + +* Visitor - Visit every node (without recursion) + +* Visitor_Recursive - Visit every node using recursion. Slightly faster. + +### Transformers + +Transformers visit each node of the tree, and run the appropriate method on it according to the node's data. + +They work bottom-up, starting with the leaves and ending at the root of the tree. + +Transformers can be used to implement map & reduce patterns. + +Because nodes are reduced from leaf to root, at any point the callbacks may assume the children have already been transformed (if applicable). + +Transformers can be chained into a new transformer by using multiplication. + +**Example:** +```python +from lark import Tree, Transformer + +class EvalExpressions(Transformer): + def expr(self, args): + return eval(args[0]) + +t = Tree('a', [Tree('expr', ['1+2'])]) +print(EvalExpressions().transform( t )) + +# Prints: Tree(a, [3]) +``` + + +Here are the classes that implement the transformer interface: +- Transformer - Recursively transforms the tree. This is the one you probably want. +- Transformer_InPlace - Non-recursive. Changes the tree in-place instead of returning new instances +- Transformer_InPlaceRecursive - Recursive. Changes the tree in-place instead of returning new instances + +## Token + +When using a lexer, the resulting tokens in the trees will be of the Token class, which inherits from Python's string. So, normal string comparisons and operations will work as expected. Tokens also have other useful attributes: + +* type - Name of the token (as specified in grammar). +* pos_in_stream - the index of the token in the text +* line - The line of the token in the text (starting with 1) +* column - The column of the token in the text (starting with 1) +* end_line - The line where the token ends +* end_column - The column where the token ends \ No newline at end of file diff --git a/docs/features.md b/docs/features.md new file mode 100644 index 0000000..8241b1f --- /dev/null +++ b/docs/features.md @@ -0,0 +1,62 @@ +# Features + + - EBNF-inspired grammar, with extra features (See: [Grammar Reference](grammar.md)) + - Builds a parse-tree (AST) automagically based on the grammar + - Stand-alone parser generator - create a small independent parser to embed in your project. + - Automatic line & column tracking + - Automatic terminal collision resolution + - Standard library of terminals (strings, numbers, names, etc.) + - Unicode fully supported + - Extensive test suite + - Python 2 & Python 3 compatible + - Pure-Python implementation + +## Parsers + +Lark implements the following parsing algorithms: + +### Earley + +An [Earley Parser](https://www.wikiwand.com/en/Earley_parser) is a chart parser capable of parsing any context-free grammar at O(n^3), and O(n^2) when the grammar is unambiguous. It can parse most LR grammars at O(n). Most programming languages are LR, and can be parsed at a linear time. + +Lark's Earley implmementation runs on top of a skipping chart parser, which allows it to use regular expressions, instead of matching characters one-by-one. This is a huge improvement to Earley that is unique to Lark. This feature is used by default, but can also be requested explicitely using `lexer='dynamic'`. + +It's possible to bypass the dynamic lexer, and use the regular Earley parser with a traditional lexer, that tokenizes as an independant first step. Doing so will provide a speed benefit, but will tokenize without using Earley's ambiguity-resolution ability. So choose this only if you know why! Activate with `lexer='standard'` + +**Note on ambiguity** + +Lark by default can handle any ambiguity in the grammar (Earley+dynamic). The user may request to recieve all derivations (using ambiguity='explicit'), or let Lark automatically choose the most fitting derivation (default behavior). + +Lark also supports user-defined rule priority to steer the automatic ambiguity resolution. + +### LALR(1) + +[LALR(1)](https://www.wikiwand.com/en/LALR_parser) is a very efficient, true-and-tested parsing algorithm. It's incredibly fast and requires very little memory. It can parse most programming languages (For example: Python and Java). + +Lark comes with an efficient implementation that outperforms every other parsing library for Python (including PLY) + +Lark extends the traditional YACC-based architecture with a *contextual lexer*, which automatically provides feedback from the parser to the lexer, making the LALR(1) algorithm stronger than ever. + +The contextual lexer communicates with the parser, and uses the parser's lookahead prediction to narrow its choice of tokens. So at each point, the lexer only matches the subgroup of terminals that are legal at that parser state, instead of all of the terminals. It’s surprisingly effective at resolving common terminal collisions, and allows to parse languages that LALR(1) was previously incapable of parsing. + +This is an improvement to LALR(1) that is unique to Lark. + +### CYK Parser + +A [CYK parser](https://www.wikiwand.com/en/CYK_algorithm) can parse any context-free grammar at O(n^3*|G|). + +Its too slow to be practical for simple grammars, but it offers good performance for highly ambiguous grammars. + +# Other features + + - Import grammars from Nearley.js + +### Experimental features + - Automatic reconstruction of input from parse-tree (see examples) + +### Planned features (not implemented yet) + - Generate code in other languages than Python + - Grammar composition + - LALR(k) parser + - Full regexp-collision support using NFAs + - Automatically produce syntax-highlighters for grammars, for popular IDEs \ No newline at end of file diff --git a/docs/grammar.md b/docs/grammar.md new file mode 100644 index 0000000..69a6704 --- /dev/null +++ b/docs/grammar.md @@ -0,0 +1,148 @@ +# Grammar Reference + +## Definitions + +**A grammar** is a list of rules and terminals, that together define a language. + +Terminals define the alphabet of the language, while rules define its structure. + +In Lark, a terminal may be a string, a regular expression, or a concatenation of these and other terminals. + +Each rule is a list of terminals and rules, whose location and nesting define the structure of the resulting parse-tree. + +A **parsing algorithm** is an algorithm that takes a grammar definition and a sequence of symbols (members of the alphabet), and matches the entirety of the sequence by searching for a structure that is allowed by the grammar. + +## General Syntax and notes + +Grammars in Lark are based on [EBNF](https://en.wikipedia.org/wiki/Extended_Backus–Naur_form) syntax, with several enhancements. + +Lark grammars are composed of a list of definitions and directives, each on its own line. A definition is either a named rule, or a named terminal. + +**Comments** start with `//` and last to the end of the line (C++ style) + +Lark begins the parse with the rule 'start', unless specified otherwise in the options. + +Names of rules are always in lowercase, while names of terminals are always in uppercase. This distinction has practical effects for tree construction, and for building a lexer (aka tokenizer, or scanner). + + +## Terminals + +Terminals are used to match text into symbols. They can be defined as a combination of literals and other terminals. + +**Syntax:** + +```html + [. ] : +``` + +Terminal names must be uppercase. + +Literals can be one of: + +* `"string"` +* `/regular expression+/` +* `"case-insensitive string"i` +* `/re with flags/imulx` +* Literal range: `"a".."z"`, `"1..9"`, etc. + +#### Notes for when using a lexer: + +When using a lexer (standard or contextual), it is the grammar-author's responsibility to make sure the literals don't collide, or that if they do, they are matched in the desired order. Literals are matched in an order according to the following criteria: + +1. Highest priority first (priority is specified as: TERM.number: ...) +2. Length of match (for regexps, the longest theoretical match is used) +3. Length of literal / pattern definition +4. Name + +**Examples:** +```perl +IF: "if" +INTEGER : /[0-9]+/ +INTEGER2 : ("0".."9")+ //# Same as INTEGER +DECIMAL.2: INTEGER "." INTEGER //# Will be matched before INTEGER +WHITESPACE: (" " | /\t/ )+ +SQL_SELECT: "select"i +``` + +## Rules + +**Syntax:** +```html + : [-> ] + | ... +``` + +Names of rules and aliases are always in lowercase. + +Rule definitions can be extended to the next line by using the OR operator (signified by a pipe: `|` ). + +An alias is a name for the specific rule alternative. It affects tree construction. + + +Each item is one of: + +* `rule` +* `TERMINAL` +* `"string literal"` or `/regexp literal/` +* `(item item ..)` - Group items +* `[item item ..]` - Maybe. Same as: `(item item ..)?` +* `item?` - Zero or one instances of item ("maybe") +* `item*` - Zero or more instances of item +* `item+` - One or more instances of item +* `item ~ n` - Exactly *n* instances of item +* `item ~ n..m` - Between *n* to *m* instances of item + +**Examples:** +```perl +hello_world: "hello" "world" +mul: [mul "*"] number //# Left-recursion is allowed! +expr: expr operator expr + | value //# Multi-line, belongs to expr + +four_words: word ~ 4 +``` + + +## Directives + +### %ignore + +All occurrences of the terminal will be ignored, and won't be part of the parse. + +**Syntax:** +```html +%ignore +``` +**Examples:** +```perl +%ignore " " + +COMMENT: "#" /[^\n]/* +%ignore COMMENT +``` +### %import + +Allows to import terminals from lark grammars. + +Future versions will allow to import rules and macros. + +**Syntax:** +```html +%import . +%import ( ) +``` + +If the module path is absolute, Lark will attempt to load it from the built-in directory (currently, only `common.lark` is available). + +If the module path is relative, such as `.path.to.file`, Lark will attempt to load it from the current working directory. Grammars must have the `.lark` extension. + +**Example:** +```perl +%import common.NUMBER + +%import .terminals_file (A B C) +``` + +### %declare + +Declare a terminal without defining it. Useful for plugins. \ No newline at end of file diff --git a/docs/how_to_use.md b/docs/how_to_use.md new file mode 100644 index 0000000..c0b7159 --- /dev/null +++ b/docs/how_to_use.md @@ -0,0 +1,54 @@ +# How To Use Lark - Guide + +## Work process + +This is the recommended process for working with Lark: + +1. Collect or create input samples, that demonstrate key features or behaviors in the language you're trying to parse. + +2. Write a grammar. Try to aim for a structure that is intuitive, and in a way that imitates how you would explain your language to a fellow human. + +3. Try your grammar in Lark against each input sample. Make sure the resulting parse-trees make sense. + +4. Use Lark's grammar features to [[shape the tree|Tree Construction]]: Get rid of superfluous rules by inlining them, and use aliases when specific cases need clarification. + + - You can perform steps 1-4 repeatedly, gradually growing your grammar to include more sentences. + +5. Create a transformer to evaluate the parse-tree into a structure you'll be comfortable to work with. This may include evaluating literals, merging branches, or even converting the entire tree into your own set of AST classes. + +Of course, some specific use-cases may deviate from this process. Feel free to suggest these cases, and I'll add them to this page. + +## Basic API Usage + +For common use, you only need to know 3 classes: Lark, Tree, Transformer ([[Classes Reference]]) + +Here is some mock usage of them. You can see a real example in the [[examples]] + +```python +from lark import Lark, Transformer + +gramamr = """start: rules and more rules + + rule1: other rules AND TOKENS + | rule1 "+" rule2 -> add + | some value [maybe] + + rule2: rule1 "-" (rule2 | "whatever")* + + TOKEN1: "a literal" + TOKEN2: TOKEN1 "and literals" + """ + +parser = Lark(grammar) + +tree = parser.parse("some input string") + +class MyTransformer(Transformer): + def rule1(self, matches): + return matches[0] + matches[1] + + # I don't have to implement rule2 if I don't feel like it! + +new_tree = MyTransformer().transform(tree) +``` + diff --git a/docs/index.md b/docs/index.md new file mode 100644 index 0000000..40e48fe --- /dev/null +++ b/docs/index.md @@ -0,0 +1,42 @@ +# Lark - a modern parsing library for Python + +Lark can parse any context-free grammar. + +Lark provides + +- Advanced grammar language, based on EBNF +- Three parsing algorithms to choose from: Earley, LALR(1) and CYK +- Automatic tree construction, based on grammar +- Fast unicode lexer with regexp support, and automatic line-counting + +Code is hosted on Github: [https://github.com/lark-parser/lark](https://github.com/lark-parser/lark) + +### Install +```bash +$ pip install lark-parser +``` + +#### Syntax Highlighting + +- [Sublime Text & TextMate](https://github.com/lark-parser/lark_syntax) +- [Visual Studio Code](https://github.com/lark-parser/vscode-lark) (Or install through the vscode plugin system) + + +## Documentation Index + + +* [Philosophy & Design Choices](philosophy.md) +* [Full List of Features](features.md) +* [Examples](https://github.com/lark-parser/lark/tree/master/examples) +* Tutorials + * [How to write a DSL](http://blog.erezsh.com/how-to-write-a-dsl-in-python-with-lark/) - Implements a toy LOGO-like language with an interpreter + * [How to write a JSON parser](json_tutorial.md) +* Guides + * [How to use Lark](how_to_use.md) +* Reference + * [Grammar](grammar.md) + * [Tree Construction](tree_construction.md) + * [Classes](classes.md) + * [Cheatsheet (PDF)](lark_cheatsheet.pdf) +* Discussion + * [Forum (Google Groups)](https://groups.google.com/forum/#!forum/lark-parser) \ No newline at end of file diff --git a/docs/philosophy.md b/docs/philosophy.md new file mode 100644 index 0000000..246ee3c --- /dev/null +++ b/docs/philosophy.md @@ -0,0 +1,63 @@ +# Philosophy + +Parsers are innately complicated and confusing. They're difficult to understand, difficult to write, and difficult to use. Even experts on the subject can become baffled by the nuances of these complicated state-machines. + +Lark's mission is to make the process of writing them as simple and abstract as possible. by the following design principles: + +### Design Principles + +1. Readability matters + +2. Keep the grammar clean and simple + +2. Don't force the user to decide on things that the parser can figure out on its own + +4. Usability is more important than performance + +5. Performance is still very important + +6. Follow the Zen Of Python, whenever possible and applicable + + +In accordance with these principles, I arrived at the following design choices: + +----------- + +# Design Choices + +### 1. Separation of code and grammar + +Grammars are the de-facto reference for your language, and for the structure of your parse-tree. For any non-trivial language, the conflation of code and grammar always turns out convoluted and difficult to read. + +The grammars in Lark are EBNF-inspired, so they are especially easy to read & work with. + +### 2. Always build a parse-tree (unless told not to) + +Trees are always simpler to work with than state-machines. + +1. Trees allow you to see the "state-machine" visually + +2. Trees allow your computation to be aware of previous and future states + +3. Trees allow you to process the parse in steps, instead of forcing you to do it all at once. + +And anyway, every parse-tree can be replayed as a state-machine, so there is no loss of information. + +See this answer in more detail [here](https://github.com/erezsh/lark/issues/4). + +You can skip the building the tree for LALR(1), by providing Lark with a transformer (see the [JSON example](https://github.com/erezsh/lark/blob/master/examples/json_parser.py)). + +### 3. Earley is the default + +The Earley algorithm can accept *any* contxet-free grammar you throw at it (i.e. any grammar you can write in EBNF, it can parse). That makes it extremely useful for beginners, who are not aware of the strange and arbitrary restrictions that LALR(1) places on its grammars. + +As the users grow to understand the structure of their grammar, the scope of their target language and their performance requirements, they may choose to switch over to LALR(1) to gain a huge performance boost, possibly at the cost of some language features. + +In short, "Premature optimization is the root of all evil." + +### Other design features + +- Automatically resolve terminal collisions whenever possible + +- Automatically keep track of line & column numbers + \ No newline at end of file diff --git a/docs/tree_construction.md b/docs/tree_construction.md new file mode 100644 index 0000000..8d2c059 --- /dev/null +++ b/docs/tree_construction.md @@ -0,0 +1,129 @@ +# Automatic Tree Construction - Reference + + +Lark builds a tree automatically based on the structure of the grammar, where each rule that is matched becomes a branch (node) in the tree, and its children are its matches, in the order of matching. + +For example, the rule `node: child1 child2` will create a tree node with two children. If it is matched as part of another rule (i.e. if it isn't the root), the new rule's tree node will become its parent. + +Using `item+` or `item*` will result in a list of items, equivalent to writing `item item item ..`. + +### Terminals + +Terminals are always values in the tree, never branches. + +Lark filters out certain types of terminals by default, considering them punctuation: + +- Terminals that won't appear in the tree are: + + - Unnamed literals (like `"keyword"` or `"+"`) + - Terminals whose name starts with an underscore (like `_DIGIT`) + +- Terminals that *will* appear in the tree are: + + - Unnamed regular expressions (like `/[0-9]/`) + - Named terminals whose name starts with a letter (like `DIGIT`) + +Rules prefixed with `!` will retain all their literals regardless. + + + + +**Example:** + +```perl + expr: "(" expr ")" + | NAME+ + + NAME: /\w+/ + + %ignore " " +``` + +Lark will parse "((hello world))" as: + + expr + expr + expr + "hello" + "world" + +The brackets do not appear in the tree by design. The words appear because they are matched by a named terminal. + + +# Shaping the tree + +Users can alter the automatic construction of the tree using a collection of grammar features. + + +* Rules whose name begins with an underscore will be inlined into their containing rule. + +**Example:** + +```perl + start: "(" _greet ")" + _greet: /\w+/ /\w+/ +``` + +Lark will parse "(hello world)" as: + + start + "hello" + "world" + + +* Rules that receive a question mark (?) at the beginning of their definition, will be inlined if they have a single child, after filtering. + +**Example:** + +```ruby + start: greet greet + ?greet: "(" /\w+/ ")" + | /\w+/ /\w+/ +``` + +Lark will parse "hello world (planet)" as: + + start + greet + "hello" + "world" + "planet" + +* Rules that begin with an exclamation mark will keep all their terminals (they won't get filtered). + +```perl + !expr: "(" expr ")" + | NAME+ + NAME: /\w+/ + %ignore " " +``` + +Will parse "((hello world))" as: + + expr + ( + expr + ( + expr + hello + world + ) + ) + +Using the `!` prefix is usually a "code smell", and may point to a flaw in your grammar design. + +* Aliases - options in a rule can receive an alias. It will be then used as the branch name for the option, instead of the rule name. + +**Example:** + +```ruby + start: greet greet + greet: "hello" + | "world" -> planet +``` + +Lark will parse "hello world" as: + + start + greet + planet \ No newline at end of file diff --git a/mkdocs.yml b/mkdocs.yml new file mode 100644 index 0000000..7c32702 --- /dev/null +++ b/mkdocs.yml @@ -0,0 +1,3 @@ +site_name: Lark +theme: readthedocs +