@@ -10,3 +10,4 @@ tags | |||
.mypy_cache | |||
/dist | |||
/build | |||
docs/_build |
@@ -25,7 +25,7 @@ Most importantly, Lark will save you time and prevent you from getting parsing h | |||
### Quick links | |||
- [Documentation @readthedocs](https://lark-parser.readthedocs.io/) | |||
- [Cheatsheet (PDF)](/docs/lark_cheatsheet.pdf) | |||
- [Cheatsheet (PDF)](/docs/_static/lark_cheatsheet.pdf) | |||
- [Online IDE (very basic)](https://lark-parser.github.io/lark/ide/app.html) | |||
- [Tutorial](/docs/json_tutorial.md) for writing a JSON parser. | |||
- Blog post: [How to write a DSL with Lark](http://blog.erezsh.com/how-to-write-a-dsl-in-python-with-lark/) | |||
@@ -113,9 +113,9 @@ See the full list of [features here](https://lark-parser.readthedocs.io/en/lates | |||
Lark is the fastest and lightest (lower is better) | |||
![Run-time Comparison](docs/comparison_runtime.png) | |||
![Run-time Comparison](docs/_static/comparison_runtime.png) | |||
![Memory Usage Comparison](docs/comparison_memory.png) | |||
![Memory Usage Comparison](docs/_static/comparison_memory.png) | |||
Check out the [JSON tutorial](/docs/json_tutorial.md#conclusion) for more details on how the comparison was made. | |||
@@ -0,0 +1,20 @@ | |||
# Minimal makefile for Sphinx documentation | |||
# | |||
# You can set these variables from the command line. | |||
SPHINXOPTS = | |||
SPHINXBUILD = sphinx-build | |||
SPHINXPROJ = Lark | |||
SOURCEDIR = . | |||
BUILDDIR = _build | |||
# Put it first so that "make" without argument is like "make help". | |||
help: | |||
@$(SPHINXBUILD) -M help "$(SOURCEDIR)" "$(BUILDDIR)" $(SPHINXOPTS) $(O) | |||
.PHONY: help Makefile | |||
# Catch-all target: route all unknown targets to Sphinx using the new | |||
# "make mode" option. $(O) is meant as a shortcut for $(SPHINXOPTS). | |||
%: Makefile | |||
@$(SPHINXBUILD) -M $@ "$(SOURCEDIR)" "$(BUILDDIR)" $(SPHINXOPTS) $(O) |
@@ -1 +0,0 @@ | |||
theme: jekyll-theme-slate |
@@ -1,284 +0,0 @@ | |||
# Classes Reference | |||
This page details the important classes in Lark. | |||
---- | |||
## lark.Lark | |||
The Lark class is the main interface for the library. It's mostly a thin wrapper for the many different parsers, and for the tree constructor. | |||
#### Lark.\_\_init\_\_ | |||
```python | |||
def __init__(self, grammar_string, **options): ... | |||
``` | |||
Creates an instance of Lark with the given grammar | |||
Example: | |||
```python | |||
>>> Lark(r'''start: "foo" ''') | |||
Lark(...) | |||
``` | |||
#### Lark.open | |||
```python | |||
def open(cls, grammar_filename, rel_to=None, **options): ... | |||
``` | |||
Creates an instance of Lark with the grammar given by its filename | |||
If rel_to is provided, the function will find the grammar filename in relation to it. | |||
Example: | |||
```python | |||
>>> Lark.open("grammar_file.lark", rel_to=__file__, parser="lalr") | |||
Lark(...) | |||
``` | |||
#### Lark.parse | |||
```python | |||
def parse(self, text, start=None, on_error=None): ... | |||
``` | |||
Parse the given text, according to the options provided. | |||
Returns a complete parse tree for the text (of type Tree) | |||
If a transformer is supplied to `__init__`, returns whatever is the result of the transformation. | |||
Parameters: | |||
* start: str - required if Lark was given multiple possible start symbols (using the start option). | |||
* on_error: function - if provided, will be called on UnexpectedToken error. Return true to resume parsing. LALR only. | |||
(See `examples/error_puppet.py` for an example of how to use `on_error`.) | |||
Example: | |||
```python | |||
>>> Lark(r'''start: "hello" " "+ /\w+/ ''').parse('hello kitty') | |||
Tree(start, [Token(__ANON_0, 'kitty')]) | |||
``` | |||
#### Lark.save / Lark.load | |||
```python | |||
def save(self, f): ... | |||
def load(cls, f): ... | |||
``` | |||
Useful for caching and multiprocessing. | |||
`save` saves the instance into the given file object | |||
`load` loads an instance from the given file object | |||
#### | |||
### Lark Options | |||
#### General options | |||
**start** - The start symbol. Either a string, or a list of strings for multiple possible starts (Default: "start") | |||
**debug** - Display debug information, such as warnings (default: False) | |||
**transformer** - Applies the transformer to every parse tree (equivlent to applying it after the parse, but faster) | |||
**propagate_positions** - Propagates (line, column, end_line, end_column) attributes into all tree branches. | |||
**maybe_placeholders** - | |||
- When True, the `[]` operator returns `None` when not matched. | |||
- When `False`, `[]` behaves like the `?` operator, and returns no value at all. | |||
- (default=`False`. Recommended to set to `True`) | |||
**g_regex_flags** - Flags that are applied to all terminals (both regex and strings) | |||
**regex** - Use the `regex` library instead of the built-in `re` module (See below) | |||
**keep_all_tokens** - Prevent the tree builder from automagically removing "punctuation" tokens (default: False) | |||
**cache** - Cache the results of the Lark grammar analysis, for x2 to x3 faster loading. LALR only for now. | |||
- When `False`, does nothing (default) | |||
- When `True`, caches to a temporary file in the local directory | |||
- When given a string, caches to the path pointed by the string | |||
#### Algorithm | |||
**parser** - Decides which parser engine to use, "earley" or "lalr". (Default: "earley") | |||
(there is also a "cyk" option for legacy) | |||
**lexer** - Decides whether or not to use a lexer stage | |||
- "auto" (default): Choose for me based on the parser | |||
- "standard": Use a standard lexer | |||
- "contextual": Stronger lexer (only works with parser="lalr") | |||
- "dynamic": Flexible and powerful (only with parser="earley") | |||
- "dynamic_complete": Same as dynamic, but tries *every* variation of tokenizing possible. (only with parser="earley") | |||
**ambiguity** - Decides how to handle ambiguity in the parse. Only relevant if parser="earley" | |||
- "resolve": The parser will automatically choose the simplest derivation (it chooses consistently: greedy for tokens, non-greedy for rules) | |||
- "explicit": The parser will return all derivations wrapped in "_ambig" tree nodes (i.e. a forest). | |||
#### Misc. | |||
- **postlex** - Lexer post-processing (Default: None) Only works with the standard and contextual lexers. | |||
- **priority** - How priorities should be evaluated - auto, none, normal, invert (Default: auto) | |||
- **lexer_callbacks** - Dictionary of callbacks for the lexer. May alter tokens during lexing. Use with caution. | |||
- **edit_terminals** - A callback | |||
- **use_bytes** - Accept and parse an input of type `bytes` instead of `str`. Grammar should still be specified as `str`, and terminal values are assumed to be `latin-1`. | |||
#### Using Unicode character classes with `regex` | |||
Python's builtin `re` module has a few persistent known bugs and also won't parse | |||
advanced regex features such as character classes. | |||
With `pip install lark-parser[regex]`, the `regex` module will be installed alongside `lark` | |||
and can act as a drop-in replacement to `re`. | |||
Any instance of `Lark` instantiated with `regex=True` will now use the `regex` module | |||
instead of `re`. For example, we can now use character classes to match PEP-3131 compliant Python identifiers. | |||
```python | |||
from lark import Lark | |||
>>> g = Lark(r""" | |||
?start: NAME | |||
NAME: ID_START ID_CONTINUE* | |||
ID_START: /[\p{Lu}\p{Ll}\p{Lt}\p{Lm}\p{Lo}\p{Nl}_]+/ | |||
ID_CONTINUE: ID_START | /[\p{Mn}\p{Mc}\p{Nd}\p{Pc}·]+/ | |||
""", regex=True) | |||
>>> g.parse('வணக்கம்') | |||
'வணக்கம்' | |||
``` | |||
---- | |||
## Tree | |||
The main tree class | |||
* `data` - The name of the rule or alias | |||
* `children` - List of matched sub-rules and terminals | |||
* `meta` - Line & Column numbers (if `propagate_positions` is enabled) | |||
* meta attributes: `line`, `column`, `start_pos`, `end_line`, `end_column`, `end_pos` | |||
#### \_\_init\_\_(self, data, children) | |||
Creates a new tree, and stores "data" and "children" in attributes of the same name. | |||
#### pretty(self, indent_str=' ') | |||
Returns an indented string representation of the tree. Great for debugging. | |||
#### find_pred(self, pred) | |||
Returns all nodes of the tree that evaluate pred(node) as true. | |||
#### find_data(self, data) | |||
Returns all nodes of the tree whose data equals the given data. | |||
#### iter_subtrees(self) | |||
Depth-first iteration. | |||
Iterates over all the subtrees, never returning to the same node twice (Lark's parse-tree is actually a DAG). | |||
#### iter_subtrees_topdown(self) | |||
Breadth-first iteration. | |||
Iterates over all the subtrees, return nodes in order like pretty() does. | |||
#### \_\_eq\_\_, \_\_hash\_\_ | |||
Trees can be hashed and compared. | |||
---- | |||
## Token | |||
When using a lexer, the resulting tokens in the trees will be of the Token class, which inherits from Python's string. So, normal string comparisons and operations will work as expected. Tokens also have other useful attributes: | |||
* `type` - Name of the token (as specified in grammar). | |||
* `pos_in_stream` - the index of the token in the text | |||
* `line` - The line of the token in the text (starting with 1) | |||
* `column` - The column of the token in the text (starting with 1) | |||
* `end_line` - The line where the token ends | |||
* `end_column` - The next column after the end of the token. For example, if the token is a single character with a `column` value of 4, `end_column` will be 5. | |||
* `end_pos` - the index where the token ends (basically pos_in_stream + len(token)) | |||
## Transformer | |||
## Visitor | |||
## Interpreter | |||
See the [visitors page](visitors.md) | |||
## UnexpectedInput | |||
- `UnexpectedInput` | |||
- `UnexpectedToken` - The parser recieved an unexpected token | |||
- `UnexpectedCharacters` - The lexer encountered an unexpected string | |||
After catching one of these exceptions, you may call the following helper methods to create a nicer error message: | |||
#### get_context(text, span) | |||
Returns a pretty string pinpointing the error in the text, with `span` amount of context characters around it. | |||
(The parser doesn't hold a copy of the text it has to parse, so you have to provide it again) | |||
#### match_examples(parse_fn, examples) | |||
Allows you to detect what's wrong in the input text by matching against example errors. | |||
Accepts the parse function (usually `lark_instance.parse`) and a dictionary of `{'example_string': value}`. | |||
The function will iterate the dictionary until it finds a matching error, and return the corresponding value. | |||
For an example usage, see: [examples/error_reporting_lalr.py](https://github.com/lark-parser/lark/blob/master/examples/error_reporting_lalr.py) | |||
### UnexpectedToken | |||
When the parser throws UnexpectedToken, it instanciates a puppet with its internal state. | |||
Users can then interactively set the puppet to the desired puppet state, and resume regular parsing. | |||
See [ParserPuppet](#ParserPuppet) | |||
### UnexpectedCharacters | |||
## ParserPuppet | |||
ParserPuppet gives you advanced control over error handling when parsing with LALR. | |||
For a simpler, more streamlined interface, see the `on_error` argument to `Lark.parse()`. | |||
#### choices(self) | |||
Returns a dictionary of token types, matched to their action in the parser. | |||
Only returns token types that are accepted by the current state. | |||
Updated by `feed_token()` | |||
#### feed_token(self, token) | |||
Feed the parser with a token, and advance it to the next state, as if it recieved it from the lexer. | |||
Note that `token` has to be an instance of `Token`. | |||
#### copy(self) | |||
Create a new puppet with a separate state. Calls to `feed_token()` won't affect the old puppet, and vice-versa. | |||
#### pretty(self) | |||
Print the output of `choices()` in a way that's easier to read. | |||
#### resume_parse(self) | |||
Resume parsing from the current puppet state. | |||
@@ -0,0 +1,67 @@ | |||
API Reference | |||
============= | |||
Lark | |||
---- | |||
.. autoclass:: lark.Lark | |||
:members: open, parse, save, load | |||
**Using Unicode character classes with regex** | |||
Python's builtin `re` module has a few persistent known bugs and also won't parse | |||
advanced regex features such as character classes. | |||
With `pip install lark-parser[regex]`, the `regex` module will be installed alongside `lark` and can act as a drop-in replacement to `re`. | |||
Any instance of `Lark` instantiated with `regex=True` will now use the `regex` module instead of `re`. | |||
For example, we can now use character classes to match PEP-3131 compliant Python identifiers. | |||
Example: | |||
:: | |||
from lark import Lark | |||
>>> g = Lark(r""" | |||
?start: NAME | |||
NAME: ID_START ID_CONTINUE* | |||
ID_START: /[\p{Lu}\p{Ll}\p{Lt}\p{Lm}\p{Lo}\p{Nl}_]+/ | |||
ID_CONTINUE: ID_START | /[\p{Mn}\p{Mc}\p{Nd}\p{Pc}·]+/ | |||
""", regex=True) | |||
>>> g.parse('வணக்கம்') | |||
'வணக்கம்' | |||
Tree | |||
---- | |||
.. autoclass:: lark.Tree | |||
:members: pretty, find_pred, find_data, iter_subtrees, | |||
iter_subtrees_topdown | |||
Token | |||
----- | |||
.. autoclass:: lark.Token | |||
Transformer, Visitor & Interpreter | |||
--------------------------------- | |||
See :doc:`visitors`. | |||
UnexpectedInput | |||
--------------- | |||
.. autoclass:: lark.exceptions.UnexpectedInput | |||
:members: get_context, match_examples | |||
.. autoclass:: lark.exceptions.UnexpectedToken | |||
.. autoclass:: lark.exceptions.UnexpectedCharacters | |||
.. _parserpuppet: | |||
ParserPuppet | |||
------------ | |||
.. autoclass:: lark.parsers.lalr_puppet.ParserPuppet | |||
:members: choices, feed_token, copy, pretty, resume_parse |
@@ -0,0 +1,179 @@ | |||
#!/usr/bin/env python3 | |||
# -*- coding: utf-8 -*- | |||
# | |||
# Lark documentation build configuration file, created by | |||
# sphinx-quickstart on Sun Aug 16 13:09:41 2020. | |||
# | |||
# This file is execfile()d with the current directory set to its | |||
# containing dir. | |||
# | |||
# Note that not all possible configuration values are present in this | |||
# autogenerated file. | |||
# | |||
# All configuration values have a default; values that are commented out | |||
# serve to show the default. | |||
# If extensions (or modules to document with autodoc) are in another directory, | |||
# add these directories to sys.path here. If the directory is relative to the | |||
# documentation root, use os.path.abspath to make it absolute, like shown here. | |||
# | |||
import os | |||
import sys | |||
sys.path.insert(0, os.path.abspath('..')) | |||
autodoc_member_order = 'bysource' | |||
# -- General configuration ------------------------------------------------ | |||
# If your documentation needs a minimal Sphinx version, state it here. | |||
# | |||
# needs_sphinx = '1.0' | |||
# Add any Sphinx extension module names here, as strings. They can be | |||
# extensions coming with Sphinx (named 'sphinx.ext.*') or your custom | |||
# ones. | |||
extensions = [ | |||
'sphinx.ext.autodoc', | |||
'sphinx.ext.napoleon', | |||
'sphinx.ext.coverage', | |||
'recommonmark', | |||
] | |||
# Add any paths that contain templates here, relative to this directory. | |||
templates_path = ['_templates'] | |||
# The suffix(es) of source filenames. | |||
# You can specify multiple suffix as a list of string: | |||
# | |||
# source_suffix = ['.rst', '.md'] | |||
source_suffix = { | |||
'.rst': 'restructuredtext', | |||
'.md': 'markdown' | |||
} | |||
# The master toctree document. | |||
master_doc = 'index' | |||
# General information about the project. | |||
project = 'Lark' | |||
copyright = '2020, Erez Shinan' | |||
author = 'Erez Shinan' | |||
# The version info for the project you're documenting, acts as replacement for | |||
# |version| and |release|, also used in various other places throughout the | |||
# built documents. | |||
# | |||
# The short X.Y version. | |||
version = '' | |||
# The full version, including alpha/beta/rc tags. | |||
release = '' | |||
# The language for content autogenerated by Sphinx. Refer to documentation | |||
# for a list of supported languages. | |||
# | |||
# This is also used if you do content translation via gettext catalogs. | |||
# Usually you set "language" from the command line for these cases. | |||
language = None | |||
# List of patterns, relative to source directory, that match files and | |||
# directories to ignore when looking for source files. | |||
# This patterns also effect to html_static_path and html_extra_path | |||
exclude_patterns = ['_build', 'Thumbs.db', '.DS_Store'] | |||
# The name of the Pygments (syntax highlighting) style to use. | |||
pygments_style = 'sphinx' | |||
# If true, `todo` and `todoList` produce output, else they produce nothing. | |||
todo_include_todos = False | |||
# -- Options for HTML output ---------------------------------------------- | |||
# The theme to use for HTML and HTML Help pages. See the documentation for | |||
# a list of builtin themes. | |||
# | |||
html_theme = 'sphinx_rtd_theme' | |||
# Theme options are theme-specific and customize the look and feel of a theme | |||
# further. For a list of options available for each theme, see the | |||
# documentation. | |||
# | |||
# html_theme_options = {} | |||
# Add any paths that contain custom static files (such as style sheets) here, | |||
# relative to this directory. They are copied after the builtin static files, | |||
# so a file named "default.css" will overwrite the builtin "default.css". | |||
html_static_path = ['_static'] | |||
# Custom sidebar templates, must be a dictionary that maps document names | |||
# to template names. | |||
# | |||
# This is required for the alabaster theme | |||
# refs: http://alabaster.readthedocs.io/en/latest/installation.html#sidebars | |||
html_sidebars = { | |||
'**': [ | |||
'relations.html', # needs 'show_related': True theme option to display | |||
'searchbox.html', | |||
] | |||
} | |||
# -- Options for HTMLHelp output ------------------------------------------ | |||
# Output file base name for HTML help builder. | |||
htmlhelp_basename = 'Larkdoc' | |||
# -- Options for LaTeX output --------------------------------------------- | |||
latex_elements = { | |||
# The paper size ('letterpaper' or 'a4paper'). | |||
# | |||
# 'papersize': 'letterpaper', | |||
# The font size ('10pt', '11pt' or '12pt'). | |||
# | |||
# 'pointsize': '10pt', | |||
# Additional stuff for the LaTeX preamble. | |||
# | |||
# 'preamble': '', | |||
# Latex figure (float) alignment | |||
# | |||
# 'figure_align': 'htbp', | |||
} | |||
# Grouping the document tree into LaTeX files. List of tuples | |||
# (source start file, target name, title, | |||
# author, documentclass [howto, manual, or own class]). | |||
latex_documents = [ | |||
(master_doc, 'Lark.tex', 'Lark Documentation', | |||
'Erez Shinan', 'manual'), | |||
] | |||
# -- Options for manual page output --------------------------------------- | |||
# One entry per manual page. List of tuples | |||
# (source start file, name, description, authors, manual section). | |||
man_pages = [ | |||
(master_doc, 'lark', 'Lark Documentation', | |||
[author], 1) | |||
] | |||
# -- Options for Texinfo output ------------------------------------------- | |||
# Grouping the document tree into Texinfo files. List of tuples | |||
# (source start file, target name, title, author, | |||
# dir menu entry, description, category) | |||
texinfo_documents = [ | |||
(master_doc, 'Lark', 'Lark Documentation', | |||
author, 'Lark', 'One line description of project.', | |||
'Miscellaneous'), | |||
] | |||
@@ -1,4 +1,6 @@ | |||
# Main Features | |||
# Features | |||
## Main Features | |||
- Earley parser, capable of parsing any context-free grammar | |||
- Implements SPPF, for efficient parsing and storing of ambiguous grammars. | |||
- LALR(1) parser, limited in power of expression, but very efficient in space and performance (O(n)). | |||
@@ -18,7 +20,8 @@ | |||
[Read more about the parsers](parsers.md) | |||
# Extra features | |||
## Extra features | |||
- Import rules and tokens from other Lark grammars, for code reuse and modularity. | |||
- Support for external regex module ([see here](classes.md#using-unicode-character-classes-with-regex)) | |||
- Import grammars from Nearley.js ([read more](nearley.md)) | |||
@@ -1,13 +1,5 @@ | |||
# Grammar Reference | |||
Table of contents: | |||
1. [Definitions](#defs) | |||
1. [Terminals](#terms) | |||
1. [Rules](#rules) | |||
1. [Directives](#dirs) | |||
<a name="defs"></a> | |||
## Definitions | |||
A **grammar** is a list of rules and terminals, that together define a language. | |||
@@ -20,7 +12,7 @@ Each rule is a list of terminals and rules, whose location and nesting define th | |||
A **parsing algorithm** is an algorithm that takes a grammar definition and a sequence of symbols (members of the alphabet), and matches the entirety of the sequence by searching for a structure that is allowed by the grammar. | |||
## General Syntax and notes | |||
### General Syntax and notes | |||
Grammars in Lark are based on [EBNF](https://en.wikipedia.org/wiki/Extended_Backus–Naur_form) syntax, with several enhancements. | |||
@@ -58,7 +50,6 @@ Lark begins the parse with the rule 'start', unless specified otherwise in the o | |||
Names of rules are always in lowercase, while names of terminals are always in uppercase. This distinction has practical effects, for the shape of the generated parse-tree, and the automatic construction of the lexer (aka tokenizer, or scanner). | |||
<a name="terms"></a> | |||
## Terminals | |||
Terminals are used to match text into symbols. They can be defined as a combination of literals and other terminals. | |||
@@ -192,7 +183,6 @@ _ambig | |||
``` | |||
<a name="rules"></a> | |||
## Rules | |||
**Syntax:** | |||
@@ -22,11 +22,11 @@ Of course, some specific use-cases may deviate from this process. Feel free to s | |||
Browse the [Examples](https://github.com/lark-parser/lark/tree/master/examples) to find a template that suits your purposes. | |||
Read the tutorials to get a better understanding of how everything works. (links in the [main page](/)) | |||
Read the tutorials to get a better understanding of how everything works. (links in the [main page](/index)) | |||
Use the [Cheatsheet (PDF)](lark_cheatsheet.pdf) for quick reference. | |||
Use the [Cheatsheet (PDF)](/_static/lark_cheatsheet.pdf) for quick reference. | |||
Use the reference pages for more in-depth explanations. (links in the [main page](/)] | |||
Use the reference pages for more in-depth explanations. (links in the [main page](/index)] | |||
## LALR usage | |||
@@ -1,55 +0,0 @@ | |||
# Lark | |||
A modern parsing library for Python | |||
## Overview | |||
Lark can parse any context-free grammar. | |||
Lark provides: | |||
- Advanced grammar language, based on EBNF | |||
- Three parsing algorithms to choose from: Earley, LALR(1) and CYK | |||
- Automatic tree construction, inferred from your grammar | |||
- Fast unicode lexer with regexp support, and automatic line-counting | |||
Lark's code is hosted on Github: [https://github.com/lark-parser/lark](https://github.com/lark-parser/lark) | |||
### Install | |||
```bash | |||
$ pip install lark-parser | |||
``` | |||
#### Syntax Highlighting | |||
- [Sublime Text & TextMate](https://github.com/lark-parser/lark_syntax) | |||
- [Visual Studio Code](https://github.com/lark-parser/vscode-lark) (Or install through the vscode plugin system) | |||
- [Intellij & PyCharm](https://github.com/lark-parser/intellij-syntax-highlighting) | |||
----- | |||
## Documentation Index | |||
* [Philosophy & Design Choices](philosophy.md) | |||
* [Features](features.md) | |||
* [Examples](https://github.com/lark-parser/lark/tree/master/examples) | |||
* [Online IDE](https://lark-parser.github.io/lark/ide/app.html) | |||
* Tutorials | |||
* [How to write a DSL](http://blog.erezsh.com/how-to-write-a-dsl-in-python-with-lark/) - Implements a toy LOGO-like language with an interpreter | |||
* [How to write a JSON parser](json_tutorial.md) - Teaches you how to use Lark | |||
* Unofficial | |||
* [Program Synthesis is Possible](https://www.cs.cornell.edu/~asampson/blog/minisynth.html) - Creates a DSL for Z3 | |||
* Guides | |||
* [How to use Lark](how_to_use.md) | |||
* [How to develop Lark](how_to_develop.md) | |||
* Reference | |||
* [Grammar](grammar.md) | |||
* [Tree Construction](tree_construction.md) | |||
* [Visitors & Transformers](visitors.md) | |||
* [Classes](classes.md) | |||
* [Cheatsheet (PDF)](lark_cheatsheet.pdf) | |||
* [Importing grammars from Nearley](nearley.md) | |||
* Discussion | |||
* [Gitter](https://gitter.im/lark-parser/Lobby) | |||
* [Forum (Google Groups)](https://groups.google.com/forum/#!forum/lark-parser) |
@@ -0,0 +1,112 @@ | |||
.. Lark documentation master file, created by | |||
sphinx-quickstart on Sun Aug 16 13:09:41 2020. | |||
You can adapt this file completely to your liking, but it should at least | |||
contain the root `toctree` directive. | |||
Welcome to Lark's documentation! | |||
================================ | |||
.. toctree:: | |||
:maxdepth: 2 | |||
:caption: Overview | |||
:hidden: | |||
philosophy | |||
features | |||
parsers | |||
.. toctree:: | |||
:maxdepth: 2 | |||
:caption: Tutorials & Guides | |||
:hidden: | |||
json_tutorial | |||
how_to_use | |||
how_to_develop | |||
recipes | |||
.. toctree:: | |||
:maxdepth: 2 | |||
:caption: Reference | |||
:hidden: | |||
grammar | |||
tree_construction | |||
classes | |||
visitors | |||
nearley | |||
Lark is a modern parsing library for Python. Lark can parse any context-free grammar. | |||
Lark provides: | |||
- Advanced grammar language, based on EBNF | |||
- Three parsing algorithms to choose from: Earley, LALR(1) and CYK | |||
- Automatic tree construction, inferred from your grammar | |||
- Fast unicode lexer with regexp support, and automatic line-counting | |||
Install Lark | |||
-------------- | |||
.. code:: bash | |||
$ pip install lark-parser | |||
Syntax Highlighting | |||
------------------- | |||
- `Sublime Text & TextMate`_ | |||
- `Visual Studio Code`_ (Or install through the vscode plugin system) | |||
- `Intellij & PyCharm`_ | |||
.. _Sublime Text & TextMate: https://github.com/lark-parser/lark_syntax | |||
.. _Visual Studio Code: https://github.com/lark-parser/vscode-lark | |||
.. _Intellij & PyCharm: https://github.com/lark-parser/intellij-syntax-highlighting | |||
Resources | |||
--------- | |||
- :doc:`philosophy` | |||
- :doc:`features` | |||
- `Examples`_ | |||
- `Online IDE`_ | |||
- Tutorials | |||
- `How to write a DSL`_ - Implements a toy LOGO-like language with | |||
an interpreter | |||
- :doc:`json_tutorial` - Teaches you how to use Lark | |||
- Unofficial | |||
- `Program Synthesis is Possible`_ - Creates a DSL for Z3 | |||
- Guides | |||
- :doc:`how_to_use` | |||
- :doc:`how_to_develop` | |||
- Reference | |||
- :doc:`grammar` | |||
- :doc:`tree_construction` | |||
- :doc:`visitors` | |||
- :doc:`classes` | |||
- :doc:`nearley` | |||
- `Cheatsheet (PDF)`_ | |||
- Discussion | |||
- `Gitter`_ | |||
- `Forum (Google Groups)`_ | |||
.. _Examples: https://github.com/lark-parser/lark/tree/master/examples | |||
.. _Online IDE: https://lark-parser.github.io/lark/ide/app.html | |||
.. _How to write a DSL: http://blog.erezsh.com/how-to-write-a-dsl-in-python-with-lark/ | |||
.. _Program Synthesis is Possible: https://www.cs.cornell.edu/~asampson/blog/minisynth.html | |||
.. _Cheatsheet (PDF): _static/lark_cheatsheet.pdf | |||
.. _Gitter: https://gitter.im/lark-parser/Lobby | |||
.. _Forum (Google Groups): https://groups.google.com/forum/#!forum/lark-parser |
@@ -1,7 +1,6 @@ | |||
# Lark Tutorial - JSON parser | |||
# JSON parser - Tutorial | |||
Lark is a parser - a program that accepts a grammar and text, and produces a structured tree that represents that text. | |||
In this tutorial we will write a JSON parser in Lark, and explore Lark's various features in the process. | |||
It has 5 parts. | |||
@@ -0,0 +1,36 @@ | |||
@ECHO OFF | |||
pushd %~dp0 | |||
REM Command file for Sphinx documentation | |||
if "%SPHINXBUILD%" == "" ( | |||
set SPHINXBUILD=sphinx-build | |||
) | |||
set SOURCEDIR=. | |||
set BUILDDIR=_build | |||
set SPHINXPROJ=Lark | |||
if "%1" == "" goto help | |||
%SPHINXBUILD% >NUL 2>NUL | |||
if errorlevel 9009 ( | |||
echo. | |||
echo.The 'sphinx-build' command was not found. Make sure you have Sphinx | |||
echo.installed, then set the SPHINXBUILD environment variable to point | |||
echo.to the full path of the 'sphinx-build' executable. Alternatively you | |||
echo.may add the Sphinx directory to PATH. | |||
echo. | |||
echo.If you don't have Sphinx installed, grab it from | |||
echo.http://sphinx-doc.org/ | |||
exit /b 1 | |||
) | |||
%SPHINXBUILD% -M %1 %SOURCEDIR% %BUILDDIR% %SPHINXOPTS% | |||
goto end | |||
:help | |||
%SPHINXBUILD% -M help %SOURCEDIR% %BUILDDIR% %SPHINXOPTS% | |||
:end | |||
popd |
@@ -1,7 +1,7 @@ | |||
# Parsers | |||
Lark implements the following parsing algorithms: Earley, LALR(1), and CYK | |||
# Earley | |||
## Earley | |||
An [Earley Parser](https://www.wikiwand.com/en/Earley_parser) is a chart parser capable of parsing any context-free grammar at O(n^3), and O(n^2) when the grammar is unambiguous. It can parse most LR grammars at O(n). Most programming languages are LR, and can be parsed at a linear time. | |||
@@ -30,7 +30,7 @@ Lark provides the following options to combat ambiguity: | |||
**TODO: Add documentation on dynamic_complete** | |||
# LALR(1) | |||
## LALR(1) | |||
[LALR(1)](https://www.wikiwand.com/en/LALR_parser) is a very efficient, true-and-tested parsing algorithm. It's incredibly fast and requires very little memory. It can parse most programming languages (For example: Python and Java). | |||
@@ -42,7 +42,7 @@ The contextual lexer communicates with the parser, and uses the parser's lookahe | |||
This is an improvement to LALR(1) that is unique to Lark. | |||
# CYK Parser | |||
## CYK Parser | |||
A [CYK parser](https://www.wikiwand.com/en/CYK_algorithm) can parse any context-free grammar at O(n^3*|G|). | |||
@@ -4,7 +4,7 @@ Parsers are innately complicated and confusing. They're difficult to understand, | |||
Lark's mission is to make the process of writing them as simple and abstract as possible, by following these design principles: | |||
### Design Principles | |||
## Design Principles | |||
1. Readability matters | |||
@@ -23,7 +23,7 @@ In accordance with these principles, I arrived at the following design choices: | |||
----------- | |||
# Design Choices | |||
## Design Choices | |||
### 1. Separation of code and grammar | |||
@@ -1,4 +1,4 @@ | |||
# Automatic Tree Construction - Reference | |||
# Tree Construction Reference | |||
Lark builds a tree automatically based on the structure of the grammar, where each rule that is matched becomes a branch (node) in the tree, and its children are its matches, in the order of matching. | |||
@@ -13,7 +13,7 @@ If `maybe_placeholders=False` (the default), then `[]` behaves like `()?`. | |||
If `maybe_placeholders=True`, then using `[item]` will return the item if it matched, or the value `None`, if it didn't. | |||
### Terminals | |||
## Terminals | |||
Terminals are always values in the tree, never branches. | |||
@@ -74,7 +74,7 @@ Lark will parse "((hello world))" as: | |||
The brackets do not appear in the tree by design. The words appear because they are matched by a named terminal. | |||
# Shaping the tree | |||
## Shaping the tree | |||
Users can alter the automatic construction of the tree using a collection of grammar features. | |||
@@ -1,148 +0,0 @@ | |||
## Transformers & Visitors | |||
Transformers & Visitors provide a convenient interface to process the parse-trees that Lark returns. | |||
They are used by inheriting from the correct class (visitor or transformer), and implementing methods corresponding to the rule you wish to process. Each method accepts the children as an argument. That can be modified using the `v_args` decorator, which allows to inline the arguments (akin to `*args`), or add the tree `meta` property as an argument. | |||
See: <a href="https://github.com/lark-parser/lark/blob/master/lark/visitors.py">visitors.py</a> | |||
### Visitors | |||
Visitors visit each node of the tree, and run the appropriate method on it according to the node's data. | |||
They work bottom-up, starting with the leaves and ending at the root of the tree. | |||
**Example:** | |||
```python | |||
class IncreaseAllNumbers(Visitor): | |||
def number(self, tree): | |||
assert tree.data == "number" | |||
tree.children[0] += 1 | |||
IncreaseAllNumbers().visit(parse_tree) | |||
``` | |||
There are two classes that implement the visitor interface: | |||
* Visitor - Visit every node (without recursion) | |||
* Visitor_Recursive - Visit every node using recursion. Slightly faster. | |||
### Interpreter | |||
The interpreter walks the tree starting at the root (top-down). | |||
For each node, it calls the method corresponding with its `data` attribute. | |||
Unlike Transformer and Visitor, the Interpreter doesn't automatically visit its sub-branches. | |||
The user has to explicitly call `visit`, `visit_children`, or use the `@visit_children_decor`. | |||
This allows the user to implement branching and loops. | |||
**Example:** | |||
```python | |||
class IncreaseSomeOfTheNumbers(Interpreter): | |||
def number(self, tree): | |||
tree.children[0] += 1 | |||
def skip(self, tree): | |||
# skip this subtree. don't change any number node inside it. | |||
pass | |||
IncreaseSomeOfTheNumbers().visit(parse_tree) | |||
``` | |||
### Transformers | |||
Transformers visit each node of the tree, and run the appropriate method on it according to the node's data. | |||
They work bottom-up (or: depth-first), starting with the leaves and ending at the root of the tree. | |||
Transformers can be used to implement map & reduce patterns. | |||
Because nodes are reduced from leaf to root, at any point the callbacks may assume the children have already been transformed (if applicable). | |||
Transformers can be chained into a new transformer by using multiplication. | |||
`Transformer` can do anything `Visitor` can do, but because it reconstructs the tree, it is slightly less efficient. | |||
**Example:** | |||
```python | |||
from lark import Tree, Transformer | |||
class EvalExpressions(Transformer): | |||
def expr(self, args): | |||
return eval(args[0]) | |||
t = Tree('a', [Tree('expr', ['1+2'])]) | |||
print(EvalExpressions().transform( t )) | |||
# Prints: Tree(a, [3]) | |||
``` | |||
All these classes implement the transformer interface: | |||
- Transformer - Recursively transforms the tree. This is the one you probably want. | |||
- Transformer_InPlace - Non-recursive. Changes the tree in-place instead of returning new instances | |||
- Transformer_InPlaceRecursive - Recursive. Changes the tree in-place instead of returning new instances | |||
### visit_tokens | |||
By default, transformers only visit rules. `visit_tokens=True` will tell Transformer to visit tokens as well. This is a slightly slower alternative to `lexer_callbacks`, but it's easier to maintain and works for all algorithms (even when there isn't a lexer). | |||
**Example:** | |||
```python | |||
class T(Transformer): | |||
INT = int | |||
NUMBER = float | |||
def NAME(self, name): | |||
return lookup_dict.get(name, name) | |||
T(visit_tokens=True).transform(tree) | |||
``` | |||
### v_args | |||
`v_args` is a decorator. | |||
By default, callback methods of transformers/visitors accept one argument: a list of the node's children. `v_args` can modify this behavior. | |||
When used on a transformer/visitor class definition, it applies to all the callback methods inside it. | |||
`v_args` accepts one of three flags: | |||
- `inline` - Children are provided as `*args` instead of a list argument (not recommended for very long lists). | |||
- `meta` - Provides two arguments: `children` and `meta` (instead of just the first) | |||
- `tree` - Provides the entire tree as the argument, instead of the children. | |||
**Examples:** | |||
```python | |||
@v_args(inline=True) | |||
class SolveArith(Transformer): | |||
def add(self, left, right): | |||
return left + right | |||
class ReverseNotation(Transformer_InPlace): | |||
@v_args(tree=True) | |||
def tree_node(self, tree): | |||
tree.children = tree.children[::-1] | |||
``` | |||
### `__default__` and `__default_token__` | |||
These are the functions that are called on if a function with a corresponding name has not been found. | |||
- The `__default__` method has the signature `(data, children, meta)`, with `data` being the data attribute of the node. It defaults to reconstruct the Tree | |||
- The `__default_token__` just takes the `Token` as an argument. It defaults to just return the argument. | |||
### Discard | |||
When raising the `Discard` exception in a transformer callback, that node is discarded and won't appear in the parent. | |||
@@ -0,0 +1,102 @@ | |||
Transformers & Visitors | |||
======================= | |||
Transformers & Visitors provide a convenient interface to process the | |||
parse-trees that Lark returns. | |||
They are used by inheriting from the correct class (visitor or transformer), | |||
and implementing methods corresponding to the rule you wish to process. Each | |||
method accepts the children as an argument. That can be modified using the | |||
``v_args`` decorator, which allows to inline the arguments (akin to ``*args``), | |||
or add the tree ``meta`` property as an argument. | |||
See: `visitors.py`_ | |||
.. _visitors.py: https://github.com/lark-parser/lark/blob/master/lark/visitors.py | |||
Visitor | |||
------- | |||
Visitors visit each node of the tree, and run the appropriate method on it according to the node's data. | |||
They work bottom-up, starting with the leaves and ending at the root of the tree. | |||
There are two classes that implement the visitor interface: | |||
- ``Visitor``: Visit every node (without recursion) | |||
- ``Visitor_Recursive``: Visit every node using recursion. Slightly faster. | |||
Example: | |||
:: | |||
class IncreaseAllNumbers(Visitor): | |||
def number(self, tree): | |||
assert tree.data == "number" | |||
tree.children[0] += 1 | |||
IncreaseAllNumbers().visit(parse_tree) | |||
.. autoclass:: lark.visitors.Visitor | |||
.. autoclass:: lark.visitors.Visitor_Recursive | |||
Interpreter | |||
----------- | |||
.. autoclass:: lark.visitors.Interpreter | |||
Example: | |||
:: | |||
class IncreaseSomeOfTheNumbers(Interpreter): | |||
def number(self, tree): | |||
tree.children[0] += 1 | |||
def skip(self, tree): | |||
# skip this subtree. don't change any number node inside it. | |||
pass | |||
IncreaseSomeOfTheNumbers().visit(parse_tree) | |||
Transformer | |||
----------- | |||
.. autoclass:: lark.visitors.Transformer | |||
:members: __default__, __default_token__ | |||
Example: | |||
:: | |||
from lark import Tree, Transformer | |||
class EvalExpressions(Transformer): | |||
def expr(self, args): | |||
return eval(args[0]) | |||
t = Tree('a', [Tree('expr', ['1+2'])]) | |||
print(EvalExpressions().transform( t )) | |||
# Prints: Tree(a, [3]) | |||
Example: | |||
:: | |||
class T(Transformer): | |||
INT = int | |||
NUMBER = float | |||
def NAME(self, name): | |||
return lookup_dict.get(name, name) | |||
T(visit_tokens=True).transform(tree) | |||
v_args | |||
------ | |||
.. autofunction:: lark.visitors.v_args | |||
Discard | |||
------- | |||
.. autoclass:: lark.visitors.Discard |
@@ -24,9 +24,25 @@ class UnexpectedEOF(ParseError): | |||
class UnexpectedInput(LarkError): | |||
"""UnexpectedInput Error. | |||
Used as a base class for the following exceptions: | |||
- ``UnexpectedToken``: The parser recieved an unexpected token | |||
- ``UnexpectedCharacters``: The lexer encountered an unexpected string | |||
After catching one of these exceptions, you may call the following helper methods to create a nicer error message. | |||
""" | |||
pos_in_stream = None | |||
def get_context(self, text, span=40): | |||
"""Returns a pretty string pinpointing the error in the text, | |||
with span amount of context characters around it. | |||
Note: | |||
The parser doesn't hold a copy of the text it has to parse, | |||
so you have to provide it again | |||
""" | |||
pos = self.pos_in_stream | |||
start = max(pos - span, 0) | |||
end = pos + span | |||
@@ -40,11 +56,22 @@ class UnexpectedInput(LarkError): | |||
return (before + after + b'\n' + b' ' * len(before) + b'^\n').decode("ascii", "backslashreplace") | |||
def match_examples(self, parse_fn, examples, token_type_match_fallback=False, use_accepts=False): | |||
""" Given a parser instance and a dictionary mapping some label with | |||
some malformed syntax examples, it'll return the label for the | |||
example that bests matches the current error. | |||
It's recommended to call this with `use_accepts=True`. The default is False for backwards compatibility. | |||
"""Allows you to detect what's wrong in the input text by matching | |||
against example errors. | |||
Given a parser instance and a dictionary mapping some label with | |||
some malformed syntax examples, it'll return the label for the | |||
example that bests matches the current error. The function will | |||
iterate the dictionary until it finds a matching error, and | |||
return the corresponding value. | |||
For an example usage, see `examples/error_reporting_lalr.py` | |||
Parameters: | |||
parse_fn: parse function (usually ``lark_instance.parse``) | |||
examples: dictionary of ``{'example_string': value}``. | |||
use_accepts: Recommended to call this with ``use_accepts=True``. | |||
The default is ``False`` for backwards compatibility. | |||
""" | |||
assert self.state is not None, "Not supported for this exception" | |||
@@ -109,8 +136,13 @@ class UnexpectedCharacters(LexError, UnexpectedInput): | |||
super(UnexpectedCharacters, self).__init__(message) | |||
class UnexpectedToken(ParseError, UnexpectedInput): | |||
"""When the parser throws UnexpectedToken, it instanciates a puppet | |||
with its internal state. Users can then interactively set the puppet to | |||
the desired puppet state, and resume regular parsing. | |||
see: :ref:`ParserPuppet`. | |||
""" | |||
def __init__(self, token, expected, considered_rules=None, state=None, puppet=None): | |||
self.line = getattr(token, 'line', '?') | |||
self.column = getattr(token, 'column', '?') | |||
@@ -132,6 +164,7 @@ class UnexpectedToken(ParseError, UnexpectedInput): | |||
super(UnexpectedToken, self).__init__(message) | |||
class VisitError(LarkError): | |||
"""VisitError is raised when visitors are interrupted by an exception | |||
@@ -27,63 +27,67 @@ class LarkOptions(Serialize): | |||
""" | |||
OPTIONS_DOC = """ | |||
# General | |||
start - The start symbol. Either a string, or a list of strings for | |||
multiple possible starts (Default: "start") | |||
debug - Display debug information, such as warnings (default: False) | |||
transformer - Applies the transformer to every parse tree (equivlent to | |||
applying it after the parse, but faster) | |||
propagate_positions - Propagates (line, column, end_line, end_column) | |||
attributes into all tree branches. | |||
maybe_placeholders - When True, the `[]` operator returns `None` when not matched. | |||
When `False`, `[]` behaves like the `?` operator, | |||
and returns no value at all. | |||
(default=`False`. Recommended to set to `True`) | |||
regex - When True, uses the `regex` module instead of the stdlib `re`. | |||
cache - Cache the results of the Lark grammar analysis, for x2 to x3 faster loading. | |||
LALR only for now. | |||
When `False`, does nothing (default) | |||
When `True`, caches to a temporary file in the local directory | |||
When given a string, caches to the path pointed by the string | |||
g_regex_flags - Flags that are applied to all terminals | |||
(both regex and strings) | |||
keep_all_tokens - Prevent the tree builder from automagically | |||
removing "punctuation" tokens (default: False) | |||
# Algorithm | |||
parser - Decides which parser engine to use | |||
Accepts "earley" or "lalr". (Default: "earley") | |||
(there is also a "cyk" option for legacy) | |||
lexer - Decides whether or not to use a lexer stage | |||
"auto" (default): Choose for me based on the parser | |||
"standard": Use a standard lexer | |||
"contextual": Stronger lexer (only works with parser="lalr") | |||
"dynamic": Flexible and powerful (only with parser="earley") | |||
"dynamic_complete": Same as dynamic, but tries *every* variation | |||
of tokenizing possible. | |||
ambiguity - Decides how to handle ambiguity in the parse. | |||
Only relevant if parser="earley" | |||
"resolve": The parser will automatically choose the simplest | |||
derivation (it chooses consistently: greedy for | |||
tokens, non-greedy for rules) | |||
"explicit": The parser will return all derivations wrapped | |||
in "_ambig" tree nodes (i.e. a forest). | |||
# Domain Specific | |||
postlex - Lexer post-processing (Default: None) Only works with the | |||
standard and contextual lexers. | |||
priority - How priorities should be evaluated - auto, none, normal, | |||
invert (Default: auto) | |||
lexer_callbacks - Dictionary of callbacks for the lexer. May alter | |||
tokens during lexing. Use with caution. | |||
use_bytes - Accept an input of type `bytes` instead of `str` (Python 3 only). | |||
edit_terminals - A callback | |||
**=== General Options ===** | |||
start | |||
The start symbol. Either a string, or a list of strings for multiple possible starts (Default: "start") | |||
debug | |||
Display debug information, such as warnings (default: False) | |||
transformer | |||
Applies the transformer to every parse tree (equivlent to applying it after the parse, but faster) | |||
propagate_positions | |||
Propagates (line, column, end_line, end_column) attributes into all tree branches. | |||
maybe_placeholders | |||
When True, the ``[]`` operator returns ``None`` when not matched. | |||
When ``False``, ``[]`` behaves like the ``?`` operator, and returns no value at all. | |||
(default= ``False``. Recommended to set to ``True``) | |||
regex | |||
When True, uses the ``regex`` module instead of the stdlib ``re``. | |||
cache | |||
Cache the results of the Lark grammar analysis, for x2 to x3 faster loading. LALR only for now. | |||
- When ``False``, does nothing (default) | |||
- When ``True``, caches to a temporary file in the local directory | |||
- When given a string, caches to the path pointed by the string | |||
g_regex_flags | |||
Flags that are applied to all terminals (both regex and strings) | |||
keep_all_tokens | |||
Prevent the tree builder from automagically removing "punctuation" tokens (default: False) | |||
**=== Algorithm Options ===** | |||
parser | |||
Decides which parser engine to use. Accepts "earley" or "lalr". (Default: "earley"). | |||
(there is also a "cyk" option for legacy) | |||
lexer | |||
Decides whether or not to use a lexer stage | |||
- "auto" (default): Choose for me based on the parser | |||
- "standard": Use a standard lexer | |||
- "contextual": Stronger lexer (only works with parser="lalr") | |||
- "dynamic": Flexible and powerful (only with parser="earley") | |||
- "dynamic_complete": Same as dynamic, but tries *every* variation of tokenizing possible. | |||
ambiguity | |||
Decides how to handle ambiguity in the parse. Only relevant if parser="earley" | |||
- "resolve" - The parser will automatically choose the simplest derivation | |||
(it chooses consistently: greedy for tokens, non-greedy for rules) | |||
- "explicit": The parser will return all derivations wrapped in "_ambig" tree nodes (i.e. a forest). | |||
**=== Misc. / Domain Specific Options ===** | |||
postlex | |||
Lexer post-processing (Default: None) Only works with the standard and contextual lexers. | |||
priority | |||
How priorities should be evaluated - auto, none, normal, invert (Default: auto) | |||
lexer_callbacks | |||
Dictionary of callbacks for the lexer. May alter tokens during lexing. Use with caution. | |||
use_bytes | |||
Accept an input of type ``bytes`` instead of ``str`` (Python 3 only). | |||
edit_terminals | |||
A callback for editing the terminals before parse. | |||
""" | |||
if __doc__: | |||
__doc__ += OPTIONS_DOC | |||
@@ -156,12 +160,19 @@ class LarkOptions(Serialize): | |||
class Lark(Serialize): | |||
def __init__(self, grammar, **options): | |||
""" | |||
grammar : a string or file-object containing the grammar spec (using Lark's ebnf syntax) | |||
options : a dictionary controlling various aspects of Lark. | |||
""" | |||
"""Main interface for the library. | |||
It's mostly a thin wrapper for the many different parsers, and for the tree constructor. | |||
Parameters: | |||
grammar: a string or file-object containing the grammar spec (using Lark's ebnf syntax) | |||
options: a dictionary controlling various aspects of Lark. | |||
Example: | |||
>>> Lark(r'''start: "foo" ''') | |||
Lark(...) | |||
""" | |||
def __init__(self, grammar, **options): | |||
self.options = LarkOptions(options) | |||
# Set regex or re module | |||
@@ -295,8 +306,8 @@ class Lark(Serialize): | |||
with FS.open(cache_fn, 'wb') as f: | |||
self.save(f) | |||
if __init__.__doc__: | |||
__init__.__doc__ += "\nOptions:\n" + LarkOptions.OPTIONS_DOC | |||
# TODO: merge with above | |||
__doc__ += "\n\n" + LarkOptions.OPTIONS_DOC | |||
__serialize_fields__ = 'parser', 'rules', 'options' | |||
@@ -314,11 +325,19 @@ class Lark(Serialize): | |||
return self.parser_class(self.lexer_conf, parser_conf, options=self.options) | |||
def save(self, f): | |||
"""Saves the instance into the given file object | |||
Useful for caching and multiprocessing. | |||
""" | |||
data, m = self.memo_serialize([TerminalDef, Rule]) | |||
pickle.dump({'data': data, 'memo': m}, f) | |||
@classmethod | |||
def load(cls, f): | |||
"""Loads an instance from the given file object | |||
Useful for caching and multiprocessing. | |||
""" | |||
inst = cls.__new__(cls) | |||
return inst._load(f) | |||
@@ -361,7 +380,7 @@ class Lark(Serialize): | |||
def open(cls, grammar_filename, rel_to=None, **options): | |||
"""Create an instance of Lark with the grammar given by its filename | |||
If rel_to is provided, the function will find the grammar filename in relation to it. | |||
If ``rel_to`` is provided, the function will find the grammar filename in relation to it. | |||
Example: | |||
@@ -396,11 +415,17 @@ class Lark(Serialize): | |||
"""Parse the given text, according to the options provided. | |||
Parameters: | |||
start: str - required if Lark was given multiple possible start symbols (using the start option). | |||
on_error: function - if provided, will be called on UnexpectedToken error. Return true to resume parsing. LALR only. | |||
text (str): Text to be parsed. | |||
start (str, optional): Required if Lark was given multiple possible start symbols (using the start option). | |||
on_error (function, optional): if provided, will be called on UnexpectedToken error. Return true to resume parsing. | |||
LALR only. See examples/error_puppet.py for an example of how to use on_error. | |||
Returns: | |||
If a transformer is supplied to ``__init__``, returns whatever is the | |||
result of the transformation. Otherwise, returns a Tree instance. | |||
Returns a tree, unless specified otherwise. | |||
""" | |||
try: | |||
return self.parser.parse(text, start=start) | |||
except UnexpectedToken as e: | |||
@@ -90,6 +90,25 @@ class TerminalDef(Serialize): | |||
class Token(Str): | |||
"""Token of a lexer. | |||
When using a lexer, the resulting tokens in the trees will be of the | |||
Token class, which inherits from Python's string. So, normal string | |||
comparisons and operations will work as expected. Tokens also have other | |||
useful attributes. | |||
Attributes: | |||
type_: Name of the token (as specified in grammar) | |||
pos_in_stream: The index of the token in the text | |||
line: The line of the token in the text (starting with 1) | |||
column: The column of the token in the text (starting with 1) | |||
end_line: The line where the token ends | |||
end_column: The next column after the end of the token. For example, | |||
if the token is a single character with a column value of 4, | |||
end_column will be 5. | |||
end_pos: the index where the token ends (basically pos_in_stream + | |||
len(token)) | |||
""" | |||
__slots__ = ('type', 'pos_in_stream', 'value', 'line', 'column', 'end_line', 'end_column', 'end_pos') | |||
def __new__(cls, type_, value, pos_in_stream=None, line=None, column=None, end_line=None, end_column=None, end_pos=None): | |||
@@ -7,6 +7,10 @@ from .. import Token | |||
class ParserPuppet(object): | |||
"""ParserPuppet gives you advanced control over error handling when parsing with LALR. | |||
For a simpler, more streamlined interface, see the ``on_error`` argument to ``Lark.parse()``. | |||
""" | |||
def __init__(self, parser, state_stack, value_stack, start, stream, set_state): | |||
self.parser = parser | |||
self._state_stack = state_stack | |||
@@ -18,8 +22,9 @@ class ParserPuppet(object): | |||
self.result = None | |||
def feed_token(self, token): | |||
"""Advance the parser state, as if it just received `token` from the lexer | |||
"""Feed the parser with a token, and advance it to the next state, as if it recieved it from the lexer. | |||
Note that ``token`` has to be an instance of ``Token``. | |||
""" | |||
end_state = self.parser.parse_table.end_states[self._start] | |||
state_stack = self._state_stack | |||
@@ -59,6 +64,10 @@ class ParserPuppet(object): | |||
value_stack.append(token) | |||
def copy(self): | |||
"""Create a new puppet with a separate state. | |||
Calls to feed_token() won't affect the old puppet, and vice-versa. | |||
""" | |||
return type(self)( | |||
self.parser, | |||
list(self._state_stack), | |||
@@ -80,6 +89,7 @@ class ParserPuppet(object): | |||
) | |||
def pretty(self): | |||
"""Print the output of ``choices()`` in a way that's easier to read.""" | |||
out = ["Puppet choices:"] | |||
for k, v in self.choices().items(): | |||
out.append('\t- %s -> %s' % (k, v)) | |||
@@ -87,6 +97,12 @@ class ParserPuppet(object): | |||
return '\n'.join(out) | |||
def choices(self): | |||
"""Returns a dictionary of token types, matched to their action in the parser. | |||
Only returns token types that are accepted by the current state. | |||
Updated by ``feed_token()``. | |||
""" | |||
return self.parser.parse_table.states[self._state_stack[-1]] | |||
def accepts(self): | |||
@@ -102,4 +118,8 @@ class ParserPuppet(object): | |||
return accepts | |||
def resume_parse(self): | |||
return self.parser.parse(self._stream, self._start, self._set_state, self._value_stack, self._state_stack) | |||
"""Resume parsing from the current puppet state.""" | |||
return self.parser.parse( | |||
self._stream, self._start, self._set_state, | |||
self._value_stack, self._state_stack | |||
) |
@@ -30,6 +30,7 @@ from io import open | |||
import codecs | |||
import sys | |||
import token, tokenize | |||
import os | |||
from pprint import pprint | |||
from os import path | |||
@@ -84,6 +85,37 @@ def extract_sections(lines): | |||
return {name:''.join(text) for name, text in sections.items()} | |||
def strip_docstrings(line_gen): | |||
""" Strip comments and docstrings from a file. | |||
Based on code from: https://stackoverflow.com/questions/1769332/script-to-remove-python-comments-docstrings | |||
""" | |||
res = [] | |||
prev_toktype = token.INDENT | |||
last_lineno = -1 | |||
last_col = 0 | |||
tokgen = tokenize.generate_tokens(line_gen) | |||
for toktype, ttext, (slineno, scol), (elineno, ecol), ltext in tokgen: | |||
if slineno > last_lineno: | |||
last_col = 0 | |||
if scol > last_col: | |||
res.append(" " * (scol - last_col)) | |||
if toktype == token.STRING and prev_toktype == token.INDENT: | |||
# Docstring | |||
res.append("#--") | |||
elif toktype == tokenize.COMMENT: | |||
# Comment | |||
res.append("##\n") | |||
else: | |||
res.append(ttext) | |||
prev_toktype = toktype | |||
last_col = ecol | |||
last_lineno = elineno | |||
return ''.join(res) | |||
def main(fobj, start): | |||
lark_inst = Lark(fobj, parser="lalr", lexer="contextual", start=start) | |||
@@ -91,9 +123,12 @@ def main(fobj, start): | |||
print('__version__ = "%s"' % lark.__version__) | |||
print() | |||
for pyfile in EXTRACT_STANDALONE_FILES: | |||
for i, pyfile in enumerate(EXTRACT_STANDALONE_FILES): | |||
with open(os.path.join(_larkdir, pyfile)) as f: | |||
print (extract_sections(f)['standalone']) | |||
code = extract_sections(f)['standalone'] | |||
if i: # if not this file | |||
code = strip_docstrings(iter(code.splitlines(True)).__next__) | |||
print(code) | |||
data, m = lark_inst.memo_serialize([TerminalDef, Rule]) | |||
print( 'DATA = (' ) | |||
@@ -14,7 +14,19 @@ class Meta: | |||
def __init__(self): | |||
self.empty = True | |||
class Tree(object): | |||
"""The main tree class. | |||
Creates a new tree, and stores "data" and "children" in attributes of the same name. | |||
Trees can be hashed and compared. | |||
Parameters: | |||
data: The name of the rule or alias | |||
children: List of matched sub-rules and terminals | |||
meta: Line & Column numbers (if ``propagate_positions`` is enabled). | |||
meta attributes: line, column, start_pos, end_line, end_column, end_pos | |||
""" | |||
def __init__(self, data, children, meta=None): | |||
self.data = data | |||
self.children = children | |||
@@ -46,6 +58,10 @@ class Tree(object): | |||
return l | |||
def pretty(self, indent_str=' '): | |||
"""Returns an indented string representation of the tree. | |||
Great for debugging. | |||
""" | |||
return ''.join(self._pretty(0, indent_str)) | |||
def __eq__(self, other): | |||
@@ -61,6 +77,10 @@ class Tree(object): | |||
return hash((self.data, tuple(self.children))) | |||
def iter_subtrees(self): | |||
"""Depth-first iteration. | |||
Iterates over all the subtrees, never returning to the same node twice (Lark's parse-tree is actually a DAG). | |||
""" | |||
queue = [self] | |||
subtrees = OrderedDict() | |||
for subtree in queue: | |||
@@ -72,11 +92,11 @@ class Tree(object): | |||
return reversed(list(subtrees.values())) | |||
def find_pred(self, pred): | |||
"Find all nodes where pred(tree) == True" | |||
"""Returns all nodes of the tree that evaluate pred(node) as true.""" | |||
return filter(pred, self.iter_subtrees()) | |||
def find_data(self, data): | |||
"Find all nodes where tree.data == data" | |||
"""Returns all nodes of the tree whose data equals the given data.""" | |||
return self.find_pred(lambda t: t.data == data) | |||
###} | |||
@@ -97,6 +117,10 @@ class Tree(object): | |||
yield c | |||
def iter_subtrees_topdown(self): | |||
"""Breadth-first iteration. | |||
Iterates over all the subtrees, return nodes in order like pretty() does. | |||
""" | |||
stack = [self] | |||
while stack: | |||
node = stack.pop() | |||
@@ -9,6 +9,9 @@ from .lexer import Token | |||
from inspect import getmembers, getmro | |||
class Discard(Exception): | |||
"""When raising the Discard exception in a transformer callback, | |||
that node is discarded and won't appear in the parent. | |||
""" | |||
pass | |||
# Transformers | |||
@@ -42,12 +45,31 @@ class _Decoratable: | |||
class Transformer(_Decoratable): | |||
"""Visits the tree recursively, starting with the leaves and finally the root (bottom-up) | |||
"""Transformers visit each node of the tree, and run the appropriate method on it according to the node's data. | |||
Calls its methods (provided by user via inheritance) according to tree.data | |||
Calls its methods (provided by user via inheritance) according to ``tree.data``. | |||
The returned value replaces the old one in the structure. | |||
Can be used to implement map or reduce. | |||
They work bottom-up (or depth-first), starting with the leaves and ending at the root of the tree. | |||
Transformers can be used to implement map & reduce patterns. Because nodes are reduced from leaf to root, | |||
at any point the callbacks may assume the children have already been transformed (if applicable). | |||
``Transformer`` can do anything ``Visitor`` can do, but because it reconstructs the tree, | |||
it is slightly less efficient. It can be used to implement map or reduce patterns. | |||
All these classes implement the transformer interface: | |||
- ``Transformer`` - Recursively transforms the tree. This is the one you probably want. | |||
- ``Transformer_InPlace`` - Non-recursive. Changes the tree in-place instead of returning new instances | |||
- ``Transformer_InPlaceRecursive`` - Recursive. Changes the tree in-place instead of returning new instances | |||
Parameters: | |||
visit_tokens: By default, transformers only visit rules. | |||
visit_tokens=True will tell ``Transformer`` to visit tokens | |||
as well. This is a slightly slower alternative to lexer_callbacks | |||
but it's easier to maintain and works for all algorithms | |||
(even when there isn't a lexer). | |||
""" | |||
__visit_tokens__ = True # For backwards compatibility | |||
@@ -110,11 +132,19 @@ class Transformer(_Decoratable): | |||
return TransformerChain(self, other) | |||
def __default__(self, data, children, meta): | |||
"Default operation on tree (for override)" | |||
"""Default operation on tree (for override) | |||
Function that is called on if a function with a corresponding name has not been found. | |||
Defaults to reconstruct the Tree. | |||
""" | |||
return Tree(data, children, meta) | |||
def __default_token__(self, token): | |||
"Default operation on token (for override)" | |||
"""Default operation on token (for override) | |||
Function that is called on if a function with a corresponding name has not been found. | |||
Defaults to just return the argument. | |||
""" | |||
return token | |||
@@ -211,10 +241,10 @@ class VisitorBase: | |||
class Visitor(VisitorBase): | |||
"""Bottom-up visitor, non-recursive | |||
"""Bottom-up visitor, non-recursive. | |||
Visits the tree, starting with the leaves and finally the root (bottom-up) | |||
Calls its methods (provided by user via inheritance) according to tree.data | |||
Calls its methods (provided by user via inheritance) according to ``tree.data`` | |||
""" | |||
def visit(self, tree): | |||
@@ -227,11 +257,12 @@ class Visitor(VisitorBase): | |||
self._call_userfunc(subtree) | |||
return tree | |||
class Visitor_Recursive(VisitorBase): | |||
"""Bottom-up visitor, recursive | |||
"""Bottom-up visitor, recursive. | |||
Visits the tree, starting with the leaves and finally the root (bottom-up) | |||
Calls its methods (provided by user via inheritance) according to tree.data | |||
Calls its methods (provided by user via inheritance) according to ``tree.data`` | |||
""" | |||
def visit(self, tree): | |||
@@ -263,13 +294,15 @@ def visit_children_decor(func): | |||
class Interpreter(_Decoratable): | |||
"""Top-down visitor, recursive | |||
"""Interpreter walks the tree starting at the root. | |||
Visits the tree, starting with the root and finally the leaves (top-down) | |||
Calls its methods (provided by user via inheritance) according to tree.data | |||
Unlike Transformer and Visitor, the Interpreter doesn't automatically visit its sub-branches. | |||
The user has to explicitly call visit, visit_children, or use the @visit_children_decor | |||
For each tree node, it calls its methods (provided by user via inheritance) according to ``tree.data``. | |||
Unlike ``Transformer`` and ``Visitor``, the Interpreter doesn't automatically visit its sub-branches. | |||
The user has to explicitly call ``visit``, ``visit_children``, or use the ``@visit_children_decor``. | |||
This allows the user to implement branching and loops. | |||
""" | |||
def visit(self, tree): | |||
@@ -352,8 +385,34 @@ def _vargs_meta(f, data, children, meta): | |||
def _vargs_tree(f, data, children, meta): | |||
return f(Tree(data, children, meta)) | |||
def v_args(inline=False, meta=False, tree=False, wrapper=None): | |||
"A convenience decorator factory, for modifying the behavior of user-supplied visitor methods" | |||
"""A convenience decorator factory for modifying the behavior of user-supplied visitor methods. | |||
By default, callback methods of transformers/visitors accept one argument - a list of the node's children. | |||
``v_args`` can modify this behavior. When used on a transformer/visitor class definition, | |||
it applies to all the callback methods inside it. | |||
Parameters: | |||
inline: Children are provided as ``*args`` instead of a list argument (not recommended for very long lists). | |||
meta: Provides two arguments: ``children`` and ``meta`` (instead of just the first) | |||
tree: Provides the entire tree as the argument, instead of the children. | |||
Example: | |||
:: | |||
@v_args(inline=True) | |||
class SolveArith(Transformer): | |||
def add(self, left, right): | |||
return left + right | |||
class ReverseNotation(Transformer_InPlace): | |||
@v_args(tree=True) | |||
def tree_node(self, tree): | |||
tree.children = tree.children[::-1] | |||
""" | |||
if tree and (meta or inline): | |||
raise ValueError("Visitor functions cannot combine 'tree' with 'meta' or 'inline'.") | |||
@@ -1,16 +0,0 @@ | |||
site_name: Lark | |||
theme: readthedocs | |||
pages: | |||
- Main Page: index.md | |||
- Philosophy: philosophy.md | |||
- Features: features.md | |||
- Parsers: parsers.md | |||
- How To Use (Guide): how_to_use.md | |||
- How To Develop (Guide): how_to_develop.md | |||
- Grammar Reference: grammar.md | |||
- Tree Construction Reference: tree_construction.md | |||
- Visitors and Transformers: visitors.md | |||
- Classes Reference: classes.md | |||
- Recipes: recipes.md | |||
- Import grammars from Nearley: nearley.md | |||
- Tutorial - JSON Parser: json_tutorial.md |
@@ -1,10 +1,7 @@ | |||
version: 2 | |||
mkdocs: | |||
configuration: mkdocs.yml | |||
fail_on_warning: false | |||
formats: all | |||
python: | |||
version: 3.5 | |||
# Build documentation in the docs/ directory with Sphinx | |||
sphinx: | |||
configuration: docs/conf.py |