diff --git a/README.md b/README.md index 5a30750..0c09541 100644 --- a/README.md +++ b/README.md @@ -65,6 +65,8 @@ These features are planned to be implemented in the near future: ## Comparison to other parsers +This is a feature comparison. For benchmarks vs pyparsing, check out the [JSON tutorial](/docs/json_tutorial.md#conclusion). + | Library | Algorithm | LOC | Grammar | Builds AST |:--------|:----------|:----|:--------|:------------ | Lark | Earley/LALR(1) | 0.5K | EBNF+ | Yes! | diff --git a/docs/json_tutorial.md b/docs/json_tutorial.md index 01c733e..a3d19da 100644 --- a/docs/json_tutorial.md +++ b/docs/json_tutorial.md @@ -27,7 +27,7 @@ Lark accepts its grammars in a format called [EBNF](https://www.wikiwand.com/en/ (*a token is a string or a regular expression*) -How to structure those rules is beyond the scope of this tutorial, but it's often enough to follow one's intuition. +How to structure those rules is beyond the scope of this tutorial, but often it's enough to follow one's intuition. In the case of JSON, the structure is simple: A json document is either a list, or a dictionary, or a string/number/etc. @@ -37,14 +37,14 @@ Let's write this structure in EBNF form: value: dict | list - | string - | number + | STRING + | NUMBER | "true" | "false" | "null" list : "[" [value ("," value)*] "]" dict : "{" [pair ("," pair)*] "}" - pair : string ":" value + pair : STRING ":" value A quick explanation of the syntax: @@ -52,17 +52,19 @@ A quick explanation of the syntax: - rule\* means *any amount*. That means, zero or more instances of that rule. - [rule] means *optional*. That means zero or one instance of that rule. -Lark also supports the rule+ operator, meaning one or more instances. +Lark also supports the rule+ operator, meaning one or more instances. It also supports the rule? operator which is another way to say *optional*. -Of course, we still haven't defined "string" and "number". +Of course, we still haven't defined "STRING" and "NUMBER". We'll do that now, and also take care of the white-space, which is part of the text. - number : /-?\d+(\.\d+)?([eE][+-]?\d+)?/ - string : /".*?(?>> text = '{"key": ["item0", "item1", 3.14]}' >>> json_parser.parse(text) -Tree(value, [Tree(dict, [Tree(pair, [Tree(string, [Token(ANONRE_1, "key")]), Tree(value, [Tree(list, [Tree(value, [Tree(string, [Token(ANONRE_1, "item0")])]), Tree(value, [Tree(string, [Token(ANONRE_1, "item1")])]), Tree(value, [Tree(number, [Token(ANONRE_0, 3.14)])])])])])])]) +Tree(value, [Tree(dict, [Tree(pair, [Token(STRING, "key"), Tree(value, [Tree(list, [Tree(value, [Token(STRING, "item0")]), Tree(value, [Token(STRING, "item1")]), Tree(value, [Token(NUMBER, 3.14)])])])])])]) >>> print( _.pretty() ) value dict pair - string "key" + "key" value list - value - string "item0" - value - string "item1" - value - number 3.14 + value "item0" + value "item1" + value 3.14 ``` As promised, Lark automagically creates a tree that represents the parsed text. @@ -125,8 +124,6 @@ Lark automatically filters out tokens from the tree, based on the following crit - Filter out string tokens without a name, or with a name that starts with an underscore. - Keep regex tokens, even unnamed ones, unless their name starts with an underscore. -This tutorial won't give an example of named tokens, but you can find such use in the [calculator example](/examples/calc.py). - Unfortunately, this means that it will also filter out tokens like "true" and "false", and we will lose that information. The next section, "Shaping the tree" deals with this issue, and others. ## Part 3 - Shaping the Tree @@ -146,10 +143,17 @@ I'll present the solution, and then explain it: | "false" -> false | "null" -> null + ... + + number : /-?\d+(\.\d+)?([eE][+-]?\d+)?/ + string : /".*?(?