2015-10-02 to 2015-10-03
Luke Imhoff
Kronic.Deth@gmail.com @KronicDeth @KronicDethHello, my name is Luke Imhoff. I am the maintainer of intellij-elixir, the Elixir plugin for Jetbrains IDEs, including IntelliJ itself and Rubymine. I have contributed to the Elixir standard library and found bugs in the native tokenizer and parser through my work on intellij-elixir. I help run the Austin Elixir meetup.
For viewers that want to follow along with their own copy of the slides or project source, they can be accessed at the shown addresses.
People may wonder why I took it upon myself to make an IDE plugin for Elixir, when I could have used a vim or emacs plugin.
I've used both emacs and vim. I started with emacs when I worked at Cray, only switching to vim when I needed something that worked over a high-latency, low-bandwidth CDMA modem on the high-way.
I started using Rubymine when my boss at a previous job, Nicholas Cancelliere, introduced me to it. I was shocked that an IDE for a dynamic language like Ruby could support Find Usage, Go To Definition, and Refactor. I had been use to using ctags for vim.
I haven't complete abandoned using vim either. I still use syntax highlighting plugins for vim in iTerm when I need to edit configuration files and I use the IDEAVim plugin for Jetbrains IDEs like Rubymine, Webstorm, and IntelliJ.
Without Rubymine I don't think I'd have been able to as quickly dive through the complex code-base in Metasploit and the graphical debugger allowed me to teach myself the internals of Rails and other DSLs.
I understood that many of the features I liked about Rubymine were shared across JetBrains' various IDEs, so if I could write the parts of a plugin to get JetBrains APIs to understand Elixir syntax and semantics, the features I really wanted would just work without me having to understand how to write the parts that were cross-language.
However, just getting syntax lexing and parsing right ended up taking a year...
I was exactly 1 year between the initial commit of the project skeleton to the v1.0.0 tag.
<variable>is defined as
::=or
|
<expr> ::= <integer> | <expr> <op> <integer>
One of the standard ways of defining a syntax is in BNF, or Backus-Naur Form, which was first used for the Algol 60 standard.
Both YECC and GrammarKit use a form of BNF, so I assumed it was just a matter of porting elixir dot Y-R-L to elixir dot B-N-F.
grammar -> eoe : nil. grammar -> expr_list : to_block('$1'). grammar -> eoe expr_list : to_block('$2'). grammar -> expr_list eoe : to_block('$1'). grammar -> eoe expr_list eoe : to_block('$2'). grammar -> '$empty' : nil.
YECC is a parser generator written in Erlang (and part of the standard distribution) that is based on yacc (with an 'a' instead of an 'e'), which is a parser generator written in C.
The YECC syntax differs from B-N-F in that it uses a skinny arrow (->) instead of colon colon equals (::=) and instead of using pipe (|) for OR, lines with the same rule name are repeated with alternative definitions. Finally, YECC supports running Erlang code on the tokens using dollar number ($n) for positional references to the matches tokens.
dollar empty ($empty) is a special token that matches no input. In formal grammars this is usually referred to as (lowercase) epsilon (ɛ).
private elixirFile ::= endOfExpression? (expressionList endOfExpression?)? private expressionList ::= expression (endOfExpression expression | adjacentExpression)*
Grammar Kit is a parser generator written in Java and created by JetBrains.
GrammarKit's B-N-F format does use colon colon equals (::=) like Backus-Naur Form, but it has some more power constructs above pipe (|) for OR.
Question mark (?) can be used for 0 or 1, parentheses (()) can be used for grouping; and star can be used for 0 or more. Empty can be implied by question mark (?) or star (*) matching nothing or there being nothing on the right-hand-side of the colon colon equals (::=).
You'll also notice that there is no inline Java code after the rule definition unlike in YECC where there as Erlang code. This is because GrammarKit automatically generates an AST, which it calls a PSI (or Program Structure Interface) Tree from the matched rules, so there's no need to define how to build the tree.
Having GrammarKit generate the AST is both good and bad. It's good because it removes a lot of redundant code, but it's bad because rules must evaluate in the correct order to reflect the desired nesting and associativity without any manual fixups, which are possible with the Erlang code in YECC.
YECC and Grammar Kit both support a form of BNF and Grammar Kit seems like it would support even a more compact grammar because question mark (?), star (star), pipe (|) and parentheses (()) could eliminate some of the redundancy needed in the multi-clause rules in YECC.
After only 6 days I had quote unquote translated the BNF from yecc to Grammar Kit and ended up with a parser that froze the IDE, which Julius "h4cc" Beckmann reported. That left the slow process of translating the grammar correctly over the next 359 days. At the time of the freeze, I didn't even really understand translating the BNF didn't work, all I knew was I needed to go slower and build the grammar up from simpler, testable pieces.
So, I went back and actually started to do the JetBrains tutorial step by step instead of jumping ahead and searched Wikipedia, trying to find CompSci articles that explained the correct way to do this.
In order to support color syntax highlighting and mark syntax errors with the nice red squiggly under line, IntelliJ Elixir needed to be able to analyze Elixir syntax.
Syntactic analysis is usually broken down into two parts: the first breaks the raw text into tokens and the second checks if those tokens are arranged in the correct order.
In most programming languages, both lexers and parsers are built using generators that have an external DSL. In Elixir, the lexer is built using Erlang directly because Erlang pattern matching is compact enough that a generator is unnecessary. Additionally, Elixir syntax contains some features that normal lexer generator aren't expecting.
For IntelliJ Elixir, I used JFlex because it was the lexer generator recommended by JetBrain's plugin tutorial.
For parsing Elixir does use a generator, called yecc, which generates Erlang code.
For IntelliJ Elixir, I used JetBrains' GrammarKit.
When I first started IntelliJ Elixir I didn't understand the important difference between these two stacks, but I hope to explain what I learned along the way to you.
The first step of syntactic analysis is lexing, also known as tokenizing. Lexing breaks up the raw text into tokens, such as keywords, literals, operators, and identifiers.
Input is matched using some pattern. In native Elixir, this is pattern matching on Erlang string prefixes. In IntelliJ Elixir's JFlex file, it's regular expressions.
tokenize([$#|String], Line, Scope, Tokens) -> Rest = tokenize_comment(String), tokenize(Rest, Line, Scope, Tokens); tokenize_comment("\r\n" ++ _ = Rest) -> Rest; tokenize_comment("\n" ++ _ = Rest) -> Rest; tokenize_comment([_|Rest]) -> tokenize_comment(Rest); tokenize_comment([]) -> [].
COMMENT = "#" [^\r\n]* {EOL}? <YYINITIAL> { {COMMENT} { yybegin(BODY); return ElixirTypes.COMMENT; } } <BODY> { {COMMENT} { return ElixirTypes.COMMENT; } }EOL
tokenize("\n" ++ Rest, Line, Scope, Tokens) -> tokenize(Rest, Line + 1, Scope, eol(Line, newline, Tokens)); tokenize("\r\n" ++ Rest, Line, Scope, Tokens) -> tokenize(Rest, Line + 1, Scope, eol(Line, newline, Tokens)); eol(_Line, _Mod, [{',',_}|_] = Tokens) -> Tokens; eol(_Line, _Mod, [{eol,_,_}|_] = Tokens) -> Tokens; eol(Line, Mod, Tokens) -> [{eol,Line,Mod}|Tokens].
EOL = \n|\r|\r\n <YYINITIAL> { ({EOL}|{WHITE_SPACE})+ { yybegin(BODY); return TokenType.WHITE_SPACE; } } <BODY> { {EOL}({EOL}|{WHITE_SPACE})* { return ElixirTypes.EOL; } }Whitespace
tokenize([T|Rest], Line, Scope, Tokens) when ?is_horizontal_space(T) -> tokenize(strip_horizontal_space(Rest), Line, Scope, Tokens); strip_horizontal_space([H|T]) when ?is_horizontal_space(H) -> strip_horizontal_space(T); strip_horizontal_space(T) -> T.
<BODY> { {WHITE_SPACE}+ { return TokenType.WHITE_SPACE; } }
In addition to turning runs of characters into single tokens, the lexer can be used to filter out runs that don't affect the meaning of code, such as spaces, extra new lines and comments.
The Erlang lexer processes the raw text as a char list. Pattern matching on the head of the list and then recursively calling tokenize on the rest of the raw text. The Erlang lexer only keeps track of the effect of ignored characters on the current line number, which was the only metadata in Elixir v0.14.3.
In contrast, the JFlex lexer processes the raw text as unicode characters. Regular expressions and references to other named regular expressions (as seen in COMMENT using EOL) can be used to match text with the longest match in a given state. In the Erlang Lexer, the order matches the order of the clauses. The JFlex lexer emits a token for even ignored characters because in an editor, unlike a compiler, one cares about comments, whitespace and extra newlines.
iex> greeting = "Hello #{"W#{"or"}ld"}" "Hello World" iex> tuple = "A tuple #{inspect {"Containing an #{:interpolated} string"}}" "A tuple {\"Containing an interpolated string\"}"
Adding support for interpolation was tricky. At first glance the #{ and } that surround interpolation should work just like curly braces in a language like C or Java, but the braces in C or Java can just be lexed and the parser can decide about whether they are matched. In languages like Ruby or Elixir that support interpolation, whether you're parsing fragments for the string, char list or sigils or if you're in normal code needs to be tracked as fragments will be syntax highlighted and parsed differently than normal code. Finally, code inside interpolation can itself have strings, char lists, or sigils that also contain interpolation, recursively.
This recursion means that a non-deterministic finite automaton as generated by JFlex can't parse Elixir!
Finite Automatons can lex regular languages, which are languages that match formal regular expressions. Formal regular expression only allow pipe (|) for or, parentheses (()) for grouping, and asterisk (*) for Kleene star, which means zero or more. They also can have question mark (?) to mean zero or one, if they can't use a symbol for no input, which would be lowercase epsilon formally. Formal regular expressions are missing the extended regular expression features like back references that would allow recursion.
Context-free languages are a step above regular expressions and can be lexed by pushdown automatons, which have a stack for keeping track of state. A pushdown automaton can use the current state along with the current input to decide whether parentheses are matched by pushing opening parentheses onto a stack and popping on closing parentheses. If you try to pop when the stack is empty, then you have an unmatched closing parenthesis and if the string ends with a non-empty stack then you have an unmatched opening parenthesis.
When writing compilers and IDEs, designers have to care about the computational hierarchy because performance guarantees get fuzzy and get more complex until you hit semidecidable and Turing machines, which may just spin forever on bad input.
tokenize([$"|T], Line, Scope, Tokens) -> handle_strings(T, Line, $", Scope, Tokens); tokenize([$'|T], Line, Scope, Tokens) -> handle_strings(T, Line, $', Scope, Tokens); handle_strings(T, Line, H, Scope, Tokens) -> case elixir_interpolation:extract(Line, Scope, true, T, H) of {error, Reason} -> interpolation_error(Reason, [H|T], Tokens, " (for string starting at line ~B)", [Line]); {NewLine, Parts, [$:|Rest]} when ?is_space(hd(Rest)) -> Unescaped = unescape_tokens(Parts), Key = case Scope#elixir_tokenizer.existing_atoms_only of true -> kw_identifier_safe; false -> kw_identifier_unsafe end, tokenize(Rest, NewLine, Scope, [{Key, Line, Unescaped}|Tokens]); {NewLine, Parts, Rest} -> Token = {string_type(H), Line, unescape_tokens(Parts)}, tokenize(Rest, NewLine, Scope, [Token|Tokens]) end.lib/elixir/src/elixir_interpolation.erl
extract(Line, Scope, true, [$#, ${|Rest], Buffer, Output, Last) -> Output1 = build_string(Line, Buffer, Output), case elixir_tokenizer:tokenize(Rest, Line, Scope) of {error, {EndLine, _, "}"}, [$}|NewRest], Tokens} -> Output2 = build_interpol(Line, Tokens, Output1), extract(EndLine, Scope, true, NewRest, [], Output2, Last); {error, Reason, _, _} -> {error, Reason}; {ok, _EndLine, _} -> {error, {string, Line, "missing interpolation terminator:}", []}} end;
In native Elixir, elixir_tokenizer's tokenize calls handle_strings, which calls elixir_interpolation's extract, which calls elixir_tokenizer's tokenize, meaning the native lexer uses normal Erlang recursion to handle the interpolation recursion.
%{ private java.util.Stack<Integer> lexicalStateStack = new java.util.Stack<Integer>(); %} <YYINITIAL> { {DOUBLE_QUOTES} { lexicalStateStack.push(BODY); yybegin(DOUBLE_QUOTED_STRING); return ElixirTypes.DOUBLE_QUOTES; } } <DOUBLE_QUOTED_STRING> { {INTERPOLATION_START} { lexicalStateStack.push(yystate()); yybegin(INTERPOLATION); return ElixirTypes.INTERPOLATION_START; } {DOUBLE_QUOTES} { int previousLexicalState = lexicalStateStack.pop(); yybegin(previousLexicalState); return ElixirTypes.DOUBLE_QUOTES; } } <BODY, INTERPOLATION> { {DOUBLE_QUOTES} { lexicalStateStack.push(yystate()); yybegin(DOUBLE_QUOTED_STRING); return ElixirTypes.DOUBLE_QUOTES; } } <INTERPOLATION> { {INTERPOLATION_END} { int previousLexicalState = lexicalStateStack.pop(); yybegin(previousLexicalState); return ElixirTypes.INTERPOLATION_END; } }
JFlex's flex DSL only supports creating finite automaton, so I added a manually managed stack in the Java code that I can run on each rule match.
If the lexer hits a non-escaped double quote, it enters the DOUBLE_QUOTED_STRING state, which treats the #{ of the interpolation start token special: it pushes the current state on top the stack and begins the INTERPOLATION state.
The INTERPOLATION and BODY states are the same except that INTERPOLATION needs to pop and restore the previous lexical state if the closing curly brace (}) for INTERPOLATION is hit.
Associativity is how to group repeated operations of the same precedence.
I'm using the or operators, the word or and double pipes (||) for the left-associative operator because they're easy to distinguish. Similarly, I'm using the two operators, double plus (++) and diamond (<>), for the right-associative operator. Remember, the operators in each example are of the same precedence, so it's just the associativity controlling the nesting.
For left-associative, the left-most operator becomes the root of the tree with the right operand executing first, so it looks like the operators are in the wrong order, but if you rearrange the nesting to pipeline order it makes sense.
For right-associative, reading the tree in order, it looks like it matches, the order of the code, but when you change it to pipeline order you can see the the second operation must actually complete at the same time as the first argument.
YECC is a LALR, or look-ahead, left-to-right, rightmost derivation parser generator. Grammar Kit is a Parsing Expression Grammar parser generator.
Rightmost derivation means the parse tree is actually built up from the right-most end of the text, so opposite the way the text is read, which is somewhat counter-intuitive for human, but turns out to be really powerful and efficient.
Leftmost derivation means the parse tree is built in the same order as the text is read and is easier for humans to understand, but it's not as theoretically powerful.
Importantly, when rightmost derivation favors left-recursion, but left-recursion will cause an infinite loop in a leftmost derivation, so in basic Parsing Expression Grammar one would need to rewrite the rules to eliminate the left-recursion.
%% Changes in ops and precedence should be reflected on lib/elixir/lib/macro.ex %% Note though the operator => in practice has lower precedence than all others, %% its entry in the table is only to support the %{user | foo => bar} syntax. Left 5 do. Right 10 stab_op_eol. %% -> Left 20 ','. Nonassoc 30 capture_op_eol. %% & Left 40 in_match_op_eol. %% <-, \\ (allowed in matches along =) Right 50 when_op_eol. %% when Right 60 type_op_eol. %% :: Right 70 pipe_op_eol. %% | Right 80 assoc_op_eol. %% => Right 90 match_op_eol. %% = Left 130 or_op_eol. %% ||, |||, or, xor Left 140 and_op_eol. %% &&, &&&, and Left 150 comp_op_eol. %% ==, !=, =~, ===, !== Left 160 rel_op_eol. %% <, >, <=, >= Left 170 arrow_op_eol. %% < (op), (op) > (e.g |>, <<<, >>>) Left 180 in_op_eol. %% in Right 200 two_op_eol. %% ++, --, .., <> Left 210 add_op_eol. %% + (op), - (op) Left 220 mult_op_eol. %% * (op), / (op) Left 250 hat_op_eol. %% ^ (op) (e.g ^^^) Nonassoc 300 unary_op_eol. %% +, -, !, ^, not, ~~~ Left 310 dot_call_op. Left 310 dot_op. %% . Nonassoc 320 at_op_eol. %% @ Nonassoc 330 dot_identifier.
The YECC format has a section for declaring both the associativity and precedence of operators. The operator table doesn't completely reflect the precedence of all operator combinations because there are some precedence swaps in the Erlang helper code.
Nonassoc is used for non-binary, prefix operators that don't have can't be chained at the same level.
Higher precedence operators (with greater numbers) in the table can act as arguments to the lower precedence operators.
matchedExpression ::= matchedExpressionCaptureOperation | matchedExpressionInMatchOperation | matchedExpressionWhenOperation | matchedExpressionTypeOperation | matchedExpressionPipeOperation | matchedExpressionMatchOperation | matchedExpressionOrOperation | matchedExpressionAndOperation | matchedExpressionComparisonOperation | matchedExpressionRelationalOperation | matchedExpressionArrowOperation | matchedExpressionInOperation | matchedExpressionTwoOperation | matchedExpressionAdditionOperation | matchedExpressionMultiplicationOperation | matchedExpressionHatOperation | matchedExpressionUnaryOperation | matchedExpressionDotOperation | matchedExpressionAtOperation | identifierExpression | accessExpression
There isn't a unified associativity and precedence table for Grammar Kit's BNF. Instead, the precedence is order of choice in parent expression, matchedExpression
Instead of stating the precedence of the operators as is done in YECC, the precedence of operations using those operator is done in Grammar Kit.
matchedExpressionAdditionOperation ::= matchedExpression DUAL_OPERATOR EOL* matchedExpression matchedExpressionAndOperation ::= matchedExpression EOL* AND_OPERATOR EOL* matchedExpression matchedExpressionArrowOperation ::= matchedExpression EOL* ARROW_OPERATOR EOL* matchedExpression matchedExpressionAtOperation ::= AT_OPERATOR EOL* matchedExpression matchedExpressionCaptureOperation ::= CAPTURE_OPERATOR EOL* matchedExpression matchedExpressionComparisonOperation ::= matchedExpression EOL* COMPARISON_OPERATOR EOL* matchedExpression matchedExpressionDotOperation ::= matchedExpression EOL* DOT_OPERATOR EOL* matchedExpression matchedExpressionHatOperation ::= matchedExpression EOL* HAT_OPERATOR EOL* matchedExpression matchedExpressionInMatchOperation ::= matchedExpression EOL* IN_MATCH_OPERATOR EOL* matchedExpression matchedExpressionInOperation ::= matchedExpression EOL* IN_OPERATOR EOL* matchedExpression matchedExpressionMatchOperation ::= matchedExpression EOL* MATCH_OPERATOR EOL* matchedExpression { rightAssociative = true } matchedExpressionMultiplicationOperation ::= matchedExpression EOL* MULTIPLICATION_OPERATOR EOL* matchedExpression matchedExpressionOrOperation ::= matchedExpression EOL* OR_OPERATOR EOL* matchedExpression matchedExpressionPipeOperation ::= matchedExpression EOL* PIPE_OPERATOR EOL* matchedExpression { rightAssociative = true } matchedExpressionRelationalOperation ::= matchedExpression EOL* RELATIONAL_OPERATOR EOL* matchedExpression matchedExpressionTwoOperation ::= matchedExpression EOL* TWO_OPERATOR EOL* matchedExpression { rightAssociative = true } matchedExpressionTypeOperation ::= matchedExpression EOL* TYPE_OPERATOR EOL* matchedExpression { rightAssociative = true } matchedExpressionUnaryOperation ::= (DUAL_OPERATOR | UNARY_OPERATOR) EOL* matchedExpression matchedExpressionWhenOperation ::= matchedExpression EOL* WHEN_OPERATOR EOL* matchedExpression { rightAssociative = true }
The pattern for binary operations are matchedExpression on either side of the operator.
Left associativity is assumed by default, so right associativity is indicated with right associative equals true (rightAssociative = true) in the braces after the rule.
You may have noticed that these rules appear to be recursive: on the previous slide, matchedExpression was defined as the ordered choice of all the operations on this on this slide, but all the operations use matchedExpression as both the left and right operand, so why doesn't parser go into an infinite loop on the first binary rule, matched expression in match operation (matchedExpressionInMatchOperation)?
There is no loop because Grammar Kit has some extensions that allow it to detect this pattern of rules.
The rules for Parsing Expression Grammars say that left recursion is impossible, but if you declare the operation rules to extend the root expression rule, then Grammar Kit will trigger Pratt Parsing.
That being said, it is really easy to break this trigger, say if you rename a rule and it no longer matches the regex. If that happens you'll see a cascade of left-recursion warnings, but because it's recursive, the editor can't tell you where to fix it.
public class ElixirParser implements PsiParser { /* ********************************************************** */ // Expression root: matchedExpression // Operator priority table: // 0: PREFIX(matchedExpressionCaptureOperation) // 1: BINARY(matchedExpressionInMatchOperation) // 2: BINARY(matchedExpressionWhenOperation) // 3: BINARY(matchedExpressionTypeOperation) // 4: BINARY(matchedExpressionPipeOperation) // 5: BINARY(matchedExpressionMatchOperation) // 6: BINARY(matchedExpressionOrOperation) // 7: BINARY(matchedExpressionAndOperation) // 8: BINARY(matchedExpressionComparisonOperation) // 9: BINARY(matchedExpressionRelationalOperation) // 10: BINARY(matchedExpressionArrowOperation) // 11: BINARY(matchedExpressionInOperation) // 12: BINARY(matchedExpressionTwoOperation) // 13: BINARY(matchedExpressionAdditionOperation) // 14: BINARY(matchedExpressionMultiplicationOperation) // 15: BINARY(matchedExpressionHatOperation) // 16: PREFIX(matchedExpressionUnaryOperation) // 17: BINARY(matchedExpressionDotOperation) // 18: PREFIX(matchedExpressionAtOperation) // 19: ATOM(identifierExpression) // 20: ATOM(accessExpression) public static boolean matchedExpression(PsiBuilder b, int l, int g) { if (!recursion_guard_(b, l, "matchedExpression")) return false; addVariant(b, "<matched expression>"); boolean r, p; Marker m = enter_section_(b, l, _NONE_, "<matched expression>"); r = matchedExpressionCaptureOperation(b, l + 1); if (!r) r = matchedExpressionUnaryOperation(b, l + 1); if (!r) r = matchedExpressionAtOperation(b, l + 1); if (!r) r = identifierExpression(b, l + 1); if (!r) r = accessExpression(b, l + 1); p = r; r = r && matchedExpression_0(b, l + 1, g); exit_section_(b, l, m, null, r, p, null); return r || p; } }
Pratt parsing is an extension to recursive-descent parsers, which include parsing expression grammars, that allows for optimization for parsing operators by noticing patterns in how humans write binary operators to eliminate left-recursion.
The optimization involves noticing that eventually all operation rules get to rules that aren't left-recursive or that are tokens, so the parser generator itself can do the left-recursion elimination by looking for those non-left-recursive rules first, then try to consume an operator, then do it all over again, but keep track of the precedence of the operator.
Here, you can see that the prefix operations, matchedExpressionUnaryOperation and matchedexpressionAtOperation, can be checked for because they consume an operator token first and so aren't left-recursive.
Likewise, identifierExpression and accessExpression are atoms, so they don't include matchedExpression at all and so can be used to consume input immediately.
Once some input is consumed, matchedExpression goes into matched expression underscore 0 (matchedExpression_0). For user written rules the underscore number system is used for anonymous groups in parentheses, but here's it's meant to indicate that the parser is checking the tail of all the matchedExpression rules.
public class ElixirParser implements PsiParser { public static boolean matchedExpression_0(PsiBuilder b, int l, int g) { if (!recursion_guard_(b, l, "matchedExpression_0")) return false; boolean r = true; while (true) { Marker m = enter_section_(b, l, _LEFT_, null); if (g < 1 && matchedExpressionInMatchOperation_0(b, l + 1)) { r = matchedExpression(b, l, 1); exit_section_(b, l, m, MATCHED_EXPRESSION_IN_MATCH_OPERATION, r, true, null); } else if (g < 2 && matchedExpressionWhenOperation_0(b, l + 1)) { r = matchedExpression(b, l, 1); exit_section_(b, l, m, MATCHED_EXPRESSION_WHEN_OPERATION, r, true, null); } else if (g < 3 && matchedExpressionTypeOperation_0(b, l + 1)) { r = matchedExpression(b, l, 2); exit_section_(b, l, m, MATCHED_EXPRESSION_TYPE_OPERATION, r, true, null); } else if (g < 4 && matchedExpressionPipeOperation_0(b, l + 1)) { r = matchedExpression(b, l, 3); exit_section_(b, l, m, MATCHED_EXPRESSION_PIPE_OPERATION, r, true, null); } else if (g < 5 && matchedExpressionMatchOperation_0(b, l + 1)) { r = matchedExpression(b, l, 4); exit_section_(b, l, m, MATCHED_EXPRESSION_MATCH_OPERATION, r, true, null); } else if (g < 6 && matchedExpressionOrOperation_0(b, l + 1)) { r = matchedExpression(b, l, 6); exit_section_(b, l, m, MATCHED_EXPRESSION_OR_OPERATION, r, true, null); } // ... else { exit_section_(b, l, m, null, false, false, null); break; } } return r; } }
The pattern used in matched expression (matchedExpression) and matced expression zero (matchedExpression_0) is based on Douglas Crockford's Top Down Operator Precedence implementation in Javascript.
g is the right-binding power of the currently matched operator. Only operators with a stronger binding power (because they are higher precedence) can be matched when recursing, but if no stronger rule is matched on the recursive call to matchedExpression, then the while loop allows for matching operators of equal right-binding power.
The left-associative, InMatch at the beginning and or at the end of the excerpt all follow the pattern of recursive call to matchedExpression with g one greater than the max for the current operator (For example: g less than 1 and recurse with 1; g less than 6 and recurse with 6.) This ensures that adjacent left operators at the same level are matched by the while loop, which are then properly left-nested by the underscore left underscore (_LEFT_) directive in the m marker at the top of while loop.
The underscore left underscore (_LEFT_) directive is similar how the pipelines are rearranged in Elixir.
For right-associative operators, like when, type, pipe, and match, the if clauses recursively calls matched expression (matchedExpression with g that can match the current operator, which means adjacent right-associative operators or any higher precedence operator is properly nested at a lower level. Since the parser is left-most derivation, any nesting occurs on the right-end, so nesting gets the proper right-associative behavior.
expr -> no_parens_expr : '$1'. no_parens_expr -> dot_op_identifier call_args_no_parens_many_strict : build_identifier('$1', '$2'). no_parens_expr -> dot_identifier call_args_no_parens_many_strict : build_identifier('$1', '$2'). dot_identifier -> identifier : '$1'. dot_identifier -> matched_expr dot_op identifier : build_dot('$2', '$1', '$3'). dot_op_identifier -> op_identifier : '$1'. dot_op_identifier -> matched_expr dot_op op_identifier : build_dot('$2', '$1', '$3'). call_args_no_parens_comma_expr -> matched_expr ',' call_args_no_parens_expr : ['$3', '$1']. call_args_no_parens_comma_expr -> call_args_no_parens_comma_expr ',' call_args_no_parens_expr : ['$3'|'$1']. call_args_no_parens_many -> matched_expr ',' call_args_no_parens_kw : ['$1', '$3']. call_args_no_parens_many -> call_args_no_parens_comma_expr : reverse('$1'). call_args_no_parens_many -> call_args_no_parens_comma_expr ',' call_args_no_parens_kw : reverse(['$3'|'$1']). call_args_no_parens_many_strict -> call_args_no_parens_many : '$1'. call_args_no_parens_many_strict -> empty_paren : throw_no_parens_strict('$1'). call_args_no_parens_many_strict -> open_paren call_args_no_parens_kw close_paren : throw_no_parens_strict('$1'). call_args_no_parens_many_strict -> open_paren call_args_no_parens_many close_paren : throw_no_parens_strict('$1').
No parentheses function calls were the first function calls to be implemented because no parens expr (no_parens_expr) is a direct child of the root expr rule.
I knew from using Elixir that these rules had to match parentheses-less calls I was used to writing.
no_parens_expr -> dot_identifier call_args_no_parens_many_strict dot_identifier -> identifier dot_identifier -> matched_expr dot_op identifierCompact dot_identifier
no_parens_expr -> dot_identifier call_args_no_parens_many_strict dot_identifier ::= (matched_expr dot_op)? identifierInline dot_identifer
no_parens_expr ::= (matched_expr dot_op)? identifier call_args_no_parens_many_strict
When converting the YECC BNF to Grammar Kit BNF, I need to combine clauses using the extra features available in Parsing Expression Grammars. This also helps me reason about the rule as Parsing Expression Grammars have more regular expression like syntax. I can evaluate those in my mind easier as they eliminate redundancies in the grammar and make the branch points more obvious to me.
That first anonymous group was a problem: the pratt parser generator for Grammar Kit only works if a rule is used as a direct operand and matched expression (matched_expr) needs to be indirect to allow both qualified and unqualified no parentheses calls.
private expression ::= emptyParentheses | unqualifiedNoParenthesesManyArgumentsCall | matchedExpression noParenthesesManyArgumentsUnqualifiedIdentifier ::= IDENTIFIER unqualifiedNoParenthesesManyArgumentsCall ::= noParenthesesManyArgumentsUnqualifiedIdentifier noParenthesesManyArgumentsStrict
Note: I can't use noParenthesesNoArgumentsUnqualifiedVariable because it extends matchedExpression and any rule that is part of the Pratt Parser doesn't just match itself, but any higher precedence rule too, which isn't obvious unless you look at the generated Java code.
noParenthesesNoArgumentsUnqualifiedCallOrVariable ::= IDENTIFIER matchedExpression ::= matchedCaptureNonNumericOperation | // ... matchedCallOperation | noParenthesesNoArgumentsUnqualifiedCallOrVariable | accessExpression matchedCallOperation ::= matchedExpression noParenthesesManyArgumentsStrict
The qualified no parentheses has to become part of the matchedExpression Pratt Parser because that's the only way to allow all the pieces of matchedExpression to be qualifiers.
call_args_no_parens_many_strict -> call_args_no_parens_many : '$1'. call_args_no_parens_many_strict -> empty_paren : throw_no_parens_strict('$1'). call_args_no_parens_many_strict -> open_paren call_args_no_parens_kw close_paren : throw_no_parens_strict('$1'). call_args_no_parens_many_strict -> open_paren call_args_no_parens_many close_paren : throw_no_parens_strict('$1').src/org/elixir_lang/Elixir.bnf
/* Special class for wrapping rules so that {@link: org.elixir_lang.inspection.NoParenthesesStrict} can just search for ElixirNoParenthesesStrict instead of having to differentiate between valid and invalid rule classes. */ noParenthesesStrict ::= emptyParentheses | OPENING_PARENTHESIS ( noParenthesesKeywords | noParenthesesManyArguments ) CLOSING_PARENTHESIS { implements = "org.elixir_lang.psi.QuotableArguments" methods = [quoteArguments] } private noParenthesesManyArgumentsStrict ::= noParenthesesManyArguments | noParenthesesStrict
Instead of having all 4 clauses from call_args_no_parens_many_strict directly under noParenthesesManyArgumentStrict, I only have 2 rules because it allowed me to use a single node, noParenthesesStrict, which was helpful for the first Inspection and Quick Fix.
function (first_positional, second_positional, key: value)Code.string_to_quoted
{:error, {1, "unexpected parenthesis. If you are making a function call, do not insert spaces in between the function name and the opening parentheses. Syntax error before: ", "'('"}}Error highlighting and Quick Fix
The Elixir parser throws errors immediately, but this stops the user from being able to correct later errors. This can be a wise choice if anyone has ever seen the noise that a single error in C or C++ can make later in the file.
For an IDE however, it would be annoying if syntax and error highlighting stopped at the first error, so invalid syntax, such as ambiguous parentheses are allowed in IntelliJ Elixir's parser and only marked as errors using an Inspection, which does the red highlighting. Inspection in the JetBrains API allows the user to Quick Fix the error with Alt Enter (ALT+Enter) when the cursor is hovering over it.
Being able to correct errors is one of the reasons that an IDEs grammar can be more permissive than the compiler's grammar: permissiveness in parser allows for more robust heuristics to mark the errors later.
call_args_no_parens_expr -> matched_expr : '$1'. call_args_no_parens_expr -> empty_paren : nil. call_args_no_parens_expr -> no_parens_expr : throw_no_parens_many_strict('$1'). call_args_no_parens_comma_expr -> matched_expr ',' call_args_no_parens_expr : ['$3', '$1']. call_args_no_parens_comma_expr -> call_args_no_parens_comma_expr ',' call_args_no_parens_expr : ['$3'|'$1']. call_args_no_parens_many -> matched_expr ',' call_args_no_parens_kw : ['$1', '$3']. call_args_no_parens_many -> call_args_no_parens_comma_expr : reverse('$1'). call_args_no_parens_many -> call_args_no_parens_comma_expr ',' call_args_no_parens_kw : reverse(['$3'|'$1']).src/org/elixir_lang/Elixir.bnf
private noParenthesesCommaExpression ::= matchedExpression (infixComma noParenthesesExpression)+ noParenthesesFirstPositional ::= matchedExpression { implements = "org.elixir_lang.psi.Quotable" methods = [quote] } noParenthesesOnePositionalAndKeywordsArguments ::= noParenthesesFirstPositional infixComma noParenthesesKeywords { implements = "org.elixir_lang.psi.QuotableArguments" methods = [quoteArguments] } noParenthesesManyPositionalAndMaybeKeywordsArguments ::= noParenthesesCommaExpression (infixComma noParenthesesKeywords)? { implements = "org.elixir_lang.psi.QuotableArguments" methods = [quoteArguments] }
For valid arguments, call_args_no_parens_many) 3 clauses are just a way of saying there needs to be at least 2 arguments and keywords can only appear at the end. This can be expressed slightly more compactly in Grammar Kit, but there still needs to be 2 rules to ensure that if there is only one positional argument then there are also keyword arguments.
call_args_no_parens_expr -> matched_expr : '$1'. call_args_no_parens_expr -> empty_paren : nil. call_args_no_parens_expr -> no_parens_expr : throw_no_parens_many_strict('$1').src/org/elixir_lang/Elixir.bnf
/* Have to prevent matchedExpression that is actually a keywordKey from being parsed as just a matchedExpression or callArgumentsNoParenthesesCommaExpression COMMA EOL* callArgumentsNoParenthesesKeywords will never match. */ noParenthesesExpression ::= emptyParentheses | /* Must be before matchedExpression because noParenthesesExpression is `matchedExpressionDotIdentifier callArgumentsNoParenthesesManyStrict` which is longer than `matchedExpressionDotIdentifier` in matchedExpression. */ /* This will be marked as an error by {@link org.elixir_lang.inspection.NoParenthesesManyStrict} */ noParenthesesManyStrictNoParenthesesExpression | matchedExpression !KEYWORD_PAIR_COLON { implements = "org.elixir_lang.psi.Quotable" methods = [quote] }
Each argument is a call_args_no_parens_expr. There are a lot of comments in my implementation because I had a hard time debugging some errors I got during testing.
Having to put (!KEYWORD_PAIR_COLON) after matchedExpression is one of the places where one has to think recursively through the rules.