ANTLR Difficulties: Writing a Ruby Grammar

imageAt Rostelecom-Solar we are developing a static code analyzer for vulnerabilities and NDV, which also works on parse trees. To build them, we use an optimized version of ANTLR4 , a tool for developing compilers, interpreters and translators.



The repository contains grammars of many programming languages. However, it lacks the Ruby grammar, which apparently no one has implemented. There is only a grammar of a similar homemade language, parsing only the simplest cases. This is not surprising, since the Ruby grammar is difficult to implement, since the language has a non-trivial syntax. But it would be very useful for those who, for example, want to write their own language and think about how to do it. Or for those who need to solve the technical difficulties discussed in our article. Well, we'll have to write a new grammar, which we'll do right here.



ANTLR proposes to split language analysis into lexer and parser.



Lexer is engaged in generating tokens based on specified sequences of characters from the alphabet of the language. If a sequence of characters matches the definition of several tokens, the longest is chosen, and among such - the first in priority (which is specified by the order of writing).



The parser forms sentences (complete commands) of the language using tokens (also called terminal characters) obtained from the lexer.



In general, the spelling of grammar is no different from other languages. You can learn from the author's book , documentation or numerous tutorials like this .



In this article, the basic implementation will be omitted and only the technical difficulties of writing will be considered.



Lexer problems



In general, the only possible problem in a lexer is the issuance of a valid token over a sequence of characters and the attendant obstacles.



Interpolation



Some Ruby strings allow for interpolation - the insertion of arbitrary code inside using syntax #{code}. For instance:



a = 3
"Hello #{if a%2==1 then "Habr!" else "World!" end}"


We will cope with the help of modes - specific "small lexers" designed for a specific task, like our parsing a string:



DoubleQuote: '"' {++nestedStringLevel;}    -> pushMode(InterpolationString);


At each nesting level, the number of open braces must be maintained due to situations of the form (you need to exit interpolation at the 2nd closing curly brace):



"Kappa #{
    buf = ''
    [1, 2, 3].each { |x| buf += x.to_s }
    buf
}"


To do this, let's create a stack openedCurlyBracesInsideString. In total, we have a token inside the mod:



StartInterpolation: '#{' {openedCurlyBracesInsideString.push(0);}    -> pushMode(DEFAULT_MODE);


Now you need to exit the normal mode (DEFAULT_MODE) in time, for this we add the code to the tokens:



OpenCurlyBracket:             '{' {
    if (nestedStringLevel > 0 && !openedCurlyBracesInsideString.empty()) {
        final int buf = openedCurlyBracesInsideString.pop();
        openedCurlyBracesInsideString.push(buf + 1);
    }
};
CloseCurlyBracket:            '}' {
    if (nestedStringLevel > 0 && openedCurlyBracesInsideString.peek() <= 0) {
       popMode();
       openedCurlyBracesInsideString.pop();
    } else {
        if (!openedCurlyBracesInsideString.empty()) {
            final int buf = openedCurlyBracesInsideString.pop();
            openedCurlyBracesInsideString.push(buf - 1);
        }
    }
};


% -notations



Ruby has additional Perl-inspired syntax for writing strings, arrays of strings and symbols (which is not a symbol in Ruby in the usual sense), regular expressions, and shell commands. The syntax for these commands is% followed by an optional type identifier and a separator character. For example: %w|a b c|- an array of three strings. However, it can also be used as a separator paired brackets {} [] () <>. It won't work just to specify all possible identifiers - then, for example, the sequence



q = 3
5%q


will not be recognized correctly. The lexer will simply eat the longest character string by creating a token %q.



By creating a check for the absence of an obviously invalid separator, such as a space character, a digit, and a letter, and adding it to the token as a predicate (a condition that must be met in a certain place to continue constructing the token),



StringArrayConstructorwToken: '%w' {canBeDelimiter()}?;


we get protection that almost always works (more on that later). We also add an expectation of the corresponding separator depending on the alternative.



StartArrayConstructorNotInterpolated
    : StringArrayConstructorwToken ( Brackets {setPairDelimiter();} | ~[(<{[] {setCharDelimiter();} ) {startStringMode(ArrayConstructorw);}

fragment Brackets: '(' | '[' | '{' | '<';


where startStringModeis a utility function for mode switching and nesting support.



public void startStringMode(final int mode) {
    pushMode(mode);
    ++nestedStringLevel;
}


Counterexample: 5%q|1 - parsed correctly only in the context of a parser, when it is known that there can be no string after 5 assignments.



You might think that it is enough to find a matching delimiter by looking ahead, but there is an example for that too - 5%q|1|1. Whence it becomes clear that when splitting into a lexer and a parser, such a case cannot be parsed.



However, this happens very rarely, so ¯ \ _ (ツ) _ / ¯ will do. Inside the mode, we just wait for the separator.



ArraywWhitespace: WhitespaceAll                           -> skip;
ArraywText:       ({!isDelimiter()}? ArraywTextFragment)+ -> type(StringPart);
ArraywEnd:        . {nestedStringLevel--;}                -> type(HereDocEnd), popMode;


where typechanges the type of generated tokens for convenience.



Division or start of regex



The regular expression syntax is as follows /regexp/(as well as the aforementioned percentage notation). The same parser context problem arises as in the previous paragraph, to mitigate it, we create a check



public boolean canBeRegex() {
    return isPrevWS && " \t\r\u000B\f\b\n".indexOf((char) _input.LA(1)) == -1 || isPrevNL || isOp || prevNonWsType == StartInterpolation;
}


and add to the token



Divide:                       '/' {
    if (canBeRegex()) {
        startHereDoc(RegExp);
    }
};


To recalculate the variables isOp, isPrevNL, isPrevWSand override emit-function of the final creation of the token



@Override
public void emit(final Token token) {
    final String txt = token.getText();
    final int type = token.getType();
    isPrevNL = type == NL;
    isPrevWS = type == WS;
    if (!isPrevWS && !isPrevNL && type != SingleLineComment && type != MultiLineComment) {
        isOp = OPERATORS.contains(type);
        prevNonWsChar = txt.charAt(txt.length() - 1);
        prevNonWsType = type;
    }
    super.emit(token);
}


where OPERATORSis the hashmap of all operators.



Parser problems



Whitespace characters



Quite an unpleasant surprise was the effect of spaces on parsing. Usually they do not affect the grammar in any way and are simply removed from the stream with the help -> skipor translation into another channel. However, a number of languages ​​still distinguish between some constructs with their help.



So, for example, a+3and a + 3cannot be a function call without parentheses, but +3maybe. Therefore, all parser rules look like this (NL - newline, WS - whitespace):



    | expression WS* op=('+' | '-') (NL | WS)* expression


To mitigate the problem, let's write a listener that will clean our parse tree from such garbage.



public class RemovingTokensListener implements ParseTreeListener {
        private List<Integer> unwantedTokens;

        ...

        @Override
        public void visitTerminal(final TerminalNode node) {
            if (this.unwantedTokens.contains(node.getSymbol().getType())) {
                ((ParserRuleContext) node.getParent().getRuleContext()).removeLastChild();
            }
        }
}


Heredoc - Lexer and parser



Special syntax for specifying multi-line strings. Maybe so



<<STR
content line 1
content line 2
STR


or even that (interestingly, markdown doesn't recognize the syntax correctly).



value = 123
print <<STR_WITH_INTERPOLATION, <<'STR_WITHOUT_INTERPOLATION'.strip
content 1 and #{value}
STR_WITH_INTERPOLATION
     content 2 and #{value}
STR_WITHOUT_INTERPOLATION


The problem is that you need to understand where the line ends, and it would also be convenient to know which content belongs to which line. To do this, first create a list of heredocs pending parsing (the parser, depending on the context and mode, can load an arbitrary number of tokens forward)



public final HeredocsHolder HEREDOCS = new HeredocsHolder();

public static final class HereDocEntry {
    public final String name;
    public final HereDocType type;
    public final boolean isInterpolated;
    public int partsNumber;

    public HereDocEntry(final String name, final HereDocType type, final boolean isInterpolated) {
        this.name = name;
        this.type = type;
        this.isInterpolated = isInterpolated;
        this.partsNumber = 0;
    }
}

public static final class HeredocsHolder {
    public final List<HereDocEntry> entries;
    public int toProcess;

    HeredocsHolder() {
        this.entries = new ArrayList<>();
        this.toProcess = 0;
    }
}


and we will add them as they become available



StartHereDoc
    : HereDocToken HereDocName {
        heredocIdentifier = getHeredocIdentifier('\'');
        setText(getText().trim());
        keepHereDoc(HereDoc, false);
    }
    ;


public void keepHereDoc(final int mode, final boolean isInterpolated) {
    HEREDOCS.entries.add(new HereDocEntry(heredocIdentifier, getHereDocType(), isInterpolated));
    ++HEREDOCS.toProcess;
    isFirstNL = true;
}




Further, having seen the end of the line with pending heredocs, we call the processing function.



NL:                           '\n' {
    final int next = _input.LA(1);
    if (HEREDOCS.toProcess > 0 && isFirstNL) {
        startHereDocRoutine();
        isFirstNL = false;
        insideHeredoc = true;
    }
};


The processing function itself is given below: here we process only the last heredocs (the lexer could have gone ahead, but the parser in this case has not yet absorbed the tokens, so we save the information)



public void startHereDocRoutine() {
    final int stopIdx = HEREDOCS.entries.size() - HEREDOCS.toProcess;
    for (int i = HEREDOCS.entries.size() - 1; i >= stopIdx; --i) {
        if (HEREDOCS.entries.get(i).isInterpolated) {
            pushMode(HereDocInterpolated);
        } else {
            pushMode(HereDoc);
        }
    }
    nestedStringLevel += HEREDOCS.toProcess;
    currentHeredocIt = HEREDOCS.entries.listIterator(HEREDOCS.entries.size() - HEREDOCS.toProcess);
    currentHeredoc = currentHeredocIt.next();
}


It needs to be overwritten nextTokento exit the mode and count the number of tokens for each heredoc



@Override
public Token nextToken()
{
    final CommonToken token = (CommonToken)super.nextToken();
    final int ttype = token.getType();
    if (insideHeredoc && ttype == StartInterpolation) {
        ++heredocTokensCount;
    }
    if (_mode == HereDoc || _mode == HereDocInterpolated) {
        if (ttype == VarName) {
            ++heredocTokensCount;
        } else if (ttype == StringPart) {
            ++heredocTokensCount;
            final String txt = token.getText();
            if (CheckHeredocEnd(txt)) {
                token.setText(txt.trim());
                token.setType(HereDocEnd);
                nestedStringLevel--;
                popMode();
                currentHeredoc.partsNumber = heredocTokensCount;
                if (currentHeredocIt.hasNext()) {
                    currentHeredoc = currentHeredocIt.next();
                }
                heredocTokensCount = 0;
                --HEREDOCS.toProcess;
                if (_mode == DEFAULT_MODE) {
                    insideHeredoc = false;
                }
            }
        }
    }
    return token;
}


Now let's start with the parser.



With the help, @parser::membersadd to the parser: a link to a lexer, string nodes where we will transfer their content, the number of interpolation nodes (by analogy with a heredocTokensCountlexer), as well as a stack of statements indicating the need for processing.



    private final RubyLexer lexer = (RubyLexer)_input.getTokenSource();
    private final List<ParserRuleContext> CONTEXTS = new ArrayList<>();
    private final List<Integer> heredocRulesCount = new ArrayList<>();
    private final Stack<StatementEntry> statements = new Stack<>();

    private static final class StatementEntry {
        public boolean needProcess;
        public int currentHeredoc;

        StatementEntry() {
            this.needProcess = false;
            this.currentHeredoc = 0;
        }
    }


Let's modify the code unit directly:



statement
@init {
    statements.push(new StatementEntry());
}
@after {
    if (statements.peek().needProcess) {
        processHeredocs($ctx);
    }
    statements.pop();
}
    : statementWithoutHeredocTail ({statements.peek().needProcess}? interpolatedStringPart* HereDocEnd {++statements.peek().currentHeredoc;})*
    ;


@init- the code that is executed when the parser enters the rule @after- when it exits.



The stack is necessary for assigning heredocs to the required statement, since there may be new ones inside the interpolation.



In order to correctly recognize cases where heredoc can be an ordinary expression and where a string can be considered tokens in a row, as well as complex cases when the end of a string will lie behind the current expression, we write the following parser code:



string:
...
    | StartInterpolatedHereDoc (memberAccess* WS* NL interpolatedStringPart* HereDocEnd)? {
        if ($ctx.HereDocEnd() == null) {
            CONTEXTS.add($ctx);
            heredocRulesCount.add(0);
            statements.peek().needProcess = true;
        } else {
             lexer.HEREDOCS.entries.remove(0);
        }
    }
...


For the very same counting of interpolation nodes, we modify the rule code with the content of the string ( + 2here it is needed to count tokens that open and close interpolation)



interpolatedStringPart
locals[int nodesCount = 0]
    : StringPart
    | VarName
    | StartInterpolation (WS* statement {++$nodesCount;})* (WS* rawStatement {++$nodesCount;})? WS* '}' {
        final int cur = statements.peek().currentHeredoc;
        if (cur < heredocRulesCount.size()) {
            heredocRulesCount.set(cur, heredocRulesCount.get(cur) + $nodesCount + 2);
        }
    }
    ;


where localsis a list of local variables (you need to refer to them through $), and whitespace tokens are not counted, since are removed during tree construction by our listener.



Now let's write the function itself processHeredocs. Let's count how many nodes to pick up



int heredocNodesCount = 0;
for (int i = 0; i < CONTEXTS.size(); ++i) {
    heredocNodesCount += lexer.HEREDOCS.entries.get(i).partsNumber;
    heredocNodesCount += heredocRulesCount.get(i);
}


Starting with which child, we will begin to throw the content of the lines in their places



int currentChild = ctx.getChildCount() - heredocNodesCount;


Hooking content to the corresponding node



for (int i = 0; i < CONTEXTS.size(); ++i) {
    final RubyLexer.HereDocEntry entry = lexer.HEREDOCS.entries.remove(0);
    final ParserRuleContext currentContext = CONTEXTS.get(i);
    final int currentNodesCount = entry.partsNumber + heredocRulesCount.get(i);
    for (int j = 0; j < currentNodesCount; ++j, ++currentChild) {
        final ParseTree child = ctx.getChild(currentChild);
        if (child instanceof TerminalNode) {
            ((TerminalNodeImpl) child).setParent(currentContext);
            currentContext.addChild((TerminalNodeImpl) child);
        } else if (child instanceof ParserRuleContext) {
            ((ParserRuleContext) child).setParent(currentContext);
            currentContext.addChild((ParserRuleContext) child);
        } else {
            // parser failed
            clear();
            return;
        }
    }
}


We clear the node itself, heredoc contexts and the list of the number of interpolation nodes



for (int i = 0; i < heredocNodesCount; ++i) {
    ctx.removeLastChild();
}
clear();


private void clear() {
    CONTEXTS.clear();
    heredocRulesCount.clear();
}


The final touch is to remove an unnecessary intermediate rule for handling statementWithoutHeredocTailheredocs - by re-hanging the children of the node to its ancestor using the same listener



public class RemovingRulesListener implements ParseTreeListener {
    private List<Integer> unwantedRules;

    ...

    @Override
    public void exitEveryRule(final ParserRuleContext ctx) {
        if (this.unwantedRules.contains(ctx.getRuleIndex())) {
            final ParserRuleContext parentCtx =
                    (ParserRuleContext) ctx.getParent().getRuleContext();
            parentCtx.children.remove(ctx);
            ctx.children.forEach(
                    child -> {
                        if (child instanceof RuleContext) {
                            ((RuleContext) child).setParent(parentCtx);
                            parentCtx.addChild((RuleContext) child);
                        } else if (child instanceof TerminalNode) {
                            ((TerminalNodeImpl) child).setParent(parentCtx);
                            parentCtx.addChild((TerminalNodeImpl) child);
                        }
                    }
            );
        }
    }
}


Ambiguity



An unsolved problem was the distinction between function calls and, for example, ordinary addition (as well as the symbol of any simultaneously unary and binary operation), which can only be solved using symbol tables at runtime.



The bottom line is that at the entrance a +a +a +a...at each step there can be both ordinary addition and a function call without arguments (although in this case Ruby requires the absence of a space after the sign of the first argument), which apparently leads to an exponential growth of walking along the graph predictions.



However, simply disallowing ANTLR space after a unary operator in the context will not work - you will have to manually rewrite the left recursive expression, which is automatically resolved for the case without arguments. Relying on the fact that “nobody” writes more than thirty terms without a reason, this problem disappears.



Conclusion



Experimentally, this grammar can parse 99% of files.



So, aws-sdk-ruby , containing 3024 ruby ​​files, crashes only on seven, fastlane , containing 1028, on 2-x, and Ruby on Rails on 2081, on 19.



However, there are still fundamentally sore points like heredocs included in expression



option(:sts_regional_endpoints,
  default: 'legacy',
  doc_type: String,
  docstring: <<-DOCS) do |cfg|
Passing in 'regional' to enable regional endpoint for STS for all supported
regions (except 'aws-global'), defaults to 'legacy' mode, using global endpoint
for legacy regions.
  DOCS
  resolve_sts_regional_endpoints(cfg)
end


or used as expressions, any block types



def test_group_by_with_order_by_virtual_count_attribute
    expected = { "SpecialPost" => 1, "StiPost" => 2 }
    actual = Post.group(:type).order(:count).limit(2).maximum(:comments_count)
    assert_equal expected, actual
end if current_adapter?(:PostgreSQLAdapter)


I hope the techniques described in this article will help you cope with the difficulties of your grammar. If you think that any of the problems can be solved better - welcome to the comments.



Author: Fedor Usov, developer of Solar appScreener



All Articles