Error recovery (as a way to implement automatic semicolon insertion)?

Feb 9, 2011 at 8:08 AM

Hi,

I am trying to create ecmascript'ish grammar using gplex and gppg however I am not sure how could I implement automatic semicolon insertion.

(This works OK): For the moment I have added support for restricted statements (continue, break, return, throw) by extending scanner, to return ';' when newline/EOF is encountered after continue/break/return/throw. I took similar approach for '++' and '--' (in this case I check for newline/EOF before the token, and if found I yyless(0) [to return -- or ++ later] and return ';').

(This work mostly - need more testing, but currently this seems to be OK): for statements continue, break, return, throw I have added similar rules to:

continue_statement
    :    KeyContinue ';'
    |    KeyContinue Identifier ';'
    |    KeyContinue error                  { if (!IsValidAutoSemicolon()) { YYABORT; } yyerrok(); }
    |    KeyContinue Identifier error    { if (!IsValidAutoSemicolon()) { YYABORT; } yyerrok(); }
    ;

The error token represents a location for semicolon, and if the next token is valid (newline, EOF or '}') then we assume that ';' was inserted and continue parsing.

(This does NOT work): I tried using the same reasoning for statements like (the semantic action does not executed):

expression_statement:
    :    expression_st ';'
    |    expression_st error                { if (!IsValidAutoSemicolon()) { YYABORT; } yyerrok(); }
    ;

However when I am testing this grammar part (currently I am using a simple test, of expression being an Identifier) I get unrecoverable syntax error (Parse returns false as no action that accepts "shift error" was found).

I am wondering if this is the intended behaviour? I have compiled a grammar showing the same behaviour:

lex:

%namespace TestError
%visibility internal

%%

"a"     return 'a';
"+"     return '+';
"*"     return '*';
";"     return ';';
"("     return '(';
")"     return ')';
"["     return '[';
"]"     return ']';


Y:

%partial
%namespace TestError
%start comp
%visibility internal

%left '+'

%%

comp
    :   slist
    ;

slist
    :   /* empty */
    |   slist stm
    ;

stm
    :   expr ';'
    |   expr error
    ;

expr
    :   secexpr
    |   expr '+' expr
    ;

primexpr
    :   literal
    |   primexpr '(' expr ')'
    ;

secexpr
    :   primexpr
    |   secexpr '[' expr ']'
    ;

literal
    :   'a'
    ;

%%

 


The input "a" [note missing ';'] generates error that "Syntax error, unexpected EOF, expecting '(', or ';', or '+', or error, or ')'". This however works fine if I use input: "a+a". It looks to be inconsistent behaviour (I assume this should be OK, as error recovery is guessing).

I was trying to figure out what is going on and it looks like, the error recovery routine tries to find parse table entry that would correspond to "shif error", however there is only entry for "reduce on error". I was trying to look for yacc documenation describing behaviour in case of error recovery and it seems that gppg follows it correctly, however I did not test that with original yacc tools.

Feb 9, 2011 at 11:47 AM

After having some thoughts I came to the following:
      Error recovery could be modifed to try any action involving "error" (instead of simply searching for "shift error" could try reducing as well), and only then to fallback to original behaviour of shifting error and discarding tokens.

As for the original issue: I think I will try hacking the ShiftReduceParser and change the default behaviour for Parse: if action==0 is encountered, then I will test whether I can use pseudo-Token AutoSemicolon to shift or reduce. Additionally I will modify my grammar to use AutoSemicolon in cases where I wanted to use token "error".

I don't think I will go for the first suggestion because I don't know what would be the implications of the change (should I file an enchancement for gppg?).

Marius

Coordinator
Feb 11, 2011 at 7:47 AM

Hi.  I have not had much of a chance to look at the details here, but here is a trick that I have used in the past. The idea is

Prod : Blah WeakSemi More ;

WeakSemi : /* empty */  { insert a virtual semicolon }

                | ';'                { /* default shift action */ }

                ;

Of course this can be tricky if you get a conflict on the weak semicolon non-terminal, but in that case the error action will be tricky also.

Cheers

John

Feb 11, 2011 at 11:24 AM

Currently I have it implemented as I mentioned in the second post.

Currently I have made the following changes:

ShiftReduceParser.cs:

...
#if TRACE_ACTIONS
                        Console.Error.WriteLine("Next token is {0}", TerminalToString(NextToken));
#endif
                    if (FsaState.ParserTable.ContainsKey(NextToken))
                        action = FsaState.ParserTable[NextToken];

ADDED: ---> CustomParserAction(ref NextToken, ref action, FsaState); // this is declared as protected virtual void CustomParserAction(ref int, ref int, State) {}
                }

                if (action > 0)         // shift
                {
...

Parser.cs:

        protected override void CustomParserAction(ref int nextToken, ref int action, QUT.Gppg.State state)
        {
            if (action == 0 && Scanner is Scanner)
            {
                Scanner s = (Scanner)Scanner;

                var token = (int)Tokens.AutoSemicolon;
                if (s.IsValidAutoSemicolonPosition
                    && state.ParserTable.ContainsKey(token)
                    && state.ParserTable[token] != 0)
                {
                    nextToken = token;
                    s.PushBack();
                    action = state.ParserTable[token];
                }
            }
        }

 

The idea is that when action == 0 (no valid action - error) I check whether we can automatically insert semicolon (ECMA standard specifies exact conditions for that), if we can, set NextToken to AutoSemicolon (I am not using real semicolon, because if would mess up the for statement) and ask the parser to return the current token back (so yylex returned the same token again).

This is hacky, however for the moment it works like a charm.

I considered using optional semicolon, however I didn't want to deal with conflicts thus tried going error route.

Marius