Grammar bug?

Dec 5, 2011 at 12:11 PM
Edited Dec 7, 2011 at 12:08 PM

Hi !

I'm writing a compiler for a PASCAL-like language. When I tried to
parse a program containing a CASE-statement I got a syntax error
where there isn't any. The grammar is quite big so I've included
only those parts that I think are relevant to the problem.

 

subrange        : expression DOTDOT expression 
                      ;

statement_list  : statement ';'  
                     | statement_list statement ';'   
                     ;

statement       : CASE expression OF
                           case_element_list
                           opt_default_stat
                       END_CASE
                     | assign_stat
                     | . . .
                     | . . .
                     | . . .
                     | /* other statements */
                     ;

case_element  : case_label_list ':' statement_list 
                     ;

case_element_list 
                     : case_element               
                     | case_element_list case_element
                     ;

case_label      : expression
                     | subrange 
                     ;

case_label_list : case_label
                     | case_label_list ',' case_label          
                     ;

opt_default_stat
                     : ELSE statement_list
                     | ELSE error
                     |
                     ;


 

Here is a piece of the sample program I tried to parse:

    CASE aColor OF
      BLUE, RED, BLACK..SILVER :
        value := 3.0*a**2 - 2.0*b**2;
      YELLOW, ORANGE:
        value := 2.0*a**2 + 3.0*b**2;
    END_CASE;

The first case_element is parsed correctly, but when the parser reads
the comma after YELLOW in the second case_element I get a syntax error.
The rule has one Shift/Reduce conflict on input symbol IDENT. If there is a
bug in the grammar, I just can't see it. Any idea on what is causing the
problem?

Coordinator
Dec 7, 2011 at 10:05 AM

hi cybernaut

I'll have a look at it in the next few days.  It is possible that it is a bug, but possibly also a grammar problem.

John

Dec 7, 2011 at 1:04 PM

Ok, thank you very much.

It seems that it's only when case_label_list starts with an identifier
that there is a syntax error (after the first case_element).  However,
if the list starts with an integer, or integer subrange, there is no
syntax error.

Coordinator
Dec 8, 2011 at 6:22 AM

Hi again

OK, the issue is a simple grammar bug.  GPPG parsers must work with only a single symbol of lookahead.  So consider the following phrase ...

CASE foo OF blah: a := b; blah2

Now at this state one whole case element may be complete. That is the case if the next token is ',' or ':'.  That isn't the case if the next token is (say) ":=", or '('. So the issue is ... in this grammar how do you know when a case element ends and a new case element starts?  Note that for the four examples I have given TWO tokens of lookahead are enough.  However what if you have a.b.c.d.e.f(blah, blah blah).  There is potentially an unbounded lookahead required.

This is why most pascal grammars require a distinguishing mark, such as '|' to mark the separation, and ANSI C (in switch statements) keep needing "case".

So the solution is to introduce a marker token to that a single lookahead is enough.  While you are at it, you might find is useful to allow for empty case_elements, and even empty case_element_lists.  I know it sounds crazy, who would write a case statement with zero cases?  Well it turns out that lots of programs that automatically generate source code from other input produce unnatural code like this.  I have been caught!

Hope this helps, good luck with the project.

John

Dec 8, 2011 at 11:43 AM

Thanks again for your help, John.

I suspected that was the problem. Using a special token to mark the
end of a case_element solves the problem. Unfortunately, I'm not really
allowed to do that because the language I'm working on is Structured Text,
which is one the languages supported by the IEC 61131-3 standard. I've read
that it is possible to rewrite an LR(k) grammar so that it is LR(1), but
I'm not sure how to do that. The easiest thing to do is ignore the standard
and add that special token.

Coordinator
Dec 11, 2011 at 3:41 AM

Hi cybernaut.

Well, putting in a divider symbol is best if you own the grammar.  However, if that is not the case there are a couple of tricks you might consider. Firstly, note that without the marker the grammar is still deterministic, it just requires a lookahead which does not have a constant bound.  If you just write some ad hoc code that reads symbols ahead until you find a symbol that settles the question it is easy.  In order to do this though you need to be able to push symbols back into the scanner.  Hanspeter Moessenboeck talks about how to do this with the latest version of COCO/R.  Worth a look.  In fact using COCO/R would be one attack on the problem, given that COCO/R is designed to do this kind of trick.  However, it means changing from bottom-up parsing to recursive descent.

The other attack would be to hack on the scanner to allow push back of symbols, and do something like the following:

You need to hack the grammar a bit so that there is a semantic action that you can call every time you complete a  statement recognition inside a case_element.  The semantic action sucks up tokens from the scanner until it can tell whether it is the real end or not. Either way ALL of the symbols get pushed back, but if your code figured that it really was the end of the case_element you push back a dummy marker symbol on top. 

If you really want to go down this path let me know, I think that I can figure out a grammar hack to do it, but I'll never know unless I try.

John

Dec 11, 2011 at 8:01 PM

Hi John,

Hacking the scanner and grammar is probably the easiest thing to do because switching to
recursive descent parsing would require a lot more work. I think have a general idea
of how to make it work and I'll try to get started on it as soon as I can.

Dec 14, 2011 at 8:41 AM
Edited Dec 14, 2011 at 8:55 AM

I think I got something that works. It didn't require that much code.
I added the methods GetNextToken(), DoPushBack() and a prolog to
the scanner.

The method GetNextToken() reads a token from the input, which is
stored in the token buffer together with the current value of yylval
and yylloc.

public Tokens GetNextToken()
{
    TokenData token = new TokenData();
    token.Enum = (Tokens)this.yylex();
    token.Value = this.yylval;
    token.Location = this.yylloc;
    this.tokenBuffer.Add(token);
    return token.Enum;
}

public void DoPushBack(bool doPushDummyToken)
{
    if (doPushDummyToken)
        this.tokenBuffer.Insert(0, dummyToken);
    this.doPushBack = true;
}

I put this code in the scanner prolog. It sets yylval and yylloc to
the values stored in the token buffer and returns the corresponding
enumerated value.

%{
    // Scanner prolog
    if (this.doPushBack)
    {
        TokenData token = this.tokenBuffer[0];
        this.yylval = token.Value;
        this.yylloc = token.Location;
        this.tokenBuffer.RemoveAt(0);
        this.doPushBack = this.tokenBuffer.Count > 0;
        return (int)token.Enum;
    }
%}

In the grammar file I added the rule case_stat_list. When a statement is reduced
it calls the method CheckForEndOfCaseStatList(), which reads one or more tokens
(using GetNextToken()) to determine whether or not the end of case_stat_list has
been reached. The method then calls the DoPushBack() method with the flag
doPushDummyToken set accordingly.

case_stat_list  : statement ';'                    {this.CheckForEndOfCaseStatList();}
                    | case_stat_list statement ';' {this.CheckForEndOfCaseStatList();}
                    ;

I use the dummy token WHEN to mark the beginning of a case_element

case_element    : WHEN case_label_list ':' case_stat_list
                       | WHEN error ':' case_stat_list
                       ;


Thank you very much for your help, John. I wouldn't have been able fix this
problem without your assistance.

Coordinator
Dec 15, 2011 at 2:10 AM
Edited Dec 15, 2011 at 2:10 AM

Hi cybernaut.  Nice solution.  I blogged about this "issue" for LALR parser generators at http://softwareautomata.blogspot.com, otherwise known as John Gough on Software Tools.  I was planning to leave your approach for part 2.