cc: Context-Sensitive Parsing of alignas

2023/09/14

This post demonstrates the ambiguity that arises when parsing the C23 AlignmentSpecifier alignas, and the context-sensitive nature of the parsing required to correctly parse the specifier.


The ambiguity:

The specifier is represented by the non-terminal AlignmentSpecifier in the C2x grammar:

    AlignmentSpecifier: alignas ( TypeName )
    AlignmentSpecifier: alignas ( ConstantExpression )

The ambiguity arises from the fact that both TypeName and ConstantExpression can derive an Identifier.

For e.g., in the declaration

    alignas(a) int b;

the variable b is declared to be of type int, with its address aligned at the result of the expression alignas(a).

The identifier a can be a TypedefName (TypeName derives TypedefName):

    typedef struct {
        long double ld;
    } a;

Or, it can be an identifier that represents a ConstantExpression (ConstantExpression derives PrimaryExpression derives Identifier):

    constexpr int a = 32;

Or, it can be an identifier that represents a ConstantExpression (ConstantExpression derives PrimaryExpression derives EnumerationConstant derives Identifier):

    enum {
        a = 32,
    };

The ambiguity in an LR(1) parser:

The collection of canonical-LR(1)-item-sets shows the reduce/reduce conflicts as described below.

print_item_set: item-set[   0]:k----------------------
[TranslationObject -> . TranslationUnit] jump=1
print_item_set: item-set[   0]:-----------------------
. . .
[AlignmentSpecifier -> . alignas ( TypeName )] jump=4025
[AlignmentSpecifier -> . alignas ( ConstantExpression )] jump=4025
. . .
print_item_set: item-set[   0]:done-------------------
print_item_set: item-set[4025]:k----------------------
[AlignmentSpecifier -> alignas . ( TypeName )] jump=4026
[AlignmentSpecifier -> alignas . ( ConstantExpression )] jump=4026
print_item_set: item-set[4025]:done-------------------
print_item_set: item-set[4026]:k----------------------
[AlignmentSpecifier -> alignas ( . TypeName )] jump=4027
[AlignmentSpecifier -> alignas ( . ConstantExpression )] jump=4029
print_item_set: item-set[4026]:-----------------------
. . .
[TypedefName -> . Identifier] jump=983
[PrimaryExpression -> . Identifier] jump=983
[EnumerationConstant -> . Identifier] jump=983
. . .
print_item_set: item-set[4026]:done-------------------
print_item_set: item-set[ 983]:k----------------------
[TypedefName -> Identifier .] las: ... ) ...
[PrimaryExpression -> Identifier .] las: ... ) ...
[EnumerationConstant -> Identifier .] las: ... ) ...
print_item_set: item-set[ 983]:done-------------------

The parser experiences reduce/reduce conflicts, as it lands into the item-set-#983 immediately after shifting the identifier a. The next available token in the input, and also the current look-head for the parser, is the punctuator ).


The ambiguity in an Earley parser:

In an Earley parser, the ambiguity manifests as two back-edges pointing to the completion of the non-terminal AlignmentSpecifier, corresponding to the two different ways the reduction to AlignmentSpecifier can be performed.

cc_token_print: CC_TOKEN_RIGHT_PAREN ')'
item-set[   4]:r[   5]----------------------
[   0]:[ALIGNMENT_SPECIFIER -> alignas ( TYPE_NAME ) .] (0)
[   1]:[ALIGNMENT_SPECIFIER -> alignas ( CONSTANT_EXPRESSION ) .] (0)
[   2]:[TYPE_SPECIFIER_QUALIFIER -> ALIGNMENT_SPECIFIER(4,0)(4,1) .] (0)
. . .

The context-sensitivity:

To determine exactly what the identifier a represents, the parser must consult the symbol-table as soon as it parses the identifier a. This also implies that the parser, as it parses the constructs, must keep the symbol-table up-to-date, so that relevant information is available to resolve any ambiguities.