<< Previous Next >>

It's alive!

Since I just got the parser working properly in the compiler, I thought I would explain a little more in depth how the parsing works, and how it interacts with the rest of the compiler.

The first thing that happens when the compiler tries to make sense out of an input string is to load one or more syntax files. The idea is that the compiler will look at the file extension and load syntax from the lang.<ext> package and use the information found in this package to parse the file. This is not automated yet, so the exact structure is not fixed. It should for example be possible for a language implementation to intervene before the parser is called, and optionally implement the entire parsing itself instead of relying on the default implementation. But like I said, this is something that is yet to be planned.

What happens next is that the input string is passed to the parser together with the loaded syntax. The parser will then output an abstract syntax tree. What is unusual with my approach is that this abstract syntax tree is not exposed outside the parser itself, instead the syntax rules are annotated with function calls. These functions will be called by the compiler to ultimatley build a syntax tree out of user-defined types.

Type preprocessor

Since the runtime has to call functions defined in the syntax definitions, we need some way of looking up functions from a string. Unfortunatley, C++ does not provide any reflection features that we can use to look up functions from strings. Ie, we need a function like this:

void *lookupFunction(const string &str);

This can be implemented by having a big lookup-table somewhere. But it is tedious to keep this updated manually. Therefore, I have written a small pre-processor that generates this table for me based on which classes and functions are marked in the C++ source code. Then we can simply parse this table at program startup and fill in the type information in the internal representation of loaded packages.

For example: The functions supporting the syntax in lang.simple looks like this:

// Tells the pre-processor which package to place the types in.
STORM_PKG(lang.simple);

class SExpr : public Object {
    STORM_CLASS;
public:
    STORM_CTOR SExpr(Type *t);
};

class SScope : public Object {
    STORM_CLASS;
public:
    STORM_CTOR SScope(Type *t);
    void STORM_FN expr(SExpr *expr);
};

SExpr *STORM_FN sOperator(SExpr *lhs, SExpr *rhs, Str *op);
SExpr *STORM_FN sVar(Str *op);
SExpr *STORM_FN sNr(Str *op);

Here, STORM_PKG tells the preprocessor that all subsequent types and function declarations are to be placed in the package lang.simple in this case. All these keywords are macros in C++ that expand to nothing, or some small boilerplate. STORM_FN will for example expand to __cdecl to ensure that the correct calling convention for functions is used.

The pre-processor will then generate something like this:

{ Name(L"lang.simple"), L"SScope", Name(L"SScope"), L"__ctor", list(1, Name(L"Type")), address(&create1<storm::SScope>) },
{ Name(L"lang.simple"), L"SScope", Name(), L"expr", list(1, Name(L"SExpr")), address(&storm::SScope::expr) },
{ Name(L"lang.simple"), null, Name(L"SExpr"), L"sNr", list(1, Name(L"Str")), address(&storm::sNr) },
{ Name(L"lang.simple"), null, Name(L"SExpr"), L"sOperator", list(3, Name(L"SExpr"), Name(L"SExpr"), Name(L"Str")), address(&storm::sOperator) },
{ Name(L"lang.simple"), null, Name(L"SExpr"), L"sVar", list(1, Name(L"Str")), address(&storm::sVar) },

Which is a simple initializer list for a struct containing the package, member-of-type name, result type, name of the function, parameters to the function and finally the function pointer to the actual function to call.

An interesting note about function pointers in C++ is that pointers to member functions cannot be casted to void * without some tricks. In Visual Studio it will actually work to cast it to a void *& first. Don't ask why... Anyway, that is exactly what the address function is doing.

The parser

Now that we have a way of calling functions, we can start to examine the connection with the actual grammar. Let's consider the following grammar:

DELIMITER => void : " *";

// Expression
SExpr => sOperator(lhs, rhs, op) : SExpr lhs, "\+" op, SExpr rhs;
SExpr => sOperator(lhs, rhs, op) : SExpr lhs, "-" op, SExpr rhs;
SExpr => sVar(v) : "[a-z]+" v;
SExpr => sNr(v) : "[0-9]+" v;

// Root rule with repetitions.
Root => SScope() : "{", (SExpr -> expr - ";",)+ - "}";

And say we want to parse the expression "{ a + b; }", which will produce this abstract syntax tree:

Root => SScope (
  expr -> SExpr (
    lhs -> SExpr (
      v -> a
    ),
    op -> +,
    rhs -> SExpr (
      v -> b
    )
  )
)

The next step is to take another look at the syntax rules and call the functions present there. In this step, think of each rule as a function that will return some value. Each option is then one possible implementation of the function, and which one is used is determined by the result of the pattern matching.

For the root element, the syntax option reads:

Root => SScope() : "{", (SExpr -> expr - ";",)+ - "}";

The part => SScope() : means that the the return value of the function is the result of the expression SScope(). Expressions here are extremely simple, they may only contain function calls or calls to a constructor. In this case we call the constructor of SScope as seen above with no parameters (the Type is added automatically as that is an implementation detail). This object is then bound to the variable me during the rest of the evaluation of the rule.

Since the Root rule contains SExpr -> expr, it means that for each SExpr-rule we found, we should call me.expr(v), where v is the result of the SExpr match. So first we have to evaluate that!

The option in question reads:

SExpr => sOperator(lhs, rhs, op) : SExpr lhs, "\+" op, SExpr rhs;

In the same way as for the root rule, this means that we should call sOperator(lhs, rhs, op) to find the result of this rule. In this case, however, we also have parameters. These have to be evaluated before we can call sOperator. The op parameter is simple, that is just the matched string. But lhs and rhs has to be evaluated first. These are simple rules, we just have to call sVar(v) where v is the string found (a and b in this case).

When we have evaluated lhs and rhs, we can finally call sOperator and since this rule does not contain any ->, we can return the result of sOperator back to the root rule, which can call the me.expr(v) and return to whoever called the parser.

This parsed string is equivalent to the following code:

SScope me = new SScope();
me.expr(
  sOperator(
    sVar("a"),
    sVar("b"),
    "+"
  )
);

What is interesting about this is that you can use this to write programs that are executed when your program is compiled. But now no code can be generated, so you can only run code during compile-time! Not very usefull, but quite cool!

Comments

Alex Telon

2014-10-24 08:23 (UTC)

Just woke up and this is the first thing I read so I definetly have to read it again later to grasp all the details :P

I will have to ask more questions later but is there any specific reason as to why you did it this way? I mean by not exposing the syntax tree.

Filip Strömbäck

2014-10-24 10:04 (UTC)

The main reason for "hiding" the syntax tree is that I think it is easier to write clean and extensible code for language-specific logic when you do not have to start by figuring out what kind of syntax node we have. Consider the following two options for a rule:

Expr => assignment(lhs, rhs) : Expr lhs, "=", Expr rhs;

Expr => addition(lhs, rhs) : Expr lhs, "\+", Expr rhs;

Here, we can simply implement the two different functions as usual. But if we would not have annotated the code, we would have to write code like this somewhere:

Expr *createExpr(SyntaxNode &node) {

    if (node is assignment) {

        return assignment(node.lhs, node.rhs);

    } else if (node is addition) {

        return addition(node.lhs, node.rhs);

    }

    // ...

}

This has three large disadvantages:

How should we implement the node is assignment parts? How shall we identify each separate rule? We could add an identifying string for example, but then I think it is easier to just add the function we should call, like I have done. There may be other good alternatives here, but I have not came up with any yet.
We need to manually decide which type the syntax node is by writing if-else statements somewhere, which is tedious! Some functional languages have solved the problem of having to write long if-else statements in the beginning of functions by introducing pattern matching on the parameters, and since we're already doing pattern-matching on the syntax, why not go all the way?
It is not easy to extend the Expr rule. To do that the original author has to come up with some mechanism of extending their language. By providing "handlers" together with the syntax, you easily extend all syntax the same way.

Actually having the function calls inline with the syntax is not really needed, but I think that illustrates the option and its implications quite clearly. An even earlier idea was to have rules in a form like this:

Rule : Expr a, "=", Expr b {

    // code here

    return r;

}

But that would require really complicated semantics to support the two evaluation orders present in the current model. However, implementing this kind of syntax rules should actually be possible to do by adding a language that does this. The current mechanism for loading syntax will not work then, it assumes that all syntax definitions are located in .bnf-files, which is a bad assumption since someone may want to extend the syntax definition language somehow!

Yes, I agree that the syntax system turned out to have somewhat complicated semantics, but I hope the example will at least clarify it somewhat! Once you grasp it, I think it is a quite simple core concept. The rules are really just functions operating on a string!

Alex Telon

2014-10-24 23:17 (UTC)

Just woke up and this is the first thing I read so I definetly have to read it again later to grasp all the details :P

I will have to ask more questions later but is there any specific reason as to why you did it this way? I mean by not exposing the syntax tree.

Filip Strömbäck

2014-10-25 04:39 (UTC)

I've got the feeling that I have seen that message before... But according to the timestamp it was not morning when you made that post!

New comment

You can use GitHub flavored markdown here. Parsed by Parsedown, which does not support all of GitHub's features. For example, specifying the language of code listings is not supported.