<< PreviousNext >>

Runtime extensibility

It has been a while since I last uploaded any progress on my research here, so this time I want to show the current state of the compiler and the built-in language.

As I have explained earlier, when the compiler should compile some source files, it will look for a class in the type system named Reader placed in a package specific for the file extension of the source file. However, we need to be able to define some language in order to create this entry point for another language somewhere. Therefore the compiler has two languages built into it: syntax definitons and the basic language (currently called Basic Storm, bs). These two languages are implemented in C++ and then exposed to the type system of the compiler, so that the compiler can understand something and allow programmers to further extend the system by writing code in Basic Storm and other languages.

The Basic Storm language is implemented as a set of syntax rules (in a bnf-file), along with a number of C++ classes (which are also exposed to the compiler). Currently, the Basic Storm language is fairly competent and has even reached the point where it can be used to extend itself (and therefore possibly create other languages soon). Before we dive into how this can be used to extend itself, I would like to present the current state of Basic Storm.

Current state

The Basic Storm language currently has the most common constructs you would expect from a language, even though some things are likely still missing. The most exciting thing about this is that it is can interoperate more or less seamlessly with C++ if the C++ api:s is designed using what is supported by the language. At the moment, the biggest limitation is that it is not possible to interface with C++ templates (as they are generated compile-time, I would either have to instantiate all possible implementations at compile-time or be able to compile the C++ code later). Aside from that, as long as you follow the calling convention for the class types, and have made all types in the exported function calls visible to the compiler, you are good to go.

First, let's have a look at the class types. Class types work like in Java, where all classes implicitly inherit from Object (then the actual language may choose to hide this fact if it wants to). This means that the compiler will use the first few bytes of the object to store some type information along with a VTable and a reference count. Class types are always allocated on the stack and are reference counted. This means that you do not have to care about managing the objects lifetime as long as you do not create any cycles. In C++, you can use Auto<T> and Par<T> (for parameters) to manage the reference counting automatically. These also respect the calling conventions for the functions correctly. Basic Storm manages this automatically, but does not yet support weak references.

Since class types have an overhead for each allocation, they are not well suited for smaller types, for example 2d or 3d points. Therefore, the compiler also supports value types. These are in contrast to class types, always allocated on the stack and passed by value to functions. In order to make the value types compatible with C++, we need to know about copy constructors and assignment operators. One could argue that value types should not need any of those, and this is true in many cases, but as I need to handle the reference counting whenever you have a reference to a class type in a value, I thought it would be easier to implement full support from the beginning. This also means that I have to have support for destructors as well. It is worth to note that even though the compiler is fully aware of destructors, you can not yet declare a destructor in Basic Storm. I have simply not have the need for anything else than the default generated ones yet.

Here are some examples of declaring and using types in Basic Storm:

class Base {
    Int a;

    ctor(Int z) {
           // here, we can not access any members of this
           super() {
                      a = z;
           }
           // now, we can!
    }

    Int overload() {
        a;
    }

    Int sum() {
        a;
    }
}

class Derived extends Base {
    Int b;

    ctor (Int z, Int x) {
         super(z) {
                   b = x;
         }
    }

    Int overload() {
        b;
    }

    Int sum() {
        super:sum + b;
    }
}

This is a simple declaration of two classes, and should show fairly well how I have chosen to implement some important parts of the language. Firstly, I am currently using the : character to clarify the type. I would like to use ., just like member access later, but it was a lot easier to keep it separate for now. It is somewhat like C++, super::bar would be super:bar, and not super.bar.

Another thing you can see is how I have implemented initializer lists. In C++, you have to initialize your member variables the first thing you do in a constructor, but I have sometimes came across cases where I want to run some small amount of code before the initialization, but this is not trivially possible. To illustrate what I mean, let's look at this simple example:

class Foo {
public:
    int a, b;
    Foo(int z, int x) : a(x + z + 1), b(x + z - 1) {}
};

In this case it is not that bad, but imagine that x + z would be something more tedious to write, and that we have to initialize a and b in the initializer list (they are some class without default constructors). One way you could solve this would be with a factory method, like this:

class Foo {
public:
    int a, b;
    Foo create(int z, int x) {
        int t = x + z;
        return Foo(t + 1, t - 1);
    }
private:
    Foo(int a, int b) : a(a), b(b) {}
}

The downside here is that we can not use the standard initialization syntax elsewhere, and we have to spread out quite simple logic. Another way is to use a function to compute the complex expression, like this:

class Foo {
public:
    int a, b;
    Foo(int z, int x) : a(c(z, x) + 1), b(c(z, x) - 1) {}
private:
    inline int c(int z, int x) { return x + z; }
};

Now we can at least use the standard initialization syntax elsewhere, but it is still not as nice as I would like it to be. One thing I noticed about this is that it is actually possible to run code in the constructor before we completely initialize the object we are creating. Then, why am I not allowed to run regular code before this point? That is why I have implemented my initializer lists like a regular statement, so in Basic Storm, we can write:

class Foo {
    Int a, b;
    ctor(Int z, Int x) {
        Int t = x + z;
        super() {
            a = t + 1;
            b = t - 1;
        }
    }
}

Which is about what I want. Since this is a constructor, it is a little special. In the beginning of the constructor, we do not have an object and in the end we do. This is possibly the reason why C++ does not allow any code before the object is completely initialized, we do not want to run code as a member without acutally having an object (even if this is not entirely true in constructors all the time...) since it will have to behave differently. My solution here is to simply disable the use of the this variable before you have called super(), effectively preventing you from accessing the uninitialized object before it is initialized.

Now when I have introduced the lanugage a bit, let's have a look at the extensibility.

Extensibility

Recently, I really felt that the lack of an array type made it difficult to use, so I extended the type system so that every part of a name can take parameters, like this: lang(Int):foo(Int). This is then used to identify parameterized types, much like templates in C++. Even though I have not yet implemented a template system in Basic Storm, I can still implement the array type programmatically in C++ (should be possible in Basic Storm later as well). More or less I write some code that the compiler calls when it can not find a specific variant of the array and then I generate that type on the fly, much like the C++ compiler does with templates. With some clever design in C++, I do not even have to generate much machine code for each instance at all.

This means that arrays in Basic Storm looked like this:

Array<Int> foo;

Which is not too nice looking. Let's see what we can do to make it nicer, and how much of that we can do without re-compiling the Storm compiler!

The Basic Storm is, like I said earlier, implemented as code in C++ and as syntax rules in a bnf file. The C++ code will read the syntax and use the built-in parser to parse the file using the rules in the bnf file. These rules will in turn tell which functions to run to create the syntax tree, which can in turn be used to generate the finished machine code for the code that is being compiled. Therefore, let's have a look at the syntax file first:

// Name rule returns a String, created by taking the 'name' parameter.
Name => SStr(name) : "[A-Za-z][A-Za-z0-9]*" name;

// Part in a type name
TypePart => TypePart(n) : Name n;
TypePart => TypePart(n) : Name n, "<", Type -> add, (",", Type -> add, )* ">";

// Type.
Type => TypeName() : TypePart -> add, (":", TypePart -> add, )*;

First, let's have a look at the syntax of the bnf rules. The leftmost name is the name of the rule itself. In the above code we have three different rules: Name, TypePart and Type. The name after the double arrow (=>) is the function in Storm to call when we have found a match, using the parameters from after the colon (:). After the colon, the individual parts to be matched are listed separated by comma. These parts may be either a string (actually a regex) or the name of another rule to match. Each part is optionally followed by a variable name to bind it to (like in Name n), or an arrow with a member function name to call (like in Type -> add). Aside from this, we can group expressions with parens. Here, this is used to repeat the grouped parts zero or more times (as indicated by the *, just like in regex).

To illustrate this, let's look at what happens when we try to match the string Array<Int> to the rule Type: From the bottom up, we can see that Array and Int matches the regex inidcated in the rule Name, so they are both matched by that rule in the end. The rule TypePart matches either just a Name or a Name, then < followed by one or more comma-separated Types and then a >. A type may also be just a single TypePart, which is enough for this string.

In the end, we get:

Type          "Array<Int>"
\- TypePart   "Array<Int>"
 |- Name      "Array"
 \- Type      "Int"
  \- TypePart "Int"
   \- Name    "Int"

To understand how the syntax tree is produced, think of each rule as a function returning whatever is between => and :. Then we get the following three functions:

TypeName matchType(input) {
    TypeName me = TypeName();
    for each 'TypePart -> add' matched in input {
        // this is for the TypePart -> add part found.
        me.add(matchTypePart(input));
    }
    return me;
}

TypePart matchTypePart(input) {
    if (input matches 'Name n') {
        TypePart me = TypePart(n);
        return me;
    } else if (input matches 'Name n, "<", Type -> add, ... ">") {
        TypePart me = TypePart(n);
        for each 'Type -> add' matched in input {
            me.add(matchType(input));
        }
        return me;
    }
}

From this, we can see that the evaluation has three parts: 1: Evaluate any bound variables as needed (by calling the corresponding rule recursively). 2: Execute the result expression, assign it to the variable 'me'. 3: Evaluate expressions of the form X -> y and run me.y(X) for each of those.

For our example, we would end up with something like this:

TypeName result = TypeName();
TypePart p1 = TypePart("Array");
TypeName t1 = TypeName();
t1.add(TypePart("Int"));
p1.add(t1);
result.add(p1);

So, now to the question: How do we add nicer syntax for arrays?

Now, when we understand how the parser works, we can add a new option for the rule Type, since that rule is used everywhere we want to match a rule. For example, if we want to use the C-like Int[], we can add this option:

// Array type.
Type => arrayType(t) : Type t, "\[\]";

It will simply match another valid type, followed by [] (we need to escape it, otherwise it means the empty character set in regex). From what we saw earlier, this would execute something like the following code to build the syntax tree when we feed it with Int[]:

TypeName t = TypeName();
t.add(TypePart("Int"));
TypeName result = arrayType(t);

So, what does the function arrayType need to contain? Well, we need to do approximatly the same work as in the earlier example, when we parsed Array<Int>, so we can implement it like this:

TypeName arrayType(TypeName t) {
    TypePart array("Array");
    array.add(t);
    TypeName r;
    r.add(array);
    r;
}

Note that this is not implemented in C++, we can implement it in Basic Storm as long as we make sure to not use the syntax we are implementing in the same package. So with these two small modifications, we can add another more convenient way of defining arrays, and we do not even have to alter the compiler at all!

The same idea can also be used to implement easier initializing of arrays, for example using the syntax:

Int[] foo = [Int: 1, 3, 3, 7];

As you can see, this has some interesting potential, but it is not yet possible to implement a completely new language only from Basic Storm. The main reason for this is that not enough of the compiler is exposed to the typesystem in the compiler yet. This is mostly because the support of value types is quite recent, and the preprocessor I use to generate the information about the exposed functions has not entirely caught up yet. Aside from that, some internals have to be re-designed to be able to interact with the Basic Storm language. For example, some central functions in the compiler (function lookup for example) return borrowed pointers, which is not supported by Basic Storm. Other times, I am still using std::vector to functions. std::vector is not supported either, and instead I should use the Array type provided by the compiler now. Much of these are not too major, so it will be done eventually. However, the extensibility above shows an interesting way to use the interaction between the compiler itself and the code that has been compiled, and acts a little like a proof of concept for writing new languages within the compiler itself.

This also helped me uncover an interesting bug in the code generation. When I used boolean variables (or any type that was not a multiple of 4 bytes), i failed to properly round the stack allocation to a multiple of 4. As the x86 architecture does not complain a lot about misaligned memory accesses, this bug went on unnoticed until I tried to throw exceptions through any function that failed to align the stack. I noticed this when I was writing the function to initialize arrays. I made some mistake early on in that function that caused the called function to throw an exception. Since the stack was not properly aligned, the compiler crashed without much to say at that point. So when I removed lines of the function, seemingly completely different symptoms started appearing, which was very confusing. In the end, I found out the embarassing mistake. Where I was computing the sice of the stack allocations I had written:

roundUp(stackSize, sizeof(void *));

instead of

stackSize = roundUp(stackSize, sizeof(void *));

I guess I should take that as a lession in the importance of function naming! Maybe naming it roundedUp would have prevented the bug altogether!

Comments

New comment

You can use GitHub flavored markdown here. Parsed by Parsedown, which does not support all of GitHub's features. For example, specifying the language of code listings is not supported.

Name: