Runtime extensibility
It has been a while since I last uploaded any progress on my research here, so this time I want to show the current state of the compiler and the built-in language.
As I have explained earlier, when the compiler should compile some source files,
it will look for a class in the type system named Reader
placed in a package specific
for the file extension of the source file. However, we need to be able to define some
language in order to create this entry point for another language somewhere. Therefore
the compiler has two languages built into it: syntax definitons and the basic language
(currently called Basic Storm, bs). These two languages are implemented in C++ and then
exposed to the type system of the compiler, so that the compiler can understand something
and allow programmers to further extend the system by writing code in Basic Storm and other
languages.
The Basic Storm language is implemented as a set of syntax rules (in a bnf
-file), along
with a number of C++ classes (which are also exposed to the compiler). Currently, the Basic
Storm language is fairly competent and has even reached the point where it can be used to
extend itself (and therefore possibly create other languages soon). Before we dive into
how this can be used to extend itself, I would like to present the current state of Basic Storm.
Current state
The Basic Storm language currently has the most common constructs you would expect from a language, even though some things are likely still missing. The most exciting thing about this is that it is can interoperate more or less seamlessly with C++ if the C++ api:s is designed using what is supported by the language. At the moment, the biggest limitation is that it is not possible to interface with C++ templates (as they are generated compile-time, I would either have to instantiate all possible implementations at compile-time or be able to compile the C++ code later). Aside from that, as long as you follow the calling convention for the class types, and have made all types in the exported function calls visible to the compiler, you are good to go.
First, let's have a look at the class types. Class types work like in Java, where all classes
implicitly inherit from Object
(then the actual language may choose to hide this fact if it
wants to). This means that the compiler will use the first few bytes of the object to store
some type information along with a VTable and a reference count. Class types are always allocated
on the stack and are reference counted. This means that you do not have to care about managing
the objects lifetime as long as you do not create any cycles. In C++, you can use Auto<T>
and
Par<T>
(for parameters) to manage the reference counting automatically. These also respect
the calling conventions for the functions correctly. Basic Storm manages this automatically,
but does not yet support weak references.
Since class types have an overhead for each allocation, they are not well suited for smaller types, for example 2d or 3d points. Therefore, the compiler also supports value types. These are in contrast to class types, always allocated on the stack and passed by value to functions. In order to make the value types compatible with C++, we need to know about copy constructors and assignment operators. One could argue that value types should not need any of those, and this is true in many cases, but as I need to handle the reference counting whenever you have a reference to a class type in a value, I thought it would be easier to implement full support from the beginning. This also means that I have to have support for destructors as well. It is worth to note that even though the compiler is fully aware of destructors, you can not yet declare a destructor in Basic Storm. I have simply not have the need for anything else than the default generated ones yet.
Here are some examples of declaring and using types in Basic Storm:
class Base { Int a; ctor(Int z) { // here, we can not access any members of this super() { a = z; } // now, we can! } Int overload() { a; } Int sum() { a; } } class Derived extends Base { Int b; ctor (Int z, Int x) { super(z) { b = x; } } Int overload() { b; } Int sum() { super:sum + b; } }
This is a simple declaration of two classes, and should show fairly well how I have chosen
to implement some important parts of the language. Firstly, I am currently using the :
character
to clarify the type. I would like to use .
, just like member access later, but it was a lot
easier to keep it separate for now. It is somewhat like C++, super::bar
would be super:bar
,
and not super.bar
.
Another thing you can see is how I have implemented initializer lists. In C++, you have to initialize your member variables the first thing you do in a constructor, but I have sometimes came across cases where I want to run some small amount of code before the initialization, but this is not trivially possible. To illustrate what I mean, let's look at this simple example:
class Foo { public: int a, b; Foo(int z, int x) : a(x + z + 1), b(x + z - 1) {} };
In this case it is not that bad, but imagine that x + z
would be something more tedious to write,
and that we have to initialize a
and b
in the initializer list (they are some class without default
constructors). One way you could solve this would be with a factory method, like this:
class Foo { public: int a, b; Foo create(int z, int x) { int t = x + z; return Foo(t + 1, t - 1); } private: Foo(int a, int b) : a(a), b(b) {} }
The downside here is that we can not use the standard initialization syntax elsewhere, and we have to spread out quite simple logic. Another way is to use a function to compute the complex expression, like this:
class Foo { public: int a, b; Foo(int z, int x) : a(c(z, x) + 1), b(c(z, x) - 1) {} private: inline int c(int z, int x) { return x + z; } };
Now we can at least use the standard initialization syntax elsewhere, but it is still not as nice as I would like it to be. One thing I noticed about this is that it is actually possible to run code in the constructor before we completely initialize the object we are creating. Then, why am I not allowed to run regular code before this point? That is why I have implemented my initializer lists like a regular statement, so in Basic Storm, we can write:
class Foo { Int a, b; ctor(Int z, Int x) { Int t = x + z; super() { a = t + 1; b = t - 1; } } }
Which is about what I want. Since this is a constructor, it is a little special. In the beginning
of the constructor, we do not have an object and in the end we do. This is possibly the reason why
C++ does not allow any code before the object is completely initialized, we do not want to run code
as a member without acutally having an object (even if this is not entirely true in constructors all the time...)
since it will have to behave differently. My solution here is to simply disable the use of the this
variable before you have called super()
, effectively preventing you from accessing the uninitialized
object before it is initialized.
Now when I have introduced the lanugage a bit, let's have a look at the extensibility.
Extensibility
Recently, I really felt that the lack of an array type made it difficult to use, so I extended
the type system so that every part of a name can take parameters, like this: lang(Int):foo(Int)
.
This is then used to identify parameterized types, much like templates in C++. Even though I have
not yet implemented a template system in Basic Storm, I can still implement the array type programmatically
in C++ (should be possible in Basic Storm later as well). More or less I write some code that the
compiler calls when it can not find a specific variant of the array and then I generate that type
on the fly, much like the C++ compiler does with templates. With some clever design in C++, I do not
even have to generate much machine code for each instance at all.
This means that arrays in Basic Storm looked like this:
Array<Int> foo;
Which is not too nice looking. Let's see what we can do to make it nicer, and how much of that we can do without re-compiling the Storm compiler!
The Basic Storm is, like I said earlier, implemented as code in C++ and as syntax rules in a bnf
file.
The C++ code will read the syntax and use the built-in parser to parse the file using the rules
in the bnf
file. These rules will in turn tell which functions to run to create the syntax tree,
which can in turn be used to generate the finished machine code for the code that is being compiled.
Therefore, let's have a look at the syntax file first:
// Name rule returns a String, created by taking the 'name' parameter. Name => SStr(name) : "[A-Za-z][A-Za-z0-9]*" name; // Part in a type name TypePart => TypePart(n) : Name n; TypePart => TypePart(n) : Name n, "<", Type -> add, (",", Type -> add, )* ">"; // Type. Type => TypeName() : TypePart -> add, (":", TypePart -> add, )*;
First, let's have a look at the syntax of the bnf rules. The leftmost name is the name
of the rule itself. In the above code we have three different rules: Name
, TypePart
and Type
.
The name after the double arrow (=>
) is the function in Storm to call when we have found
a match, using the parameters from after the colon (:
). After the colon, the individual parts
to be matched are listed separated by comma. These parts may be either a string (actually a regex)
or the name of another rule to match. Each part is optionally followed by a variable name to bind
it to (like in Name n
), or an arrow with a member function name to call (like in Type -> add
).
Aside from this, we can group expressions with parens. Here, this is used to repeat the grouped
parts zero or more times (as indicated by the *
, just like in regex).
To illustrate this, let's look at what happens when we try to match the string Array<Int>
to
the rule Type
:
From the bottom up, we can see that Array
and Int
matches the regex inidcated in the rule Name
,
so they are both matched by that rule in the end. The rule TypePart
matches either just a Name
or
a Name
, then <
followed by one or more comma-separated Type
s and then a >
. A type may also be
just a single TypePart
, which is enough for this string.
In the end, we get:
Type "Array<Int>" \- TypePart "Array<Int>" |- Name "Array" \- Type "Int" \- TypePart "Int" \- Name "Int"
To understand how the syntax tree is produced, think of each rule as a function returning whatever
is between =>
and :
. Then we get the following three functions:
TypeName matchType(input) { TypeName me = TypeName(); for each 'TypePart -> add' matched in input { // this is for the TypePart -> add part found. me.add(matchTypePart(input)); } return me; } TypePart matchTypePart(input) { if (input matches 'Name n') { TypePart me = TypePart(n); return me; } else if (input matches 'Name n, "<", Type -> add, ... ">") { TypePart me = TypePart(n); for each 'Type -> add' matched in input { me.add(matchType(input)); } return me; } }
From this, we can see that the evaluation has three parts:
1: Evaluate any bound variables as needed (by calling the corresponding rule recursively).
2: Execute the result expression, assign it to the variable 'me'.
3: Evaluate expressions of the form X -> y
and run me.y(X)
for each of those.
For our example, we would end up with something like this:
TypeName result = TypeName(); TypePart p1 = TypePart("Array"); TypeName t1 = TypeName(); t1.add(TypePart("Int")); p1.add(t1); result.add(p1);
So, now to the question: How do we add nicer syntax for arrays?
Now, when we understand how the parser works, we can add a new option for the rule Type
, since
that rule is used everywhere we want to match a rule. For example, if we want to use the C-like Int[]
,
we can add this option:
// Array type. Type => arrayType(t) : Type t, "\[\]";
It will simply match another valid type, followed by []
(we need to escape it, otherwise it means the empty
character set in regex). From what we saw earlier, this would execute something like the following code to
build the syntax tree when we feed it with Int[]
:
TypeName t = TypeName(); t.add(TypePart("Int")); TypeName result = arrayType(t);
So, what does the function arrayType
need to contain? Well, we need to do approximatly the same work
as in the earlier example, when we parsed Array<Int>
, so we can implement it like this:
TypeName arrayType(TypeName t) { TypePart array("Array"); array.add(t); TypeName r; r.add(array); r; }
Note that this is not implemented in C++, we can implement it in Basic Storm as long as we make sure to not use the syntax we are implementing in the same package. So with these two small modifications, we can add another more convenient way of defining arrays, and we do not even have to alter the compiler at all!
The same idea can also be used to implement easier initializing of arrays, for example using the syntax:
Int[] foo = [Int: 1, 3, 3, 7];
As you can see, this has some interesting potential, but it is not yet possible to implement a completely
new language only from Basic Storm. The main reason for this is that not enough of the compiler is exposed
to the typesystem in the compiler yet. This is mostly because the support of value types is quite recent,
and the preprocessor I use to generate the information about the exposed functions has not entirely caught
up yet. Aside from that, some internals have to be re-designed to be able to interact with the Basic Storm
language. For example, some central functions in the compiler (function lookup for example) return borrowed
pointers, which is not supported by Basic Storm. Other times, I am still using std::vector
to functions.
std::vector
is not supported either, and instead I should use the Array
type provided by the compiler now.
Much of these are not too major, so it will be done eventually. However, the extensibility above shows an
interesting way to use the interaction between the compiler itself and the code that has been compiled, and
acts a little like a proof of concept for writing new languages within the compiler itself.
This also helped me uncover an interesting bug in the code generation. When I used boolean variables (or any type that was not a multiple of 4 bytes), i failed to properly round the stack allocation to a multiple of 4. As the x86 architecture does not complain a lot about misaligned memory accesses, this bug went on unnoticed until I tried to throw exceptions through any function that failed to align the stack. I noticed this when I was writing the function to initialize arrays. I made some mistake early on in that function that caused the called function to throw an exception. Since the stack was not properly aligned, the compiler crashed without much to say at that point. So when I removed lines of the function, seemingly completely different symptoms started appearing, which was very confusing. In the end, I found out the embarassing mistake. Where I was computing the sice of the stack allocations I had written:
roundUp(stackSize, sizeof(void *));
instead of
stackSize = roundUp(stackSize, sizeof(void *));
I guess I should take that as a lession in the importance of function naming! Maybe naming it roundedUp
would have prevented the bug altogether!
Comments
New comment
You can use GitHub flavored markdown here. Parsed by Parsedown, which does not support all of GitHub's features. For example, specifying the language of code listings is not supported.