<< PreviousNext >>

Code generation

Since I introduced the type system inside the compiler itself, I have been working on being able to actually implement a small language to better understand what needs to be in place. My current idea now is to implement a small language in C++, which I will later on use to implement the real Storm language. The main benefit of actually implementing the main language in a basic language is that I will be sure that everything needed is properly exposed to the runtime. Aside from that, I will be able to take advantage of the compiler itself much earlier. The downside is of course that if I later decide that I want to simplify the simple language, I will have to rewrite the implementation of the main language as well. I am not entirely sure that I will do it this way yet, but it seems like an interesting idea.

I have just been able to compile my first function in the simple language (currently called Basic Storm). This was a really simple function that simply returns a number, but even a simple function like that relies on suprisingly many internals of the compiler. The first step is to identify that the source file actually contains a function. Then the compiler needs to figure out what machine code to generate, which includes keeping track of types in the machine code itself.

The source file that was properly compiled looks like this:

Int fn() {
    3;
}

It may look a little strange at first sight, but if you are familiar to a LISP dialect, think of it like (defun fn () 3) instead, and it should make sense. What I have done, mostly as an experiment, is to not differentiate expressions and statements. Instead, everything is an expression, even blocks. By default blocks return the value of the last expression (just like progn in LISP). I will later add some kind of return, both for blocks and for functions. An interesting consequence of this is that expressions like 1 + { 4; 5; } * 2, or even 2 + if (i == 3) { 3; } else { 4; } becomes legal. The first expression will returns 11, since the block returns the last expression (5). The second expression will return either 5 or 6, depending on the value of i. I have not yet implemented all these statements, and most importantly, no support for variables yet. But I think that should be fairly easy to do from now on.

Implementation

The implementation of a language starts with something I call a package reader. A package reader is an object that reads files from a language and provides the contents to the Storm compiler. When the compiler examines a package, it looks at the file extensions in the associated directory and loads each set of files with the package reader found in lang.<ext>.Reader. The reader is handed the files it is assumed to examine and then left on its own. For the Basic Storm language, the package reader does nothing more than parsing each file independently with syntax definitions.

So the next step is to run the file contents against a set of BNF rules to find the functions in the file. This is currently done with these rules:

DELIMITER => void : "[ \n\r\t]*";
DELIMITER => void : "//.*\n";

// Find a name (excercise: why is [A-Za-z].* a bad idea?)
Name => Str(name) : "[A-Za-z]+" name;

// Package
Package => Pkg() : Name -> add - ("\." - Name -> add)*;

// Type
Type => TypeName(name) : Name name;
Type => TypeName(pkg, name) : Package pkg, "\.", Name name;

// Skip a block, assuming matching {} and well-behaved strings.
SkipBlock => cont : ( SkipContent ) cont;
SkipContent => void : SkipContent, SkipContent;
SkipContent => void : "[^{}\"']*";
SkipContent => void : "{", SkipContent, "}";
SkipContent => void : "\"" - StrContent - "\"";
SkipContent => void : "'[^']'";
SkipContent => void : "'\\.'";

StrContent => void : "[^\"\\]*";
StrContent => void : StrContent - "\\." - StrContent;

// Parameter list (possibly empty)
Params => Params() : DELIMITER;
Params => Params() : Type -> add, Name, (",", Type -> add, Name, )*;

// Function
Function => FunctionDecl(pos, name, result, params, contents) :
   Type result, Name name, "(", Params params, ")", "{", SkipBlock contents, "}";

// Root rule, used to parse an entire file.
File => Contents() : (Function -> add, )*;

The root rule here is File, which will look for any number of functions in the file. The runtime will create a Contents object for the file, and add() a number of FunctionDecl objects to it. We can also see that FunctionDecl is passed the result of the rule SkipBlock. SkipBlock is doing something new, it is using what I call capturing. In this case, it means that cont will contain the string matched between ( and ). The rule SkipContent is then in charge of matching brackets and ignore strings, so that we extract entire blocks from the source text. In the case above, the rule will return the string 3;.

Using this information, we can hand over these function definitions to the compiler. These functions are all loaded lazily, which in this case means that they are compiled the first time they are executed. How do we solve this smoothly? I do not want to do anything special for the lazy compilation to work since I will most likely forget that some time. My solution relies on the fact that it is possible to move functions in memory using the code generation backend I have previously written. The reason it works is that whenever the code generation writes some machine code containing an address, it also keeps track of what it is pointing to, so that all pointers can be updated when whatever it was pointing to has been moved. Of course this means that I can not save a raw pointer to functions in memory in C++ as these are unknown to the referencing system and they will therefore not be updated.

Using this system, we can generate a stub function that looks like this:

prolog
push <pointer to function>
call Function::load
epilog
jmp eax

This assumes we have a load function in the Function class we use that compiles the function and then returns the address of the compiled code. The compilation process will also update any remaining pointers to the stub, so that they point to the newly generated machine code. This means that aside from the compilation, there will be no runtime overhead the next time the function is called. Of course it should also be possible to force the compiler to update any lazily generated code ahead of time later.

To compile the function itself, we first parse the contents using some additional rules:

// Expression
Expr(Block block);
Expr => intConstant(nr) : "[0-9]+" nr;
Expr => Block(block) : "{", (Stmt(me) -> expr, )*, "}";

// Statement
Stmt(Block block);
Stmt => expr : Expr(block) expr, ";";
Stmt => Block(block) : "{", (Stmt(me) -> expr, )*, "}";

// Root rule for parsing function bodies.
FunctionBody => FnBody() : (Stmt(me) -> expr, )*;

You can see that I have separated statements and blocks in the syntax definition of a function. This is only to handle semicolons properly. Here: a statement is either an expression with a semicolon, or a block (without semicolon). We could also write { 3; };, which would mean the expression form of a block instead. Internally they are treated alike, as the only thing the first Stmt rule does is to forward the return value of the Expr rule.

These rules result in a nice tree of information. But how do we turn that into machine code? My solution is to have a base class Expr that provides two functions for this purpose. First it has got the type function, which computes the type of that sub-expression, secondly it has the code(target) function, which will return the code for the function, and placing any result in target. If target is a special null-value, a result will not be generated if it is not needed. These two functions are implemented in the two subclasses currently present: Constant and Block. The constant (returned by intConstant) simply returns the type of the constant and generates a simple mov instruction. The block simply returns the type of the last expression or void if it is an empty block. The code generation is also easy, it just asks the contained expressions for their code in turn (asking all except the last one to not generate a return value).

Using this, we can implement the compilation of a function in the following steps:

  • Run the parser to get the syntax tree.
  • Check the type returned from the root node of the tree with the return type of the function.
  • Add a prolog and an epilog to the code returned by the root of the tree.

This is the time I felt like I produced something good. It is not too much work implementing more features of the language from here. Just add a rule, another subclass of the Expr, and implement it so that it generates proper machine code. Writing machine code is not too difficult either, since I have already written a library to generate machine code conveniently by helping you with boring details like allocating space for local variables, generating function prologs and epilogs, managing blocks within a function, and handling exceptions (you can attach destructor functions to your variables that will be called if an exceptions is thrown). Aside from that, the library strives to hide the differences between at leas x86 and x86-64 by providing a separate pointer size as well as supporting 64-bit values on a 32-bit cpu. So code generation looks like this:

Listing l;

// Parameter 1, size 4 (Int)
Variable param1 = l.frame.createParameter(4);
// Local variable, size 4 (Int)
Variable var1 = l.frame.createVariable(4);

l << prolog();
l << mov(var1, intConst(54));
l << add(var1, param1);
l << mov(eax, var1);
l << epilog();
l << ret();

So I just create some variables and parameters and then use them. If someone has been working with x86 assembler, they may notice that there is no single op-code to implement the add instruction used. The generated instruction would look like this: add [ebp+<var1>], [ebp+<param1>] where <var1> and <param1> is the offsets of the two variables relative ebp. Instead, the library understands this and expands it to two instructions like:

mov ecx, [ebp+<param1>]
add [ebp+<var1>], ecx

It even keeps track of which registers are used so that it can choose a free register if present, or free a register by pushing it on the stack temporarily. The prolog and epilog also takes care of saving any registers that are to be preserved through function calls. All of this makes code generation so much more nice to work with!

Comments

New comment

You can use GitHub flavored markdown here. Parsed by Parsedown, which does not support all of GitHub's features. For example, specifying the language of code listings is not supported.

Name: