Next >>

Extensible syntax

In the main page, I stated that I want to include custom syntax in Storm. Since this can cause all kinds of unmaintainable code, you may wonder why I still think it is a good idea. This is what I will try to explain here, along with some ideas about how to define the new syntax and semantics.

The main reason I want to include custom syntax in Storm is that I think domain specific languages (DSL) are incredible useful sometimes, even though most large languages, like Java and C++, does not allow it. Having a DSL is something that I think would be very useful in, for example, GUI creation. Currently, the options you have is to write code that either positions your UI elements manually, or write tons of code to set up the correct hierarchy for you. Another option is to write your layout hierarchy in an external language, like XML (for Android). I think that it would be a lot easier to create a simple DSL inside your regular language that generates the code for creating your UI elements from a simple representation that is easy to understand by the programmer. This also simplifies the development environment in some cases, since an external resource compiler is not needed anymore.

Aside from those use-cases, simply extending the language to fit your needs in a clean way is also very pleasant some times. This is something that is possible to some extent in C/C++ through macros. The problem using macros is that they look like functions in most cases, which can be confusing. Using custom syntax instead could improve readability, and since it is a new syntactic construction, it will not be confused with something else easily.

Currently, the Storm compiler uses a variant of EBNF to describe the syntax to use. However, it is somewhat altered from the standard form to be more extensible. The main difference is the absence of the or-operator |. Instead, one defines another rule with the same return-type to accomplish the same effect. To tie these syntax definitions to some underlying implementation, the syntax definitions can be annotated with function calls.

A simple syntax file may look like this:

// Rule for the , delimiter.
DELIMITER => void : " *";

Name => String(name) : "[A-Za-z][^ ]*" name;
Operator => String(op) : "[\+-*/]" op;

// A block.
Block(Scope parent);
Block => Block(parent) : "{", (Statement(me) -> statement, ";", )* "}";

// A statement
Statement(Scope scope);
Statement => Block(block) : Block(scope) block;
Statement => Expression(expr) : Expression(scope) expr;
Statement => Declaration(type, name) : Name type, Name name;

// Expression
Expression(Scope scope);
Expression => Operator(lhs, op, rhs) : Expression(scope) lhs, Operator op, Expression(scope) rhs;
Expression => String(name) : Name name;
Expression => Number(nr) : "[1-9][0-9]*" nr;

This may look quite complex at first glance, since it contains quite a lot of information. First, let me introduce the names for things I have chosen (may not be standard).

Syntax token: Either a regular expression or a reference to another rule, indicates what is to be matched.
Option: A set of tokens, this is one line in the source file. Ends with a ;. Similar to rules in BNF, but since my version disallows the OR operator, and these are commonly used to refer to one alternative of a rule I chose this name.
Rule: A named collection of options.
Type: Regular types in the Storm language.

Each line has the following syntax:

<rule> => <result> : <token>, ...;

Where rule indicates that this rule is to be appended to the indicated rule, result is either a function call, or a constructor invocation of a Storm type. This is what will eventually be considered the result of the parsing of this rule. Finally there is a list of tokens. These are separated by either a comma (,) or a dash (-). A comma indicates that there should be whitespace between the two tokens (uses the DELIMITER rule). A dash means that nothing is allowed between the two tokens.

To simplify repetition, regex-like syntax is used. You simply enclose the token(s) you want to repeat in parenthesises, and after the closing paren you put either: *,?,+, which has the same meaning as in regular expressions. Currently only one repeat is supported per rule.

All tokens in a rule can be bound to variables or passed to methods of the result object. To bind the matched part of a token to a variable, the variable name is simply written after the token to be matched like this: "[abc]" foo.

To pass it to a method, simply use an arrow like this: "[abc]" -> foo, which will invoke the foo method of the result object.

Each rule may also take one or more parameters. These are declared separatley, in this case always directly above the first rule. This simply means that invoking the rule requires some extra information. This variable may be used freely in all rules like a regular variable.

The execution order of these methods and function calls are sometimes important, which is why there is more than one way of doing it. The function call could be skipped, but that requires the creation of one Storm type per rule (or maybe per rule!), since the each rule will call specific methods on the object. However, this is not enough in all cases, specifically when implementing blocks.

When implementing a block, we want to be able to keep track of what we have currently found inside the block in order to be able to look at the types of expressions in the block. If we were to only use the function/constructor call, we would receive all child rules at once, already processed. Not being able to tell each child what is inside its scope so that the child rules can check the types of subexpressions. This is what is different with the second alternative, it will call a member function once for each child rule in turn, so that you can report different scope information to each of them. To clarify why this is needed, consider the following example:

{
    int a;
    a + a;
}

To be able to correctly see that the addition in the second line is legal, we need to know that the previous statement was actually a variable declaration. What happens when parsing this is that the block rule will start by running the Block(parent) constructor, then for each statement the statement method is called. Since the statement method will be provided the result that is specified in a Expression rule, we have to execute the function in that rule first. In this case the first statement will run Declaration(type, name). When this value is passed to the statement method, we can see that a variable has been declared, and remember that to when the next statement is being created. In this case it is Operator. If operator would like, it can freely look at the scope information to judge if an addition is legal in that case and act accordingly.

Comments

AlexTelon

2014-10-14 06:29 (UTC)

So if you want to have several Storm-ENBF (or whatever I should call your syntax) files/packages to work together as we talked about in our emails. How do you plan to get them to work together? Is it simply a matter of telling a ENBF "compiler" or maybe even the real storm-compiler about which rule-sets to use. Or will there be some sort of header file that packs different ENBF files together?

Also this commentbox is quite small atm, its like you don't want to get long comments ;)

Göran Rydqvist

2014-10-14 10:46 (UTC)

Extensible syntax rules (pun intended) the universe! Keep it up!

Filip Strömbäck

2014-10-15 03:28 (UTC)

AlexTelon: It is fairly simple actually. When the compiler sees a certain file type, it automatically includes a package, for example lang.s for a .s-file. What that means is that all syntax present in that package will be used to parse that file. As long as the language .s includes some measure of including other packages as well, the syntax rules present there will also be used to parse the file.

Sorry about the small comment box, I initially thought that all browsers did like Firefox: included a resize feature by default. Fixed now!

Göran Rydqvist

2014-10-16 12:33 (UTC)

My personal feeling is that compilation unit is way to coarse a level to switch languages. Consider for instance embedded SQL where you want to type an sql-statement as an expression and keep using the result in your current lingua. So basically you could have a keyword or other syntax trigger to contextually switch language anywhere, especially at definition, class member, statement or expression levels, but basically anything goes, as long as it is good language design (intuitive, easy to understand, unambigous etc). I am trying to say that there are no real borders between languages unless we have alien as well as human readers :), which suggests that future programming languages should be designed to be compatible/fit to a basic framework. Programming language inventions and expressive power would then evolve as naturally as today's open source libraries. The current process of iso-committees considering suggestions for years and pushing out new standards to be implemented (maybee) in the next release of your development enviroment is not what I would call agile development .. Programmers rise with me and retake the power that is rightfully ours :D!

Alex Telon

2014-10-16 13:39 (UTC)

An open language framework would be awesome. Right now many try to get languages to inter-operate but every language does its thing in its own way which seems to make this difficult. I for one would gladly sacrifice some performance in higher-level languages if they would be compatible to a common basic framework. Or even better, imagine if gcc could give you normal optimized c++ but also be able to compile to another format that differs slightly from the ordinary object files in a way that makes it easier for other languages to tap into those files.

.NETs Roslyn compiler is also an interesting example of a way to give programmers more power (given that you work with .NET in this case). Roslyns API which now can give you Syntax Trees and other low-level details makes it easier for a company to make their own version of the compiler.

The more I learn the less I feel that I know about languages, need to do something about that... https://imgflip.com/i/d4v7m

Filip Strömbäck

2014-10-17 04:36 (UTC)

Göran Rydqvist:

The idea of switching languages inline without limitations would be really great! This idea will however allow this to (to some extent at least). The idea is that you do not need to write all syntax definitions in the same file, nor even in the same package. This means that you can extend the language above somewhere else by, for example adding another option like this:

Expr => InlineSql(sql) : "SQL: ", lang.sql.Root sql, " END";

This will of course assume some compatibility between the two implementations, since we're reusing the SQL rules in a completely different context. It will still be possible to write your own implementation that is compatible with your implementation, but that will get tedious to do for every pair of programming languages!

You have got a really great vision about extensible syntax! I had not imagined those extents yet, but as you mention them it is not much more than the inevitable next step! Sounds really amazing if some language really succeeds in the extensible syntax level and let us proceed with the next level of programming!

New comment

You can use GitHub flavored markdown here. Parsed by Parsedown, which does not support all of GitHub's features. For example, specifying the language of code listings is not supported.