Looking up variables based on a context.

The general caveat for all these articles I am writing on Programming Language Design is: I don’t know. I am just messing around and trying to figure out how they work and how they can be implemented. The internals of most programming languages are scurry scurry1 places. When you read the source code for say the v8 engine you are struck by the apparent elegance of its design, it’s quite fun to read the source, and you can even abstractly understand what it going on. But an abstract understanding is not the same as a pragmatic understanding.

Pragmatic Understanding

To see if you have a pragmatic understanding of something, unplug your computer from the internet, then write an algorithm or program in your preferred language that implements that understanding, accepting a series of inputs and consistently producing the expected outputs. You can only use printed references that detail libraries, i.e. you can look up the details of some STL functions and the like. When you can do this, you have a pragmatic understanding.

It’s easy to describe the process of creating a programming language like this:

Well, first you scan for tokens, then you create a parse tree, then you navigate that parse tree and either a) emit byte codes or b) execute core functions.2

What's involved.
What’s involved.

Nebulous Bit

The Nebulous Bit® is not really a single thing, but a grouping  of things3. We can look at it a bit like an engine. In most programming languages you have the Byte Code Emitter(Fuel injector), The Virtual Machine(The Crank Shaft), The Object Store(Oil Pump), The Garbage Collector(Catalytic Converter). I don’t think it’s an accident that v8’s VM is called Crank Shaft.

Byte Code Emitter

Most modern programming languages (there might be a few exceptions with really simple languages like TCL or AWK, I don’t know those, MRI executes on the tree, but YARV doesn’t) are implemented as compilers. If someone says differently they are selling you a bill of goods. An interpreter, today, is little more than a really sophisticated dynamic compiler. It only compiles what you actually use in a program run, but more often than not, the “Fuel Injector” of the programming language’s “Engine” converts an AST4 into some intermediate language5.

For instance, Lua will “compile” your code to it’s intermediate language that might look something like this contrived and non-functioning example.

I think, but do not know6, that it should be possible to even skip the AST stage entirely and go straight to emitting byte code. The codes that you emit7 are either a) for an intermediate language(regist/stack based) that is platform agnostic, i.e. like the byte code for the JVM, or code for LLVM, or b) directly into executable memory (v8 does this). If you are making a real compiler, you will probably emit something like Intel or At&t Assembly or some such thing.

You can execute on the tree, which is in and of itself a small amount of compilation, or at the very least, simplification. Creating your tree is less about organizing what the programmer said and more about organizing what the programmer meant. For instance, there is only one loop, sure languages provide you with various structures for looping that come with specific semantics but eventually those can all be implemented with a single kind of loop. It is usually the Parsers job to encapsulate the linguistic semantics and to produce a consistent and simplistic tree. Executing on the tree is considered very slow. Ruby was considered slow for this reason, MRI8 executed, apparently, directly on the tree. Later a virtual machine was added called YARV, which was faster than MRI but still slower than everything else. They keep claiming that Ruby is getting faster and I keep not noticing much of a difference. It’s the common wisdom that compiling to intermediate codes and executing that is better than direct execution on the tree, this is the common wisdom, but common wisdom is worth about as much as an asshole on your elbow.

Of course, when someone says this, most newbies have no idea what they mean. Essentially this means creating an array of byte code, here’s an example:

You might wonder, as I did, how exactly is this more efficient? It is because of two things, a) operations on some array or doubly linked list are very very fast. Moving forward is a simple matter of adding an offset, or incrementing a pointer. b) offsets in the array or list can be “cached” or saved, or references inside the code can be replaced with direct accesses. c) it’s easier to jump and return and keep track of where you jumped from so that you can restore the frame/stack/wtfe.

Essentially, this is mimicking, in a certain sense, how the computer itself runs commands. That is why it is called a Virtual Machine. Virtual Machines also simplify, to a degree, the implementation of a programming language because you break everything down into already well established machine codes, that have already proven capable of representing every kind of algorithm ever.

Algorithms for optimizing such simple languages are very good, very clever, and very well researched. So once you have the intermediate format, you can apply some general algorithms, or if you emit to a specific engine like LLVM it can run those for you, to optimize  the code. What kind of Virtual Machine you design will depend on what kinds of things you want to do, but only a little. All computer programs are already represented fully by the processor machine codes. So creating a subset of that is all you could do anyway. Considering the features of your language, you might only have to define a few dozen codes.

Generally you will need:

  1. SETSOMETHING
  2. GETSOMETHING
  3. CALLSOMETHING
  4. PUSHSOMETHING
  5. POPSOMETHING
  6. MOVESOMETHING
  7. LOOPSOME
  8. RETURNTOSOMEWHERE
  9. ADDSOMETHING
  10. SUBSOMETHING
  11. MULSOMETHING
  12. DIVSOMETHING
  13. POWSOMETHING
  14. COMPARESOMETHING
  15. JUMPSOMEWHERE

These will tie directly into your core implementation, what does it mean to SET a variable? To GET a variable? To CALL a function, to JUMP to an instruction and so on and so forth. That leads us to your:

Virtual Machine

A virtual machine receives some emitted codes and keeps track of them in some kind of list structure, for the sake of argument we’ll talk about it like it’s an array.

The Virtual Machine is just a Big API that glues together all of the parts inside the Nebulous Bit® of the language implementation. SET and GET will link to some kind of ObjectStore or Allocator that will provide non-terminal values for the operations of the machine. CALL may also link into core functions implemented deep in the internals of the programming language, or it will simply move the execution to some predetermined section of the list and start running codes from there until it hits an END or RESTORE, or RETURN code in which case it will go back to it’s original postion + 1 and continue.

The Virtual Machine will need to keep track of multiple restore points so that you can have a call stack and know where to return to when you are done.

Object Store, AKA the Context

Regardless of what language you are actually implementing, you will need some form of context. You could have one context, the Global Context, and leave it at that. But people will hate you and won’t use your language.

The idea of a context can be summed up like this: Where the fuck is x defined? I’ve got x, where the fuck do I put it? The answer is the Current Context. The Virtual Machine needs to keep track of the current context, as you enter a function, a new context needs to be created that is linked to the previous context.

When you enter almost any kind of block you end up creating some kind of new context. Even a for loop can or does create a new context, or scope. You’ll notice there are a lot of words to describe functionally the same things. On the higher levels they can and perhaps do have shades of meaning, or can if you want them to, but at the lowest level, you pretty much implement them the same kind of way.

You will need to create some kind of object(context, scope, receiver, object, whatever), and on that object there will be at least one bucket which has statefull data in the form of variables at least, possibly even functions. In Ruby, and probably other languages, you have something called the default receiver which is just a fancy way of saying the current context. You can specify a context by prefixing a def with the class name, like so:

This may seem complicated, but essentially what we are doing is appending a new context to a context chain when we enter a block, and closing that context when we exit. You may notice that we can has[sic] two different kinds of contexts. One is closeable, or deletable in sofar as it only needs to exist while we are in the block and once we leave it, it can safely be deleted, even more it would be unsafe NOT to delete it. In the case of someMethod, that context should be deleted when we return from the method, but the context of MyClass should not be deleted.

It is my thinking that considering a context as a separate thing is really unnecessary. If you take the perspective that everything is an Object, or a Function (FunctionObject much?, much too much), then you can do something.

The description of an object suits the concept of a context perfectly: Something that encapsulates state (variables), behavior (functions) and can send and receive messages (Where the fuck is x?) to itself and other objects.

For fun today, I slapped together a little proof of concept for disposable contexts using clang++ and c++11 features. Here’s a trivial and naive implementation of a small language that allows you to define and retrieve variables and navigate into and out of contexts.

Here is an example session:

This sample code illustrates some core concepts, creating new scopes and linking them to their parent scope, recursively searching up the scopes until you hit the sentinel at which point we return a default Undefined. In a real language, we would throw an exception. It also illustrates occlusion.

proglang2

When we enter into Context B, we define z = 4 and this definition should occlude the one in the previous scope. Just as we did in the example, where we called next to create a new context and then defined x again with the value of 2. Then when we retrieved it, there was no need to search the parent contexts for the value.

So let’s do a little worse case scenario, here’s a variation on that code:

Here we create 20,000 contexts, or receivers9 each context has 20 local variables, so that’s 400,000 variables defined. Let’s give it a run and see:

Not too shabby, it can certainly be optimized, or re-designed in all manner of ways, right up to us manually allocating the contexts and having a class keep track of all the positions. We could implement some interesting caching schemes, but there is one important point.

You have to search at least once

No matter how you slice it, you will have to search for the definition of a variable or function at least once, even if for all successive calls you cache it. At some point during your program, you will have to search SOME kind of hash map to get a pointer to the location which you can then cache.

This was a serious point of confusion for me because it seemed, from the way people discussed it, that there was something evil about the searching, they would talk about ghost classes, and inline caching and all that jazz, but what they really mean is that once you do some work, i.e. searching for the variable, you should try your damndest to never have to do that work again.

There are no real ways to optimize a linear search for a variable definition, because at some point you will have to iterate over some structure and compare some kind of key to get at some kind of value. You cannot get better than O(n). You can find clever ways to make n smaller. Like separating things off into buckets or bins and using some clever mechanism like sorting all bins as the first letter of the variable name so that you only have to search those smaller buckets etc. Which would suck if some programmer started all of his variable names with ‘a’. you could sort them by the length of the variable name? The alphabet one is best, you could even create a 27 x 27 bin that uses the first and second letter of the variable name to decide the bucket it should be placed in. Why 27? The 27th bin is for weird/non alphabet character starts, like ‘_’ and such. What if the first and second character are ‘__’?

This kind of optimization would only make sense when you have lots and lots of variables. I think in the end, optimization routines should be selectable by the user, too few languages give users the ability to help the implementation figure out what was meant and what is needed. A lot of effort goes into complex heuristics that could probably be solved by asking the user to type a couple of extra letters.

Generally, how a language is implemented, and how a language pretends to be implemented, and what features it offers have more to do with Fads10 than with practicality.

What I wonder is this: Is it possible to create a programming language that allows the user to customize via code or options like command line switches the types of features she wants? Is it possible to run in imperative mode, object mode, functional mode? Maybe even logical mode? Is it possible to be able to extend the syntax from within the language, but also to extend the semantics? What about a language that can compile user code into modules then recompile itself and link against those modules? Would that be code that evolves? I am not simply talking about self rewriting code, but code where use by a user leads to recompilation automatically to optimize the language for their uses?

Most languages are implemented in such a way that the implementation is really written in stone, though I wonder if that is really necessary because it is intrinsic to language design, or that it is intrinsic to dogmatic theories of language implementation and optimization. Some food for thought.

  1. meaning scary scary
  2. this is actually not really a complete description, but this is a common one that you get. You will also need to create a mechanism for storing and retrieving values, for defining contexts, and a chain of lookup tables
  3. read modules
  4. Abstract Syntax Tree
  5. in all fairness, an AST is practically the first level of compilation because it’s almost always a simplified version of the commands typed. Some nodes are sorted, like for operator precedence etc.
  6. there are many things I do not know, unlike most people who talk about language design, I am not afraid of what I do not know, or about talking about what I do not know. I communicate my understanding, and my misunderstanding. Too many programmers pretend to be gurus when they are not, they’ll talk about Parse Trees, and S-Expressions, and LL vs. LALR but when it comes right down to it, most people have no fucking clue what any of these things really are. What the hell is compiling with continuations? Are we talking like Io continuations? What the fuck does that even mean.
  7. as you scan through the input, or token stream, this puts your emitter machine into a certain state and he begins “emitting” certain kinds of codes, or tracking certain offsets and so on. Commands (what the programmer typed) are kind of seen as events.
  8. Matz Rudy Interpreter
  9. really its all conceptually the same or similar. There are always details, and the devil is in them, but for our purposes context, receiver, and scope are functionally the same.
  10. patterns and paradigms come and go. Functional? OOP? Pattern X,  feature y, closures, lambdas, ASYNC, coroutines etc and so forth. Some of these are good and they stick around, some are stupid like “goto is evil” and “Globalz R Bad”. In my opinion OOP is a bit questionable, because most programs are transactional, their ability to maintain state is superfluous.

About Jason

I am a 31 year old programmer living in the south of France. I currently work actively in the fields of Ruby/PHP/Javascript and server/website administration.
This entry was posted in Programming Language Design, Topics. Bookmark the permalink.