Lexical Analysis: You should use FLEX…right?

It’s been a hobby of mine for that last few years to try to create my own programming language. Don’t ask why, I just want to. Over the years I have tried and failed multiple times. This is mainly due to the fact that I am a self taught programmer with no CS background, or Math background, or any real education.

When you set out to learn how one makes a programming language you will receive a series of answers similar to this:

  1. Use flex to generate a lexical analyzer.
  2. Use Bison to generate a Parser
  3. Forget both of those and use ANTLR

To understand what you need to do to create a programming language, you will need to understand that in general all programming languages have a set of modules that take turns with the users input (i.e. the script).

The Lexer

The Lexer, or Scanner is the first component of a programming language. It reads a stream of characters and groups them together into meaningful parts. The Lexer adds no semantic meaning to the stream. Lexers create “meaningful” tokens. They generally, but not always, do not create whitespace tokens, they “ignore” them in the sense that they do not lead to the emission of a token, but they “use” them to delineate tokens.

Flex has a simple interface for creating a lexer, a simple Scanner will look like this:

If we save that as test.y and run:

We can then start entering in some text. We can enter in start, stop, and any word.

Try this input:

Pretty cool huh…

Now try this:

Well, that’s going to be a problem. In this case, we just pasted in translated versions of the word preview, in French, Esperanto an Traditional Chinese.

The problem with flex is based in the idea that it is a lexical analyzer that builds an automaton based on regular expressions, and it makes some assumptions on the size of characters and strings in the system. Basically, it won’t well with non english characters without some additional work and rethinking of your rules, and maybe even tweaking the code produced a bit.

So let’s try that:

Now we compile that and try:

Oh! Score. But let’s say I want to tokenize Chinese words to? Let’s say for instance my friend gave me a huge file with Chinese, French, and German word correspondences and I needed to parse that? Please tell me I don’t have to add all those characters in?

The answer is, probably not. I couldn’t figure out how to do it easily.

The thing is, flex is or was written by a very smart person. Whenever I see something created by a smart person, I get very afraid and remember the fable of The Two Pots.

Two pots had been left on the bank of a river, one of brass, and one of clay. When the tide rose they both floated off down the stream.

Now the clay pot tried its best to keep aloof from the brass one, which cried out: “Fear nothing, friend, I will not strike you.”

“But I may come in contact with you,” said the other, “if I come too close; and whether I hit you, or you hit me, I shall suffer for it.”

The fable of course means that the weak suffer inspite of the good will of the strong. This also transfers to the idea that the stupid suffer inspite of the good will of the smart. When smart people make tools to be used by stupid people, stupid people usually just end up cutting off a finger.

I, not being as smart as the person who designed flex, wouldn’t want to build a programming language with it because I don’t understand how it works, at least not in a pragmatic sense. Abstractly speaking, Lexers are very simple things.

Also, learning to create a programming language is about learning how to “create” a programming language. In a sense, it’s the difference between baking a cake from scratch, and baking a cake from a store bought mix where you just add milk. There are plenty of programming languages that do in fact fit into the “just add milk” variety.

Now, if you’d like to make a simple command interface that links to your program and calls various functions based on inputs, flex is your friend. If you need to create a DSL and you have to do it with c/c++, flex is again your friend.

But if you want to create a programming language because you want to understand how they work, flex is your enemy. Because the code it produces is too high above the level of a person asking the question “How do I create a programming language.” People might disagree with me, but they are ugly and stupid for doing so. I am learning how to make a programming language, I asked that question, I got that response (Use flex + bison) and I found it to be unhelpful, I admit there could be a couple of really smart people who don’t know how to create programming languages, asked that question, got that answer, and then like one of those Zen stories “became enlightened.” Good for them.

For the rest of us, we don’t realize that what we mean is “HOW do I create all of the parts of a programming language so that I can understand it enough to do it.” Not “how do lazy people” do it.

If you want to learn to program things you need to use a programming language. If you want to learn to create a programming language, you’ll need to know how to program in a programming language. Spending hours and days learning a non-programming DSL for flex and bison (EBNF) is pointless because at the end, all you know how to do is programing in the flex and bison DSLs. Yes, you get a Tree out of that which you can then use to connect up bits to C/C++ code that actually does something, but you are no closer to understanding how to create a programming language than you were before. You are now writing nothing more than glue code which amounts to function calls. It’s Shake N’ Bake, and I helped.

How a Lexer Works

Lexers read an input stream until they reach a barrier, when that barrier is reached, they decide if the stream up to the barrier is reasonable input based on simple rules, if the input matches a rule, they emit a Token, which is the Text and the Type of the token.

Here’s an example:

What we are looking for is inclusion in a set. Actually, we are looking for three inclusions in  three sets  or in this case dictionaries. Three because there is a set of word characters, a set of punctuation characters, and finally a set of barriers. In this case, we only have 1 puntuation and 1 barrier(the space). With this information, we can write a simple lexer.

What will this look like on the command line?

What happens if we change the input string?

And the output is:

OMG What the hell happened. Welcome to your first hurdle. This is one of the reasons why people say that you should use something like Flex, because finding a way to insure that all valid combinations of the input work, and that no rules collide with other rules is difficult. What is happening here is that we treat membership in another set to be a barrier, but we are initializing the token on encountering a barrier and continuing through the loop. We intend to eat the whitespace, but accidentally eat the punctuation instead.

Let’s rewrite that to be a bit more supportive of variation.

So it’s a little bit more code, but this one will support many different variations. Try a few and see what happens. What happens when you add characters not from the dictionary? What happens when you put lots of whitespace?

Modifying the code this way, will allow you to support lots of whitespace between words:

That is, once we encounter some whitespace, we do what is necessary based on the rule for the barrier, and then we consume all subsequent barriers, because we aren’t interested in them. What if we changed the barrier to inlude semi-colons?

Let’s change the code to support semi-colons as well.

Lexers aren’t rocket science. Once you get into the swing of things, they are actually rather easy to write. I would also point out that this Lexer is 15k in size compiled, and about 56 lines of code. You’ll see later that with simple conceptual modifications, we can homebrew a lexer for all kinds of things with less code that is more understandable, and the most important thing: more hackable. Code that works, but you can’t change it, is useless code. If you can’t understand it, or if the price for understanding it is too high, then it’s worthless to use it as your base because when tits become upwardly inclined, good luck trying to fix it.

So how does this compare with flex? In all fairness, at lot of work has gone into flex to make sure it is super optimal. Trying to beat it’s speed is a losing battle, especially if you are like me, and not as smart as the person who made it. We are interested in understanding how lexing works, not squeezing ever cycle we can out of the lexer.

Just for shits and giggles, let’s compare them.

This will create a test file with 100,000 tokens and random bits of space.

The above is our lexer code, brought back from earlier with no real changes.

And here is out test.y for flex.

Here’s the results:

So our lexer is 3 times slower than flex. This is to be expected, we wrote it as a very general lexer that we can swap out all kinds of things for, it’s an example, and has 0 optimizations. We would expect it perhaps to be 10 times slower or more. But all those thousands and thousands of lines of code only managed to squeeze 0.080s out of it.

To optimize this lexer, what we could do is be predictive. Since we are looking for “the”, and “quick” a lot, we could predict the outcomes and jump forward based on the first letter and so on and so forth. Remember, premature optimization is the root of all evil.

About Jason

I am a 31 year old programmer living in the south of France. I currently work actively in the fields of Ruby/PHP/Javascript and server/website administration.
This entry was posted in c/C++, C/C++ Programming, General Computing, Linux, Programming Language Design, Topics, Tutorials, Uncategorized and tagged , , , , . Bookmark the permalink.