Jolie Rouge


all ur parentheses Я belong to me


Lexical Analysis: You should use FLEX…right?

It’s been a hobby of mine for that last few years to try to create my own programming language. Don’t ask why, I just want to. Over the years I have tried and failed multiple times. This is mainly due to the fact that I am a self taught programmer with no CS background, or Math background, or any real education.

When you set out to learn how one makes a programming language you will receive a series of answers similar to this:

  1. Use flex to generate a lexical analyzer.
  2. Use Bison to generate a Parser
  3. Forget both of those and use ANTLR

To understand what you need to do to create a programming language, you will need to understand that in general all programming languages have a set of modules that take turns with the users input (i.e. the script).

The Lexer

The Lexer, or Scanner is the first component of a programming language. It reads a stream of characters and groups them together into meaningful parts. The Lexer adds no semantic meaning to the stream. Lexers create “meaningful” tokens. They generally, but not always, do not create whitespace tokens, they “ignore” them in the sense that they do not lead to the emission of a token, but they “use” them to delineate tokens.

Flex has a simple interface for creating a lexer, a simple Scanner will look like this:

If we save that as test.y and run:

flex test.y
cc lex.yy.c -o example -lfl
./example

We can then start entering in some text. We can enter in start, stop, and any word.

Try this input:

jason@berserker:~/work/flex$ ./example 
start
Start command received

stop
Stop command received

hello
WORD

world
WORD

Pretty cool huh…

Now try this:

jason@berserker:~/work/flex$ ./example 
start
Start command received

stop
Stop command received

avant-première
WORD
-WORD
èWORD

antaŭmontro
WORD
ŭWORD

預覽
預覽

Well, that’s going to be a problem. In this case, we just pasted in translated versions of the word preview, in French, Esperanto an Traditional Chinese.

The problem with flex is based in the idea that it is a lexical analyzer that builds an automaton based on regular expressions, and it makes some assumptions on the size of characters and strings in the system. Basically, it won’t well with non english characters without some additional work and rethinking of your rules, and maybe even tweaking the code produced a bit.

So let’s try that:

%{
#include <stdio.h>
%}

%%
stop printf("Stop command received\n");
start printf("Start command received\n");
[a-zA-ZÀàÂâÆæÇçÉéÈèÊêËëÎîÏïÔôŒœÙùÛûÜü]* printf("WORD\n");
%%

Now we compile that and try:

jason@berserker:~/work/flex$ ./example 
œuvre 
WORD

Oh! Score. But let’s say I want to tokenize Chinese words to? Let’s say for instance my friend gave me a huge file with Chinese, French, and German word correspondences and I needed to parse that? Please tell me I don’t have to add all those characters in?

The answer is, probably not. I couldn’t figure out how to do it easily.

The thing is, flex is or was written by a very smart person. Whenever I see something created by a smart person, I get very afraid and remember the fable of The Two Pots.

Two pots had been left on the bank of a river, one of brass, and one of clay. When the tide rose they both floated off down the stream.

Now the clay pot tried its best to keep aloof from the brass one, which cried out: “Fear nothing, friend, I will not strike you.”

“But I may come in contact with you,” said the other, “if I come too close; and whether I hit you, or you hit me, I shall suffer for it.”

The fable of course means that the weak suffer inspite of the good will of the strong. This also transfers to the idea that the stupid suffer inspite of the good will of the smart. When smart people make tools to be used by stupid people, stupid people usually just end up cutting off a finger.

I, not being as smart as the person who designed flex, wouldn’t want to build a programming language with it because I don’t understand how it works, at least not in a pragmatic sense. Abstractly speaking, Lexers are very simple things.

Also, learning to create a programming language is about learning how to “create” a programming language. In a sense, it’s the difference between baking a cake from scratch, and baking a cake from a store bought mix where you just add milk. There are plenty of programming languages that do in fact fit into the “just add milk” variety.

Now, if you’d like to make a simple command interface that links to your program and calls various functions based on inputs, flex is your friend. If you need to create a DSL and you have to do it with c/c++, flex is again your friend.

But if you want to create a programming language because you want to understand how they work, flex is your enemy. Because the code it produces is too high above the level of a person asking the question “How do I create a programming language.” People might disagree with me, but they are ugly and stupid for doing so. I am learning how to make a programming language, I asked that question, I got that response (Use flex + bison) and I found it to be unhelpful, I admit there could be a couple of really smart people who don’t know how to create programming languages, asked that question, got that answer, and then like one of those Zen stories “became enlightened.” Good for them.

For the rest of us, we don’t realize that what we mean is “HOW do I create all of the parts of a programming language so that I can understand it enough to do it.” Not “how do lazy people” do it.

If you want to learn to program things you need to use a programming language. If you want to learn to create a programming language, you’ll need to know how to program in a programming language. Spending hours and days learning a non-programming DSL for flex and bison (EBNF) is pointless because at the end, all you know how to do is programing in the flex and bison DSLs. Yes, you get a Tree out of that which you can then use to connect up bits to C/C++ code that actually does something, but you are no closer to understanding how to create a programming language than you were before. You are now writing nothing more than glue code which amounts to function calls. It’s Shake N’ Bake, and I helped.

How a Lexer Works

Lexers read an input stream until they reach a barrier, when that barrier is reached, they decide if the stream up to the barrier is reasonable input based on simple rules, if the input matches a rule, they emit a Token, which is the Text and the Type of the token.

Here’s an example:

the quick brown fox.

Token: [the] [WORD]
Token: [quick] [WORD]
Token: [brown] [WORD]
Token: [fox] [WORD]
Token: [.] [PUNCTUATION]

The barrier is: ' ' a space

WORD dictionary: bcefhiknoqrtuw
PUNCTUATION dictionary: .

What we are looking for is inclusion in a set. Actually, we are looking for three inclusions in  three sets  or in this case dictionaries. Three because there is a set of word characters, a set of punctuation characters, and finally a set of barriers. In this case, we only have 1 puntuation and 1 barrier(the space). With this information, we can write a simple lexer.

#include <iostream>
#include <sstream>
#include <stdio.h>
  typedef struct Token {
    std::string text;
    std::string type;
  } Token;
int main(void) {

  std::string chars 		= "bcefhiknoqrtuw";
  std::string barriers 		= " ";
  std::string puns 		= ".";
  std::stringstream source;
  source.str("the quick brown fox.");

  std::string buffer;
  char current_character;
  Token * token = new Token;
  while (source.good()) {
    source.get(current_character);
    if (barriers.find(current_character) < barriers.size()) {
      print_token:
      printf("Token: [%s] [%s]\n", token->text.c_str(),token->type.c_str());
      token->text = "";
      token->type = "";
      continue; //i.e. skip the barrier
    } else if (token->type == "WORD" && puns.find(current_character) < puns.size()) {
      goto print_token; // Because membership in another set is a kind of barrier
    } else if (token->type == "PUNCTUATION" && chars.find(current_character) < chars.size()) {
      goto print_token;// Because membership in another set is a kind of barrier
    } else if (chars.find(current_character) < chars.size()) {
      token->type = "WORD";
    } else if ( puns.find(current_character) < puns.size()) {
      token->type = "PUNTUATION";
    }
    token->text += current_character;
  }
  printf("Token: [%s] [%s]\n", token->text.c_str(),token->type.c_str());
  return 0;
}

What will this look like on the command line?

jason@berserker:~/work/examples/lexer$ g++ lexer1.cpp -o lexer1.bin
jason@berserker:~/work/examples/lexer$ ./lexer1.bin 
Token: [the] [WORD]
Token: [quick] [WORD]
Token: [brown] [WORD]
Token: [fox] [WORD]
Token: [.] [PUNTUATION]
jason@berserker:~/work/examples/lexer$

What happens if we change the input string?

#include <iostream>
#include <sstream>
#include <stdio.h>
  typedef struct Token {
    std::string text;
    std::string type;
  } Token;
int main(void) {

  std::string chars 		= "bcefhiknoqrtuw";
  std::string barriers 		= " ";
  std::string puns 		= ".";
  std::stringstream source;
  source.str("the quick brown fox. the brown. the quick. fox brown the ");

  std::string buffer;
  char current_character;
  Token * token = new Token;
  while (source.good()) {
    source.get(current_character);
    if (barriers.find(current_character) < barriers.size()) {
      print_token:
      printf("Token: [%s] [%s]\n", token->text.c_str(),token->type.c_str());
      token->text = "";
      token->type = "";
      continue; //i.e. skip the barrier
    } else if (token->type == "WORD" && puns.find(current_character) < puns.size()) {
      goto print_token; // Because membership in another set is a kind of barrier
    } else if (token->type == "PUNCTUATION" && chars.find(current_character) < chars.size()) {
      goto print_token;// Because membership in another set is a kind of barrier
    } else if (chars.find(current_character) < chars.size()) {
      token->type = "WORD";
    } else if (puns.find(current_character) < puns.size()) {
      token->type = "PUNTUATION";
    }
    token->text += current_character;
  }
  printf("Token: [%s] [%s]\n", token->text.c_str(),token->type.c_str());
  return 0;
}

And the output is:

Token: [the] [WORD]
Token: [quick] [WORD]
Token: [brown] [WORD]
Token: [fox] [WORD]
Token: [] []
Token: [the] [WORD]
Token: [brown] [WORD]
Token: [] []
Token: [the] [WORD]
Token: [quick] [WORD]
Token: [] []
Token: [fox] [WORD]
Token: [brown] [WORD]
Token: [the] [WORD]
Token: [] []
Token: [] []

OMG What the hell happened. Welcome to your first hurdle. This is one of the reasons why people say that you should use something like Flex, because finding a way to insure that all valid combinations of the input work, and that no rules collide with other rules is difficult. What is happening here is that we treat membership in another set to be a barrier, but we are initializing the token on encountering a barrier and continuing through the loop. We intend to eat the whitespace, but accidentally eat the punctuation instead.

Let’s rewrite that to be a bit more supportive of variation.

#include <iostream>
#include <sstream>
#include <stdio.h>
typedef struct Token {
  std::string text;
  std::string type;
} Token;
void PrintToken(Token * token) {
  printf("Token: [%s] [%s]\n", token->text.c_str(),token->type.c_str());
}
int main(void) {
  std::string chars 		= "bcefhiknoqrtuw";
  std::string barriers 		= " ";
  std::string puns 		= ".";
  std::stringstream source;
  source.str("the quick brown fox. the brown. the quick. fox brown the");

  std::string buffer;
  char current_character;
  Token * token = new Token;
  while  (!source.eof()) {
    source.get(current_character);
    // We don't know about eof until we read, so we have a double check.
    // otherwise the last character read will end up doubled, in
    // this case, thee instead of the
    if (source.eof()) {
      break;
    }
    if (barriers.find(current_character) < barriers.size()) {
      PrintToken(token);    
      token->text = "";
      token->type = "";
      continue;
    } else if (token->type == "WORD" && puns.find(current_character) < puns.size()) {
      PrintToken(token);
      token->text = current_character;
      token->type = "PUNCTUATION";
      continue;
    } else if (token->type == "PUNCTUATION" && chars.find(current_character) < chars.size()) {
      PrintToken(token);
      token->text = current_character;
      token->type = "WORD";
      continue;
    } else if (chars.find(current_character) < chars.size()) {
      token->type = "WORD";
    } else if (puns.find(current_character) < puns.size()) {
      token->type = "PUNTUATION";
    }
    token->text += current_character;
  }
  printf("Token: [%s] [%s]\n", token->text.c_str(),token->type.c_str());
  return 0;
}

So it’s a little bit more code, but this one will support many different variations. Try a few and see what happens. What happens when you add characters not from the dictionary? What happens when you put lots of whitespace?

Modifying the code this way, will allow you to support lots of whitespace between words:

if (barriers.find(current_character) < barriers.size()) {
      PrintToken(token);    
      token->text = "";
      token->type = "";
      while (source.peek() == ' ' && source.good()) {
	         source.get(current_character);
      }
      continue;

That is, once we encounter some whitespace, we do what is necessary based on the rule for the barrier, and then we consume all subsequent barriers, because we aren’t interested in them. What if we changed the barrier to inlude semi-colons?

Let’s change the code to support semi-colons as well.

... 
std::string chars 		= "bcefhiknoqrtuw";
std::string barriers 		= " ;";
std::string puns 		= ".";
std::stringstream source;
source.str("the;quick; brown  ;;;; ; ;  fox.");
...
if (barriers.find(current_character) < barriers.size()) {
      PrintToken(token);    
      token->text = "";
      token->type = "";
      while ((barriers.find(source.peek()) < barriers.size()) && source.good()) {
	          source.get(current_character);
      }

Lexers aren’t rocket science. Once you get into the swing of things, they are actually rather easy to write. I would also point out that this Lexer is 15k in size compiled, and about 56 lines of code. You’ll see later that with simple conceptual modifications, we can homebrew a lexer for all kinds of things with less code that is more understandable, and the most important thing: more hackable. Code that works, but you can’t change it, is useless code. If you can’t understand it, or if the price for understanding it is too high, then it’s worthless to use it as your base because when tits become upwardly inclined, good luck trying to fix it.

So how does this compare with flex? In all fairness, at lot of work has gone into flex to make sure it is super optimal. Trying to beat it’s speed is a losing battle, especially if you are like me, and not as smart as the person who made it. We are interested in understanding how lexing works, not squeezing ever cycle we can out of the lexer.

Just for shits and giggles, let’s compare them.

word = ['the','quick','brown','fox','.']
File.open('test.txt','w+') do |f|
  100000.times do 
      f.write word[rand(5)] + " "
      if rand(10) == 4 then
	(rand(10)).times do
	  f.write(" ")
	end
      end
  end
end

This will create a test file with 100,000 tokens and random bits of space.

#include <iostream>
#include <fstream>
#include <stdio.h>
typedef struct Token {
  std::string text;
  std::string type;
} Token;
void PrintToken(Token * token) {
  printf("Token: [%s] [%s]\n", token->text.c_str(),token->type.c_str());
}
int main(void) {
  std::string chars 		= "bcefhiknoqrtuw";
  std::string barriers 		= " ;";
  std::string puns 		= ".";
  std::ifstream source;
  source.open("test.txt");

  std::string buffer;
  char current_character;
  Token * token = new Token;
  while  (!source.eof()) {
    source.get(current_character);
    // We don't know about eof until we read, so we have a double check.
    // otherwise the last character read will end up doubled, in
    // this case, thee instead of the
    if (source.eof()) {
      break;
    }
    if (barriers.find(current_character) < barriers.size()) {
      PrintToken(token);    
      token->text = "";
      token->type = "";
      while ((barriers.find(source.peek()) < barriers.size()) && source.good()) {
	source.get(current_character);
      }
      continue;
    } else if (token->type == "WORD" && puns.find(current_character) < puns.size()) {
      PrintToken(token);
      token->text = current_character;
      token->type = "PUNCTUATION";
      continue;
    } else if (token->type == "PUNCTUATION" && chars.find(current_character) < chars.size()) {
      PrintToken(token);
      token->text = current_character;
      token->type = "WORD";
      continue;
    } else if (chars.find(current_character) < chars.size()) {
      token->type = "WORD";
    } else if (puns.find(current_character) < puns.size()) {
      token->type = "PUNTUATION";
    }
    token->text += current_character;
  }
  printf("Token: [%s] [%s]\n", token->text.c_str(),token->type.c_str());
  source.close();
  return 0;
}

The above is our lexer code, brought back from earlier with no real changes.

And here is out test.y for flex.

%{
  #include <stdio.h>
  #include <string.h>
  #include <iostream>
%}

%%
[bcefhiknoqrtuw]+	printf("Token: [%s] [%s] \n",yytext,"WORD");
[\.]+			printf("Token: [%s] [%s] \n",yytext,"PUNCTUATION");
%%

void yyerror(const char *str) {
	fprintf(stderr,"error: %s\n",str);
}

int yywrap() {
	return 1;
} 

int main(void) {
    FILE *myfile = fopen("test.txt", "r");
    if (!myfile) {
	    std::cout << "I can't open file!" << std::endl;
	    return -1;
    }
    yyin = myfile;
    yylex();
    fclose(myfile);
}

Here’s the results:

My Lexer:

real    0m0.623s
user    0m0.120s
sys     0m0.220s

Flex Lexer:

real    0m0.542s
user    0m0.040s
sys     0m0.228s

So our lexer is 3 times slower than flex. This is to be expected, we wrote it as a very general lexer that we can swap out all kinds of things for, it’s an example, and has 0 optimizations. We would expect it perhaps to be 10 times slower or more. But all those thousands and thousands of lines of code only managed to squeeze 0.080s out of it.

To optimize this lexer, what we could do is be predictive. Since we are looking for “the”, and “quick” a lot, we could predict the outcomes and jump forward based on the first letter and so on and so forth. Remember, premature optimization is the root of all evil.