nand2tetris, Part 2.2

2022-09-14

Welcome back! In Part 2.2 we are going to start writing a compiler. It compiles a high-level language called Jack to VM commands we saw in the last chapter. For ease of organization I decided, once again, to write it in Python.

The compiler is a three-step process. In this blogpost we are writing the first one — a tokenizer. What does a tokenizer do?

Tokenizer

Say we have a file called "Main.jack", containing only this line:

class Main {}

The tokenizer is going to read it, then extract meaningful elements or "tokens" one by one until the file is exhausted. For this file, the tokens will be class, Main, { and }.

But how? I've never taken a compiler course before, so I was at a complete (compilete?) loss.

I was tempted to say, "oh simple. I can just split the file and tokenize each element from what it looks like." But split by what? Whitespace? No, because you see no whitespace between the curly braces. And while you could traverse it char by char, efficiency is up to debate.

I considered this a while back. Here's how I would have done it

Suppose I'm inside the function that reads the file char by char. I'm going to define a list of tokens, let's say instances of the class Token, like this:

tokens = []

Then I will implement method Token.append_char(char: str) that takes one character, and:

  • append it to itself if it is allowed (e.g. e after someVariabl)
  • do nothing if the character doesn't matter (e.g. a space after a paren)
  • signal its termination if the char ends the token and starts another (e.g. + after a variable)
  • raise an error if the token is not allowed at all (e.g. a directly after 123)

This way when we read a char, we try tokens[-1].append_char(char) to see if the last token we saw accepts it. If not we just end the previous token, and push the new one to the list of tokens.

You can see how this is an efficiency disaster. So many function calls are unnecessary if we could just look ahead more characters at a time.

For this very concern, I rejected this char-by-char idea in favor of line-by-line. The solution is, I feed the line into a function which extracts and returns the first complete token. Then I ask the returned token how many characters long it is. If it answers n, I remove the first n characters (and leading whitespace) from the line, then feed it back into the same function again. Repeat this until the line has nothing left, and I will get a list of tokens.

I took advantage of certain rules in the Jack language to simplify things. For instance all symbols are 1 character long. No ==, not even <=, only these:

{}()[].,;+-*/&|<>=~

This means if the first character is in the list of symbols, that's gotta be it; we can stop looking.

If the first character is a digit, then the token is no doubt an integer literal. If it is a letter or underscore, we first check if it's a keyword; if not it must be an identifier, e.g. class name, variable, etc.

And there are strings. The Jack standard specifies a string literal to be "zero or more non-double-quote characters surrounded by double quotes" (paraphrased). Easy, regex go brrr: (".*?")

The *? signals a non-greedy search. For example, suppose we have this string:

"string 1".append("string 2")

and match it against these two regexes:

(".*")  -> "string 1.append("string 2"
(".*?") -> "string 1"

The second one is correct.

What about escape sequences?

Canonically, Jack does not support them. A string ends whenever a double quote is met. I don't like it. I want escape sequences. I need escape sequences.

And thus I defined a "language extension" which is opt-in through argument -E escape. The implementation is as simple as changing the regex, but it took me a while to figure out.

The regex is ("(.|\\")+?[^\\]"). Let's take it apart:

(                )  group 1
 "                      literal "
  (     )+?             group 2, one or more, non-greedy
   .                        any character
    |                       or
     \\"                    \"
           [^\\]        any character except \
                "       literal "

It will keep reading the line until it meets an unescaped double quote. Problem solved.

When a string is done, we come back to the main tokenizer function and look for the next token. And this is how I tokenize the source code line by line.

A small caveat. There are multiline comments in Jack that could mess things up if I read line by line naïvely. I have to keep a flag in the scope of this file so I know if I'm inside one or not.

Error handling

A challenge I set for myself is to write sensible error messages. Something to alert the Jack programmer (who, too, is me). Therefore whenever I instantiate a token, I pass the current line number and the position of the token on that line. Whenever the tokenizer fails to recognize something, it will put a caret under it:

$ cat bad.jack
class Main {
    hehe I don't give a shit
}

$ python -m hackc bad.jack
bad.jack:2
    hehe I don't give a shit
              ^ Invalid token

Caveat: it only works for indentation with spaces for now because the position in line is counted with respect to bytes, not columns. It is possible to change this behavior.

Conclusion

The tokenizer is the first step from source code to executable programs. It bridges the gap between human and machine, and thus involves the most "stupid" intricacies. However, as soon as we get this over with, the rest is constructing a syntax tree from the list of tokens with regard to a few rules.

You can check out my code here: hackc

See you in part 2.3!