Tokens and tokenizing

There is a lot of duplicate code - not only the “ignore whitespace” snippets everywhere, but the entirety of kw_print and kw_input are almost identical, even though they perform completely opposite operations.

The reason is easy to spot - if we look at the structure of each command and the parameters following it:

graph LR
  PRINT --> s(string)
  PRINT --> v(variable)

graph LR
  INPUT --> s(string) --> v(variable)

graph LR
  LET --> v(variable) --> e[=] --> s(string)

PRINT is followed by either a “string” or a variable-name, INPUT is followed by both a “string” and a variable-name, and LET is followed by a variable-name, an equal sign and a “string” ← right now the string used by LET isn’t in quotes, and doesn’t allow escape characters, but it really should …

It is quite obvious that our language, BASIC, consists of some “components”:

keywords, like PRINT, INPUT and LET
strings, that begin and end with “ and can contain escaped characters with \
variable names, that right now can be anything
equal signs
and a lot of optional whitespace: spaces and tabs mostly

In programming language parlance, these “components” are called tokens, and a single token can either be a single character, like the equal sign, or a long string of characters, like a string (pun intended). Because most programming languages also have defined functions, classes, structures, etc. a “variable name” isn’t only for variables, so it is more often called an “Identifier”.

That means that we can setup some rules for how the different commands should expect their parameters, which tokens should follow each one:

PRINT <string> | <identifier> - meaning PRINT is followed by either a string or an identifier.
LET <identifier> <equal> <string> - meaning LET is followed by an identifier, then a =, then a string.
INPUT <identifier> | <string> <identifier> - meaning INPUT is followed by an identifier, or a string and an identifier

There can be <whitespace> between each token - all connected whitespace, all the spaces in a row, is considered a single token.

What we need is a “tokenizer”, a function to read the input-line and convert it into tokens, so that LET name = "Peter" will be converted into LET → <identifier “name”> → <equal> → <string “Peter”> and then the kw_let function will receive that list of tokens to work with, and not having to handle all the whitespace, checking for “ and \ and so on.

So let’s write that!