For several years I have been working on a personal project of creating (and in fact researching) a "fake emulator", that is, an emulator written in JavaScript for a computer that never existed. This machine was supposed to be a tribute to the eight- and sixteen-bit computers of the 1980s and 90s.
However, I love the complexity: this machine also used a new set of instructions. It is similar to the kits used in that era, but a little easier to work with. The Retroputer was born . Over the years, the emulator has been expanding and improving, but most likely it will never be "finished" (after all, this is a personal research project).
When @bbcmicrobot appeared, I wanted to create something similar for the Retroputer. My JS development skills were mostly limited to the front-end, so this would be a great excuse to get some back-end experience. There is only one problem: Retroputer can only understand its own assembly language. It doesn't have BASIC support yet.
So I came up with the creation of the BASIC interpreter in the style of the 80s, that is, completely in assembly language, as it was then written. I decided that it was worth sharing my work because we don't often have to dive into areas that are so far from the usual abstractions. My everyday tool (JavaScript) makes many aspects trivial and sometimes it even seems like magic. Understanding the lowest level of processes often helps in understanding these abstractions.
So let's get started.
Retroputer characteristics
- 16 ROM (KERNEL) c 1 (scratch space)
- 512 , 64
- 16- 6516, 512 4 64
- 4025, ,
- 1125
- 9800
When I was writing assembler for Retroputer, I could use the very handy Pegjs tool. It provided fast native assembly syntax, but unfortunately nothing like this exists for Retroputer ASM. That is, we will have to go the hard way.
The parsing itself is performed in several stages. A language that uses a compiler parses the code into an abstract syntax tree (AST) (or other representation), and then can use this tree to generate finished native code. Therefore, the program must be syntactically correct for successful compilation.
Some modern interpreters have this concept too, because it is often useful to generate an intermediate AST and execute it rather than the source code.
But for the BASIC interpreter on a machine with limited resources, it will be most efficient to parse in several stages, some of which occurs at runtime. However, this means that syntax errors cannot be detected until you run the program and navigate to the area of the code with the error.
Retroputer BASIC parsing consists of three stages:
- Converting strings
- Tokenization
- Runtime syntax checking
The first two stages occur when the user enters the program (or downloads it). The latter occurs during program execution. Basically, the first two create the fuselage of the aircraft, but do not guarantee that it will take off. The last leg is a test pilot - we hope he takes off, but we're not sure about that until he tries.
Note: Retroputer BASIC source code (in development) is available on GitHub .
Converting strings
This is the simplest part of the whole process. User-entered string is converted to uppercase to simplify (and speed up) subsequent processes. BASIC is not case sensitive, so we can take advantage of that.
<pre>print 2+2
' becomes:
PRINT 2+2</pre>
It's very easy to do this in JavaScript, right?
theLine = theLine.toUpperCase();
But in assembly language, this process is more detailed. We need to read the character, convert it to uppercase, and then store it somewhere.
ld y, 0 # y is our index
_loop: ld al, [d, x, y] # [d,x] is pointer to string
cmp al, 97 # is al (char) in range?
brs n _continue # Not a lowercase char; continue
cmp al, 123 # high portion
brs !n _continue # not a lowercase char; continue
and al, 0b1101_1111 # uppercase!
st [d, x, y], al # store back (we modify the original)
_continue: inc y # move our index along
cmp al, 0 # check for NULL
brs !z _loop # No? Go back for more.
The above code doesn't quite match the semantics of the JavaScript version of the code. An important difference is that today we use Unicode to work with text, so converting input from lower to upper case can often be more difficult and sometimes impossible (depending on the language). Retroputer lives in the ASCII world (more precisely, in the world of its own ASCII variation called RetSCII), that is, all supported characters are encoded in eight bits. For many languages, this is completely insufficient, but it meets the realities of that time.
This also means that we can use a cute ASCII feature to convert from lower to upper case. The uppercase "A" is represented in ASCII code
65
, and the lowercase "a" is represented by the code97
... If you are familiar with the powers of two, then you should have noticed this difference.
So it turns out that lowercase letters are specified by a number that is exactly 32 more than the number of lowercase letters. If we know that the value is in the correct range, we only need to subtract 32!
This approach works, but we can just modify the bits a little. In the case of Retroputer, this will not be faster than subtraction, however, in the absence of subtraction, we will not have to pay attention to the carry / borrow flag during calculations. It turns out we can use bitwise
and
to turn off the bit in the place of the 32 value.
and al, 0b1101_1111 # turn off bit in 32-place
# versus
clr c # clear carry
sub al, 32 # subtract 32
But there is one subtlety: not everything can be converted to uppercase. For example, if the user added a string literal, then we need to be more careful, because we don't want Retroputer BASIC to constantly yell at the user with caps. (While many computers of the era did not have lower case, the Retroputer did not.)
For example:
print "Hello, World!"
' should become:
PRINT "Hello, World!"
' and not
PRINT "HELLO, WORLD!"
This means that we need to keep track of whether we are in the middle of a string literal. In BASIC, there is only one designation for this: double quotes. If the character is a double quote, we can set the flag and, depending on the value of the flag, convert to uppercase or leave it as it is.
It turns out that JavaScript doesn't have built-in functionality for this, but we can create it ourselves:
const len = theLine.length;
let insideString = false;
for (let i = 0; i < len; i++) {
const ch = theLine[i];
if (ch === `"`) insideString = !insideString;
if (!insideString) {
const newCh = ch.toUpperCase();
if (ch !== newCh) theLine[i] = newCh;
}
}
Now the logic in JS more closely matches the assembler version, although we again make a little use of the Unicode support in JS.
The assembler version looks like this:
ld y, 0 # y is our index
ld bl, 0 # === insideString (false)
_loop: ld al, [d, x, y] # [d,x] is pointer to string
cmp al, 34 # is al a double quote?
brs !z check_char # no? should we uppercase it?
xor bl, 0xFF # yes? toggle insideString
_check_char:
cmp bl, 0xFF # inside a string?
brs z _continue # yes? don't modify it
cmp al, 97 # is al (char) in range? "a"
brs n _continue # Not a lowercase char; continue
cmp al, 123 # high portion "z"
brs !n _continue # not a lowercase char; continue
and al, 0b1101_1111 # uppercase!
st [d, x, y], al # store back (we modify the original)
_continue: inc y # move our index along
cmp al, 0 # check for NULL
brs !z _loop # No? Go back for more.
So far we have only done the conversion of the input text to uppercase, but you can use the definition of a string literal for another possibility. We can do one more syntax check!
If at the end of the process we figure out what
inString
is still true ( bl = 0xFF
), then we can return an error, because that means there is an incomplete string literal somewhere in the string.
Note: It turns out that many versions of BASIC are rather lenient on the lack of closing quotes on strings. I found this out while creating my own interpreter. Even though it is, it doesn't seem right to me, so Retroputer BASIC doesn't allow it.
Tokenization
The next step in parsing is to transform the input string into something more efficient for the Retroputer BASIC interpreter to execute. We're trying to get as close as possible to the concept of an abstract syntax tree, but the result will still not be a tree. But it will be something that we can quickly handle at runtime.
A common feature of the early microcomputers was very limited memory. The Retroputer has more memory than most standard machines at the time, but still much less than modern computers. Therefore, long BASIC programs can easily take up too much memory if they are saved during user input.
To save space, keywords are tokenized during program entry into memory... This process converts keywords to single byte tokens. Keywords are always at least two bytes long, so there is more savings as you type. It also means that we can use the lookup table at runtime to call the appropriate assembly language routines.
However, Retroputer BASIC went a little further than most BASIC versions of the time. Moreover, it converts numbers to binary form, marks strings, computes variable references, and much more. To be honest, this is wasting more space, but the speed (and ease of implementation) benefits outweigh that.
So, the following steps are used here:
-
, , . , , , , . -
, , , . ,PRINT "Hello, World"
«Hello, World» , .
, . -
, , , . JavaScript, !
( ). ,PRINT
! -
Retroputer BASIC ( ). . , , .
Retroputer BASIC . , . , Retroputer BASIC.
In the post I will not go into detail about the implementation of this stage in assembly language, I will leave it for the future. But don't worry, there is a lot of code there .
Checking syntax at runtime
Last but not least, the runtime syntax check is. After converting the code to tokenized form, it is quite easy to implement it.
First, at runtime, BASIC checks to see if it is currently processing a token. All tokens have the most significant bit enabled (that is, they have a value of 128 and higher). If a token is found, then we can determine which subroutine to call by finding it in the vector table . It also makes it trivial to display syntax errors - some keywords don't make sense as operators, so the vector table simply points to the procedure that generates the syntax error.
After the operator token handler is called, the handler takes on all the additional parsing work. For tokens and it can use the transition between them
gettok
, gettok-raw
,peektok
etc. If the token is something not expected by the procedure, then the procedure simply returns an error code. It is at this stage that both syntax errors and type matching errors are caught.
If the operator must evaluate the expression, then another stage of parsing is performed. During the parsing of the expression, another vector lookup table is used , that is, we can intercept keywords that are not related to the mathematical expression and return the corresponding errors. For example, if you try to enter
PRINT 2+CLS
, then we get a syntax error on CLS
( CLS
- this is the shorthand for "clear screen").
Note: from this table, we can also determine the priority of the mathematical operators and the number of required parameters. This is important for the actual evaluation of the expression, but it can also be used to catch cases where the user has not provided enough arguments.
Since the token maps directly to the vector lookup table entry, the program can be executed fairly quickly and with minimal effort. The task of parsing each type of operator is entrusted to the handler itself, and in general this does not cause any particular problems. Probably the most difficult to parse are
PRINT
andINPUT
, but at each step one token is taken at a time.
Since many of the checks are not performed until the runtime of the program, this means that we may have partial results before the error occurs. For instance:
PRINT "Hello";CLS
Hello
?Syntax Error
In addition, this means that if the program leaves the screen in a state in which we cannot see the text, then we may have problems with the elimination of errors. If a syntax error is displayed but we don't see it, what should we do?
This way of checking syntax certainly has its drawbacks, but it allows for a fairly simple interpreter.
Tokenizing user input
As mentioned earlier, it was common for BASIC versions of that era to use tokenization tactics. To save space and increase execution speed, keywords were "compressed" into single-byte tokens (or double-byte, if more keywords are needed).
Let's pretend we have a simple line of code that clears the screen and prints a standard greeting:
CLS: PRINT "Hello, World"
While this in itself does not take up much memory, if you write a long program, many words in that program will be keywords. If you compress them (tokenize), then you can save a solid fraction of the space. For example, after tokenization of the line shown above, there will be something like this in memory:
8A E3 B5 FC Hello, World 00 00
It shouldn't be too difficult to convert this back to the original construct:
Byte | Keyword | Notes |
---|---|---|
8A
|
CLS
|
|
E3 | ":" | End of construction |
32 | "" | We store at most one space |
B5 | ||
FB | Line in code | Next comes the string literal |
Hello, World, 00 | Strings are null terminated | |
00 | End of line of code | Lines of code are also null terminated |
You may think that this is a lot of work, but the space savings can be significant. This is not particularly noticeable in the example, but even from it you can imagine how quickly the saved space can accumulate. In this case, the compressed result is 19 bytes. The source text consists of 26 bytes (including the terminating null byte).
Saving space becomes important when it comes to the BASIC interpreter running on a machine with very little RAM for user programs. Therefore, such compression is very important and attractive, even if it does not have additional benefits.
So how do we tokenize something like this? At first, it seems like JavaScript is pretty trivial to do this. With an array of strings, you can quickly replace each keyword with the corresponding token. True?
Seems like this is a job for
String#replace
? With a naive approach, you can try something like this:
const tokens = ["CLS", "PRINT", ":" ];
function tokenize (inStr) {
const newStr = inStr;
tokens.forEach((token, idx) => newStr = newStr.replace(
new RegExp(token, "g"), String.fromCharCode(128+idx)
);
return newStr;
}
Unfortunately, we soon realize that we cannot substitute string literals with tokens. This means that you need to do the processing character by character, taking into account the context, so as not to compress elements that are not keywords.
This new algorithm is closer to the assembly language code in Retroputer, but JS still makes things a lot easier. Here is the beginning of the JS code (we'll gradually expand on it in this post):
const tokens = ["CLS", "PRINT", ":" ];
function tokenize(inStr) {
let out = []; // return value is "crunched" array
let idx = 0; // index into inStr
let ch = ""; // current character
while (idx < inStr.length) {
ch = inStr.charCodeAt(idx); // get the current character code
// our logic is going to go here
out.push(ch); // for now push the character
idx++; // and keep going (next char)
}
out.push(0); // we end with NUL
return out;
}
We start with a very simplified list of tokens, because no one wants to see the entire table in this format ( it's long, and Retroputer actually creates token tables from a JS file! ), But for our purposes this will be enough. The principle here is that when we see a token, we write its index to the output array.
At this point, we have a function that, for now, simply converts the string to its equivalent character codes. While it is not particularly useful, it can serve as a convenient framework.
The assembly language version is quite similar to the one shown above.
do {
call _get-source-index # get next character index (in y)
dl := <BP+source>,y # read next char from our source string (ch)
call _adv-source-index # advance our source index
call _get-target-index # get position in target ("out" in JS)
<BP+target>,y := dl # store to target (out[y] = ch)
call _adv-target-index # move it along
cmp dl, 0 # was char 0? if so, break
} while !z
I have not included in the example above
_get-source-index
or other functions, because their tasks are clear from the name: they simply get, set the source and target indices, and also navigate through them. It is worth noting that unlike JS, there are no dynamically allocated arrays in assembly language, so this algorithm allocates a buffer of sufficient size. As we move down the input line, we need to know where to write the next token in the target buffer, and what the target index is doing above. Each of the above functions returns an index in Y
. For example, the function _adv-target-index
increments the target index by one (equivalent y++
).
Be careful with literals
We need to be careful: string literals can be confusing to the tokenizer - we cannot simply replace every character string that matches a keyword with a token value. If we see a string literal that has "CLS" in it, then we shouldn't seek to tokenize it. It should not be executable, and if we output it ... then instead of it we output the byte corresponding to the token. This is most likely not what the developer meant.
On the other hand, numeric literals shouldn't have this problem, because BASIC doesn't have numeric keywords. Even so, there is no point in searching for a keyword when faced with a number - why waste time?
Tokenizing string literals
So, let's first check if the line starts - in JS it's not particularly difficult to do this:
if (ch === 34) {
out.push (0xFB); // string-in-code token
idx++;
ch = inStr.charCodeAt(idx); // get next character after the quote
while (ch !== 34 && idx < inStr.length) {
out.push(ch);
idx++;
ch = inStr.charCodeAt(idx);
};
idx++; // go past the quote
out.push(0); // end of string
continue;
}
The double quotation mark is represented by character code 34. Other programming languages can recognize many more styles of quotation marks (like JS or C), but with BASIC it's simple: either double quotation marks, or nothing.
When we see that a string begins, we can simply take the remaining characters and add to the output array until we encounter another quote.
After reading the entire string, we need to add a zero byte because Retroputer BASIC uses C-style strings. If we wanted to use Pascal-style strings, we could store a counter and insert the length of the string literal. In any case, this does not pose any particular problem. The only reason I've used null-terminated strings is that they're very easy to work with in assembly language — we just need to compare to a null byte, not store a counter.
So JavaScript was not that hard. I think most of you would decide to use something more abstract, like a regular expression. I think this code will make our intentions more obvious.
if (ch === 34) {
out.push (0xFB); // string-in-code token
const stringLiteral = inStr.substr(idx+1).match(/^[^"]*/)[0];
idx += stringLiteral.length+1;
out.push(...Array.from(stringLiteral, ch => ch.charCodeAt(0)));
idx++; // go past the quote
out.push(0); // end of string
continue;
}
The above code is roughly the same, but instead of character-by-character validation, we let JS just execute
match
. We know we'll get a match (we're in a string), so we don't even need to check if the match was successful. We then increment the index by the length of the string and copy the characters into the output array. In my opinion, this code is much easier to understand (however, it implies an understanding of regular expressions, as well as the peculiarities of ES6 syntax, for example, …
and arrow functions).
Of course, in assembly language, we have to manually do all the work that JS does. The result is very similar to our first attempt at writing JS code:
cmp dl, constants.QUOTE # is dl a quote?
brs !Z not-a-string # nope; not a string
call _get-target-index # get next position in crunch target
dl := brodata.TOK_CODE_STRING # "String" token
<bp+target>,y := dl # Store it into the crunch target
call _adv-target-index
still-a-string:
call _get-source-index
dl := <bp+source>,y # get string character
call _adv-source-index
cmp dl, constants.QUOTE # if it's a quote, we can zero it
if Z {
dl := 0
}
call _get-target-index
<bp+target>,y := dl # write to target
call _adv-target-index
cmp dl, 0 # are we done?
brs !Z still-a-string # no, keep going
continue # next!
not-a-string:
It is worth adding a note about the Retroputer assembly language parser - it has basic support for high-level constructs such as blocks and loops. Therefore, it
if Z {…}
will execute the content inside the block when the zero flag is set, and continue
can be used to branch back to the beginning of the block ( break
also used to exit the block). The assembler translates this into various comparison and branching instructions, so the CPU doesn't see the high-level parts of the code, but it makes the code a little easier to write.
Tokenizing numbers
Also, it doesn't make sense to try to search the list of keywords for numbers, so we can just skip them. Most versions of BASIC do something similar to the string manipulation shown above - if the character read is a digit, it will be concatenated to the output, after which the compressor will take over.
Retroputer BASIC (and some other versions of BASIC, such as Atari BASIC) goes one step further: it converts a number to the appropriate binary format. This is very easy to do - if we meet a digit, then we multiply the accumulator by 10 and add the digit, continuing this way as long as the digits continue to occur. (It's worth noting, though, that while Retroputer BASIC only works with integer values, adding floating point numbers is already in my todo list.)
(I must say that currently Retroputer BASIC does nothing when an integer overflow occurs, whether signed or unsigned. When I add floating point numbers, Retroputer will convert to floating point as well.)
Retroputer BASIC goes one step further. : it also recognizes hexadecimal numbers and converts them to their binary equivalent. It is used as a designation for such numbers
0x
(as in JS), and the language itself has additional logic to ensure that an error is returned if several characters are specified x
.
In assembly language, checking whether a character is a digit is not that hard, but it requires a couple of comparisons: you need to make sure that the character code is between
0x30
and0x39
... (These are character codes "0" and "9".)
Having received a character-digit, we can use another convenient property of the character set.
0x30
- this is the character code "0", 0x31
- the code "1", and so on. We could subtract if we wanted 0x30
, but there is a simpler way: just discard the four most significant bits:
and dl, 0b0000_1111
Unfortunately, this will not work if we need to handle hex numbers. In this case, you can subtract and then compare with 10, and then decrease it by 7 again if the code is greater than 10 (given that the hexadecimal numbers are "A" - "Z" in uppercase).
In JS, you can use regular expressions again, and then when we get a match for the number, you can call
Number()
, which gives us another advantage: hexadecimal numbers are also converted trivially, because it Number()
does this automatically if the number starts with 0x
.
How does it look in JavaScript?
if (ch >= 48 && ch <= 57) {
out.push (0xFD); // number token
const numberLiteralStr = inStr.substr(idx).match(/^[\d|A-F|a-f|x|X]+/)[0];
idx += numberLiteralStr.length;
const numberLiteral = Number(numberLiteralStr);
const bytes = new Uint8Array(new Uint16Array([numberLiteral]).buffer);
out.push(...bytes)
continue;
}
The regular expression allows you to use any combination of numbers or hexadecimal digits (plus
x
to switch to hexadecimal mode). The expression is not ideal, for example, it allows you to use several x
, but for now this is enough.
In the above code, converting numbers to bytes can be questionable.
Number()
did all the hard work of turning a string of numbers into a number that JavaScript can work with, but now we need to represent it as a sequence of bytes. You can use math to do this:
const hiByte = (numberLiteral & 0xFF00) >> 8;
const loByte = (numberLiteral & 0x00FF);
... and it's easy to do for an integer. But thanks to the use of typed JS arrays, you can not do all these calculations, but at the same time prepare for handling floating point numbers in the future (we will just replace
Uint16Array
with Float64Array
).
The assembly language code for this task is slightly longer , but it does a little more. Retroputer uses one more optimization: if the number is small, then it is saved in a smaller format. This means 0-255 can be stored in one byte, while longer numbers take up two.
Search for keywords
So we've done quite a bit of work, but so far we haven't compressed a single keyword. Having dealt with numbers and string literals, we can be sure that everything else is either a keyword or a variable name. (Or a space, but it's easy to check.)
In BASIC, keywords don't always start with an alphabetic character; operators and delimiters are also considered tokens. But variables also start with an alphabetic character. So we cannot immediately determine whether we will compress a keyword or a variable. This suits us - if we do not find a match in the list of tokens, we will assume that this is a variable.
So how do we verify that a potential keyword is indeed a keyword? If we were writing in JavaScript, we could use the
Array#findIndex
.
const tokenMatch = tokens.findIndex(token =>
inStr.substr(idx).startsWith(token));
if (tokenMatch > -1) {
out.push(tokenMatch + 128);
idx += tokens[tokenMatch].length;
continue;
}
The method
Array#findIndex
iterates over the array tokens
and we can check if it starts inStr
(by the current one idx
) with the token we are checking. Working with our simplified list of tokens, we will do something similar (assume that inStr.substr(idx)===”PRINT”
):
token | .startsWith (token)? | Index |
---|---|---|
CLS | false | 0 |
true | 1 |
Note: as
indexOf
with JS, if nothing is found, we get -1
.
If a match is found, you can store its index in the returned array. But how do we later distinguish a token from a symbol? Easy: turn on the most significant bit, which can be done by adding 128 to the token value.
(Note: if we need more tokens than 128, then for some tokens we would have to use two bytes. This complicates things a little , but not so much. Some versions of BASIC do this, for example, in various Microsoft BASICs.)
We are done with JavaScript, but how do we do it in assembly language?
It turns out to be almost the same, but the code will be much longer.
search-keywords:
bl := [d, x] # get first character of current token
cmp bl, constants.NUL # if it's NUL, we've exhausted the list
brs Z exit-keyword-search # so we're clearly not a keyword
clr Z # compare strings, but with partial equality
call [vectors.STRCMP] # so that our source doesn't need NUL between
# tokens; c will now be how many chars got
# compared
if !Z {
do { # no match; advance past our token in the list
inc x # character by character
bl := [d, x] # tokens are terminated with NULs
cmp bl, constants.NUL # so if we see it, we can stop
} while !z
inc x # go past the token #
inc x # in the lookup table
brs search-keywords # and keep looking for a match
}
clr c # found the match!
add x, c # c is the length of the match
inc x # past the NUL
bl := [d, x] # bl is now the token #
call _get-target-index
<bp+target>,y := bl # write the token #
call _adv-target-index
call _get-source-index # advance past the token in the source
clr c
add y, c # by adding the character count
dec y # back off by one (we'll advance again later)
call _set-source-index
continue
Well, it doesn't look so bad. The algorithm is almost the same, only the assembly language token table is structured slightly differently. It looks something like this:
CLS 00 80
PRINT 00 81
: 00 82
Each keyword is terminated with a null byte followed by a token number.
However, we are missing something important here - how is this string comparison done at all? There is a procedure in the core of Retroputer
STRCMP
that we can use, but what does it look like?
strcmp: {
enter 0x00
push a
push b
push d
push y
pushf
if Z {
bl := 0x00 # Checking for full equality
} else {
bl := 0x01 # only checking for partial equality
}
_main:
y := 0 # start of string
top:
cl := [d, x, y] # character in string A
al := <bp+4>,y # character in string B
cmp bl, 0x01 # check if we're doing full equality
if Z {
cmp cl, 0 # we're not, so check for an early nul
# in string b
brs Z strings-are-equal # if it's NUL, we calling them equal
}
cmp cl, al # check character
if Z {
cmp cl, 0 # equal, but check for NUL
brs Z strings-are-equal # NUL reached, strings are equal
inc y # next character
brs top # not NUL, so keep going...
}
# if here, the strings aren't equal
if N {
popf # string is less than
set N
clr Z
brs _out
} else {
popf # string is greater than
clr N
clr Z
brs _out
}
strings-are-equal:
popf
clr N # Not less than
set Z # but Equal
_out:
c := y # make sure we know how many chars
# were compared
pop y
pop d
pop b
pop a
exit 0x00
ret
}
I don’t know about you, but I’m starting to love the method
String#startsWith
from JS more and more . Yes, it does the same thing, but we don't have to write all the internal logic!
Variable handling
The key compression is now complete, so we can finish. But Retroputer BASIC takes another step forward - tokenizing variables. It seems to me that only a few versions of BASIC from the 80s and 90s did this, because in conditions of limited memory it can be harmful. But if there is a lot of memory, then tokenization of variables helps to improve performance.
Here's what Retroputer BASIC does:
- It reads up to the first two characters of the variable name. (This was a standard inconvenience of the BASIC versions of the era due to memory constraints.)
- From these two characters, it determines the index of the variable. "A" is variable 0, "A0" is variable 53, and so on. The equation is pretty simple, but it doesn't interest us right now.
- BASIC . ,
$
BASIC . . - BASIC , . !
(Note: when Retroputer BASIC learns to dynamically allocate memory for variables, we will replace the index with a pointer to a variable. Another item in the todo list!)
This makes finding variables incredibly fast at runtime: we don't have to parse the variable name and calculate the index every time when we meet a variable. In continuous cycles, the savings can be substantial. But this comes at a high cost: we need to store the pointer and variable name together, so for each variable we encounter, we need to add four extra bytes to the output.
It's up to you to decide if it's worth it.
Be that as it may, in JavaScript it is also easy to determine whether the remainder of the character stream is a variable:
const variableMatch = inStr.substr(idx).match(/^[A-Z]+[A-Z0-9]*[\$]*/);
if (variableMatch) {
const variableName = variableMatch[0];
idx += variableName.length;
// tokenize it (omitted; nothing new here)
continue;
}
I will not describe in detail the code that Retroputer currently uses to tokenize variables, it is too verbose. We'll come back to it when I add dynamic memory allocation for variables.
Tokenization completed
If we now test our code in JS, we get a byte array similar to the one Retroputer BASIC uses internally:
console.log(toHex(tokenize(`CLS: PRINT "Hello, World"`)));
// 80 82 20 81 20 FB 48 65 6C 6C 6F 2C 20 57 6F 72 6C 64 00 00
Wow, what a lot of work to save a few bytes. But when there is only a couple of kilobytes of free memory, it's worth it! But this is not the only advantage of compressing the text entered by the user, we also speed up the program execution.
To explain why this happens, we need to understand how Retroputer executes code. I won't go into details for now, but in a nutshell, the following is required to execute the code:
- Get a byte from memory
- If a byte has the most significant bit turned on, then it is a token, otherwise it is a syntax error (or NUL; in any case, this is where we end up with the program line)
- We are looking for a token handler in an array - this array contains pointers to the functions themselves that process the token
- We call the handler (it is responsible for receiving arguments and the like)
- We repeat all over again
This is another reason for tokenizing keywords - executing the logic of the keyword becomes trivial, just find the address in the array and call it.
I would like to emphasize that while tokenization is important for saving space, it also improves execution speed.
For example, a JS run loop might look something like this:
const handlers = [ cls, print, endOfStmt ];
bytes.forEach(byte => handlers[byte] && handlers[byte]());
Of course, in reality everything is not so simple, it is much more, but this is already a topic for another post!
Resources
- The complete JavaScript code for this post
- Retroputer source code (currently in development)
- List of resources used for this project
Advertising
Powerful VPS with DDoS protection and the latest hardware. All this is about our epic servers . The maximum configuration is 128 CPU cores, 512 GB RAM, 4000 GB NVMe.