Writing an 80s-style BASIC interpreter





For several years I have been working on a personal project of creating (and in fact researching) a "fake emulator", that is, an emulator written in JavaScript for a computer that never existed. This machine was supposed to be a tribute to the eight- and sixteen-bit computers of the 1980s and 90s.



However, I love the complexity: this machine also used a new set of instructions. It is similar to the kits used in that era, but a little easier to work with. The Retroputer was born . Over the years, the emulator has been expanding and improving, but most likely it will never be "finished" (after all, this is a personal research project).



When @bbcmicrobot appeared, I wanted to create something similar for the Retroputer. My JS development skills were mostly limited to the front-end, so this would be a great excuse to get some back-end experience. There is only one problem: Retroputer can only understand its own assembly language. It doesn't have BASIC support yet.



So I came up with the creation of the BASIC interpreter in the style of the 80s, that is, completely in assembly language, as it was then written. I decided that it was worth sharing my work because we don't often have to dive into areas that are so far from the usual abstractions. My everyday tool (JavaScript) makes many aspects trivial and sometimes it even seems like magic. Understanding the lowest level of processes often helps in understanding these abstractions.



So let's get started.



Retroputer characteristics



  • 16 ROM (KERNEL) c 1 (scratch space)
  • 512 , 64
  • 16- 6516, 512 4 64
  • 4025, ,
  • 1125
  • 9800




When I was writing assembler for Retroputer, I could use the very handy Pegjs tool. It provided fast native assembly syntax, but unfortunately nothing like this exists for Retroputer ASM. That is, we will have to go the hard way.



The parsing itself is performed in several stages. A language that uses a compiler parses the code into an abstract syntax tree (AST) (or other representation), and then can use this tree to generate finished native code. Therefore, the program must be syntactically correct for successful compilation.



Some modern interpreters have this concept too, because it is often useful to generate an intermediate AST and execute it rather than the source code.



But for the BASIC interpreter on a machine with limited resources, it will be most efficient to parse in several stages, some of which occurs at runtime. However, this means that syntax errors cannot be detected until you run the program and navigate to the area of ​​the code with the error.



Retroputer BASIC parsing consists of three stages:



  1. Converting strings
  2. Tokenization
  3. Runtime syntax checking


The first two stages occur when the user enters the program (or downloads it). The latter occurs during program execution. Basically, the first two create the fuselage of the aircraft, but do not guarantee that it will take off. The last leg is a test pilot - we hope he takes off, but we're not sure about that until he tries.



Note: Retroputer BASIC source code (in development) is available on GitHub .



Converting strings



This is the simplest part of the whole process. User-entered string is converted to uppercase to simplify (and speed up) subsequent processes. BASIC is not case sensitive, so we can take advantage of that.



<pre>print 2+2
' becomes:
PRINT 2+2</pre>


It's very easy to do this in JavaScript, right?



theLine = theLine.toUpperCase();


But in assembly language, this process is more detailed. We need to read the character, convert it to uppercase, and then store it somewhere.



           ld  y,  0                # y is our index
_loop:     ld  al, [d, x, y]        # [d,x] is pointer to string
           cmp al, 97               # is al (char) in range?
           brs n   _continue        # Not a lowercase char; continue
           cmp al, 123              # high portion
           brs !n  _continue        # not a lowercase char; continue
           and al, 0b1101_1111      # uppercase!
           st  [d, x, y], al        # store back (we modify the original)
_continue: inc y                    # move our index along
           cmp al, 0                # check for NULL
           brs !z _loop             # No? Go back for more.


The above code doesn't quite match the semantics of the JavaScript version of the code. An important difference is that today we use Unicode to work with text, so converting input from lower to upper case can often be more difficult and sometimes impossible (depending on the language). Retroputer lives in the ASCII world (more precisely, in the world of its own ASCII variation called RetSCII), that is, all supported characters are encoded in eight bits. For many languages, this is completely insufficient, but it meets the realities of that time.



This also means that we can use a cute ASCII feature to convert from lower to upper case. The uppercase "A" is represented in ASCII code 65, and the lowercase "a" is represented by the code97... If you are familiar with the powers of two, then you should have noticed this difference.



So it turns out that lowercase letters are specified by a number that is exactly 32 more than the number of lowercase letters. If we know that the value is in the correct range, we only need to subtract 32!



This approach works, but we can just modify the bits a little. In the case of Retroputer, this will not be faster than subtraction, however, in the absence of subtraction, we will not have to pay attention to the carry / borrow flag during calculations. It turns out we can use bitwise andto turn off the bit in the place of the 32 value.



and al, 0b1101_1111         # turn off bit in 32-place
# versus
clr c                       # clear carry
sub al, 32                  # subtract 32


But there is one subtlety: not everything can be converted to uppercase. For example, if the user added a string literal, then we need to be more careful, because we don't want Retroputer BASIC to constantly yell at the user with caps. (While many computers of the era did not have lower case, the Retroputer did not.)



For example:



print "Hello, World!"
' should become:
PRINT "Hello, World!"
' and not
PRINT "HELLO, WORLD!"


This means that we need to keep track of whether we are in the middle of a string literal. In BASIC, there is only one designation for this: double quotes. If the character is a double quote, we can set the flag and, depending on the value of the flag, convert to uppercase or leave it as it is.



It turns out that JavaScript doesn't have built-in functionality for this, but we can create it ourselves:



const len = theLine.length;
let insideString = false;
for (let i = 0; i < len; i++) {
    const ch = theLine[i];
    if (ch === `"`) insideString = !insideString;
    if (!insideString) {
        const newCh = ch.toUpperCase();
        if (ch !== newCh) theLine[i] = newCh;
    }
}


Now the logic in JS more closely matches the assembler version, although we again make a little use of the Unicode support in JS.



The assembler version looks like this:



           ld  y,  0                # y is our index
           ld  bl, 0                # === insideString (false)
_loop:     ld  al, [d, x, y]        # [d,x] is pointer to string
           cmp al, 34               # is al a double quote?
           brs !z  check_char       # no? should we uppercase it?
           xor bl, 0xFF             # yes? toggle insideString
_check_char:
           cmp bl, 0xFF             # inside a string?
           brs z   _continue        # yes? don't modify it
           cmp al, 97               # is al (char) in range? "a"
           brs n   _continue        # Not a lowercase char; continue
           cmp al, 123              # high portion           "z"
           brs !n  _continue        # not a lowercase char; continue
           and al, 0b1101_1111      # uppercase!
           st  [d, x, y], al        # store back (we modify the original)
_continue: inc y                    # move our index along
           cmp al, 0                # check for NULL
           brs !z _loop             # No? Go back for more.


So far we have only done the conversion of the input text to uppercase, but you can use the definition of a string literal for another possibility. We can do one more syntax check!



If at the end of the process we figure out what inStringis still true ( bl = 0xFF), then we can return an error, because that means there is an incomplete string literal somewhere in the string.



Note: It turns out that many versions of BASIC are rather lenient on the lack of closing quotes on strings. I found this out while creating my own interpreter. Even though it is, it doesn't seem right to me, so Retroputer BASIC doesn't allow it.



Tokenization



The next step in parsing is to transform the input string into something more efficient for the Retroputer BASIC interpreter to execute. We're trying to get as close as possible to the concept of an abstract syntax tree, but the result will still not be a tree. But it will be something that we can quickly handle at runtime.



A common feature of the early microcomputers was very limited memory. The Retroputer has more memory than most standard machines at the time, but still much less than modern computers. Therefore, long BASIC programs can easily take up too much memory if they are saved during user input.



To save space, keywords are tokenized during program entry into memory... This process converts keywords to single byte tokens. Keywords are always at least two bytes long, so there is more savings as you type. It also means that we can use the lookup table at runtime to call the appropriate assembly language routines.



However, Retroputer BASIC went a little further than most BASIC versions of the time. Moreover, it converts numbers to binary form, marks strings, computes variable references, and much more. To be honest, this is wasting more space, but the speed (and ease of implementation) benefits outweigh that.



So, the following steps are used here:







  1. , , . , , , , .




  2. , , , . , PRINT "Hello, World" «Hello, World» , .



    , .




  3. , , , . JavaScript, !



    ( ). , PRINT !




  4. Retroputer BASIC ( ). . , , .



    Retroputer BASIC . , . , Retroputer BASIC.


In the post I will not go into detail about the implementation of this stage in assembly language, I will leave it for the future. But don't worry, there is a lot of code there .



Checking syntax at runtime



Last but not least, the runtime syntax check is. After converting the code to tokenized form, it is quite easy to implement it.



First, at runtime, BASIC checks to see if it is currently processing a token. All tokens have the most significant bit enabled (that is, they have a value of 128 and higher). If a token is found, then we can determine which subroutine to call by finding it in the vector table . It also makes it trivial to display syntax errors - some keywords don't make sense as operators, so the vector table simply points to the procedure that generates the syntax error.



After the operator token handler is called, the handler takes on all the additional parsing work. For tokens and it can use the transition between them gettok, gettok-raw,peektok etc. If the token is something not expected by the procedure, then the procedure simply returns an error code. It is at this stage that both syntax errors and type matching errors are caught.



If the operator must evaluate the expression, then another stage of parsing is performed. During the parsing of the expression, another vector lookup table is used , that is, we can intercept keywords that are not related to the mathematical expression and return the corresponding errors. For example, if you try to enterPRINT 2+CLS, then we get a syntax error on CLS( CLS- this is the shorthand for "clear screen").



Note: from this table, we can also determine the priority of the mathematical operators and the number of required parameters. This is important for the actual evaluation of the expression, but it can also be used to catch cases where the user has not provided enough arguments.



Since the token maps directly to the vector lookup table entry, the program can be executed fairly quickly and with minimal effort. The task of parsing each type of operator is entrusted to the handler itself, and in general this does not cause any particular problems. Probably the most difficult to parse are PRINTandINPUT, but at each step one token is taken at a time.



Since many of the checks are not performed until the runtime of the program, this means that we may have partial results before the error occurs. For instance:



PRINT "Hello";CLS
Hello
?Syntax Error


In addition, this means that if the program leaves the screen in a state in which we cannot see the text, then we may have problems with the elimination of errors. If a syntax error is displayed but we don't see it, what should we do?



This way of checking syntax certainly has its drawbacks, but it allows for a fairly simple interpreter.



Tokenizing user input



As mentioned earlier, it was common for BASIC versions of that era to use tokenization tactics. To save space and increase execution speed, keywords were "compressed" into single-byte tokens (or double-byte, if more keywords are needed).



Let's pretend we have a simple line of code that clears the screen and prints a standard greeting:



CLS: PRINT "Hello, World"


While this in itself does not take up much memory, if you write a long program, many words in that program will be keywords. If you compress them (tokenize), then you can save a solid fraction of the space. For example, after tokenization of the line shown above, there will be something like this in memory:



8A E3 B5 FC Hello, World 00 00


It shouldn't be too difficult to convert this back to the original construct:



Byte Keyword Notes
8A

CLS

E3 ":" End of construction
32 "" We store at most one space
B5

PRINT



FB Line in code Next comes the string literal
Hello, World, 00 Strings are null terminated
00 End of line of code Lines of code are also null terminated


You may think that this is a lot of work, but the space savings can be significant. This is not particularly noticeable in the example, but even from it you can imagine how quickly the saved space can accumulate. In this case, the compressed result is 19 bytes. The source text consists of 26 bytes (including the terminating null byte).



Saving space becomes important when it comes to the BASIC interpreter running on a machine with very little RAM for user programs. Therefore, such compression is very important and attractive, even if it does not have additional benefits.



So how do we tokenize something like this? At first, it seems like JavaScript is pretty trivial to do this. With an array of strings, you can quickly replace each keyword with the corresponding token. True?



Seems like this is a job for String#replace? With a naive approach, you can try something like this:



const tokens = ["CLS", "PRINT", ":" ];
function tokenize (inStr) {
    const newStr = inStr;
    tokens.forEach((token, idx) => newStr = newStr.replace(
        new RegExp(token, "g"), String.fromCharCode(128+idx)
    );
    return newStr;
}


Unfortunately, we soon realize that we cannot substitute string literals with tokens. This means that you need to do the processing character by character, taking into account the context, so as not to compress elements that are not keywords.



This new algorithm is closer to the assembly language code in Retroputer, but JS still makes things a lot easier. Here is the beginning of the JS code (we'll gradually expand on it in this post):



const tokens = ["CLS", "PRINT", ":" ];

function tokenize(inStr) {
    let out = [];                    // return value is "crunched" array
    let idx = 0;                     // index into inStr
    let ch = "";                     // current character
    while (idx < inStr.length) {
        ch = inStr.charCodeAt(idx);  // get the current character code
        
        // our logic is going to go here

        out.push(ch);                // for now push the character
        idx++;                       // and keep going (next char)
    }
    out.push(0);                     // we end with NUL
    return out;
}


We start with a very simplified list of tokens, because no one wants to see the entire table in this format ( it's long, and Retroputer actually creates token tables from a JS file! ), But for our purposes this will be enough. The principle here is that when we see a token, we write its index to the output array.



At this point, we have a function that, for now, simply converts the string to its equivalent character codes. While it is not particularly useful, it can serve as a convenient framework.



The assembly language version is quite similar to the one shown above.



  do {
    call _get-source-index     # get next character index (in y)
    dl := <BP+source>,y        # read next char from our source string (ch)
    call _adv-source-index     # advance our source index
    call _get-target-index     # get position in target   ("out" in JS)
    <BP+target>,y := dl        # store to target          (out[y] = ch)
    call _adv-target-index     # move it along
    cmp dl, 0                  # was char 0? if so, break
  } while !z


I have not included in the example above _get-source-indexor other functions, because their tasks are clear from the name: they simply get, set the source and target indices, and also navigate through them. It is worth noting that unlike JS, there are no dynamically allocated arrays in assembly language, so this algorithm allocates a buffer of sufficient size. As we move down the input line, we need to know where to write the next token in the target buffer, and what the target index is doing above. Each of the above functions returns an index in Y. For example, the function _adv-target-indexincrements the target index by one (equivalent y++).



Be careful with literals



We need to be careful: string literals can be confusing to the tokenizer - we cannot simply replace every character string that matches a keyword with a token value. If we see a string literal that has "CLS" in it, then we shouldn't seek to tokenize it. It should not be executable, and if we output it ... then instead of it we output the byte corresponding to the token. This is most likely not what the developer meant.



On the other hand, numeric literals shouldn't have this problem, because BASIC doesn't have numeric keywords. Even so, there is no point in searching for a keyword when faced with a number - why waste time?



Tokenizing string literals



So, let's first check if the line starts - in JS it's not particularly difficult to do this:



if (ch === 34) {
  out.push (0xFB);             // string-in-code token
  idx++;
  ch = inStr.charCodeAt(idx);  // get next character after the quote
  while (ch !== 34 && idx < inStr.length) {
    out.push(ch);
    idx++;
    ch = inStr.charCodeAt(idx);
  };
  idx++;                       // go past the quote
  out.push(0);                 // end of string
  continue;
}


The double quotation mark is represented by character code 34. Other programming languages ​​can recognize many more styles of quotation marks (like JS or C), but with BASIC it's simple: either double quotation marks, or nothing.



When we see that a string begins, we can simply take the remaining characters and add to the output array until we encounter another quote.



After reading the entire string, we need to add a zero byte because Retroputer BASIC uses C-style strings. If we wanted to use Pascal-style strings, we could store a counter and insert the length of the string literal. In any case, this does not pose any particular problem. The only reason I've used null-terminated strings is that they're very easy to work with in assembly language — we just need to compare to a null byte, not store a counter.



So JavaScript was not that hard. I think most of you would decide to use something more abstract, like a regular expression. I think this code will make our intentions more obvious.



if (ch === 34) {
  out.push (0xFB);             // string-in-code token
  const stringLiteral = inStr.substr(idx+1).match(/^[^"]*/)[0];
  idx += stringLiteral.length+1;
  out.push(...Array.from(stringLiteral, ch => ch.charCodeAt(0)));
  idx++;                       // go past the quote
  out.push(0);                 // end of string
  continue;
}


The above code is roughly the same, but instead of character-by-character validation, we let JS just execute match. We know we'll get a match (we're in a string), so we don't even need to check if the match was successful. We then increment the index by the length of the string and copy the characters into the output array. In my opinion, this code is much easier to understand (however, it implies an understanding of regular expressions, as well as the peculiarities of ES6 syntax, for example, and arrow functions).



Of course, in assembly language, we have to manually do all the work that JS does. The result is very similar to our first attempt at writing JS code:



  cmp dl, constants.QUOTE         # is dl a quote?
  brs !Z not-a-string             # nope; not a string

  call _get-target-index          # get next position in crunch target
  dl := brodata.TOK_CODE_STRING   # "String" token
  <bp+target>,y := dl             # Store it into the crunch target
  call _adv-target-index

still-a-string:
  call _get-source-index
  dl := <bp+source>,y             # get string character
  call _adv-source-index
  cmp dl, constants.QUOTE         # if it's a quote, we can zero it
  if Z { 
    dl := 0 
  }
  call _get-target-index
  <bp+target>,y := dl             # write to target
  call _adv-target-index
  cmp dl, 0                       # are we done?
  brs !Z still-a-string           # no, keep going
  continue                        # next!
not-a-string:


It is worth adding a note about the Retroputer assembly language parser - it has basic support for high-level constructs such as blocks and loops. Therefore, it if Z {…}will execute the content inside the block when the zero flag is set, and continuecan be used to branch back to the beginning of the block ( breakalso used to exit the block). The assembler translates this into various comparison and branching instructions, so the CPU doesn't see the high-level parts of the code, but it makes the code a little easier to write.



Tokenizing numbers



Also, it doesn't make sense to try to search the list of keywords for numbers, so we can just skip them. Most versions of BASIC do something similar to the string manipulation shown above - if the character read is a digit, it will be concatenated to the output, after which the compressor will take over.



Retroputer BASIC (and some other versions of BASIC, such as Atari BASIC) goes one step further: it converts a number to the appropriate binary format. This is very easy to do - if we meet a digit, then we multiply the accumulator by 10 and add the digit, continuing this way as long as the digits continue to occur. (It's worth noting, though, that while Retroputer BASIC only works with integer values, adding floating point numbers is already in my todo list.)



(I must say that currently Retroputer BASIC does nothing when an integer overflow occurs, whether signed or unsigned. When I add floating point numbers, Retroputer will convert to floating point as well.)



Retroputer BASIC goes one step further. : it also recognizes hexadecimal numbers and converts them to their binary equivalent. It is used as a designation for such numbers 0x(as in JS), and the language itself has additional logic to ensure that an error is returned if several characters are specified x.



In assembly language, checking whether a character is a digit is not that hard, but it requires a couple of comparisons: you need to make sure that the character code is between 0x30and0x39... (These are character codes "0" and "9".)



Having received a character-digit, we can use another convenient property of the character set. 0x30- this is the character code "0", 0x31- the code "1", and so on. We could subtract if we wanted 0x30, but there is a simpler way: just discard the four most significant bits:



and dl, 0b0000_1111


Unfortunately, this will not work if we need to handle hex numbers. In this case, you can subtract and then compare with 10, and then decrease it by 7 again if the code is greater than 10 (given that the hexadecimal numbers are "A" - "Z" in uppercase).



In JS, you can use regular expressions again, and then when we get a match for the number, you can call Number(), which gives us another advantage: hexadecimal numbers are also converted trivially, because it Number()does this automatically if the number starts with 0x.



How does it look in JavaScript?



if (ch >= 48 && ch <= 57) {
  out.push (0xFD);           // number token
  const numberLiteralStr = inStr.substr(idx).match(/^[\d|A-F|a-f|x|X]+/)[0];
  idx += numberLiteralStr.length;
  const numberLiteral = Number(numberLiteralStr);
  const bytes = new Uint8Array(new Uint16Array([numberLiteral]).buffer);
  out.push(...bytes)
  continue;
}


The regular expression allows you to use any combination of numbers or hexadecimal digits (plus xto switch to hexadecimal mode). The expression is not ideal, for example, it allows you to use several x, but for now this is enough.



In the above code, converting numbers to bytes can be questionable. Number()did all the hard work of turning a string of numbers into a number that JavaScript can work with, but now we need to represent it as a sequence of bytes. You can use math to do this:



const hiByte = (numberLiteral & 0xFF00) >> 8;
const loByte = (numberLiteral & 0x00FF);


... and it's easy to do for an integer. But thanks to the use of typed JS arrays, you can not do all these calculations, but at the same time prepare for handling floating point numbers in the future (we will just replace Uint16Arraywith Float64Array).



The assembly language code for this task is slightly longer , but it does a little more. Retroputer uses one more optimization: if the number is small, then it is saved in a smaller format. This means 0-255 can be stored in one byte, while longer numbers take up two.



Search for keywords



So we've done quite a bit of work, but so far we haven't compressed a single keyword. Having dealt with numbers and string literals, we can be sure that everything else is either a keyword or a variable name. (Or a space, but it's easy to check.)



In BASIC, keywords don't always start with an alphabetic character; operators and delimiters are also considered tokens. But variables also start with an alphabetic character. So we cannot immediately determine whether we will compress a keyword or a variable. This suits us - if we do not find a match in the list of tokens, we will assume that this is a variable.



So how do we verify that a potential keyword is indeed a keyword? If we were writing in JavaScript, we could use the Array#findIndex.



const tokenMatch = tokens.findIndex(token => 
  inStr.substr(idx).startsWith(token));
if (tokenMatch > -1) {
  out.push(tokenMatch + 128);
  idx += tokens[tokenMatch].length;
  continue;
}


The method Array#findIndexiterates over the array tokensand we can check if it starts inStr(by the current one idx) with the token we are checking. Working with our simplified list of tokens, we will do something similar (assume that inStr.substr(idx)===”PRINT”):



token .startsWith (token)? Index
CLS false 0
PRINT true 1


Note: as indexOfwith JS, if nothing is found, we get -1.



If a match is found, you can store its index in the returned array. But how do we later distinguish a token from a symbol? Easy: turn on the most significant bit, which can be done by adding 128 to the token value.



(Note: if we need more tokens than 128, then for some tokens we would have to use two bytes. This complicates things a little , but not so much. Some versions of BASIC do this, for example, in various Microsoft BASICs.)



We are done with JavaScript, but how do we do it in assembly language?



It turns out to be almost the same, but the code will be much longer.



search-keywords:
  bl := [d, x]                 # get first character of current token
  cmp bl, constants.NUL        # if it's NUL, we've exhausted the list
  brs Z exit-keyword-search    # so we're clearly not a keyword
  clr Z                        # compare strings, but with partial equality
  call [vectors.STRCMP]        # so that our source doesn't need NUL between
                               # tokens; c will now be how many chars got 
                               # compared
  if !Z {
    do {                       # no match; advance past our token in the list
      inc x                    # character by character
      bl := [d, x]             # tokens are terminated with NULs
      cmp bl, constants.NUL    # so if we see it, we can stop
    } while !z
    inc x                      # go past the token #
    inc x                      # in the lookup table
    brs search-keywords        # and keep looking for a match
  }
  clr c                        # found the match!
  add x, c                     # c is the length of the match
  inc x                        # past the NUL
  bl := [d, x]                 # bl is now the token #
  call _get-target-index
  <bp+target>,y := bl          # write the token #
  call _adv-target-index
  call _get-source-index       # advance past the token in the source
  clr c
  add y, c                     # by adding the character count
  dec y                        # back off by one (we'll advance again later)
  call _set-source-index
  continue


Well, it doesn't look so bad. The algorithm is almost the same, only the assembly language token table is structured slightly differently. It looks something like this:



CLS   00 80
PRINT 00 81
:     00 82


Each keyword is terminated with a null byte followed by a token number.



However, we are missing something important here - how is this string comparison done at all? There is a procedure in the core of Retroputer STRCMPthat we can use, but what does it look like?



strcmp: {
  enter 0x00
  push a
  push b
  push d
  push y
  pushf
  if Z {
    bl := 0x00                  # Checking for full equality
  } else {
    bl := 0x01                  # only checking for partial equality
  }
_main:
  y := 0                        # start of string
top:
  cl := [d, x, y]               # character in string A
  al := <bp+4>,y                # character in string B
  cmp bl, 0x01                  # check if we're doing full equality
  if Z {
    cmp cl, 0                   # we're not, so check for an early nul
                                # in string b
    brs Z strings-are-equal     # if it's NUL, we calling them equal
  }
  cmp cl, al                    # check character
  if Z {
    cmp cl, 0                   # equal, but check for NUL
    brs Z strings-are-equal     # NUL reached, strings are equal
    inc y                       # next character
    brs top                     # not NUL, so keep going...
  }
 # if here, the strings aren't equal
  if N {
    popf                        # string is less than
    set N
    clr Z
    brs _out
  } else {
    popf                        # string is greater than
    clr N
    clr Z
    brs _out
  }
strings-are-equal:
  popf
  clr N                         # Not less than
  set Z                         # but Equal
_out:
  c := y                        # make sure we know how many chars
                                # were compared
  pop y
  pop d
  pop b
  pop a
  exit 0x00
  ret
}


I don’t know about you, but I’m starting to love the method String#startsWithfrom JS more and more . Yes, it does the same thing, but we don't have to write all the internal logic!



Variable handling



The key compression is now complete, so we can finish. But Retroputer BASIC takes another step forward - tokenizing variables. It seems to me that only a few versions of BASIC from the 80s and 90s did this, because in conditions of limited memory it can be harmful. But if there is a lot of memory, then tokenization of variables helps to improve performance.



Here's what Retroputer BASIC does:



  • It reads up to the first two characters of the variable name. (This was a standard inconvenience of the BASIC versions of the era due to memory constraints.)
  • From these two characters, it determines the index of the variable. "A" is variable 0, "A0" is variable 53, and so on. The equation is pretty simple, but it doesn't interest us right now.
  • BASIC . , $ BASIC . .
  • BASIC , . !


(Note: when Retroputer BASIC learns to dynamically allocate memory for variables, we will replace the index with a pointer to a variable. Another item in the todo list!)



This makes finding variables incredibly fast at runtime: we don't have to parse the variable name and calculate the index every time when we meet a variable. In continuous cycles, the savings can be substantial. But this comes at a high cost: we need to store the pointer and variable name together, so for each variable we encounter, we need to add four extra bytes to the output.



It's up to you to decide if it's worth it.



Be that as it may, in JavaScript it is also easy to determine whether the remainder of the character stream is a variable:



const variableMatch = inStr.substr(idx).match(/^[A-Z]+[A-Z0-9]*[\$]*/);
if (variableMatch) {
  const variableName = variableMatch[0];
  idx += variableName.length;
  // tokenize it (omitted; nothing new here)
  continue;
}


I will not describe in detail the code that Retroputer currently uses to tokenize variables, it is too verbose. We'll come back to it when I add dynamic memory allocation for variables.



Tokenization completed



If we now test our code in JS, we get a byte array similar to the one Retroputer BASIC uses internally:



console.log(toHex(tokenize(`CLS: PRINT "Hello, World"`)));
// 80 82 20 81 20 FB 48 65 6C 6C 6F 2C 20 57 6F 72 6C 64 00 00 


Wow, what a lot of work to save a few bytes. But when there is only a couple of kilobytes of free memory, it's worth it! But this is not the only advantage of compressing the text entered by the user, we also speed up the program execution.



To explain why this happens, we need to understand how Retroputer executes code. I won't go into details for now, but in a nutshell, the following is required to execute the code:



  • Get a byte from memory
  • If a byte has the most significant bit turned on, then it is a token, otherwise it is a syntax error (or NUL; in any case, this is where we end up with the program line)
  • We are looking for a token handler in an array - this array contains pointers to the functions themselves that process the token
  • We call the handler (it is responsible for receiving arguments and the like)
  • We repeat all over again


This is another reason for tokenizing keywords - executing the logic of the keyword becomes trivial, just find the address in the array and call it.



I would like to emphasize that while tokenization is important for saving space, it also improves execution speed.



For example, a JS run loop might look something like this:



const handlers = [ cls, print, endOfStmt ];
bytes.forEach(byte => handlers[byte] && handlers[byte]());


Of course, in reality everything is not so simple, it is much more, but this is already a topic for another post!



Resources








Advertising



Powerful VPS with DDoS protection and the latest hardware. All this is about our epic servers . The maximum configuration is 128 CPU cores, 512 GB RAM, 4000 GB NVMe.






All Articles