In my opinion, one of the best programming languages of its age was BBC BASIC. The other day I was looking through some old disks (or should that be ‘discs’?) and found some programs I’d written for the Acorn Risc PC in BASIC, and wanted to have a reminisce by looking at them.
Unfortunately the files are stored in a tokenised binary format, and the details of the format aren’t easy to find. There are a few sites around which purport to tell you the format, but they’re eiher out of date, or wrong, or for spin-off commercial versions like R.T. Russell’s BBC BASIC. These have almost identical token lists, but when I tried using them, a few of the more advanced tokens in BASIC V didn’t come out right.
After a little further searching, I found that the source code of BBC BASIC V is available as part of RISC OS Open. I downloaded it and spent a few happy hours re-reading ARM assembler to derive the format.
Each line is stored as a sequence of bytes:
0x0d [line num hi] [line num lo] [line len] [data...]
The line number is as you’d expect — the line number — with one exception. The maximum line
number is 65279 (
0xfeff) as the special marker
0x0d 0xff is used to signify the end of the program.
The line length includes the three preceding bytes, making the maximum length of a line 251 bytes.
The line data itself is tokenised. The original BBC BASIC treated any character with the top bit set as
a token (with one exception), and a table of 128 tokens was used to determine this. In BBC BASIC V,
any character value greater or equal to
0x7f is interpreted as a token, and there are three
“extended” tokens (
0xc8) which use the next byte to select further tokens.
The exception mentioned above — which applies to both original BBC BASIC and BBC BASIC V — is that
0x8d is used to signify “there’s a line number reference coming up.” In both versions of BASIC it was
still de rigeur to use
GOSUB — which require line numbers — and as an optimisation the
line number is stored in a 3-byte binary format instead of the equivalent ASCII digits.
This format is described in more detail in another blog post.
The tokens are broken into four categories:
0x7fand are looked up for all values except
0x8d. (See the Python implementation below for the list of all tokens.)
0xc6byte and start at value
0x8e. There are only two tokens,
0xc7and start at value
0x8e, and include
0xc8and also start at value
0x8e. These include keywords like
Putting this all together, I was able to write a quick Python script to decode my old BASIC programs, which is available here.
After all that effort, my old programs were…interesting — more on them another time.