BBC BASIC V file format

In my opinion, one of the best programming languages of its age was BBC BASIC. The other day I was looking through some old disks (or should that be ‘discs’?) and found some programs I’d written for the Acorn Risc PC in BASIC, and wanted to have a reminisce by looking at them.

Unfortunately the files are stored in a tokenised binary format, and the details of the format aren’t easy to find. There are a few sites around which purport to tell you the format, but they’re eiher out of date, or wrong, or for spin-off commercial versions like R.T. Russell’s BBC BASIC. These have almost identical token lists, but when I tried using them, a few of the more advanced tokens in BASIC V didn’t come out right.

After a little further searching, I found that the source code of BBC BASIC V is available as part of RISC OS Open. I downloaded it and spent a few happy hours re-reading ARM assembler to derive the format.

Each line is stored as a sequence of bytes:

0x0d [line num hi] [line num lo] [line len] [data...]

The line number is as you’d expect — the line number — with one exception. The maximum line number is 65279 (0xfeff) as the special marker 0x0d 0xff is used to signify the end of the program. The line length includes the three preceding bytes, making the maximum length of a line 251 bytes.

The line data itself is tokenised. The original BBC BASIC treated any character with the top bit set as a token (with one exception), and a table of 128 tokens was used to determine this. In BBC BASIC V, any character value greater or equal to 0x7f is interpreted as a token, and there are three “extended” tokens (0xc6, 0xc7 and 0xc8) which use the next byte to select further tokens.

The exception mentioned above — which applies to both original BBC BASIC and BBC BASIC V — is that 0x8d is used to signify “there’s a line number reference coming up.” In both versions of BASIC it was still de rigeur to use GOTO and GOSUB — which require line numbers — and as an optimisation the line number is stored in a 3-byte binary format instead of the equivalent ASCII digits. This format is described in more detail in another blog post.

The tokens are broken into four categories:

  1. The main token list, holding all the main keywords. These values start at 0x7f and are looked up for all values except 0xc6, 0xc7, 0xc8 and 0x8d. (See the Python implementation below for the list of all tokens.)
  2. Extended function tokens. These follow a 0xc6 byte and start at value 0x8e. There are only two tokens, SUM and BEAT.
  3. Extended command tokens. These follow a 0xc7 and start at value 0x8e, and include RENUMBER, EDIT and HELP.
  4. Extended statement tokens. These follow a 0xc8 and also start at value 0x8e. These include keywords like CASE, MOUSE and SYS.

Putting this all together, I was able to write a quick Python script to decode my old BASIC programs, which is available here.

After all that effort, my old programs were…interesting — more on them another time.

Filed under: Coding
Posted at 11:35:00 GMT on 14th November 2007.

About Matt Godbolt

Matt Godbolt is a C++ developer working in Chicago in the finance industry.