Matt Godbolt’s blog

Self-indulgent postings

Elite's crazy tokenised string routine

The other day some discussions between friends led to someone posting a faked up Elite status screen with a silly rating instead of “Harmless”. Given what I’ve been spending all my time on recently I thought it would be more amusing to post a playable version with the “Harmless” replaced by patching the Elite code.

I naively thought this would be easy to achieve: I added a quick routine in jsbeeb to search for a string in memory and looked for Harmless. No joy. I tried a case-insensitive version. Likewise, no luck. This wasn’t going to be as simple as I thought.

I next put a conditional breakpoint on the OS write character routine OSWRCH, conditional on the character being the H of the Harmless I was hoping to replace.

In jsbeeb (at the time of writing) breakpoints are a little awkward to use: Assign a function to processor.debugInstruction which will be called at every instruction with the program counter and should return true to stop the program, or false to carry on. OSWRCH is at $ffee so bringing up the Javascript console and typing:

processor.debugInstruction = function(pc) { 
    return pc === 0xffee && processor.a == 72; 

A quick run and…no joy. Of course, Elite doesn’t use the OS routines to print text — it has its own.

At this point I remembered that Ian Bell, co-creator of Elite, had posted up some old disc images of the source code of the tape version of Elite. It’s not quite the same as this disc image I’m using, but it should be a good starting point. For ease of reading without firing up an emulator, he’s also posted a ZIP of the pertinent files as text.

Opening up the source gives an amazing insight into how the game was made: The source is a terrific mess. It’s split over several BASIC files which assemble the code by daisy-chaining each other. The code has very few comments and whitespace; crams many 6502 instructions per line; and uses many tricks like skipping over the next few instructions in branches via PC-relative mechanisms like BNE P%+4. It’s pretty hard-going stuff.

Luckily one of the few comments is \PRINT and so this is the first clue: the print routine is at the obfuscatedly-named label TT26. There’s no easy way to tie labels in the tape source to addresses in the disc version. I opted to search for the opcodes either side of the label, and struck gold: TT26 lives at address $1d56. I moved my Javascript breakpoint there and was able to catch the letter ‘H’ being drawn. Lucky Elite uses ASCII or otherwise I wouldn’t have been able to catch it here.

Here I stepped out (keypress ‘o’ in the latest jsbeeb debugger) and found the calling routine, and traced its origin to $31ac. Searching in the source I found this corresponds to label TT27, which sadly has no handy comments. This routine is the “print a string character” routine, but its input is definitely not ASCII.

The short version

The code EORs its input with 35, then interprets the character as a token: some tokens are ASCII, some are colour and case-changing codes, some refer to parts of the player’s state and some refer to one of a list of two-letter expansions, making the strings in general shorter. The string “Harmless” is actually all caps (the title casing is done by the code) and is encoded as 6 bytes: H [AR] M [LE] S S where the AR and LE are two-letter expansions. The string itself is stored at $0769. Capital letters are left unchanged, so to replace the string with, say, “Monkey” we can patch the code to put 6e4c4d48465a (“MONKEY” EORed with 35) in the relevant place. This can be achieved with a URL parameter in jsbeeb, adding &patch=@31a6,0769:6e4c4d48465a to the autoboot URL to yield a handy bootable patched URL.

The “@” sign in the patch URL sets a breakpoint at that location, and then applies the patch (in addr:bytestopatch) when the breakpoint is hit. That way, Elite loads up first, then the first time the print routine is called the patch is made.

The long version

Let’s dig into that print routine a bit and see what it’s doing. Here I’m going to use the label names from the source. To follow along at home, fire up Elite in jsbeeb, hit the Home key and then go to $31ac by typing into the disassembly address box, or open up ELITEB.TXT from the source ZIP.

The “print message” routine directly follows the print character routine, and I’ve annotated the whole lot. Some of the auxiliary routines are just above .TT27, but to keep things a little shorter I’ve left those out, and just explained them in the text (e.g. the csh routine).

.TT27       ; print char routine
TAX         ; save A into X, and set flags
BEQ csh     ; if zero, print "Cash:" and then
            ; the player's cash
BMI TT43    ; if negative (bit 7 set) go to TT43
            ; which treats the bottom 7 bits as an index
            ; into a token table
DEX         ; decrement and set flags...
BEQ tal     ; if zero, A was 1; go to tal
            ; which prints "tally" (a number)
BEQ ypl     ; if A was 2; go to to ypl
            ; which prints the current planet name
BNE P%+5
JMP cpl     ; if A was 3; go to cpl
            ; which prints the selected planet name
BEQ cmn     ; if A was 4; go to cmn
            ; which prints the player name (commander?)
BEQ fwl     ; if A was 5; go to fwl
            ; which just prints the player's cash
BNE P%+7
LDA #128    ; if A was 6, store 128 in QQ17
STA QQ17    ; which is a case-switch flag
RTS         ; and return
DEX         ; (this comes from P%+7 above)
BNE P%+5
STX QQ17    ; if A was 8, store zero ...
RTS         ; in case-switch flag and return
BEQ crlf    ; if A was 9, tab over and put a colon
            ; (maybe)
CMP #&60    ; NB we are comparing with the original 'A' again
BCS ex      ; If it's lower-case (or above), head to ex
            ; which treats it as a message index to print
CMP #14     ; if less than 14..
BCC P%+6    ; treat as a capitcal
CMP #32     ; if less than 32
BCC qw      ; ...go to 'qw' which treats it as message number
            ; A + 114
; So now we have a capital letter.
LDX QQ17    ; otherwise check the case-switch flag
BEQ TT74    ; zero? leave as-is and just print
BMI TT41    ; negative? head to TT41
BIT QQ17    ; bit 6 set?
BVS TT46    ; ...go to TT46 (which prints the character
            ; as-is, then clears bit 6 of case-switch)
; This prints the char in the A register in lowercase:
CMP #65     ; less than 'A' ?
BCC TT44    ; ...then print as is
CMP #&5B    ; more than 'Z' ?
BCS TT44    ; ...just print
ADC #32     ; otherwise add 32 to make it lower-case
.TT44       ; TT44 just jumps to TT26 (the character
JMP TT26    ; print routine).
BVS TT45    ; if bit 6 of case-switch set, go to TT45
CMP #65     ; if less than 65 ('A')
BCC TT74    ; just print, else..
ORA #64
STA QQ17    ; set bit '6' in case-switch
BNE TT44    ; and print as-is - i.e. this character
            ; will be printed but the next will be
            ; down-cased
ADC #114    ; add 114 to A
BNE ex      ; and print it as a canned message

LDA #21
STA XC      ; store 21 in the x position
BNE TT73    ; and go to TT73 (which prints a colon)

CPX #FF     ; if case-switch was $ff
BEQ TT48    ; ...skip entirely
CMP #65     ; if it's 'A' or above
BCS TT42    ; ...print in lowercase
PHA         ; save character to print
AND #191    ; clear bit 6 of case-switch
STA QQ17    
PLA         ; restore character
JMP TT26    ; and print as-is

CMP #160    ; is it >160?
BCS TT47    ; print as message number (A-160)
; Else we're going to print it as a one or
; two-character token.
AND #127    ; clear top bit to leave token number
ASL A       ; double it to get offset into table
TAY         ; put in Y register
LDA QQ16,Y  ; read first char
JSR TT27    ; print it
LDA QQ16+1,Y; read second char
CMP #63     ; if it's 63
BEQ TT48    ; ...skip (not that I've noticed any tokens that
            ; take advantage of this)
JMP TT27    ; else print second char and exit

; We got here if A>160 which means print canned message A-160
SBC #160        ; A-=160
; This is the main entry routine that prints canned message A
TAX             ; save A in X
; point zero-page location (V,V+1) to QQ18
; which is the message table.
LDA #(QQ18 MOD 256)
LDA #(QQ18 DIV 256)
LDY #0          ; init loop counter
BEQ TT50        ; Skip if this is the zeroth message

; TT51 to TT59 skip to the end of a zero-terminated string
LDA (V),Y       ; get current byte
BEQ TT49        ; if we found the end, exit loop
INY             ; otherwise move to next char
BNE TT51        ; if we didn't overflow, reloop
INC V+1         ; handle overflow
BNE TT51        ; and reloop
INY             ; skip the zero-terminator
BNE TT59        ; handle overflow
INC V+1         
DEX             ; decrement the message index
BNE TT51        ; and reloop if we haven't skipped to target

; Now we're at the main loop where (V),Y point at the message
; to print.
; Printing a char can recursively call us, so stack (V+1) 
; and Y (which is enough to preserve our state).
PHA             ; preserve "Y"
PHA             ; preserve (V+1)
LDA (V),Y       ; Read a string byte
EOR #35         ; "Decrypt" it
JSR TT27        ; call the print char routine
PLA             ; restore V+1
PLA             ; restore Y
INY             ; move to next char
BNE P%+4        ; overflowed?
INC V+1         ; handle overflow
LDA (V),Y       ; hit the end of the string?
BNE TT50        ; if not, go around again
RTS             ; we're finished

The token table (at QQ18) maps the following indices to these two-byte strings:

00  AL
01  LE
02  XE
03  GE
04  ZA
05  CE
06  BI
07  SO
08  US
09  ES
0a  AR
0b  MA
0c  IN
0d  DI
0e  RE
0f  A?
10  ER
11  AT
12  EN
13  BE
14  RA
15  LA
16  VE
17  TI
18  ED
19  OR
1a  QU
1b  AN
1c  TE
1d  IS
1e  RI
1f  ON

So the HARMLESS string was encoded as:

48  'H'
8a  reference to 'AR'
4d  'M'
81  reference to 'LE'
53  'S'
53  'S'

Then EORed with 35 ($23) and zero-terminated to yield 6b a9 6e a2 70 70 00

To make your own simple strings you can use this snippet of python:

def encode(msg):
    return "".join("{:02x}".format(ord(x)^35) for x in msg)

msg = raw_input("Text to encode: ")
print encode(msg)

Filed under: Coding

Posted at 20:00:00 BST on 12th June 2014.

jsbeeb Part Four - IRQs and timers

This is the fourth post in my series on emulating a BBC Micro in Javascript. I’d recommend reading the previous part on 6502 internal timings before reading this post. It’s also handy to have read read the first part which covers general stuff, and the second part which focuses on the video hardware.

Again thanks to my good chum Rich Talbot-Watkins who helped demystify what goes on in the Beeb.

May I interrupt you a moment?

Like most CPUs, the 6502 has a physical pin (the IRQ line) which causes an interrupt when brought low: the current instruction is completed and then instead of proceeding to the next instruction, the CPU enters interrupt mode (which disables further interrupts) and jumps to an operating system handler. This handler then deals with the interrupt — a timer firing, a keyboard event, disc activity and so on. The handler is responsible for:

Of course in reality nothing is that simple. On the 6502 the simple pipelining has an interesting side effect: the IRQ pin is only checked on the penultimate cycle of each instruction. This is the point at which the next instruction’s address is determined and fetched, ready for execution on the next cycle.

The penultimate cycle in question is the last “logical” cycle of the 6502’s state machine, not necessarily the penultimate physical cycle — in the case of cycle stretching on the final cycle of the instruction, those stretched cycles don’t cause the 6502 to recheck the IRQ line.

If the IRQ line is brought back high mid-instruction when the processor had already noticed it being low, the interrupt is still taken — even though there’s no evidence left for the interrupt handler to determine why it was called.

Not emulating these strange quirks leads to subtle differences in when interrupts are taken, which causes the legendary Kevin Edwards’ protection systems to fail to decode.

Given all this complexity, it’s interesting to think how experiment to try and work out how Kevin sussed all this out. Rich had a guess and in this fascinating thread Kevin confirms how it was done.

Running out of time(rs)

While we’re talking cycle correctness we should mention the hardware timers. There are two timers in each of two 6522 Versatile Interface Adapters (VIA). Each timer can generate an interrupt when it overruns, and automatically reloads from a programmable start value.

From our cycle correctness point of view the interesting thing about the timers is the exact time that the timer fires. They tick at 1MHz, and generate their interrupt a half-cycle after they underflow below $0000 (back to $ffff). At this point they either stop (a “one-shot” timer), or reload their initial value and start again (a “free-running” timer).

When used in the latter mode the reload happens on the half-cycle after the interrupt: When a value N is written to the timer’s latched start value, the first interrupt will happen N + 1.5 1MHz cycles later (i.e. 2N + 3 CPU cycles); and subsequent interrupts happen at N + 2 1MHz cycle (2N + 4 CPU cycle) intervals. (Recall that the CPU clock runs at 2MHz.)

If we configured a free-running timer with a counter of 2, and assuming we could acknowledge the IRQ almost straight away1, it would look something like:

Some games rely on the exact timing of the timers to change video settings mid-frame. For example, the driving game Revs sets the timer to fire every 312.5 * 64 - 2 cycles — which is to say exactly the number of 1MHz ticks there are in an interlaced PAL TV frame, minus two to account for the later time the timer fires. Most games use the vertical blanking interrupt generated by the video circuitry at the top of a frame to set a one-shot timer to fire after the appropriate time, but Revs just leaves a free-running timer going. If the emulation wasn’t spot on, then over time the place where the video settings change would slowly creep up the screen.

Acknowledging the timer

Reading the low 8 bits of a timer has an extra meaning: it acknowledges any interrupts from that timer. This is what an interrupt handler would do: once it had ascertained that it was a timer that caused the IRQ (by checking the VIA status register) it would read from that timer to stop it holding the CPU IRQ pin low.

By now you’ll realise there’s going to be a catch: if the CPU tries to acknowledge the timer firing on the same cycle that it fired, then the acknowledgement gets lost2. An interrupt will still be generated on the next cycle. Again, Kevin Edwards’ protection systems take advantage of this unusual behaviour.

Of course, on a real system all the components run in parallel. In an emulator they’re emulated serially. Empirically, we determined that to get things to work right, at each tick we do things in this order:

  1. Check to see if any IRQs are pending (if this is a penultimate CPU cycle).
  2. Generate any IRQs due to happen this cycle.
  3. Let any CPU memory accesses happen (which may of course clear IRQs etc).

Also there are a few extra + 1s in the code when counters are being loaded from or written to account for the actual point within a cycle the CPU would actually read or write the timer.

Most of the relevant code for the interrupts in in the via.js source file. The interrupt generation and counter handling is in the polltime function and it looks a little like:

// "cycles" is the number of 2MHz cycles that
// have occurred since the last time we checked.
count -= cycles;
justHit = false;
// Have we fired? -3 here accounts for the
// initial late fire time.
if (count <= -3) {
  // "justHit" is our note that we
  // should ignore any acknowledgment on
  // this cycle.
  if (count === -3) justHit = true;
// Reload the timer, accounting for the
// extra 4 2MHz ticks of a reload.
while (count < -3) 
  count += reload + 4;

Getting the emulation cycle-perfect has been a very satisfying achievement. Thanks are due to Ed Spittles for gathering real-world measurements, and Rich Talbot-Watkins for poring over technical documents and making spreadsheets of possible internal timing behaviour. It’s all been worth it as we finally cracked Kevin’s preotection system.

As an amusing side-note, when Rich and I first tried cracking the Alien 8 protection system (one of Kevin’s earlier accomplishments) some 20 years ago we started down this road. We tried building a 6502 emulator — in BBC Basic — to run the decryption code. We never got it working, and now I have a much fuller appreciation of why!

  1. The mark “x” on the diagram is where the IRQ is acknowledged. 

  2. This seems to be a 6522 glitch, though it’s not documented anywhere. 

Filed under: Coding Emulation

Posted at 14:40:00 BST on 4th June 2014.

jsbeeb Part Three - 6502 CPU timings

This is the third post in my series on emulating a BBC Micro in Javascript. You might find it instructive to read the first part which covers general stuff, or the second part which focuses on the video hardware. This post will cover the subtleties of the 6502’s instruction timings. In the next post I’ll cover how interrupts and hardware timers fit into the mix.

This time around the thanks really have to go to my good chum Rich Talbot-Watkins. He and I have been friends since we were twelve years old and have been programming together since we met. His knowledge of the Beeb is legendary — he still writes games for it even now. His help in getting the timings spot on in jsbeeb was invaluable.

Why is timing so important anyway?

Getting the instruction timings right is paramount for good emulation. I covered some of this in the first post, but so many tricks on the BBC required intimate knowledge of the instruction and hardware timings that if an emulator didn’t account for them properly, some things wouldn’t work right.

The most challenging example of timings were games protection systems. These would be used to prevent disassembly, copying and cheats. The game code would be encrypted first and would be decrypted at runtime by code. The code would often use XORs with hardware timers (amongst other things) to make it difficult to decrypt manually. In order to decrypt correctly the relative timing of the decryption code and the hardware timers has to be emulated absolutely perfectly.

Worse still, it’s not just how many CPU cycles each instruction takes that needs to be correct, but the fact that the memory reads and writes happen on particular cycles within an instruction that need to be correct.

Let’s delve a little into that:

Life of an instruction

The 6502 at the heart of the Beeb is simple but powerful. Like most other processors, instructions are fetched from RAM and executed in turn. The 6502 has a very simple pipeline — the fetch for the next instruction happens during the execution of the previous instruction. This has important consequences that we’ll talk about later.

Most CPUs have a physical pin dedicated to indicate “I need to access memory”, but to save costs this was left off of the 6502. Instead the memory system is permanently engaged. As almost every clock cycle needs access to memory (to read instructions or data) this is generally a win. Again, this is an important fact which leads to some unusual behaviour we must account for.

The 6502 treats the 256 bytes at the bottom of RAM (the “zero page”) specially. Instructions accessing the zero page are encoded differently and run a little faster (as they don’t need to encode two address bytes). The zero page can also be treated as an array of 16-bit index pointers.

Let’s go through a quick example, calculating a very simple checksum over ten bytes:

.checksum        ; Sum 10 bytes pointed by $70/$71
    LDA #0       ; Set A (our checksum) to zero
    TAY          ; Also put zero in Y (the loop counter)
    CLC          ; Clear carry (all adds are with carry)
    ADC ($70), Y ; Add check sum
    INY          ; Y++
    CPY #10      ; is Y 10?
    BNE lp       ; if not, loop around
    RTS          ; we're done, result in A

This example assembles to the 12 bytes:

a9 00 a8 18 71 70 c8 c0 0a d0 f8 60

Using Visual 6502, we can see what memory accesses happen on each clock tick:

  #  addr  data rw Comment
  0 $0000  $a9   1  LDA #
  1 $0001  $00   1  #0
  2 $0002  $a8   1  TAY
  3 $0003  $18   1  (CLC)
  4 $0003  $18   1  CLC
  5 $0004  $71   1  (ADC)
  6 $0004  $71   1  ADC (zp),Y
  7 $0005  $70   1  $70
  8 $0070  $00   1  addrLo
  9 $0071  $00   1  addrHi
 10 $0000  $a9   1  val at (addr)
 11 $0006  $c8   1  INY
 12 $0007  $c0   1  (CPY) 
 13 $0007  $c0   1  CPY #
 14 $0008  $0a   1  #10 
 15 $0009  $d0   1  BNE
 16 $000a  $f8   1  lp
 17 $000b  $60   1  (RTS)
 18 $0003  $18   1  CLC

Here the columns indicate the cycle number; the value output on the physical address bus (addr); the data on the data bus (data); the value on the “read not write” pin (rw), where a 1 indicates a read access, and a 0 indicates a write; and then a comment explaining a little of what’s going on that cycle.

As you can see the memory is accessed unconditionally on every cycle. Points to note:

On cycles 3, 12 and 17 the opcode following the current instruction is fetched prematurely. In the case of cycles 3 and 12, the instruction is a single byte instruction. Each instruction takes a minimum of two clock cycles, and in this case the following instruction is fetched twice. In the case of cycle 17, the next instruction is fetched even though the branch isn’t taken.

The sequence of fetches for the ADC ($70), Y instruction is opcode, zero page address, zero page low, zero page high (the low and high together give a 16-bit address), and finally address pointed to plus Y.

So far so good - it seems unusual to our modern “memory is slow” mindset that the processor touches RAM every cycle, but this is from an age where processors and RAM were clocked at the same speed.

Now let’s get a little more interesting. Let’s set the address pointed to by $70/$71 to be $0eff and take a look at what happens for Y=0 and Y=1, where we’d expect $0eff+0 = $0eff and $0eff+1 = $0d00 to be read from.

First, here’s Y=0:

  #  addr  data rw Comment
  6 $0004  $71  1  ADC (zp),Y
  7 $0005  $70  1  $70
  8 $0070  $ff  1  addrLo=$ff
  9 $0071  $0e  1  addrHi=$0e
 10 $0eff  $00  1  val at $0eff

Nothing shocking there. Now take a look at the next iteration where Y=1:

  #  addr  data rw Comment
 20 $0004  $71  1  ADC (zp),Y
 21 $0005  $70  1  $70 
 22 $0070  $ff  1  addrLo=$ff
 23 $0071  $0e  1  addrHi=$0e
 24 $0e00  $00  1  val at $0e00 ?!
 25 $0f00  $00  1  val at $0f00

Whoah — what’s all that? On cycle 24, we fetch the byte at $0e00 which is not the right address at all. Then there’s an extra cycle where we read from the correct place.

The 6502 is an 8-bit machine and so adding an 8-bit offset to a 16-bit address ought to take two cycles: one to add the low bits together, and then one to add any carry to the high bits. As the address bus is always active, the non-carried address is output on the first cycle. If there’s no carry, the 6502 stops there, else it does another read, this time with the correct address. Neat, eh?

Things aren’t always that simple, however. For instructions that both read and write, the double-read always happens. For example, the instruction INC $1234,X will always do two reads and one write, even if there’s no carry. This is because even if there’s no carry to do, there’s still work to be done waiting for the increment operation to finish before the final result can be stored. There’s nothing to short-cut. What’s more, the increment operation takes a while longer and the write happens twice; once with the unmodified value, and once with the correct value. This is what it looks like, when there’s no carry (for INC $3412,X with X=0):

  #  addr  data rw Comment
  2 $0002  $fe  1  INC Abs,X
  3 $0003  $12  1  addrLo=$12
  4 $0004  $34  1  addrHi=$34
  5 $3412  $00  1  read $3412 (and get 0)
  6 $3412  $00  1  read $3412 again
  7 $3412  $00  0  write back 0
  8 $3412  $01  0  write back 1

And when there’s a carry (INC $3412,X with X=$FF):

  #  addr  data rw Comment
  2 $0002  $fe  1  INC Abs,X
  3 $0003  $12  1  addrLo=$12
  4 $0004  $34  1  addrHi=$34
  5 $3411  $00  1  read $3411 (non-carry addr)
  6 $3511  $00  1  read $3511 (correct addr)
  7 $3511  $00  0  write back 0
  8 $3511  $01  0  write back 1

Wait around for ages and then two turn up at once

Just when you thought all this was making sense, there’s another thing to consider. Inside the Beeb there are two buses. Fast peripherals and RAM can run at the same speed as the CPU itself and are clocked at 2MHz. Some of the peripherals can’t work at this blazing speed, and instead need to be communicated with at the slothly 1MHz. These peripherals are memory mapped and the Beeb supports this variable speed memory access with a bit of help of some external circuitry which looks at the memory addresses on the bus and slows the CPU clock down when accessing areas of memory where slow peripherals are mapped. The gory details are mostly covered in the BBC Micro hardware guide, but the main thing we need to worry about is how our CPU clock gets synchronized up and cycle-stretched to talk to the 1MHz bus.

There are two possible cases — one where the bus access starts in the middle of a 2MHz pulse, and one where they coincide.

Most of the interesting things happen on the falling edge of the clock (although the 6502 does some things on the rising edge too). That means that depending on whether we’re on an odd or even cycle relative to the 1MHz timer we’ll get cycle stretched to either 2 or 3 cycles. And of course, each time the processor accesses memory, it may get stretched. So, putting this together with the previous section on all the extra accesses that happen, you can see that an instruction modifying the memory on a 1MHz peripheral can take many more cycles than it would otherwise seem to need.

Let’s take a specific example, from the legendary Kevin Edwards’ Nightshade protection. Here, Kevin uses a read-modify-write instruction on a 1MHz-bus attached timer register at $fe48. This register is the low 8 bits of a 16-bit timer that counts down at 1MHz. Writing to the register replaces the bottom 8 bits with the written value (which then continues counting down at 1MHz). Here the read-modify-write instruction is a rotate left, which suffers from the same double-write behaviour as the DEC instruction from the previous section.

  #  addr  data rw Comment
  0 $0d2d  $2e  1  ROL abs
  1 $0d2e  $48  1  addrLo=$48
  2 $0d2f  $fe  1  addrHi=$fe
  3 $fe48  $00  1  read...
  4 $fe48  $00  1  ..stretch..
  5 $fe48  $08  1 complete
  6 $fe48  $08  0  write unmodified..
  7 $fe48  $08  0  ..stretch..complete
  8 $fe48  $10  0  write modified..
  9 $fe48  $10  0  ..stretch..complete

Unlike the previous examples, this one is not generated by Visual 6502 but is hand-calculated, and shows what happens on each 2MHz timer cycle. As far as the 6502 is concerned it only executed 6 cycles, but because its external clock was slowed down, the wall-clock time taken is 10 2MHz ticks.

This example shows three stretches. Remember that all the time, the timer itself is counting down too, so unless the emulation models the exact cycle within the instruction that the reads and writes happen, the timer value will not be updated properly. Kevin also uses the fact that when timers overrun interrupts are generated to make the code even more difficult to emulate. In some cases, writing to the timers will suppress or cancel pending interrupts, so subtle timing differences can cause wildly different interrupt behaviour too.

Implementation details

jsbeeb implements all the complex behaviour by “compiling” instructions from a table of opcode side effects and knowledge of what addressing modes cause what kinds of memory accesses. A list of cycles where memory accesses is kept, and then an optimization pass is made: any memory accesses that are known not to refer to any hardware devices (and thus don’t have any sensitive time dependencies or side effects) are coalesced where possible. Sequences of CPU cycles with no hardware visible effects are also coalesced so that the various state machines (like timers, video, sound etc) can be run for as long a stretch as possible. Running them all a single cycle at a time is somewhat inefficient.

For memory accesses that can’t be optimized in this way, code is generated to check for 1MHz bus accesses and appropriately sync the CPU clock.

The code for this is in the InstructionGen class in 6502.opcodes.js.

Let’s look at INC Abs,X from earlier. The compilation starts at getInstruction which is given the text of opcode and its arguments. It calls getOp with the opcode part (INC) and that returns a struct describing what the instruction does (its op) and what bus cycles are needed (read and/or write). In this case the operation needs both a read and write bus cycle, and the op is the javascript snippet:

REG = (REG + 1) & 0xff;

The cpu.setzn part sets the zero and negative processor flags according to its argument. REG here is a variable that will contain the read-in value from the read bus cycle and will be also be the value written out in the write cycle.

Next getInstruction gets together the addressing mode part, by parsing the opcode arguments abs,x 1. Now the instruction starts being put together. In the following code snippets I’ve removed some if checks (describing them instead), and also have shortened the lines to fit here. Check the source on github for full details.

First the 16-bit address is fetched from the two bytes at the current program counter. Then the indexed address is calculated, along with a non-carried version. We account for the three cycles this takes (to fetch the opcode byte, and then the two bytes of the address):

ig = new InstructionGen();
ig.append("addr = cpu.getw();");
ig.append("addrWithCarry = addr + cpu.x;");
ig.append("addrNonCarry = (addr&0xff00) 
            | (addrWithCarry&0xff);");

Next we perform the operation’s required bus accesses. In our case we need a read and a write. Here we account for the spurious read and write. The readOp and writeOp methods of the InstructionGen are responsible for ticking the CPU the appropriate amount of time depending on cycle stretching.

if ( && op.write) { // read/modify/write
  // First a spurious read of the non-carried address
  // Now the actual read
  ig.readOp("addrWithCarry", "REG");
  // And a write of the unmodified value
  ig.writeOp("addrWithCarry", "REG");

Now we apply the operation and write back the final result.

if (op.write)
  ig.writeOp("addrWithCarry", "REG");

The final code is the result of calling render on the InstructionGen, which does all the timer magic.

The final compiled code for INC Abs,X comes out as something like:2

addr = cpu.getw();
addrWithCarry = (addr + cpu.x) & 0xffff;
addrNonCarry = (addr&0xff00) 
    | (addrWithCarry&0xff);
    * ((cpu.cycles & 1) + 1));
    * (!(cpu.cycles & 1) + 1));
REG = cpu.readmem(addrWithCarry);
    * (!(cpu.cycles & 1) + 1));
cpu.writemem(addrWithCarry, REG);
REG = (REG + 1) & 0xff;
    * (!(cpu.cycles & 1) + 1));
cpu.writemem(addrWithCarry, REG);

To get the actual version, fire up jsbeeb and type instructions6502[0xfe] into the Javascript console.

Next time I’ll cover how the 6502 deals with interrupts and how that interacts with the pipelining. I’ll also cover one of the more common sources of interrupts: the 6522 Versatile Interface Adapter’s timers.

  1. The code uses lowercase throughout although I’ve used more usual capitalization in this write-up. Sorry for any confusion! 

  2. As I look at the generated code I realise I could make it a bit more efficient and readable. I’ll look at updating that later so hopefully by the time you take the trouble to look it will be much nicer. 

Filed under: Coding Emulation

Posted at 16:10:00 BST on 29th May 2014.

Emulating a BBC Micro in Javascript - Part Two

Following on from my previous post, I’m going to talk a bit about emulating the video hardware of a BBC Micro. Firstly, a big credit to Tom Walker for his b-em emulator upon which much of the jsbeeb video code is based. Thanks Tom!

The video hardware

When most people think of a BBC Micro, they think of the iconic “Teletext” display, referred to as MODE 7. It was the default screen mode and had some great benefits: it had high-resolution, clear text, many colours, and some cool effects. The fact it only took up 1KB of memory was nice too, especially when that may be 1/16th of your whole RAM.

Picture of a typical MODE 7 display
MODE 7 in all its glory

MODE 7 was supplied by a separate chip from the rest of the screen modes, and although it’s very interesting and complex, it’s not what I’m going to talk about today. I’m going to talk about the rest of the video circuitry; the graphics modes provided by a Motorola 6845CRTC, in combination with a custom video ULA which provided timings and palette configuration.

The 6502 CPU communicated with the 6845 by writing to its memory mapped area at 0xfe00 to 0xfe07. The 6845 has 18 internal registers and to access them one would write the address number to 0xfe00 and then the value required to 0xfe01. (The other addresses in the range mapped to the same two functions).

The registers control things like where the television sync pulse would be generated, when and where horizontal lines would start and end, and what memory address the bitmapped screen was stored at. By judicious programming of the start of screen memory, games could perform hardware scrolling, albeit at a somewhat coarse level of 8 pixels in the X direction.

jsbeeb operates in a physical PAL TV space of 1280x768 pixels. This allows it to account for things like the vertical sync pulse being configurable. At the end of each frame, the area of the 1280x768 that actually contained pixels is then blitted to a 698x571 canvas element. The unusual size is to purposefully blur the pixels a little to simulate an authentic TV experience.

Some of the screen timings and registers
Some of the relationships between register values, courtesy of the Advanced User Guide

Every clock cycle (of the 2MHz clock), then 6845 processes one byte of screen memory, generating anywhere from two to eight logical pixels, depending on the number of colours per pixel and other register settings. In jsbeeb’s terms, each tick generates between eight and sixteen physical pixels in PAL space.

On each clock cycle, jsbeeb looks at all the hardware registers, determines which settings are in effect, and renders either 8 or 16 32-bpp pixels into its buffer. The video code runs every time the CPU emulator calls its polltime(), making it a kind of co-routine with the CPU (along with the other peripheral emulators). (Code is in video.js)

The reason the state is re-checked each cycle is because it was very common for games to alter the settings mid-frame by clever use of well-timed interrupts to, for example, scroll part of the screen, or change the palette of colours available halfway down the screen.

Checking each time is a little inefficient – as is ticking the screen code every time the CPU counter ticks. In future updates I plan on accumulating screen clock cycles and running the screen for longer batches if and only if I know for certain none of the registers have changed. At this point I can also consider better code arrangements to take advantage of rendering multiple screen pixels with the same settings.

The framebuffer object written to is a 32-bpp overlay on the 8-bpp underlying canvas’s framebuffer. This lets me write a single 32-bpp value to set a whole pixel. The mappings between the BBC’s colour format and the 32-bpp values required are all done when the palette registers are written to, instead of each time a colour is needed.

The code for a single screen byte dat looks something like:

var tblOff = dat * 16;
for (i = 0; i < charWidth; ++i) { // is 8 or 16
    fb32[cur + i] = ulapal[table4bpp[tblOff + i]];

The table4bpp is a table mapping the contributions of each display byte’s bits to the 8 or 16 physical pixels of a character (created here). It maps to a 4-bit value which indexes into the palette, ulapal.

Conceivably it might be worth regenerating the table each time the palette is modified rather than doubly-indexing into tables for every physical pixel: I haven’t done the profile so I can’t be sure. In reality the table is also indexed by some other of the video registers, but these also could be folded into the table regeneration. I’m intrigued now and will have to take a look!

In spite of the complexity, the performance of the screen code is suprisingly good. It does appear near the top of the profiles but upon further investigation a lot of the time is spent at the end of each frame in the blitting code that takes the area of the PAL physical screen that was rendered to and stretch blits it into the actual canvas object displayed on the web page. Blitting code is currently in the big blob of main.js. Suggestions on how to improve its speed welcomed!

Returning briefly to MODE 7 support; this rather complex mode is implemented only partially at the moment. Code is in teletext.js, and there’s a lot of allusions to interlacing that I don’t currently do. Tom Walker’s b-em is much better at this part and it’s one of the reasons his emulator’s MODE 7 display looks so much nicer: emulating the TV’s higher-resolution interlacing makes everything smoother. I’m hoping to find time to return to this soon.

Filed under: Coding Emulation

Posted at 12:10:00 BST on 16th May 2014.