Matt Godbolt’s blog

Self-indulgent postings

Debugging BBC Master demos with jsbeeb

Over the last few weeks I’ve really been concentrating on shoring up the emulation quality of jsbeeb, mainly by adding test cases for all the undefined opcodes. Thankfully, there are some processor test suites out there and I’ve been able to get them running in jsbeeb as part of the continuous build. It now takes about 40 minutes to run all the tests, but I’m pretty darned sure jsbeeb has an accurate NMOS 6502 emulation.

Most recently I’ve been taking a glance over the BBC Master emulation, both in terms of hardware and the slightly different CMOS 65SC12 chip it used.

With a little help from my friends on the stardot forums I was able to get some tests run on real Masters and compare the output of the few instructions I couldn’t find info for. Most notably the unusual behaviour of the 6502 when in Binary Coded Decimal mode has mostly been fixed in the 65SC12.

To test the emulation quality, Rich Talbot-Watkins suggested I try out a Master-only demo that Tom Walker (of b-em fame) had written1. I fired it up and was both very impressed at the quality of the demo, and pleased that it worked first time!

The Master demo, by Tom Walker
Tom’s impressive Master demo

Then I restarted it and the second time it ran, the sound went all crazy.


A bit of experimentation and it seemed about 30% of the time the sound would go nuts about a minute into the demo, squeaking and missing half the notes.

At first I suspected the sound chip emulation. The emulation was taken from my earlier Sega Master System emulator which happened to use the same chip, and I had recently updated the emulation there to fix up some other problems that Rich had noticed. I’d taken the information from the SMS Power emulation site, and it was pretty accurate, but maybe it didn’t match the chip in the Beeb.

It’s definitely the case that Tom’s b-em emulator and jsbeeb differ on how they latch the register values — maybe that was it? I ended up writing a “b-em” style sound system and running it in parallel with the jsbeeb one and comparing their outputs. They agreed 100%, even when the squeaks were there.

Scratch that as a source of error, then. It was a bit suspect anyway as the problem is intermittent and (somewhat) non-deterministic2.

While instrumenting the sound code I was able to notice a pattern: at the point the squeaks started, two of the sound channels would be set to $020 values. This was a handy automatic diagnostic of the problem — it seems there’s no normal time when the music ought to be this squeaky.

The sound code in the demo is based off of Rich’s “Blurp” music player. As such he had a pretty good understanding of how it worked, and he was able to determine what inputs to the sound code would make it choose to play a sound of frequency $020, reading from a particular part of the frequency lookup table.

Armed with this information I added a breakpoint on reading that frequency table location. That allowed me to see when it was being read, but there wasn’t any obvious reason why: I needed to somehow track to the root cause.

Writing an emulator gives you amazing flexibility in debugging — I added a circular buffer of the last 256 values of all the registers after executing each instruction. That meant when the breakpoint tripped I had a history of how the code and the values in the A, X, Y and flags registers had evolved.

Poring over the history I noticed something odd in the code compared to the version in Rich’s original: there was one comparison with zero, which was followed by a branch-if-less-than — a branch that would never be taken.

PC   opcode      A  X  Y   ; my comment
8248 CPX #$00   20 02 00
824a BNE $8291  20 02 00
8291 LDX #$00   20 00 00
8293 CMP #$c0   20 00 00   ; >= c0?
8295 BCC $829b  20 00 00   ; no...
829b CMP #$00   20 00 00   ; >= 0 (er?)
829d BCC $82a3  20 00 00   ; no... carry is set

This is deeply suspicious — this wasn’t in Rich’s Blurp code and didn’t make any sense. On a hunch I reloaded until I got a “working” version where the sound didn’t squeak and lo and behold: The code at location $829b was comparing with #$60. Something was overwriting the code!

Excitedly I breakpointed on memory writes to the $60 byte at location $829a and was rewarded with the culprit: a pretty screen-clearing routine that shutter-style cleared the screen at the point in the demo where the sound went wrong!

Hooray! Now I knew how the corruption was happening. Next I needed to work out why it was intermittent…

The clear routine writes zero bytes in a shutter-style pattern to the screen memory. Nothing too unusual there. The screen memory ends at $7fff in the Master memory map, and then in the bank $8000-#bfff there’s a selectable 16KB bank which can be ROM or one of four banks of RAM. The sound playing routines live in one of the RAM banks at location $8000.

To apparently save some cycles somewhere, the screen routine doesn’t explicitly check for overrunning the memory at $8000: it writes past the end in some cases. 100% reproducible disaster, you might think. But no: Tom’s smarter than that and every time he’s about to write to the screen, he swaps the RAM out of $8000 and instead pages in some ROM — which of course makes the writes harmless. Here’s a snippet of the code:

LDA #$0f
STA $fe30     ; page in ROM
SEI           ; disable IRQs
LDA #$00
STA ($72),Y   ; store zero
CLI           ; re-enable IRQs
LDA $72
ADC #$08      ; move to the next char
STA $72
BNE loop      ; and keep clearing

The interesting part is where interrupts are disabled: recall I said the sound playing routine is in one of the RAM banks that gets paged in at $8000? Well, the sound player is driven by interrupts. The main interrupt handler (which lives down in unpaged memory) pages in the relevant RAM bank when it realises it needs to handle sound (in response to a timer firing). Critically, when it returns it doesn’t restore the page that was previously there.

To prevent this being a problem, Tom sensibly disables interrupts while he does his write. However he doesn’t disable interrupts between paging the ROM in and writing so there’s a tiny window of opportunity where, should an IRQ happen, the sound handling bank is left in just before the zero byte is written.

The window is tiny - there’s only 6 cycles between the point of no return for the store, and the SEI taking effect. Additionally, it’s only if the clearing routine is just past the end of the screen when the IRQ happens that it causes an issue.

Now we understand how, where and why. One last nagging thing to take care of — why is it intermittent? Further instrumentation gives the answer: the timer values that are used to set the interrupt frequency aren’t being deterministically set. Instead their values are loaded in the interrupt routine itself. That means the “start” time of the interrupts depends on how many cycles have passed since the last hard reset. I was able to use this to find the 6-cycle window: by externally varying the timer values at the point at which the first instruction of the demo executes I could find the values that caused it to go wrong.

b-em and other emulators don’t seem to see this issue — I believe as they don’t model the interrupts as precisely in all cases.

The only remaining mystery is why nobody has ever seen the issue on a real BBC Master. The marvellous Ed Spittles has run the demo a number of times on a real Master and never seen the issue. This leaves me feeling that perhaps I’m missing something, but it could just be an emulation artifact (maybe in how long the disc drive takes to load?) that makes the few timer values where it’s a problem more likely.

Any ideas on this gratefully accepted! For now, I’m happy enough with my explanation and can hopefully get on with some of the other things that need attending to.

Big thanks to Rich for putting up with my enormous nocturnal email rambles as I picked my way through all this in addition to giving fantastic advice on what the problem might be. Again, also thanks to Tom Walker for both the demo and for his excellent emulator which was the inspiration for jsbeeb.

  1. Sadly I don’t know the license state of the demo so I haven’t been able to share it here. 

  2. From an autoboot (with a deterministic start time) it would always fail. Subsequent runs would fail about 30% of the time. 

Filed under: Emulation

Posted at 03:50:00 GMT on 31st October 2014.

Elite's crazy tokenised string routine

The other day some discussions between friends led to someone posting a faked up Elite status screen with a silly rating instead of “Harmless”. Given what I’ve been spending all my time on recently I thought it would be more amusing to post a playable version with the “Harmless” replaced by patching the Elite code.

I naively thought this would be easy to achieve: I added a quick routine in jsbeeb to search for a string in memory and looked for Harmless. No joy. I tried a case-insensitive version. Likewise, no luck. This wasn’t going to be as simple as I thought.

I next put a conditional breakpoint on the OS write character routine OSWRCH, conditional on the character being the H of the Harmless I was hoping to replace.

In jsbeeb (at the time of writing) breakpoints are a little awkward to use: Assign a function to processor.debugInstruction which will be called at every instruction with the program counter and should return true to stop the program, or false to carry on. OSWRCH is at $ffee so bringing up the Javascript console and typing:

processor.debugInstruction = function(pc) { 
    return pc === 0xffee && processor.a == 72; 

A quick run and…no joy. Of course, Elite doesn’t use the OS routines to print text — it has its own.

At this point I remembered that Ian Bell, co-creator of Elite, had posted up some old disc images of the source code of the tape version of Elite. It’s not quite the same as this disc image I’m using, but it should be a good starting point. For ease of reading without firing up an emulator, he’s also posted a ZIP of the pertinent files as text.

Opening up the source gives an amazing insight into how the game was made: The source is a terrific mess. It’s split over several BASIC files which assemble the code by daisy-chaining each other. The code has very few comments and whitespace; crams many 6502 instructions per line; and uses many tricks like skipping over the next few instructions in branches via PC-relative mechanisms like BNE P%+4. It’s pretty hard-going stuff.

Luckily one of the few comments is \PRINT and so this is the first clue: the print routine is at the obfuscatedly-named label TT26. There’s no easy way to tie labels in the tape source to addresses in the disc version. I opted to search for the opcodes either side of the label, and struck gold: TT26 lives at address $1d56. I moved my Javascript breakpoint there and was able to catch the letter ‘H’ being drawn. Lucky Elite uses ASCII or otherwise I wouldn’t have been able to catch it here.

Here I stepped out (keypress ‘o’ in the latest jsbeeb debugger) and found the calling routine, and traced its origin to $31ac. Searching in the source I found this corresponds to label TT27, which sadly has no handy comments. This routine is the “print a string character” routine, but its input is definitely not ASCII.

The short version

The code EORs its input with 35, then interprets the character as a token: some tokens are ASCII, some are colour and case-changing codes, some refer to parts of the player’s state and some refer to one of a list of two-letter expansions, making the strings in general shorter. The string “Harmless” is actually all caps (the title casing is done by the code) and is encoded as 6 bytes: H [AR] M [LE] S S where the AR and LE are two-letter expansions. The string itself is stored at $0769. Capital letters are left unchanged, so to replace the string with, say, “Monkey” we can patch the code to put 6e4c4d48465a (“MONKEY” EORed with 35) in the relevant place. This can be achieved with a URL parameter in jsbeeb, adding &patch=@31a6,0769:6e4c4d48465a to the autoboot URL to yield a handy bootable patched URL.

The “@” sign in the patch URL sets a breakpoint at that location, and then applies the patch (in addr:bytestopatch) when the breakpoint is hit. That way, Elite loads up first, then the first time the print routine is called the patch is made.

The long version

Let’s dig into that print routine a bit and see what it’s doing. Here I’m going to use the label names from the source. To follow along at home, fire up Elite in jsbeeb, hit the Home key and then go to $31ac by typing into the disassembly address box, or open up ELITEB.TXT from the source ZIP.

The “print message” routine directly follows the print character routine, and I’ve annotated the whole lot. Some of the auxiliary routines are just above .TT27, but to keep things a little shorter I’ve left those out, and just explained them in the text (e.g. the csh routine).

.TT27       ; print char routine
TAX         ; save A into X, and set flags
BEQ csh     ; if zero, print "Cash:" and then
            ; the player's cash
BMI TT43    ; if negative (bit 7 set) go to TT43
            ; which treats the bottom 7 bits as an index
            ; into a token table
DEX         ; decrement and set flags...
BEQ tal     ; if zero, A was 1; go to tal
            ; which prints "tally" (a number)
BEQ ypl     ; if A was 2; go to to ypl
            ; which prints the current planet name
BNE P%+5
JMP cpl     ; if A was 3; go to cpl
            ; which prints the selected planet name
BEQ cmn     ; if A was 4; go to cmn
            ; which prints the player name (commander?)
BEQ fwl     ; if A was 5; go to fwl
            ; which just prints the player's cash
BNE P%+7
LDA #128    ; if A was 6, store 128 in QQ17
STA QQ17    ; which is a case-switch flag
RTS         ; and return
DEX         ; (this comes from P%+7 above)
BNE P%+5
STX QQ17    ; if A was 8, store zero ...
RTS         ; in case-switch flag and return
BEQ crlf    ; if A was 9, tab over and put a colon
            ; (maybe)
CMP #&60    ; NB we are comparing with the original 'A' again
BCS ex      ; If it's lower-case (or above), head to ex
            ; which treats it as a message index to print
CMP #14     ; if less than 14..
BCC P%+6    ; treat as a capitcal
CMP #32     ; if less than 32
BCC qw      ; ...go to 'qw' which treats it as message number
            ; A + 114
; So now we have a capital letter.
LDX QQ17    ; otherwise check the case-switch flag
BEQ TT74    ; zero? leave as-is and just print
BMI TT41    ; negative? head to TT41
BIT QQ17    ; bit 6 set?
BVS TT46    ; ...go to TT46 (which prints the character
            ; as-is, then clears bit 6 of case-switch)
; This prints the char in the A register in lowercase:
CMP #65     ; less than 'A' ?
BCC TT44    ; ...then print as is
CMP #&5B    ; more than 'Z' ?
BCS TT44    ; ...just print
ADC #32     ; otherwise add 32 to make it lower-case
.TT44       ; TT44 just jumps to TT26 (the character
JMP TT26    ; print routine).
BVS TT45    ; if bit 6 of case-switch set, go to TT45
CMP #65     ; if less than 65 ('A')
BCC TT74    ; just print, else..
ORA #64
STA QQ17    ; set bit '6' in case-switch
BNE TT44    ; and print as-is - i.e. this character
            ; will be printed but the next will be
            ; down-cased
ADC #114    ; add 114 to A
BNE ex      ; and print it as a canned message

LDA #21
STA XC      ; store 21 in the x position
BNE TT73    ; and go to TT73 (which prints a colon)

CPX #FF     ; if case-switch was $ff
BEQ TT48    ; ...skip entirely
CMP #65     ; if it's 'A' or above
BCS TT42    ; ...print in lowercase
PHA         ; save character to print
AND #191    ; clear bit 6 of case-switch
STA QQ17    
PLA         ; restore character
JMP TT26    ; and print as-is

CMP #160    ; is it >160?
BCS TT47    ; print as message number (A-160)
; Else we're going to print it as a one or
; two-character token.
AND #127    ; clear top bit to leave token number
ASL A       ; double it to get offset into table
TAY         ; put in Y register
LDA QQ16,Y  ; read first char
JSR TT27    ; print it
LDA QQ16+1,Y; read second char
CMP #63     ; if it's 63
BEQ TT48    ; ...skip (not that I've noticed any tokens that
            ; take advantage of this)
JMP TT27    ; else print second char and exit

; We got here if A>160 which means print canned message A-160
SBC #160        ; A-=160
; This is the main entry routine that prints canned message A
TAX             ; save A in X
; point zero-page location (V,V+1) to QQ18
; which is the message table.
LDA #(QQ18 MOD 256)
LDA #(QQ18 DIV 256)
LDY #0          ; init loop counter
BEQ TT50        ; Skip if this is the zeroth message

; TT51 to TT59 skip to the end of a zero-terminated string
LDA (V),Y       ; get current byte
BEQ TT49        ; if we found the end, exit loop
INY             ; otherwise move to next char
BNE TT51        ; if we didn't overflow, reloop
INC V+1         ; handle overflow
BNE TT51        ; and reloop
INY             ; skip the zero-terminator
BNE TT59        ; handle overflow
INC V+1         
DEX             ; decrement the message index
BNE TT51        ; and reloop if we haven't skipped to target

; Now we're at the main loop where (V),Y point at the message
; to print.
; Printing a char can recursively call us, so stack (V+1) 
; and Y (which is enough to preserve our state).
PHA             ; preserve "Y"
PHA             ; preserve (V+1)
LDA (V),Y       ; Read a string byte
EOR #35         ; "Decrypt" it
JSR TT27        ; call the print char routine
PLA             ; restore V+1
PLA             ; restore Y
INY             ; move to next char
BNE P%+4        ; overflowed?
INC V+1         ; handle overflow
LDA (V),Y       ; hit the end of the string?
BNE TT50        ; if not, go around again
RTS             ; we're finished

The token table (at QQ18) maps the following indices to these two-byte strings:

00  AL
01  LE
02  XE
03  GE
04  ZA
05  CE
06  BI
07  SO
08  US
09  ES
0a  AR
0b  MA
0c  IN
0d  DI
0e  RE
0f  A?
10  ER
11  AT
12  EN
13  BE
14  RA
15  LA
16  VE
17  TI
18  ED
19  OR
1a  QU
1b  AN
1c  TE
1d  IS
1e  RI
1f  ON

So the HARMLESS string was encoded as:

48  'H'
8a  reference to 'AR'
4d  'M'
81  reference to 'LE'
53  'S'
53  'S'

Then EORed with 35 ($23) and zero-terminated to yield 6b a9 6e a2 70 70 00

To make your own simple strings you can use this snippet of python:

def encode(msg):
    return "".join("{:02x}".format(ord(x)^35) for x in msg)

msg = raw_input("Text to encode: ")
print encode(msg)

Filed under: Coding

Posted at 20:00:00 BST on 12th June 2014.

jsbeeb Part Four - IRQs and timers

This is the fourth post in my series on emulating a BBC Micro in Javascript. I’d recommend reading the previous part on 6502 internal timings before reading this post. It’s also handy to have read read the first part which covers general stuff, and the second part which focuses on the video hardware.

Again thanks to my good chum Rich Talbot-Watkins who helped demystify what goes on in the Beeb.

May I interrupt you a moment?

Like most CPUs, the 6502 has a physical pin (the IRQ line) which causes an interrupt when brought low: the current instruction is completed and then instead of proceeding to the next instruction, the CPU enters interrupt mode (which disables further interrupts) and jumps to an operating system handler. This handler then deals with the interrupt — a timer firing, a keyboard event, disc activity and so on. The handler is responsible for:

Of course in reality nothing is that simple. On the 6502 the simple pipelining has an interesting side effect: the IRQ pin is only checked on the penultimate cycle of each instruction. This is the point at which the next instruction’s address is determined and fetched, ready for execution on the next cycle.

The penultimate cycle in question is the last “logical” cycle of the 6502’s state machine, not necessarily the penultimate physical cycle — in the case of cycle stretching on the final cycle of the instruction, those stretched cycles don’t cause the 6502 to recheck the IRQ line.

If the IRQ line is brought back high mid-instruction when the processor had already noticed it being low, the interrupt is still taken — even though there’s no evidence left for the interrupt handler to determine why it was called.

Not emulating these strange quirks leads to subtle differences in when interrupts are taken, which causes the legendary Kevin Edwards’ protection systems to fail to decode.

Given all this complexity, it’s interesting to think how experiment to try and work out how Kevin sussed all this out. Rich had a guess and in this fascinating thread Kevin confirms how it was done.

Running out of time(rs)

While we’re talking cycle correctness we should mention the hardware timers. There are two timers in each of two 6522 Versatile Interface Adapters (VIA). Each timer can generate an interrupt when it overruns, and automatically reloads from a programmable start value.

From our cycle correctness point of view the interesting thing about the timers is the exact time that the timer fires. They tick at 1MHz, and generate their interrupt a half-cycle after they underflow below $0000 (back to $ffff). At this point they either stop (a “one-shot” timer), or reload their initial value and start again (a “free-running” timer).

When used in the latter mode the reload happens on the half-cycle after the interrupt: When a value N is written to the timer’s latched start value, the first interrupt will happen N + 1.5 1MHz cycles later (i.e. 2N + 3 CPU cycles); and subsequent interrupts happen at N + 2 1MHz cycle (2N + 4 CPU cycle) intervals. (Recall that the CPU clock runs at 2MHz.)

If we configured a free-running timer with a counter of 2, and assuming we could acknowledge the IRQ almost straight away1, it would look something like:

Some games rely on the exact timing of the timers to change video settings mid-frame. For example, the driving game Revs sets the timer to fire every 312.5 * 64 - 2 cycles — which is to say exactly the number of 1MHz ticks there are in an interlaced PAL TV frame, minus two to account for the later time the timer fires. Most games use the vertical blanking interrupt generated by the video circuitry at the top of a frame to set a one-shot timer to fire after the appropriate time, but Revs just leaves a free-running timer going. If the emulation wasn’t spot on, then over time the place where the video settings change would slowly creep up the screen.

Acknowledging the timer

Reading the low 8 bits of a timer has an extra meaning: it acknowledges any interrupts from that timer. This is what an interrupt handler would do: once it had ascertained that it was a timer that caused the IRQ (by checking the VIA status register) it would read from that timer to stop it holding the CPU IRQ pin low.

By now you’ll realise there’s going to be a catch: if the CPU tries to acknowledge the timer firing on the same cycle that it fired, then the acknowledgement gets lost2. An interrupt will still be generated on the next cycle. Again, Kevin Edwards’ protection systems take advantage of this unusual behaviour.

Of course, on a real system all the components run in parallel. In an emulator they’re emulated serially. Empirically, we determined that to get things to work right, at each tick we do things in this order:

  1. Check to see if any IRQs are pending (if this is a penultimate CPU cycle).
  2. Generate any IRQs due to happen this cycle.
  3. Let any CPU memory accesses happen (which may of course clear IRQs etc).

Also there are a few extra + 1s in the code when counters are being loaded from or written to account for the actual point within a cycle the CPU would actually read or write the timer.

Most of the relevant code for the interrupts in in the via.js source file. The interrupt generation and counter handling is in the polltime function and it looks a little like:

// "cycles" is the number of 2MHz cycles that
// have occurred since the last time we checked.
count -= cycles;
justHit = false;
// Have we fired? -3 here accounts for the
// initial late fire time.
if (count <= -3) {
  // "justHit" is our note that we
  // should ignore any acknowledgment on
  // this cycle.
  if (count === -3) justHit = true;
// Reload the timer, accounting for the
// extra 4 2MHz ticks of a reload.
while (count < -3) 
  count += reload + 4;

Getting the emulation cycle-perfect has been a very satisfying achievement. Thanks are due to Ed Spittles for gathering real-world measurements, and Rich Talbot-Watkins for poring over technical documents and making spreadsheets of possible internal timing behaviour. It’s all been worth it as we finally cracked Kevin’s preotection system.

As an amusing side-note, when Rich and I first tried cracking the Alien 8 protection system (one of Kevin’s earlier accomplishments) some 20 years ago we started down this road. We tried building a 6502 emulator — in BBC Basic — to run the decryption code. We never got it working, and now I have a much fuller appreciation of why!

  1. The mark “x” on the diagram is where the IRQ is acknowledged. 

  2. This seems to be a 6522 glitch, though it’s not documented anywhere. 

Filed under: Coding Emulation

Posted at 14:40:00 BST on 4th June 2014.

jsbeeb Part Three - 6502 CPU timings

This is the third post in my series on emulating a BBC Micro in Javascript. You might find it instructive to read the first part which covers general stuff, or the second part which focuses on the video hardware. This post will cover the subtleties of the 6502’s instruction timings. In the next post I’ll cover how interrupts and hardware timers fit into the mix.

This time around the thanks really have to go to my good chum Rich Talbot-Watkins. He and I have been friends since we were twelve years old and have been programming together since we met. His knowledge of the Beeb is legendary — he still writes games for it even now. His help in getting the timings spot on in jsbeeb was invaluable.

Why is timing so important anyway?

Getting the instruction timings right is paramount for good emulation. I covered some of this in the first post, but so many tricks on the BBC required intimate knowledge of the instruction and hardware timings that if an emulator didn’t account for them properly, some things wouldn’t work right.

The most challenging example of timings were games protection systems. These would be used to prevent disassembly, copying and cheats. The game code would be encrypted first and would be decrypted at runtime by code. The code would often use XORs with hardware timers (amongst other things) to make it difficult to decrypt manually. In order to decrypt correctly the relative timing of the decryption code and the hardware timers has to be emulated absolutely perfectly.

Worse still, it’s not just how many CPU cycles each instruction takes that needs to be correct, but the fact that the memory reads and writes happen on particular cycles within an instruction that need to be correct.

Let’s delve a little into that:

Life of an instruction

The 6502 at the heart of the Beeb is simple but powerful. Like most other processors, instructions are fetched from RAM and executed in turn. The 6502 has a very simple pipeline — the fetch for the next instruction happens during the execution of the previous instruction. This has important consequences that we’ll talk about later.

Most CPUs have a physical pin dedicated to indicate “I need to access memory”, but to save costs this was left off of the 6502. Instead the memory system is permanently engaged. As almost every clock cycle needs access to memory (to read instructions or data) this is generally a win. Again, this is an important fact which leads to some unusual behaviour we must account for.

The 6502 treats the 256 bytes at the bottom of RAM (the “zero page”) specially. Instructions accessing the zero page are encoded differently and run a little faster (as they don’t need to encode two address bytes). The zero page can also be treated as an array of 16-bit index pointers.

Let’s go through a quick example, calculating a very simple checksum over ten bytes:

.checksum        ; Sum 10 bytes pointed by $70/$71
    LDA #0       ; Set A (our checksum) to zero
    TAY          ; Also put zero in Y (the loop counter)
    CLC          ; Clear carry (all adds are with carry)
    ADC ($70), Y ; Add check sum
    INY          ; Y++
    CPY #10      ; is Y 10?
    BNE lp       ; if not, loop around
    RTS          ; we're done, result in A

This example assembles to the 12 bytes:

a9 00 a8 18 71 70 c8 c0 0a d0 f8 60

Using Visual 6502, we can see what memory accesses happen on each clock tick:

  #  addr  data rw Comment
  0 $0000  $a9   1  LDA #
  1 $0001  $00   1  #0
  2 $0002  $a8   1  TAY
  3 $0003  $18   1  (CLC)
  4 $0003  $18   1  CLC
  5 $0004  $71   1  (ADC)
  6 $0004  $71   1  ADC (zp),Y
  7 $0005  $70   1  $70
  8 $0070  $00   1  addrLo
  9 $0071  $00   1  addrHi
 10 $0000  $a9   1  val at (addr)
 11 $0006  $c8   1  INY
 12 $0007  $c0   1  (CPY) 
 13 $0007  $c0   1  CPY #
 14 $0008  $0a   1  #10 
 15 $0009  $d0   1  BNE
 16 $000a  $f8   1  lp
 17 $000b  $60   1  (RTS)
 18 $0003  $18   1  CLC

Here the columns indicate the cycle number; the value output on the physical address bus (addr); the data on the data bus (data); the value on the “read not write” pin (rw), where a 1 indicates a read access, and a 0 indicates a write; and then a comment explaining a little of what’s going on that cycle.

As you can see the memory is accessed unconditionally on every cycle. Points to note:

On cycles 3, 12 and 17 the opcode following the current instruction is fetched prematurely. In the case of cycles 3 and 12, the instruction is a single byte instruction. Each instruction takes a minimum of two clock cycles, and in this case the following instruction is fetched twice. In the case of cycle 17, the next instruction is fetched even though the branch isn’t taken.

The sequence of fetches for the ADC ($70), Y instruction is opcode, zero page address, zero page low, zero page high (the low and high together give a 16-bit address), and finally address pointed to plus Y.

So far so good - it seems unusual to our modern “memory is slow” mindset that the processor touches RAM every cycle, but this is from an age where processors and RAM were clocked at the same speed.

Now let’s get a little more interesting. Let’s set the address pointed to by $70/$71 to be $0eff and take a look at what happens for Y=0 and Y=1, where we’d expect $0eff+0 = $0eff and $0eff+1 = $0d00 to be read from.

First, here’s Y=0:

  #  addr  data rw Comment
  6 $0004  $71  1  ADC (zp),Y
  7 $0005  $70  1  $70
  8 $0070  $ff  1  addrLo=$ff
  9 $0071  $0e  1  addrHi=$0e
 10 $0eff  $00  1  val at $0eff

Nothing shocking there. Now take a look at the next iteration where Y=1:

  #  addr  data rw Comment
 20 $0004  $71  1  ADC (zp),Y
 21 $0005  $70  1  $70 
 22 $0070  $ff  1  addrLo=$ff
 23 $0071  $0e  1  addrHi=$0e
 24 $0e00  $00  1  val at $0e00 ?!
 25 $0f00  $00  1  val at $0f00

Whoah — what’s all that? On cycle 24, we fetch the byte at $0e00 which is not the right address at all. Then there’s an extra cycle where we read from the correct place.

The 6502 is an 8-bit machine and so adding an 8-bit offset to a 16-bit address ought to take two cycles: one to add the low bits together, and then one to add any carry to the high bits. As the address bus is always active, the non-carried address is output on the first cycle. If there’s no carry, the 6502 stops there, else it does another read, this time with the correct address. Neat, eh?

Things aren’t always that simple, however. For instructions that both read and write, the double-read always happens. For example, the instruction INC $1234,X will always do two reads and one write, even if there’s no carry. This is because even if there’s no carry to do, there’s still work to be done waiting for the increment operation to finish before the final result can be stored. There’s nothing to short-cut. What’s more, the increment operation takes a while longer and the write happens twice; once with the unmodified value, and once with the correct value. This is what it looks like, when there’s no carry (for INC $3412,X with X=0):

  #  addr  data rw Comment
  2 $0002  $fe  1  INC Abs,X
  3 $0003  $12  1  addrLo=$12
  4 $0004  $34  1  addrHi=$34
  5 $3412  $00  1  read $3412 (and get 0)
  6 $3412  $00  1  read $3412 again
  7 $3412  $00  0  write back 0
  8 $3412  $01  0  write back 1

And when there’s a carry (INC $3412,X with X=$FF):

  #  addr  data rw Comment
  2 $0002  $fe  1  INC Abs,X
  3 $0003  $12  1  addrLo=$12
  4 $0004  $34  1  addrHi=$34
  5 $3411  $00  1  read $3411 (non-carry addr)
  6 $3511  $00  1  read $3511 (correct addr)
  7 $3511  $00  0  write back 0
  8 $3511  $01  0  write back 1

Wait around for ages and then two turn up at once

Just when you thought all this was making sense, there’s another thing to consider. Inside the Beeb there are two buses. Fast peripherals and RAM can run at the same speed as the CPU itself and are clocked at 2MHz. Some of the peripherals can’t work at this blazing speed, and instead need to be communicated with at the slothly 1MHz. These peripherals are memory mapped and the Beeb supports this variable speed memory access with a bit of help of some external circuitry which looks at the memory addresses on the bus and slows the CPU clock down when accessing areas of memory where slow peripherals are mapped. The gory details are mostly covered in the BBC Micro hardware guide, but the main thing we need to worry about is how our CPU clock gets synchronized up and cycle-stretched to talk to the 1MHz bus.

There are two possible cases — one where the bus access starts in the middle of a 2MHz pulse, and one where they coincide.

Most of the interesting things happen on the falling edge of the clock (although the 6502 does some things on the rising edge too). That means that depending on whether we’re on an odd or even cycle relative to the 1MHz timer we’ll get cycle stretched to either 2 or 3 cycles. And of course, each time the processor accesses memory, it may get stretched. So, putting this together with the previous section on all the extra accesses that happen, you can see that an instruction modifying the memory on a 1MHz peripheral can take many more cycles than it would otherwise seem to need.

Let’s take a specific example, from the legendary Kevin Edwards’ Nightshade protection. Here, Kevin uses a read-modify-write instruction on a 1MHz-bus attached timer register at $fe48. This register is the low 8 bits of a 16-bit timer that counts down at 1MHz. Writing to the register replaces the bottom 8 bits with the written value (which then continues counting down at 1MHz). Here the read-modify-write instruction is a rotate left, which suffers from the same double-write behaviour as the DEC instruction from the previous section.

  #  addr  data rw Comment
  0 $0d2d  $2e  1  ROL abs
  1 $0d2e  $48  1  addrLo=$48
  2 $0d2f  $fe  1  addrHi=$fe
  3 $fe48  $00  1  read...
  4 $fe48  $00  1  ..stretch..
  5 $fe48  $08  1 complete
  6 $fe48  $08  0  write unmodified..
  7 $fe48  $08  0  ..stretch..complete
  8 $fe48  $10  0  write modified..
  9 $fe48  $10  0  ..stretch..complete

Unlike the previous examples, this one is not generated by Visual 6502 but is hand-calculated, and shows what happens on each 2MHz timer cycle. As far as the 6502 is concerned it only executed 6 cycles, but because its external clock was slowed down, the wall-clock time taken is 10 2MHz ticks.

This example shows three stretches. Remember that all the time, the timer itself is counting down too, so unless the emulation models the exact cycle within the instruction that the reads and writes happen, the timer value will not be updated properly. Kevin also uses the fact that when timers overrun interrupts are generated to make the code even more difficult to emulate. In some cases, writing to the timers will suppress or cancel pending interrupts, so subtle timing differences can cause wildly different interrupt behaviour too.

Implementation details

jsbeeb implements all the complex behaviour by “compiling” instructions from a table of opcode side effects and knowledge of what addressing modes cause what kinds of memory accesses. A list of cycles where memory accesses is kept, and then an optimization pass is made: any memory accesses that are known not to refer to any hardware devices (and thus don’t have any sensitive time dependencies or side effects) are coalesced where possible. Sequences of CPU cycles with no hardware visible effects are also coalesced so that the various state machines (like timers, video, sound etc) can be run for as long a stretch as possible. Running them all a single cycle at a time is somewhat inefficient.

For memory accesses that can’t be optimized in this way, code is generated to check for 1MHz bus accesses and appropriately sync the CPU clock.

The code for this is in the InstructionGen class in 6502.opcodes.js.

Let’s look at INC Abs,X from earlier. The compilation starts at getInstruction which is given the text of opcode and its arguments. It calls getOp with the opcode part (INC) and that returns a struct describing what the instruction does (its op) and what bus cycles are needed (read and/or write). In this case the operation needs both a read and write bus cycle, and the op is the javascript snippet:

REG = (REG + 1) & 0xff;

The cpu.setzn part sets the zero and negative processor flags according to its argument. REG here is a variable that will contain the read-in value from the read bus cycle and will be also be the value written out in the write cycle.

Next getInstruction gets together the addressing mode part, by parsing the opcode arguments abs,x 1. Now the instruction starts being put together. In the following code snippets I’ve removed some if checks (describing them instead), and also have shortened the lines to fit here. Check the source on github for full details.

First the 16-bit address is fetched from the two bytes at the current program counter. Then the indexed address is calculated, along with a non-carried version. We account for the three cycles this takes (to fetch the opcode byte, and then the two bytes of the address):

ig = new InstructionGen();
ig.append("addr = cpu.getw();");
ig.append("addrWithCarry = addr + cpu.x;");
ig.append("addrNonCarry = (addr&0xff00) 
            | (addrWithCarry&0xff);");

Next we perform the operation’s required bus accesses. In our case we need a read and a write. Here we account for the spurious read and write. The readOp and writeOp methods of the InstructionGen are responsible for ticking the CPU the appropriate amount of time depending on cycle stretching.

if ( && op.write) { // read/modify/write
  // First a spurious read of the non-carried address
  // Now the actual read
  ig.readOp("addrWithCarry", "REG");
  // And a write of the unmodified value
  ig.writeOp("addrWithCarry", "REG");

Now we apply the operation and write back the final result.

if (op.write)
  ig.writeOp("addrWithCarry", "REG");

The final code is the result of calling render on the InstructionGen, which does all the timer magic.

The final compiled code for INC Abs,X comes out as something like:2

addr = cpu.getw();
addrWithCarry = (addr + cpu.x) & 0xffff;
addrNonCarry = (addr&0xff00) 
    | (addrWithCarry&0xff);
    * ((cpu.cycles & 1) + 1));
    * (!(cpu.cycles & 1) + 1));
REG = cpu.readmem(addrWithCarry);
    * (!(cpu.cycles & 1) + 1));
cpu.writemem(addrWithCarry, REG);
REG = (REG + 1) & 0xff;
    * (!(cpu.cycles & 1) + 1));
cpu.writemem(addrWithCarry, REG);

To get the actual version, fire up jsbeeb and type instructions6502[0xfe] into the Javascript console.

Next time I’ll cover how the 6502 deals with interrupts and how that interacts with the pipelining. I’ll also cover one of the more common sources of interrupts: the 6522 Versatile Interface Adapter’s timers.

  1. The code uses lowercase throughout although I’ve used more usual capitalization in this write-up. Sorry for any confusion! 

  2. As I look at the generated code I realise I could make it a bit more efficient and readable. I’ll look at updating that later so hopefully by the time you take the trouble to look it will be much nicer. 

Filed under: Coding Emulation

Posted at 16:10:00 BST on 29th May 2014.