jsbeeb Part Three - 6502 CPU timings — Matt Godbolt’s blog

jsbeeb Part Three - 6502 CPU timings

This is the third post in my series on emulating a BBC Micro in Javascript. You might find it instructive to read the first part which covers general stuff, or the second part which focuses on the video hardware. This post will cover the subtleties of the 6502’s instruction timings. In the next post I’ll cover how interrupts and hardware timers fit into the mix.

This time around the thanks really have to go to my good chum Rich Talbot-Watkins. He and I have been friends since we were twelve years old and have been programming together since we met. His knowledge of the Beeb is legendary — he still writes games for it even now. His help in getting the timings spot on in jsbeeb was invaluable.

Why is timing so important anyway?

Getting the instruction timings right is paramount for good emulation. I covered some of this in the first post, but so many tricks on the BBC required intimate knowledge of the instruction and hardware timings that if an emulator didn’t account for them properly, some things wouldn’t work right.

The most challenging example of timings were games protection systems. These would be used to prevent disassembly, copying and cheats. The game code would be encrypted first and would be decrypted at runtime by code. The code would often use XORs with hardware timers (amongst other things) to make it difficult to decrypt manually. In order to decrypt correctly the relative timing of the decryption code and the hardware timers has to be emulated absolutely perfectly.

Worse still, it’s not just how many CPU cycles each instruction takes that needs to be correct, but the fact that the memory reads and writes happen on particular cycles within an instruction that need to be correct.

Let’s delve a little into that:

Life of an instruction

The 6502 at the heart of the Beeb is simple but powerful. Like most other processors, instructions are fetched from RAM and executed in turn. The 6502 has a very simple pipeline — the fetch for the next instruction happens during the execution of the previous instruction. This has important consequences that we’ll talk about later.

Most CPUs have a physical pin dedicated to indicate “I need to access memory”, but to save costs this was left off of the 6502. Instead the memory system is permanently engaged. As almost every clock cycle needs access to memory (to read instructions or data) this is generally a win. Again, this is an important fact which leads to some unusual behaviour we must account for.

The 6502 treats the 256 bytes at the bottom of RAM (the “zero page”) specially. Instructions accessing the zero page are encoded differently and run a little faster (as they don’t need to encode two address bytes). The zero page can also be treated as an array of 16-bit index pointers.

Let’s go through a quick example, calculating a very simple checksum over ten bytes:

.checksum        ; Sum 10 bytes pointed by $70/$71
    LDA #0       ; Set A (our checksum) to zero
    TAY          ; Also put zero in Y (the loop counter)
.lp
    CLC          ; Clear carry (all adds are with carry)
    ADC ($70), Y ; Add check sum
    INY          ; Y++
    CPY #10      ; is Y 10?
    BNE lp       ; if not, loop around
    RTS          ; we're done, result in A

This example assembles to the 12 bytes:

a9 00 a8 18 71 70 c8 c0 0a d0 f8 60

Using Visual 6502, we can see what memory accesses happen on each clock tick:

  #  addr  data rw Comment
  0 $0000  $a9   1  LDA #
  1 $0001  $00   1  #0
  2 $0002  $a8   1  TAY
  3 $0003  $18   1  (CLC)
  4 $0003  $18   1  CLC
  5 $0004  $71   1  (ADC)
  6 $0004  $71   1  ADC (zp),Y
  7 $0005  $70   1  $70
  8 $0070  $00   1  addrLo
  9 $0071  $00   1  addrHi
 10 $0000  $a9   1  val at (addr)
 11 $0006  $c8   1  INY
 12 $0007  $c0   1  (CPY)
 13 $0007  $c0   1  CPY #
 14 $0008  $0a   1  #10
 15 $0009  $d0   1  BNE
 16 $000a  $f8   1  lp
 17 $000b  $60   1  (RTS)
 18 $0003  $18   1  CLC

Here the columns indicate the cycle number; the value output on the physical address bus (addr); the data on the data bus (data); the value on the “read not write” pin (rw), where a 1 indicates a read access, and a 0 indicates a write; and then a comment explaining a little of what’s going on that cycle.

As you can see the memory is accessed unconditionally on every cycle. Points to note:

On cycles 3, 12 and 17 the opcode following the current instruction is fetched prematurely. In the case of cycles 3 and 12, the instruction is a single byte instruction. Each instruction takes a minimum of two clock cycles, and in this case the following instruction is fetched twice. In the case of cycle 17, the next instruction is fetched even though the branch isn’t taken.

The sequence of fetches for the ADC ($70), Y instruction is opcode, zero page address, zero page low, zero page high (the low and high together give a 16-bit address), and finally address pointed to plus Y.

So far so good - it seems unusual to our modern “memory is slow” mindset that the processor touches RAM every cycle, but this is from an age where processors and RAM were clocked at the same speed.

Now let’s get a little more interesting. Let’s set the address pointed to by $70/$71 to be $0eff and take a look at what happens for Y=0 and Y=1, where we’d expect $0eff+0 = $0eff and $0eff+1 = $0d00 to be read from.

First, here’s Y=0:

  #  addr  data rw Comment
  6 $0004  $71  1  ADC (zp),Y
  7 $0005  $70  1  $70
  8 $0070  $ff  1  addrLo=$ff
  9 $0071  $0e  1  addrHi=$0e
 10 $0eff  $00  1  val at $0eff

Nothing shocking there. Now take a look at the next iteration where Y=1:

  #  addr  data rw Comment
 20 $0004  $71  1  ADC (zp),Y
 21 $0005  $70  1  $70
 22 $0070  $ff  1  addrLo=$ff
 23 $0071  $0e  1  addrHi=$0e
 24 $0e00  $00  1  val at $0e00 ?!
 25 $0f00  $00  1  val at $0f00

Whoah — what’s all that? On cycle 24, we fetch the byte at $0e00 which is not the right address at all. Then there’s an extra cycle where we read from the correct place.

The 6502 is an 8-bit machine and so adding an 8-bit offset to a 16-bit address ought to take two cycles: one to add the low bits together, and then one to add any carry to the high bits. As the address bus is always active, the non-carried address is output on the first cycle. If there’s no carry, the 6502 stops there, else it does another read, this time with the correct address. Neat, eh?

Things aren’t always that simple, however. For instructions that both read and write, the double-read always happens. For example, the instruction INC $1234,X will always do two reads and one write, even if there’s no carry. This is because even if there’s no carry to do, there’s still work to be done waiting for the increment operation to finish before the final result can be stored. There’s nothing to short-cut. What’s more, the increment operation takes a while longer and the write happens twice; once with the unmodified value, and once with the correct value. This is what it looks like, when there’s no carry (for INC $3412,X with X=0):

  #  addr  data rw Comment
  2 $0002  $fe  1  INC Abs,X
  3 $0003  $12  1  addrLo=$12
  4 $0004  $34  1  addrHi=$34
  5 $3412  $00  1  read $3412 (and get 0)
  6 $3412  $00  1  read $3412 again
  7 $3412  $00  0  write back 0
  8 $3412  $01  0  write back 1

And when there’s a carry (INC $3412,X with X=$FF):

  #  addr  data rw Comment
  2 $0002  $fe  1  INC Abs,X
  3 $0003  $12  1  addrLo=$12
  4 $0004  $34  1  addrHi=$34
  5 $3411  $00  1  read $3411 (non-carry addr)
  6 $3511  $00  1  read $3511 (correct addr)
  7 $3511  $00  0  write back 0
  8 $3511  $01  0  write back 1

Wait around for ages and then two turn up at once

Just when you thought all this was making sense, there’s another thing to consider. Inside the Beeb there are two buses. Fast peripherals and RAM can run at the same speed as the CPU itself and are clocked at 2MHz. Some of the peripherals can’t work at this blazing speed, and instead need to be communicated with at the slothly 1MHz. These peripherals are memory mapped and the Beeb supports this variable speed memory access with a bit of help of some external circuitry which looks at the memory addresses on the bus and slows the CPU clock down when accessing areas of memory where slow peripherals are mapped. The gory details are mostly covered in the BBC Micro hardware guide, but the main thing we need to worry about is how our CPU clock gets synchronized up and cycle-stretched to talk to the 1MHz bus.

There are two possible cases — one where the bus access starts in the middle of a 2MHz pulse, and one where they coincide.

Most of the interesting things happen on the falling edge of the clock (although the 6502 does some things on the rising edge too). That means that depending on whether we’re on an odd or even cycle relative to the 1MHz timer we’ll get cycle stretched to either 2 or 3 cycles. And of course, each time the processor accesses memory, it may get stretched. So, putting this together with the previous section on all the extra accesses that happen, you can see that an instruction modifying the memory on a 1MHz peripheral can take many more cycles than it would otherwise seem to need.

Let’s take a specific example, from the legendary Kevin Edwards’ Nightshade protection. Here, Kevin uses a read-modify-write instruction on a 1MHz-bus attached timer register at $fe48. This register is the low 8 bits of a 16-bit timer that counts down at 1MHz. Writing to the register replaces the bottom 8 bits with the written value (which then continues counting down at 1MHz). Here the read-modify-write instruction is a rotate left, which suffers from the same double-write behaviour as the DEC instruction from the previous section.

  #  addr  data rw Comment
  0 $0d2d  $2e  1  ROL abs
  1 $0d2e  $48  1  addrLo=$48
  2 $0d2f  $fe  1  addrHi=$fe
  3 $fe48  $00  1  read...
  4 $fe48  $00  1  ..stretch..
  5 $fe48  $08  1  ..read complete
  6 $fe48  $08  0  write unmodified..
  7 $fe48  $08  0  ..stretch..complete
  8 $fe48  $10  0  write modified..
  9 $fe48  $10  0  ..stretch..complete

Unlike the previous examples, this one is not generated by Visual 6502 but is hand-calculated, and shows what happens on each 2MHz timer cycle. As far as the 6502 is concerned it only executed 6 cycles, but because its external clock was slowed down, the wall-clock time taken is 10 2MHz ticks.

This example shows three stretches. Remember that all the time, the timer itself is counting down too, so unless the emulation models the exact cycle within the instruction that the reads and writes happen, the timer value will not be updated properly. Kevin also uses the fact that when timers overrun interrupts are generated to make the code even more difficult to emulate. In some cases, writing to the timers will suppress or cancel pending interrupts, so subtle timing differences can cause wildly different interrupt behaviour too.

Implementation details

jsbeeb implements all the complex behaviour by “compiling” instructions from a table of opcode side effects and knowledge of what addressing modes cause what kinds of memory accesses. A list of cycles where memory accesses is kept, and then an optimization pass is made: any memory accesses that are known not to refer to any hardware devices (and thus don’t have any sensitive time dependencies or side effects) are coalesced where possible. Sequences of CPU cycles with no hardware visible effects are also coalesced so that the various state machines (like timers, video, sound etc) can be run for as long a stretch as possible. Running them all a single cycle at a time is somewhat inefficient.

For memory accesses that can’t be optimized in this way, code is generated to check for 1MHz bus accesses and appropriately sync the CPU clock.

The code for this is in the InstructionGen class in 6502.opcodes.js.

Let’s look at INC Abs,X from earlier. The compilation starts at getInstruction which is given the text of opcode and its arguments. It calls getOp with the opcode part (INC) and that returns a struct describing what the instruction does (its op) and what bus cycles are needed (read and/or write). In this case the operation needs both a read and write bus cycle, and the op is the javascript snippet:

REG = (REG + 1) & 0xff;
cpu.setzn(REG);

The cpu.setzn part sets the zero and negative processor flags according to its argument. REG here is a variable that will contain the read-in value from the read bus cycle and will be also be the value written out in the write cycle.

Next getInstruction gets together the addressing mode part, by parsing the opcode arguments abs,x ¹. Now the instruction starts being put together. In the following code snippets I’ve removed some if checks (describing them instead), and also have shortened the lines to fit here. Check the source on github for full details.

First the 16-bit address is fetched from the two bytes at the current program counter. Then the indexed address is calculated, along with a non-carried version. We account for the three cycles this takes (to fetch the opcode byte, and then the two bytes of the address):

ig = new InstructionGen();
ig.append("addr = cpu.getw();");
ig.append("addrWithCarry = addr + cpu.x;");
ig.append("addrNonCarry = (addr&0xff00)
            | (addrWithCarry&0xff);");
ig.tick(3);

Next we perform the operation’s required bus accesses. In our case we need a read and a write. Here we account for the spurious read and write. The readOp and writeOp methods of the InstructionGen are responsible for ticking the CPU the appropriate amount of time depending on cycle stretching.

if (op.read && op.write) { // read/modify/write
  // First a spurious read of the non-carried address
  ig.readOp("addrNonCarry");
  // Now the actual read
  ig.readOp("addrWithCarry", "REG");
  // And a write of the unmodified value
  ig.writeOp("addrWithCarry", "REG");
}

Now we apply the operation and write back the final result.

ig.append(op.op);
if (op.write)
  ig.writeOp("addrWithCarry", "REG");

The final code is the result of calling render on the InstructionGen, which does all the timer magic.

The final compiled code for INC Abs,X comes out as something like:²

addr = cpu.getw();
addrWithCarry = (addr + cpu.x) & 0xffff;
addrNonCarry = (addr&0xff00)
    | (addrWithCarry&0xff);
cpu.polltime(4+cpu.is1MHzAccess(addrNonCarry)
    * ((cpu.cycles & 1) + 1));
cpu.readmem(addrNonCarry);
cpu.polltime(1+cpu.is1MHzAccess(addrWithCarry)
    * (!(cpu.cycles & 1) + 1));
REG = cpu.readmem(addrWithCarry);
cpu.polltime(1+cpu.is1MHzAccess(addrWithCarry)
    * (!(cpu.cycles & 1) + 1));
cpu.checkInt();
cpu.writemem(addrWithCarry, REG);
REG = (REG + 1) & 0xff;
cpu.setzn(REG);
cpu.polltime(1+cpu.is1MHzAccess(addrWithCarry)
    * (!(cpu.cycles & 1) + 1));
cpu.writemem(addrWithCarry, REG);

To get the actual version, fire up jsbeeb and type instructions6502[0xfe] into the Javascript console.

Next time I’ll cover how the 6502 deals with interrupts and how that interacts with the pipelining. I’ll also cover one of the more common sources of interrupts: the 6522 Versatile Interface Adapter’s timers.

The code uses lowercase throughout although I’ve used more usual capitalization in this write-up. Sorry for any confusion! ↩
As I look at the generated code I realise I could make it a bit more efficient and readable. I’ll look at updating that later so hopefully by the time you take the trouble to look it will be much nicer. ↩