Emulating The Core, Part 1: The Fetch-Decode-Execute Cycle

In previous posts we have attempted to give an integral overview of the Nintendo Game Boy console.  We have discussed considerably its various subsystems and tried to show how they interact to make the whole Game Boy experience possible[1].  We have, additionally, studied the Game Boy’s bootstrap ROM; a 256 byte program responsible for starting up the console upon power on.  With this background, therefore, we have gained knowledge about the way the Game Boy behaves, and how it get things done.  It is about time to begin with the actual emulation of the Nintendo Game Boy system.

First, lets quickly review some theoretical background about computer systems in general, and how it applies to emulation. (It is ‘safe’ to ignore the following BACKGROUND note and continue with THE FETCH-DECODE-EXECUTE PRINCIPLE section).


The foundations of computer emulation is not, as might be though, some kind of black art driven by sorcery and obscure low-level implementation.  Theoretically, emulation is guaranteed possible because both source and target machines are (normally) computationally equivalent.  Being computationally equivalent means that both machines have the same computational power; it can be shown that whatever the target machine can compute, the source machine can also compute (and vice-versa).  Some care must be taken here because one can easily fall into the error of considering the real, physical machines, as Turing-complete or Turing-equivalent; because real machines have finite resources, they are not Turing-complete; they instead map into the Linear Bounded Automaton model.  However, the important thing is that the possibility of emulation is fundamentally a well-known problem from the Theory of Computation.

On the practical side, the way real computer systems are implemented makes it possible to design an emulation system following common ‘rules’ or patterns, quite independent of the programming language and platform of choice. Of course, this doesn’t mean that emulating a computer system is easy, or that there isn’t room for creativity and aesthetics; the work is extremely tedious, actually, and knowledge of the emulated computer system is needed in great detail.

Indeed, real computer systems share common design and implementation aspects that makes it feasible and even natural to emulate a computer system on an entirely different one.  This has to do (as already stated), on the theoretical or abstract side, with the Turing Machine model of computation.  A Turing Machine is an abstract computational device proposed by mathematician Alan Turing; it is a mathematical theory about computation. A Universal Turing Machine is a Turing Machine that can simulate some other, arbitrary, Turing Machine.  It turns out that the Universal Turing Machine captures the notion of computation essential to the construction of algorithms,  but also is ‘not so abstract’, so that implementation of real machines is feasible based (fundamentally) on it. Indeed, the differences between a hypothetical Universal Turing Machine and a real machine, besides the fact that a UTM can theoretically address an unlimited amount of memory, is more technological or practical than fundamental (for example, a UTM’s memory is based on a sequential tape, whereas a real machine would use a random access memory device).

Turing Machines are also at the core of modern programming languages[2]; they are Turing-complete computational systems.  These programming languages are designed to support the semantics common to all computer systems (for example, bytes and words, memory reads and writes, etc), and through abstraction, they further extend these semantics and allow for more convenient and flexible mechanisms for driving the machine (for example, data structures and classes, procedures and functions, etc).

The lesson behind the above notes is that, while emulating the Nintendo Game Boy is a cumbersome task, the theory behind computers can help us to better understand why it is at all possible to do so.


The ‘core’ of a computer system is its Central Processing Unit (sometimes various CPUs somehow interconnected), and the core of its operation is what is commonly know as the Fetch-Decode-Execute Principle, or Instruction Cycle.

Fortunately, our target CPU is so simple that instructions are executed sequentially, although the same principle applies to deeply-pipelined, super-scalar CPUs as well, where the Instruction Cycle is broken up into various stages of what is known as a pipeline.  As a result, more than one instruction can be executed concurrently, and even in a different order than the program order.


Basic five-stage pipeline in a RISC machine (IF = Instruction Fetch, ID = Instruction Decode, EX = Execute, MEM = Memory access, WB = Register write back). In the fourth clock cycle (the green column), the earliest instruction is in MEM stage, and the latest instruction has not yet entered the pipeline.

However these implementation details may be, the Fetch-Decode-Execute Cycle, or Instruction Cycle, adopts the following generic form:

1. Fetch the next instruction to be executed from the memory location pointed to by the Program Counter Register (‘PC’) and advance it to point to the instruction that follows.

2. Decode the instruction just fetched.  This means determining which instruction is to be executed, so that the CPU can know how many operands the instruction needs.  Sometimes, the instruction passes through a somewhat different ‘decoding’ into smaller steps called micro-operations; other times the instruction is simple enough to be directly executed. This is, the instruction may be implemented directly in the hardware, or it may need to be driven by software in what is usually called microcode. There is much flexibility in decoding and executing an instruction.

3. Execute the instruction.  It might make use of the Arithmetic Logic Unit (ALU), for example.

4. Go to step 1.

Of course, this is a general and abstract view of the Fetch-Decode-Execute cycle, and for our purpose it is quite sufficient. For a more detailed explanation take a look at the resources at the end of the post.


The following code will be our starting point. The function rom_exec() is responsible for initializing the Game Boy’s Program Counter Register (‘PC’) to address 0x0000 or 0x0100.  Lets see:

FILE: gboy_x86_64.S

    movq $addr_sp, %r13 # set PC to 0x0000
    addq %rdi, %r13 # 0x0000 or 0x0100
    movq $z80_ldex, %r15 # %r15 holds pointer to opcode records
    movq (%r13), %r12 # %r12 holds the instruction's opcode
    call exec_next # GO GO GO!
    ret # return and change game

Our first instruction sets register %r13 to point to addr_sp.  This variable is defined as an array of 64kb, space which represents the Game Boy’s address space.  In this way, register %r13 points to the beginning of the Game Boy address space, or address 0x0000.  Recall that the bootstrap program is loaded into the first 256 bytes of the address space, that is, addresses 0x0000 to 0x00FF.  The register %r13 will act as the Game Boy’s Program Counter Register (PC) throughout all the emulation.  The second instruction adds to the PC the value in register %rdi.  Register %rdi is either 0 or 256.  The idea is simply that RealBoy has the option to not execute the bootstrap program; execution would then start directly at address 0x0100, immediately following the 256-byte bootstrap program, instead of 0x0000.  Lets just assume that %rdi is 0 so this instruction does nothing.

Next, the register %r15 is set to point to a special data structure which represents the Game Boy’s instruction set. This pointer will allow us to execute each instruction we fetch through the ‘PC’. We’ll further look at this in a while.

Now, the contents of the memory address pointed to by %r13 (that is, the value at address 0x0000 of the Game Boy’s address space) is copied to %r12. Because the Game Boy’s bootstrap ROM was already copied to the first 256 bytes of the address space, %r12 effectively holds the opcode for the first instruction of this program.  Finally, a call is made to exec_next(), where the instruction cycle is implemented; exec_next() should never return, unless the user chooses to change the game.

It’s not hard to get used to the x86-64 Assembly Language, and indeed it enables you to make some low-level decisions that can speed up things.  Of course, with today’s compilers technology and extremely fast machines this is hardly needed.  Nevertheless, this will help us better understand the machine, something we want.

For example, note that the register %r13 is used to simulate the Game Boy’s Program Counter Register.  This is fixed: %r13 is used to represent the PC, and only this.  Designating a CPU register for this only purpose is a nice decision, because we are always manipulating the PC.  If we didn’t have this exclusive resource, we would have to constantly save the value of the PC somewhere in memory.  As we know, register access is way faster than memory access.  It might not seem important (sure, memory access is fast enough in this case), but, considering that the PC is such a time-consuming resource (we really use it A LOT), it just doesn’t seem smart to save in main memory something that is always being needed.  The trade-off, however, is that now we have reduced the number of general-purpose registers we can use, and the x86-64 architecture is not so generous with the amount of these registers.


Let’s now proceed to the core of the RealBoy emulator: the exec_next() function (exec_next is short for execute next instruction).  Recall that rom_exec() got called to prepare the PC register to begin the Instruction Cycle.  Indeed, this function initializes the PC and fetches[3] the first instruction to execute from address 0x0000 of the Game Boy’s address space.  So, we have actually already completed the fetch part of the Fetch-Decode-Execute Cycle; this is done simply by reading the contents of the address pointed to by (the host’s machine) register %r13, which, recall, is used exclusively throughout the entire execution to represent the Game Boy CPU’s Program Counter (PC).

We’ll divide our study of ‘the core’ in two parts: first, our instruction is actually decoded and executed. Second, machine state is updated; timers are advanced, interrupts and keyboard inputs are processed, sound and video systems are updated, etc.

Now that we have the opcode of the instruction we want to execute, we proceed to decode and execute our instruction. Let’s see:

FILE: gboy_x86_64.S

    movq %r15, %rbx # pointer to instructions' records
    andq $0xff, %r12 # mask in opcode (record offset)
    shlq $5, %r12 # multiply by 32; each record is 32 bytes
    addq %r12, %rbx # point to instruction's record
    movq (%rbx), %r14 # save cycle duration of instruction
    movq %r14, %r12 # copy
    shrq $56, %r14 # cycle duration is the most significant byte
    shrq $40, %r12 # get byte that tells if instruction is a memory access
    andq $0xff, %r12
    movq %rbx, %rdx # pass record as argument
    addq $28, %rbx # instruction's function is the last 4 bytes of record
    movl (%ebx), %ebx # copy function
    testq $1, gbddb # are we debugging?
    jz no_ddb # no
    movq %r13, %rsi # argument to debugger
    movq %rdx, %rbp # save reg
    call gddb_main
    movq %rbp, %rdx # restore reg
    movq $0, inst_is_cb
    testq $DELAY, %r12 # delay instruction execution due to memory access?
    jz 1f # no; execute instruction normally
    jmp execute_precise
    call *%rbx # execute instruction

Our first instruction copies the contents of register %r15 to register %rbx. But, what is %r15? Remember that in function rom_exec() we had the following assignment: movq $z80_ldex, %r15, that is, %r15 points to the memory location where the z80_ldex begins.  We also said that this memory location is where a data structure describing the Game Boy’s instruction set is located.  As a matter of fact, register %r15 always points to this memory location.  It must be a very important memory location so that we assigned permanently a register to point to it, just as we assigned a permanent role to the %r13 register (to represent, throughout execution, the Game Boy CPU’s PC).  Indeed, as with register %r13, the permanent assignment of register %r15 is motivated by the extreme use we make of the data it references.


Recall that the first byte of every instruction of the Game Boy corresponds to the opcode of the instruction.  Furthermore, because an opcode is always 1 byte long, a total of 256 opcodes are offered by the Game Boy’s CPU[4].

In RealBoy, we have a special array data structure that represents these instructions; there is an entry or node for each instruction.  The data structure is indeed an array: a contiguous chunk of memory where all the entries reside contiguously, one next to the other.  Organizing the data structure as an array has the advantage that the opcode can be used as an index to its corresponding instruction’s entry.  This way, finding a particular entry is done immediately (in constant time).  Each entry has information regarding the way an instruction should be interpreted (emulated).  Let’s take a look at the format of this structure by looking at some of its entries:

FILE: gboy_x86_64.S

.byte 0x00, NULL, NULL, NULL, NULL, NULL, 1, 4 # nop
.ascii "nop"
.long NULL
.long op_nop

.byte 0x01, REG, BC, IMM, NULL, NULL, 3, 12 # ld BC, nn
.ascii "ld"
.long WORD
.long op_ld_imm_reg

.byte 0x02, REG_IND, BC, REG, A, DELAY|RD_XOR_WR, 1, 8 # ld (BC), A
.long BYTE
.ascii "ld (BC), A"
.long op_ld_reg_mem

.byte 0x03, REG, BC, NULL, NULL, NULL, 1, 8 # inc BC
.ascii "inc BC"
.long WORD
.long op_inc_reg_wr

(IMPORTANT NOTE: The .ascii field in the preceding Assembly code doesn’t exactly match the source file.  In the source file, space is filled with the NULL character until 16 bytes are reached; here we don’t show those NULL characters because WordPress deletes them.  Just remember that this field is always 16 bytes long)

Here we have shown the first 4 entries of the data structure which, effectively, corresponds to the first 4 instructions of the CPU’s instruction set.  This data structure could equivalently been represented by the following C structure construct:

struct z80_ldex {
    char dataset1[8];
    char instruction_string[16];
    long dataset2;
    void (execute_function)(struct z80_ldex *);
} z80_ldex[512];

So, note that our data structure z80_ldex is actually an array of 512 entries (remember that the opcode 0xCB is really an extension to 256 additional instructions, so we have a total of 256*2=512 instructions).  Also note that each entry is 32 bytes; because 32 is a power of 2, this will allow us to avoid a costly multiplication instruction when indexing the data structure, and instead use the cheap shift left instruction.

The most important field in this data structure is the execute_function field; it is a pointer to the function responsible for actually emulating the instruction.  Because a function can handle various similar instructions, dataset1 and dataset2 give further information about how to interpret the instruction.  As a matter of fact, a pointer to the opcode’s entry is passed to the function execute_function, so it can unambiguously emulate the current instruction.  For example, the same function is used to emulate all 8-bit register-to-register copy instructions, and the particular opcode entry tells the function which registers must be used as source and destination operands.  The field instruction_string is helpful to the debugger; RealBoy includes a very simple debugger that enables you to run any game in a controlled manner.  For example, you can control the machine state on a per-instruction basis (indeed the most common feature of a debugger). In fact, RealBoy’s most annoying bugs where caught debugging actual games.  This field is used to print the current instruction when stepping.  The last 2 bytes of dataset1 are also very important; they correspond to the instruction’s length in bytes and the instruction’s duration in CPU cycles, respectively. The first one is important to advance the ‘PC’ so it points to the next instruction to execute. The second one is important to keep accurate system timing.

Let’s work over this example.  We have stopped the debugger (gdb) at the beginning of the rom_exec function.


Before initializing registers %r12, %r13 and %r15

Let’s now step through the instructions until we hit the call exec_next instruction.  We are interested in the registers %r13, %r15 and %12, which are about to get modified:


After initializing registers %r12, %r13 and %r15

Note the values of %r13 and %r15; %r13 points to the address that represents the beginning of the Game Boy’s address space, and %r15 points to the beginning (the first element) of the instructions’ data structure.  The register %r12 was assigned the value pointed to by %r13: the first bytes of the Game Boy’s address space, which correspond to the beginning of the bootstrap ROM.  Let’s now jump over to exec_next.


Executed the first 2 instructions of exec_next

Note that the value at %r15 was copied to %rbx, and that %r12 is now 0x31. We masked in the first byte of %r12 with the instruction andq $0xff, %r12. The value 0x31 is indeed the opcode of the instruction we are about to execute.  Let’s further execute a couple of instructions:


Scale opcode index (%r12) and point to its corresponding entry in data structure (%rbx)

Remember that we pointed out earlier the advantage of our data structure having each entry be 32 bytes long.  We have our current opcode, which is 0x31, in register %r12.  Recall that this number is actually an index into our data structure; indeed, we want to access entry number 0x31 (decimal 49) in our data structure.  To accomplish this we have to scale our index to the size of each entry; this is, we’ve got to multiply 0x31 by the data structure’s entry size in bytes, which is 32.  Now, instead of using a costly multiplication operation, we accomplish our multiplication through a shift operation.  This shift operation is actually our shlq $5, %r12 instruction. We can do this because 32, which is the scaling factor, is a power of 2.  Now, having scaled our opcode 0x31 in register %r12, we now have the offset to the corresponding entry in our data structure.  Remember that register %rbx is a pointer to the beginning of the data structure (a pointer to the first entry), so simply adding %r12 to it makes it point to the desired entry.  We now have %rbx pointing to the entry in the data structure corresponding to opcode 0x31.

So, now we should have a better picture of how we are emulating ‘the core’, that is, the Fetch-Decode-Execute Cycle. In essence:

  1. We have a permanent register assigned to the Game Boy’s Program Counter Register (our %r13 register).  We extract the first byte of the address pointed to by this register to get or fetch the current instruction’s opcode.
  2. Because our opcode is actually an index to its corresponding entry in the opcodes’ data structure, we scale the value of the opcode by a factor equal to the size of the entries in bytes (32).
  3. Now that we have the offset to the beginning of our opcode’s entry, we add it to the register %rbx, which points to the beginning of the data structure.  We now have a pointer to the entry that corresponds to the opcode we are currently executing; we have just decoded our current instruction.

We now have to interpret the opcode so we can actually emulate the current instruction.

Let’s execute the next 4 instructions in gdb to see what happens:


%r14 holds the instruction’s duration. %r12 holds memory access information

Recall that %rbx now points to the beginning of our opcode’s entry in the data structure, so the instruction movq (%rbx), %r14 copies some bytes from this entry to register %r14.  If you remember the format of our data structure, the first 8 bytes correspond to dataset1 which, among other things, is where the instruction’s length and duration is stored.  Now, recall that we are on a 64-bit machine, so normally every data movement to and from memory will be 8 bytes. Of course, we can also move data in 1, 2, and 4 bytes, but our instruction movq (%rbx), %r14 copies the first 8 bytes (that is, the dataset1 field) to register %r14.

Now we copy the same 8 bytes (the dataset1 field) to register %r12. We wish to preserve the instruction’s duration in CPU cycles in register %r14; this corresponds to the least significant byte of the dataset1 field. Because our x86-64 machine is little endian, it orders the bytes starting from the least significant byte, so our instruction’s length byte is actually ‘the last byte’ in register %r14, so we must shift 56 bits to the right.  Meanwhile, register %r12 holds byte 6 of dataset1 field.  Indeed, we got this value by first shifting 40 bits to the right (shrq $40, %r12) and then masking in the byte (andq $0xff, %r12).  This byte tells us whether or not the instruction accesses memory.  In case it does, the instruction requires some special handling.  Because our current instruction only involves an immediate value and a register (no memory access), %r12 is 0.

Let’s see our opcode’s entry in the data structure to see what’s happening:

FILE: gboy_x86_64.S

/* ld SP, nn */
.byte 0x31, REG, SP, IMM, NULL, NULL, 3, 12
.ascii "ld"
.long WORD
.long op_ld_imm_sp

Observe that the instruction referred to by opcode 0x31 is a load to the Stack Pointer Register.  If you recall our exercise studying the Game Boy’s bootstrap program, this indeed corresponds to the very first instruction executed, where the stack was initialized.

Now look at the values corresponding to dataset1.  We are interested in the last three values (bytes) of this field: NULL, 3, 12.  NULL tells us that the current instruction does not access memory; this is very important for some instructions that require some extra handling for accuracy purposes.  The 3 is the instruction’s length in bytes: in this case, the first byte is, as we now, the opcode, and the next two bytes are the immediate operand that is assigned to the Stack Pointer (‘SP’).

Note the values of %r14 and %r12.  NULL expands to value 0, so our instruction does not access memory.  Handling instructions that do not access memory is easier, because memory access must be done at specific timings.  We can ignore memory timings for instructions that don’t do memory reads or writes.

Now, having all the information about the current instruction, we would want to actually execute it.  Remember that we have a pointer to the function that emulates each instruction in our data structure, so we want to call the function that emulates the instruction referenced by opcode 0x31.


%rdx used to pass argument to function. %ebx holds the function pointer.

Note here that we copied %rbx to %rdx.  Remember that %rbx points to the beginning of the opcode’s entry in the data structure; %rdx now points there as well.  The register %rdx actually holds an argument passed to the function we will call.  Now, in our data structure, the pointer to the function that emulates the instruction is the last 4 bytes of each entry.  That is why we add 28 to %rbx, so it points to this field.  The instruction movl (%ebx), %ebx copies the address of the function to %ebx (this is the lower 4 bytes of the %rbx register).  We do a 4-byte copy because memory addresses, even in 64-bit architectures, usually fit in 32 bits, so it would be a waste of memory to assign 8 bytes to this function pointer.  The important thing is that now we can call our function using the %rbx (really the %ebx) register.

The next two instructions check if debugging is enabled.  In case we enabled the debugger, a special call to gddb_main is made where execution is stopped, the current instruction is printed, and a command line interpreter is called, so the current state of our emulated machine can be checked (register values, contents of memory addresses, etc).  We don’t enable debugging so execution continues at line 157.


The debugging function is skipped

We now copy the value 0 to the variable inst_is_cb, a variable that is used to check if the current instruction is the one with opcode 0xCB, that is, the instruction extender.  Now, recall that we copied dataset1 from %r14 to %r12.  We said that %r14 holds the instruction’s duration in CPU cycles.  Well, %r12 also holds an important value for the executing instruction; it holds information that tells us if the instruction accesses memory.  Because our instruction does not access memory, %r12 is NULL. The instruction testq $DELAY, %r12 does this checking.  In case the instruction accesses memory, a jump is performed to execute_precise.  In our example, instead, a jump is performed to line 162.


Instruction does not access memory.

Now, because register %rbx contains the address of the function that will emulate our instruction, we use a call instruction to transfer control to that function:


Call to function that emulates our instruction.

This function is in charge of actually executing our emulated instruction, thus completing the Fetch-Decode-Execute Cycle.  It advances the PC so it points to the next instruction to execute.  In this case recall that the instruction’s length is 3 bytes, so the PC is advanced 3 bytes.  This particular instruction extracts the immediate value operand and copies it to the Stack Pointer Register (SP).  The memory location referenced by the label regs_sets is actually where we hold the values corresponding to each of the Game Boy’s registers.


After executing the function.

We see that, upon emulating our instruction, the SP has the value 0xFFFE, and the PC has the value 0x0003.  This is exactly what we expected.  Let’s return from our emulating function:


Return from function; we have successfully emulated the LD SP, $0xFFFE instruction

Our first part of ‘the core’ is almost complete.  Next we must test whether the special case in which the instruction corresponds to the instruction extender.  In this case, the special variable inst_is_cb (instruction is 0xCB) is set to 1 in the emulating function, so the following test would be true and a jump back to no_ddb would be performed.  The function that handles the instruction extender has already set up everything for executing the extended instruction; %r12 holds the byte that tells if the instruction accesses memory, %rbx points to the function that executes the extended instruction and %rdx points to the instruction’s entry in the data structure.  So, here we have a ‘mini-loop’ that is only used for handling this special case.  But, because our current instruction is not the instruction extender, the jump back to no_ddb is not performed, and execution continues at line 166.


Instruction is not instruction extender (opcode 0xCB)

With the special case of the instruction extender, we come to the end of the first part of emulating ‘the core’; the Fetch-Decode-Execute Cycle.  In our second part we deal with things that need to be done after executing every instruction, namely, to update the state of the rest of the machine with appropriate timing implied by the instruction we just executed.


We have seen how RealBoy deals with the Fetch-Decode-Execute Cycle, or Instruction Cycle.  This cycle is repeated ad infinitum until somebody decides to shut down the machine.

We saw how low-level knowledge can lead to interesting implementation decisions.  For example, we have avoided a costly multiplication instruction when scaling our index.  Instead, we used the ultra fast shift operation to do the scaling.  Because of the Fetch-Decode-Execute Cycle, this save is actually done for every instruction we emulate.

Next comes the second part of our study concerning how the RealBoy emulator handles the ubiquitous part of every emulator, namely, what needs to be done right after every instruction we emulate.  Lots of bookkeeping must be done; timers must be updated, interrupts must be processed, among many other things.

Continue with: Emulating The Core, Part 2: Interrupts and Timing.


Did you like this post? Do you have any suggestions? Please rate this post or leave us a comment so we can improve the quality of our work!


1. This, of course, have all been done at the interface level, that is, the part of the hardware that is directly exposed to the programmer. Very little, if any, information beyond this level have been noted.  We apologize to the hard-core hardware enthusiasts.
2. We mean imperative programming languages. Modern (pure) functional languages are based on lambda calculus instead of Turing Machines.
3. This is the only place where instruction fetching is done outside exec_next(); every further instruction fetched is done inside the main loop of this function, as we will see shortly.
4. Actually, the number of instructions offered is duplicated to 512 because opcode 0xCB is interpreted as a special instruction extender.


2 thoughts on “Emulating The Core, Part 1: The Fetch-Decode-Execute Cycle

  1. Hi!
    I’m reading your blog, and I think it’s the best thing I read for a long time: Entertaining, instructive and well written.

    I’ve a little question though. You state that r13 is dedicated for the PC. So why do you bother with a memory version of PC in regset ? I first thought it was for debugging (nicely done with the x/6xg &regsets) but then you make a copy from the memory PC to r13.

    Thank you, for your time and your work.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s