Assembling an assembler - overview

To make my relay computer do something interesting (or anything at all) it needs a list of instructions held in memory. Each instruction consists of an 8-bit value called an opcode (portmanteau of operation and code) optionally followed by one or two further 8-bit values (usually referencing a location in memory). The computer will work through them one at a time doing whatever operation that opcode represents. Here’s an example program:

Can you guess what it does? If you’ve not seen it before or you aren’t extremely familiar with this computer’s instruction set it’s going to be a tough ask. Imagine then that you need to write a much more complicated program with 100s of lines … a tough ask becomes a Herculean task. This is why ‘higher level’ languages came about as it makes writing a program easier for the user but then, of course, there’s some translation needed to get that back to the ‘low level’ list of machine code/instructions.

Assembly language is a good choice for something that’s well balanced between human and machine. In the simplest cases most assembly language instructions have a 1:1 relationship with an equivalent 8-bit machine operation but beyond the basics assembly language can make looping, logic branching, handling data and so on much, much easier. Try this example:

start:  ldi a,1     ; initial setup A = 1
        ldi b,0     ;               B = 0

loop:   mov c,b     ; slide B -> C
        mov b,a     ;       A -> B
        add         ; and add together

done:   bcs done    ; infinite loop if overflowed

        jmp loop    ; otherwise have another go

If you haven’t spotted it (and don’t feel bad if you haven’t) this is the exact same program you saw before … just the assembly language equivalent. So, how do we get from one to the other then? Well that’s the job of an assembler. Let’s try the example above by hand first to get a feel for what’s involved.

To start with let’s look at our first set of assembly language instructions - ldi, mov and add:

These ‘railroad’ diagrams show the acceptable syntax of each assembly language instruction:

ldi loads an immediate value in to either register a or b
mov copies a value from one register (a, b, c or d) to another register (again - a, b, c or d)
add adds the values in register b and c together placing the result in register a or d (if the destination isn’t specified it’ll assume register a).

As mentioned above each of these assembly instructions has a direct equivalent opcode and we can use our opcode charts to work out what they should be:

Load Immediate

SETAB 8

0 1 r d d d d d

Loads a value between -16 and +15 in register A or B.

    r = destination register (0-A, 1-B)
ddddd = value (-16..15)

8-Bit Move

MOV8 8

0 0 d d d s s s

Copies the content of one 8-bit register to another.

ddd = destination register (000-A, 001-B, 010-C, 011-D, 100-M1, 101-M2, 110-X, 111-Y)
sss = source register      (000-A, 001-B, 010-C, 011-D, 100-M1, 101-M2, 110-X, 111-Y)

ALU Operation

ALU 8

1 0 0 0 r f f f

Performs an arithmetic or logic operation on the B (and optionally C) register(s).

  r = destination register (0-A, 1-D)
fff = function code (000-NOP, 001-ADD, 010-INC, 011-AND, 100-OR, 101-XOR, 110-NOT, 111-SHL)

We see here then that ldi is a specific variant of a SETAB class opcode, mov similarly for the MOV8 class and add for the ALU class. Taking all this together then we arrive at:

01000001  |  start:  ldi a,1     ; initial setup A = 1
01100000  |          ldi b,0     ;               B = 0

00010001  |  loop:   mov c,b     ; slide B -> C
00001000  |          mov b,a     ;       A -> B
10000001  |          add         ; and add together

????????  |  done:   bcs done    ; infinite loop if overflowed

????????  |          jmp loop    ; otherwise have another go

We’ve now got the 8-bit opcodes for most of our assembly instructions. Typically when writing these 8-bit values we do it in hexadecimal rather than binary (as it’s shorter) - let’s update what we’ve got:

41  |  start:  ldi a,1     ; initial setup A = 1
60  |          ldi b,0     ;               B = 0

11  |  loop:   mov c,b     ; slide B -> C
08  |          mov b,a     ;       A -> B
81  |          add         ; and add together

??  |  done:   bcs done    ; infinite loop if overflowed

??  |          jmp loop    ; otherwise have another go

Notice how this is now starting resemble the ‘random numbers’ at the top of this post?

So what about those last two lines? What do we do about the loop and jump? Let’s start with the railroad diagram for branching operations:

jmp will set the program counter to a given location unconditionally (i.e. always). bcs will also set the program counter to a given location but only if the last arithmetic operation set the ‘carry’ flag (i.e. it overflowed) otherwise the program counter will move on to the next location in memory. In the case of our code above the arithmetic operation is the immediately preceding add.

jmp and bcs have opcode equivalents of course but how do we translate those labels, as in, how do we say where we want to jump to? Well, it turns out that whilst assembling we need to keep track of where the program counter will be. Let’s assume the program starts at the first location in memory (address 0000 in hexadecimal) and counting up from there we get the following:

0000: 41  |  start:  ldi a,1     ; initial setup A = 1
0001: 60  |          ldi b,0     ;               B = 0

0002: 11  |  loop:   mov c,b     ; slide B -> C
0003: 08  |          mov b,a     ;       A -> B
0004: 81  |          add         ; and add together

0005: ??  |  done:   bcs done    ; infinite loop if overflowed

0008: ??  |          jmp loop    ; otherwise have another go

We now have a 16-bit value at the far left of each line representing a location in memory (for where that program line is stored). Hang on though, why did the counter jump three places rather than one at the jumps? Well, those instructions need to be followed by a 16 bit value for where in memory you need to jump to like so:

Branch/Call & 16-bit Load Immediate

GOTO 24

1 1 d s c z n x

h h h h h h h h

l l l l l l l l

Branches to a given address if stated condition register flag(s) is set. Address of next instruction can optionally be saved in XY register. M register can also be loaded with 16-bit value (without jump).

d = destination register (0-M, 1-J)
s = 1 = load PC if sign bit is set (if negative); 0 = ignore sign bit
c = 1 = load PC if carry bit is set (if carry); 0 = ignore carry bit
z = 1 = load PC if zero bit set (if result is zero); 0 = ignore if zero bit set
n = 1 = load PC if zero bit clear (if result is not zero); 0 = ignore if zero bit clear
x = 1 = copy PC to XY; 0 = no copy
hhhhhhhh = address high byte (to set in M2/J2)
llllllll = address low byte (to set in M1/J1)

… and those labels … they represent a certain location in memory making it much easier for the human to indicate where in the program they want to jump to without having to do the program counter tracking in their head.

We can now finish the hand assembling by replacing those labels with the now known program counter values to arrive at:

0000: 41        |  start:  ldi a,1     ; initial setup A = 1
0001: 60        |          ldi b,0 

0002: 11        |  loop:   mov c,b     ; slide B -> C
0003: 08        |          mov b,a     ;       A -> B
0004: 81        |          add         ; and add together

0005: E8 00 05  |  done:   bcs done    ; infinite loop if overflowed

0008: E6 00 02  |          jmp loop    ; otherwise have another go

Removing all our ‘working out’ above we get back to where we first started:

Now, this is a really simple example but that’s the basics of an assembler. From here though you could extend the assembly language to allow the human programmer to change where the program counter starts from or maybe to allow some basic arithmetic around labels (so label1 + 5 for example … i.e. 5 locations further on from wherever label1 points to). Commercial assemblers can be much more sophisticated but it always comes down to producing a list of values that will be loaded into a computer’s memory for it to follow and operate on.

In designing my assembly language I’ve drawn inspiration from several existing ones out there to produce something that I personally find aesthetically pleasing. The mnemonics are mostly inspired by 6502 (used in computers like the Commodore 64 and BBC Micro) as I like that they are all three characters which makes everything line up neatly (to my eye) but I also like the parameter format of Z80 (used in the Sinclair ZX Spectrum - my first computer). If you compare my assembly language with those I’m sure you can see the influences coming through and as I add more to my assembly language you’ll see this pattern continue.

So, that’ll do for this post - next time in this mini series I’ll take a look at how I can write an assembler that will automate this whole assembly language to machine code translation process. Looking at what I did by hand above you might think it’s quite easy but think about what I’m doing above. There’s a lot comes for free with the human brain and I’m automatically extracting a lot of semantic meaning out of what really is just a series of characters - a file of text … the assembler is going to need to be taught all of that.

If you’re hungry for more in the meanwhile though you can find the complete railroad diagrams here and there’s also a video below going through my chosen assembly language instructions …