ARM assembler written in Lisp

I've recently updated the uLisp ARM assembler to make it more compact; it will now fit on a board with about 2000 objects of workspace, with room to spare to write assembler programs and run them.

I thought it would be useful to describe how the assembler works. This will help anyone who wants to extend it to cater for ARM instructions that it doesn't currently support. It will also be helpful if you want to write an assembler for another processor, or even design your own processor and write an assembler for it; Lisp is an excellent language to do this. For example, a printout of the whole ARM assembler fits on just over two A4 pages.

Instruction encodings

The starting point for writing an assembler is to get hold of a summary of the processor's table of instruction encodings. For the ARM Thumb instruction set these are as follows:

ARMThumb1.gif

ARM Thumb instruction encodings for instructions starting #x0 to #x8.

 ARMThumb2.gif

ARM Thumb instruction encodings for instructions starting #x9 to #xF.

You can see from these diagrams that the 16-bit instructions are arranged into consistent field patterns. This is true of most processor instruction sets, but some are more orderly than others (RISC-V is a nightmare!).

An example - LSL

As an example, consider the first instruction in the first table, LSL (Logical Shift Left) immediate:

ARMThumbLSL.gif

This consists of:

  • The four-bit value #b0000.
  • A one-bit op code, which is 0 for LSL and 1 for LSR.
  • An immed5 value, which is a 5-bit integer from 0 to 31 giving the size of the left shift.
  • Lm, which is a value from 0 to 7 representing the source register R0 to R7.
  • Ld, which is a value from 0 to 7 representing the destination register R0 to R7.

Emitting bit fields

The first function we need is emit, which takes a specification defining the widths of the bit fields, and a list of arguments, and packs the values of the arguments into the bit fields:

(defun emit (bits &rest args)
  (let ((word 0) (shift -28))
    (mapc #'(lambda (value)
              (let ((width (logand (ash bits shift) #xf)))
                (incf shift 4)
                (unless (zerop (ash value (- width))) (error "Won't fit"))
                (setq word (logior (ash word width) value))))
          args)
    word))

The first argument, bits, is a 32-bit hexadecimal number in which each hex digit specifies the width of the next bit field. The function emit reads the hex digits in bits from left to right, packs the appropriate number of bits from each argument into word, and then returns the result.

For example, the bit fields for the LSL instruction could be specified by:

#x41533000

To make it easier to process the bit fields the widths are left-aligned, so you should add zeros to make the bits parameter eight hex digits.

The remaining arguments are the values to be packed into the bit fields. If any argument won't fit into the corresponding bit field the error Won't fit will be displayed.

So for example, to emit the op code for the instruction:

LSL r7, r4, #31

evaluate:

> (emit #x41533000 0 0 31 4 7)
2023

If you print this as a 16-bit binary number with:

> (format t "~16,'0b" 2023)
0000011111100111 

you can see that the values have been put into the correct fields as required.

Specifying registers

The next step is to be able to specify registers as r0 to r15, or their synonyms sp (for r13), lr (for r14), and pc (for r15). This is handled by the function regno:

(defun regno (sym)
  (case sym (sp 13) (lr 14) (pc 15)
    (t (read-from-string (subseq (string sym) 1)))))

For example:

> (regno 'r12)
12

Finally, we can now define the LSL instruction as the convenient Lisp function $lsl as follows:

(defun $lsl (argd argm immed5)
  (emit #x41533000 0 0 immed5 (regno argm) (regno argd))

This allows us to specify the instruction using syntax that's close to ARM assembler syntax:

> ($lsl 'r7 'r4 31)
2023

I've used the convention that functions representing ARM instructions are prefixed by a $ sign; otherwise there would be a problem with instructions that are also existing Lisp functions, such as push and pop.

Handling addressing modes

The final complication is that some instruction mnemonics can generate different op codes, depending on the types of their arguments.

For example, there's also a variant of LSL that shifts a register Rd by the shift value specified in the register Rs:

ARMThumbLSL2.gif

Using this syntax, the following assembler instruction shifts the value in R7 by the value in R1:

LSL r7, r1

The block of register-to-register instructions that include LSL is handled by the routine reg-reg:

(defun reg-reg (op argd argm)
  (emit #xa3300000 op (regno argm) (regno argd)))

Finally, we need to modify $lsl to include the register-to-register variant:

(defun $lsl (argd argm &optional arg2)
  (cond
   ((numberp arg2)
    (lsl-lsr-0 0 arg2 argm argd))
   ((numberp argm)
    (lsl-lsr-0 0 argm argd argd))
   (t
    (reg-reg #b0100000010 argd argm))))

where lsl-lsr-0 is defined as:

(defun lsl-lsr-0 (op immed5 argm argd)
  (emit #x41533000 0 op immed5 (regno argm) (regno argd)))

This expanded version of $lsl also handles the two-argument case where the source and destination registers are the same in an immediate shift; for example:

($lsl 'r1 31)

Running the assembler

To run the assembler in uLisp you use the built-in command defcode, which generates an assembler listing, and puts the machine code into RAM so you can execute it as if it's a normal Lisp function.

Greatest Common Divisor example

For example, to assemble a machine-code routine gcd to calculate Greatest Common Divisor you'd evaluate:

; Greatest Common Divisor
(defcode gcd (x y)
  swap
  ($mov 'r2 'r1)
  ($mov 'r1 'r0)
  again
  ($mov 'r0 'r2)
  ($sub 'r2 'r2 'r1)
  ($blt swap)
  ($bne again)
  ($bx 'lr))

and you could then call:

> (gcd 3287 3460)
173

Running the assembler in Common Lisp

You can also run the ARM assembler in a standard Common Lisp implementation. The Common Lisp version of the ARM Assembler includes the following defcode macro that lets you assemble an ARM function and print the machine code, like the defcode special form built into uLisp:

(defparameter *pc* 0)

(defmacro defcode (&body code)
  (let ((*print-pretty* t))
    (setq *pc* 0)
    (mapc
     #'(lambda (ins)
         (cond
          ((atom ins) (format t "~4,'0x      ~(~a~)~%" *pc* ins) (set ins *pc*))
          (t (format t "~4,'0x ~4,'0x ~(~a~)~%" *pc* (eval ins) ins) (incf *pc* 2))))
     (cddr code))
    nil))

Evaluating the Greatest Common Divisor example above generates the following output:

0000      swap
0000 000A ($mov 'r2 'r1)
0002 0001 ($mov 'r1 'r0)
0004      again
0004 0010 ($mov 'r0 'r2)
0006 1A52 ($sub 'r2 'r2 'r1)
0008 DBFA ($blt swap)
000A D1FB ($bne again)
000C 4770 ($bx 'lr)

In this case you obviously won't be able to run the machine code.

Resources

For the full source of the ARM assembler see ARM assembler in uLisp.

For more information see ARM assembler overview.

For a list of the ARM Thumb instructions supported by the assembler see ARM assembler instructions.

For ARM assembler examples see ARM assembler examples.