Let's write a disassembler for ARMv6 in C

Let’s write a disassembler for ARMv6 in C

In this post I will show how to recreate objdump from binutils for Armv6-M. You will not only learn to decode Thumb instructions, but also to understand their semantics and to navigate Arm reference manuals.

I will describe all steps in great detail so you would only need a minimal background in C to follow along. If you are learning to program microcontrollers and want to understand what’s happening behind the scenes – this post is for you.

Diving into the lower levels is really not that scary. You just have to be brave enough to dispel the magic and see how simple the base of our technology is if you have the right ancient books. Getting close to the machine gives an exciting sense of affinity and helpful insights for daily work.

Thumb decoder

For this post I will specifically target Arm Cortex M0+ – a small power-efficient Armv6-M core released in 2012 which is used in microcontrollers like STM32G0. It does not support the 32 bit Arm instruction set and instead only has 16-bit Thumb instructions with a small number of 32-bit instructions introduced with Thumb-2 technology.

The most important book for us here is the Armv6-M Architecture Reference Manual (DDI0419C) [1]. Besides all the useful information it contains a complete set of all instructions supported by this core with their encodings, differences to other cores and even operation pseudocode. I will refer to this document later as A6-123 (chapter-page) or A1.1.1 (chapter). You can also check out the Arm Instruction Set Reference Guide (100076) [2] for a complete list of instructions.

How are instructions stored?

Each Thumb instruction is either a single 16-bit halfword or a 32-bit instruction consisting of two consecutive halfwords. Instruction fetches are always little-endian even if the memory system is configured as big-endian [A3.3.1]. This means that the least significant byte of a halfword is stored at address A, and most significant at A+1: e.g. an instruction 0x464e is stored in memory as mem[addr 0x0] = 0x4e, mem[addr 0x1] = 0x46.

Each halfword must be aligned in memory to a multiple of two, which means that you cannot put instructions at addresses like 0x3 or 0x127, only 0x2 and 0x128. Note that 32-bit instructions do not require an alignment of 4 as they are fetched as two independent halfwords.

The least significant bit of the Program Counter (PC) selects the execution state: 0 for Arm state, and 1 for Thumb. Since this core only supports Thumb state, bit[0] must always be 1 or else a HardFault occurs. This bit is ignored when calculating the fetch address so a branch to 0x123 will fetch an instruction from 0x122 [A4.1.1].

Now onto decoding! First, let’s create a struct to hold decoded instructions:

enum thumb_op_ty {
    thumb_op_movs_imm,
    thumb_op_bl
    /* ... */
};
struct thumb_op {
    enum thumb_op_ty ty;
    union {
        uint16_t push_regs;
        struct {
            uint16_t rd;
            uint32_t imm32; /* 32 bit wide immediate value */
        } movs_imm;
        int32_t bl_imm32;
        /* ... */
    } v;
};

This is a polymorphic structure that can represent any instruction based on ty. The union v here contains the operands according to ty: v.bl_imm32 is for ty == thumb_op_bl, v.movs_imm.rd is for ty == thumb_op_movs_imm, etc. To work with such structures a switch statement on ty is often used. It is important to always check ty before accessing values from union v. Otherwise you will get misinterpreted values as all union fields are located at the same address.

Next, let’s define our main function:

int thumb_decode(uint16_t hw1, uint16_t hw2, struct thumb_op * op)
{
    /* ... */
}

It accepts two halfwords hw1 and hw2 that represent two consecutive halfwords from memory, and puts the decoded instruction in op. The return value is the width of the instruction in bytes, i.e. by how much to increment the address for the next call.

Throughout the following sections I will use some bit-twiddling functions: - bits_(v, hi, lo) returns a bitfield x = v[hi:lo] such that bit at position hi in v becomes the highest bit of the result, and bit lo becomes bit zero, counting from zero: bits_(0b11110010100111, 7, 5) == 0b101 - bit_(v, n) returns one bit at position n from v, counting from zero: bit_(0b00100, 2) == 0b1 - sign_extend32_(v, b) extends a signed two’s complement bitfield v from a width of b bits to n bits by adding copies of the highest bit from v to fill the extra bits: sign_extend32_(0b1101, 4) == 0b11111111111111111111111111111101

static uint16_t bits_(uint16_t v, int hi, int lo)
{
    return (v >> lo) & ~(~0 << (hi - lo + 1));
}
static uint32_t bit_(uint32_t v, int n)
{
    return (v >> n) & 1;
}
static int32_t sign_extend32_(uint32_t v, int b)
{
    uint32_t m = (1U << (b - 1));
    v = v & ((1U << b) - 1);
    return (v ^ m) - m;
}

I will describe those functions in great detail in the appendix on Helper functions.

32-bit instructions: Branch with Link (`BL`)

Cortex M0+ supports seven 32-bit instructions from Thumb-2: MSR, MRS, DSB, DMB, ISB, UDF and BL [A5.3]. We can detect them by checking the highest 5 bits of the first halfword: if bit[15:11] is one of 0b11101, 0b11110 or 0b11111, we must fetch the second halfword; all other instructions are 16-bit [A5.1].

static int op_is_32_bit_(uint16_t hw1)
{
    return bits_(hw1, 15, 11) == 0b11101
            || bits_(hw1, 15, 11) == 0b11110
            || bits_(hw1, 15, 11) == 0b11111);
}

int thumb_decode(uint16_t hw1, uint16_t hw2, struct thumb_op * op)
{
    if (op_is_32_bit_(hw1)) {
        /* extended below */
        return 4;
    } else { /* 16-bit instructions */
        /* ... */
        return 2;
    }
}

I will only cover the BL instruction since its encoding is the most complex. The rest is left as an exercise to the reader.

The Branch with Link instruction BL calls a subroutine at an address relative to the current program counter (aka PC-relative address) [A6.7.13]. It encodes the address as a 25-bit even signed integer. The lowest bit is not encoded and is always 0 to double the range at no cost thanks to the fact that instructions must be aligned to a multiple of two, i.e. at even addresses. I will call the encoded relative address imm32. The mnemonic is bl <label>, and the address is calculated as an offset from the PC value of the BL instruction to this assembly label.

stuff:
    mov r0, r0
my_proc:
    add r1, r1, #42 ; mem_addr = 2, PC of an instruction = (2 + 4) | 0x1 = 7
    bx lr           ; mem_addr = 4, PC of an instruction = (4 + 4) | 0x1 = 9
my_caller:
    mov r1, #10     ; mem_addr = 6, PC of an instruction = (6 + 4) | 0x1 = 11
    bl my_proc      ; mem_addr = 8, PC of an instruction = (8 + 4) | 0x1 = 13

Note that the PC value of an instruction is its address plus 4 [A4.2.1], even if the instruction itself takes only 2 bytes. This has to do with the pipeline architecture where the fetch stage for the next instruction is in parallel with the execute stage for the current instruction. In our example, instruction bl my_proc is at address 8 in memory but its PC value is 13: add 4 and set the lowest bit (remember about Thumb state). The encoded offset is 2 (label addr) - 13 (PC) == -11 (imm32).

Now let’s get into decoding. For that we refer to the Encoding diagram from the chapter on the instruction [A6.7.13]:

| halfword 1 (`hw1`)                  | | halfword 2 (`hw2`)                  |
|15|14|13|12|11|10|9|8|7|6|5|4|3|2|1|0| |15|14|13|12|11|10|9|8|7|6|5|4|3|2|1|0|
|-------------------------------------|-|-------------------------------------|
| 1| 1| 1| 1| 0| S|             imm10 | | 1| 1|J1| 1|J2|                imm11 |

All bits explicitly marked with 0 or 1 must be set to that value, or else the instruction at hand is a different one or undefined. Names such as S, J1 or imm10 are used later to refer to the bitfields: hw1[9:0] == imm10, hw2[11] == J2 (see also Appendix E.1 in [1]).

    if (bits_(hw1, 15, 11) == 0b11110
            && bits_(hw2, 15, 14) == 0b11 && bit_(hw2, 12) == 0b1) {
        /* ... */
    }

Each Encoding diagram in the manual is followed by some pseudocode that translates the fields from the diagram into more meaningful values used in the instruction description and operational pseudocode:

I1 = NOT(J1 EOR S);
I2 = NOT(J2 EOR S);
imm32 = SignExtend(S:I1:I2:imm10:imm11:'0', 32);

This pseudocode uses some operators described in Appendix E.5 in [1]: - NOT(x) negates each bit of x, e.g. NOT(0b1010) == 0b0101; same as ~x in C; - a EOR b computes a bitwise exclusive OR (aka XOR), e.g. 0b1010 EOR 0b1100 == 0b0110; same as a ^ b in C; - a:b concatenates two bitstrings by placing all bits from a to the left (higher) of b, e.g. 0b101:0b011 == 0b101011; same as (a << n_bits(b)) | b in C, where n_bits(b) is the width of b in bits; - SignExtend(x, n) extends a signed two’s complement bitfield x to a length of n bits by adding copies of the highest bit from x to fill the extra bits, e.g. SignExtend(0b1101, 8) == 0b11111101 (0b1101 == -3); you will see the function sign_extend32_(x) == SignExtend(x, 32) in C, which I will describe later in great detail.

Fields' pseudocode is often followed by a condition such as:

If InITBlock() && !LastInITBlock() then UNPREDICTABLE;

Those are meant for the IT (If-Then) instruction, which is not supported on Armv6-M. InITBlock() and LastInITBlock() always return false, so you can just ignore those checks [A6.3].

Now all we have to do is to extract the fields and concatenate them:

        uint32_t s     = bit_(hw1, 10);
        uint32_t i1    = ~(bit_(hw2, 13) ^ s) & 1;
        uint32_t i2    = ~(bit_(hw2, 11) ^ s) & 1;
        uint32_t imm10 = bits_(hw1, 9, 0);
        uint32_t imm11 = bits_(hw2, 10, 0);
        uint32_t imm25 = 0;
        imm25 = imm25       | s;     /* bits[24] */
        imm25 = imm25 << 1  | i1;    /* bits[23] */
        imm25 = imm25 << 1  | i2;    /* bits[22] */
        imm25 = imm25 << 10 | imm10; /* bits[21:11] */
        imm25 = imm25 << 11 | imm11; /* bits[10:1] */
        imm25 = imm25 << 1  | 0;     /* bits[0] */
        op->ty = thumb_op_bl;
        op->v.bl_imm32 = sign_extend32_(imm25, 25);

Note the mask & 1 in i1 and i2: without it all bits above the first will be set after negating, and doing imm25 << 1 | i1 will overwrite fields added before.

Another way of writing the concatenation is with a union and bit-fields:

        union {
            struct {
                unsigned int zero : 1;
                unsigned int imm11 : 11;
                unsigned int imm10 : 10;
                unsigned int i2 : 1;
                unsigned int i1 : 1;
                unsigned int s : 1;
                unsigned int zero2 : 7;
            } b;
            uint32_t w;
        } imm32;
        uint32_t s = bit_(hw1, 10);
        imm32.w = 0;
        imm32.b.imm11 = bits_(hw2, 10, 0);
        imm32.b.imm10 = bits_(hw1, 9, 0);
        imm32.b.i2 = ~(bit_(hw2, 11) ^ s);
        imm32.b.i1 = ~(bit_(hw2, 13) ^ s);
        imm32.b.s = s;
        op->ty = thumb_op_bl;
        op->v.bl_imm32 = sign_extend32_(imm32.w, 25);

… but it relies on two implementation-dependent mechanisms, some might call it Undefined behavior squared. The standard does not specify the precise layout in memory of the bit-fields, so different compilers are free to reorder them. Then we try to interpret struct imm32.b as uint32_t. This is called type punning and is also undefined: for example, on big-endian systems, bit imm32.b.zero might be placed in the most significant byte of imm32.w, resulting in nonsense value (see also 6.8 Unions and 6.9 Bit-fields in [4]).

Now let’s walk through disassembling some real instruction codes that you can see in the second column of this neat disassembly (remember, to build a disassembler you will need a disassembler):

Disassembly of section .text:
00000000 <foo>:
       0:       1c00            adds    r0, r0, #0
00000002 <first>:
       2:       3001            adds    r0, #1
       4:       f0ab fe02       bl      abc0c <second>
       8:       382a            subs    r0, #42
        ...
000abc0a <bar>:
   abc0a:       1c00            adds    r0, r0, #0
000abc0c <second>:
   abc0c:       3802            subs    r0, #2
   abc0e:       f754 f9f8       bl      2 <first>
   abc12:       30e4            adds    r0, #228

Let’s decode our two BL instructions:

struct thumb_op op1, op2;
/*     4:       f0ab fe02       bl      abc0c <second> */
thumb_decode(0b1111000010101011, 0b1111111000000010, &op1);
/* abc0e:       f754 f9f8       bl      2 <first> */
thumb_decode(0b1111011101010100, 0b1111100111111000, &op2);

| halfword 1 (`hw1`)                  | | halfword 2 (`hw2`)                  |
|15|14|13|12|11|10|9|8|7|6|5|4|3|2|1|0| |15|14|13|12|11|10|9|8|7|6|5|4|3|2|1|0|
|--|--|--|--|--|--|-------------------|-|--|--|--|--|--|----------------------|
| 1| 1| 1| 1| 0| S|             imm10 | | 1| 1|J1| 1|J2|                imm11 |
|--|--|--|--|--|--|-op1---------------|-|--|--|--|--|--|----------------------|
| 1| 1| 1| 1| 0| 0|        0010101011 |-| 1| 1| 1| 1| 1|          11000000010 |
|--|--|--|--|--|--|-op2---------------|-|--|--|--|--|--|----------------------|
| 1| 1| 1| 1| 0| 1|        1101010100 |-| 1| 1| 1| 1| 1|          00111111000 |

First we check the highest 5 bits of hw1 to see if the instruction is 32-bit:

    bits_(hw1, 15, 11) == 0b11110;
    if (bits_(hw1, 15, 11) == 0b11101
            || bits_(hw1, 15, 11) == 0b11110
            || bits_(hw1, 15, 11) == 0b11111) {
        /* it's 32-bit! */
    }

Then we match the encoding:

    bits_(hw1, 15, 11) == 0b11110;
    bits_(hw2, 15, 14) == 0b11;
    bit_(hw2, 12) == 0b1;
    if (bits_(hw1, 15, 11) == 0b11110
            && bits_(hw2, 15, 14) == 0b11 && bit_(hw2, 12) == 0b1) {
        /* it's BL! */
    }

… and finally decode the operand:

For op1:
        uint32_t imm25 = 0;
        imm25 = imm25       | 0b0;                     /* s */
        /* == 0b00000000000000000000000000000000 */
        imm25 = imm25 << 1  | ~(0b1 ^ 0b0) & 1;        /* i1; 0b1 ^ 0b0 == 0b1 */
        /* == 0b00000000000000000000000000000000 */
        imm25 = imm25 << 1  | ~(0b1 ^ 0b0) & 1;        /* i2 */
        /* == 0b00000000000000000000000000000000 */
        imm25 = imm25 << 10 | 0b0010101011;            /* imm10 */
        /* == 0b00000000000000000000000010101011 */
        imm25 = imm25 << 11 | 0b11000000010;           /* imm11 */
        /* == 0b00000000000001010101111000000010 */
        imm25 = imm25 << 1  | 0;                       /* lowest bit is always 0 */
        /* == 0b00000000000010101011110000000100 */
        op->v.bl_imm32 = sign_extend32_(imm25, 25);
        /* == 0b00000000000010101011110000000100 */
        /* == +0xABC04 */
        op->ty = thumb_op_bl;
For op2:
        uint32_t imm25 = 0;
        imm25 = imm25       | 0b1;                     /* s */
        /* == 0b00000000000000000000000000000001 */
        imm25 = imm25 << 1  | ~(0b1 ^ 0b1) & 1;        /* i1 */
        /* == 0b00000000000000000000000000000011 */
        imm25 = imm25 << 1  | ~(0b1 ^ 0b1) & 1;        /* i2 */
        /* == 0b00000000000000000000000000000111 */
        imm25 = imm25 << 10 | 0b1101010100;            /* imm10 */
        /* == 0b00000000000000000001111101010100 */
        imm25 = imm25 << 11 | 0b00111111000;           /* imm11 */
        /* == 0b00000000111110101010000111111000 */
        imm25 = imm25 << 1  | 0;                       /* lowest bit is always 0 */
        /* == 0b00000001111101010100001111110000 */
        op->v.bl_imm32 = sign_extend32_(imm25, 25);
        /* == 0b11111111111101010100001111110000 */
        /* == -0xABC10 */
        op->ty = thumb_op_bl;

To check the results, let’s remember that imm32 is an offset from the PC-value of the instruction, which is equal to its address plus 4:

00000002 <first>:
       2:       3001            adds    r0, #1
       4:       f0ab fe02       bl      +0xABC04
        ; PC-value == 0x4 + 0x4 == 0x8
        ; target address = 0x8 + 0xABC04 == 0xABC0C == address of <second>
       8:       382a            subs    r0, #42
000abc0c <second>:
   abc0c:       3802            subs    r0, #2
   abc0e:       f754 f9f8       bl      -0xABC10
        ; PC-value == 0xABC0E + 0x4 == 0xABC12
        ; target address = 0xABC12 + (-0xABC10) == 0x2 == address of <first>
   abc12:       30e4            adds    r0, #228

Conditional instructions: 16-bit Branch (`B<c>`)

Cortex M0+ has only one conditional instruction: 16-bit B<c> with a jump range of -256 to +254 bytes. You can select one of 14 checks [A6.3]:

enum thumb_cond {
    thumb_cond_eq = 0b0000, /* Z == 1; equal */
    thumb_cond_ne = 0b0001, /* Z == 0; not equal */
    thumb_cond_cs = 0b0010, /* C == 1; carry set */
    thumb_cond_cc = 0b0011, /* C == 0; carry clear */
    thumb_cond_mi = 0b0100, /* N == 1; minus, negative */
    thumb_cond_pl = 0b0101, /* N == 0; plus, positive or zero */
    thumb_cond_vs = 0b0110, /* V == 1; overflow */
    thumb_cond_vc = 0b0111, /* V == 0; no overflow */
    thumb_cond_hi = 0b1000, /* C == 1 and Z == 0; unsigned higher */
    thumb_cond_ls = 0b1001, /* C == 0 or Z == 1; unsigned lower or same */
    thumb_cond_ge = 0b1010, /* N == V; signed greater than or equal */
    thumb_cond_lt = 0b1011, /* N != V; signed less than */
    thumb_cond_gt = 0b1100, /* Z == 0 and N == V; signed greater than */
    thumb_cond_le = 0b1101, /* Z == 1 or N != V; signed less than or equal */
    thumb_cond_al = 0b1110, /* always (unconditional); never encoded */

    /* synonyms in assembly */
    thumb_cond_hs = thumb_cond_cs, /* unsigned higher or same */
    thumb_cond_ld = thumb_cond_cc  /* unsigned lower */
};

Those conditions check flags in the APSR register [A2.3.2]. Most instructions update it, with precise mechanism described in their Operation pseudocode.

This instruction has two encodings: conditional and unconditional with bigger range. Let’s start with the first one:

Encoding T1:
|15|14|13|12|11|10|9|8|7|6|5|4|3|2|1|0|
|-------------------------------------|
| 1| 1| 0| 1|    cond |          imm8 |

if cond == '1110' then UNDEFINED;
if cond == '1111' then SEE SVC;
imm32 = SignExtend(imm8:'0', 32);

These two conditions in the pseudocode are very important. The first one enforces the rule that the condition AL (always) is never encoded in instructions; there is an unconditional encoding T2 for such cases. UNDEFINED triggers an Undefined instruction exception. The second condition says that an encoding with bits[15:8] == 0b11011111 is actually an SVC instruction, not B<c> [A6.7.68].

The second encoding is straightforward:

Encoding T2:
|15|14|13|12|11|10|9|8|7|6|5|4|3|2|1|0|
|-------------------------------------|
| 1| 1| 1| 0| 0|                imm11 |

imm32 = SignExtend(imm11:'0', 32);

To start decoding, let’s extend struct thumb_op:

enum thumb_op_ty {
    /* ... */
    thumb_op_b,
    /* ... */
};
struct thumb_op {
    enum thumb_op_ty ty;
    union {
        /* ... */
        struct {
            enum thumb_cond cond;
            int32_t imm32;
        } b;
        /* ... */
    } v;
};

Side note: gcc has a check -Wswitch-enum, which will warn you if you extended a struct and forgot to handle the new variants in a switch.

Now decoding is as simple as:

int thumb_decode(uint16_t hw1, uint16_t hw2, struct thumb_op * op)
{
    if (op_is_32_bit_(hw1)) {
        /* ... */
        return 4;
    } else { /* 16-bit instructions */
        if (bits_(hw1, 15, 12) == 0b1101
                && bits_(hw1, 11, 8) != 0b1111
                && bits_(hw1, 11, 8) != 0b1110) {
            /* A6-119: T1: B<c> <label> */
            op->ty = thumb_op_b;
            op->v.b.cond = bits_(hw1, 11, 8);
            op->v.b.imm32 = sign_extend32_(bits_(hw1, 7, 0) << 1, 9);
        } else if (bits_(hw1, 15, 11) == 0b11100) {
            /* A6-119: T2: B<c> <label> */
            op->ty = thumb_op_b;
            op->v.b.cond = thumb_cond_al;
            op->v.b.imm32 = sign_extend32_(bits_(hw1, 10, 0) << 1, 12);
        }
        /* ... */
        return 2;
    }
}

The check bits_(hw1, 11, 8) != 0b1111 && bits_(hw1, 11, 8) != 0b1110 mirrors the conditions in the pseudocode. It also ensures that the value in cond is a valid enumeration variant. The left shift in bits_(hw1, 7, 0) << 1 is the same as concatenation imm8:'0'.

Here is an example:

Disassembly of section .text:
00000100 <demo>:
 100:   2200            movs    r2, #0
 102:   2364            movs    r3, #100
00000104 <.loop>:
 104:   3201            adds    r2, #1
        ...
 1a6:   3b01            subs    r3, #1
 1a8:   d1ac            bne.n   104 <.loop> ; PC-value == 0x1AC

The suffix .n means narrow for 16-bit encoding. Objdump prints it to avoid confusion when a 32-bit encoding may be available, which is not the case for our core.

struct thumb_op op1;
/* 1a8:   d1ac            bne.n   104 <.loop> */
thumb_decode(0b1101000110101100, 0b0, &op1);

Encoding T1:
|15|14|13|12|11|10|9|8|7|6|5|4|3|2|1|0|
|--|--|--|--|---------|---------------|
| 1| 1| 0| 1|    cond |          imm8 |
|--|--|--|--|---------|---------------|
| 1| 1| 0| 1|    0001 |      10101100 |

As usual we match the encoding:

        bits_(hw1, 15, 12) == 0b1101
        bits_(hw1, 11, 8) == 0b0001
        if (bits_(hw1, 15, 12) == 0b1101
                && bits_(hw1, 11, 8) != 0b1111
                && bits_(hw1, 11, 8) != 0b1110) {
            /* it's B<c>! */
        }

… and unpack the fields:

op->ty = thumb_op_b;
op->v.b.cond = 0b0001; /* == thumb_cond_ne */
/* 0b10101100 << 1  == 0b101011000 */
op->v.b.imm32 = sign_extend32_(0b10101100 << 1, 9);
/* sign_extend32_(0b101011000, 9) == 0b11111111111111111111111101011000 */
/* == -0xA8 */

The PC-value of our branch is 0x1AC, so the target address is 0x1AC - 0xA8 == 0x104, which is precisely the address of the label .loop.

Instructions with multiple registers: 16-bit `PUSH` and `LDM`

This core has four instructions which can load/store multiple registers in one go [A4.7]. They save program memory and shave off some cycles compared to fetching and decoding multiple instructions. Those instructions are divided into two logical groups: PUSH / POP and STM / LDM.

PUSH and POP implement a full-descending stack, which is the default stack type for Arm [6]. Descending refers to the direction of stack growth: the value of the stack pointer (sp aka r13) after reset is at the top of the stack space, and new items are placed at decreasing addresses. Full stack means that sp always points to the last pushed item; initially sp points to the byte above the allowed stack space. PUSH accepts a list of up to nine different registers (r0-r7 + r14 aka lr). It decrements sp by the total size of the passed registers, then stores them one before another: the first register will be at the new sp, while the last one will be just below the old sp. POP does the inverse of that, except that instead of lr it accepts r15 aka pc, which lets you restore the stack frame and jump to the return address in one instruction.

; SP == 20
push {r0-r1, r4, lr}
; mem[addrs 7:4] = r0
; mem[addrs 11:8] = r1
; mem[addrs 15:12] = r4
; mem[addrs 19:16] = lr
; SP = 4
pop {r0-r1, r4, pc}
; r0 = mem[addrs 7:4]
; r1 = mem[addrs 11:8]
; r4 = mem[addrs 15:12]
; pc = mem[addrs 19:16] (jump)
; SP == 20

STM and LDM are similar but different. They let you change the base register but they can’t store/load lr/pc, accepting only up to eight registers. STM, unlike PUSH, places the registers at incrementing addresses above the base register. LDM on the other hand, behaves exactly like POP, loading from incrementing addresses above base. Another difference is that you can choose to leave the base register unchanged. This makes them useful for things like copying a struct:

; struct foo { int a, b, c, d, e, f } x, y;
; r0 == &x, r1 == &y
ldm r0, {r2-r7}
; r2 == x.a, r3 == x.b, ...
stm r1, {r2-r7}
; y.a == r2, y.b == r3, ...

Beware that both of those instructions have synonyms which assemble to the same code: STMEA/STMIA for STM, and LDMFD/LDMIA for LDM.

To decode those instructions let’s first enumerate the registers:

/* see [A2.3] */
enum arm_reg {
    arm_r0 = 1 << 0,
    arm_r1 = 1 << 1,
    arm_r2 = 1 << 2,
    arm_r3 = 1 << 3,
    arm_r4 = 1 << 4,
    arm_r5 = 1 << 5,
    arm_r6 = 1 << 6,
    arm_r7 = 1 << 7,
    arm_r8 = 1 << 8,
    arm_r9 = 1 << 9,
    arm_r10 = 1 << 10,
    arm_r11 = 1 << 11,
    arm_r12 = 1 << 12,
    arm_r13 = 1 << 13,
    arm_r14 = 1 << 14,
    arm_r15 = 1 << 15,

    arm_sp = arm_r13,
    arm_lr = arm_r14,
    arm_pc = arm_r15
};
enum arm_reg arm_reg_int(int v)
{
    return 1 << v;
}

You will see in a moment that this enum mirrors how multiple registers are encoded as bit positions: bit[0] is for r0, bit[7] for r7 and so on. arm_reg_int will come into play when decoding the base register number.

enum thumb_op_ty {
    /* ... */
    thumb_op_push,
    thumb_op_ldm
    /* ... */
};
struct thumb_op {
    enum thumb_op_ty ty;
    union {
        /* ... */
        uint16_t push_regs;
        struct {
            enum arm_reg rn;
            uint16_t regs;
            int wback;
        } ldm;
        /* ... */
    } v;
};

Note how push_regs and ldm.regs use a plain uint16_t instead of enum arm_reg. That is because they will hold a bitmask made up of multiple enumeration variants, while an enum type would suggest that the value exactly matches one of the variants.

Now that everything is prepared, let’s first decode PUSH [A6.7.50]:

Encoding T1:
|15|14|13|12|11|10|9|8|7|6|5|4|3|2|1|0|
|-------------------------------------|
| 1| 0| 1| 1| 0| 1|0|M| register_list |

registers = '0':M:000000:register_list;
UnalignedAllowed = FALSE;
if BitCount(registers) < 1 then UNPREDICTABLE;

registers here is a bitmask where bit N selects register N: if registers[0] and registers[14] are 1, PUSH will save registers r0 and r14 (lr). The concatenation in the pseudocode is just another way of saying that bit M selects register r14. The check for BitCount(registers) < 1) specifies that you must select at least one register.

We could also rephrase this diagram to make register_list more obvious:

|15|14|13|12|11|10| 9| 8| 7| 6| 5| 4| 3| 2| 1| 0|
|-----------------------------------------------|
| 1| 0| 1| 1| 0| 1| 0|lr|r7|r6|r5|r4|r3|r2|r1|r0|

With all that, decoding is pretty straightforward:

int thumb_decode(uint16_t hw1, uint16_t hw2, struct thumb_op * op)
{
    if (op_is_32_bit_(hw1)) {
        /* ... */
        return 4;
    } else { /* 16-bit instructions */
        /* ... */
        } else if (bits_(hw1, 15, 9) == 0b1011010) {
            /* A6-167: T1: PUSH <registers> */
            op->ty = thumb_op_push;
            op->v.push_regs = bits_(hw1, 7, 0) | bit_(hw1, 8) << 14;
        }
        /* ... */
        return 2;
    }
}

LDM has a similar encoding [A6.7.25] but with a couple of nuances:

Encoding T1:
|15|14|13|12|11|10|9|8|7|6|5|4|3|2|1|0|
|-------------------------------------|
| 1| 1| 0| 0| 1|   Rn | register_list |

n = UInt(Rn);
registers = '00000000':register_list;
wback = (registers<n> == '0')
If BitCount(registers) < 1 then UNPREDICTABLE;

Rn here is the number of the base register i.e., Rn == 0 for r0, 3 for r3 and so on. Note that Rn is a 3-bit field which means that this instruction can only use registers r0-r7, also known as low registers; this is the case for most of Thumb instructions. n = UInt(Rn) is just a formality to say that n is the number of the base register.

The definition of wback suggests that the instruction must increment the base register if you do not load to it: registers<n> == '0' is only going to be true if the bit at position Rn is not selected: suppose r0 = 0x10, mem[0x10]=0x42 before you try to execute ldm! r0, {r0, r1}; if that was allowed and you could writeback to a register present in the list, the value of r0 after the instruction will depend on how it is implemented in hardware – it could be either 0x10 + 8 = 0x18, 0x42 or even 0x42 + 8 = 0x4A. This has a side effect that Rn will be modified in any case: either by incrementing (wback) or by loading from memory.

int thumb_decode(uint16_t hw1, uint16_t hw2, struct thumb_op * op)
{
        /* ... */
        } else if (bits_(hw1, 15, 11) == 0b11001) {
            /* A6-137: T1: LDM <Rn>, <registers> or LDM <Rn>! <registers> */
            op->ty = thumb_op_ldm;
            op->v.ldm.rn = arm_reg_int(bits_(hw1, 10, 8));
            op->v.ldm.regs = bits_(hw1, 7, 0);
            op->v.ldm.wback = ! (op->v.ldm.regs & op->v.ldm.rn);
        }
        /* ... */
}

See how we can check if a register is selected with a simple bitwise AND: op->v.ldm.regs & op->v.ldm.rn. This is thanks to the fact that each register is stored at a different bit position in the mask, and that arm_reg_int() converts a register number to that bit position.

And one more time let’s go look at some real disassembly:

0: b5d0 push {r4, r6, r7, lr}

`PUSH`: 0b1011010111010000
|15|14|13|12|11|10| 9| 8| 7| 6| 5| 4| 3| 2| 1| 0|
|--|--|--|--|--|--|--|--|-----------------------|
| 1| 0| 1| 1| 0| 1| 0| M|         register_list |
|--|--|--|--|--|--|--|--|-----------------------|
|  |  |  |  |  |  |  |lr|r7|r6|r5|r4|r3|r2|r1|r0|
|--|--|--|--|--|--|--|--|--|--|--|--|--|--|--|--|
| 1| 0| 1| 1| 0| 1| 0| 1| 1| 1| 0| 1| 0| 0| 0| 0|

2: cf4e ldmia r7!, {r1, r2, r3, r6}

`LDMIA` (same as `LDM`): 0b1100111101001110
|15|14|13|12|11|10| 9| 8| 7| 6| 5| 4| 3| 2| 1| 0|
|--|--|--|--|--|--------|-----------------------|
| 1| 1| 0| 0| 1|     Rn |         register_list |
|--|--|--|--|--|--------|-----------------------|
|  |  |  |  |  |        |r7|r6|r5|r4|r3|r2|r1|r0|
|--|--|--|--|--|--------|--|--|--|--|--|--|--|--|
| 1| 1| 0| 0| 1|   1110 | 0| 1| 0| 0| 1| 1| 1| 0|

Now that we have figured out all essential instruction groups. Now you can extend the decoder to rest of instructions. A couple of tips before you start.

First, add a fallback clause so that unknown instructions won’t return garbage:

enum thumb_op_ty {
    /* ... */
    thumb_op_word,
    thumb_op_hword
    /* ... */
};
struct thumb_op {
    enum thumb_op_ty ty;
    union {
        /* ... */
        uint32_t word_val;
        uint16_t hword_val;
        /* ... */
    } v;
};
int thumb_decode(uint16_t hw1, uint16_t hw2, struct thumb_op * op)
{
    if (op_is_32_bit_(hw1)) {
        /* ... */
        } else {
            op->ty = thumb_op_word;
            op->v.word_val = hw1 | hw2 << 16;
        }
        return 4;
    } else {
        /* ... */
        } else {
            op->ty = thumb_op_hword;
            op->v.hword_val = hw1;
        }
        return 2;
    }
}

thumb_op_word and thumb_op_hword mirror the assembly directives .word and .hword which paste the literal bytes given.

Second tip is to always put instructions with more fixed bits above those with fewer: e.g. we check 8 bits for SVC (bits_(hw1, 15, 8) == 0b11011111) so it must go before the check for B<c> with only 4 fixed bits besides the restrictions on cond (bits_(hw1, 15, 12) == 0b1101). This will help avoid misinterpretation when a less specific encoding turns out to be a prefix of a larger one.

Putting it to work: printing and main()

Up until now we had to validate our decoder by hand, so let’s add a disassembler to our disassembler:

static const char * thumb_reg_str_(enum arm_reg v);
static void thumb_print_regs_(uint16_t v);
void thumb_print_disass(struct thumb_op * op, uint32_t addr)
{
    /* see PC value of an instruction @ A4-68: A4.2.1 */
    uint32_t pc = addr + 4;
    switch (op->ty) {
    case thumb_op_bl:
        printf(
            "bl        %x",
            pc + op->v.bl_imm32);
        break;
    case thumb_op_b:
        if (op->v.b.cond != thumb_cond_al) { /* Encoding T1 */
            printf(
                "b%s.n     %x",
                thumb_cond_str_(op->v.b.cond),
                pc + op->v.b.imm32);
        } else { /* Encoding T2 */
            printf(
                "b.n       %x",
                pc + op->v.b.imm32);
        }
        break;
    case thumb_op_push:
        printf("push      ");
        thumb_print_regs_(op->v.push_regs);
        break;
    case thumb_op_ldm:
        printf("ldm%c      %s ",
            op->v.ldm.wback ? '!' : ' ',
            thumb_reg_str_(op->v.ldm.rn));
        thumb_print_regs_(op->v.ldm.regs);
        break;
    case thumb_op_word:
        printf(".word     0x%08x", op->v.word_val);
        break;
    case thumb_op_hword:
        printf(".hword    0x%04x", op->v.hword_val);
        break;
    };
}

This function is simple, but be careful to follow the rules on what mnemonic to choose in the description of each instruction. Some instructions (like ADD) might allow multiple encodings; for those you might want to save the decoded encoding variant to ensure the disassembly will not loose the original intent.

static const char * thumb_reg_str_(enum arm_reg v)
{
    switch (v) {
    case arm_r0:
        return "r0";
    /* ... */
    case arm_r15:
        return "r15";
    };
}
static void thumb_print_regs_(uint16_t v)
{
    int i;
    int comma = 0;
    printf("{");
    for (i = 0; i <= 15; i++) {
        if (bit_(v, i)) {
            printf("%sr%i", comma ? ", " : "", i);
            comma = 1;
        }
    }
    printf("}");
}

And finally we can tie everything together and decode an actual file:

#include <stdint.h>
#include <stdio.h>
#include <fcntl.h>
#include <unistd.h>
#include <stdlib.h>
#include <sys/stat.h>

static void print_bin16_(uint16_t v);
static int read_file_(const char * name, uint8_t ** data);
int main(int argc, char ** argv)
{
    uint8_t * file;
    int file_size;
    int addr;
    int len;
    uint16_t hw1, hw2;
    struct thumb_op op;

    if (argc < 2) {
        return 1;
    }
    file_size = read_file_(argv[1], &file);
    if (file_size < 0) {
        return 2;
    }

    addr = 0;
    while (1) {
        if (addr + 1 > file_size) {
            break;
        }
        hw1 = file[addr] | file[addr + 1] << 8;
        if (addr + 3 <= file_size) {
            hw2 = file[addr + 2] | file[addr + 3] << 8;
        }

        len = thumb_decode(hw1, hw2, &op);
        if (len == 2) {
            printf("%5x: %04x      ", addr, hw1);
            print_bin16_(hw1);
            printf("                     ");
            thumb_print_disass(&op, addr);
        } else if (len == 4) {
            printf("%5x: %04x %04x ", addr, hw1, hw2);
            print_bin16_(hw1);
            printf(" ");
            print_bin16_(hw2);
            printf("    ");
            thumb_print_disass(&op, addr);
        }
        addr += len;
        putchar('\n');
    }

    free(file);

    return 0;
}

static void print_bin16_(uint16_t v)
{
    int i;
    for (i = sizeof(v) * 8 - 1; i >= 0; i--) {
        putchar(((v >> i) & 1) ? '1' : '0');
    }
}

static int read_file_(const char * name, uint8_t ** data)
{
    int fd = -1;
    struct stat sb;
    fd = open(name, O_RDONLY);
    if (fd < 0) {
        perror("open");
        goto err;
    }
    if (fstat(fd, &sb) == -1) {
        perror("fstat");
        goto err;
    }
    if (sb.st_size > 0) {
        *data = malloc(sb.st_size);
    } else {
        goto err;
    }
    if (*data == NULL) {
        perror("malloc");
        goto err;
    }
    if (read(fd, *data, sb.st_size) != sb.st_size) {
        goto err;
    }
    return sb.st_size;
err:
    if (fd >= 0) {
        close(fd);
    }
    if (*data != NULL) {
        free(*data);
    }
    return -1;
}

This program works only with raw binary program images, so you should use objcopy if you want to disassemble an object file:

arm-none-eabi-objcopy -O binary a.o a.bin

All together this looks like this:

 0: f000 f800 1111000000000000 1111100000000000    bl        4
 4: b438      1011010000111000                     push      {r3, r4, r5}
 6: c838      1100100000111000                     ldm!      r0 {r3, r4, r5}
 8: 2200      0010001000000000                     movs      r2, #0
 a: 2364      0010001101100100                     movs      r3, #100
 c: 3201      0011001000000001                     adds      r2, #1
 e: 3b01      0011101100000001                     subs      r3, #1
10: d1fc      1101000111111100                     bne.n     c

Et voila! In a following post I will show how to parse an actual object file to disassemble it with symbol names and annotations for raw data.

Appendix: Bit twiddling deep dive

First, we need a function to work with multiple integers packed inside a single halfword:

static uint16_t bits_(uint16_t v, int hi, int lo)
{
    return (v >> lo) & ~(~0 << (hi - lo + 1));
}

This function extracts a right adjusted bit field that begins at bit index lo and ends at hi (inclusive) (see 2.9 Bitwise operations in [4]):

bits_(0b01010, 4, 0) == 0b01010;
bits_(0b11010111, 6, 4) == 0b101;

The expression v >> lo moves the first bit of our field to the bit zero of the halfword and discards all extra bits to the right of the field:

11010111 >> 4 == 00001101:
| bit #    | 7 | 6 | 5 | 4 | 3 | 2 | 1 | 0 |
|----------|-------------------------------|
| operand  | 1 | 1 | 0 | 1 | 0 | 1 | 1 | 1 |
|----------|-------------------------------|
|          | added (4)     |               | discarded (4)
|----------|---------------|---------------|
| result   | 0 | 0 | 0 | 0 | 1 | 1 | 0 | 1 | 0 | 1 | 1 | 1 |

~0 is all 1 bits, shifting it left by the width of our bit field places that number of zeroes at the start, and negating that gives a mask with 1 bits for each bit of our bit field:

~(~0 << (6 - 4 + 1)) == 00000111
| bit #      | 7 | 6 | 5 | 4 | 3 | 2 | 1 | 0 |
|------------|---|---|---|---|---|---|---|---|
| 0          | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
|------------|---|---|---|---|---|---|---|---|
| ~0         | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 |
|------------|---|---|---|---|---|---|---|---|
| ~0 << 3    | 1 | 1 | 1 | 1 | 1 | 0 | 0 | 0 |
|------------|---|---|---|---|---|---|---|---|
| ~(~0 << 3) | 0 | 0 | 0 | 0 | 0 | 1 | 1 | 1 |

We then apply this mask with a binary AND to set all the extra bits to the left of our bit field to 0:

(11010111 >> 4) & ~(~0 << (6 - 4 + 1)) == 00000101
| bit #                  | 7 | 6 | 5 | 4 | 3 | 2 | 1 | 0 |
|------------------------|---|---|---|---|---|---|---|---|
| v >> lo                | 0 | 0 | 0 | 0 | 1 | 1 | 0 | 1 |
|------------------------|---|---|---|---|---|---|---|---|
| ~(~0 << (hi - lo + 1)) | 0 | 0 | 0 | 0 | 0 | 1 | 1 | 1 |
|------------------------|---|---|---|---|---|---|---|---|
| &                      | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 1 |

Let us also define a simpler function for extracting singular bits:

static uint32_t bit_(uint32_t v, int n)
{
    return (v >> n) & 1;
}

It places the bit number n to be the bit zero of the word and uses a mask with only the first bit 1 to discard the extra bits.

Some instructions encode signed numbers in bit fields. bits_() will work fine for positive numbers but we have to sign-extend the negative numbers.

Suppose we have a signed 4-bit field with a decimal -3 encoded as two’s-complement as 0b1101. Without sign-extend it will be represented as 0b00001101 in an 8 bit integer which is equal to decimal 13.

To sign-extend a bit field of variable width we just need to fill the higher bits of the word with the most significant bit of the field [3]:

static int32_t sign_extend32_(uint32_t v, unsigned int b)
{
    uint32_t m = (1U << (b - 1));
    v = v & ((1U << b) - 1);
    return (v ^ m) - m;
}

The expression v = v & ((1U << b) - 1) clears extra bits that are higher than our field:

0b10101101 & ((1U << 4) - 1) == 0b00001101:
| bit #         | 7 | 6 | 5 | 4 | 3 | 2 | 1 | 0 |
|---------------|---|---|---|---|---|---|---|---|
| 1U << 4       | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 |
|---------------|---|---|---|---|---|---|---|---|
| (1U << 4) - 1 | 0 | 0 | 0 | 0 | 1 | 1 | 1 | 1 |
|---------------|---|---|---|---|---|---|---|---|
| v             | 1 | 0 | 1 | 0 | 1 | 1 | 0 | 1 |
|---------------|---|---|---|---|---|---|---|---|
| &             | 0 | 0 | 0 | 0 | 1 | 1 | 0 | 1 |

m = (1U << (b - 1)) creates a mask with a 1 bit placed at the most significant bit of our field: m == 0b1000.

The expression v ^ m can be interpreted mathematically as adding the smallest possible number with width b:

m == 0b1000 == -8
v == 0b1101 == -3
0b1101 ^ 0b1000 == 0b0101 == 5
just like -3 + (-8) == 5
or for positive:
v == 0b0110 == 6
0b0110 ^ 0b1000 == 0b1110 == -2
just like 6 + (-8) == -2

We cannot simply use an addition operator because the result will overflow to higher bits for negative numbers:

0b1101 + 0b1000 == 0b10101 == 21

This one XOR basically replaces two operations: an addition and a mask:

(0b1101 + 0b1000) & 0b1111
== 0b10101 & 0b01111
== 0b0101
== 5

We then subtract m to reverse our previous addition but at this time we specifically want it to cause an overflow:

m == 0b1000 == -8
v == 0b1101 == -3
0b1101 ^ 0b1000 == 0b00000101 == 5
0b00000101 - 0b00001000 == 0b11111101 == -3

Literature

[1] Armv6-M Architecture Reference Manual (DDI0419C)
https://developer.arm.com/documentation/ddi0419/latest
[2] Arm Instruction Set Reference Guide (100076)
https://developer.arm.com/documentation/100076/0100/
[3] Sean Eron Anderson: Bit Twiddling Hacks:
Sign extending from a variable bit-width
http://graphics.stanford.edu/~seander/bithacks.html#VariableSignExtend
[4] Ritchie & Kernighan: The C Programming Language (Second edition):
https://www.cs.sfu.ca/~ashriram/Courses/CS295/assets/books/C_Book_2nd.pdf

Andrei Liungrin