In this post I will show how to recreate objdump from binutils for Armv6-M.
You will not only learn to decode Thumb instructions, but also to understand
their semantics and to navigate Arm reference manuals.
I will describe all steps in great detail so you would only need a minimal background in C to follow along. If you are learning to program microcontrollers and want to understand what’s happening behind the scenes – this post is for you.
Diving into the lower levels is really not that scary. You just have to be brave enough to dispel the magic and see how simple the base of our technology is if you have the right ancient books. Getting close to the machine gives an exciting sense of affinity and helpful insights for daily work.
For this post I will specifically target Arm Cortex M0+ – a small power-efficient Armv6-M core released in 2012 which is used in microcontrollers like STM32G0. It does not support the 32 bit Arm instruction set and instead only has 16-bit Thumb instructions with a small number of 32-bit instructions introduced with Thumb-2 technology.
The most important book for us here is the Armv6-M Architecture Reference
Manual (DDI0419C) [1]. Besides all the useful information it contains a
complete set of all instructions supported by this core with their encodings,
differences to other cores and even operation pseudocode. I will refer to this
document later as A6-123 (chapter-page) or A1.1.1 (chapter). You can also
check out the Arm Instruction Set Reference Guide (100076) [2] for a complete
list of instructions.
Each Thumb instruction is either a single 16-bit halfword or a 32-bit
instruction consisting of two consecutive halfwords. Instruction fetches are
always little-endian even if the memory system is configured as big-endian
[A3.3.1]. This means that the least significant byte of a halfword is stored at
address A, and most significant at A+1: e.g. an instruction 0x464e is
stored in memory as mem[addr 0x0] = 0x4e, mem[addr 0x1] = 0x46.
Each halfword must be aligned in memory to a multiple of two, which means that
you cannot put instructions at addresses like 0x3 or 0x127, only 0x2 and
0x128. Note that 32-bit instructions do not require an alignment of 4 as they
are fetched as two independent halfwords.
The least significant bit of the Program Counter (PC) selects the execution
state: 0 for Arm state, and 1 for Thumb. Since this core only supports
Thumb state, bit[0] must always be 1 or else a HardFault occurs. This bit
is ignored when calculating the fetch address so a branch to 0x123 will fetch
an instruction from 0x122 [A4.1.1].
Now onto decoding! First, let’s create a struct to hold decoded instructions:
enum thumb_op_ty {
thumb_op_movs_imm,
thumb_op_bl
/* ... */
};
struct thumb_op {
enum thumb_op_ty ty;
union {
uint16_t push_regs;
struct {
uint16_t rd;
uint32_t imm32; /* 32 bit wide immediate value */
} movs_imm;
int32_t bl_imm32;
/* ... */
} v;
};
This is a polymorphic structure that can represent any instruction based on
ty. The union v here contains the operands according to ty: v.bl_imm32
is for ty == thumb_op_bl, v.movs_imm.rd is for ty == thumb_op_movs_imm,
etc. To work with such structures a switch statement on ty is often used.
It is important to always check ty before accessing values from union v.
Otherwise you will get misinterpreted values as all union fields are located
at the same address.
Next, let’s define our main function:
int thumb_decode(uint16_t hw1, uint16_t hw2, struct thumb_op * op)
{
/* ... */
}
It accepts two halfwords hw1 and hw2 that represent two consecutive
halfwords from memory, and puts the decoded instruction in op. The return
value is the width of the instruction in bytes, i.e. by how much to increment
the address for the next call.
Throughout the following sections I will use some bit-twiddling functions:
- bits_(v, hi, lo) returns a bitfield x = v[hi:lo] such that bit at
position hi in v becomes the highest bit of the result, and bit lo
becomes bit zero, counting from zero:
bits_(0b11110010100111, 7, 5) == 0b101
- bit_(v, n) returns one bit at position n from v, counting from zero:
bit_(0b00100, 2) == 0b1
- sign_extend32_(v, b) extends a signed two’s complement bitfield v from a
width of b bits to n bits by adding copies of the highest bit from v to
fill the extra bits:
sign_extend32_(0b1101, 4) == 0b11111111111111111111111111111101
static uint16_t bits_(uint16_t v, int hi, int lo)
{
return (v >> lo) & ~(~0 << (hi - lo + 1));
}
static uint32_t bit_(uint32_t v, int n)
{
return (v >> n) & 1;
}
static int32_t sign_extend32_(uint32_t v, int b)
{
uint32_t m = (1U << (b - 1));
v = v & ((1U << b) - 1);
return (v ^ m) - m;
}
I will describe those functions in great detail in the appendix on Helper
functions.
BL)Cortex M0+ supports seven 32-bit instructions from Thumb-2: MSR, MRS, DSB,
DMB, ISB, UDF and BL [A5.3]. We can detect them by checking the highest
5 bits of the first halfword: if bit[15:11] is one of 0b11101, 0b11110 or
0b11111, we must fetch the second halfword; all other instructions are
16-bit [A5.1].
static int op_is_32_bit_(uint16_t hw1)
{
return bits_(hw1, 15, 11) == 0b11101
|| bits_(hw1, 15, 11) == 0b11110
|| bits_(hw1, 15, 11) == 0b11111);
}
int thumb_decode(uint16_t hw1, uint16_t hw2, struct thumb_op * op)
{
if (op_is_32_bit_(hw1)) {
/* extended below */
return 4;
} else { /* 16-bit instructions */
/* ... */
return 2;
}
}
I will only cover the BL instruction since its encoding is the most complex.
The rest is left as an exercise to the reader.
The Branch with Link instruction BL calls a subroutine at an address relative
to the current program counter (aka PC-relative address) [A6.7.13]. It
encodes the address as a 25-bit even signed integer. The lowest bit is not
encoded and is always 0 to double the range at no cost thanks to the fact
that instructions must be aligned to a multiple of two, i.e. at even addresses.
I will call the encoded relative address imm32. The mnemonic is bl <label>,
and the address is calculated as an offset from the PC value of the BL
instruction to this assembly label.
stuff:
mov r0, r0
my_proc:
add r1, r1, #42 ; mem_addr = 2, PC of an instruction = (2 + 4) | 0x1 = 7
bx lr ; mem_addr = 4, PC of an instruction = (4 + 4) | 0x1 = 9
my_caller:
mov r1, #10 ; mem_addr = 6, PC of an instruction = (6 + 4) | 0x1 = 11
bl my_proc ; mem_addr = 8, PC of an instruction = (8 + 4) | 0x1 = 13
Note that the PC value of an instruction is its address plus 4 [A4.2.1], even
if the instruction itself takes only 2 bytes. This has to do with the pipeline
architecture where the fetch stage for the next instruction is in parallel
with the execute stage for the current instruction. In our example,
instruction bl my_proc is at address 8 in memory but its PC value is 13: add
4 and set the lowest bit (remember about Thumb state). The encoded offset is
2 (label addr) - 13 (PC) == -11 (imm32).
Now let’s get into decoding. For that we refer to the Encoding diagram from the chapter on the instruction [A6.7.13]:
| halfword 1 (`hw1`) | | halfword 2 (`hw2`) |
|15|14|13|12|11|10|9|8|7|6|5|4|3|2|1|0| |15|14|13|12|11|10|9|8|7|6|5|4|3|2|1|0|
|-------------------------------------|-|-------------------------------------|
| 1| 1| 1| 1| 0| S| imm10 | | 1| 1|J1| 1|J2| imm11 |
All bits explicitly marked with 0 or 1 must be set to that value, or else
the instruction at hand is a different one or undefined. Names such as S,
J1 or imm10 are used later to refer to the bitfields: hw1[9:0] == imm10,
hw2[11] == J2 (see also Appendix E.1 in [1]).
if (bits_(hw1, 15, 11) == 0b11110
&& bits_(hw2, 15, 14) == 0b11 && bit_(hw2, 12) == 0b1) {
/* ... */
}
Each Encoding diagram in the manual is followed by some pseudocode that translates the fields from the diagram into more meaningful values used in the instruction description and operational pseudocode:
I1 = NOT(J1 EOR S);
I2 = NOT(J2 EOR S);
imm32 = SignExtend(S:I1:I2:imm10:imm11:'0', 32);
This pseudocode uses some operators described in Appendix E.5 in [1]:
- NOT(x) negates each bit of x, e.g. NOT(0b1010) == 0b0101;
same as ~x in C;
- a EOR b computes a bitwise exclusive OR (aka XOR),
e.g. 0b1010 EOR 0b1100 == 0b0110; same as a ^ b in C;
- a:b concatenates two bitstrings by placing all bits from a to the left
(higher) of b, e.g. 0b101:0b011 == 0b101011;
same as (a << n_bits(b)) | b in C, where n_bits(b) is the width of b
in bits;
- SignExtend(x, n) extends a signed two’s complement bitfield x to a length
of n bits by adding copies of the highest bit from x to fill the extra
bits, e.g. SignExtend(0b1101, 8) == 0b11111101 (0b1101 == -3);
you will see the function sign_extend32_(x) == SignExtend(x, 32) in C,
which I will describe later in great detail.
Fields' pseudocode is often followed by a condition such as:
If InITBlock() && !LastInITBlock() then UNPREDICTABLE;
Those are meant for the IT (If-Then) instruction, which is not supported on
Armv6-M. InITBlock() and LastInITBlock() always return false, so you can
just ignore those checks [A6.3].
Now all we have to do is to extract the fields and concatenate them:
uint32_t s = bit_(hw1, 10);
uint32_t i1 = ~(bit_(hw2, 13) ^ s) & 1;
uint32_t i2 = ~(bit_(hw2, 11) ^ s) & 1;
uint32_t imm10 = bits_(hw1, 9, 0);
uint32_t imm11 = bits_(hw2, 10, 0);
uint32_t imm25 = 0;
imm25 = imm25 | s; /* bits[24] */
imm25 = imm25 << 1 | i1; /* bits[23] */
imm25 = imm25 << 1 | i2; /* bits[22] */
imm25 = imm25 << 10 | imm10; /* bits[21:11] */
imm25 = imm25 << 11 | imm11; /* bits[10:1] */
imm25 = imm25 << 1 | 0; /* bits[0] */
op->ty = thumb_op_bl;
op->v.bl_imm32 = sign_extend32_(imm25, 25);
Note the mask & 1 in i1 and i2: without it all bits above the first will
be set after negating, and doing imm25 << 1 | i1 will overwrite fields added
before.
Another way of writing the concatenation is with a union and bit-fields:
union {
struct {
unsigned int zero : 1;
unsigned int imm11 : 11;
unsigned int imm10 : 10;
unsigned int i2 : 1;
unsigned int i1 : 1;
unsigned int s : 1;
unsigned int zero2 : 7;
} b;
uint32_t w;
} imm32;
uint32_t s = bit_(hw1, 10);
imm32.w = 0;
imm32.b.imm11 = bits_(hw2, 10, 0);
imm32.b.imm10 = bits_(hw1, 9, 0);
imm32.b.i2 = ~(bit_(hw2, 11) ^ s);
imm32.b.i1 = ~(bit_(hw2, 13) ^ s);
imm32.b.s = s;
op->ty = thumb_op_bl;
op->v.bl_imm32 = sign_extend32_(imm32.w, 25);
… but it relies on two implementation-dependent mechanisms, some might call
it Undefined behavior squared. The standard does not specify the precise
layout in memory of the bit-fields, so different compilers are free to reorder
them. Then we try to interpret struct imm32.b as uint32_t. This is called
type punning and is also undefined: for example, on big-endian systems, bit
imm32.b.zero might be placed in the most significant byte of imm32.w,
resulting in nonsense value (see also 6.8 Unions and 6.9 Bit-fields in
[4]).
Now let’s walk through disassembling some real instruction codes that you can see in the second column of this neat disassembly (remember, to build a disassembler you will need a disassembler):
Disassembly of section .text:
00000000 <foo>:
0: 1c00 adds r0, r0, #0
00000002 <first>:
2: 3001 adds r0, #1
4: f0ab fe02 bl abc0c <second>
8: 382a subs r0, #42
...
000abc0a <bar>:
abc0a: 1c00 adds r0, r0, #0
000abc0c <second>:
abc0c: 3802 subs r0, #2
abc0e: f754 f9f8 bl 2 <first>
abc12: 30e4 adds r0, #228
Let’s decode our two BL instructions:
struct thumb_op op1, op2;
/* 4: f0ab fe02 bl abc0c <second> */
thumb_decode(0b1111000010101011, 0b1111111000000010, &op1);
/* abc0e: f754 f9f8 bl 2 <first> */
thumb_decode(0b1111011101010100, 0b1111100111111000, &op2);
| halfword 1 (`hw1`) | | halfword 2 (`hw2`) |
|15|14|13|12|11|10|9|8|7|6|5|4|3|2|1|0| |15|14|13|12|11|10|9|8|7|6|5|4|3|2|1|0|
|--|--|--|--|--|--|-------------------|-|--|--|--|--|--|----------------------|
| 1| 1| 1| 1| 0| S| imm10 | | 1| 1|J1| 1|J2| imm11 |
|--|--|--|--|--|--|-op1---------------|-|--|--|--|--|--|----------------------|
| 1| 1| 1| 1| 0| 0| 0010101011 |-| 1| 1| 1| 1| 1| 11000000010 |
|--|--|--|--|--|--|-op2---------------|-|--|--|--|--|--|----------------------|
| 1| 1| 1| 1| 0| 1| 1101010100 |-| 1| 1| 1| 1| 1| 00111111000 |
First we check the highest 5 bits of hw1 to see if the instruction is 32-bit:
bits_(hw1, 15, 11) == 0b11110;
if (bits_(hw1, 15, 11) == 0b11101
|| bits_(hw1, 15, 11) == 0b11110
|| bits_(hw1, 15, 11) == 0b11111) {
/* it's 32-bit! */
}
Then we match the encoding:
bits_(hw1, 15, 11) == 0b11110;
bits_(hw2, 15, 14) == 0b11;
bit_(hw2, 12) == 0b1;
if (bits_(hw1, 15, 11) == 0b11110
&& bits_(hw2, 15, 14) == 0b11 && bit_(hw2, 12) == 0b1) {
/* it's BL! */
}
… and finally decode the operand:
For op1:
uint32_t imm25 = 0;
imm25 = imm25 | 0b0; /* s */
/* == 0b00000000000000000000000000000000 */
imm25 = imm25 << 1 | ~(0b1 ^ 0b0) & 1; /* i1; 0b1 ^ 0b0 == 0b1 */
/* == 0b00000000000000000000000000000000 */
imm25 = imm25 << 1 | ~(0b1 ^ 0b0) & 1; /* i2 */
/* == 0b00000000000000000000000000000000 */
imm25 = imm25 << 10 | 0b0010101011; /* imm10 */
/* == 0b00000000000000000000000010101011 */
imm25 = imm25 << 11 | 0b11000000010; /* imm11 */
/* == 0b00000000000001010101111000000010 */
imm25 = imm25 << 1 | 0; /* lowest bit is always 0 */
/* == 0b00000000000010101011110000000100 */
op->v.bl_imm32 = sign_extend32_(imm25, 25);
/* == 0b00000000000010101011110000000100 */
/* == +0xABC04 */
op->ty = thumb_op_bl;
For op2:
uint32_t imm25 = 0;
imm25 = imm25 | 0b1; /* s */
/* == 0b00000000000000000000000000000001 */
imm25 = imm25 << 1 | ~(0b1 ^ 0b1) & 1; /* i1 */
/* == 0b00000000000000000000000000000011 */
imm25 = imm25 << 1 | ~(0b1 ^ 0b1) & 1; /* i2 */
/* == 0b00000000000000000000000000000111 */
imm25 = imm25 << 10 | 0b1101010100; /* imm10 */
/* == 0b00000000000000000001111101010100 */
imm25 = imm25 << 11 | 0b00111111000; /* imm11 */
/* == 0b00000000111110101010000111111000 */
imm25 = imm25 << 1 | 0; /* lowest bit is always 0 */
/* == 0b00000001111101010100001111110000 */
op->v.bl_imm32 = sign_extend32_(imm25, 25);
/* == 0b11111111111101010100001111110000 */
/* == -0xABC10 */
op->ty = thumb_op_bl;
To check the results, let’s remember that imm32 is an offset from the PC-value
of the instruction, which is equal to its address plus 4:
00000002 <first>:
2: 3001 adds r0, #1
4: f0ab fe02 bl +0xABC04
; PC-value == 0x4 + 0x4 == 0x8
; target address = 0x8 + 0xABC04 == 0xABC0C == address of <second>
8: 382a subs r0, #42
000abc0c <second>:
abc0c: 3802 subs r0, #2
abc0e: f754 f9f8 bl -0xABC10
; PC-value == 0xABC0E + 0x4 == 0xABC12
; target address = 0xABC12 + (-0xABC10) == 0x2 == address of <first>
abc12: 30e4 adds r0, #228
B<c>)Cortex M0+ has only one conditional instruction: 16-bit B<c> with a jump
range of -256 to +254 bytes. You can select one of 14 checks [A6.3]:
enum thumb_cond {
thumb_cond_eq = 0b0000, /* Z == 1; equal */
thumb_cond_ne = 0b0001, /* Z == 0; not equal */
thumb_cond_cs = 0b0010, /* C == 1; carry set */
thumb_cond_cc = 0b0011, /* C == 0; carry clear */
thumb_cond_mi = 0b0100, /* N == 1; minus, negative */
thumb_cond_pl = 0b0101, /* N == 0; plus, positive or zero */
thumb_cond_vs = 0b0110, /* V == 1; overflow */
thumb_cond_vc = 0b0111, /* V == 0; no overflow */
thumb_cond_hi = 0b1000, /* C == 1 and Z == 0; unsigned higher */
thumb_cond_ls = 0b1001, /* C == 0 or Z == 1; unsigned lower or same */
thumb_cond_ge = 0b1010, /* N == V; signed greater than or equal */
thumb_cond_lt = 0b1011, /* N != V; signed less than */
thumb_cond_gt = 0b1100, /* Z == 0 and N == V; signed greater than */
thumb_cond_le = 0b1101, /* Z == 1 or N != V; signed less than or equal */
thumb_cond_al = 0b1110, /* always (unconditional); never encoded */
/* synonyms in assembly */
thumb_cond_hs = thumb_cond_cs, /* unsigned higher or same */
thumb_cond_ld = thumb_cond_cc /* unsigned lower */
};
Those conditions check flags in the APSR register [A2.3.2]. Most instructions
update it, with precise mechanism described in their Operation pseudocode.
This instruction has two encodings: conditional and unconditional with bigger range. Let’s start with the first one:
Encoding T1:
|15|14|13|12|11|10|9|8|7|6|5|4|3|2|1|0|
|-------------------------------------|
| 1| 1| 0| 1| cond | imm8 |
if cond == '1110' then UNDEFINED;
if cond == '1111' then SEE SVC;
imm32 = SignExtend(imm8:'0', 32);
These two conditions in the pseudocode are very important. The first one
enforces the rule that the condition AL (always) is never encoded in
instructions; there is an unconditional encoding T2 for such cases.
UNDEFINED triggers an Undefined instruction exception. The second condition
says that an encoding with bits[15:8] == 0b11011111 is actually an SVC
instruction, not B<c> [A6.7.68].
The second encoding is straightforward:
Encoding T2:
|15|14|13|12|11|10|9|8|7|6|5|4|3|2|1|0|
|-------------------------------------|
| 1| 1| 1| 0| 0| imm11 |
imm32 = SignExtend(imm11:'0', 32);
To start decoding, let’s extend struct thumb_op:
enum thumb_op_ty {
/* ... */
thumb_op_b,
/* ... */
};
struct thumb_op {
enum thumb_op_ty ty;
union {
/* ... */
struct {
enum thumb_cond cond;
int32_t imm32;
} b;
/* ... */
} v;
};
Side note: gcc has a check -Wswitch-enum, which will warn you if you extended
a struct and forgot to handle the new variants in a switch.
Now decoding is as simple as:
int thumb_decode(uint16_t hw1, uint16_t hw2, struct thumb_op * op)
{
if (op_is_32_bit_(hw1)) {
/* ... */
return 4;
} else { /* 16-bit instructions */
if (bits_(hw1, 15, 12) == 0b1101
&& bits_(hw1, 11, 8) != 0b1111
&& bits_(hw1, 11, 8) != 0b1110) {
/* A6-119: T1: B<c> <label> */
op->ty = thumb_op_b;
op->v.b.cond = bits_(hw1, 11, 8);
op->v.b.imm32 = sign_extend32_(bits_(hw1, 7, 0) << 1, 9);
} else if (bits_(hw1, 15, 11) == 0b11100) {
/* A6-119: T2: B<c> <label> */
op->ty = thumb_op_b;
op->v.b.cond = thumb_cond_al;
op->v.b.imm32 = sign_extend32_(bits_(hw1, 10, 0) << 1, 12);
}
/* ... */
return 2;
}
}
The check bits_(hw1, 11, 8) != 0b1111 && bits_(hw1, 11, 8) != 0b1110 mirrors
the conditions in the pseudocode. It also ensures that the value in cond is a
valid enumeration variant. The left shift in bits_(hw1, 7, 0) << 1 is the
same as concatenation imm8:'0'.
Here is an example:
Disassembly of section .text:
00000100 <demo>:
100: 2200 movs r2, #0
102: 2364 movs r3, #100
00000104 <.loop>:
104: 3201 adds r2, #1
...
1a6: 3b01 subs r3, #1
1a8: d1ac bne.n 104 <.loop> ; PC-value == 0x1AC
The suffix .n means narrow for 16-bit encoding. Objdump prints it to avoid
confusion when a 32-bit encoding may be available, which is not the case for
our core.
struct thumb_op op1;
/* 1a8: d1ac bne.n 104 <.loop> */
thumb_decode(0b1101000110101100, 0b0, &op1);
Encoding T1:
|15|14|13|12|11|10|9|8|7|6|5|4|3|2|1|0|
|--|--|--|--|---------|---------------|
| 1| 1| 0| 1| cond | imm8 |
|--|--|--|--|---------|---------------|
| 1| 1| 0| 1| 0001 | 10101100 |
As usual we match the encoding:
bits_(hw1, 15, 12) == 0b1101
bits_(hw1, 11, 8) == 0b0001
if (bits_(hw1, 15, 12) == 0b1101
&& bits_(hw1, 11, 8) != 0b1111
&& bits_(hw1, 11, 8) != 0b1110) {
/* it's B<c>! */
}
… and unpack the fields:
op->ty = thumb_op_b;
op->v.b.cond = 0b0001; /* == thumb_cond_ne */
/* 0b10101100 << 1 == 0b101011000 */
op->v.b.imm32 = sign_extend32_(0b10101100 << 1, 9);
/* sign_extend32_(0b101011000, 9) == 0b11111111111111111111111101011000 */
/* == -0xA8 */
The PC-value of our branch is 0x1AC, so the target address is
0x1AC - 0xA8 == 0x104, which is precisely the address of the label .loop.
PUSH and LDMThis core has four instructions which can load/store multiple registers in one
go [A4.7]. They save program memory and shave off some cycles compared to
fetching and decoding multiple instructions. Those instructions are divided
into two logical groups: PUSH / POP and STM / LDM.
PUSH and POP implement a full-descending stack, which is the default
stack type for Arm [6]. Descending refers to the direction of stack growth: the
value of the stack pointer (sp aka r13) after reset is at the top of the
stack space, and new items are placed at decreasing addresses. Full stack means
that sp always points to the last pushed item; initially sp points to the
byte above the allowed stack space. PUSH accepts a list of up to nine
different registers (r0-r7 + r14 aka lr). It decrements sp by the total
size of the passed registers, then stores them one before another: the first
register will be at the new sp, while the last one will be just below the old
sp. POP does the inverse of that, except that instead of lr it accepts
r15 aka pc, which lets you restore the stack frame and jump to the return
address in one instruction.
; SP == 20
push {r0-r1, r4, lr}
; mem[addrs 7:4] = r0
; mem[addrs 11:8] = r1
; mem[addrs 15:12] = r4
; mem[addrs 19:16] = lr
; SP = 4
pop {r0-r1, r4, pc}
; r0 = mem[addrs 7:4]
; r1 = mem[addrs 11:8]
; r4 = mem[addrs 15:12]
; pc = mem[addrs 19:16] (jump)
; SP == 20
STM and LDM are similar but different. They let you change the base
register but they can’t store/load lr/pc, accepting only up to eight
registers. STM, unlike PUSH, places the registers at incrementing addresses
above the base register. LDM on the other hand, behaves exactly like POP,
loading from incrementing addresses above base. Another difference is that you
can choose to leave the base register unchanged. This makes them useful for
things like copying a struct:
; struct foo { int a, b, c, d, e, f } x, y;
; r0 == &x, r1 == &y
ldm r0, {r2-r7}
; r2 == x.a, r3 == x.b, ...
stm r1, {r2-r7}
; y.a == r2, y.b == r3, ...
Beware that both of those instructions have synonyms which assemble to the same
code: STMEA/STMIA for STM, and LDMFD/LDMIA for LDM.
To decode those instructions let’s first enumerate the registers:
/* see [A2.3] */
enum arm_reg {
arm_r0 = 1 << 0,
arm_r1 = 1 << 1,
arm_r2 = 1 << 2,
arm_r3 = 1 << 3,
arm_r4 = 1 << 4,
arm_r5 = 1 << 5,
arm_r6 = 1 << 6,
arm_r7 = 1 << 7,
arm_r8 = 1 << 8,
arm_r9 = 1 << 9,
arm_r10 = 1 << 10,
arm_r11 = 1 << 11,
arm_r12 = 1 << 12,
arm_r13 = 1 << 13,
arm_r14 = 1 << 14,
arm_r15 = 1 << 15,
arm_sp = arm_r13,
arm_lr = arm_r14,
arm_pc = arm_r15
};
enum arm_reg arm_reg_int(int v)
{
return 1 << v;
}
You will see in a moment that this enum mirrors how multiple registers are
encoded as bit positions: bit[0] is for r0, bit[7] for r7 and so on.
arm_reg_int will come into play when decoding the base register number.
enum thumb_op_ty {
/* ... */
thumb_op_push,
thumb_op_ldm
/* ... */
};
struct thumb_op {
enum thumb_op_ty ty;
union {
/* ... */
uint16_t push_regs;
struct {
enum arm_reg rn;
uint16_t regs;
int wback;
} ldm;
/* ... */
} v;
};
Note how push_regs and ldm.regs use a plain uint16_t instead of
enum arm_reg. That is because they will hold a bitmask made up of multiple
enumeration variants, while an enum type would suggest that the value exactly
matches one of the variants.
Now that everything is prepared, let’s first decode PUSH [A6.7.50]:
Encoding T1:
|15|14|13|12|11|10|9|8|7|6|5|4|3|2|1|0|
|-------------------------------------|
| 1| 0| 1| 1| 0| 1|0|M| register_list |
registers = '0':M:000000:register_list;
UnalignedAllowed = FALSE;
if BitCount(registers) < 1 then UNPREDICTABLE;
registers here is a bitmask where bit N selects register N: if registers[0]
and registers[14] are 1, PUSH will save registers r0 and r14 (lr).
The concatenation in the pseudocode is just another way of saying that bit M
selects register r14. The check for BitCount(registers) < 1) specifies that
you must select at least one register.
We could also rephrase this diagram to make register_list more obvious:
|15|14|13|12|11|10| 9| 8| 7| 6| 5| 4| 3| 2| 1| 0|
|-----------------------------------------------|
| 1| 0| 1| 1| 0| 1| 0|lr|r7|r6|r5|r4|r3|r2|r1|r0|
With all that, decoding is pretty straightforward:
int thumb_decode(uint16_t hw1, uint16_t hw2, struct thumb_op * op)
{
if (op_is_32_bit_(hw1)) {
/* ... */
return 4;
} else { /* 16-bit instructions */
/* ... */
} else if (bits_(hw1, 15, 9) == 0b1011010) {
/* A6-167: T1: PUSH <registers> */
op->ty = thumb_op_push;
op->v.push_regs = bits_(hw1, 7, 0) | bit_(hw1, 8) << 14;
}
/* ... */
return 2;
}
}
LDM has a similar encoding [A6.7.25] but with a couple of nuances:
Encoding T1:
|15|14|13|12|11|10|9|8|7|6|5|4|3|2|1|0|
|-------------------------------------|
| 1| 1| 0| 0| 1| Rn | register_list |
n = UInt(Rn);
registers = '00000000':register_list;
wback = (registers<n> == '0')
If BitCount(registers) < 1 then UNPREDICTABLE;
Rn here is the number of the base register i.e., Rn == 0 for r0, 3 for
r3 and so on. Note that Rn is a 3-bit field which means that this
instruction can only use registers r0-r7, also known as low registers; this
is the case for most of Thumb instructions. n = UInt(Rn) is just a formality
to say that n is the number of the base register.
The definition of wback suggests that the instruction must increment the base
register if you do not load to it: registers<n> == '0' is only going to be
true if the bit at position Rn is not selected: suppose r0 = 0x10,
mem[0x10]=0x42 before you try to execute ldm! r0, {r0, r1}; if that was
allowed and you could writeback to a register present in the list, the value of
r0 after the instruction will depend on how it is implemented in hardware –
it could be either 0x10 + 8 = 0x18, 0x42 or even 0x42 + 8 = 0x4A. This
has a side effect that Rn will be modified in any case: either by
incrementing (wback) or by loading from memory.
int thumb_decode(uint16_t hw1, uint16_t hw2, struct thumb_op * op)
{
/* ... */
} else if (bits_(hw1, 15, 11) == 0b11001) {
/* A6-137: T1: LDM <Rn>, <registers> or LDM <Rn>! <registers> */
op->ty = thumb_op_ldm;
op->v.ldm.rn = arm_reg_int(bits_(hw1, 10, 8));
op->v.ldm.regs = bits_(hw1, 7, 0);
op->v.ldm.wback = ! (op->v.ldm.regs & op->v.ldm.rn);
}
/* ... */
}
See how we can check if a register is selected with a simple bitwise AND:
op->v.ldm.regs & op->v.ldm.rn. This is thanks to the fact that each register
is stored at a different bit position in the mask, and that arm_reg_int()
converts a register number to that bit position.
And one more time let’s go look at some real disassembly:
0: b5d0 push {r4, r6, r7, lr}
`PUSH`: 0b1011010111010000
|15|14|13|12|11|10| 9| 8| 7| 6| 5| 4| 3| 2| 1| 0|
|--|--|--|--|--|--|--|--|-----------------------|
| 1| 0| 1| 1| 0| 1| 0| M| register_list |
|--|--|--|--|--|--|--|--|-----------------------|
| | | | | | | |lr|r7|r6|r5|r4|r3|r2|r1|r0|
|--|--|--|--|--|--|--|--|--|--|--|--|--|--|--|--|
| 1| 0| 1| 1| 0| 1| 0| 1| 1| 1| 0| 1| 0| 0| 0| 0|
2: cf4e ldmia r7!, {r1, r2, r3, r6}
`LDMIA` (same as `LDM`): 0b1100111101001110
|15|14|13|12|11|10| 9| 8| 7| 6| 5| 4| 3| 2| 1| 0|
|--|--|--|--|--|--------|-----------------------|
| 1| 1| 0| 0| 1| Rn | register_list |
|--|--|--|--|--|--------|-----------------------|
| | | | | | |r7|r6|r5|r4|r3|r2|r1|r0|
|--|--|--|--|--|--------|--|--|--|--|--|--|--|--|
| 1| 1| 0| 0| 1| 1110 | 0| 1| 0| 0| 1| 1| 1| 0|
Now that we have figured out all essential instruction groups. Now you can extend the decoder to rest of instructions. A couple of tips before you start.
First, add a fallback clause so that unknown instructions won’t return garbage:
enum thumb_op_ty {
/* ... */
thumb_op_word,
thumb_op_hword
/* ... */
};
struct thumb_op {
enum thumb_op_ty ty;
union {
/* ... */
uint32_t word_val;
uint16_t hword_val;
/* ... */
} v;
};
int thumb_decode(uint16_t hw1, uint16_t hw2, struct thumb_op * op)
{
if (op_is_32_bit_(hw1)) {
/* ... */
} else {
op->ty = thumb_op_word;
op->v.word_val = hw1 | hw2 << 16;
}
return 4;
} else {
/* ... */
} else {
op->ty = thumb_op_hword;
op->v.hword_val = hw1;
}
return 2;
}
}
thumb_op_word and thumb_op_hword mirror the assembly directives .word and
.hword which paste the literal bytes given.
Second tip is to always put instructions with more fixed bits above those with
fewer: e.g. we check 8 bits for SVC (bits_(hw1, 15, 8) == 0b11011111) so it
must go before the check for B<c> with only 4 fixed bits besides the
restrictions on cond (bits_(hw1, 15, 12) == 0b1101). This will help avoid
misinterpretation when a less specific encoding turns out to be a prefix of a
larger one.
Up until now we had to validate our decoder by hand, so let’s add a disassembler to our disassembler:
static const char * thumb_reg_str_(enum arm_reg v);
static void thumb_print_regs_(uint16_t v);
void thumb_print_disass(struct thumb_op * op, uint32_t addr)
{
/* see PC value of an instruction @ A4-68: A4.2.1 */
uint32_t pc = addr + 4;
switch (op->ty) {
case thumb_op_bl:
printf(
"bl %x",
pc + op->v.bl_imm32);
break;
case thumb_op_b:
if (op->v.b.cond != thumb_cond_al) { /* Encoding T1 */
printf(
"b%s.n %x",
thumb_cond_str_(op->v.b.cond),
pc + op->v.b.imm32);
} else { /* Encoding T2 */
printf(
"b.n %x",
pc + op->v.b.imm32);
}
break;
case thumb_op_push:
printf("push ");
thumb_print_regs_(op->v.push_regs);
break;
case thumb_op_ldm:
printf("ldm%c %s ",
op->v.ldm.wback ? '!' : ' ',
thumb_reg_str_(op->v.ldm.rn));
thumb_print_regs_(op->v.ldm.regs);
break;
case thumb_op_word:
printf(".word 0x%08x", op->v.word_val);
break;
case thumb_op_hword:
printf(".hword 0x%04x", op->v.hword_val);
break;
};
}
This function is simple, but be careful to follow the rules on what mnemonic to
choose in the description of each instruction. Some instructions (like ADD)
might allow multiple encodings; for those you might want to save the decoded
encoding variant to ensure the disassembly will not loose the original intent.
static const char * thumb_reg_str_(enum arm_reg v)
{
switch (v) {
case arm_r0:
return "r0";
/* ... */
case arm_r15:
return "r15";
};
}
static void thumb_print_regs_(uint16_t v)
{
int i;
int comma = 0;
printf("{");
for (i = 0; i <= 15; i++) {
if (bit_(v, i)) {
printf("%sr%i", comma ? ", " : "", i);
comma = 1;
}
}
printf("}");
}
And finally we can tie everything together and decode an actual file:
#include <stdint.h>
#include <stdio.h>
#include <fcntl.h>
#include <unistd.h>
#include <stdlib.h>
#include <sys/stat.h>
static void print_bin16_(uint16_t v);
static int read_file_(const char * name, uint8_t ** data);
int main(int argc, char ** argv)
{
uint8_t * file;
int file_size;
int addr;
int len;
uint16_t hw1, hw2;
struct thumb_op op;
if (argc < 2) {
return 1;
}
file_size = read_file_(argv[1], &file);
if (file_size < 0) {
return 2;
}
addr = 0;
while (1) {
if (addr + 1 > file_size) {
break;
}
hw1 = file[addr] | file[addr + 1] << 8;
if (addr + 3 <= file_size) {
hw2 = file[addr + 2] | file[addr + 3] << 8;
}
len = thumb_decode(hw1, hw2, &op);
if (len == 2) {
printf("%5x: %04x ", addr, hw1);
print_bin16_(hw1);
printf(" ");
thumb_print_disass(&op, addr);
} else if (len == 4) {
printf("%5x: %04x %04x ", addr, hw1, hw2);
print_bin16_(hw1);
printf(" ");
print_bin16_(hw2);
printf(" ");
thumb_print_disass(&op, addr);
}
addr += len;
putchar('\n');
}
free(file);
return 0;
}
static void print_bin16_(uint16_t v)
{
int i;
for (i = sizeof(v) * 8 - 1; i >= 0; i--) {
putchar(((v >> i) & 1) ? '1' : '0');
}
}
static int read_file_(const char * name, uint8_t ** data)
{
int fd = -1;
struct stat sb;
fd = open(name, O_RDONLY);
if (fd < 0) {
perror("open");
goto err;
}
if (fstat(fd, &sb) == -1) {
perror("fstat");
goto err;
}
if (sb.st_size > 0) {
*data = malloc(sb.st_size);
} else {
goto err;
}
if (*data == NULL) {
perror("malloc");
goto err;
}
if (read(fd, *data, sb.st_size) != sb.st_size) {
goto err;
}
return sb.st_size;
err:
if (fd >= 0) {
close(fd);
}
if (*data != NULL) {
free(*data);
}
return -1;
}
This program works only with raw binary program images, so you should use
objcopy if you want to disassemble an object file:
arm-none-eabi-objcopy -O binary a.o a.bin
All together this looks like this:
0: f000 f800 1111000000000000 1111100000000000 bl 4
4: b438 1011010000111000 push {r3, r4, r5}
6: c838 1100100000111000 ldm! r0 {r3, r4, r5}
8: 2200 0010001000000000 movs r2, #0
a: 2364 0010001101100100 movs r3, #100
c: 3201 0011001000000001 adds r2, #1
e: 3b01 0011101100000001 subs r3, #1
10: d1fc 1101000111111100 bne.n c
Et voila! In a following post I will show how to parse an actual object file to disassemble it with symbol names and annotations for raw data.
First, we need a function to work with multiple integers packed inside a single halfword:
static uint16_t bits_(uint16_t v, int hi, int lo)
{
return (v >> lo) & ~(~0 << (hi - lo + 1));
}
This function extracts a right adjusted bit field that begins at bit index lo
and ends at hi (inclusive) (see 2.9 Bitwise operations in [4]):
bits_(0b01010, 4, 0) == 0b01010;
bits_(0b11010111, 6, 4) == 0b101;
The expression v >> lo moves the first bit of our field to the bit zero of
the halfword and discards all extra bits to the right of the field:
11010111 >> 4 == 00001101:
| bit # | 7 | 6 | 5 | 4 | 3 | 2 | 1 | 0 |
|----------|-------------------------------|
| operand | 1 | 1 | 0 | 1 | 0 | 1 | 1 | 1 |
|----------|-------------------------------|
| | added (4) | | discarded (4)
|----------|---------------|---------------|
| result | 0 | 0 | 0 | 0 | 1 | 1 | 0 | 1 | 0 | 1 | 1 | 1 |
~0 is all 1 bits, shifting it left by the width of our bit field places
that number of zeroes at the start, and negating that gives a mask with 1
bits for each bit of our bit field:
~(~0 << (6 - 4 + 1)) == 00000111
| bit # | 7 | 6 | 5 | 4 | 3 | 2 | 1 | 0 |
|------------|---|---|---|---|---|---|---|---|
| 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
|------------|---|---|---|---|---|---|---|---|
| ~0 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 |
|------------|---|---|---|---|---|---|---|---|
| ~0 << 3 | 1 | 1 | 1 | 1 | 1 | 0 | 0 | 0 |
|------------|---|---|---|---|---|---|---|---|
| ~(~0 << 3) | 0 | 0 | 0 | 0 | 0 | 1 | 1 | 1 |
We then apply this mask with a binary AND to set all the extra bits to the left
of our bit field to 0:
(11010111 >> 4) & ~(~0 << (6 - 4 + 1)) == 00000101
| bit # | 7 | 6 | 5 | 4 | 3 | 2 | 1 | 0 |
|------------------------|---|---|---|---|---|---|---|---|
| v >> lo | 0 | 0 | 0 | 0 | 1 | 1 | 0 | 1 |
|------------------------|---|---|---|---|---|---|---|---|
| ~(~0 << (hi - lo + 1)) | 0 | 0 | 0 | 0 | 0 | 1 | 1 | 1 |
|------------------------|---|---|---|---|---|---|---|---|
| & | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 1 |
Let us also define a simpler function for extracting singular bits:
static uint32_t bit_(uint32_t v, int n)
{
return (v >> n) & 1;
}
It places the bit number n to be the bit zero of the word and uses a mask
with only the first bit 1 to discard the extra bits.
Some instructions encode signed numbers in bit fields. bits_() will work fine
for positive numbers but we have to sign-extend the negative numbers.
Suppose we have a signed 4-bit field with a decimal -3 encoded as
two’s-complement as 0b1101. Without sign-extend it will be represented as
0b00001101 in an 8 bit integer which is equal to decimal 13.
To sign-extend a bit field of variable width we just need to fill the higher bits of the word with the most significant bit of the field [3]:
static int32_t sign_extend32_(uint32_t v, unsigned int b)
{
uint32_t m = (1U << (b - 1));
v = v & ((1U << b) - 1);
return (v ^ m) - m;
}
The expression v = v & ((1U << b) - 1) clears extra bits that are higher than
our field:
0b10101101 & ((1U << 4) - 1) == 0b00001101:
| bit # | 7 | 6 | 5 | 4 | 3 | 2 | 1 | 0 |
|---------------|---|---|---|---|---|---|---|---|
| 1U << 4 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 |
|---------------|---|---|---|---|---|---|---|---|
| (1U << 4) - 1 | 0 | 0 | 0 | 0 | 1 | 1 | 1 | 1 |
|---------------|---|---|---|---|---|---|---|---|
| v | 1 | 0 | 1 | 0 | 1 | 1 | 0 | 1 |
|---------------|---|---|---|---|---|---|---|---|
| & | 0 | 0 | 0 | 0 | 1 | 1 | 0 | 1 |
m = (1U << (b - 1)) creates a mask with a 1 bit placed at the most
significant bit of our field: m == 0b1000.
The expression v ^ m can be interpreted mathematically as adding the smallest
possible number with width b:
m == 0b1000 == -8
v == 0b1101 == -3
0b1101 ^ 0b1000 == 0b0101 == 5
just like -3 + (-8) == 5
or for positive:
v == 0b0110 == 6
0b0110 ^ 0b1000 == 0b1110 == -2
just like 6 + (-8) == -2
We cannot simply use an addition operator because the result will overflow to higher bits for negative numbers:
0b1101 + 0b1000 == 0b10101 == 21
This one XOR basically replaces two operations: an addition and a mask:
(0b1101 + 0b1000) & 0b1111
== 0b10101 & 0b01111
== 0b0101
== 5
We then subtract m to reverse our previous addition but at this time we
specifically want it to cause an overflow:
m == 0b1000 == -8
v == 0b1101 == -3
0b1101 ^ 0b1000 == 0b00000101 == 5
0b00000101 - 0b00001000 == 0b11111101 == -3