Finished result
The final merged PR can be found here.
Task
What
The task at hand, which led to me code all of this, was to implement RzIL lifting for the SuperH-4 instruction set architecture. RzIL is Rizin’s intermediate language. It is based on BAP’s Core Theory, with minor deviations wherever required.
Why
The IL’s purpose is to generalize analysis for various architectures by “lifting” (i.e. converting) the architecture specific instructions to the common intermediate language, and applying the analysis passes on the “lifted” IL instructions. This reduces any architecture specific analysis code, leading to a better analysis loop, and a uniform analysis logic. This also separates the analysis logic from the instruction lifting logic. It also allows for emulation and symbolic execution.
Things to keep in mind
Implementing the IL lifting is pretty straightforward from the conceptual point of view. One just needs to refer the ISA documentation/manual and implement the instruction in terms of the IL opcodes. However, from the developer point of view, it does require a bit more thought than just some grunt translation work. To make the IL lifting maintainable, readable and easy to extend for other instructions, one needs to come up with proper abstractions and methods, which are general enough to cover most of the cases, but at the same time, simple enough to not cause any sort spaghetti code. It’s imperative that the developer makes reasonable choices, keeping in mind the importance of abstractions and simplicity, and find a good balance between both.
Disassembler
Once the IL is implemented, it needs to be tested through tests. Rizin uses rz-test
for automated testing. One can test the assembling, disassembling and IL lifting dump for any instruction using this. The disassembler for SuperH-4 architecture was already present in the source tree, but it was GPL licensed code from the GNU Compiler Collection (the file I am talking about is sh-dis.c
). Rizin is also trying to get rid of GPL licensed code and rewrite the code under LGPL3 license. This was a good opportunity to revise and rewrite the disassembler.
Assembler
Also, there was no assembler for SuperH-4 architecture in Rizin. So this would also be a good time to implement and add one. This would make the analysis plugin for the SH-4 architecture “complete”. And, it would lead to more robust testing and allow users to assemble instructions (for example, in case they are building shellcode from instructions).
Plan
So, the plan was to implement the IL lifting, re-implement the disassembler, implement the disassembler and then add tests to verify the lifting, disassembler and the assembler. Looking back, I would have wanted to first implement the disassembler and the assembler while writing the tests in parallel, and the implement the IL lifting.
Register model
The SuperH-4 register model is quite similar to any other architecture. It has general purpose registers, status register (consisting of status bits), system registers, control registers and floating point registers. For the purpose of this article, we would not deal with floating point registers and instructions.
Banks
One specific quirk in this very simple register model is banked registers. There are two banks, both of which have their own set of the first 8 general purpose registers (r0
-r7
) . Therefore, whenever accessing any of the banked registers, one also needs to specify the bank. Which bank should be used, depends on the privilege bit. One needs to be really careful about implementing this in the IL. We cannot just calculate the privilege bit (in fact, the IL disallows us to do so), and thus, we have to implement this selection using the IL. This leads to a more verbose and complicated IL lifting, but it makes the IL more general and applicable even in case of privileged instructions.
Status registers
The status register (mnemonic being sr
), consists of multiple status bits. The sr
does not exist as a variable in the IL, only the status bits exist. This makes the whole process of accessing and modifying the status bits easier. However, modifying or reading the status register becomes relatively complex (but it is relatively uncommon, hence worth the tradeoff). For now, the IL does not allow overlapped registers, so it is hard to let both the status bits, and the status register coexist as variables in the IL.
ISA
Operands
The SuperH-4 instruction set architecture is pretty simple. It uses LOAD-STORE model of memory, and when using indirect addressing modes, the operand value is the memory stored at the effective address. The number of bytes to read from the effective address depends on the scaling of the instruction. It is necessary to provide generic utility methods and structs so that the whole process of getting and setting operand values is as clean as possible. Same goes for any other commonly used patterns. Commonly used patterns should be abstracted out to lead to a cleaner code, but at the same time should not lead to complexities, anti-patterns and “special” cases.
Instructions
For the aforementioned PR, I had implemented the lifting, assembler and the disassembler for CPU instructions only. The instructions have a scaling size in case they are performing memory access. In case of no scaling, long word scaling (4 bytes) is assumed by default.
Future plans
- Run
rz-tractest
to verify the IL lifting (this would require patching QEMU to generate traces for SuperH-4 emulation) - Implement delayed branch (this requires support from the IL, so it’ll need work on RzIL internally)
- Add floating point instructions support in the assembler and the disassembler
- Implement IL lifting for floating point instructions (this also require IL support for floating point numbers, a WIP PR for the IEE754 spec already exists)