Monday, October 1, 2012

Problems with software reverse engineering


A dark times to us all reverse engineers, especially the software kind. During my ~ten years of practice in the area, I’ve found out that today we have less tools for the task than what we had five years ago. Sure, nowadays IDA has a decompiler and some kind of a Python integration, but we all miss tools that served us for so long such as SoftICE and the Zynamics bindiff.
In this blog post I would like to tell my humble opinion on the subject.
Lets say a person, who I’ll refer to as the user, is given a one million pages long book (Font size 7). The user is being asked the simple question of “Does Alice ever meet Bob in this story?”. The user can read all million pages or die trying, in order to answer the question. But if the user had a digital copy of the book he could have searched for all of the places in the book where the word Alice and the word Bob are found not more than two pages apart. If the user had a Python interface to all the information found in the book, he could have used RegExp or some kind of lexical Python modules to maybe look for the word “meet” in all variations such as “meeting”, “met” or so on. Moreover, he could have found a way to figure out if there is a nickname for Alice such as “The Bitch”, and to automatically add this nickname to the search.
The way I see it, the current status of software reverse engineering is much closer to the hard copy book search task, than the digital one, and much further away from a decent Python interface.

The most commonly used tool IDA is giving the user a huge textual dump of disassembly, with some basic blocks, functions and higher level of understanding, where the highest level it gets is with the decompiler in which the user gets a C like code to read. I agree that it is easier to read C code than Opcodes but not by far. The user would never be able to read all the text for even the simplest program such as the Ms-Minesweeper. To produce this kind of text dump IDA had to had some kind of “understanding” of what each and every opcode is doing. I call this “opcoding doing” the opcode side effects. For instance, the side effects of the “ADD EAX, EBX” opcode would be reading EBX, EAX and writing to EAX and EFLAGS. IDA got this precious knowledge of side effects but it doesn’t let the user access it. I think this is the biggest point of failure in the understanding of what reverse engineering should look like.
Lets consider an example of creating a crack for a simple serial protected tool. The approach usually is of using a debugger to break on some known related API such as the GetDlgItemTextA or MessageBoxA to get to the point where the username is read or a wrong key message is displayed. Replacing this part of the task with a search of all functions in the code that are one or two steps away from both of these APIs, slicing with all the functions that contain the XOR opcode with two different operands, and some other characteristics relevant only to those kind of functions, would most likely produce a very short list of potential functions. When analyzing the functions, it is probably using many other functions that the user doesn’t care about, while only few of them are relevant to the serial validation. Displaying a list of all functions that are accessible from a certain function and filtering only the ones that might make an access to the buffer that is pointed by EDI or so, would make the job of the reverser much more effective.
The text of the opcodes is hardly ever what the reverser really cares about, he just wants to know what are the side effects of a function are, whether it changes EAX or not, whether it makes a copy of the buffer or not. The decompiled code is nice, but considering the fact that it’s not possible to recompile this code, why write it in C and not as an equations or list of arithmetic operations.
The reversing tool should be a much dummer tool that just creates the access point to as much information as available, the analysis should be done later on with scripts and plugins. The analysis that is done on a malware is not the same one that is done on a DRM product and is not the same one that is done on recovering algorithm.
However, putting all that aside, we have to have undo functionality. The lack of the undo function in IDA represents not only the fact that the tool is incomplete, it represents the fact that tool was made with small projects in mind, while sometimes the reversing work is something that is done over a long period of time. Symantec, Kaspersky and others told it took them a few months work of more than one reverser to get a basic understanding of the Stuxnet, which as far as I know, had very little anti-debugging tricks to it. IDA is not made for collaboration or long time research.

I would love to hear more opinions about this subject, and what people think should be done about it.

Cheers,
Assaf