lifting-bits / fcd Goto Github PK

View Code? Open in Web Editor NEW

This project forked from fay59/fcd

23.0 23.0 3.0 7.12 MB

An optimizing decompiler (modified to use remill semantics)

Home Page: http://zneak.github.io/fcd

License: Other

CMake 1.05% C++ 90.68% C 2.35% Python 5.91%

fcd's People

Stargazers

Watchers

Forkers

nickolas-pohilets isabella232

fcd's Issues

Handle global entities in AST generation

The AST generation algorithm does not consider most global entities in the input bitcode. Implementing handling of global entities, such as aggregate type definitions, global variables, function declarations, etc., would improve the presentation of bitcode lifted from binaries and extracted from C code, and enable proper presentation of bitcode extracted from C++ code. This would also put fcd+remill closer to outputting complete valid C code.

Refactoring AST generation

The state of the AST generation algorithms and data structure codebase could be better and refactored with good programming practices in mind. The incorporation of libclang for it's AST should be considered instead of the custom AST data structure we now have.

The major points to consider when refactoring

Classes should adhere to rule of three (or rule of five, or rule of zero).
Classes should cooperate flawlessly with STL containers.
Use STL containers wherever possible.
Avoid explicit dynamic memory allocation (using new) whenever possible. If a custom allocation mechanism is needed, use allocators and STL containers.
Use glog's CHECK() and LOG() instead of relying on llvm_unreachable() and LLVM diagnostics.
Do not use goto.

Remove obsoleted code and the dependency on Capstone

Once Remill is used for lifting, there should be no further need for capstone and other parts of the codebase, most likely the x86_emulator.

Port ParameterRegistryPass and ProgramMemoryAliasAnalysis IR Passes

By changing the lifted IR it's very likely that ParameterRegistryPass and ProgramMemoryAliasAnalysis have been broken. ProgramMemoryAliasAnalysis should be easy to fix, since the original one only checked for the !fcd.prgmem tag on memory ops. However ParameterRegistryPass is likey to be worse, because it seems to heavily use other resources from fcd/callconv/.

Port the ArgumentRecovery IR Pass

The pass attempts to recover function arguments for lifted IR functions. The pass heavily relies on information provided the ParameterRegistry class in fcd/callconv/param_registry.(h|cpp) and IR passes associated with it, namely ParameterRegistryPass and ProgramMemoryAliasAnalysis.

Presentation of string literals in output pseudocode

String literals are currently presented as initializer lists with ASCII values as characters instead of C string literals in output pseudocode. This should obviously not be the case long-term.

Move memop lowering and ISEL/intrinsics cleanup code to remill

The code in fcd is itself copied over from McSema and seems like a reasonable candidate for inclusion into remill as a fast out-of-the-box option to get a cleaner lifted IR. The code could be placed into remill/BC/Lowering.(h|cpp) or something similar.

Port the IdentifyLocals IR Pass

This pass attempts to recover the stack frame after function argument recovery is done. By changing the lifted IR, the pass needs to be updated.

Reimplement the TranslationContext class using Remill

Title is pretty self-explanatory. Relevant code is in fcd/codegen/translation_context.cpp. This will also include changes to the users of TranslationContext. Code will be formatted using Remill's clang-format.

Handle floats in AST generation

The AST generation algorithm completely disregards floating point types and crashes if the input bitcode contains them. Needless to say, floats should be handled better.

Analyzing RA location on the stack

The current approach to analyzing the stack frame in RemillStackRecovery can be supplemented by analyzing the location of the return address of the caller relative to the top of the stack stored in the stack pointer. A thing to note is that this needs the ABI to define that the callee is responsible for cleaning up stack.

This kind of analysis could provide information about the number of stack parameters passed into the callee without the callee actually accessing any of the parameters. Which is useful when analyzing callsites of the callee so that we won't recover too many or few parameters.

Improve handling of conditions in AST generation

The AST generation algorithm seems to perform boolean expression normalization when generating conditions. The implementation of the normalization algorithm seems to have significant performance issues that need to be addressed. If non-trivial transformations are required, the incorporation of an external library for this task should be considered.

Using McSema lifted bitcode

As it is, fcd only has very basic CFG recovery and doesn't handle a number of aspects of symbol information usually present in a binary. McSema has more capabilities in this regard and produces bitcode similar to that of fcd after lifting. It would definitely be worth to check if bitcode lifted by McSema could be used in fcd in some way.

Refactor main.cpp

The main.cpp file hosts a lot of code which was either obsoleted by using Google Flags or does not work well with Remill's clang-format. Some refactoring is in order. This might also result in splitting main.cpp into smaller files.

Remill `State` and Intrinsic cleanup

As it is fcd+remill produces pseudocode that contains __remill* intrinsics and leftover uses of the State and Memory variables and arguments. This makes fcd+remill produce superfluous code at best and crash in pseudocode generation at worst.

One way to fix this is to replace the calls to __remill* intrinsics with calls to new functions that do not use the State and Memory variables and let subsequent optimizations to deal with the rest. Ideally this will solve both problems.

Refactor python bindings

I don't have much experience in creating python bindings to LLVM or C++ code in general, but the current generation of binding.cpp during build via fcd/python/bindings.py from llvm-c/Core.h seems a bit clumsy and easily broken when migrating to different versions of LLVM. Maybe worth a refactor?

Migrate IR passes from `RemillTranslationContext::FinalizeModule()`

To facilitate usage of modules that were lifted by other remill-based tools (most notably McSema), IR passes such as RemillArgumentRecovery and RemillStackRecovery need to be run outside of RemillTranslationContext::FinalizeModule(). Ideally put into RunPassPipeline() or a related function in main.cpp. This will allow the passes to be run independently of lifting,

Refactor fcd's call convention detection

Fcd currently uses a lot of dirty hacks in it's call convention detection IR passes and classes. Notable examples include wrapping an Executable class object in an LLVM ImmutablePass and having ParameterRegistryPass be dependent on it for target information. This shouldn't really be necessary or desirable. Also generally there seems to be a lot of unused and possibly obsoleted code.

Enhance function return type recovery in `RemillArgumentRecovery`

As it is now, the analysis of function return types implemented in RecoverRetType() in the RemillArgumentRecovery pass is pretty limited and would benefit from some enhancing.

@pgoodman in #13 proposed the following:

I think you need to look at all paths leading to a ret. You might be able to simplify your life using some existing CFG pass that restructures the function. This may need to be applied to a clone of the function.

What I would consider looking into would be an combination inside-out and outside-in approach:

look for all definitions of return registers that lead to a ret, add those to stored_regs[func].

for every call to a func present in stored_regs[func], check if the anything from stored_regs[func] is used on a path that is reachable from the call, then do ret_reg[func][reg] += 1 to add confidence to that register.

If one function tail-calls another, then you may want to union together stored_regs[caller] += stored_regs[callee].

In descending order of confidence, type check ret_regs[func] for each function.

Some functions in stored_regs are never directly called, fall back to an approach like what you have, but expand it so that it looks beyond the terminal blocks.

Segfault in `ExpressionUse::setUse(Expression*)`

When trying to generate pseudocode from a clang compiled IR module (with --module-in and -nooptimize) the AST generation pass in the backend dies due to, what I assume to be, an illegal write on expression->firstUse = this; in fcd/ast/expression_use.cpp.

The crash was produced using the following module.
miniz.zip

RFC: Recovery of parameters passed via stack

Consider the following code and a calling convention that uses no registers to pass parameters into a function:

int f(int p1, int p2, int p3, int p4, int p5) {
  int v1 = 5;
  return (v1 + p1 + p5) % 2;
}

int main(void) {
  return f(1,0,0,0,1);  
}

In this hypothetical scenario, with the current approach, fcd will recover the above as roughly:

int f(int p1, p2) {
  int v1 = 5;
  return (v1 + p1 + p2) % 2;
}

int main(void) {
  return f(1,1);
}

My question here is if we want to analyze callsites and try to recover the original function prototype or leave it as is and give the user the option to provide a function prototype in a header file, similarly to how the original fcd version handles prototypes of external functions.

Refactor AddressSpaceAAWrapperPass

The simple alias analysis pass housed in fcd/pass_asaa.* is currently littered with LLVM version #ifdef directives. There was some effort to keep it tidy, but LLVMs alias analysis has seen a lot of changes from 3.5 onwards. The pass could use a refactor.

Interactivity or scripting

This is more of a braindump for a longer-term way of using fcd. When I look at decompiled code, one thing I notice is that there are lots of libc bringup routines (e.g. __libc_start_main) that could probably be elided, but coming up with a good policy for what to elide and when is not straightforward.

Another thing that comes up is how to specify things like headers to fcd in order for it to do a better job with decompilation. Interactivity has come up as an option, and I think that might be how fcd originally did things.

I think that a nice alternative might be a more scripting-oriented approach. It would be similar-ish to interactive, but permit more re-use down the line. For things like libc stuff, you can have a file like linux.py or ELF.py that just "does the right thing" for eliding stuff. Scripting may also enable things like specifying headers.

I'm not sure if scripting should be done via embedding a Python interpreter (this is what PointsTo did, and it worked reasonably well.. it would mean a new command-line argument would be something like --script or something), or making fcd into a Python module (a bit harder, might make it easier to integrate with other stuff).

Any thoughts?

Refactor header declaration parsing

Header declaration parsing was disabled for llvm-3.8 and lower. The feature relies on the <clang/Index/CodegenNameGenerator.h> header which was introduced in llvm-3.9. The code in fcd/header_decls.cpp and fcd/clang/* needs a refactor anyway; maybe it would be worth investigating if the feature can work with llvm-3.8 and lower?

Bad value replacement in `ConvertRemillArgsToLocals()`

I've built FCD+Remill using Debug build of LLVM 4.0 to investigate other issue, but encountered this first:

In pass_argrec_remill.cpp:341 in function ConvertRemillArgsToLocals()

  auto pc_type = remill::AddressType(module);
  auto arg_pc = remill::NthArgument(func, remill::kPCArgNum);
  auto loc_pc = ir.CreateAlloca(pc_type, nullptr, "loc_pc");
  arg_pc->replaceAllUsesWith(loc_pc);

Assertion failed: (New->getType() == getType() && "replaceAllUses of value with new value of different type!"), function doRAUW, file ../llvm/lib/IR/Value.cpp, line 375.
(lldb) p arg_pc->dump()
i64 %pc
(lldb) p loc_pc->dump()
  %loc_pc = alloca i64
(lldb) p arg_pc->getType()->dump()
i64
(lldb) p loc_pc->getType()->dump()
i64*

Add a comprehensive help message to cmdline flags

Refactor fcd's CMakeLists.txt

The project structure and build of fcd has changed due to making fcd a subproject of Remill. The current CMake files work, but could definitely use a refactor to more resemble the those of other Remill subprojects (MCSema).

Analyze callsites in `RemillArgumentRecovery`

The RemillArgumentRecovery pass currently recovers function arguments based on the argument passing register usage in the body. Analyzing callsites would give us better results in a number of cases, i.e. external functions.

Alias Analysis of Remill's `State` structure

As it stands, the RemillArgumentRecovery pass relies on a static table for information about register aliasing (i.e. RAX aliases RAX, EAX, AX, AH, AL). This approach isn't very flexible and fcd+Remill would definitely benefit from an analysis that would provide information like this (and potentially more) from analyzing the State structure present in all functions lifted by Remill. This would also allow to refactor the RemillArgumentRecovery pass to work before and after passes like LLVM's mem2reg.

Broken CFG in `__libc_csu_init`

Seems like certain basic blocks are disassembled more times than necessary. The issue should be replicable using tests/retval-atoi.c.

lifting-bits / fcd Goto Github PK

fcd's People

Stargazers

Watchers

Forkers

fcd's Issues

Recommend Projects

Recommend Topics

Recommend Org