lifting-bits / fcd Goto Github PK
View Code? Open in Web Editor NEWThis project forked from fay59/fcd
An optimizing decompiler (modified to use remill semantics)
Home Page: http://zneak.github.io/fcd
License: Other
This project forked from fay59/fcd
An optimizing decompiler (modified to use remill semantics)
Home Page: http://zneak.github.io/fcd
License: Other
The AST generation algorithm does not consider most global entities in the input bitcode. Implementing handling of global entities, such as aggregate type definitions, global variables, function declarations, etc., would improve the presentation of bitcode lifted from binaries and extracted from C code, and enable proper presentation of bitcode extracted from C++ code. This would also put fcd+remill closer to outputting complete valid C code.
The state of the AST generation algorithms and data structure codebase could be better and refactored with good programming practices in mind. The incorporation of libclang for it's AST should be considered instead of the custom AST data structure we now have.
The major points to consider when refactoring
new
) whenever possible. If a custom allocation mechanism is needed, use allocators and STL containers.CHECK()
and LOG()
instead of relying on llvm_unreachable()
and LLVM diagnostics.goto
.Once Remill is used for lifting, there should be no further need for capstone
and other parts of the codebase, most likely the x86_emulator
.
By changing the lifted IR it's very likely that ParameterRegistryPass
and ProgramMemoryAliasAnalysis
have been broken. ProgramMemoryAliasAnalysis
should be easy to fix, since the original one only checked for the !fcd.prgmem
tag on memory ops. However ParameterRegistryPass
is likey to be worse, because it seems to heavily use other resources from fcd/callconv/
.
The pass attempts to recover function arguments for lifted IR functions. The pass heavily relies on information provided the ParameterRegistry
class in fcd/callconv/param_registry.(h|cpp) and IR passes associated with it, namely ParameterRegistryPass
and ProgramMemoryAliasAnalysis
.
String literals are currently presented as initializer lists with ASCII values as characters instead of C string literals in output pseudocode. This should obviously not be the case long-term.
The code in fcd is itself copied over from McSema and seems like a reasonable candidate for inclusion into remill as a fast out-of-the-box option to get a cleaner lifted IR. The code could be placed into remill/BC/Lowering.(h|cpp)
or something similar.
This pass attempts to recover the stack frame after function argument recovery is done. By changing the lifted IR, the pass needs to be updated.
Title is pretty self-explanatory. Relevant code is in fcd/codegen/translation_context.cpp
. This will also include changes to the users of TranslationContext. Code will be formatted using Remill's clang-format.
The AST generation algorithm completely disregards floating point types and crashes if the input bitcode contains them. Needless to say, floats should be handled better.
The current approach to analyzing the stack frame in RemillStackRecovery
can be supplemented by analyzing the location of the return address of the caller relative to the top of the stack stored in the stack pointer. A thing to note is that this needs the ABI to define that the callee is responsible for cleaning up stack.
This kind of analysis could provide information about the number of stack parameters passed into the callee without the callee actually accessing any of the parameters. Which is useful when analyzing callsites of the callee so that we won't recover too many or few parameters.
The AST generation algorithm seems to perform boolean expression normalization when generating conditions. The implementation of the normalization algorithm seems to have significant performance issues that need to be addressed. If non-trivial transformations are required, the incorporation of an external library for this task should be considered.
As it is, fcd only has very basic CFG recovery and doesn't handle a number of aspects of symbol information usually present in a binary. McSema has more capabilities in this regard and produces bitcode similar to that of fcd after lifting. It would definitely be worth to check if bitcode lifted by McSema could be used in fcd in some way.
The main.cpp
file hosts a lot of code which was either obsoleted by using Google Flags or does not work well with Remill's clang-format. Some refactoring is in order. This might also result in splitting main.cpp
into smaller files.
As it is fcd+remill produces pseudocode that contains __remill*
intrinsics and leftover uses of the State
and Memory
variables and arguments. This makes fcd+remill produce superfluous code at best and crash in pseudocode generation at worst.
One way to fix this is to replace the calls to __remill*
intrinsics with calls to new functions that do not use the State
and Memory
variables and let subsequent optimizations to deal with the rest. Ideally this will solve both problems.
I don't have much experience in creating python bindings to LLVM or C++ code in general, but the current generation of binding.cpp
during build via fcd/python/bindings.py
from llvm-c/Core.h
seems a bit clumsy and easily broken when migrating to different versions of LLVM. Maybe worth a refactor?
To facilitate usage of modules that were lifted by other remill-based tools (most notably McSema), IR passes such as RemillArgumentRecovery
and RemillStackRecovery
need to be run outside of RemillTranslationContext::FinalizeModule()
. Ideally put into RunPassPipeline()
or a related function in main.cpp
. This will allow the passes to be run independently of lifting,
Fcd currently uses a lot of dirty hacks in it's call convention detection IR passes and classes. Notable examples include wrapping an Executable
class object in an LLVM ImmutablePass
and having ParameterRegistryPass
be dependent on it for target information. This shouldn't really be necessary or desirable. Also generally there seems to be a lot of unused and possibly obsoleted code.
As it is now, the analysis of function return types implemented in RecoverRetType()
in the RemillArgumentRecovery
pass is pretty limited and would benefit from some enhancing.
@pgoodman in #13 proposed the following:
I think you need to look at all paths leading to a
ret
. You might be able to simplify your life using some existing CFG pass that restructures the function. This may need to be applied to a clone of the function.
What I would consider looking into would be an combination inside-out and outside-in approach:
- look for all definitions of return registers that lead to a
ret
, add those tostored_regs[func]
.- for every call to a
func
present instored_regs[func]
, check if the anything fromstored_regs[func]
is used on a path that is reachable from thecall
, then doret_reg[func][reg] += 1
to add confidence to that register.- If one function tail-calls another, then you may want to union together
stored_regs[caller] += stored_regs[callee]
.- In descending order of confidence, type check
ret_regs[func]
for each function.- Some functions in
stored_regs
are never directly called, fall back to an approach like what you have, but expand it so that it looks beyond the terminal blocks.
When trying to generate pseudocode from a clang compiled IR module (with --module-in
and -nooptimize
) the AST generation pass in the backend dies due to, what I assume to be, an illegal write on expression->firstUse = this;
in fcd/ast/expression_use.cpp
.
The crash was produced using the following module.
miniz.zip
Consider the following code and a calling convention that uses no registers to pass parameters into a function:
int f(int p1, int p2, int p3, int p4, int p5) {
int v1 = 5;
return (v1 + p1 + p5) % 2;
}
int main(void) {
return f(1,0,0,0,1);
}
In this hypothetical scenario, with the current approach, fcd will recover the above as roughly:
int f(int p1, p2) {
int v1 = 5;
return (v1 + p1 + p2) % 2;
}
int main(void) {
return f(1,1);
}
My question here is if we want to analyze callsites and try to recover the original function prototype or leave it as is and give the user the option to provide a function prototype in a header file, similarly to how the original fcd version handles prototypes of external functions.
The simple alias analysis pass housed in fcd/pass_asaa.*
is currently littered with LLVM version #ifdef
directives. There was some effort to keep it tidy, but LLVMs alias analysis has seen a lot of changes from 3.5 onwards. The pass could use a refactor.
This is more of a braindump for a longer-term way of using fcd. When I look at decompiled code, one thing I notice is that there are lots of libc bringup routines (e.g. __libc_start_main
) that could probably be elided, but coming up with a good policy for what to elide and when is not straightforward.
Another thing that comes up is how to specify things like headers to fcd in order for it to do a better job with decompilation. Interactivity has come up as an option, and I think that might be how fcd originally did things.
I think that a nice alternative might be a more scripting-oriented approach. It would be similar-ish to interactive, but permit more re-use down the line. For things like libc stuff, you can have a file like linux.py or ELF.py that just "does the right thing" for eliding stuff. Scripting may also enable things like specifying headers.
I'm not sure if scripting should be done via embedding a Python interpreter (this is what PointsTo did, and it worked reasonably well.. it would mean a new command-line argument would be something like --script
or something), or making fcd into a Python module (a bit harder, might make it easier to integrate with other stuff).
Any thoughts?
Header declaration parsing was disabled for llvm-3.8 and lower. The feature relies on the <clang/Index/CodegenNameGenerator.h>
header which was introduced in llvm-3.9. The code in fcd/header_decls.cpp
and fcd/clang/*
needs a refactor anyway; maybe it would be worth investigating if the feature can work with llvm-3.8 and lower?
I've built FCD+Remill using Debug build of LLVM 4.0 to investigate other issue, but encountered this first:
In pass_argrec_remill.cpp:341
in function ConvertRemillArgsToLocals()
auto pc_type = remill::AddressType(module);
auto arg_pc = remill::NthArgument(func, remill::kPCArgNum);
auto loc_pc = ir.CreateAlloca(pc_type, nullptr, "loc_pc");
arg_pc->replaceAllUsesWith(loc_pc);
Assertion failed: (New->getType() == getType() && "replaceAllUses of value with new value of different type!"), function doRAUW, file ../llvm/lib/IR/Value.cpp, line 375.
(lldb) p arg_pc->dump()
i64 %pc
(lldb) p loc_pc->dump()
%loc_pc = alloca i64
(lldb) p arg_pc->getType()->dump()
i64
(lldb) p loc_pc->getType()->dump()
i64*
The project structure and build of fcd has changed due to making fcd a subproject of Remill. The current CMake files work, but could definitely use a refactor to more resemble the those of other Remill subprojects (MCSema).
The RemillArgumentRecovery
pass currently recovers function arguments based on the argument passing register usage in the body. Analyzing callsites would give us better results in a number of cases, i.e. external functions.
As it stands, the RemillArgumentRecovery
pass relies on a static table for information about register aliasing (i.e. RAX aliases RAX, EAX, AX, AH, AL). This approach isn't very flexible and fcd+Remill would definitely benefit from an analysis that would provide information like this (and potentially more) from analyzing the State
structure present in all functions lifted by Remill. This would also allow to refactor the RemillArgumentRecovery
pass to work before and after passes like LLVM's mem2reg
.
Seems like certain basic blocks are disassembled more times than necessary. The issue should be replicable using tests/retval-atoi.c
.
A declarative, efficient, and flexible JavaScript library for building user interfaces.
๐ Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. ๐๐๐
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google โค๏ธ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.