yueduan / deepbindiff Goto Github PK

Official repository for DeepBinDiff

License: BSD 3-Clause "New" or "Revised" License

Shell 37.63% Python 62.37%

deepbindiff's Introduction

DeepBinDiff

This is the official repository for DeepBinDiff, which is a fine-grained binary diffing tool for x86 binaries. We will actively update it.

Paper

Please consider citing our paper.

Yue Duan, Xuezixiang Li, Jinghan Wang, and Heng Yin, "DeepBinDiff: Learning Program-Wide Code Representations for Binary Diffing", NDSS'2020

Requirements:

tensorflow (2.0 > tensorflow version >= 1.14.0)
gensim
angr
networkx
lapjv

Run the tool

python3 src/deepbindiff.py --input1 path_to_the_first_binary --input2 /path_to_the_second_binary --outputDir output/

For example, to compare O0 and O1 chroot binaries from coreutils v5.93, you may run:

python3 src/deepbindiff.py --input1 /home/DeepBinDiff/experiment_data/coreutils/binaries/coreutils-5.93-O0/chroot --input2 /home/DeepBinDiff/experiment_data/coreutils/binaries/coreutils-5.93-O1/chroot --outputDir output/

You can also use src/analysis_in_batch.sh script to perform binary diffing in batches.

Misc

IDA Pro or Angr?

We have both the IDA pro version and the angr version. IDA pro is used in order to directly compare with BinDiff, which uses IDA pro as well. The code here uses Angr.

Results?

Results are printed directly on the screen as "matched pairs" once the diffing is done. Each pair represents a matched pair of basic blocks in the two binaries. The numbers are the basic block indices, which can be found in output/nodeIndexToCode file.

CPU or GPU?

The current version is using CPU only.

NLP pre-training?

The current version uses an on-the-fly training process, meaning we only use the two input binaries for NLP training. Therefore, we don't need any pre-trained model. This will eliminate the OOV problem but will slow down the process a bit.

deepbindiff's People

Contributors

Stargazers

Watchers

deepbindiff's Issues

code KeyError

I run this command

python3 src/deepbindiff.py --input1 ./experiment_data/findutils/binaries/findutils-4.41-O2/find --input2 ./experiment_data/findutils/binaries/findutils-4.6-O2/find --outputDir ./result

It has a KeyError

Traceback (most recent call last):
  File "src/deepbindiff.py", line 243, in <module>
    main()
  File "src/deepbindiff.py", line 224, in main
    block_embeddings = cal_block_embeddings(blockIdxToTokens, blockIdxToOpcodeNum, blockIdxToOpcodeCounts, insToBlockCounts, tokenEmbeddings, reversed_dictionary)
  File "src/deepbindiff.py", line 118, in cal_block_embeddings
    tf_weight = opcodeCounts[token] / opcodeNum
KeyError: 'and'

Problem About the Source Code

Thanks for your outstanding work, but I met some problems with your source code.
In preprocessing.py ，

def offsetStrMappingGen(cfg1, cfg2, binary1, binary2, mneList):
# count type of constants for feature vector generation

# offsetStrMapping[offset] = strRef.strip()
offsetStrMapping = {}

# lists that store all the non-binary functions in bin1 and 2
externFuncNamesBin1 = []
externFuncNamesBin2 = []

for func in cfg1.functions.values():
    if func.binary_name == binary1:
        for offset, strRef in func.string_references(vex_only=True):
            offset = str(offset)
            #offset = str(hex(offset))[:-1]
            if offset not in offsetStrMapping:
                offsetStrMapping[offset] = ''.join(strRef.split())
    elif func.binary_name not in externFuncNamesBin1:
        externFuncNamesBin1.append(func.name)

def externBlocksAndFuncsToBeMerged(cfg1, cfg2, nodelist1, nodelist2, binary1, binary2, nodeDic1, nodeDic2, externFuncNamesBin1, externFuncNamesBin2, string_bid1, string_bid2):
# toBeMerged[node1_id] = node2_id
toBeMergedBlocks = {}
toBeMergedBlocksReverse = {}

# toBeMergedFuncs[func1_addr] = func2_addr
toBeMergedFuncs = {}
toBeMergedFuncsReverse = {}

externFuncNameBlockMappingBin1 = {}
externFuncNameBlockMappingBin2 = {}
funcNameAddrMappingBin1 = {}
funcNameAddrMappingBin2 = {}

for func in cfg1.functions.values():
    binName = func.binary_name
    funcName = func.name
    funcAddr = func.addr
    blockList = list(func.blocks)
            
    if (binName == binary1) and (func.name in externFuncNamesBin1) and (len(blockList) == 1):
        for node in nodelist1:
            if (node.block is not None) and (node.block.addr == blockList[0].addr):     
                externFuncNameBlockMappingBin1[funcName] = nodeDic1[node]
                funcNameAddrMappingBin1[funcName] = funcAddr

The non-binary functions are stored in externFuncNamesBin1, whose binary names are not binary1. And when it goes to
if (binName == binary1) and (func.name in externFuncNamesBin1) and (len(blockList) == 1):, the condition will never be satisfied.

Where is /vec_all being created?

I was trying to execute the code, and I came across an error where;

embedding_file = "\\vec_all"

But this file is being called without being created. How can I go about resolving this issue?

Any and every help is appreciated.

UPD: Resloved. The issue stems from the fact that python3 command malfunctions. Instead, using python on terminal commands seems to solve the problem.

How deepbindiff deal with function names?

There are two types of function names. One of them is a string, and the other is a memory address. I didn't find how deepbindiff handles them. Thank you.

push       eax
call       memset

push       eax
call       sub_8084480

Does the function 'normalization' handle that?

def normalization(opstr, offsetStrMapping):
    optoken = ''

    opstrNum = ""
    if opstr.startswith("0x") or opstr.startswith("0X"):
        opstrNum = str(int(opstr, 16))

    # normalize ptr
    if "ptr" in opstr:
        optoken = 'ptr'
        # nodeToIndex.write("ptr\n")
    # substitude offset with strings
    elif opstrNum in offsetStrMapping:
        optoken = offsetStrMapping[opstrNum]
        # nodeToIndex.write("str\n")
        # nodeToIndex.write(offsetStrMapping[opstr] + "\n")
    elif opstr.startswith("0x") or opstr.startswith("-0x") or opstr.replace('.','',1).replace('-','',1).isdigit():
        optoken = 'imme'
        # nodeToIndex.write("IMME\n")
    elif opstr in register_list_1_byte:
        optoken = 'reg1'
    elif opstr in register_list_2_byte:
        optoken = 'reg2'
    elif opstr in register_list_4_byte:
        optoken = 'reg4'
    elif opstr in register_list_8_byte:
        optoken = 'reg8'
    else:
        optoken = str(opstr)
        # nodeToIndex.write(opstr + "\n")
    return optoken

[Errno 2] No such file or directory: 'data/DeepBD/features'

How can I solve this problem

Unusable README file

The README file offers zero information about how to actually use or evaluate this code base nor does it provide any information on system/environment requirements.

Why this code cannot run in GPU

Dear sir or Madam,
When I run this code, it shows that the code can only run on the cpu. But the model running on the cpu took too much time. I want to know which part of the code makes the model unable to run on the cpu and whether it can be transplanted to the gpu to run. Thank you with my greatest respect.

Interpret the DeepBinDiff results

Hello,

I am writing to inquire some advices to interpret the output of DeepBinDiff. In particular, I have two questions as follows:

the processing time seems too large. I use the following command:

➜  DeepBinDiff git:(master) ✗ python3 src/deepbindiff.py --input1 experiment_data/coreutils/binaries/coreutils-7.6-O0/true --input2 experiment_data/coreutils/binaries/coreutils-7.6-O3/true --outputDir output/

And the processing time is:

python3 src/deepbindiff.py --input1  --input2  --outputDir output/  63233.15s user 103785.18s system 1966% cpu 2:21:33.48 total

It takes quite a long time (we are running it on a 32-core server machine with 256GB RAM). Is it normal?

The output to compare true vs. true is as follows:

Reading...
time:  7696.551887512207
Saving embeddings...
Perform matching...
[[0.8654591  0.92791235 0.7441185  ... 0.9215279  0.9736992  0.97301173]
 [0.74939525 0.8753574  0.6855561  ... 0.9378971  0.95241654 0.9951242 ]
 [0.6706515  0.82596886 0.8066987  ... 0.805171   0.9380803  0.9999579 ]
 ...
 [0.         0.         0.         ... 0.         0.         0.        ]
 [0.         0.         0.         ... 0.         0.         0.        ]
 [0.         0.         0.         ... 0.         0.         0.        ]]
(1044, 1044)
matched pairs:
[[161, 875], [164, 867], [84, 828], [389, 1309], [71, 811], [346, 1287], [302, 1212], [208, 987],
 [90, 833], [74, 814], [218, 1292], [467, 1556], [110, 844], [91, 1102], [456, 1562], [279, 1196]
, [75, 816], [213, 1317], [264, 1166], [77, 815], [76, 819], [102, 834], [291, 1206], [70, 856],
[329, 1248], [602, 1578], [560, 1581], [455, 1584], [692, 1543], [222, 999], [375, 1301], [392, 1
218], [49, 789], [1, 809], [374, 1219], [635, 1724], [267, 1177], [458, 1541], [201, 977], [250,
1139], [597, 1699], [410, 1338], [257, 1154], [341, 1319], [244, 1161], [248, 1100], [203, 978],
[734, 1920], [546, 1644], [598, 1696], [304, 1327], [372, 1302], ...

May I ask how to interpret the results? I am familiar with BinDiff and expecting similar output format like BinDiff (function-level and binary-level similarity). Is it possible to covert the current output into function or binary-level similarity score? Thank you very much!

problem about ground truth collection

Nice work.
However, can the grounp truth collection script be open source?

I have the matched pairs. How do I compute the metrics from the paper? I'm unsure of how to go about it.

how to interpret the output?

it seems that deepbindiff produced the following files, but how to interpret them?
edgelist
edgelist_merged_tadw
nodeIndexToCode

thank you.

How do I use this script with IDA pro?

If I want to run this script under IDA Pro, what should I modify?

bug in small binary- “Sampler's range is too small“

I use deepbindiff as following:

➜  DeepBinDiff git:(master) ✗ cat test.c 
#include<stdio.h>
void main()
{
    printf("hello world\n");
}
➜  DeepBinDiff git:(master) ✗ gcc test.c -o test1
➜  DeepBinDiff git:(master) ✗ gcc test.c -o test2
➜  DeepBinDiff git:(master) ✗ python3 src/deepbindiff.py --input1 ./test1 --input2 ./test2 --outputDir ./out

The error full output is as following. It seems the problem is "Sampler's range is too small".

Traceback (most recent call last):
  File "/home/ling/.local/lib/python3.8/site-packages/tensorflow/python/client/session.py", line 1365, in _do_call
    return fn(*args)
  File "/home/ling/.local/lib/python3.8/site-packages/tensorflow/python/client/session.py", line 1349, in _run_fn
    return self._call_tf_sessionrun(options, feed_dict, fetch_list,
  File "/home/ling/.local/lib/python3.8/site-packages/tensorflow/python/client/session.py", line 1441, in _call_tf_sessionrun
    return tf_session.TF_SessionRun_wrapper(self._session, options, feed_dict,
tensorflow.python.framework.errors_impl.InvalidArgumentError: Sampler's range is too small.
     [[{{node nce_loss/LogUniformCandidateSampler}}]]

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "src/deepbindiff.py", line 233, in <module>
    main()
  File "src/deepbindiff.py", line 220, in main
    tokenEmbeddings = featureGen.tokenEmbeddingGeneration(article, blockBoundaryIndex, insnStartingIndices, indexToCurrentInsnsStart, dictionary, reversed_dictionary, opcode_idx_list)
  File "/mnt/hgfs/deepbindiff/DeepBinDiff/src/featureGen.py", line 392, in tokenEmbeddingGeneration
    embeddings = buildAndTraining(article, blockBoundaryIndex, insnStartingIndices, indexToCurrentInsnsStart, dictionary, opcode_idx_list)
  File "/mnt/hgfs/deepbindiff/DeepBinDiff/src/featureGen.py", line 348, in buildAndTraining
    _, loss_val = session.run([optimizer, loss], feed_dict=feed_dict)
  File "/home/ling/.local/lib/python3.8/site-packages/tensorflow/python/client/session.py", line 957, in run
    result = self._run(None, fetches, feed_dict, options_ptr,
  File "/home/ling/.local/lib/python3.8/site-packages/tensorflow/python/client/session.py", line 1180, in _run
    results = self._do_run(handle, final_targets, final_fetches,
  File "/home/ling/.local/lib/python3.8/site-packages/tensorflow/python/client/session.py", line 1358, in _do_run
    return self._do_call(_run_fn, feeds, fetches, targets, options,
  File "/home/ling/.local/lib/python3.8/site-packages/tensorflow/python/client/session.py", line 1384, in _do_call
    raise type(e)(node_def, op, message)
tensorflow.python.framework.errors_impl.InvalidArgumentError: Sampler's range is too small.
     [[node nce_loss/LogUniformCandidateSampler (defined at /mnt/hgfs/deepbindiff/DeepBinDiff/src/featureGen.py:310) ]]

Original stack trace for 'nce_loss/LogUniformCandidateSampler':
  File "src/deepbindiff.py", line 233, in <module>
    main()
  File "src/deepbindiff.py", line 220, in main
    tokenEmbeddings = featureGen.tokenEmbeddingGeneration(article, blockBoundaryIndex, insnStartingIndices, indexToCurrentInsnsStart, dictionary, reversed_dictionary, opcode_idx_list)
  File "/mnt/hgfs/deepbindiff/DeepBinDiff/src/featureGen.py", line 392, in tokenEmbeddingGeneration
    embeddings = buildAndTraining(article, blockBoundaryIndex, insnStartingIndices, indexToCurrentInsnsStart, dictionary, opcode_idx_list)
  File "/mnt/hgfs/deepbindiff/DeepBinDiff/src/featureGen.py", line 310, in buildAndTraining
    loss = tf.reduce_mean(tf.nn.nce_loss(weights=nce_weights,
  File "/home/ling/.local/lib/python3.8/site-packages/tensorflow/python/ops/nn_impl.py", line 2046, in nce_loss
    logits, labels = _compute_sampled_logits(
  File "/home/ling/.local/lib/python3.8/site-packages/tensorflow/python/ops/nn_impl.py", line 1742, in _compute_sampled_logits
    sampled_values = candidate_sampling_ops.log_uniform_candidate_sampler(
  File "/home/ling/.local/lib/python3.8/site-packages/tensorflow/python/ops/candidate_sampling_ops.py", line 149, in log_uniform_candidate_sampler
    return gen_candidate_sampling_ops.log_uniform_candidate_sampler(
  File "/home/ling/.local/lib/python3.8/site-packages/tensorflow/python/ops/gen_candidate_sampling_ops.py", line 656, in log_uniform_candidate_sampler
    _, _, _op, _outputs = _op_def_library._apply_op_helper(
  File "/home/ling/.local/lib/python3.8/site-packages/tensorflow/python/framework/op_def_library.py", line 742, in _apply_op_helper
    op = g._create_op_internal(op_type_name, inputs, dtypes=None,
  File "/home/ling/.local/lib/python3.8/site-packages/tensorflow/python/framework/ops.py", line 3319, in _create_op_internal
    ret = Operation(
  File "/home/ling/.local/lib/python3.8/site-packages/tensorflow/python/framework/ops.py", line 1791, in __init__
    self._traceback = tf_stack.extract_stack()

IDA or Angr?

I want to run deepbindiff and have read some of your codes.
It seems deepbindiff leverages Angr to extract basic block information and generate CFG instead of IDA?

How to deal with instructions with different number of address

Nice work.
However, there are also some questions about the instruction embedding. I hope to get your answer.

The practice of instruction embedding in your paper is the opcode times TF-IDF weight, and concat the average of operands. For example, if the dimension of a single opcode or operand is 64, the final dimension of instruction is 64+64=128 (I hope I understand correctly).

However, there are some opcode with none operand such as 'retn', 'pusha', 'popa', 'cdq', etc. How did you deal with such instructions? Ignore or concat a zero embedding as operands?

Thanks.

No such file or directory: 'data/DeepBD/features'

Hello there,

Thanks a lot for providing such a nice tool to use. I am writing to inquire an error encountered when running this tool. Here is the error message I received:

 python3 src/deepbindiff.py --input1 experiment_data/coreutils/binaries/coreutils-7.6-O0/ls --input2 experiment_data/coreutils/binaries/coreutils-7.6-O3/ls --outputDir output/
....
....
Initialized
Average loss at step  0 :  134.17404174804688
Average loss at step  2000 :  5.122461198568344
Average loss at step  4000 :  3.3189247410297393
Average loss at step  6000 :  3.2604737248420714
Traceback (most recent call last):
  File "src/deepbindiff.py", line 230, in <module>
    main()
  File "src/deepbindiff.py", line 223, in main
    copyEverythingOver(outputDir, 'data/DeepBD/')
  File "src/deepbindiff.py", line 172, in copyEverythingOver
    copyfile(src_dir + node_features, dst_dir + node_features)
  File "/usr/lib/python3.6/shutil.py", line 121, in copyfile
    with open(dst, 'wb') as fdst:
FileNotFoundError: [Errno 2] No such file or directory: 'data/DeepBD/features'

Can anyone shed some lights on this? Thank you very much!

Add requirements.txt

The codebase requires specific versions of tensorflow and gensim to run.
Please add the following requirements.txt to the repo:

tensorflow==1.15
gensim==3.8.3
angr
networkx
lapjv
scikit-learn

openssl binaries are not given for testing

Hi,
Why Openssl exmaple is not included in the shared binaries?
I have compiled both version for test example. Please provide yours to verify the results

".\vec_all" file is missing.

The error message says that the ".\vec_all" file is missing.

Memory Error

The sizes of the two binaries are 11.3MB and 18.2MB, causing MemoryError.

How to input half of the binaries into the model training

Hi, i am trying to run this code in a server, but the result is not good. I have noticed that there is a sentence in the paper: "We randomly select half of the binaries in our dataset for token embedding model training", but i cannot find a function in this code to load half of binaries of a dataset in one-time. Do i miss any important details? Or this code can only run two binary in one-time?

requirements.txt?

Hey, I would love to get this up and running. It would seem though that I’m using an incorrect version of TensorFlow though.

I think having a requirements.txt would make installation easier.