Git Product home page Git Product logo

libdb's Introduction

binary_tpl_detection

Dataset url: https://figshare.com/s/4a007e78f29243531b8c

Feature Extractor

  • The extractor extracts features from all binary files under a given directory and save features to a json file.
  • Input: directory
  • Output: two files, stored in a given target directory.
    • Information such as running time is stored in the status file.
    • Extracted features are stored in the features file, such as 9760608.json. The format of this json is a list of BinaryFile entity.
  • It is recommended to put your task code under consumer directory (in featureExtractor/bcat_client/src/main/java/thusca/bcat/client/consumer). See the example in consumer/BinFileFeatureExtractTest.java

Pre-requisites

Basic knowledge about Java Development, Springboot and Annotation Development.
For example, if you use IDE like VScode or Idea, basic java development environment need to be installed such as Java Extension Pack, MAVEN for JAVA. It should be noted that we use Lombok Annotation and Springboot in code that may depend on extensions Lombok Annotations Support and Spring Boot Tools for IDE to debug or run. Besides, LibmagicJnaWrapper depends on libmagic to get file type, please install this library and modify the paths in LibmagicJnaWrapper.java. It can be easily installed using apt/brew command on Linux/MacOS.

Build Artifact

Env:

  • Java: Java 11.
  • IntelliJ Idea. (We have found that the extractor artifact works well only under IntelliJ Idea to build the artifact. Tested successful under Windows IntelliJ Idea 2021.2)

Steps:

  1. Ghidra: 9.1.2. The file ghidra.jar is stored under /user/lib/ghidra.jar you should put it under /featureExtractor/bcat_client/lib first.
  2. Open Idea, open project "binary_lib_detection-main\featureExtractor". Wait until indexing finish, if error occurs, try reopen/clean the project.
  3. File -> Project Structure -> Project SDK, select Java SDK 11.
  4. File -> Project Structure -> Artifacts -> "+" -> jar -> from modules with dependencies -> Module ("bcat_client") -> Main Class ("ClientApplication") -> JAR files from libraries (select copy to the output directory and link via manifest) 5. The jars will be generated at path: featureExtractor\out\artifacts\bcat_client_jar, with bcat_client.jar inside.

Task

Methods for all tasks are stored under the directory /consumer. Building database: Code:Task2ExtractCoreFedora.java, Data: FedoraLib_Dataset. Set tha save path and get all features to build TPL feature database. We use the directory ../data/CoreFedoraFeatureJson0505 to represent the save path.

Run

Zip the bcat_client_jar folder and upload to a Linux server, unzip, and run:

java -jar bcat_client.jar

Note: Java 11 required.

Func similarity Model

This model is used to determine if two functions are similar based on Gemini Network.

Prepration and Data Data is stored in ../data/vector_deduplicate_gemini_format_less_compilation_cases.
or Cross-5C_Dataset.7z on figshare.

By default, we use the path ../data under main/torch to store the data. Please copy them under it.

Environment Step

The network is written using Torch 1.8 in Python 3.8. Torch installation is based on cuda 11.

conda create -n tpldetection python=3.8 ipykernel
bash
conda activate tpldetection
pip install torch==1.8.1+cu111 torchvision==0.9.1+cu111 torchaudio==0.8.1 -f https://download.pytorch.org/whl/lts/1.8/torch_lts.html
pip install -r requirements.txt

Milvus v1.1.1(vector search engine) is necessary for function retrival. It requires docker 19.03 or higher ref: https://milvus.io/docs/v1.1.1/milvus_docker-gpu.md

sudo docker pull milvusdb/milvus:1.1.1-gpu-d061621-330cc6
mkdir -p /home/$USER/milvus/conf
cd /home/$USER/milvus/conf
wget https://raw.githubusercontent.com/milvus-io/milvus/v1.1.1/core/conf/demo/server_config.yaml

sudo docker run -d --name milvus_gpu_1.1.1 --gpus all \
-p 19530:19530 \
-p 19121:19121 \
-v /home/$USER/milvus/db:/var/lib/milvus/db \
-v /home/$USER/milvus/conf:/var/lib/milvus/conf \
-v /home/$USER/milvus/logs:/var/lib/milvus/logs \
-v /home/$USER/milvus/wal:/var/lib/milvus/wal \
milvusdb/milvus:1.1.1-gpu-d061621-330cc6

Run

Run the following command to train the model:

# train/validation dataset: /data/func_comparison/vector_deduplicate_our_format_less_compilation_cases/train_test
# test dataset: /data/func_comparison/vector_deduplicate_our_format_less_compilation_cases/valid
cd main/torch
bash run.sh

A trained model is saved under ../data/7fea_contra_torch_b128/saved_model/

Library detection

Database

Embedding

raw feature database: ../data/CoreFedoraFeatureJson0505

Embeddings: set the path ../data/CoreFedoraFeatureJson0505 as args.fedora_js. You can use mutilprocess to speed up and the code is writen in core_fedora_embeddings.py as follows:

with Pool(10) as p:
    p.starmap(core_fedora_embedding, [(i, True) for i in range(10)])

all embeddings are saved under the args.save_path. We use the path ../data/7fea_contra_torch_b128/core_funcs to represent it.

Indexing and Building Milvus dataset

run build_milvus_database.py to build function vector database using Mulvis.

the function get_bin_fcg is used to generate an indexing file containing binary to functions to accelarate.

get_bin2func_num generates an indexing from binary to the number of funtions in it.

Detection

Data: detection_targets. Firstly, extract features from APKs. See the method localExtractOSSPoliceApks in TaskProcessTargets.java under the directory consumer. We use the directory../data/detection_targets/feature_json to save all extracted features.

see the function detect_v2 in function_vector_channel. Other methods + FCG Filter can be seen in files xxx_afcg.py. Baselines are under the directory /related_work.

We combine basic feature channel (B2SFinder(basic features) + FCG Filter) and function vector channel together to report the final results.

All files named analyze_results.py are used to calculate precision and recall.

libdb's People

Contributors

lkpsg avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.