Git Product home page Git Product logo

vuldetector's Introduction

VulDetector

Description

  • VulDetector is a static-analysis tool to detect C/C++ vulnerabilities based on graph comparison at the granularity of function. At the key of VulDetector is a weighted feature graph (WFG) model which characterizes function with a small yet semantically rich graph. It first pinpoints vulnerability-sensitive keywords to slice the control flow graph of a function, thereby reducing the graph size without compromising security-related semantics. Then, each sliced subgraph is characterized using WFG, which provides both syntactic and semantic features in varying degrees of security. Here we provide the key modules on WFG generation and comparison.
  • Please refer our paper for more details.

Key Modules

  • DataPrepare: Extract function codes and CFGs from a program.
  • SenLocate: Locate the sensitive lines.
  • WFGParse: Generate WFGs from CFGs.
  • SimCompare: Compute the similarity of two WFGs

Environment

  • Tested on Linux-3.10.0 (Red hat 7.3) and Linux-4.15.0 (Ubuntu 18.04.3)

Setup

Install packages

  • Python packages (python2.7 currently)
  • Necessary: hungarian, networkx
  • On-demand:
    • sklearn is required for determining keywords in stat_keywords.py
    • matplotlib is required for drawing graphs in code2graph.py, commented now

Setup LLVM

  • Download sourcecode of LLVM and Clang (6.0.1)
  • Use clang.path in llvm_clang to patch clang
    • xz -d cfe-6.0.1.src.tar.xz
    • tar -xvf cfe-6.0.1.src.tar
    • mv cfe-6.0.1.src clang
    • patch -p0 < clang.path
    • cp -r clang ./llvm-6.0.1/tools/
  • Build LLVM, refer Build LLVM and Clang

Usage

Code Extraction

  • Extract raw code for each function from sourcecode
  • Input: source code directory input/OpenSSL/ of the project
  • Output: output/OpenSSL_code_dir now contains the raw code for each function
  • Cmds
    • cd DataPrepare
    • python2.7 extract_func.py input/OpenSSL/ output/OpenSSL_code_dir input/OpenSSL/

CFG Description Generation

  • Generate raw CFG description for a project (e.g., OpenSSL).
  • Output: The cfg_desc.log records the parsed CFGs for the entire project
  • Cmds
    • cd input/OpenSSL
    • scan-build -enable-checker debug.DumpCFG make 2> data/cfg_desc/cfg_desc.log
  • NOTE: make sure using clang to compile the project

CFG Extraction

  • Extract CFG description for each function from cfg_desc.log
  • Input: generated cfg_desc.log in the above step
  • Output: ../data/func_cfg now contains the CFGs for each function
  • Cmds
    • cd DataPrepare
    • python2.7 extract_cfg_desc.py ../data/cfg_desc/cfg_desc.log ../data/func_cfg/

Sensitive Line Location

  • Get sensitive lines for a function
  • Input: source code of a function
  • Output: a list of matched keyword and line_no
  • Cmds
    • cd ../SenLocate
    • python2.7 sensitive_parse.py ../data/func_code/cms_smime.c#small#do_free_upto#126.c
  • NOTE: You can provide your own keywords in sensitive_parse.py

WFG Generation

  • Generate WFGs from CFG and sourcecode.

  • Input: cfg_file (necessary), code_file, and sensitive_line_no. The code_file and sensitive_line_no can be set 'no' for different requirements. Refer to the three cases below.

  • Output: Dump the node {lines, ast_feature, weight} of the WFG, meanwhile storing the WFG as dict into ../data/wfgs

  • i) leave code_file as 'no' to use the full graph as WFG (no slicing)

    • cd ../WFGParse
    • python2.7 code2graph.py ../data/func_cfg/cms_smime.c#do_free_upto no -1
  • ii) leave sensitive_line_no as '-1' to automatically seach sensitive lines and generate WFG for each sensitive line.

    • python2.7 code2graph.py ../data/func_cfg/cms_smime.c#do_free_upto ../data/func_code/cms_smime.c#small#do_free_upto#126.c -1
  • iii) pass the specific sensitive_line_no to generate WFG

    • python2.7 code2graph.py ../data/func_cfg/cms_smime.c#do_free_upto ../data/func_code/cms_smime.c#small#do_free_upto#126.c 7
  • NOTE: Two key parameters, i.e., weigh_depth and decay_ratio, can be modified in config.py

WFG Comparison

  • Compute the similarity of two WFGs
  • Input: WFG file path
  • Output: Similarity
  • Cmds
    • cd ../SimCompare
    • python2.7 cfgcmp.py ../data/wfgs/cms_smime.c#do_free_upto_-1 ../data/wfgs/cms_smime.c#do_free_upto_131

Others:

  • ./DataPrepare/stat_keywords.py is used to determine keywords for your own corpus
  • ./DataPrepare/stat_ast_features.py is used to determine ast_features for your own corpus.

vuldetector's People

Contributors

cuilei avatar leontsui1987 avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar

vuldetector's Issues

llvm 设置

image

你好,我在配置llvm的时候,并没有找到clang.path文件,在llvm文件目录下
image

cmake version

which version of cmake is needed ,i use 3.16 ,but met this error
image

No input and outpur folder found

I searched all directories but didn't find the input and output folders. Could you please provide me with the link of the dataset and the path where it needs to be placed

Python Packages Version

Could you please tell me about your python packages version using command "pip2 list" or "conda list"?

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.