Git Product home page Git Product logo

cade's Introduction

CADE: Contrastive Autoencoder for Drifting detection and Explanation

The repository contains the code for detecting and explaining a specific type of concept drift (i.e., previously unseen families) in security applications like malware attribution and network intrusion classification.

Further details can be found in the paper "CADE: Detecting and Explaining Concept Drift Samples for Security Applications" by Limin Yang, Wenbo Guo, Qingying Hao, Arridhana Ciptadi, Ali Ahmadzadeh, Xinyu Xing, Gang Wang (USENIX Security 2021). We also include supplemental materials in the repo (USENIX_21_drifting_Supplementary_Materials.pdf) due to page limit. Check out http://liminyang.web.illinois.edu for up-to-date information on the project.

If you end up building on this research or code as part of a project or publication, please include a reference to the USENIX Security paper:

@inproceedings{yang2021cade,
    title = {CADE: Detecting and Explaining Concept Drift Samples for Security Applications},
    author = {Yang, Limin and Guo, Wenbo and Hao, Qingying and Ciptadi, Arridhana and Ahmadzadeh, Ali and Xing, Xinyu and Wang, Gang},
    booktitle = {Proc. of USENIX Security},
    year = {2021}
}

1. Installation

Before getting started we recommend setting up a Python 3.6.5 or 3.6.8 virtual environment (other Python 3.6 or above versions might also work but didn't test).

  • If you are using CPU-based tensorflow, install all required packages:

    pip install -r requirements-tensorflow-cpu.txt
    python setup.py install
  • If you are using GPU-based tensorflow, please try the following steps to setup:

    module load cuda-toolkit/9.0  # other versions might also work but didn't test
    # you may also try pyenv and virtualenv to create the virtual environment, here we use Anaconda
    conda create -n cade-gpu python=3.6.8
    conda activate cade-gpu
    pip install scipy==1.3.3
    pip install numpy==1.16.1
    pip install --ignore-installed tensorflow-gpu==1.12.0
    pip install keras==2.2.5
    pip install sklearn==0.23.2
    pip install matplotlib==3.1.2
    pip install seaborn==0.11.0
    pip install tqdm==4.49.0
    python setup.py install

2. Configuration

The preprocessed Drebin and IDS2018 dataset can be found under the data folder. If you prefer to modify the preprocessing step, you may download the original dataset here: https://www.sec.cs.tu-bs.de/~danarp/drebin/index.html and https://www.unb.ca/cic/datasets/ids-2018.html and fill out the configuration in cade/config.py.

3. Usage

There are a number of command line arguments to run our program:

$ python main.py -h
usage: main.py [-h] [--data DATA] [-c {mlp,rf}] [--stage {detect,explanation}]
               [--pure-ae {0,1}] [--quiet {0,1}] [--cae-hidden CAE_HIDDEN]
               [--cae-batch-size CAE_BATCH_SIZE] [--cae-lr CAE_LR]
               [--cae-epochs CAE_EPOCHS] [--cae-lambda-1 CAE_LAMBDA_1]
               [--similar-ratio SIMILAR_RATIO] [--margin MARGIN]
               [--display-interval DISPLAY_INTERVAL]
               [--mad-threshold MAD_THRESHOLD]
               [--exp-method {distance_mm1,approximation_loose}]
               [--exp-lambda-1 EXP_LAMBDA_1] [--mlp-retrain {0,1}]
               [--mlp-hidden MLP_HIDDEN] [--mlp-batch-size MLP_BATCH_SIZE]
               [--mlp-lr MLP_LR] [--mlp-epochs MLP_EPOCHS]
               [--mlp-dropout MLP_DROPOUT] [--newfamily-label NEWFAMILY_LABEL]
               [--tree TREE] [--rf-retrain {0,1}]

See cade/utils.py or run python main.py -h for detailed help. You may also check run_drebin_cade.sh for a bunch of examples.

4. Examples

4.1 Drift detection

  1. To get the detection performance of CADE on the Drebin dataset (iteratively choose one family from 8 families as the unseen family):

    ./run_drebin_cade.sh
    
    # After the shell script finished running
    python -u average_all_detection_results.py drebin 0
    # 0 means using CADE, while 1 means using Vanilla AE
  2. To get the detection performance of CADE on the IDS2018 dataset (iteratively choose one family from 3 families as the unseen family):

    ./run_ids_cade.sh
    
    # After the shell script finished running
    python -u average_all_detection_results.py IDS 0
  3. To get the detection performance of Vanilla Autoencoder on the Drebin dataset:

    ./run_drebin_pure_ae.sh
    
    # After the shell script finished running
    python -u average_all_detection_results.py drebin 1
  4. To get the detection performance of Vanilla Autoencoder on the IDS2018 dataset:

    ./run_ids_pure_ae.sh
    
    # After the shell script finished running
    python -u average_all_detection_results.py IDS 1

4.2 Drift explanation

  1. CADE explaining drift samples on the Drebin-Fakedoc setting (i.e., drebin_new_7):

    ./run_cade_exp_drebin_fakedoc.sh
    # It will generate reports/drebin_new_7/mask_distance_mm1_0.001.npz,
    # which is already provided.
    # This step is time-consuming and non-deterministic,
    # so we include the explanation output for saving reproduction time and easier comparison.
  2. CADE explaining drift samples on the IDS2018-Infiltration setting:

    ./run_cade_exp_ids_infiltration.sh
    # It will generate reports/IDS_new_Infilteration/mask_distance_mm1_0.001.npz,
    # which is already provided.
  3. Boundary-based explanation on the Drebin-Fakedoc setting:

    ./run_boundary_exp_drebin_fakedoc.sh
    # It will generate reports/drebin_new_7/mask_approximation_loose_0.001.npz,
    # which is already provided.
  4. Boundary-based explanation on the IDS2018-Infiltration setting:

    ./run_boundary_exp_ids_infiltration.sh
    # It will generate reports/IDS_new_Infilteration/mask_approximation_loose_0.001.npz,
    # which is already provided.
  5. Compare CADE with boundary-based explanation and random explanation (using distance as the evaluation metric)

    1. Drebin-FakeDoc
    # 1. To get original distance and CADE distance
    python -u evaluate_explanation_by_distance.py drebin_new_7 distance_mm1 0.001 1 0.1
    
    # 2. To get random explanation distance
    python -u evaluate_explanation_by_distance.py drebin_new_7 random 0.001 0 0.1
    # since we randomly run 100 times, there might be minor difference on the output.
    
    # 3. To get boundary-based explanation distance
    python -u evaluate_explanation_by_distance.py drebin_new_7 approximation_loose 0.001 0 0.1
    
    # 4. To get gradient-based explanation distance
    nohup python -u evaluate_explanation_by_distance.py drebin_new_7 gradient 0.001 0 0.1 \
    > logs/nohup-drebin_new_7-gradient-exp.log &
    1. IDS2018-Infiltration
    # 1. To get original distance and CADE distance
    nohup python -u evaluate_explanation_by_distance.py IDS_new_Infilteration distance_mm1 \
    0.001 1 0.1 > logs/nohup-IDS-distance-mm1-exp.log &
    
    # 2. To get random explanation distance
    nohup python -u evaluate_explanation_by_distance.py IDS_new_Infilteration random \
    0.001 0 0.1 > logs/nohup-IDS-random-exp.log &
    # since we randomly run 100 times, there might be minor difference on the output.
    
    # 3. To get boundary-based explanation distance
    nohup python -u evaluate_explanation_by_distance.py IDS_new_Infilteration \
    approximation_loose 0.001 0 0.1 > logs/nohup-IDS-boundary-exp.log &
    
    # 4. To get gradient-based explanation distance
    nohup python -u evaluate_explanation_by_distance.py IDS_new_Infilteration gradient \
    0.001 0 0.1 > logs/nohup-IDS-gradient-exp.log &

5. Contact

If you have any questions, please contact Limin ([email protected]).

6. Licensing

For ethical considerations, code and data is covered by a modified BSD 3-Clause License which restricts the use of the code to academic purposes and which specifically prohibits commercial applications.

Any redistribution or use of this software must be limited to the purposes of non-commercial scientific research or non-commercial education. Any other use, in particular any use for commercial purposes, is prohibited. This includes, without limitation, incorporation in a commercial product, use in a commercial service, or production of other artefacts for commercial purposes.

cade's People

Contributors

whyisyoung avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.