Git Product home page Git Product logo

malware-uncertainty's Introduction

Uncertainty Quantification for Android Malware Detectors

This code repository is for our ACSAC 2021 paper (to appear), entitled Can We Leverage Predictive Uncertainty to Detect Dataset Shift and Adversarial Examples in Android Malware Detection?.

Overview

Our aim to explore the uncertainty quantification to harden malware detectors in the realistic environments (i.e., natural adversaries exist). This approach is rarely investigated in the context of malware detection, where the properties of dataset shift are different from other domains (e.g., image). Therefore we are motivated to evaluate the quality of predictive uncertainties inherent in malware detectors under the dataset shift. Specifically, we consider 4 Android malware detectors, including DeepDrebin, MultimodalNN, DeepDroid and Droidetec, and 6 calibration methods, including Vanilla, Temp scaling, Monto Carlo dropout, variational Bayesian Inference, Deep Ensemble and Weighted deep ensemble. The dataset shift is specified as out of source, temporal covariate shift or adversarial evasion attacks.

Dependencies:

We develop the codes on Windows operation system, and run the codes on Ubuntu 18.04. The codes depend on Python 3.6. Other packages (e.g., TensorFlow) can be found in the ./requirements.txt.

Configuration & Usage

1. Datasets

  • Three datasets are leveraged, namely that Drebin, VirusShare_Android_APK_2013 and Androzoo. Note that for the security consideration, these three datasets are required to follow the policies of their own to obtain the Android applications.

  For Drebin, we can download the malicious APKs from the official website and we provides sha256 codes of a portion of Drebin benign APKs, for which the corresponding APKs can be download from Androzoo.

  For Androzoo, we use the dataset built by researchers Pendlebury et al. All APKs can be downloaded from Androzoo.

  For Virusshare, we use the file named VirusShare_Android_APK_2013.zip.

  For adversarial APKs, we resort to this repository.

  • We additionally provide the preprocessed data files which are available at an anonymous url (the size of unzip folder is ~213GB).

2. Configure

For the purpose of convenience, we provide a conf (Windows platform) / conf-server (Ubuntu) file to assist the customization (Please pick one and rename it config to use rather than both). Before running, all things are changed in the following:

  • Modify the project_root=/absolute/path/to/malware-uncertainty/.

  • Modify the database_dir=/absolute/path/to/datasets. For more details (Optionally), there are Drebin or Androzoo malware datasets in this directory with the structure:

datasets
|---drebin
      |---malicious_samples  % malicious apps folder
      |---benign_samples     % benign apps foler
|---androzoo_tesseract
      |---malicious_samples
      |---benign_samples
      |   date_stamp.json    % date stamp for each app, we will provide
|---VirusShare_Android_APK_2013
      |---malicious_samples
      |---benign_samples
|---naive_data               % saving the preprocessed data files 
...

If no real apps are considered, the preprocessing data files make the project work as well. In this case, we need continue to configure the followings:

  • Download the datasets from the anonymous url, and put the folder in the project root directory, namely malware-uncertainty. Please Note that this datasets is not necessary the same as the directory of database_dir in the second step.
  • Download the naive_data from the anonymous url, and put the folder in the database_dir directory, which is configured in the second step (need unzip, mv naive_data.tar.gz database_dir; cd database_dir; tar -xvzf naive_data.tar.gz ./).

3. Usage

We suggest users to create a conda environment to run the codes. In this spirit, the following instructions may be helpful:

  1. Create a new environment: conda create -n mal-uncertainty python=3.6
  2. Activate the environment and install dependencies: conda activate mal-uncertainty and pip install -r requirements.txt
  3. Next step:
  • For training, all scripts are listed in ./run.sh
  • And then for producing figures and table data, the python code is ./experiments/table-figures.py (we have not implemented this part for the malware detector Droidetec)

Warning

  • It is usually time consuming to perform feature extraction on Android applications.
  • Two detectors (DeepDroid and Droidetec) are both RAM and computation consuming because the huge long sequence is used for promoting detection accuracy

License && Issues

We will make our codes public available under a formal license. For now, this is still an ongoing work and we plan to report more results in the future work. It is worth reminding that we found there two issues when checking our codes:

  • No random seed set for friendly reproducing results exactly as the paper; nevertheless, the similar results can be achieved.
  • The training, validation, and test datasets are split randomly, leading to a mess of results.

Contact

Any questions, please do not hesitate to contact us (Shouhuai Xu email: [email protected], Deqiang Li email: [email protected])

malware-uncertainty's People

Contributors

deqangss avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.