Git Product home page Git Product logo

explainable-ai-malware-detection's Introduction

Explainable-AI-malware-detection

Malware detection with added explanability through saliency map on Android APK using PyTorch and Androguard.

Requirements

To run the code you need to create a conda environment. You can do so by running the following commands after cloning the repository:

    conda env create --file ./conda-package-list.yml
    conda activate malware_detection_research

๐Ÿ—’๏ธ Note : we used Androguard 3.3.5 and not the version 4.0.2 because of a bug in the submodules not being recognized by PyLance.

โš ๏ธ Warning : this repository is developped as a python package. Thus, to run the scripts you need to be at the root of the repository and use the following syntax :

python -m folder.script

For example, to run the apk_to_image.py script with the required parameters, you need to run the following command :

python -m pre_processing.apk_to_image -t RGB -p random -e jpg

How to run the code

This repository contains multiples files that can be used to train a model, test it, and generate saliency maps.

๐Ÿ—’๏ธ Note : you can use the start_training.sh script and modify it to your needs. This script will run all the scripts in the correct order.

The usual workflow is:

  1. Insert the APKs in the _dataset folder into subdirectories corresponding to their nature. For example, in our experiments we had two datasets.

    • Random based split dataset: 30k_dataset with four subdirectories: Goodware_Obf, Goodware_NoObf, Malware_Obf, Malware_NoObf
    • Time split based dataset : 71k_dataset with still four subdirectories Goodware_Obf, Goodware_NoObf, Malware_Obf, Malware_NoObf but inside those subfolders everything was sorted by period like so 2022_01, ..., 2022_12, 2023_01, ..., 2023_12 etc. The subdirectories are used to create the dataset and the labels. The values for the name are kind of hardcoded in the code, so you might need to change them if you want to use different names. (see model_training/train_test_model.py).

      โš ๏ธ Warning : if you have a time based dataset, you require an .csv file with a hash column to identify the APKs, a num_antivirus_malicious column to give how many detection one APK has on Virus Total (it's an int), first_submission_date with the date of the first submission of the APK on Virus Total in the format DD-MM-YYYY hh:mm and the column obfuscated (0 or 1) which indicates if the APK is obfuscated or not. You can find all the scripts to manipulate the dataset in pre_processing/dataset_manipulation/ folder. For indication, to organize the dataset you would first run pre_processing/dataset_manipulation/sort_by_period.py and then pre_processing/dataset_manipulation/select_data_by_period.py. The parameters are in the script usage section.

  2. Run pre_processing/apk_to_image.py to transform the APKs into images. You have to specify the type of images you want (RGB or BW) and the padding type (random, black, white), and the extension type (jpg or png). Here's an example on how to run the script:

    python -m pre_processing.apk_to_image -t RGB -p random -e jpg

    This will create a folder in the _images corresponding to the type of conversion you chose with the nature of the APKs as subdirectories. For example, if you chose RGB and random padding, you will have the following structure:

    _images/Goodware_Obf_RGB_random/{apk_name}.jpg

    If you have a time based dataset, you can use the --time_based flag to specify it. This will create the images using the already created train and test directories created by the sort_by_period.py script.

    python -m pre_processing.apk_to_image -t RGB -p random -e jpg --time_based
  3. Run pre_processing/create_train_test.py to automatically create the train and test directories with the given ratio (default 80:20 train/test).

    python -m pre_processing.create_train_test -r 0.8

    If you have a time based dataset, you can use the --time_based flag to specify it. This will sort the images using the already created train and test directories without a random split.

  4. Run model_training/train_test_models.py to train and test ResNet18 and ResNet50 model on multiple epochs. You have to specify the type of images (RGB or BW), the padding type (random, black, white) and the extension of the image (jpg or png) so the script knows where to find the images. Here's an example on how to run the script:

    python -m model_training.train_test_models -t RGB -p random -ex jpg

    This will save the model in the _models folder. The name of the model will be {model_name}_{type}_{padding_type}_padding_{extension}.pth. For example, if you chose RGB and random padding, you will have the following model:

    model_training/_models/resnet18_RGB_random_10_epochs_jpg.pth
    ...
    model_training/_models/resnet18_RGB_random_50_epochs_jpg.pth
    ...
    model_training/_models/resnet50_RGB_random_10_epochs_jpg.pth
    ...
    model_training/_models/resnet50_RGB_random_50_epochs_jpg.pth
  5. Run visualization/saliency.py to generate the saliency maps. You have to specify the type of images (RGB or BW) and the padding type (random, black, white) with the model name (resnet18, resnet50) and finally the epochs number (10, 20, ...) Here's an example on how to run the script:

    python -m visualization.saliency -t RGB -p random -mn resnet18 -e 5 -ex jpg

    This will save the saliency maps in the _saliency_maps folder.

Experiment tracking

If you want to see the results of our experiments with TensorBoard (the built-in VSCode way doesn't work with WSL), you can run the following command:

tensorboard --logdir=model_training/runs

โš ๏ธ Warning : you must be at the root of the repository to run this command.

๐Ÿ—’๏ธ Note : you can use tensorboard in the CLI if you're using a precompiled TensorFlow package (e.g you installed via pip.) See here for more details.

Acknowledgement

This project is based on the following paper:

[1] Obfuscation detection for Android applications : We used create_image.py and map-saturation.png to transform the APKs into images while developing our own method.

[2] Fast adversarial training using FGSM Based on paper Fast is better than free: Revisiting adversarial training by Athalye et al. We used fast_adversarial.py to train our model.

explainable-ai-malware-detection's People

Contributors

hugodhugo avatar foucaulde avatar raideeen avatar

Stargazers

xiaofu avatar  avatar  avatar ANAS IEDE avatar  avatar Itsfitts avatar  avatar Mathieu LARUELLE avatar

Watchers

Kostas Georgiou avatar  avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.