Explainable-AI-malware-detection

Malware detection with added explanability through saliency map on Android APK using PyTorch and Androguard.

Requirements

To run the code you need to create a conda environment. You can do so by running the following commands after cloning the repository:

    conda env create --file ./conda-package-list.yml
    conda activate malware_detection_research

🗒️ Note : we used Androguard 3.3.5 and not the version 4.0.2 because of a bug in the submodules not being recognized by PyLance.

⚠️ Warning : this repository is developped as a python package. Thus, to run the scripts you need to be at the root of the repository and use the following syntax :
python -m folder.script
For example, to run the apk_to_image.py script with the required parameters, you need to run the following command :
python -m pre_processing.apk_to_image -t RGB -p random -e jpg

How to run the code

This repository contains multiples files that can be used to train a model, test it, and generate saliency maps.

🗒️ Note : you can use the start_training.sh script and modify it to your needs. This script will run all the scripts in the correct order.

The usual workflow is:

Insert the APKs in the _dataset folder into subdirectories corresponding to their nature. For example, in our experiments we had two datasets.
- Random based split dataset: 30k_dataset with four subdirectories: Goodware_Obf, Goodware_NoObf, Malware_Obf, Malware_NoObf
- Time split based dataset : 71k_dataset with still four subdirectories Goodware_Obf, Goodware_NoObf, Malware_Obf, Malware_NoObf but inside those subfolders everything was sorted by period like so 2022_01, ..., 2022_12, 2023_01, ..., 2023_12 etc. The subdirectories are used to create the dataset and the labels. The values for the name are kind of hardcoded in the code, so you might need to change them if you want to use different names. (see model_training/train_test_model.py).
  
  ⚠️ Warning : if you have a time based dataset, you require an .csv file with a hash column to identify the APKs, a num_antivirus_malicious column to give how many detection one APK has on Virus Total (it's an int), first_submission_date with the date of the first submission of the APK on Virus Total in the format DD-MM-YYYY hh:mm and the column obfuscated (0 or 1) which indicates if the APK is obfuscated or not. You can find all the scripts to manipulate the dataset in pre_processing/dataset_manipulation/ folder. For indication, to organize the dataset you would first run pre_processing/dataset_manipulation/sort_by_period.py and then pre_processing/dataset_manipulation/select_data_by_period.py. The parameters are in the script usage section.
Run pre_processing/apk_to_image.py to transform the APKs into images. You have to specify the type of images you want (RGB or BW) and the padding type (random, black, white), and the extension type (jpg or png). Here's an example on how to run the script:
```
python -m pre_processing.apk_to_image -t RGB -p random -e jpg
```
This will create a folder in the _images corresponding to the type of conversion you chose with the nature of the APKs as subdirectories. For example, if you chose RGB and random padding, you will have the following structure:
```
_images/Goodware_Obf_RGB_random/{apk_name}.jpg
```
If you have a time based dataset, you can use the --time_based flag to specify it. This will create the images using the already created train and test directories created by the sort_by_period.py script.
```
python -m pre_processing.apk_to_image -t RGB -p random -e jpg --time_based
```
Run pre_processing/create_train_test.py to automatically create the train and test directories with the given ratio (default 80:20 train/test).
```
python -m pre_processing.create_train_test -r 0.8
```
If you have a time based dataset, you can use the --time_based flag to specify it. This will sort the images using the already created train and test directories without a random split.
Run model_training/train_test_models.py to train and test ResNet18 and ResNet50 model on multiple epochs. You have to specify the type of images (RGB or BW), the padding type (random, black, white) and the extension of the image (jpg or png) so the script knows where to find the images. Here's an example on how to run the script:
```
python -m model_training.train_test_models -t RGB -p random -ex jpg
```
This will save the model in the _models folder. The name of the model will be {model_name}_{type}_{padding_type}_padding_{extension}.pth. For example, if you chose RGB and random padding, you will have the following model:
```
model_training/_models/resnet18_RGB_random_10_epochs_jpg.pth
...
model_training/_models/resnet18_RGB_random_50_epochs_jpg.pth
...
model_training/_models/resnet50_RGB_random_10_epochs_jpg.pth
...
model_training/_models/resnet50_RGB_random_50_epochs_jpg.pth
```
Run visualization/saliency.py to generate the saliency maps. You have to specify the type of images (RGB or BW) and the padding type (random, black, white) with the model name (resnet18, resnet50) and finally the epochs number (10, 20, ...) Here's an example on how to run the script:
```
python -m visualization.saliency -t RGB -p random -mn resnet18 -e 5 -ex jpg
```
This will save the saliency maps in the _saliency_maps folder.

Experiment tracking

If you want to see the results of our experiments with TensorBoard (the built-in VSCode way doesn't work with WSL), you can run the following command:

tensorboard --logdir=model_training/runs

⚠️ Warning : you must be at the root of the repository to run this command.

🗒️ Note : you can use tensorboard in the CLI if you're using a precompiled TensorFlow package (e.g you installed via pip.) See here for more details.

Acknowledgement

This project is based on the following paper:

[1] Obfuscation detection for Android applications : We used create_image.py and map-saturation.png to transform the APKs into images while developing our own method.

[2] Fast adversarial training using FGSM Based on paper Fast is better than free: Revisiting adversarial training by Athalye et al. We used fast_adversarial.py to train our model.

raideeen / explainable-ai-malware-detection Goto Github PK