Git Product home page Git Product logo

fouadtrad / leveraging-adversarial-samples-for-enhanced-classification-of-malicious-and-evasive-pdf-files Goto Github PK

View Code? Open in Web Editor NEW
4.0 1.0 2.0 26.66 MB

Repository for the paper "Leveraging Adversarial Samples for Enhanced Classification of Malicious and Evasive PDF Files" published in Applied Sciences, MDPI

Home Page: https://www.mdpi.com/2076-3417/13/6/3472

Jupyter Notebook 100.00%
evasion machine-learning malware pdf-malware evasive-samples

leveraging-adversarial-samples-for-enhanced-classification-of-malicious-and-evasive-pdf-files's Introduction

Leveraging-Adversarial-Samples-for-Enhanced-Classification-of-Malicious-and-Evasive-PDF-Files

Full Paper available at: https://www.mdpi.com/2076-3417/13/6/3472

Abstract:

The Portable Document Format (PDF) is considered one of the most popular formats due to its flexibility and portability across platforms. Although people have used machine learning techniques to detect malware in PDF files, the problem with these models is their weak resistance against evasion attacks, which constitutes a major security threat. The goal of this study is to introduce three machine learning-based systems that enhance malware detection in the presence of evasion attacks by substantially relying on evasive data to train malware and evasion detection models. To evaluate the robustness of the proposed systems, we used two testing datasets, a real dataset containing around 100 k PDF samples and an evasive dataset containing 500 k samples that we generated. We compared the results of the proposed systems to a baseline model that was not adversarially trained. When tested against the evasive dataset, the proposed systems provided an increase of around 80% in the f1-score compared to the baseline. This proves the value of the proposed approaches towards the ability to deal with evasive attacks

Methodology:

In this study, we propose the idea of leveraging PDF samples that are known to be evasive and to perform adversarial learning, where we train the malware classifier on data containing evasive and non-evasive samples. More- over, instead of relying on the certainty of the malware classifier’s predictions to detect evasion, we propose the idea of using this mix of evasive and non-evasive data to build standalone models that detect evasion. These two missing pieces are the focus of this study, and accordingly, we propose three approaches:

  1. Building a robust malware classifier by performing the training on a mix of evasive and non-evasive data.
  2. Building a hierarchical system that first classifies if a PDF is evasive or not, and then, checks for malware by forwarding the PDF to a model that deals exclusively with evasive data or another model that deals only with non-evasive data.
  3. Building a multi-label classifier that detects evasion and maliciousness simultaneously and independently instead of relying on two dependent models as in the second approach. This classifier is also trained on a combination of evasive and non-evasive data.

To implement these approaches, we collected training data from various sources, and they fall under two categories: evasive and non-evasive. The training for the components of each system is shown in the figure below.

Some Results

We test each of the systems against 2 datasets:

  1. One real dataset (100 k samples) collected from monitoring the network of a university campus.
  2. An Evasive dataset (500 k samples) that we generated based on the data that we collected.

We compare the performance of the proposed systems with a baseline that was not adversarially trained to compare. When testing against the first dataset, all systems perform similarly, which proves that the proposed systems still perform well on classic datasets that are not evasive.

When testing against the second dataset, we can see the superiority of the proposed approaches compared to the baseline that performs worse than random guessing.

These results highlight the importance of the suggested approaches and their ability to be resilient against PDF evasion attacks.

Citation

Any work that uses the data or code provided should cite the following paper:

F. Trad, A. Hussein, and A. Chehab, “Leveraging Adversarial Samples for Enhanced Classification of Malicious and Evasive PDF Files,” Applied Sciences, vol. 13, no. 6, p. 3472, Mar. 2023, doi: 10.3390/app13063472.2

leveraging-adversarial-samples-for-enhanced-classification-of-malicious-and-evasive-pdf-files's People

Contributors

fouadtrad avatar

Stargazers

 avatar  avatar  avatar  avatar

Watchers

 avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.