Git Product home page Git Product logo

malicious_pdf_detection's Introduction

Malicious_pdf_detection

This project aims to detect if a pdf file is clean or malicious.

You can generate malicious PDF Files from clean PDF Files to form your dataset using the project: https://github.com/jonaslejon/malicious-pdf. This is a project by - jonaslejon (Jonas Lejon), maggick (maggick), tonyarris (Tony Harris). For issues regarding generation of Malicious PDF Files, please contact them or raise an issue on their repository.

Create two directories maliciouspdf and cleanpdf and keep your malicious and clean PDF files accordingly.

  • command_exec.py will iterate through each and every file in the folders viz maliciouspdf and cleanpdf.

  • feature_extraction.py help in feature extraction of each pdf file based on its file structure. It uses pdfid.py script, which is an opensource file and part of peepdf.

  • classifier.py implements the Random Forest Classifier and trains it with the data pdfdataset_n.csv. We also split the data into 30% for testing purpose. Accuracy is observed to be around 99%.

We have already extracted the necessary features from these files and formed a dataset as pdfdataset.csv and pdfdataset_n.csv is min-max normalized version of it.

Please raise a PR if you have improvements for the project.

malicious_pdf_detection's People

Contributors

kartik2309 avatar

Stargazers

LOTSWN avatar  avatar xmy avatar Mustapha EZZALI avatar AVA avatar gx1 avatar Maurizio Casciano avatar Samarth Rayar avatar Morteza Zakeri avatar  avatar  avatar  avatar  avatar

Watchers

James Cloos avatar  avatar

malicious_pdf_detection's Issues

how pdfdataset_n.csv generate?

hi, dear developer,sorry for I have no background in machine learning. I would like to ask how pdfdataset_n.csv is generated? I don't see the relevant steps in the code, I wonder if the pdfdataset.csv is 0-1 normalized?

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.