Git Product home page Git Product logo

Comments (8)

ZaydH avatar ZaydH commented on August 19, 2024

The saved models are dataset specific. Since the authors of the SLEIPNIR dataset asked that it not be directly released, I did not release the models for it. I would be open minded to do so with some other changes to make it work regardless of whether it is run with CUDA or not. If that would be useful, let me know.

You should be able to run with the bundled dataset. That dataset (not generated by me) is of dubious quality. It does have a limited number of features and would be quick to train. You can run it by calling:

python main.py 10 32 100 data/trial_mal.npy data/trial_ben.npy

The above command assumes you are calling it from the repo's root directory.

from malwaregan.

bedangSen avatar bedangSen commented on August 19, 2024

The saved models are dataset specific. Since the authors of the SLEIPNIR dataset asked that it not be directly released, I did not release the models for it. I would be open minded to do so with some other changes to make it work regardless of whether it is run with CUDA or not. If that would be useful, let me know.

Hey @ZaydH that would be great! I followed the link you provided in your README.md to the repository with the SLEIPNIR dataset. (https://github.com/ALFA-group/adv-malware-viz) and I have a few questions:

Are the contents of the datasets actual malware binaries? If so did you run the generate_vectors.py script to extract the features?

You should be able to run with the bundled dataset. That dataset (not generated by me) is of dubious quality. It does have a limited number of features and would be quick to train. You can run it by calling:
python main.py 10 32 100 data/trial_mal.npy data/trial_ben.npy

I haven't been able to figure out the format of the generated vectors from generate_vectors.py. But can I set the file path to the files generated from the generate_vectors.py script?

(Also wishing you a very Eid Mubarak! ^^)

from malwaregan.

ZaydH avatar ZaydH commented on August 19, 2024

Thanks for the kind words about Eid!

I am attending a conference at the moment and generating the saved models would be difficult while I am away. I will add them when I back home next week.

Are the contents of the datasets actual malware binaries? If so did you run the generate_vectors.py script to extract the features?

The SLEIPNIR dataset only comes with the ~22K-dimension Boolean feature vectors. It does not come with the benign or malware binaries. There are services like VirusShare where developers can get access to malware binaries. Malwr is another site that used to provide malware binaries, but it has been down for a very long time and may never return.

I do not know how to generate the Boolean feature vectors from a compiled binary. I have seen other folks use cuckoo, but I do not have any experience with it myself.

I haven't been able to figure out the format of the generated vectors from generate_vectors.py. But can I set the file path to the files generated from the generate_vectors.py script?

I am not sure what the generate_vectors.py script is. I previous wrote my own script for merging the feature vectors for each SLEIPNIR malware. I uploaded build_datasets.py as a gist.

from malwaregan.

bedangSen avatar bedangSen commented on August 19, 2024

So regarding the extraction of the feature vectors, I have written a simple script based on the generate_vectors.py script I mentioned earlier, which takes in a collection of malicious and benign files and then extracts the imports libraries to generate the feature vector. I have written this module so that it is compatible with your MalGAN application.

My objective is to be able to develop a pipeline that goes from a malware binary to an adversarially generated malware binary. So far, I have the first part. So I can train your MalGAN model on the feature vectors based on a collection of my benign and malicious files. Once I have trained a model, I will pass it a malware file, and get the adversarially generated feature vector from your MalGAN model. The final step for my application is to reconstruct the malware from the generated feature vector and execute it in a sandboxed environment to make sure it retains its functionality.

Again, I am quite new to this so I just want to clarify: the saved models will give me an adversially generated feature vector if I give it a malware feature vector?

P.S. This might be silly but I wanted to thank you for your beautifully written application. I learnt a lot from it! Especially the use of logging and argeparse!

from malwaregan.

ZaydH avatar ZaydH commented on August 19, 2024

Again, I am quite new to this so I just want to clarify: the saved models will give me an adversially generated feature vector if I give it a malware feature vector?

The saved model will only do what you want if the feature vector you generate has the exact same set of features in the exact same order as the SLEIPNIR dataset (or whatever is used to train the model). Is that the case here? It feels like no, but I want you to be extra sure.

Especially the use of logging and argeparse!

I did not like how it was logging before. I changed most of this code to follow the paradigm in skorch.

from malwaregan.

vietvo89 avatar vietvo89 commented on August 19, 2024

Hello guys

I am a newbie and start off carrying out research on malware evasion and detection. I am confused a bit about the dataset and how to handle it for running training and verifying. But it is interesting to read your conversation.

Hi Zay

I am requesting a dataset (SLEIPNIR) from GitHub repo you mentioned in Read file but have not got that dataset yet. And I found your training code running on trial_mal.npy and trial_ben.npy. Are they from SLEIPNIR dataset? If I am not mistaken, your MalGAN is built on Boolean feature vectors which just determine a feature present or absent rather than keeping a specific value of each feature. Is it right? So I think your program may not be used to train on feature vector datasets like EMBER. Is it right? And it does not take the value of each feature into account?

Hi BedangSen
I think you intend to create a pipeline that can train directly on binary data right? But how can you extract features from binary files? I think different malware types have different features so you must know the generic matrix of various malware types to extract and map features of each binary files. How do you do that?

Thanks

from malwaregan.

vietvo89 avatar vietvo89 commented on August 19, 2024

Hello Zay and BedangSen

It seems that MalGAN exploits API features rather than static features of malwares. If I am not mistaken, API features are based on API call instead of static features, so I think it makes sense to use Boolean feature vectors. Therefore, malware samples must be submitted to a tool like Cuckoo Sandbox to extract API features of each sample. If so, there is no way to extract feature from binary files directly for training. So the idea of BedangSen may not work. Otherwise, you must change the architecture of the detector and generator to work with static features extracted directly from binary files. In this way, EMBER is good and free to work.

https://github.com/yanminglai/Malware-GAN

from malwaregan.

ZaydH avatar ZaydH commented on August 19, 2024

The SLEIPNIR dataset is published by MIT. I do not have permission to share it. Abdullah al-Dujaili is super on top of responding. I recommend being a little patient with him as he is super busy.

The trial datasets I provide in this repo are directly from @yanminglai's repo. I only provide them because they give you something to try/demo. I am not sure how they were originally generated but I assume with Cuckoo. To be clear, this code only requires that the feature vectors be binary. Other than that, the code is dataset agnostic and you can generate the data however you want. Even if the original dataset did not contain only binary data, as long as you can binarize it, you'll be good to go.

I am not familiar with the EMBER dataset. I cannot say whether it will work well or not.

from malwaregan.

Related Issues (8)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.