Comments (8)
The saved models are dataset specific. Since the authors of the SLEIPNIR dataset asked that it not be directly released, I did not release the models for it. I would be open minded to do so with some other changes to make it work regardless of whether it is run with CUDA or not. If that would be useful, let me know.
You should be able to run with the bundled dataset. That dataset (not generated by me) is of dubious quality. It does have a limited number of features and would be quick to train. You can run it by calling:
python main.py 10 32 100 data/trial_mal.npy data/trial_ben.npy
The above command assumes you are calling it from the repo's root directory.
from malwaregan.
The saved models are dataset specific. Since the authors of the SLEIPNIR dataset asked that it not be directly released, I did not release the models for it. I would be open minded to do so with some other changes to make it work regardless of whether it is run with CUDA or not. If that would be useful, let me know.
Hey @ZaydH that would be great! I followed the link you provided in your README.md to the repository with the SLEIPNIR dataset. (https://github.com/ALFA-group/adv-malware-viz) and I have a few questions:
Are the contents of the datasets actual malware binaries? If so did you run the generate_vectors.py
script to extract the features?
You should be able to run with the bundled dataset. That dataset (not generated by me) is of dubious quality. It does have a limited number of features and would be quick to train. You can run it by calling:
python main.py 10 32 100 data/trial_mal.npy data/trial_ben.npy
I haven't been able to figure out the format of the generated vectors from generate_vectors.py
. But can I set the file path to the files generated from the generate_vectors.py
script?
(Also wishing you a very Eid Mubarak! ^^)
from malwaregan.
Thanks for the kind words about Eid!
I am attending a conference at the moment and generating the saved models would be difficult while I am away. I will add them when I back home next week.
Are the contents of the datasets actual malware binaries? If so did you run the
generate_vectors.py
script to extract the features?
The SLEIPNIR dataset only comes with the ~22K-dimension Boolean feature vectors. It does not come with the benign or malware binaries. There are services like VirusShare where developers can get access to malware binaries. Malwr is another site that used to provide malware binaries, but it has been down for a very long time and may never return.
I do not know how to generate the Boolean feature vectors from a compiled binary. I have seen other folks use cuckoo, but I do not have any experience with it myself.
I haven't been able to figure out the format of the generated vectors from
generate_vectors.py
. But can I set the file path to the files generated from thegenerate_vectors.py
script?
I am not sure what the generate_vectors.py
script is. I previous wrote my own script for merging the feature vectors for each SLEIPNIR malware. I uploaded build_datasets.py as a gist.
from malwaregan.
So regarding the extraction of the feature vectors, I have written a simple script based on the generate_vectors.py
script I mentioned earlier, which takes in a collection of malicious and benign files and then extracts the imports libraries to generate the feature vector. I have written this module so that it is compatible with your MalGAN application.
My objective is to be able to develop a pipeline that goes from a malware binary to an adversarially generated malware binary. So far, I have the first part. So I can train your MalGAN model on the feature vectors based on a collection of my benign and malicious files. Once I have trained a model, I will pass it a malware file, and get the adversarially generated feature vector from your MalGAN model. The final step for my application is to reconstruct the malware from the generated feature vector and execute it in a sandboxed environment to make sure it retains its functionality.
Again, I am quite new to this so I just want to clarify: the saved models will give me an adversially generated feature vector if I give it a malware feature vector?
P.S. This might be silly but I wanted to thank you for your beautifully written application. I learnt a lot from it! Especially the use of logging and argeparse!
from malwaregan.
Again, I am quite new to this so I just want to clarify: the saved models will give me an adversially generated feature vector if I give it a malware feature vector?
The saved model will only do what you want if the feature vector you generate has the exact same set of features in the exact same order as the SLEIPNIR dataset (or whatever is used to train the model). Is that the case here? It feels like no, but I want you to be extra sure.
Especially the use of logging and argeparse!
I did not like how it was logging before. I changed most of this code to follow the paradigm in skorch
.
from malwaregan.
Hello guys
I am a newbie and start off carrying out research on malware evasion and detection. I am confused a bit about the dataset and how to handle it for running training and verifying. But it is interesting to read your conversation.
Hi Zay
I am requesting a dataset (SLEIPNIR) from GitHub repo you mentioned in Read file but have not got that dataset yet. And I found your training code running on trial_mal.npy and trial_ben.npy. Are they from SLEIPNIR dataset? If I am not mistaken, your MalGAN is built on Boolean feature vectors which just determine a feature present or absent rather than keeping a specific value of each feature. Is it right? So I think your program may not be used to train on feature vector datasets like EMBER. Is it right? And it does not take the value of each feature into account?
Hi BedangSen
I think you intend to create a pipeline that can train directly on binary data right? But how can you extract features from binary files? I think different malware types have different features so you must know the generic matrix of various malware types to extract and map features of each binary files. How do you do that?
Thanks
from malwaregan.
Hello Zay and BedangSen
It seems that MalGAN exploits API features rather than static features of malwares. If I am not mistaken, API features are based on API call instead of static features, so I think it makes sense to use Boolean feature vectors. Therefore, malware samples must be submitted to a tool like Cuckoo Sandbox to extract API features of each sample. If so, there is no way to extract feature from binary files directly for training. So the idea of BedangSen may not work. Otherwise, you must change the architecture of the detector and generator to work with static features extracted directly from binary files. In this way, EMBER is good and free to work.
https://github.com/yanminglai/Malware-GAN
from malwaregan.
The SLEIPNIR dataset is published by MIT. I do not have permission to share it. Abdullah al-Dujaili is super on top of responding. I recommend being a little patient with him as he is super busy.
The trial datasets I provide in this repo are directly from @yanminglai's repo. I only provide them because they give you something to try/demo. I am not sure how they were originally generated but I assume with Cuckoo. To be clear, this code only requires that the feature vectors be binary. Other than that, the code is dataset agnostic and you can generate the data however you want. Even if the original dataset did not contain only binary data, as long as you can binarize it, you'll be good to go.
I am not familiar with the EMBER dataset. I cannot say whether it will work well or not.
from malwaregan.
Related Issues (8)
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from malwaregan.