Comments (6)
The optimized model is available in the large data download (available at https://pubdata.endgame.com/ember/ember_dataset_2018_2.tar.bz2). The LGB text file named ember_model_2018.txt
is in there with the jsonl files:
$ tar xvf ember_dataset_2018_2.tar
x ember2018/
x ember2018/train_features_1.jsonl
x ember2018/train_features_0.jsonl
x ember2018/train_features_3.jsonl
x ember2018/test_features.jsonl
x ember2018/ember_model_2018.txt
x ember2018/train_features_5.jsonl
x ember2018/train_features_4.jsonl
x ember2018/train_features_2.jsonl
from ember.
Thanks for the quick reply! I assumed that the ember_model_2018.txt
was the "unoptimized" version. Seems like I've been using the right one all along!
from ember.
To clarify, in your research paper EMBER: An Open Dataset for Training Static PE Malware Machine Learning Models
, it was mentioned that:
From the vectorized features, we trained a gradient-boosed decision tree (GBDT) model using LightGBM with
default parameters (100 trees, 31 leaves per tree), resulting in fewer than 10K tunable parameters [14]. Model training
took 3 hours. Baseline model performance may be much improved with appropriate hyper-parameter optimization,
which is of less interest to us in this work.
Furthermore, in your source code for ember, the optimized version would have been saved as lgbm_model.save_model(os.path.join(args.datadir, "optimised_model.txt"))
Therefore, I assumed that the ember_model_2018.txt
was the original "unoptimized" version.
Hence the clarification!
from ember.
Ah. That makes sense. That quote was written and released with EMBER 2017. In that case, we only released the default model and I don't have any optimized model anymore to post. We did not release a new paper for the EMBER 2018 release. For this one, the model is already optimized with this grid search:
https://docs.google.com/presentation/d/1A13tsUkgWeujTy9SD-vDFfQp9fnIqbSE_tCihNPlArQ/edit#slide=id.g6318784c2c_0_1131
from ember.
I see. So just to confirm the ember_model_2018.txt
as part of https://pubdata.endgame.com/ember/ember_dataset_2018_2.tar.bz2
is the unoptimized LGB model?
My problem is that I am unable to train the LGB model locally due to Out-Of-Memory (OOM) issues, hence I am asking around for the optimized_model.txt so I can just load it in.
Once again, wondering if anyone out there has successfully trained LGB with the --optimize
flag to arrive at the following best params and able to share the resulting optimized_model.txt?
From the slides shared by Phil:
best_params = {
“boosting”: “gbdt”,
“objective”: “binary”,
“num_iterations”: 1000,
“learning_rate”: 0.05,
“num_leaves”: 2048,
“feature_fraction”: 0.5,
“bagging_fraction”: 1.0,
“max_depth”: 15,
“min_data_in_leaf”: 50,
}
from ember.
After relooking at the ember source code for the train script, I realize that the default set of parameters are already the optimized parameters. If you compare what Phil had in his CAMLIS 2019 presentation (posted in above comment) and the original params:
Default train script:
params = {
"boosting": "gbdt",
"objective": "binary",
"num_iterations": 1000,
"learning_rate": 0.05,
"num_leaves": 2048,
"max_depth": 15,
"min_data_in_leaf": 50,
"feature_fraction": 0.5
}
The only difference is the addition of bagging_fraction”: 1.0
, which according to the LGBM documentation is already the default value.
from ember.
Related Issues (20)
- AttributeError: module 'ember' has no attribute 'create_vectorized_features' HOT 10
- error: subprocess-exited-with-error (lief) HOT 2
- create_vectorized_features error HOT 7
- module 'lief' has no attribute 'bad_format' HOT 2
- Problem with run classify_binaries.py URGENT
- How to extract raw feature from a set of PE binaries? HOT 2
- Extract Raw Features for Own Dataset HOT 3
- How can I use .features script to extract features from a malware sample I already have using the same way ember does?
- Extracted raw features does match dataset features
- how to uninstall ember
- Why "train model" runs so fast?
- Can't import the library on Google Colab
- How to get the original bytes of the PE file. I want to covert a file to a gray image. HOT 5
- How to map samples in data set to their SHA256 hash?
- Define the information represented on the malware vector? HOT 3
- What is the license of the ember/features.py file?
- The train_ember.py file is not installed HOT 1
- Dependencies no longer declared in setuptools
- NumPy 1.24 compatibility HOT 2
- Sharing datasets over BitTorrent
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from ember.