Git Product home page Git Product logo

Comments (6)

mrphilroth avatar mrphilroth commented on May 12, 2024

The optimized model is available in the large data download (available at https://pubdata.endgame.com/ember/ember_dataset_2018_2.tar.bz2). The LGB text file named ember_model_2018.txt is in there with the jsonl files:

$ tar xvf ember_dataset_2018_2.tar
x ember2018/
x ember2018/train_features_1.jsonl
x ember2018/train_features_0.jsonl
x ember2018/train_features_3.jsonl
x ember2018/test_features.jsonl
x ember2018/ember_model_2018.txt
x ember2018/train_features_5.jsonl
x ember2018/train_features_4.jsonl
x ember2018/train_features_2.jsonl

from ember.

wilsoncwj avatar wilsoncwj commented on May 12, 2024

Thanks for the quick reply! I assumed that the ember_model_2018.txt was the "unoptimized" version. Seems like I've been using the right one all along!

from ember.

wilsoncwj avatar wilsoncwj commented on May 12, 2024

To clarify, in your research paper EMBER: An Open Dataset for Training Static PE Malware Machine Learning Models, it was mentioned that:

From the vectorized features, we trained a gradient-boosed decision tree (GBDT) model using LightGBM with
default parameters (100 trees, 31 leaves per tree), resulting in fewer than 10K tunable parameters [14]. Model training
took 3 hours. Baseline model performance may be much improved with appropriate hyper-parameter optimization,
which is of less interest to us in this work.

Furthermore, in your source code for ember, the optimized version would have been saved as lgbm_model.save_model(os.path.join(args.datadir, "optimised_model.txt"))

Therefore, I assumed that the ember_model_2018.txt was the original "unoptimized" version.
Hence the clarification!

from ember.

mrphilroth avatar mrphilroth commented on May 12, 2024

Ah. That makes sense. That quote was written and released with EMBER 2017. In that case, we only released the default model and I don't have any optimized model anymore to post. We did not release a new paper for the EMBER 2018 release. For this one, the model is already optimized with this grid search:
https://docs.google.com/presentation/d/1A13tsUkgWeujTy9SD-vDFfQp9fnIqbSE_tCihNPlArQ/edit#slide=id.g6318784c2c_0_1131

from ember.

wilsoncwj avatar wilsoncwj commented on May 12, 2024

I see. So just to confirm the ember_model_2018.txt as part of https://pubdata.endgame.com/ember/ember_dataset_2018_2.tar.bz2 is the unoptimized LGB model?

My problem is that I am unable to train the LGB model locally due to Out-Of-Memory (OOM) issues, hence I am asking around for the optimized_model.txt so I can just load it in.

Once again, wondering if anyone out there has successfully trained LGB with the --optimize flag to arrive at the following best params and able to share the resulting optimized_model.txt?

From the slides shared by Phil:

best_params = {
  “boosting”: “gbdt”,
  “objective”: “binary”,
  “num_iterations”: 1000,
  “learning_rate”: 0.05,
  “num_leaves”: 2048,
  “feature_fraction”: 0.5,
  “bagging_fraction”: 1.0,
  “max_depth”: 15,
  “min_data_in_leaf”: 50,
}

from ember.

wilsoncwj avatar wilsoncwj commented on May 12, 2024

After relooking at the ember source code for the train script, I realize that the default set of parameters are already the optimized parameters. If you compare what Phil had in his CAMLIS 2019 presentation (posted in above comment) and the original params:

Default train script:
params = {
        "boosting": "gbdt",
        "objective": "binary",
        "num_iterations": 1000,
        "learning_rate": 0.05,
        "num_leaves": 2048,
        "max_depth": 15,
        "min_data_in_leaf": 50,
        "feature_fraction": 0.5
    }

The only difference is the addition of bagging_fraction”: 1.0, which according to the LGBM documentation is already the default value.

from ember.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.