Git Product home page Git Product logo

submodularstreamingmaximization's Introduction

Submodular Streaming Maximization

Building docs Python tests C++ tests

18.06.2021: Our paper has been accepted at the ECML/PKDD 2021. Stay tuned for an updated version. 🤩

This repository includes the code for our paper Very Fast Streaming Submodular Function Maximization which introduces a new nonnegative submodular function maximization algorithm for streaming data. For our experiments, we also implemented already existing state-of-the-art streaming algorithms for which we could not find an implementation. The code focuses on easy extensibility and accessibility. It is mainly written in header-only C++ with a Python interface via pybind11. A detailed documentation can be found here.

How to use this code

First clone this repo recursively

git clone --recurse-submodules [email protected]:sbuschjaeger/SubmodularStreamingMaximization.git 

Using the Python Interface

If you use anaconda to manage your python packages you can use the following commands to install all dependencies including the python interface PySSM

source /opt/anaconda3/bin/activate 
conda env create -f environment.yml  --force
conda activate pyssm

Note the --force option which overrides all existing environments called pyssm. Alternatively, you can just install this package via

pip install -e .

Once installed, you can simply import the desired submodular function and optimizer via PySSM. For a detailed explanation on specific parameters / functions provided please have a look at the documentation of the source code. The following example uses the Greedy optimizer to select a data summary by maximizing the Informative Vector Machine (the full examples can be found in tests/main.py)

from PySSM import RBFKernel
from PySSM import IVM
from PySSM import Greedy

X = [
    [0, 0],
    [1, 1],
    [0.5, 1.0],
    [1.0, 0.5],
    [0, 0.5],
    [0.5, 1],
    [0.0, 1.0],
    [1.0, 0.]
]    

K = 3
kernel = RBFKernel(sigma=1,scale=1)
ivm = IVM(kernel = kernel, sigma = 1.0)
greedy = Greedy(K, ivm)

greedy.fit(X)

# Alternativley, you can use the streaming interface. 
#for x in X:
#    opt.next(x)

fval = opt.get_fval()
solution = np.array(opt.get_solution())

print("Found a solution with fval = {}".format(fval))
print(solution)

Using the C++ interface

The C++ code is header-only so simply include the desired functions in your project and your are good to go. However, you require a C++17 compiler (e.g. gcc-7 or clang-5). If you have trouble compiling you can look at the CMakeLists.txt file which compiles the Python bindings as well as the following test file. The following example uses the Greedy optimizer to select a data summary by maximizing the Informative Vector Machine (the full examples can be found in tests/main.cpp)

#include <iostream>
#include <vector>
#include <math.h>
#include "FastIVM.h"
#include "RBFKernel.h"
#include "Greedy.h"

std::vector<std::vector<double>> data = {
    {0, 0},
    {1, 1},
    {0.5, 1.0},
    {1.0, 0.5},
    {0, 0.5},
    {0.5, 1},
    {0.0, 1.0},
    {1.0, 0.0}
};    

unsigned int K = 3;
FastIVM fastIVM(K, RBFKernel(), 1.0);

Greedy greedy(K, fastIVM)
greedy.fit(data);
auto solution = greedy.get_solution();
double fval = greedy.get_fval();

std::cout << "Found a solution with fval = " << fval << std::endl;
for (auto x : solution) {
    for (auto xi : x) {
        std::cout << xi << " ";
    }
    std::cout << std::endl;
}

How to reproduce the experiments in our paper

This repository combines code for multiple datasets and two different repositories. It is structured as the following:

  • experiment_runner: Contains code for the experiment_runner package. This is a small tool which we use to run multiple experiments in our computing environment. Basically, the experiment_runner accepts a list of experiments and runs these either on multiple machines or in a multi-threaded environment. All results are written to a *.jsonl. This code is sparsely documented. For a commented example please have a look at experiments/kdd99/run.{cpp,py}.
  • include/: Contains the C++ implementation of all optimizers. You can find more documentation under docs/html/index.html including some examples how to use specific optimizers.
  • tests/run.{py,cpp}: Contains sample code for C++ and Python
  • DATASET/run.py: Contains code to run the experiments on the specific DATASET.
  • DATASET/init.sh: Some datasets must be downloaded manually. In this case you can use the provided init.sh to do so.
  • explore_results.ipynb: A juypter notebook to plot the results of each experiment

All experiments have been performed via the Python interface. Assuming that you followed the instructions above, simply navigate to the specific folder and execute the run file

cd experiments/kddcup99
./run.py

In the experiments folder you can find code which runs experiments on various dataset via the Python interface. You will probably need to download the data first which can be done using the init.{sh,py} scripts provided in each folder. Some notes on this:

  • creditfraud is hosted on kaggle, which requires the kaggle-api to be installed and configured with an API key. It might be easier to manually download this data-set from kaggle
  • fact-highlevel is hosted by the FACT project page. To process these files some additional packages are required which should be installed via the conda environment. If not, please make sure to have pyfact installed and working. Also note, that these files are rather large (> 2GB) so the download may take some time.
  • fact-lowlevel requires even more additional tools and packages. Please contact [email protected] if you are interested in these files.

Once the data is downloaded, you can start the experiments by executing run.py in the respective folder. This file starts all experiments for a single data-set including all hyperparameter configurations. This may take some time (usually a few hours per data-set) to finish. The experiments are currently configured to launch 15 threads via joblib, so make sure your hardware is strong enough or reduce the number of cores by setting n_cores at the end of each file. After the experiments finished, you can browse the results by using the explore_results Jupyter Notebook. Note that, depending on actual experiments you ran you might want to change the list of datasets used for plotting in the second cell of this notebook accordingly.

Citing our Paper

@misc{buschjäger2020fast,
      title={Very Fast Streaming Submodular Function Maximization}, 
      author={Sebastian Buschjäger and Philipp-Jan Honysz and Katharina Morik},
      year={2020},
      eprint={2010.10059},
      archivePrefix={arXiv},
      primaryClass={cs.LG}
}

submodularstreamingmaximization's People

Contributors

sbuschjaeger avatar whadup avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.