netsharecmu / netshare Goto Github PK

View Code? Open in Web Editor NEW

71.0 8.0 20.0 4.4 MB

(SIGCOMM '22) Practical GAN-based Synthetic IP Header Trace Generation using NetShare

Home Page: https://www.pcapshare.com/

License: BSD 3-Clause Clear License

Python 93.06% C 2.87% Shell 4.06%

gan gans gans-models generative-adversarial-network netflow netflow-data netflow-v9 pcap pcap-generator synthetic-data

netshare's Introduction

Practical GAN-based Synthetic IP Header Trace Generation using NetShare

[paper (SIGCOMM 2022)] [talk (SIGCOMM 2022)] [talk (ZeekWeek 2022)] [talk (FloCon 2023)] [web service demo]

Authors: [Yucheng Yin] [Zinan Lin] [Minhao Jin] [Giulia Fanti] [Vyas Sekar]

Abstract: We explore the feasibility of using Generative Adversarial Networks (GANs) to automatically learn generative models to generate synthetic packet- and flow header traces for network-ing tasks (e.g., telemetry, anomaly detection, provisioning). We identify key fidelity, scalability, and privacy challenges and tradeoffs in existing GAN-based approaches. By synthesizing domain-specific insights with recent advances in machine learning and privacy, we identify design choices to tackle these challenges. Building on these insights, we develop an end-to-end framework, NetShare. We evaluate NetShare on six diverse packet header traces and find that: (1) across distributional metrics and traces, it achieves 46% more accuracy than baselines, and (2) it meets users’ requirements of downstream tasks in evaluating accuracy and rank ordering of candidate approaches.

News

[2023.04] Woohoo! New version released with a list of new features:

Bump Python version to 3.9
Replace tensorflow 1.15 with torch
Support generic dataset formats
Add SDMetrics for hyperparameter/model selection and data visualization

[2022.08]: The deprecated camera-ready branch holds the scripts we used to run all the experiments in the paper.

Users

NetShare has been used by several independent users/companies.

Datasets

We are adding more datasets! Feel free to add your own and contribute!

Our paper uses six public datasets for reproducibility. Please download the six datasets here and put them under traces/.

You may also refer to the README for detailed descriptions of the datasets.

Setup

Step 0: Install `libpcap` depdency (Optional)

If you are working with PCAP files and you have not installed libpcap,

On MacOS, install using homebrew:
```
brew install libpcap
```
On Debian-based system (e.g., Ubuntu), install using apt:
```
sudo apt install libpcap-dev
```

Step 1: Install NetShare Python package (Required)

We recommend installing NetShare in a virtual environment (e.g., Anaconda3). We test with virtual environment with Python==3.9.

# Assume Anaconda is installed
# Create virtual environment if not exists
conda create --name NetShare python=3.9

# Activate virtual env
conda activate NetShare

# Install NetShare package
git clone https://github.com/netsharecmu/NetShare.git
pip3 install -e NetShare/

# Install SDMetrics package
git clone https://github.com/netsharecmu/SDMetrics_timeseries
pip3 install -e SDMetrics_timeseries/

Step 2: How to start Ray? (Optional but strongly recommended)

Ray is a unified framework for scaling AI and Python applications. Our framework utilizes Ray to increase parallelism and distribute workloads among the cluster automatically and efficiently.

Laptop/Single-machine (only recommended for demo/dev/fun)

ray start --head --port=6379 --include-dashboard=True --dashboard-host=0.0.0.0 --dashboard-port=8265

Please go to http://localhost:8265 to view the Ray dashboard.

Multi-machines (strongly recommended for faster training/generation)

We provide a utility script and README under util/ for setting up a Ray cluster. As a reference, we are using Cloudlab which is referred as ``custom cluster'' in the Ray documentation. If you are using a different cluster (e.g., AWS, GCP, Azure), please refer to the Ray doc for full reference.

Example usage

We are adding more examples of usage (PCAP, NetFlow, w/ and w/o DP). Please stay tuned!

Here is a minimal working example to generate synthetic netflow files without differential privacy. Please change your working directory to examples/<sub_example> by cd examples/<sub_example>.

You may refer to examples for more scripts and config files.

Driver code

import random
import netshare.ray as ray
from netshare import Generator

if __name__ == '__main__':
    # Change to False if you would not like to use Ray
    ray.config.enabled = False
    ray.init(address="auto")

    # configuration file
    generator = Generator(config="config_example_netflow_nodp.json")

    # `work_folder` should not exist o/w an overwrite error will be thrown.
    # Please set the `worker_folder` as *absolute path*
    # if you are using Ray with multi-machine setup
    # since Ray has bugs when dealing with relative paths.
    generator.train(work_folder=f'../../results/test-ugr16')
    generator.generate(work_folder=f'../../results/test-ugr16')
    generator.visualize(work_folder=f'../../results/test-ugr16')

    ray.shutdown()

The corresponding configuration file. You may refer to README for detailed explanations of the configuration files.

After generation, you will be redirected to a dashboard where a side-to-side visual comparison between real and synthetic data will be shown.

Codebase structure (for dev purpose)

├── doc                       # (tentative) NetShare tutorials and APIs
├── examples                  # Examples of using NetShare on different datasets
├── netshare                  # NetShare source code
│   ├── configs               # Default configurations  
│   ├── generators            # Generator class
│   ├── model_managers        # Core of NetShare service (i.e, train/generate)
│   ├── models                # Timeseries GAN models (e.g., DoppelGANger)
│   ├── pre_post_processors   # Pre- and post-process data
│   ├── ray                   # Ray functions overloading
│   └── utils                 # Utility functions/common class definitions
├── traces                    # Traces/datasets
└── util                      # MISC/setup scripts
    └── ray                   # Ray setup script

References

Please cite our paper/codebase approriately if you find NetShare is useful.

@inproceedings{netshare-sigcomm2022,
  author = {Yin, Yucheng and Lin, Zinan and Jin, Minhao and Fanti, Giulia and Sekar, Vyas},
  title = {Practical GAN-Based Synthetic IP Header Trace Generation Using NetShare},
  year = {2022},
  isbn = {9781450394208},
  publisher = {Association for Computing Machinery},
  address = {New York, NY, USA},
  url = {https://doi.org/10.1145/3544216.3544251},
  doi = {10.1145/3544216.3544251},
  abstract = {We explore the feasibility of using Generative Adversarial Networks (GANs) to automatically learn generative models to generate synthetic packet- and flow header traces for networking tasks (e.g., telemetry, anomaly detection, provisioning). We identify key fidelity, scalability, and privacy challenges and tradeoffs in existing GAN-based approaches. By synthesizing domain-specific insights with recent advances in machine learning and privacy, we identify design choices to tackle these challenges. Building on these insights, we develop an end-to-end framework, NetShare. We evaluate NetShare on six diverse packet header traces and find that: (1) across all distributional metrics and traces, it achieves 46% more accuracy than baselines and (2) it meets users' requirements of downstream tasks in evaluating accuracy and rank ordering of candidate approaches.},
  booktitle = {Proceedings of the ACM SIGCOMM 2022 Conference},
  pages = {458–472},
  numpages = {15},
  keywords = {privacy, synthetic data generation, network packets, network flows, generative adversarial networks},
  location = {Amsterdam, Netherlands},
  series = {SIGCOMM '22}
}

Part of the source code is adapated from the following open-source projects:

netshare's People

Contributors

Stargazers

Watchers

netshare's Issues

Reproducibility on TON dataset experiments

Can you people release the configuration and driver scripts for evaluating the model accuracy when using the full TON dataset?
This should include both training/testing on the real dataset as well as training on synthetic and testing on the real dataset.

Problem with PCAP to CSV conversion in `netshare/pre_post_processors/netshare/main.c`

The program /netshare/pre_post_processors/netshare/main.c seems unable to identify protocols other than TCP and UDP.

This program converts TCP and UDP to their corresponding protocol number but it ignores all other kinds of protocals. Is this a bug or is it by design?

Netflow Training and generation ValueError: Pretrain_dir {} does not exist!

Hello,
Thank you for your work. I'm interested in implementing your solution for netflow traffic generation.
Unfortunately, at the end of the chunk0 training, I experience the following error (I copy/paste the last part of the stdout/err, because the entire would be too long).

Stop data_loader #0: True
Stop data_loader #1: True
data loader ended
-------------
data loader ended
-------------
data loader endeddata loader ended
-------------

-------------
data loader ended
-------------
Stop data_loader #2: True
Stop data_loader #3: True
Stop data_loader #4: True
data loader ended
-------------
Stop data_loader #5: True
Stop data_loader #6: True
Stop data_loader #7: True
Stop data_loader #8: True
Stop data_loader #9: True
-------------
Finish launching chunk0 experiments ...
Start waiting for chunk0 from config_group_id 0experiments finished ...
Traceback (most recent call last):
  File "/srv/tempdd/aschoen/NetShare/netshare/model_managers/model_manager.py", line 34, in train
    model_config=model_config)
  File "/srv/tempdd/aschoen/NetShare/netshare/model_managers/netshare_manager/netshare_manager.py", line 42, in _train
    log_folder=log_folder)
  File "/srv/tempdd/aschoen/NetShare/netshare/ray/remote.py", line 34, in remote
    return ResultWrapper(self._ray_args[0](*args, **kwargs))
  File "/srv/tempdd/aschoen/NetShare/netshare/model_managers/netshare_manager/train_helper.py", line 104, in _train_specific_config_group
    log_folder)
  File "/srv/tempdd/aschoen/NetShare/netshare/model_managers/netshare_manager/train_helper.py", line 24, in _launch_other_chunks_training
    raise ValueError("Pretrain_dir {} does not exist!")
ValueError: Pretrain_dir {} does not exist!
Traceback (most recent call last):
  File "example_netflow.py", line 15, in <module>
    generator.train_and_generate(work_folder='../results/vis_test/')
  File "/srv/tempdd/aschoen/NetShare/netshare/generators/generator.py", line 203, in train_and_generate
    if not self.train(work_folder):
  File "/srv/tempdd/aschoen/NetShare/netshare/generators/generator.py", line 196, in train
    log_folder=self._get_model_log_folder(work_folder)):
  File "/srv/tempdd/aschoen/NetShare/netshare/generators/generator.py", line 122, in _train
    model_config=self._model_config)
  File "/srv/tempdd/aschoen/NetShare/netshare/model_managers/model_manager.py", line 34, in train
    model_config=model_config)
  File "/srv/tempdd/aschoen/NetShare/netshare/model_managers/netshare_manager/netshare_manager.py", line 42, in _train
    log_folder=log_folder)
  File "/srv/tempdd/aschoen/NetShare/netshare/ray/remote.py", line 34, in remote
    return ResultWrapper(self._ray_args[0](*args, **kwargs))
  File "/srv/tempdd/aschoen/NetShare/netshare/model_managers/netshare_manager/train_helper.py", line 104, in _train_specific_config_group
    log_folder)
  File "/srv/tempdd/aschoen/NetShare/netshare/model_managers/netshare_manager/train_helper.py", line 24, in _launch_other_chunks_training
    raise ValueError("Pretrain_dir {} does not exist!")
ValueError: Pretrain_dir {} does not exist!

Maybe it has something to do with my config file and especially the part with the pretrain_dir argument. I've tryed with pretrain_dir = null or pretrain_dir = '/path/to/a/directory/on/my/computer' but both return me this error.

This is my complete config file


{
    "global_config": {
        "overwrite": true,
        "original_data_file": "../traces/ugr16/raw.csv",
        "dataset_type": "netflow",
        "n_chunks": 10,
        "dp": false,
        "word2vec_vecSize": 10,
        "timestamp": "interarrival",
        "truncate": "per_chunk",
        "max_flow_len": false 
    },
    "pre_post_processor": {
        "class": "NetsharePrePostProcessor",
        "config": {
            "norm_option": 0,
            "split_name": "multichunk_dep_v2",
            "df2chunks": "fixed_time",
            "full_IP_header": true,
            "encode_IP": "bit"
        }
    },
    "model_manager": {
        "class": "NetShareManager",
        "config": {
            "pretrain_dir": null,
            "skip_chunk0_train": false,
            "pretrain_non_dp": true,
            "pretrain_non_dp_reduce_time": 4.0,
            "pretrain_dp": false,
            "run": 0
        }
    },
    "model": {
        "class": "DoppelGANgerTFModel",
        "config": {
            "batch_size": 1000,
            "sample_len": [
                1,
                5,
                10
            ],
            "sample_len_expand": true,
            "iteration": 500,
            "vis_freq": 2001,
            "vis_num_sample": 5,
            "d_rounds": 5,
            "g_rounds": 1,
            "num_packing": 1,
            "noise": true,
            "attr_noise_type": "normal",
            "feature_noise_type": "normal",
            "rnn_mlp_num_layers": 0,
            "feed_back": false,
            "g_lr": 0.0001,
            "d_lr": 0.0001,
            "d_gp_coe": 10.0,
            "gen_feature_num_layers": 1,
            "gen_feature_num_units": 100,
            "gen_attribute_num_layers": 5,
            "gen_attribute_num_units": 512,
            "disc_num_layers": 5,
            "disc_num_units": 512,
            "initial_state": "random",
            "leaky_relu": false,
            "attr_d_lr": 0.0001,
            "attr_d_gp_coe": 10.0,
            "g_attr_d_coe": 1.0,
            "attr_disc_num_layers": 5,
            "attr_disc_num_units": 512,
            "aux_disc": true,
            "self_norm": false,
            "fix_feature_network": false,
            "debug": false,
            "combined_disc": true,
            "use_gt_lengths": false,
            "use_uniform_lengths": false,
            "num_cores": null,
            "sn_mode": null,
            "scale": 1.0,
            "extra_checkpoint_freq": 20000,
            "epoch_checkpoint_freq": 1000,
            "dp_noise_multiplier": null,
            "dp_l2_norm_clip": null
        }
    }
}

Thank you in advance for your help.
Hallavar

FileNotFoundError: [Errno 2] No such file or directory: '/data1/maoyuning/NetShare-master/results/test_cidds/generated_data/best_syn_dfs/syn.csv'

I'm running on a single machine(Ubuntu 16.04.3 LTS (GNU/Linux 4.4.0-148-generic x86_64) ) and turn Ray on.
The driver.py looks:

import netshare.ray as ray

from netshare import Generator

if __name__ == '__main__':
    # Change to False if you would not like to use Ray
    ray.config.enabled = True
    ray.init(address="auto")

    # configuration file
    generator = Generator(config="netflow/config_example_netflow_nodp_cidds.json")

    # `work_folder` should not exist o/w an overwrite error will be thrown.
    # Please set the `worker_folder` as *absolute path*
    # if you are using Ray with multi-machine setup
    # since Ray has bugs when dealing with relative paths.
    generator.train_and_generate(work_folder='/data1/maoyuning/NetShare-master/results/test_cidds')

    ray.shutdown()

The netflow/config_example_netflow_nodp_cidds.json is as follows:

{
    "global_config": {
        "original_data_file": "../traces/cidds/raw.csv",
        "dataset_type": "netflow",
        "n_chunks": 10,
        "dp": false
    },
    "pre_post_processor": {
        "class": "NetsharePrePostProcessor",
        "config": {
            "max_flow_len": null
        }
    },
    "model": {
        "class": "DoppelGANgerTFModel",
        "config": {
            "iteration": 20,
            "extra_checkpoint_freq": 10,
            "epoch_checkpoint_freq": 5
        }
    },
    "default": "netflow.json"
}

Here is the error message:

Traceback (most recent call last):
  File "driver.py", line 17, in <module>
    generator.train_and_generate(work_folder='/data1/maoyuning/NetShare-master/results/test_cidds')
  File "/data1/maoyuning/NetShare-master/netshare/generators/generator.py", line 205, in train_and_generate
    if not self.generate(work_folder):
  File "/data1/maoyuning/NetShare-master/netshare/generators/generator.py", line 176, in generate
    work_folder)):
  File "/data1/maoyuning/NetShare-master/netshare/generators/generator.py", line 110, in _post_process
    log_folder=log_folder)
  File "/data1/maoyuning/NetShare-master/netshare/pre_post_processors/pre_post_processor.py", line 38, in post_process
    log_folder=log_folder)
  File "/data1/maoyuning/NetShare-master/netshare/pre_post_processors/netshare/netshare_pre_post_processor.py", line 532, in _post_process
    os.path.join(output_folder, "syn.csv")
  File "/data1/maoyuning/.conda/envs/tf-1.15-py36/lib/python3.6/shutil.py", line 120, in copyfile
    with open(src, 'rb') as fsrc:
FileNotFoundError: [Errno 2] No such file or directory: '/data1/maoyuning/NetShare-master/results/test_cidds/generated_data/best_syn_dfs/syn.csv'

Bug when loading from chunk0 checkpoint

When loading from chunk 0 checkpoint by setting skip_chunk0_train to true, the function _configs2configsgroup in netshare/model_managers/netshare_manager/netshare_util.py can't find the checkpoint correctly. I believe this is due to a bug in this function where the for loop for finding the latest checkpoint use an incorrect format of checkpoint file.

The currrent format is

ckpt_dir = os.path.join(
    configs[config_id]["result_folder"],
    "checkpoint",
    "epoch_id-{}".format(epoch_id)
)

whereas the correct format should be

ckpt_dir = os.path.join(
    configs[config_id]["result_folder"],
    "checkpoint",
    "epoch_id-{}.pt".format(epoch_id)
)

Could you upload the synthetic data generated from NetShare trained on the cluster?

Could you please upload the synthetic datasets that are generated from NetShare trained on the 200 CPU cluster mentioned in the paper? The training of NetShare is quite compute-intensive and nearly impossible without a cluster. I have found the synthetic CAIDA and UGR16 from this repo (as csv files), but I can't find synthetic data of other datasets.

Missing "default" config of the pcap example

When I was running examples/driver.py using generator = Generator(config=""pcap/config_example_pcap_nodp.json), an error occured:

Traceback (most recent call last):
  File "driver.py", line 10, in <module>
    generator = Generator(config="pcap/config_example_pcap_nodp.json")
  File "/home/xinyu/NetShare/netshare/generators/generator.py", line 39, in __init__
    self._overwrite = global_config['overwrite']
  File "/home/xinyu/anaconda3/envs/NetShare/lib/python3.6/site-packages/config_io/config.py", line 12, in __mi    ssing__
    raise AttributeError(key)
AttributeError: overwrite

After some debugging, I found out it's likely because the config file of pcap examples/config_example_pcap_nodp.json misses a "default" a rgument, thus making the config_io failed to read the default setting.

The config file currently is

 {
     "global_config": {
         "original_data_file": "../traces/caida/raw.pcap",
         "dataset_type": "pcap",
         "n_chunks": 10,
         "dp": false
     }
}

It should be

 {
     "global_config": {
         "original_data_file": "../traces/caida/raw.pcap",
         "dataset_type": "pcap",
         "n_chunks": 10,
         "dp": false
     },
     "default": "pcap.json"
}

Prepostprocessor for a new dataset

Hello,

I'm trying to adapt your solution for a new dataset based on CIFlowMeter features extractor.

I would like to know what are the expected output of the preprocess and the postprocess.

Because I already have my data in the format require by doppelGANger, so I was thinking about just loading my data in these functions.

In the output of preprocess, before being given to training step, should all my attributes be continuous, or should I keep some discrete attributes that will be manage by word2vec, and if so, where do I indicate which attribute is continuous and which attribute is discrete ?

Also, in your example on zeek, you didn't change any argument in the PrePostProcessor config field of the config file

},
  "pre_post_processor": {
      "class": "ZeeklogPrePostProcessor",
      "config": {
          "norm_option": 0,
          "split_name": "multichunk_dep_v2",
          "df2chunks": "fixed_time",
          "full_IP_header": true,
          "encode_IP": "bit"
      }

Wouldn't be a problem anywhere else down the pipeline ?

And if so, could you please provide some info on what argument does what ? But if it is not mandatory t change it, we can just keep it like this.

Thanks in advance.

[Documentation Improvement] Missing dependency

When I tried to run NetShare in a Docker container, I realized it had an implicit dependency on libpcap. I used this command to install pcap2csv.so:

sudo apt install libpcap-dev

Maybe it would be helpful to add this dependency line in the README file?

RuntimeError during pip install

Hi! I'm following the README.md in branch new_dataset, and after pip3 install -e ., I encountered the following error. Is it a known issue or is there any additional dependency that I need to install? Thanks!

ERROR: Command errored out with exit status 1:
     command: /Users/dorothyko/opt/anaconda3/envs/NetShare/bin/python -u -c 'import io, os, sys, setuptools, tokenize; sys.argv[0] = '"'"'/private/var/folders/8z/rhgl1dpd44j80z0vry1zh7k80000gn/T/pip-install-0xh9kpj3/dm-tree_8e1c2948f02b4e8dbaffa8699b00febb/setup.py'"'"'; __file__='"'"'/private/var/folders/8z/rhgl1dpd44j80z0vry1zh7k80000gn/T/pip-install-0xh9kpj3/dm-tree_8e1c2948f02b4e8dbaffa8699b00febb/setup.py'"'"';f = getattr(tokenize, '"'"'open'"'"', open)(__file__) if os.path.exists(__file__) else io.StringIO('"'"'from setuptools import setup; setup()'"'"');code = f.read().replace('"'"'\r\n'"'"', '"'"'\n'"'"');f.close();exec(compile(code, __file__, '"'"'exec'"'"'))' install --record /private/var/folders/8z/rhgl1dpd44j80z0vry1zh7k80000gn/T/pip-record-t1fpvnsw/install-record.txt --single-version-externally-managed --compile --install-headers /Users/dorothyko/opt/anaconda3/envs/NetShare/include/python3.6m/dm-tree
         cwd: /private/var/folders/8z/rhgl1dpd44j80z0vry1zh7k80000gn/T/pip-install-0xh9kpj3/dm-tree_8e1c2948f02b4e8dbaffa8699b00febb/
    Complete output (57 lines):
    running install
    running build
    running build_py
    creating build
    creating build/lib.macosx-10.9-x86_64-3.6
    creating build/lib.macosx-10.9-x86_64-3.6/tree
    copying tree/sequence.py -> build/lib.macosx-10.9-x86_64-3.6/tree
    copying tree/__init__.py -> build/lib.macosx-10.9-x86_64-3.6/tree
    copying tree/tree_test.py -> build/lib.macosx-10.9-x86_64-3.6/tree
    copying tree/tree_benchmark.py -> build/lib.macosx-10.9-x86_64-3.6/tree
    running build_ext
    Traceback (most recent call last):
      File "/private/var/folders/8z/rhgl1dpd44j80z0vry1zh7k80000gn/T/pip-install-0xh9kpj3/dm-tree_8e1c2948f02b4e8dbaffa8699b00febb/setup.py", line 77, in _check_build_environment
        subprocess.check_call(['cmake', '--version'])
      File "/Users/dorothyko/opt/anaconda3/envs/NetShare/lib/python3.6/subprocess.py", line 306, in check_call
        retcode = call(*popenargs, **kwargs)
      File "/Users/dorothyko/opt/anaconda3/envs/NetShare/lib/python3.6/subprocess.py", line 287, in call
        with Popen(*popenargs, **kwargs) as p:
      File "/Users/dorothyko/opt/anaconda3/envs/NetShare/lib/python3.6/subprocess.py", line 729, in __init__
        restore_signals, start_new_session)
      File "/Users/dorothyko/opt/anaconda3/envs/NetShare/lib/python3.6/subprocess.py", line 1364, in _execute_child
        raise child_exception_type(errno_num, err_msg, err_filename)
    FileNotFoundError: [Errno 2] No such file or directory: 'cmake': 'cmake'

    The above exception was the direct cause of the following exception:

    Traceback (most recent call last):
      File "<string>", line 1, in <module>
      File "/private/var/folders/8z/rhgl1dpd44j80z0vry1zh7k80000gn/T/pip-install-0xh9kpj3/dm-tree_8e1c2948f02b4e8dbaffa8699b00febb/setup.py", line 155, in <module>
        keywords='tree nest flatten',
      File "/Users/dorothyko/opt/anaconda3/envs/NetShare/lib/python3.6/site-packages/setuptools/__init__.py", line 153, in setup
        return distutils.core.setup(**attrs)
      File "/Users/dorothyko/opt/anaconda3/envs/NetShare/lib/python3.6/distutils/core.py", line 148, in setup
        dist.run_commands()
      File "/Users/dorothyko/opt/anaconda3/envs/NetShare/lib/python3.6/distutils/dist.py", line 955, in run_commands
        self.run_command(cmd)
      File "/Users/dorothyko/opt/anaconda3/envs/NetShare/lib/python3.6/distutils/dist.py", line 974, in run_command
        cmd_obj.run()
      File "/Users/dorothyko/opt/anaconda3/envs/NetShare/lib/python3.6/site-packages/setuptools/command/install.py", line 61, in run
        return orig.install.run(self)
      File "/Users/dorothyko/opt/anaconda3/envs/NetShare/lib/python3.6/distutils/command/install.py", line 545, in run
        self.run_command('build')
      File "/Users/dorothyko/opt/anaconda3/envs/NetShare/lib/python3.6/distutils/cmd.py", line 313, in run_command
        self.distribution.run_command(command)
      File "/Users/dorothyko/opt/anaconda3/envs/NetShare/lib/python3.6/distutils/dist.py", line 974, in run_command
        cmd_obj.run()
      File "/Users/dorothyko/opt/anaconda3/envs/NetShare/lib/python3.6/distutils/command/build.py", line 135, in run
        self.run_command(cmd_name)
      File "/Users/dorothyko/opt/anaconda3/envs/NetShare/lib/python3.6/distutils/cmd.py", line 313, in run_command
        self.distribution.run_command(command)
      File "/Users/dorothyko/opt/anaconda3/envs/NetShare/lib/python3.6/distutils/dist.py", line 974, in run_command
        cmd_obj.run()
      File "/private/var/folders/8z/rhgl1dpd44j80z0vry1zh7k80000gn/T/pip-install-0xh9kpj3/dm-tree_8e1c2948f02b4e8dbaffa8699b00febb/setup.py", line 70, in run
        self._check_build_environment()
      File "/private/var/folders/8z/rhgl1dpd44j80z0vry1zh7k80000gn/T/pip-install-0xh9kpj3/dm-tree_8e1c2948f02b4e8dbaffa8699b00febb/setup.py", line 82, in _check_build_environment
        ) from e
    RuntimeError: CMake must be installed to build the following extensions: _tree
    ----------------------------------------
ERROR: Command errored out with exit status 1: /Users/dorothyko/opt/anaconda3/envs/NetShare/bin/python -u -c 'import io, os, sys, setuptools, tokenize; sys.argv[0] = '"'"'/private/var/folders/8z/rhgl1dpd44j80z0vry1zh7k80000gn/T/pip-install-0xh9kpj3/dm-tree_8e1c2948f02b4e8dbaffa8699b00febb/setup.py'"'"'; __file__='"'"'/private/var/folders/8z/rhgl1dpd44j80z0vry1zh7k80000gn/T/pip-install-0xh9kpj3/dm-tree_8e1c2948f02b4e8dbaffa8699b00febb/setup.py'"'"';f = getattr(tokenize, '"'"'open'"'"', open)(__file__) if os.path.exists(__file__) else io.StringIO('"'"'from setuptools import setup; setup()'"'"');code = f.read().replace('"'"'\r\n'"'"', '"'"'\n'"'"');f.close();exec(compile(code, __file__, '"'"'exec'"'"'))' install --record /private/var/folders/8z/rhgl1dpd44j80z0vry1zh7k80000gn/T/pip-record-t1fpvnsw/install-record.txt --single-version-externally-managed --compile --install-headers /Users/dorothyko/opt/anaconda3/envs/NetShare/include/python3.6m/dm-tree Check the logs for full command output. ```

The package can only be installed in ediable mode.

If the netshare package was installed through pip3 install . (without opening the editable option), I got the following error message when importing netshare:

Traceback (most recent call last):
  File "test.py", line 1, in <module>
    import netshare
  File "/home/runwei/anaconda3/envs/netshare/lib/python3.6/site-packages/netshare/__init__.py", line 1, in <module>
    from .generators.generator import Generator
  File "/home/runwei/anaconda3/envs/netshare/lib/python3.6/site-packages/netshare/generators/generator.py", line 5, in <module>
    import netshare.pre_post_processors as pre_post_processors
  File "/home/runwei/anaconda3/envs/netshare/lib/python3.6/site-packages/netshare/pre_post_processors/__init__.py", line 2, in <module>
    from .netshare.netshare_pre_post_processor import NetsharePrePostProcessor
ModuleNotFoundError: No module named 'netshare.pre_post_processors.netshare'

How to train the model with naive differential privacy?

I'm trying to train the model on a small test dataset with naive differential privacy. I tried to change something in the configuration, but I got empty output.
Here is the changed configuration

{
    "global_config": {
        "original_data_file": "../traces/simple-network/Switch1_Ethernet1_to_PC1_Ethernet0-correct/raw.pcap",
        "dataset_type": "pcap",
        "n_chunks": 1,
        "dp": true
    },
    "model_manager": {
        "class": "NetShareManager",
        "config": {
            "pretrain_non_dp": false,
            "pretrain_non_dp_reduce_time": null,
            "pretrain_dp": false
        }
    },
    "model": {
        "class": "DoppelGANgerTFModel",
        "config": {
            "batch_size": 1,
            "sample_len": [
                1
            ],
            "iteration": 80000,
            "extra_checkpoint_freq": 4000,
            "epoch_checkpoint_freq": 1000,
            "gen_feature_num_layers": 1,
            "gen_feature_num_units": 100,
            "gen_attribute_num_layers": 1,
            "gen_attribute_num_units": 32,
            "disc_num_layers": 1,
            "disc_num_units": 32,
            "attr_disc_num_layers": 1,
            "attr_disc_num_units": 32,
            "dp_noise_multiplier": 0.2797,
            "dp_l2_norm_clip": 1.0
        }
    },
    "default": "pcap.json"
}

I suspect it is partly because I set pretrain_dp=false. But if pretrain_dp=true, I will be asked to provide a model pretrained with public dataset.

ValueError: Variable DoppelGANgerGenerator/attribute_real/layer0/linear/matrix/Adam/ already exists, disallowed.

I have just follow the instructions and run the script driver.py. Here is the error message:

Traceback (most recent call last):
  File "/home/runwei/NetShare/netshare/models/model.py", line 27, in train
    log_folder=log_folder)
  File "/home/runwei/NetShare/netshare/models/doppelganger_tf_model.py", line 176, in _train
    gan.build()
  File "/home/runwei/NetShare/netshare/models/doppelganger_tf/doppelganger.py", line 293, in build
    self.build_loss()
  File "/home/runwei/NetShare/netshare/models/doppelganger_tf/doppelganger.py", line 708, in build_loss
    self.g_loss, var_list=self.generator.trainable_vars
  File "/home/runwei/anaconda3/envs/netshare/lib/python3.6/site-packages/tensorflow_core/python/training/optimizer.py", line 413, in minimize
    name=name)
  File "/home/runwei/anaconda3/envs/netshare/lib/python3.6/site-packages/tensorflow_core/python/training/optimizer.py", line 597, in apply_gradients
    self._create_slots(var_list)
  File "/home/runwei/anaconda3/envs/netshare/lib/python3.6/site-packages/tensorflow_core/python/training/adam.py", line 131, in _create_slots
    self._zeros_slot(v, "m", self._name)
  File "/home/runwei/anaconda3/envs/netshare/lib/python3.6/site-packages/tensorflow_core/python/training/optimizer.py", line 1156, in _zeros_slot
    new_slot_variable = slot_creator.create_zeros_slot(var, op_name)
  File "/home/runwei/anaconda3/envs/netshare/lib/python3.6/site-packages/tensorflow_core/python/training/slot_creator.py", line 190, in create_zeros_slot
    colocate_with_primary=colocate_with_primary)
  File "/home/runwei/anaconda3/envs/netshare/lib/python3.6/site-packages/tensorflow_core/python/training/slot_creator.py", line 164, in create_slot_with_initializer
    dtype)
  File "/home/runwei/anaconda3/envs/netshare/lib/python3.6/site-packages/tensorflow_core/python/training/slot_creator.py", line 74, in _create_slot_var
    validate_shape=validate_shape)
  File "/home/runwei/anaconda3/envs/netshare/lib/python3.6/site-packages/tensorflow_core/python/ops/variable_scope.py", line 1500, in get_variable
    aggregation=aggregation)
  File "/home/runwei/anaconda3/envs/netshare/lib/python3.6/site-packages/tensorflow_core/python/ops/variable_scope.py", line 1243, in get_variable
    aggregation=aggregation)
  File "/home/runwei/anaconda3/envs/netshare/lib/python3.6/site-packages/tensorflow_core/python/ops/variable_scope.py", line 567, in get_variable
    aggregation=aggregation)
  File "/home/runwei/anaconda3/envs/netshare/lib/python3.6/site-packages/tensorflow_core/python/ops/variable_scope.py", line 519, in _true_getter
    aggregation=aggregation)
  File "/home/runwei/anaconda3/envs/netshare/lib/python3.6/site-packages/tensorflow_core/python/ops/variable_scope.py", line 868, in _get_single_variable
    (err_msg, "".join(traceback.format_list(tb))))
ValueError: Variable DoppelGANgerGenerator/attribute_real/layer0/linear/matrix/Adam/ already exists, disallowed. Did you mean to set reuse=True or reuse=tf.AUTO_REUSE in VarScope? Originally defined at:

  File "/home/runwei/anaconda3/envs/netshare/lib/python3.6/site-packages/tensorflow_core/python/framework/ops.py", line 1748, in __init__
    self._traceback = tf_stack.extract_stack()
  File "/home/runwei/anaconda3/envs/netshare/lib/python3.6/site-packages/tensorflow_core/python/framework/ops.py", line 3426, in _create_op_internal
    op_def=op_def)
  File "/home/runwei/anaconda3/envs/netshare/lib/python3.6/site-packages/tensorflow_core/python/framework/ops.py", line 3357, in create_op
    attrs, op_def, compute_device)
  File "/home/runwei/anaconda3/envs/netshare/lib/python3.6/site-packages/tensorflow_core/python/util/deprecation.py", line 507, in new_func
    return func(*args, **kwargs)
  File "/home/runwei/anaconda3/envs/netshare/lib/python3.6/site-packages/tensorflow_core/python/framework/op_def_library.py", line 794, in _apply_op_helper
    op_def=op_def)

TypeError: can't convert cuda:0 device type tensor to numpy. Use Tensor.cpu() to copy the tensor to host memory first.

Hello,
I meet this error while I'm trying to generate data using GPU. The whole error message is here:

Traceback (most recent call last):
  File "/home/ubuntu/xzj/NetShare-new/netshare/models/model.py", line 34, in generate
    return self._generate(
  File "/home/ubuntu/xzj/NetShare-new/netshare/models/doppelganger_torch_model.py", line 247, in _generate
    ) = dg.generate(
  File "/home/ubuntu/xzj/NetShare-new/netshare/models/doppelganger_torch/doppelganger.py", line 237, in generate
    attribute, attribute_discrete, feature = tuple(
  File "/home/ubuntu/xzj/NetShare-new/netshare/models/doppelganger_torch/doppelganger.py", line 238, in <genexpr>
    np.concatenate(d, axis=0) for d in zip(*generated_data_list)
  File "<__array_function__ internals>", line 200, in concatenate
  File "/home/ubuntu/.conda/envs/NetShare-new/lib/python3.9/site-packages/torch/_tensor.py", line 970, in __array__
    return self.numpy()
TypeError: can't convert cuda:0 device type tensor to numpy. Use Tensor.cpu() to copy the tensor to host memory first.

It's just a little bug, I solved it by changing the function DoppelGANger._generate in ./netshare/models/doppelganger_torch/doppelganger.py.
The changed code is like this:

def _generate(
        self,
        real_attribute_noise,
        addi_attribute_noise,
        feature_input_noise,
        h0,
        c0,
        given_attribute=None,
        given_attribute_discrete=None,
    ):

        self.generator.eval()
        self.discriminator.eval()
        if self.use_attr_discriminator:
            self.attr_discriminator.eval()

        if given_attribute is None and given_attribute_discrete is None:
            with torch.no_grad():
                attribute, attribute_discrete, feature = self.generator(
                    real_attribute_noise=real_attribute_noise.to(self.device),
                    addi_attribute_noise=addi_attribute_noise.to(self.device),
                    feature_input_noise=feature_input_noise.to(self.device),
                    h0=h0.to(self.device),
                    c0=c0.to(self.device)
                )
        else:
            given_attribute = torch.from_numpy(given_attribute).float()
            given_attribute_discrete = torch.from_numpy(
                given_attribute_discrete).float()
            with torch.no_grad():
                attribute, attribute_discrete, feature = self.generator(
                    real_attribute_noise=real_attribute_noise.to(self.device),
                    addi_attribute_noise=addi_attribute_noise.to(self.device),
                    feature_input_noise=feature_input_noise.to(self.device),
                    h0=h0.to(self.device),
                    c0=c0.to(self.device),
                    given_attribute=given_attribute.to(self.device),
                    given_attribute_discrete=given_attribute_discrete.to(self.device),
                )
        return attribute.cpu(), attribute_discrete.cpu(), feature.cpu()

importing ray raise a AttributeError on sys.stdout()

Hello,

I'm trying to implement a simple generation of netflow from ugr16 dataset. I'm on linux and python3.6

My config.json is

{
    "global_config": {
        "original_data_file": "../traces/ugr16/raw.csv",
        "dataset_type": "netflow",
        "n_chunks": 10,
        "dp": false,
        "max_flow_len": false #to avoid the raising of an attribute error in netshare_pre_post_processor.py
    },
    "default": "netflow.json"
}

My driver.py is


from netshare import Generator
if __name__ == '__main__':
    # configuration file
    generator = Generator(config="netflow/config_example_netflow_nodp.json")

    # `work_folder` should not exist o/w an overwrite error will be thrown.
    # Please set the `worker_folder` as *absolute path*
    # if you are using Ray with multi-machine setup
    # since Ray has bugs when dealing with relative paths.
    
    #generator.visualize(work_folder='../results/vis_test')
    generator.train_and_generate(work_folder='../results/vis_test/')

After building all the chunk, the program starts working on the first chunk and raise an Attribute error linked to ray (wich is not use in the program)

The error Traceback is


Traceback (most recent call last):
  File "/srv/tempdd/aschoen/NetShare/netshare/pre_post_processors/pre_post_processor.py", line 29, in pre_process
    log_folder=log_folder)
  File "/srv/tempdd/aschoen/NetShare/netshare/pre_post_processors/netshare/netshare_pre_post_processor.py", line 406, in _pre_process
    flowkeys_chunkidx=flowkeys_chunkidx,
  File "/srv/tempdd/aschoen/NetShare/netshare/ray/remote.py", line 25, in remote
    import ray
  File "/udd/aschoen/.local/lib/python3.6/site-packages/ray/__init__.py", line 169, in <module>
    from ray import autoscaler  # noqa:E402
  File "/udd/aschoen/.local/lib/python3.6/site-packages/ray/autoscaler/__init__.py", line 1, in <module>
    from ray.autoscaler import sdk
  File "/udd/aschoen/.local/lib/python3.6/site-packages/ray/autoscaler/sdk/__init__.py", line 1, in <module>
    from ray.autoscaler.sdk.sdk import (
  File "/udd/aschoen/.local/lib/python3.6/site-packages/ray/autoscaler/sdk/sdk.py", line 9, in <module>
    from ray.autoscaler._private import commands
  File "/udd/aschoen/.local/lib/python3.6/site-packages/ray/autoscaler/_private/commands.py", line 24, in <module>
    from ray.autoscaler._private import subprocess_output_util as cmd_output_util
  File "/udd/aschoen/.local/lib/python3.6/site-packages/ray/autoscaler/_private/subprocess_output_util.py", line 8, in <module>
    from ray.autoscaler._private.cli_logger import cf, cli_logger
  File "/udd/aschoen/.local/lib/python3.6/site-packages/ray/autoscaler/_private/cli_logger.py", line 61, in <module>
    import colorful as _cf
  File "/udd/aschoen/.conda/envs/NetShare-env/lib/python3.6/site-packages/colorful/__init__.py", line 133, in <module>
    sys.modules[__name__] = ColorfulModule(Colorful(), __name__)
  File "/udd/aschoen/.conda/envs/NetShare-env/lib/python3.6/site-packages/colorful/core.py", line 342, in __init__
    colormode = terminal.detect_color_support(env=os.environ)
  File "/udd/aschoen/.conda/envs/NetShare-env/lib/python3.6/site-packages/colorful/terminal.py", line 48, in detect_color_support
    if not sys.stdout.isatty():

I think it is the import of ray in the RemoteFunctionWrapper of remote.py which is causing this fault. Is there a way to ignore this wrapper in your code by giving a specific argument to the json file ?

If you have any advice, I would really appreciate.

Good luck for your big merge ;)

Adrien

Did you rename the attributes of UGR' 16 and CIDDS dataset ?

Hello,
I'm wondering if you rename the attribute of the .csv dataset in your preprocessing step.
For CIDDS, the programs netshare_pre_post_processor.py seems to be looking up to a column named 'td' as well as 'ts'. Both these attributes are not present in the original .csv.
For UGR'16, the .csv given by the team has no column name, did you decide to name them ? If yes, how (wich name correspond to which column number) ?

Could you please give a quick explanation of all the attributes/column your program is supposed to use, so one can change its dataset accordingly ?

Difference of iteration count in camera-ready branch and main branch

Hi, I have one problem about the configuration. I notice that the camera-ready branch has set the default iteration to 40 for pcap w/o dp while the main branch set it to 80000. Could you please explain this difference? Are the results in the NetShare essay produced by experiments using the config in camera-ready branch?

ValueError: result_folder: /data1/maoyuning/NetShare-master/results/test_caida/models/chunkid-1/sample_len-100 not found in configs!

I follow the instructions and run the scrip driver.py. The configuration is as follows:

{
    "global_config": {
        "original_data_file": "../traces/caida/raw.pcap",
        "dataset_type": "pcap",
        "n_chunks": 5,
        "dp": false
    },
    "model":{
        "class": "DoppelGANgerTFModel",
        "config": {
            "iteration": 10,
            "extra_checkpoint_freq": 5,
            "epoch_checkpoint_freq": 2
        }
    },
    "default": "pcap.json"
}