fjxmlzn / doppelganger Goto Github PK

[IMC 2020 (Best Paper Finalist)] Using GANs for Sharing Networked Time Series Data: Challenges, Initial Promise, and Open Questions

Home Page: http://arxiv.org/abs/1909.13403

License: BSD 3-Clause Clear License

Python 100.00%

dataset-generation datasets doppelganger fidelity gan gans generative-adversarial-network privacy synthetic-data synthetic-data-generation synthetic-data-generator synthetic-dataset-generation time-series timeseries

doppelganger's People

Contributors

Stargazers

Watchers

doppelganger's Issues

Time Series Length

Is it possible to generate a continuous stream of time series information using this model? For example, I only have 10,000 samples of 500 time steps, can I generate one sample with 4000 steps (8 times the input length) with you package? And how is this achieved, is it done by feeding in the previous time step?

CLI getting stuck on running example_training/main.py

Hey, I am trying to run the example training with GPU on Windows 10 with python 3.6 and tf 1.4. On trying it run with verbose, the last part before it gets stuck is this:
`import 'pathos.secure.copier' # <_frozen_importlib_external.SourceFileLoader object at 0x000001D083E1C9B0>

C:\Users\e106923\Miniconda3\envs\DGAN\lib\site-packages\pathos-0.2.8-py3.6.egg\pathos\secure_pycache_\tunnel.cpython-36.pyc matches C:\Users\e106923\Miniconda3\envs\DGAN\lib\site-packages\pathos-0.2.8-py3.6.egg\pathos\secure\tunnel.py

code object from 'C:\Users\e106923\Miniconda3\envs\DGAN\lib\site-packages\pathos-0.2.8-py3.6.egg\pathos\secure\pycache\tunnel.cpython-36.pyc'

import 'pathos.secure.tunnel' # <_frozen_importlib_external.SourceFileLoader object at 0x000001D083E1CC88>
import 'pathos.secure' # <_frozen_importlib_external.SourceFileLoader object at 0x000001D083E1C6A0>
import 'pathos' # <_frozen_importlib_external.SourceFileLoader object at 0x000001D083DA56D8>
import 'gpu_task_scheduler.gpu_task_scheduler' # <_frozen_importlib_external.SourceFileLoader object at 0x000001D083CA1D68>

C:\Users\e106923\Miniconda3\envs\DGAN\DoppelGANger-master\example_training_pycache_\gan_task.cpython-36.pyc matches C:\Users\e106923\Miniconda3\envs\DGAN\DoppelGANger-master\example_training\gan_task.py

code object from 'C:\Users\e106923\Miniconda3\envs\DGAN\DoppelGANger-master\example_training\pycache\gan_task.cpython-36.pyc'

C:\Users\e106923\Miniconda3\envs\DGAN\lib\site-packages\gputaskscheduler-0.1.0-py3.6.egg\gpu_task_scheduler_pycache_\gpu_task.cpython-36.pyc matches C:\Users\e106923\Miniconda3\envs\DGAN\lib\site-packages\gputaskscheduler-0.1.0-py3.6.egg\gpu_task_scheduler\gpu_task.py

code object from 'C:\Users\e106923\Miniconda3\envs\DGAN\lib\site-packages\gputaskscheduler-0.1.0-py3.6.egg\gpu_task_scheduler\pycache\gpu_task.cpython-36.pyc'

import 'gpu_task_scheduler.gpu_task' # <_frozen_importlib_external.SourceFileLoader object at 0x000001D083E26780>
import 'gan_task' # <_frozen_importlib_external.SourceFileLoader object at 0x000001D083E26208>

C:\Users\e106923\Miniconda3\envs\DGAN\lib_pycache_\hmac.cpython-36.pyc matches C:\Users\e106923\Miniconda3\envs\DGAN\lib\hmac.py

code object from 'C:\Users\e106923\Miniconda3\envs\DGAN\lib\pycache\hmac.cpython-36.pyc'

import 'hmac' # <_frozen_importlib_external.SourceFileLoader object at 0x000001D083E266A0>`

I can't seem to figure out what's the issue as there is no error but no output.

Training does not run although the input is of the required form

Hello, I'm trying to run the TF2 version of DoppleGANger but I'm running into some issues.
I have transformed the input data in the form that is required (according to README). When trying to train the GAN I run into the following error:

Traceback (most recent call last):
File "C:\Users\dgtri\Desktop\DoppelGANger-TF2\example_training(wt)\main.py", line 29, in
normalize_per_sample(
File "C:\Users\dgtri\Desktop\DoppelGANger-TF2\example_training(wt)..\gan\util.py", line 182, in normalize_per_sample
additional_attribute = np.stack(additional_attribute, axis=1)
File "<array_function internals>", line 5, in stack
File "C:\Users\dgtri\anaconda3\lib\site-packages\numpy\core\shape_base.py", line 423, in stack
raise ValueError('need at least one array to stack')
ValueError: need at least one array to stack

I am wondering what is the cause of this error.
It's worth noting that I have not pre-normalized the data inside of the numpy arrays to be inputed.
I'm also attaching the numpy arrays in the following screenshots in order to make sure that they're in the right form.
Their shapes are:
(3682, 120, 4)
(3682, 5)
(3682, 120)

Thank you!

Dataset

Hi, I want to know where did you find datasets? I couldn't find more datasets with measurements and attributes. Could you please tell me how to find datasets? Thank you so much!

Timeseries with missing data

Hello,
I hope you are well.

First of all, amazing work. I particularly enjoyed reading the paper. It was very well-written and easily understandable.

I'm working with an open source internet activity data set. It's a fairly small data set, with only hourly recordings over 5 weeks. From the way I understood the data formatting, I used the 'week of the year' as an attribute and the actual measurements as the feature.
The results were pretty impressive and I've attached a simple comparison plot of the real (orange) and generated (blue) sequences below.

My data format looked something like this:

Week of year	0	1	2	3	.....	168
0	123	456	678	567	.....	234
1	345	890	787	122	.....	345
...	.....	.....	....	.....	.....	....

For now, there are 168 hours in each week, so the series length is constant and active on every step. This made for fairly simple data pre-processing.
Now suppose I randomly removed some hourly values from each week, and then trained the DG on that new, partial timeseries data.

Can the DG produce unique values for all hours that I could then use to fill in the gaps in the original input?
Another way to phrase it would be, if each week had a different series length, can the DG produce the full 168 hours for each week based on the hourly values it gets?

If yes, then what would the Preprocessing of such a data set look like?

My idea is to add the hours as an additional feature, but I'm not sure if I would drop the hours with missing values and let DG pad the end of the timeseries as it does or something else. I'm also not sure how this would be reflected in the data_gen_flag. Would I just show the timeseries as 'off' for those values?

I hope I make sense. I'd love to hear your opinion about whether this sort of generation is possible and a rough idea of what time attributes/features should be included to improve its results.
Thank you!

generated_samples

I am able to run the training algorithm. But when I run the generating_data, it never creates the output in the "generated_samples" folder. I attached the worker log here. Could you please help me on that?
Thanks in advance.

worker_generate_data.log

Docstrings in DoppelGANger

First of all. Thanks for the great work. I was able to use it to generate time series of natural inflows into water reservoirs, and it worked without much changes to the code.

It took me a while to figure out the right format though. I would have been much faster if the DoppelGANger class had proper Docstrings. I pretty much had to guess and try how the functions other than the constructor work. Providing some simple examples would also help.

Generating NAN values

Model is generating NAN values in features and attributes.
My data has following shape:
print(data_feature.shape)
print(data_attribute.shape)
print(data_gen_flag.shape)

(500, 1, 3)
(500, 4)
(500, 1)

Reproducing Figure 1 WWT again

Hi @fjxmlzn,

Thank you for the phenomenal effort on the repo. Additionally, thank you for sharing the code the calculating the autocorrelation, it helped me to match the autocorrelation curve in Figure 1.

I'm trying to reproduce the curve for DoppelGANger in Figure 1 below:

First of all, I've used the WWT data provided in the GDrive for training and testing. In addition, I have run the doppleframework provided in the example_training (withoutGPU_TASK). I'm struggling to match the curve in Figure 1 with two different sample_len.

`sample_len=5`

`sample_len=10`

I was wondering if you can help me understand what went wrong and how I can reproduce the performance in the paper.
The hyperparameters used are:

"batch_size": 100,
"vis_freq": 200,
"vis_num_sample": 5,
"d_rounds": 1,
"g_rounds": 1,
"num_packing": 1,
"noise": True,
"feed_back": False,
"g_lr": 0.001,
"d_lr": 0.001,
"d_gp_coe": 10.0,
"gen_feature_num_layers": 1,
"gen_feature_num_units": 100,
"gen_attribute_num_layers": 3,
"gen_attribute_num_units": 100,
"disc_num_layers": 5,
"disc_num_units": 200,
"initial_state": "random",

"attr_d_lr": 0.001,
"attr_d_gp_coe": 10.0,
"g_attr_d_coe": 1.0,
"attr_disc_num_layers": 5,
"attr_disc_num_units": 200,

"epoch": [400],
"sample_len": [5, 10],
"extra_checkpoint_freq": [5],
"epoch_checkpoint_freq": [1],
"aux_disc": [True],
"self_norm": [True]

Real Valued Time Series with no attributes

Hi,

I have been meaning to test out this model for generation of real valued time series only, with no "attributes"(categorical). Is it even possible?

Thanks in advance !

Dynamic attributes / attributes with time stamp?

Thank you so much for this nice work again! I have a question about the attribute (metadata) generator, is that possible to generate attributes with timestamps (dynamic attributes), since it seems in the papers it should be an MLP that only generates static attributes? Or are there any other approaches that we could input these dynamic attributes as conditions to generate features? The attributes might have the same length and timestamps as the features.

The question mainly arose from the technical blog you posted on this repository: Hazy: (2)Generating Synthetic Sequential Data using GANs. In the second example, the author uses such kind of dynamic attribute, like hourly weather data. I also asked the author of the blog about the implementation details, yet still did not get replied.

Request for availability of the scripts used to reproduce figures

Dear @fjxmlzn and @wangchen615,

I am really interested in your work in synthetic generation of time series data with high-dimensional metadata, it is quite a wonderful paper to read (Easily comprehensible for someone new to the field of GANs). And the methods proposed were sound and reasonable to alleviate some of the challenges to attain the best synthetic data fidelity.

Could I check you you guys over the availability of the scripts used to generate the figures in the paper? I would like to reproduce the results, especially the autocorrelation one. If possible, could you show how to run such functions or scripts?

Namely figures (1), (5), (6), (8), and (9)!

I really appreciate your help in this.

Thank you in advance

is_gen_flag

Thanks for your great efforts and achievements in DGAN!
And I do hope that you can answer my question below.
"Note that is_gen_flag should always set to False (default). is_gen_flag=True is for internal use only (see comments in doppelganger.py for details)."
Does it mean that I cannot set the "is_gen_flag" to True?And I must choose a apporpriate length of timesteps for my examples?
Looking forward to your reply!

sample_len and sample_time

Get confused by the meaning of the variables.

sample_len: The time series batch size.
self.sample_time = int(self.data_feature.shape[1] / self.sample_len)

In my view:
data_feature.shape[1]: the length(steps) of the time series of a training sample. and it's fixed value for all the sample.
so the sample_time should be 0 <= sample_time < data_features.shape[1]

Did I misunderstand?

TF2

Hi I just want to know whether you are perhaps planning on releasing a version for Tensor Flow 2, it would probably be around for the next few years and I think this is an interesting repository that could be used more in the near future. Thanks for your work!

Reproduce Figure 1 on WWT

Hi, thank you for your work and ultra-clean code. I appreciate it a lot. I'm trying to reproduce Figure 1 in the paper and ran into some problems. I've attached the image I got below:

My first concern is that the ACF for real WWT data (either train or test) doesn't match Figure 1 in the paper. This data is directly downloaded from the GDrive. I'm currently using statsmodels.tsa.stattools.acf and calculate the ACF for 550 nlags averaged across 50,000 samples. Is this correct? I would appreciate pointers for how you calculated the ACF for real data to get Figure 1.

My second concern is with the generated time series for DoppelGANger. I've used the provided codebase with sample_len = 10 but as you can see in the ACF, it doesn't quite capture the weekly seasonality as clearly as Figure 1. Maybe this is related to how ACF was calculated? The MSE between the DoppelGANger and real is 0.0005 for my case, which is close to 0.0009 in Table 3 so I'm not sure what went wrong.

For quick reference:

Training time

hii, first of all, thanks for your great work! I was able to run the example codes, but i got a problem that as I run the code multiple times, the time of per epoch multiplies. Maybe I didn't understand the code enough, trying to get some help.

Dataset Format

Hello,
Thanks for the great work.
I prepared my dataset to use the tool. But, when I run the training algorithm from "example_training > main.py" I am getting an exception "..\gan\network.py line 262" in build "unknown output type". I prepared the data_attribute_output and data_feature_output as per the instructions. Any idea why the exception is occurring?
Thanks in advance for your help.

About two MLPs

Is there actually only one MLP in the code? "Real" and "fake" metadata are generated at the same time?

what tensorflow_privacy version are you using when you are using the tensorflow==1.4.0 and python==3.5?

Hi @fjxmlzn I'm still trying to reproduce your generation of data for WWT and the best case scenario I obtained is this for eps=1.8E8

Can you specify me what version of tensorflow_privacy did you use when you test your evaluation for tensorflow==1.4.0 and python==3.5? For the example I'm attaching here the tensorflow_privacy is 0.4.0, for the the tensorflow_privacy>0.6.0 a version of tf>=2.0 is needed, again can you let me know what version are you using for obtaining your results exactly?

membership_inference_attack

Hey,
I am trying to reproduce the Figure-15. Is it possible to share the source code for membership inference attacks from your paper.
Thanks in advance for your help.

output feature has many 0.

Hi, when I trying to generate the "web" output, the output is a series with some first values and all others are 0.

I don't know why this happens? I used tf 1.13 to run it.

For example:

array([[-0.07411094],
       [ 0.50166506],
       [ 0.30637154],
       [ 0.36993235],
       [ 0.08496358],
       [-0.11077818],
       [-0.81356037],
       [-0.16060808],
       [ 0.13039538],
       [ 0.13650048],
       [ 0.        ],
       [ 0.        ],
       [ 0.        ],
       [ 0.        ],
       [ 0.        ],
       [ 0.        ],
       [ 0.        ],
       [ 0.        ],
       [ 0.        ],
       [ 0.        ],
       [ 0.        ],
       [ 0.        ],
       [ 0.        ],
       [ 0.        ],
       [ 0.        ],
       [ 0.        ],
       [ 0.        ],
       [ 0.        ],
       [ 0.        ],
       [ 0.        ],

Code of AR and HMM baseline

Thanks for your great work. But I don't quite understand your implementation of baselines AR and HMM:
(1) How do you train the HMM and use it to generate time series?
(2) In AR model, if I understand correctly, you randomly sample the first record R1 from the distribution of training data, and then autogressively use AR to predict the next record. But with p=3, the model need 3 past steps to predict the next step, so maybe you need to sample the first three records?
(3) Could you provide the code of these baselines?

Data Format [Question]

Hello,

Is there any more documentation/notebooks/examples on transforming the data required for the model.

Say I have the following columns where Amount is the numerical column and tag is the categorical column:

userId | date | Amount | Tag
A0|2020-01-01|200|green
A0|2020-01-02|300|blue
A2|2020-01-01|218|red
A2|2020-01-02|242|pink
A3|2020-01-04|38|red

What is the difference between data features and data attribute? I see that I would have to one hot encode Tag to be 4 columns [green, blue, red, pink] in this example so the dataframe would become

userId | date | Amount | green | blue | red | pink
A0|2020-01-01|200|1|0|0|0
A0|2020-01-02|300|0|1|0|0|
A2|2020-01-01|218|0|0|1|0
A2|2020-01-02|242|0|0|0|1
A3|2020-01-04|38|0|0|1|0

How would I go about creating the data features/attributes? I am having a hard time understanding how I would transform some simple data as above to work with the model.

Thank you!

about mode-collapse

Thanks for your work!
I have noticed in the code that you tried to mitigate mode-collapse by packing, but the num_packing is 1 by default. Do you mean packing is unnessary in this case?

Python 3 compatibility

The code is only compatible to Python 2.7. I would propose a pull request but could not get it to work with Python 3.

Here is a workaround for the time being:

import subprocess 
process = subprocess.Popen("python2.7 doppelganger.py",
    shell=True,stdout=subprocess.PIPE,
    stderr=subprocess.PIPE
)
output, error = process.communicate()

It is only a building block, but maybe this is useful for anybody who wants to use it from Python 3 now.

Runtime of DoppelGANger?

I'm running the experiment on the web dataset.

It has a shape of (50000,550,3), and the runtime looks like it is going to be 4 days on a RTX 2080 Ti. It is only using about 10% of the GPU.

I am wondering if this is typical? It does seems like an awfully long time for that many datapoints.

Problem with tensorflow

Hello, I downloaded the DoppelGanger but I have problems when executing the code

Traceback (most recent call last):
  File "main.py", line 7, in <module>
    from gan.doppelganger import DoppelGANger
  File "..\gan\doppelganger.py", line 1, in <module>
    import tensorflow as tf
  File "C:\Users\A-rep\AppData\Roaming\Python\Python38\site-packages\tensorflow\__init__.py", line 51, in <module>
    from ._api.v2 import compat
  File "C:\Users\A-rep\AppData\Roaming\Python\Python38\site-packages\tensorflow\_api\v2\compat\__init__.py", line 37, in <module>
    from . import v1
  File "C:\Users\A-rep\AppData\Roaming\Python\Python38\site-packages\tensorflow\_api\v2\compat\v1\__init__.py", line 30, in <module>
    from . import compat
  File "C:\Users\A-rep\AppData\Roaming\Python\Python38\site-packages\tensorflow\_api\v2\compat\v1\compat\__init__.py", line 37, in <module>
    from . import v1
  File "C:\Users\A-rep\AppData\Roaming\Python\Python38\site-packages\tensorflow\_api\v2\compat\v1\compat\v1\__init__.py", line 31, in <module>
    from tensorflow._api.v2.compat.v1 import config
  File "C:\Users\A-rep\AppData\Roaming\Python\Python38\site-packages\tensorflow\_api\v2\compat\v1\config\__init__.py", line 14, in <module>
    from tensorflow.python.eager.def_function import experimental_functions_run_eagerly
ImportError: cannot import name 'experimental_functions_run_eagerly' from 'tensorflow.python.eager.def_function'

I'm using tensorflow version 1.15.0 and python 3.9

Attribute problematic result

Hello! I have succesfully ran the DoppelGANger code for autonomous vehicle - pedestrian interaction data.
The problem is that the resulting samples are somewhat illogical. For example, a particular discrete attribute that ranged from 0 to around 10 in the data that were inputed, has value=1 for every single generated sample.
I wonder if you have any input on what I could do (if anything) to fix that.
Thanks!

Inference from attributes

Dear @fjxmlzn,

Thank you for your great work! I was wondering if I can use the feature geenrator to generate new features out of given attributes. Based on what I've read in the paper it should be possible and I trained with my data, but I don't fully understand the generate data function. Can you please tell me how I can simply use it as some sort of inference. I want to pass the attributes (same format for the training) and then it spits a features vector conditioned by this set of attributes.

Please help.

The data generated ranges from 0 to 2

The trained model produces data in the range of [0-2], which I cannot restore to its original size by multiplying a value. Do you know what caused it? ( I guess the range should be [0-1])

unknown output type

Hello, thanks for your works which could perfectly satisfy my research request! But I have encountered a minor problem, may in my "data_feature_output.pkl" , because the"worker_generate_data" told me that:

File "D:\Anaconda\pythonProject14\DoppelGANger-TF2\example_generating_data\gan\network.py", line 262, in build
raise Exception("unknown output type")
Exception: unknown output type

I wonder how to solve it?

My code for writing data_feature_output is following:

How to pre-specify the length of features

Thank you for the great work again! Now since I just have a small amount of training data with fixed length, I would like to pre-specify the length of features to reduce the difficulty of training a bit, but I have no clue how to do that. I noticed in issue #7 https://github.com/fjxmlzn/DoppelGANger/issues/7, you mentioned it could be achieved via minor code change, could you please give me some suggestions on how to do so? Thank you so much in advance!

Incomplete training

Recently, I met with another problem.
I tried to run main.py in the example_training file and main_generate_data.py in the example_generating_data file. However, the result was that only a file named results was created. And in sub-files of 'results', there was only a worker_*.log.txt.
Q1: Why no synthetic datasets of [web/google/FCC_MBA] were generated?

I looked for whether there is a place in the code to specify the dataset path. But I found nothing.

Q2: When I know the attributes and features of my datasets, how to generate the four files including data_attribute_output.pkl, data_feature_output.pkl, data_test.npz and data_train.npz. Whether another codes need to be written to achieve this work?

At last, thank you for your continued patient answers.

Originally posted by @chameleonzz in #3 (comment)

Large dimensionality of the attributes (metadata) and possible alternatives to realize synthetic time series generation "translated" from real measured time series?

Thank you so much for such great work! I have experimented with this novel GAN model and it works very well on some exemplary datasets. Whereas since my subsequent research capitalizing on synthetic data generation is mainly on urban local climate prediction, the types of attributes (metadata) seem to be very different, and I am really still a novice to the deep learning community especially regarding time series, so I would like to ask:

Is that possible that I use one image (maybe vectorized by each pixel) as an attribute? I guess the major concern is the size of the image might lead to very high dimensions (maybe hundreds to thousands of pixels). I've also noticed you mentioned in the paper, " When the total dimension of samples (measurements + metadata) is large, judging sample fidelity is hard." And thus you added an auxiliary discriminator for attributes generated. Based on this argument could I assume that current architecture is already able to handle some high-dimensional inputs? Otherwise is there any rule of thumb that tells the highest "safe" dimensionality? Then I could prepare the training data with respect to this.
Is that possible to generate synthetic time series based on some measured time series? In my case measured time series could be weather data from the meteorological station nearby while synthetic ones could be actual weather data in a specific scenario. In this task, these time series usually have the same length, so it might be comparable to a "data augmentation" from the measured data or maybe comparable to the "image style translation" task but implemented on time series. One intuitive solution I could come up with is to use all these measured time series as a high-dimensional attribute (the length of annual hourly time series is usually 8760), but it might again lead to the aforementioned high-dimension issue.

Since currently I need to prepare the training data according to the characteristics of DoppelGANer and it will be a very nasty and time-consuming process, I still could not answer these questions with my own experiments, sorry if I was too tedious with my case. Any suggestions or insights related to my questions will be appreciated. Thank you so much in advance!

DP_Training

Hey,
I am trying to train the WWT dataset using "example_dp_training". I encountered below issue although I installed Tensorflow Differential-Privacy library from here [https://github.com/tensorflow/privacy].
Thanks in advance for your help.

Error Log:
WARNING:tensorflow:From C:\Downloads\DoppelGANger-master\example_dp_training\gan_task.py:84: The name tf.ConfigProto is deprecated. Please use tf.compat.v1.ConfigProto instead.

WARNING:tensorflow:From C:\Downloads\DoppelGANger-master\example_dp_training\gan_task.py:85: The name tf.Session is deprecated. Please use tf.compat.v1.Session instead.

2021-06-14 22:34:56.219546: I tensorflow/core/platform/cpu_feature_guard.cc:142] Your CPU supports instructions that this TensorFlow binary was not compiled to use: AVX AVX2
(50000, 550, 1)
(50000, 14)
(50000, 550)
(50000, 550, 3)
2
Traceback (most recent call last):
File "C:\Anaconda3\envs\dopple_working\Scripts\start_gpu_task-script.py", line 33, in
sys.exit(load_entry_point('GPUTaskScheduler', 'console_scripts', 'start_gpu_task')())
File "c:\downloads\gputaskscheduler-master\gpu_task_scheduler\start_gpu_task.py", line 23, in main
worker.main()
File "C:\Downloads\DoppelGANger-master\example_dp_training\gan_task.py", line 121, in main
dp_l2_norm_clip=self._config["dp_l2_norm_clip"])
File "..\gan\doppelganger.py", line 191, in init
raise RuntimeError('tensorflow_privacy should be installed for'
RuntimeError: tensorflow_privacy should be installed for DP training

Error in summary.merge function

Hi Zinan Lin

I've been trying to run this on TF2 for a bit now, after chaning many lines from TF1 to TF2. I reached this point... any help is appreciated...

Traceback (most recent call last):
File "/opt/conda/bin/start_gpu_task", line 11, in
load_entry_point('GPUTaskScheduler', 'console_scripts', 'start_gpu_task')()
File "/home/jovyan/GAN2/DoppelGANger/GPUTaskScheduler/gpu_task_scheduler/start_gpu_task.py", line 23, in main
worker.main()
File "/home/jovyan/GAN2/DoppelGANger/example_training/gan_task.py", line 120, in main
gan.build()
File "../gan/doppelganger.py", line 213, in build
self.build_summary()
File "../gan/doppelganger.py", line 516, in build_summary
self.g_summary = tf.compat.v1.summary.merge(self.g_summary)
File "/opt/conda/lib/python3.7/site-packages/tensorflow/python/summary/summary.py", line 371, in merge
val = _gen_logging_ops.merge_summary(inputs=inputs, name=name) File "/opt/conda/lib/python3.7/site-packages/tensorflow/python/ops/gen_logging_ops.py", line 468, in merge_summary "MergeSummary", inputs=inputs, name=name) File "/opt/conda/lib/python3.7/site-packages/tensorflow/python/framework/op_def_library.py", line 436, in _apply_op_helper (prefix, dtype.name))TypeError: Tensors in list passed to 'inputs' of 'MergeSummary' Op have types [bool, bool, bool] that do not match expected type string.

unreasonable output

I run the main.py file under example_training(without_GPUTaskScheduler), the training process is no problem. However, when I run main.py under example_training on Windows, in tensorflow1, the project has no output and no error message, so I tried it in the environment of tensorflow2, and finally generated a results folder, which contains a lot of running information, but the worker.log and generate.log under each folder prompt the following: 'start_gpu_task' is not an internal or external command, nor is it a runnable program or batch file. I would like to ask if there is a reason or a solution to this situation?

Packing & Differential privacy

Thanks a lot for your work!

I'm running DoppelGANger on my own use-case in which packing improves the results. Here, I use num_packing=10. I aim to do some experiments with differential privacy in the future. I was wondering, theoretically, does the packing have an influence hereon?

To explain what I mean, I suppose that when using num_packing=10, each sample each "seen" 10 times per epoch. Does this mean that we have to use a stricter noise multiplier to reach the same level of differential privacy? I do not see this in your code right now, but it could be that I misunderstand the impact of packing on differential privacy level.

Request for min/max used for feature and attribute normalization in input data

Hello, I see the input datasets you've shared in drive have been normalized. Would you be able to provide the min/max used scaling for each feature and attribute so an inverse transform can be applied to get samples back in the original scale? Thanks!

Generating time series with negative values

Hi,

I'm training a model on daily household electricty consumption. Because of solar power, negative values are present in the set. Is there any experience on how the model behaves with negative values when using the normalizing_per_sample? It seems that the model shifts the mean value to a negative value while it should be positive.

So I'm mostly wondering if their has been experience with negative values in time series, or if I rather should try to make everything positive and apply some post-processing after wards on the samples.

Unable to run main.py in example_training

Dear @fjxmlzn - I have created all necessary files from my dataset (e.g. .pkl files) and want to train DoppelGANger on it. Based on the description, I use the main.py file in example_training and it runs so quickly, but after checking the results directory, in worker.log, I get the following error:

Traceback (most recent call last):
  File "/home/arian.khorasani/scratch/Generative_models/generative-env/bin/start_gpu_task", line 33, in <module>
    sys.exit(load_entry_point('GPUTaskScheduler', 'console_scripts', 'start_gpu_task')())
  File "/home/arian.khorasani/GPUTaskScheduler/gpu_task_scheduler/start_gpu_task.py", line 23, in main
    worker.main()
  File "/home/arian.khorasani/DoppelGANger/DoppelGANGer/example_training/gan_task.py", line 12, in main
    from network import DoppelGANgerGenerator, Discriminator, \
  File "/home/arian.khorasani/DoppelGANger/DoppelGANGer/gan/network.py", line 2, in <module>
    from .op import linear, batch_norm, flatten
ImportError: attempted relative import with no known parent package

I think the issue is related to the network.py file, which has the following imports:

from .op import linear, batch_norm, flatten
from .output import OutputType, Normalization

but after removing . I still get that error, so I'm not sure how to solve it. Any clue or feedback would be appreciated!
Thank you!

module 'zmq.backend.cython.socket' has no attribute 'get'

I met some problems when I run scheduler.start(). It says module 'zmq.backend.cython.socket' has no attribute 'get'
and Can't get attribute 'get' on <module 'zmq.backend.cython.socket' from 'E:\\Users\\shand\\anaconda3\\envs\\DoppelGANger\\lib\\site-packages\\zmq\\backend\\cython\\socket.cp35-win_amd64.pyd'>
and Can't pickle <cyfunction Socket.get at 0x000001FCDFCC71B8>: it's not found as zmq.backend.cython.socket.get

No Error but no completed Results of the web example

Hello Zinan,

I tried to run all examples together, "google", "web", "FCC_MBA", but it seems that they require extremely large RAM.
Then I tried to run only "web" with very computationally not demanding parameters, such as below.

However, as you see in the attached file, the simulation stops at 12%....

Why am I not able to run until the end? the simulation probably exists because of processing time exceeded, but as you see in the file, there is no output error (all warning)...
My node: is 32 G RAM and 4 cores

Not demanding parameters for web example:
{
"dataset": ["web"],
"epoch": [50],#def [400],
"run": [0],# def [0, 1, 2],
"sample_len": [10],# def[1, 5, 10, 25, 50],
"extra_checkpoint_freq": [5],
"epoch_checkpoint_freq": [1],
"aux_disc": [True],
"self_norm": [True]
},

worker.log