Comments (12)
Glad to hear that you find out the problem! Thanks!
from doppelganger.
Would you mind providing more details about: which training code/generation code you ran, and what are the parameters you used?
Thanks!
from doppelganger.
Thank you. I tried with different settings, here is 1 of it. I try to use only 10000 data points in "web" dataset instead of 50000 data points for lower training time (currently I am training full 50000 data points with 400 epochs).
epoch = 400
batch_size = 20
vis_freq = 200
vis_num_sample = 5
d_rounds = 3
g_rounds = 1
d_gp_coe = 10.0
attr_d_gp_coe = 10.0
g_attr_d_coe = 1.0
extra_checkpoint_freq = 50
num_packing = 1
g_lr = 0.0001
d_lr = 0.0001
Here is my code. I trained it using jupyterlab on an Azure compute.
Training and gen data: https://colab.research.google.com/drive/15uDCfcBY7s-MxMTgJPwD8kc725Zzcs9S?usp=sharing
Load model and gen data: https://colab.research.google.com/drive/1CQQVoexeyJJhXp0vku9R6T78FvZoxPVg?usp=sharing
Thank you.
from doppelganger.
Thanks for the information!
Firstly, about why the features have many zeros in the end. It's probably because the samples have a shorter length than 550, and the rest will be padded with zeros. You can see the length of each generated samples from data_gen_flag
field in the generated data.
Secondly, about why there are lengths shorter than 550. We let DoppelGANger learn the correct length for each sample instead of pre-specifying it, so that it can deal with more general cases (although DG can also support specifying the lengths with minor code changes, discussed in Appendix B). From our experience, to ensure good fidelity (including the lengths), you need to make sure that the total number of training iterations to be large enough. So if you decrease the size of the training set by 5 times, you need to increase the number of epochs by about 5 times as well (i.e. it's hard to reduce the training time if you want good fidelity). We indeed have tested how the fidelity changes by varying the numbers of training samples (with approximately the same total training iterations) in Figure 11 of our paper. In general, you get good fidelity by training on more samples (as expected). So I would recommend using the default parameters in the code.
If you really want to train on 10000 samples instead, the settings we used in Figure 11 were just keeping all other parameters the same as the defaults in code except the number of epochs. I see that you tune other parameters as well, which could give better results (but we haven't tried before). Empirically we see that the default parameters are good enough across different datasets and settings.
Hope this helps, and let me know if you have further questions : )
from doppelganger.
Thank you for your response.
1, Yes, I should check the features again.
2, I will train with more epochs. My deadline is quite tight but I should definitely try more epochs.
About the parameters, at first, I use your parameters but the result is quite poor. I try to change them following a blog post, but the result is not improved at all. I will use default parameters in my future training.
I will try your suggestion and update the result here.
from doppelganger.
Thank you! Yeah I think training for more iterations should give much better results : )
from doppelganger.
Could you provide a pre-train model for 1 of your dataset (web or fcc_mba is better)? I just want to check my generation part. Thank you.
from doppelganger.
Sure! Here is the pretrained model for web (sample_len=10): https://drive.google.com/drive/folders/1nZly-2G9h212bwzrSDcIeMqfqO9x1miv?usp=sharing
from doppelganger.
I think that the problem is in the Generation part because:
- at first, I load your pretrain and found that the problem stays the same. But I found that the length of the output is equal to
sample_len
(10 non-zero values at first and all others is 0, I checked it by numpy). - I try to generate output after building GAN (after
gan.build()
without training). But even without the training, it can produce those output (like: 10 non-zero values at first and all others is 0). - if I change
sample_len
value, the number of non-zero values is equal tosample_len
(I tried with 50, 55, or 550. In 550 case, output filled with all non-zero values).
So I am not sure about the training part, but definitely the generation part has some issues. I am figuring it out. Do you have any ideas?
from doppelganger.
I see. This is weird. I have tried using this checkpoint to generate samples, and everything looks good. The example script I used for checking is at https://gist.github.com/fjxmlzn/fc61538ae69bf3633334a00401d5b3a6 (put it under DoppelGANger/example_generating_data/ and change mid_checkpoint_dir to the path of the checkpoint I uploaded to google drive).
from doppelganger.
I found that the problem is that I added this line to the doppelganger.py
.
self.sess.run(tf.global_variables_initializer())
I added it to sample_from()
and train()
function. I added it when I was trying to fix some bugs when training.
I am able to generate normal output (from your pretrain) now.
Very thank you for your support.
from doppelganger.
A small note: I cloned your repo 19 days ago, and I just clone your repo today (and it generated normal output).
Why didi I add 2 these lines? Because I want to separate session
to view/debug somethings. I often use these lines:
run_config = tf.ConfigProto()
run_config.gpu_options.allow_growth = True
sess = tf.Session(config=run_config)
sess.run(tf.global_variables_initializer())
Since 2019, I switched to use Keras to write TF code, so it's quite hard to debug this. Thank you again for your support.
from doppelganger.
Related Issues (20)
- CLI getting stuck on running example_training/main.py HOT 2
- Dynamic attributes / attributes with time stamp? HOT 6
- Request for min/max used for feature and attribute normalization in input data HOT 2
- The data generated ranges from 0 to 2 HOT 3
- Incomplete training HOT 7
- unreasonable output HOT 6
- Dataset HOT 1
- About two MLPs HOT 1
- Training does not run although the input is of the required form HOT 6
- Generating time series with negative values HOT 4
- is_gen_flag HOT 4
- Attribute problematic result HOT 18
- Problem with tensorflow HOT 1
- Training time HOT 6
- Code of AR and HMM baseline
- unknown output type HOT 6
- Request for availability of the scripts used to reproduce figures HOT 8
- Inference from attributes HOT 3
- Unable to run main.py in example_training HOT 21
- what tensorflow_privacy version are you using when you are using the tensorflow==1.4.0 and python==3.5? HOT 7
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from doppelganger.