Comments (12)
Would you mind providing more details, e.g., the config.py file you are using?
from doppelganger.
a | b | c | d | e0 | e1 | f
1618.0 | 2689.0 | 4615.0 | 0 | 0 | 1 | 1
I have data like this. D only contains 0 and f only contains 1.
data_feature_output = [
output.Output(type_=OutputType.CONTINUOUS, dim=1, normalization=Normalization.ZERO_ONE, is_gen_flag=False),
output.Output(type_=OutputType.CONTINUOUS, dim=1, normalization=Normalization.ZERO_ONE, is_gen_flag=False),
output.Output(type_=OutputType.CONTINUOUS, dim=1, normalization=Normalization.ZERO_ONE, is_gen_flag=False),
]
data_attribute_output = [
output.Output(type_=OutputType.DISCRETE, dim=1, normalization=None, is_gen_flag=False),
output.Output(type_=OutputType.DISCRETE, dim=2, normalization=None, is_gen_flag=False),
output.Output(type_=OutputType.DISCRETE, dim=1, normalization=None, is_gen_flag=False)]
I used this format to prepare the data.
Data gen flag was created like this. Sequence length is 1 in my case
data_gen_flag = np.ones((data_feature.shape[0], SEQ_LEN), dtype="float32")
Shapes of all:
print(data_feature.shape)
print(data_attribute.shape)
print(data_gen_flag.shape)
(500, 1, 3)
(500, 4)
(500, 1)
Now when I run the training notebook. Data looks good before normalizing the samples. After normalizing, data looks as below:
Data feature:
Shape: (500, 1, 5)
array([[[ 0., nan, nan, 0., 1.]],
[[ 0., nan, nan, 0., 1.]],
[[ 0., nan, nan, 0., 1.]],
...,
[[ 0., nan, nan, 0., 1.]],
[[ 0., nan, nan, 0., 1.]],
[[ 0., nan, nan, 0., 1.]]])
Data attribute: Showing only one sample
Shape: (500, 10)
array([0.0000000e+00, 0.0000000e+00, 1.0000000e+00, 1.0000000e+00,
1.6180000e+03, 1.2207031e-04, 2.6890000e+03, 0.0000000e+00,
4.6150000e+03, 0.0000000e+00], dtype=float32)
I think the issue is in normalize sample part. Please let me know what you think.
from doppelganger.
Thanks for the details! I can see two potential issues here.
About the feature part, may I know which "training notebook" you are using? There was a bug in a very early version of the normalization code, which could cause this issue. But it was already fixed in 34efa73.
About the attribute part, the dim parameter in output.Output means the number of possibilities for that field, and the corresponding dimension in data should use one-hot encoding. So if a discrete field only has 1 possibility (i.e., 'D only contains 0 and f only contains 1.' as you said), then the corresponding dimension in data should be 1 (i.e., one-hot encoding of a 1-dimension categorical variable is 1). (Although modeling those dimensions with fixed values with GANs are not that useful, but I assume you are just doing some initial test of the data and code.)
from doppelganger.
I am using your training code without GPU.
I haven't one-hot encoded the D and F columns. I have only one hot encoded the E column to E0 and E1 and I have put dimension as 2 in the output.
Yes, I am using this normalization code. I am unable to understand this issue
from doppelganger.
About the attribute part, you should use one-hot encoding for all discrete fields. To fix this, it is as simple as changing d from zeros to ones.
For the feature part, I just realized another issue. If you are putting normalization=Normalization.ZERO_ONE, you should make sure that the corresponding dimensions have values between zero and one (by normalizing them before putting it to DoppelGANger). But it seems like the values are beyond the range. But that seems to be an independent issue to the nan problem. I cannot see why it happens if you are indeed using the latest code. Could you please share the code and data (a sub-sample that can reproduce the problem you see is sufficient) somewhere (e.g., on Google drive) so that I can take a look?
from doppelganger.
Yes sure, here is the link to a folder containing a sample of the data and the notebooks I am using to prepare the data and for training.
https://drive.google.com/drive/folders/1NGYyAIphe2MmowL5T32sDIM0OKVFeOO0?usp=sharing
from doppelganger.
It is a time-series data
from doppelganger.
Thanks for sharing! I just checked the training code. The normalization part is NOT from the latest version in this repo. Please incorporate 34efa73 into it.
from doppelganger.
I am using the same code.
` def normalize_per_sample(data_feature, data_attribute, data_feature_outputs,
data_attribute_outputs, eps=1e-4):
# assume all samples have maximum length
data_feature_min = np.amin(data_feature, axis=1)
data_feature_max = np.amax(data_feature, axis=1)
additional_attribute = []
additional_attribute_outputs = []
dim = 0
for output in data_feature_outputs:
if output.type_ == OutputType.CONTINUOUS:
for _ in range(output.dim):
max_ = data_feature_max[:, dim] + eps
min_ = data_feature_min[:, dim] - eps
additional_attribute.append((max_ + min_) / 2.0)
additional_attribute.append((max_ - min_) / 2.0)
additional_attribute_outputs.append(Output(
type_=OutputType.CONTINUOUS,
dim=1,
normalization=output.normalization,
is_gen_flag=False))
additional_attribute_outputs.append(Output(
type_=OutputType.CONTINUOUS,
dim=1,
normalization=Normalization.ZERO_ONE,
is_gen_flag=False))
max_ = np.expand_dims(max_, axis=1)
min_ = np.expand_dims(min_, axis=1)
data_feature[:, :, dim] = (data_feature[:, :, dim] - min_) / (max_ - min_)
if output.normalization == Normalization.MINUSONE_ONE:
data_feature[:, :, dim] = data_feature[:, :, dim] * 2.0 - 1.0
dim += 1
else:
dim += output.dim
real_attribute_mask = ([True] * len(data_attribute_outputs) +
[False] * len(additional_attribute_outputs))
additional_attribute = np.stack(additional_attribute, axis=1)
data_attribute = np.concatenate([data_attribute, additional_attribute], axis=1)
data_attribute_outputs.extend(additional_attribute_outputs)
return data_feature, data_attribute, data_attribute_outputs, \
real_attribute_mask`
Thanks for sharing! I just checked the training code. The normalization part is NOT from the latest version in this repo. Please incorporate 34efa73 into it.
from doppelganger.
I think the issue is in this part:
data_feature[:, :, dim] = (data_feature[:, :, dim] - min_) / (max_ - min_)
Please have a look at the normalize sample and renormalize sample functions.
from doppelganger.
Your training_notebook.ipynb did NOT call gan.util. normalize_per_sample at all. It directly normalize the samples in training_notebook.ipynb. And how it normalizes it is in the old way, without 34efa73 in it:
data_feature_min = np.amin(data_feature, axis=1)
data_feature_max = np.amax(data_feature, axis=1)
additional_attribute = []
additional_attribute_outputs = []
dim = 0
for output in data_feature_outputs:
if output.type_ == OutputType.CONTINUOUS:
for _ in range(output.dim):
max_ = data_feature_max[:, dim]
min_ = data_feature_min[:, dim]
additional_attribute.append((max_ + min_) / 2.0)
additional_attribute.append((max_ - min_) / 2.0)
additional_attribute_outputs.append(Output(
type_=OutputType.CONTINUOUS,
dim=1,
normalization=output.normalization,
is_gen_flag=False))
additional_attribute_outputs.append(Output(
type_=OutputType.CONTINUOUS,
dim=1,
normalization=Normalization.ZERO_ONE,
is_gen_flag=False))
max_ = np.expand_dims(max_, axis=1)
min_ = np.expand_dims(min_, axis=1)
data_feature[:, :, dim] = \
(data_feature[:, :, dim] - min_) / (max_ - min_)
if output.normalization == Normalization.MINUSONE_ONE:
data_feature[:, :, dim] = \
data_feature[:, :, dim] * 2.0 - 1.0
dim += 1
else:
dim += output.dim
Moreover, I think you only uploaded the main script (training_notebook.ipynb), without all other dependent python files. So I have no idea what version of other files you are using. (But that doesn't matter as the script does not call gan.util. normalize_per_sample at all.)
from doppelganger.
I'll close the issue for now. Feel free to reopen it if you still experience issues :)
from doppelganger.
Related Issues (20)
- membership_inference_attack HOT 6
- CLI getting stuck on running example_training/main.py HOT 2
- Dynamic attributes / attributes with time stamp? HOT 6
- Request for min/max used for feature and attribute normalization in input data HOT 2
- The data generated ranges from 0 to 2 HOT 3
- Incomplete training HOT 7
- unreasonable output HOT 6
- Dataset HOT 1
- About two MLPs HOT 1
- Training does not run although the input is of the required form HOT 6
- Generating time series with negative values HOT 4
- is_gen_flag HOT 4
- Attribute problematic result HOT 18
- Problem with tensorflow HOT 1
- Training time HOT 6
- Code of AR and HMM baseline
- unknown output type HOT 6
- Request for availability of the scripts used to reproduce figures HOT 8
- Inference from attributes HOT 3
- Unable to run main.py in example_training HOT 21
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from doppelganger.