Model is generating NAN values in features and attributes. My data has following s

Generating NAN values about doppelganger HOT 12 CLOSED

fjxmlzn commented on July 29, 2024

Generating NAN values

from doppelganger.

Comments (12)

fjxmlzn commented on July 29, 2024

Would you mind providing more details, e.g., the config.py file you are using?

from doppelganger.

fahadali127 commented on July 29, 2024

a | b | c | d | e0 | e1 | f
1618.0 | 2689.0 | 4615.0 | 0 | 0 | 1 | 1

I have data like this. D only contains 0 and f only contains 1.

data_feature_output = [
output.Output(type_=OutputType.CONTINUOUS, dim=1, normalization=Normalization.ZERO_ONE, is_gen_flag=False),
output.Output(type_=OutputType.CONTINUOUS, dim=1, normalization=Normalization.ZERO_ONE, is_gen_flag=False),
output.Output(type_=OutputType.CONTINUOUS, dim=1, normalization=Normalization.ZERO_ONE, is_gen_flag=False),
]

data_attribute_output = [
output.Output(type_=OutputType.DISCRETE, dim=1, normalization=None, is_gen_flag=False),
output.Output(type_=OutputType.DISCRETE, dim=2, normalization=None, is_gen_flag=False),
output.Output(type_=OutputType.DISCRETE, dim=1, normalization=None, is_gen_flag=False)]

I used this format to prepare the data.

Data gen flag was created like this. Sequence length is 1 in my case
data_gen_flag = np.ones((data_feature.shape[0], SEQ_LEN), dtype="float32")

Shapes of all:

print(data_feature.shape)
print(data_attribute.shape)
print(data_gen_flag.shape)
(500, 1, 3)
(500, 4)
(500, 1)

Now when I run the training notebook. Data looks good before normalizing the samples. After normalizing, data looks as below:

Data feature:
Shape: (500, 1, 5)

array([[[ 0., nan, nan, 0., 1.]],

   [[ 0., nan, nan,  0.,  1.]],

   [[ 0., nan, nan,  0.,  1.]],

   ...,

   [[ 0., nan, nan,  0.,  1.]],

   [[ 0., nan, nan,  0.,  1.]],

   [[ 0., nan, nan,  0.,  1.]]])

Data attribute: Showing only one sample

Shape: (500, 10)

array([0.0000000e+00, 0.0000000e+00, 1.0000000e+00, 1.0000000e+00,
1.6180000e+03, 1.2207031e-04, 2.6890000e+03, 0.0000000e+00,
4.6150000e+03, 0.0000000e+00], dtype=float32)

I think the issue is in normalize sample part. Please let me know what you think.

from doppelganger.

fjxmlzn commented on July 29, 2024

Thanks for the details! I can see two potential issues here.

About the feature part, may I know which "training notebook" you are using? There was a bug in a very early version of the normalization code, which could cause this issue. But it was already fixed in 34efa73.

About the attribute part, the dim parameter in output.Output means the number of possibilities for that field, and the corresponding dimension in data should use one-hot encoding. So if a discrete field only has 1 possibility (i.e., 'D only contains 0 and f only contains 1.' as you said), then the corresponding dimension in data should be 1 (i.e., one-hot encoding of a 1-dimension categorical variable is 1). (Although modeling those dimensions with fixed values with GANs are not that useful, but I assume you are just doing some initial test of the data and code.)

from doppelganger.

fahadali127 commented on July 29, 2024

I am using your training code without GPU.
I haven't one-hot encoded the D and F columns. I have only one hot encoded the E column to E0 and E1 and I have put dimension as 2 in the output.

Yes, I am using this normalization code. I am unable to understand this issue

from doppelganger.

fjxmlzn commented on July 29, 2024

About the attribute part, you should use one-hot encoding for all discrete fields. To fix this, it is as simple as changing d from zeros to ones.

For the feature part, I just realized another issue. If you are putting normalization=Normalization.ZERO_ONE, you should make sure that the corresponding dimensions have values between zero and one (by normalizing them before putting it to DoppelGANger). But it seems like the values are beyond the range. But that seems to be an independent issue to the nan problem. I cannot see why it happens if you are indeed using the latest code. Could you please share the code and data (a sub-sample that can reproduce the problem you see is sufficient) somewhere (e.g., on Google drive) so that I can take a look?

from doppelganger.

fahadali127 commented on July 29, 2024

Yes sure, here is the link to a folder containing a sample of the data and the notebooks I am using to prepare the data and for training.
https://drive.google.com/drive/folders/1NGYyAIphe2MmowL5T32sDIM0OKVFeOO0?usp=sharing

from doppelganger.

fahadali127 commented on July 29, 2024

It is a time-series data

from doppelganger.

fjxmlzn commented on July 29, 2024

Thanks for sharing! I just checked the training code. The normalization part is NOT from the latest version in this repo. Please incorporate 34efa73 into it.

from doppelganger.

fahadali127 commented on July 29, 2024

I am using the same code.

` def normalize_per_sample(data_feature, data_attribute, data_feature_outputs,
data_attribute_outputs, eps=1e-4):

# assume all samples have maximum length
data_feature_min = np.amin(data_feature, axis=1)
data_feature_max = np.amax(data_feature, axis=1)

additional_attribute = []
additional_attribute_outputs = []
dim = 0
for output in data_feature_outputs:
    if output.type_ == OutputType.CONTINUOUS:
        for _ in range(output.dim):
            max_ = data_feature_max[:, dim] + eps
            min_ = data_feature_min[:, dim] - eps

            additional_attribute.append((max_ + min_) / 2.0)
            additional_attribute.append((max_ - min_) / 2.0)
            
            additional_attribute_outputs.append(Output(
                type_=OutputType.CONTINUOUS,
                dim=1,
                normalization=output.normalization,
                is_gen_flag=False))
            additional_attribute_outputs.append(Output(
                type_=OutputType.CONTINUOUS,
                dim=1,
                normalization=Normalization.ZERO_ONE,
                is_gen_flag=False))

            max_ = np.expand_dims(max_, axis=1)
            min_ = np.expand_dims(min_, axis=1)

            data_feature[:, :, dim] = (data_feature[:, :, dim] - min_) / (max_ - min_)
            
            if output.normalization == Normalization.MINUSONE_ONE:
                data_feature[:, :, dim] = data_feature[:, :, dim] * 2.0 - 1.0

            dim += 1
    else:
        dim += output.dim

real_attribute_mask = ([True] * len(data_attribute_outputs) +
                       [False] * len(additional_attribute_outputs))

additional_attribute = np.stack(additional_attribute, axis=1)

data_attribute = np.concatenate([data_attribute, additional_attribute], axis=1)

data_attribute_outputs.extend(additional_attribute_outputs)

return data_feature, data_attribute, data_attribute_outputs, \
    real_attribute_mask`

Thanks for sharing! I just checked the training code. The normalization part is NOT from the latest version in this repo. Please incorporate 34efa73 into it.

from doppelganger.

fahadali127 commented on July 29, 2024

I think the issue is in this part:

data_feature[:, :, dim] = (data_feature[:, :, dim] - min_) / (max_ - min_)

Please have a look at the normalize sample and renormalize sample functions.

from doppelganger.

fjxmlzn commented on July 29, 2024

Your training_notebook.ipynb did NOT call gan.util. normalize_per_sample at all. It directly normalize the samples in training_notebook.ipynb. And how it normalizes it is in the old way, without 34efa73 in it:

data_feature_min = np.amin(data_feature, axis=1)
data_feature_max = np.amax(data_feature, axis=1)

additional_attribute = []
additional_attribute_outputs = []

dim = 0
for output in data_feature_outputs:
    if output.type_ == OutputType.CONTINUOUS:
        for _ in range(output.dim):
            max_ = data_feature_max[:, dim]
            min_ = data_feature_min[:, dim]

            additional_attribute.append((max_ + min_) / 2.0)
            additional_attribute.append((max_ - min_) / 2.0)
            additional_attribute_outputs.append(Output(
                type_=OutputType.CONTINUOUS,
                dim=1,
                normalization=output.normalization,
                is_gen_flag=False))
            additional_attribute_outputs.append(Output(
                type_=OutputType.CONTINUOUS,
                dim=1,
                normalization=Normalization.ZERO_ONE,
                is_gen_flag=False))

            max_ = np.expand_dims(max_, axis=1)
            min_ = np.expand_dims(min_, axis=1)

            data_feature[:, :, dim] = \
                (data_feature[:, :, dim] - min_) / (max_ - min_)
            if output.normalization == Normalization.MINUSONE_ONE:
                data_feature[:, :, dim] = \
                    data_feature[:, :, dim] * 2.0 - 1.0

            dim += 1
    else:
        dim += output.dim

Moreover, I think you only uploaded the main script (training_notebook.ipynb), without all other dependent python files. So I have no idea what version of other files you are using. (But that doesn't matter as the script does not call gan.util. normalize_per_sample at all.)

from doppelganger.

fjxmlzn commented on July 29, 2024

I'll close the issue for now. Feel free to reopen it if you still experience issues :)

from doppelganger.

Generating NAN values about doppelganger HOT 12 CLOSED

Comments (12)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent