Git Product home page Git Product logo

Comments (12)

fjxmlzn avatar fjxmlzn commented on July 29, 2024

Would you mind providing more details, e.g., the config.py file you are using?

from doppelganger.

fahadali127 avatar fahadali127 commented on July 29, 2024

a | b | c | d | e0 | e1 | f
1618.0 | 2689.0 | 4615.0 | 0 | 0 | 1 | 1

I have data like this. D only contains 0 and f only contains 1.

data_feature_output = [
output.Output(type_=OutputType.CONTINUOUS, dim=1, normalization=Normalization.ZERO_ONE, is_gen_flag=False),
output.Output(type_=OutputType.CONTINUOUS, dim=1, normalization=Normalization.ZERO_ONE, is_gen_flag=False),
output.Output(type_=OutputType.CONTINUOUS, dim=1, normalization=Normalization.ZERO_ONE, is_gen_flag=False),
]

data_attribute_output = [
output.Output(type_=OutputType.DISCRETE, dim=1, normalization=None, is_gen_flag=False),
output.Output(type_=OutputType.DISCRETE, dim=2, normalization=None, is_gen_flag=False),
output.Output(type_=OutputType.DISCRETE, dim=1, normalization=None, is_gen_flag=False)]

I used this format to prepare the data.

Data gen flag was created like this. Sequence length is 1 in my case
data_gen_flag = np.ones((data_feature.shape[0], SEQ_LEN), dtype="float32")

Shapes of all:

print(data_feature.shape)
print(data_attribute.shape)
print(data_gen_flag.shape)
(500, 1, 3)
(500, 4)
(500, 1)

Now when I run the training notebook. Data looks good before normalizing the samples. After normalizing, data looks as below:

Data feature:
Shape: (500, 1, 5)

array([[[ 0., nan, nan, 0., 1.]],

   [[ 0., nan, nan,  0.,  1.]],

   [[ 0., nan, nan,  0.,  1.]],

   ...,

   [[ 0., nan, nan,  0.,  1.]],

   [[ 0., nan, nan,  0.,  1.]],

   [[ 0., nan, nan,  0.,  1.]]])

Data attribute: Showing only one sample

Shape: (500, 10)

array([0.0000000e+00, 0.0000000e+00, 1.0000000e+00, 1.0000000e+00,
1.6180000e+03, 1.2207031e-04, 2.6890000e+03, 0.0000000e+00,
4.6150000e+03, 0.0000000e+00], dtype=float32)

I think the issue is in normalize sample part. Please let me know what you think.

from doppelganger.

fjxmlzn avatar fjxmlzn commented on July 29, 2024

Thanks for the details! I can see two potential issues here.

About the feature part, may I know which "training notebook" you are using? There was a bug in a very early version of the normalization code, which could cause this issue. But it was already fixed in 34efa73.

About the attribute part, the dim parameter in output.Output means the number of possibilities for that field, and the corresponding dimension in data should use one-hot encoding. So if a discrete field only has 1 possibility (i.e., 'D only contains 0 and f only contains 1.' as you said), then the corresponding dimension in data should be 1 (i.e., one-hot encoding of a 1-dimension categorical variable is 1). (Although modeling those dimensions with fixed values with GANs are not that useful, but I assume you are just doing some initial test of the data and code.)

from doppelganger.

fahadali127 avatar fahadali127 commented on July 29, 2024

I am using your training code without GPU.
I haven't one-hot encoded the D and F columns. I have only one hot encoded the E column to E0 and E1 and I have put dimension as 2 in the output.

Yes, I am using this normalization code. I am unable to understand this issue

from doppelganger.

fjxmlzn avatar fjxmlzn commented on July 29, 2024

About the attribute part, you should use one-hot encoding for all discrete fields. To fix this, it is as simple as changing d from zeros to ones.

For the feature part, I just realized another issue. If you are putting normalization=Normalization.ZERO_ONE, you should make sure that the corresponding dimensions have values between zero and one (by normalizing them before putting it to DoppelGANger). But it seems like the values are beyond the range. But that seems to be an independent issue to the nan problem. I cannot see why it happens if you are indeed using the latest code. Could you please share the code and data (a sub-sample that can reproduce the problem you see is sufficient) somewhere (e.g., on Google drive) so that I can take a look?

from doppelganger.

fahadali127 avatar fahadali127 commented on July 29, 2024

Yes sure, here is the link to a folder containing a sample of the data and the notebooks I am using to prepare the data and for training.
https://drive.google.com/drive/folders/1NGYyAIphe2MmowL5T32sDIM0OKVFeOO0?usp=sharing

from doppelganger.

fahadali127 avatar fahadali127 commented on July 29, 2024

It is a time-series data

from doppelganger.

fjxmlzn avatar fjxmlzn commented on July 29, 2024

Thanks for sharing! I just checked the training code. The normalization part is NOT from the latest version in this repo. Please incorporate 34efa73 into it.

from doppelganger.

fahadali127 avatar fahadali127 commented on July 29, 2024

I am using the same code.

` def normalize_per_sample(data_feature, data_attribute, data_feature_outputs,
data_attribute_outputs, eps=1e-4):

# assume all samples have maximum length
data_feature_min = np.amin(data_feature, axis=1)
data_feature_max = np.amax(data_feature, axis=1)

additional_attribute = []
additional_attribute_outputs = []
dim = 0
for output in data_feature_outputs:
    if output.type_ == OutputType.CONTINUOUS:
        for _ in range(output.dim):
            max_ = data_feature_max[:, dim] + eps
            min_ = data_feature_min[:, dim] - eps

            additional_attribute.append((max_ + min_) / 2.0)
            additional_attribute.append((max_ - min_) / 2.0)
            
            additional_attribute_outputs.append(Output(
                type_=OutputType.CONTINUOUS,
                dim=1,
                normalization=output.normalization,
                is_gen_flag=False))
            additional_attribute_outputs.append(Output(
                type_=OutputType.CONTINUOUS,
                dim=1,
                normalization=Normalization.ZERO_ONE,
                is_gen_flag=False))

            max_ = np.expand_dims(max_, axis=1)
            min_ = np.expand_dims(min_, axis=1)

            data_feature[:, :, dim] = (data_feature[:, :, dim] - min_) / (max_ - min_)
            
            if output.normalization == Normalization.MINUSONE_ONE:
                data_feature[:, :, dim] = data_feature[:, :, dim] * 2.0 - 1.0

            dim += 1
    else:
        dim += output.dim

real_attribute_mask = ([True] * len(data_attribute_outputs) +
                       [False] * len(additional_attribute_outputs))

additional_attribute = np.stack(additional_attribute, axis=1)

data_attribute = np.concatenate([data_attribute, additional_attribute], axis=1)

data_attribute_outputs.extend(additional_attribute_outputs)

return data_feature, data_attribute, data_attribute_outputs, \
    real_attribute_mask`

Thanks for sharing! I just checked the training code. The normalization part is NOT from the latest version in this repo. Please incorporate 34efa73 into it.

from doppelganger.

fahadali127 avatar fahadali127 commented on July 29, 2024

I think the issue is in this part:

data_feature[:, :, dim] = (data_feature[:, :, dim] - min_) / (max_ - min_)

Please have a look at the normalize sample and renormalize sample functions.

from doppelganger.

fjxmlzn avatar fjxmlzn commented on July 29, 2024

Your training_notebook.ipynb did NOT call gan.util. normalize_per_sample at all. It directly normalize the samples in training_notebook.ipynb. And how it normalizes it is in the old way, without 34efa73 in it:

data_feature_min = np.amin(data_feature, axis=1)
data_feature_max = np.amax(data_feature, axis=1)

additional_attribute = []
additional_attribute_outputs = []

dim = 0
for output in data_feature_outputs:
    if output.type_ == OutputType.CONTINUOUS:
        for _ in range(output.dim):
            max_ = data_feature_max[:, dim]
            min_ = data_feature_min[:, dim]

            additional_attribute.append((max_ + min_) / 2.0)
            additional_attribute.append((max_ - min_) / 2.0)
            additional_attribute_outputs.append(Output(
                type_=OutputType.CONTINUOUS,
                dim=1,
                normalization=output.normalization,
                is_gen_flag=False))
            additional_attribute_outputs.append(Output(
                type_=OutputType.CONTINUOUS,
                dim=1,
                normalization=Normalization.ZERO_ONE,
                is_gen_flag=False))

            max_ = np.expand_dims(max_, axis=1)
            min_ = np.expand_dims(min_, axis=1)

            data_feature[:, :, dim] = \
                (data_feature[:, :, dim] - min_) / (max_ - min_)
            if output.normalization == Normalization.MINUSONE_ONE:
                data_feature[:, :, dim] = \
                    data_feature[:, :, dim] * 2.0 - 1.0

            dim += 1
    else:
        dim += output.dim

Moreover, I think you only uploaded the main script (training_notebook.ipynb), without all other dependent python files. So I have no idea what version of other files you are using. (But that doesn't matter as the script does not call gan.util. normalize_per_sample at all.)

from doppelganger.

fjxmlzn avatar fjxmlzn commented on July 29, 2024

I'll close the issue for now. Feel free to reopen it if you still experience issues :)

from doppelganger.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.