Git Product home page Git Product logo

Comments (21)

meiyor avatar meiyor commented on September 3, 2024 1

yes @fjxmlzn we have made changes on the repo, i just downloaded locally and add it on other Github repo. I think it is a problem if you have multiple DoppelGANGers folders in your local machine or it can maybe be a problem with the Python version. I can run it on my end because i need to modify the load_data.py and you have not reported the way (the) you created the npz and the pkls using pickle, so i changed to dill and i also changed some tf calls using tf-version>2.0, to make usable for most people. Can you report how you create the files? or maybe can you let us know how to fix the error?

from doppelganger.

fjxmlzn avatar fjxmlzn commented on September 3, 2024

Which python version are you using and what command are you executing?

from doppelganger.

ArianKhorasani avatar ArianKhorasani commented on September 3, 2024

Which python version are you using and what command are you executing?

Dear @fjxmlzn - The version of my Python is 3.10.14 and I am launching the following commands:

cd example_training
python main.py

from doppelganger.

fjxmlzn avatar fjxmlzn commented on September 3, 2024

Have you made any changes to the code in the repo?

from doppelganger.

fjxmlzn avatar fjxmlzn commented on September 3, 2024

Hi @meiyor @ArianKhorasani , I just got the time to try it with the same python version, but I did not get this error. This error should not be due to pickle or tf; this is simply an error from Python when it tries to locate & import the libaries.

Upon a closer look at the error message, it seems like you have modified gan_task.py at line 12. The error message shows that you have modified the line to

from network import DoppelGANgerGenerator, Discriminator

whereas the original code https://github.com/fjxmlzn/DoppelGANger/blob/05f36ec6c3850863751d4f3f88d180e9b12cb3eb/example_training/gan_task.py#L12C13-L12C27 was

from gan.network import DoppelGANgerGenerator, Discriminator, \

Why did you make this change?

Just to make it clear, the original code works like this:

The code first adds the parent folder to the system path (

sys.path.append("..")
) so that we can import any library relative to the repo root path. For example, from gan.network import XXX means we will important from gan/network.py file. The grammar from .op import XXX is relative import; it implies that we want to import from op.py which is in the same folder as network.py.

from doppelganger.

meiyor avatar meiyor commented on September 3, 2024

We only made some changes to make the code compatible to tf >=2.0. We have not made any change on any import nor any file from the original code. I definitely think the problem appears when we created the .npzs and the .pkls associated with the output modules. As a preliminary solution for this issue, we substituted the pickle.dump by dill.dump to save the output class environment and for each value on the saved pkls. We also changed the pickle.load by dill.load in load_data.py code. For now the code is working but if you believe some function from tf>=2.0 will affect the main-code functionality - please let us know. We are now evaluating the generated data, if we obtain some similar results as your paper we can share the code working for tf>=2.0. Let us know what you think.

from doppelganger.

fjxmlzn avatar fjxmlzn commented on September 3, 2024

I see! I am sorry that I misunderstood what you said. It's great that it's working!

@yzion previously helped us create a TF2 branch here: https://github.com/fjxmlzn/DoppelGANger/tree/TF2, you can see the changes they made there. But that's a long time ago. If you have your TF2 code ready (which is very helpful to the community and us), feel free to make a pull request to the TF2 branch, or host it in your own repo and I can add a link in the readme file to your code.

from doppelganger.

meiyor avatar meiyor commented on September 3, 2024

Hi @fjxmlzn before sharing the new code with you, we want to know how you measure the autocorrelation, and MSE for the generated data. Do you have a code, you can share with us to see if we can obtain similar results as your evaluations, or let us know some pseudocode, after having the generated data how we can measure your metrics. We will be really glad if you can help us with that. Thank you again!

from doppelganger.

fjxmlzn avatar fjxmlzn commented on September 3, 2024

Please see here the code for computing autocorrelation: #20 (comment)

Thanks.

from doppelganger.

meiyor avatar meiyor commented on September 3, 2024

btw, @fjxmlzn, the autocorrelation is calculated for the entire 50000 samples of the generated data for Google and Web dataset, for instance? or this is only calculate for the first samples representing the first 500 days lags? how many samples are associated with the amount of lags in days?

Which feature index from the generated dataset are used for this autocorrelation, for the Web dataset we have 550 features and for the Google we have 2500, the feature selected here is the one with the minimum MSE in comparison with original?

I hope you can clarify this to me asap! Thank you!

from doppelganger.

fjxmlzn avatar fjxmlzn commented on September 3, 2024

We only computed autocorrelation for the Wiki Web dataset (Figure 1 of https://arxiv.org/pdf/1909.13403).

We computed the autocorrelation for the entire 50000 samples of the generated data.

In the Wiki Web dataset, there is only one feature, which refers to the daily page view (see Table 7 of https://arxiv.org/pdf/1909.13403; "measurements" means "features" there). The number 550 is the number of days, not the number of features.

Hope that this is helpful!

from doppelganger.

meiyor avatar meiyor commented on September 3, 2024

I'm really confused in the way you explain it in Github. When you say this: data_feature: Training features, in numpy float32 array format. The size is [(number of training samples) x (maximum length) x (total dimension of features)] Then, the first axis is the number of samples of the time series, what is the second axis?, and the third axis is the number of features? it says that is the feature dimensionality which are two different things. In our approach we add our own dataset like this: number_of_samples (>1M) x number_of_features (34) x features_dimensions (1). it will work? if not please let us know as soon as possible, this will clarify a lot.

from doppelganger.

fjxmlzn avatar fjxmlzn commented on September 3, 2024

I see where the confusion is. It would be helpful to read the data formulation section (Section 3.1) of the paper https://arxiv.org/pdf/1909.13403. I explain it below.

Each dataset contains many samples; each sample is a list of features ordered by time (features are called measurements in the paper).

Let's take Wiki Web Traffic dataset as an example (please take a look at Section 5.1 and Section A where we had detailed discussions):

  • Each sample is the daily page views of a website (different samples correspond to different websites).
  • The daily page views of a website contain 550 numbers; each number corresponds to the page view of one day. In this case, we say there is only 1 feature (and total dimension of features=1), which is the daily page view. The length of the time series is 550 (i.e., maximum length=550).

I am confused when you say your dataset is of shape number_of_samples (>1M) x number_of_features (34) x features_dimensions (1). Which samples correspond to one time series? Are the samples (of count 1M) independent?

from doppelganger.

meiyor avatar meiyor commented on September 3, 2024

@fjxmlzn - this is what you wrote in section 3.1: "We abstract the scope of our datasets as follows: A dataset D =�𝑂1, 𝑂2, ..., 𝑂𝑛 is defined as a set of samples 𝑂𝑖 (e.g., the clients).
Each sample 𝑂𝑖 = (𝐴𝑖 , 𝑅𝑖 ) contains π‘š metadata 𝐴𝑖 = [𝐴𝑖
1, 𝐴𝑖
2, ..., 𝐴𝑖
π‘š ].
For example, metadata 𝐴𝑖
1 could represent client 𝑖’s physical loca-
tion, and 𝐴𝑖
2 the client’s ISP. Note that we can support datasets
in which multiple samples have the same set of metadata. The
second component of each sample is a time series of records 𝑅𝑖 =
[𝑅𝑖
1, 𝑅𝑖
2, ..., 𝑅𝑖
𝑇 𝑖 ], where 𝑅𝑖
𝑗 means 𝑗-th measurement of 𝑖-th client.
Different samples may contain a different number of measure-
ments. The number of records for sample 𝑂𝑖 is given by 𝑇 𝑖 . Each
record 𝑅𝑖
𝑗 = (𝑑𝑖
𝑗 , 𝑓 𝑖
𝑗 ) contains a timestamp 𝑑𝑖
𝑗 , and 𝐾 measurements
𝑓 𝑖
𝑗 = [𝑓 𝑖
𝑗,1, 𝑓 𝑖
𝑗,2, ..., 𝑓 𝑖
𝑗,𝐾 ]. For example, 𝑑𝑖
𝑗 represents the time when
the measurement 𝑓 𝑖
𝑗 is taken, and 𝑓 𝑖
𝑗,1, 𝑓 𝑖
𝑗,2 represent the ping loss
rate and traffic byte counter at this timestamp respectively. Note
that the timestamps are sorted, i.e. 𝑑𝑖
𝑗 < 𝑑𝑖
𝑗+1 βˆ€1 ≀ 𝑗 < 𝑇 𝑖 ."

This is extremely confused because in this case you are relating measurements with features dimensions and they are features itself, following your last explanation. You are also referring to samples but in fact the samples are not relating with the time-series sequence at all, they are independent variables.

In this way and following your last explanation, then the samples are not related with sequence at all and are independent measurements/variables, right? The max-length is the maximum-sequence length in the time-series and it represent the sequence itself, right? if so please clarify in Github that this axis is related with the time-series sequence itself, and the features-dimensions is in fact the amount of features, right? Can you confirm that to us, so we can be sure we are doing our experiments right.

from doppelganger.

fjxmlzn avatar fjxmlzn commented on September 3, 2024

I am sorry but I am not sure I fully understand your descriptions, for example, what do "samples are not relating with the time-series sequence at all", "it represents the sequence itself", etc. mean? I know the terminologies can be ambiguous, so let's make it clearer using concrete examples.

For the wiki web traffic, the shape of data_feature in data_train.npz is 50000x550x1. The [i, j, 0] element inside this tensor means the total page view of i-th website on j-th day.

What is your data? After knowing that I can provide suggestions on how to format the data.

from doppelganger.

meiyor avatar meiyor commented on September 3, 2024

Our data is a clinical dataset we have now 40k subjects x 270 hours x 34 features (biol. signals), I think we are trying to map our dataset as you map the web traffic dataset in your experiments. We are assuming each subject is an independent variable as you clarified before. Let us know if this is valid, ofc, I know the potential convergence of our data generation will depend on our data distribution anyways. If you can suggest us something to make our dataset more similar to your structure map we will be glad. Thank you!

from doppelganger.

fjxmlzn avatar fjxmlzn commented on September 3, 2024

I see!

Re: "We are assuming each subject is an independent variable as you clarified before". You are right!

Re: format. If the features have one value per hour, and the features are numerical (instead of categorical), then making the data in the shape of 40k x 270 x 34 for data_feature should be good!

from doppelganger.

meiyor avatar meiyor commented on September 3, 2024

ok perfect! Thank you very much!

from doppelganger.

meiyor avatar meiyor commented on September 3, 2024

btw @fjxmlzn we wanted to ask you. Do you have the code for plotting the distribution or histogram comparison for the generated and real datasets you evaluated? The google and the web ones don't have max/min values reported, but the FCC_MBA has it in here https://drive.google.com/drive/folders/19hnyG8lN9_WWIac998rT6RtBB9Zit70X. If you can share the max and min for google and web datasets we will be glad.

from doppelganger.

fjxmlzn avatar fjxmlzn commented on September 3, 2024

For web: #27
For Google: there should be "data_feature_max" and "data_feature_min" embedded in the npz files. We used those values to linearly normalize the original data.

from doppelganger.

fjxmlzn avatar fjxmlzn commented on September 3, 2024

Closing the issues for now. Feel free to reopen the issue if it is still not resoled.

from doppelganger.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    πŸ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. πŸ“ŠπŸ“ˆπŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❀️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.