Git Product home page Git Product logo

world-models's People

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

world-models's Issues

issue about gmm_loss

In the file models/mdrnn.py, why minus max_log_probs first (line 38) and then add max_log_probs.squeeze() (line 43) ?

batch = batch.unsqueeze(-2)
normal_dist = Normal(mus, sigmas)
g_log_probs = normal_dist.log_prob(batch)
g_log_probs = logpi + torch.sum(g_log_probs, dim=-1)
max_log_probs = torch.max(g_log_probs, dim=-1, keepdim=True)[0]
g_log_probs = g_log_probs - max_log_probs
g_probs = torch.exp(g_log_probs)
probs = torch.sum(g_probs, dim=-1)

log_prob = max_log_probs.squeeze() + torch.log(probs)
if reduce:
    return - torch.mean(log_prob)
return - log_prob

Is there a corresponding mathematical formula?

Possible error when predicting next action (class RolloutGenerator)

Hello!

Thanks, first of all, for the library. It has been of great help to me!

Now, I wanted to discuss a portion of the code that I believe to be erroneous. In class RolloutGenerator, function get_action_and_transition(), we have the following code:

def get_action_and_transition(self, obs, hidden):
    """ Get action and transition.
    Encode obs to latent using the VAE, then obtain estimation for next
    latent and next hidden state using the MDRNN and compute the controller
    corresponding action.
    :args obs: current observation (1 x 3 x 64 x 64) torch tensor
    :args hidden: current hidden state (1 x 256) torch tensor
    :returns: (action, next_hidden)
        - action: 1D np array
        - next_hidden (1 x 256) torch tensor
    """
    _, latent_mu, _ = self.vae(obs)
    action = self.controller(latent_mu, hidden[0])
    _, _, _, _, _, next_hidden = self.mdrnn(action, latent_mu, hidden)
    return action.squeeze().cpu().numpy(), next_hidden

I think this function description is quite clear. The problem is, it feeds latent_mu to both the controller and the mdrnn network. I would argue that we should use the real latent vector instead (let's call it z).

First, the current implementation is not what they do in the original World models paper, as they describe the controller as, and I quote:

C is a simple single layer linear model that maps z_t​​ and h_t​​ directly to action a_t​​ at each time step.

Second, we train the mdrnn network using the latent vector 'z' (see file trainmdrnn.py, function to_latent()). Therefore, why do we use latent_mu now?

This problem affects both the training and testing of the controller. It might be the reason why you report that the memory module is of little to no help in your experiments (https://ctallec.github.io/world-models/). However, I must say I haven't done any proper testing yet.

I would like to hear your thoughts on this.

Invalid argument array size

RuntimeError: invalid argument 2: size '[-1 x 3 x 96 x 96]' is invalid
for input with 6291456 elements at /pytorch/aten/src/TH/THStorage.c

When trying to run python trainmdrnn.py --logdir exp_dir (After generation and training vae).

Not sure what is going on there.

a multi-process problem in the controller

Thanks to the authors for sharing their code.
I executed the controller on Ubuntu 18.04 with one GPU.
Unfortunately, the program couldn't execute due to a multi-process problem with Cuda. After that, I put all of the code except functions in the main function and called it.
Now, it works properly.

The definition of GMM linear layer may wrong? Or I have missed something?

hi ctallec,
In file mdrnn.py:
I observed the neural number in gmm_linear layer is too few, why is the output size defined as (2 * latents + 1) * gaussians + 2? Shouldn't it be 3 * latents * gaussians +2 (I also saw this definition in other implementation of mdn-rnn)? In your definition, you seem to share the pis to all gaussian element which is not feasible under my understanding of GMM. My understanding is that, each element of the latent vector has its own GMM, that is, for example, if we have 3 gaussian elements, for each z_i we have 3 mus, 3 sigmas and 3 pis. Or have I had some misunderstandings of GMM?
Best,

problem about training VAE

Hi, I am trying to train the VAE (with the step 2 command), and I have generated the datasets by the first command(8 threads and 125 data each thread), but after loading the file buffer, it gives the following error:

File "trainvae.py", line 127, in
mkdir(vae_dir)
FileNotFoundError: [Errno 2] No such file or directory: 'exp_dir/vae'

Is there anyone facing the same issue? It would be great if you can give me some suggestions about it.

thanks!

Controller Input

Hi, I wonder if the input to the controller should be the latent vector (z) from VAE and hidden vector from RNN?

_, _, _, _, _, next_hidden = self.mdrnn(action, latent_mu, hidden)

But the code here shows that one of the inputs is the gaussian mean instead.

requirements are wrong ?

Installing the requirements with pip install -r requirements.txt makes the environment incorrect (apparently a bug in box2d which is fixed by installing gym[all] which contains a forked version of box2d.).

openai/gym#647

Training the controller and getting stuck in local minima

I'm currently trying to train the CMA controller and I keep getting stuck at a reward of around 250-300. After that the controller just stops improving for me. I tried restarting this multiple times, but I'm getting the same result. There are no errors while training. The longest training time was 30 hours in a single session, however the last improvement during that session was after around 10 hours (very small improvement). Is my controller getting stuck at local minima?
The GPU I have here is just a single GTX 970 and I'm only able to run it with 6 workers before running out of memory. Is there further adjustment needed when running on slower hardware?

Multiprocessing very slow

I am running controllertrain.pyon a google cloud VM headlessly with python 3.7 and xvfb. Everything works but I have noticed what seems to be a linear relationship between the number of workers I allow and the time for each worker to execute its rollout.

If only one worker is allowed it can run 200 steps of the environment in 5 seconds. For 10 workers each worker is only able to get 10 steps, this means that the 10 workers are actually 50% slower at getting through the iterations (each worker is outputting the iteration it is on in its rollout (added a print statement inside misc.utils.py for this))!

Has anyone else observed a similar effect? What could be wrong with my server? I am not using any GPUs, just CPU to run the VAE and MDRNN.

Thank you.

untrained RNN

Interesting to learn that an untrained RNN will produce the same result as trained, I was having the same doubt about the value of predicting the "Z(t+1)".
Question is: If RNN untrained, then "h" suppose to be random, how this "h" will contribute to the training of the controller?

[Question] Some controller training questions

Hi,

I have some doubts regarding the controller training:

  • Which is the meaning of the screen outputs?

captura de pantalla 2019-02-07 a las 21 03 10

  • How much usually takes to be trained (e.g. README parameters, one worker/1060 Ti GTX)? Does it converge to a specific error value?

  • I am not really sure how much a difference increasing the population and n-samples makes in the training.

Thanks!

trainmdrnn running only test, the test loss decreases?

I noticed a behavior which is a bit odd. If I comment out the line which runs training in trainmdrnn.py see here which means I am only running test, the test error loss is decreasing.
I am confused as to how this can be, since no gradients should be updating anything during test, right?

ETA:
I added this snippet of code in the data_pass

        wsum = 0
        for w in list(mdrnn.parameters()):
            wsum += torch.norm(w)
        print(wsum.item())

and it looks like the mdrnn weights indeed aren't changing during test (only during train) -- but I am still not sure how the test loss can be decreasing.

Negative GMM loss. How to interpret?

Hi,

I am currently training the MDR-RNN with VAEs of different latent vector sizes (LSIZE). I have noticed that the smaller the size, the smaller the GMM loss (and total loss) is. Specifically, by using an LSIZE of 4 (and default RSIZE of 256), the loss goes below zero.

On the other hand, on the code comments of trainmdrnn.py, I saw that: "The LSIZE + 2 factor is here to counteract the fact that the GMMLoss scales approximately linearily with LSIZE". So I suppose that the fact that the loss goes below zero is somewhat expected when using a small LSIZE.

Nonetheless, I wonder how we could interpret and compare the MDR-RNN losses in order to asses its performance. Also, do you know if the original World Models implementation also does the "LSIZE+2" scaling / has the loss below zero effect? I read their code but could not figure if that was the case.

Thanks again!

Having trouble generation_script.py

image

I'm having this error while executing generation_script.py

Could you help resolving this error?

I ran in python=3.5, 3.6, pytorch=4.0, cuda=9.0, cudnn=7 and pip install -r requirements.txt

the train_controller always break off when trainning about 15min

when i run the train_controller.py , it is normal at start,while ,after running for about 15 min,the program will break silently, i don't know why, because i have trained VAE and MD_RNN.I have tried both in local computer and on the server,while the issue always exist.

inconsisent MDRNN / MDRNNCell behavoir

there is different behavoir when running MDRNN vs. MDRNNCell. Specifically, I give MDRNN and MDRNNCell the same input (MDRNN is batched_sequences in the input, then I take only one sequence from the output, and compare that against the same sequence as input to the MDRNNCell). I observe that the mus and sigmas match up, but the logpi does not. The issue is related to the dimension of the softmax.

Specifically, in MDRNN the softmax is applied along the last dimension: (e.g., for a 32x16x5 input, along dimension with 5)

logpi = f.log_softmax(pi, dim=-1)

Where as is MDRNNCell the softmax is applied along the first input (e.g., for a 16x5 input, softmax is applied along dimension with 16).

logpi = f.log_softmax(pi, dim=-2)

I notice that this was changed in this commit:
5d4261e#diff-949b6b6e9db2dd11dbf333ec6fff33ed

[Error] Can not train VAE

Hi, I am trying to train the VAE (with the step 2 command), but when it tries to load the dataset (dataset/carracing) it gives the following error:

Traceback (most recent call last):
File "trainvae.py", line 63, in
dataset_train, batch_size=args.batch_size, shuffle=True, num_workers=2)
File "/home/s1881460/miniconda3/envs/mlp/lib/python3.7/site-packages/torch/utils/data/dataloader.py", line 802, in init
sampler = RandomSampler(dataset)
File "/home/s1881460/miniconda3/envs/mlp/lib/python3.7/site-packages/torch/utils/data/sampler.py", line 60, in init
self.num_samples = len(self.data_source)
File "/mnt/mscteach_home/s1881460/world-models/data/loaders.py", line 55, in len
self.load_next_buffer()
File "/mnt/mscteach_home/s1881460/world-models/data/loaders.py", line 34, in load_next_buffer
self._buffer_index = self._buffer_index % len(self._files)
ZeroDivisionError: integer division or modulo by zero

Thanks!

Error training MD-rnn

Hi,

I am facing this invalid input size error when training dmrnn.

File "trainmdrnn.py", line 205, in
test_loss = test(e)
File "trainmdrnn.py", line 170, in data_pass
latent_obs, latent_next_obs = to_latent(obs, next_obs)
File "trainmdrnn.py", line 108, in to_latent
[(obs_mu, obs_logsigma), (next_obs_mu, next_obs_logsigma)]]
File "trainmdrnn.py", line 107, in
for x_mu, x_logsigma in
RuntimeError: shape '[16, 32, 32]' is invalid for input of size 11264

Different transform in trainvae.py & trainmdrnn.py

In trainvae.py:
transform_test = transforms.Compose([
transforms.ToPILImage(),
transforms.Resize((RED_SIZE, RED_SIZE)),
transforms.ToTensor(),
])

While in trainmdrnn.py:
transform = transforms.Lambda(
lambda x: np.transpose(x, (0, 3, 1, 2)) / 255)

I don't quite understand why there would be such difference and its consequences for the encoder processing of VAE model.

MDRNN losses extremely low due to numerical instability?

MDRNN training and GMM losses decrease abruptly to very low values, even with gradient clipping.
Was this observed in the originally tested repo, or is this result of recent PyTorch versions.
Issue persists with higher precision PyTorch configuration as well.

Epoch 0: 2912it [00:18, 158.02it/s, loss=-7490883896016783802368.000000 bce=  0.022669 gmm=-7724973763404514721792.000000 mse=  0.000000]                                                

Epoch 0: 100%|██████████████████████████████| 1936/1936 [00:12<00:00, 157.29it/s, loss=-13451842733942434168832.000000 bce=  0.000828 gmm=-13872212352094781308928.000000 mse=  0.000000]

Epoch 1: 2912it [00:18, 157.59it/s, loss=-16901104332652949798912.000000 bce=  0.000793 gmm=-17429263292277607890944.000000 mse=  0.000000]                                              

Epoch 1: 100%|██████████████████████████████| 1936/1936 [00:12<00:00, 156.85it/s, loss=-19335289690015750160384.000000 bce=  0.000749 gmm=-19939516790304420134912.000000 mse=  0.000000]

Epoch 2: 2912it [00:18, 157.39it/s, loss=-20089711310459944042496.000000 bce=  0.000734 gmm=-20717514125435083948032.000000 mse=  0.000000]                                              

Epoch 2: 100%|███████████████████████████████| 1936/1936 [01:09<00:00, 27.85it/s, loss=-20316329081654105604096.000000 bce=  0.000709 gmm=-20951213785059046719488.000000 mse=  0.000000]

one question about gmm_loss function

Your code is really helpful!
But I am confused about the gmm_loss function. Why is this loss function defined like this? Can anybody help me? I will be very grateful. Thank you.

Splitting of Train and Validation / Test set

Screenshot from 2019-05-16 11-08-47

Doesn't this splitting mean you're actually taking 400 for training and 600 for testing? I think files[:-600] takes the first 400 out of 1000, but I could be misinterpreting what your code is doing here.

Also, I thought by splitting it should reduce the dataset total for the test_loader, but both train and test loaders have the same length of dataset and takes the same amount of time to complete an epoch. Any idea why this is?

No init_hidden for mdrnn?

I do not see hidden layer function initialization in trainmdrnn.py or mdrnn.py. Can you explain if this is correct? I think you would need initialize hidden layers of RNN since your training data may come from many different episodes. Thanks in advance if you can clarify.

Does MDRNN loss include reward?

Beautiful code guys!

Just a quick question: In the original blog/paper I didn't see the reward being used during the Memory network training as I see in your code here Did you modify on purpose? I am just wondering.

Reparameterization trick coefficient

I'm reimplementing world-models as an exercise in learning pytorch and I noticed that the reparameterization trick in the VAE is implemented slightly differently here and in the original tensorflow. Specifically, in the original, they reparameterize with z = epsilon*exp(log_sigma / 2) + mu, and in this code we just do z = epsilon*exp(log_sigma) + mu. I'm not very familiar with VAEs, and so I wanted to make sure that this all worked out to the same results, maybe because the encoder weights will double or something. Does this always hash out the same? If we regularize the encoder weights, does it still turn out the same? Is one or the other more correct?

traincontroller issue - Stuck waiting for r_queue

Hi, the train controller seems to be forever waiting for the results in the result queue.

Anything I should look for in particular ?

The command I ran:
python traincontroller.py --logdir exp_dir --n-samples 4 --pop-size 4 --target-return 950 --display --max-workers 1 (I tried 4 workers, same, and 32 workers my RAM explodes)

Data generation script: No module named 'utils'

During the data generation phase using the run command given in the readme, I'm getting an error during import of the utils package. This is because utils exists one level above generation_script.py.

$ python data/generation_script.py --rollouts 1000 --rootdir datasets/carracing --threads 1
xvfb-run -s "-screen 0 1400x900x24" --server-num=1 python data/carracing.py --dir datasets/carracing/thread_0 --rollouts 1001 --policy brown
Traceback (most recent call last):
  File "data/carracing.py", line 10, in <module>
    from utils.misc import sample_continuous_policy
ModuleNotFoundError: No module named 'utils'

I can get it to run with:

PYTHONPATH='.' python data/generation_script.py --rollouts 1000 --rootdir datasets/carracing --threads 1

Is everyone modifying their paths to get it to run?

Worker dying issue with controller training

Hello, has anyone ran into an issue where one or more of the workers created in python traincontroller.py dies without explanation, causing the entire script to hang because it is waiting for the dead workers to finish their evaluations?

I've checked the logs created in tmp for each of the worker processes and unfortunately the .err logs seem to be uninformative or empty.

I think it might be a GPU memory issue or some issue related to some modifications I made to the CarRacing environment, but the lack of any error logging is concerning. It also does not seem to consistently happen, which is also strange.

Thanks!

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.