Git Product home page Git Product logo

dreamerv2's Introduction

Hey πŸ‘‹

  • I am interested in deep learning and reinforcement learning algorithms.
  • Love to be engrossed in research.
  • I like coding, cycling, cooking, cognitive science and cricket.
  • How to reach me: [email protected]

dreamerv2's People

Contributors

rajghugare19 avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar

dreamerv2's Issues

cannot find module dreamerV2

Hi, When I download the dependencies however in your eval.py it says from dreamerv2.utils.wrapper, it says dreamerv2 module not found. Why do I need to pip install dreamerv2 when this repo is meant to me an implementation of Dreamerv2 in pytorch.
Am I missing something.
Thanks for the great repo though, really handy.

Why does the sequences of rewards start at t-1?

Thanks for sharing the code, but I have a question.
According to buffer.py.,here

def _shift_sequences(self, obs, actions, rewards, terminals):
        obs = obs[1:]
        actions = actions[:-1]
        rewards = rewards[:-1]
        terminals = terminals[:-1]
        return obs, actions, rewards, terminals

I think you want to align states with rewards, but in trainer.py, here

obs, actions, rewards, terms = self.buffer.sample()
obs = torch.tensor(obs, dtype=torch.float32).to(self.device)                         # t, t+seq_len
actions = torch.tensor(actions, dtype=torch.float32).to(self.device)                 # t-1, t+seq_len-1
rewards = torch.tensor(rewards, dtype=torch.float32).to(self.device).unsqueeze(-1)   # t-1 to t+seq_len-1
nonterms = torch.tensor(1-terms, dtype=torch.float32).to(self.device).unsqueeze(-1)  # t-1 to t+seq_len-1

Why does the sequence of rewards start at t-1?
When prefilling the buffer, a transition (s_t, a_t, r_t+1, d_t+1) is pushed into the buffer, but the r_t+1 corresponds to the s_t+1, so
when calling the _shift_sequences, the states and the rewards should be aligned, so I think the rewards may start at t rather than t - 1

procgen env

Dumb question here, but How does this algorithm compare in Procgen environment, especially compared to PPG?

Thank you

Why not align with original implementation of authors in calculating so-called `pcon`?

In the original implementation, there is

# tensorflow
weight = tf.stop_gradient(tf.math.cumprod(tf.concat([tf.ones_like(disc[:1]), disc[:-1]], 0), 0))

and in your code:

discount_arr = torch.cat([torch.ones_like(discount_arr[:1]), discount_arr[1:]])

# pytorch
discount_arr = torch.cat([torch.ones_like(discount_arr[:1]), discount_arr[1:]])
discount = torch.cumprod(discount_arr[:-1], 0)

I've tested that they are different when using pcon predictor. Like:

# tensorflow
x = np.arange(9).reshape(3,3)*0.1
y = tf.convert_to_tensor(x)
z = tf.math.cumprod(tf.concat([tf.ones_like(y[:1]), y[:-1]], 0), 0)
>>> z:
 <tf.Tensor: shape=(3, 3), dtype=float64, numpy=
 array([[0. , 0.1, 0.2],
        [0.3, 0.4, 0.5],
        [0.6, 0.7, 0.8]])>)
# pytorch
x = np.arange(9).reshape(3,3)*0.1
y = torch.as_tensor(x)
z = torch.cumprod(torch.cat([torch.ones_like(y[:1]), y[1:]]),0)
>>> z:
tensor([[1.0000, 1.0000, 1.0000],
        [0.3000, 0.4000, 0.5000],
        [0.1800, 0.2800, 0.4000]], dtype=torch.float64)

So why is the calculation different?

There is a reason I guess is that, because the pcon predictor is a Bernoulli distribution, so the samples are always either 0 or 1. Thus these two different ways of calculating discount weight will always bring the same output, is that right?

But what if we want the pcon predictor to output a "soft" label, then which one is right?

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    πŸ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. πŸ“ŠπŸ“ˆπŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❀️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.