I would like to resume training from a given epoch. I guess I could add a line to:

<a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-url="/us

best way to resume training from PKL about rlkit HOT 11 OPEN

rail-berkeley commented on August 24, 2024

best way to resume training from PKL

from rlkit.

Comments (11)

vitchyr commented on August 24, 2024 2

Yes, currently it's not implemented but your suggestions should work. After that just call train (and you'll probably want to change it so that you can pass in the epoch to start at). I think the annoying part will just be loading an existing progress.csv rather than overriding it (assume that's what you want to do)

from rlkit.

dwiel commented on August 24, 2024 2

@yusukeurakami you ever figure this out?

from rlkit.

yusukeurakami commented on August 24, 2024

@vitchyr
I have a question related to training resume. I am trying to load the pre-trained SAC policy and re-train it. I loaded all five network (qf1, qf2, target_qf1, target_qf2, policy), and start training but policy_loss and qf_loss exploded (e+8~e+10).
I thought it's because I forgot to save the "log_alpha" so I save/load it and re-start the training but still not working. It there any thought about this phenomenon?
When I turn off the Auto Entropy Tuning function, I can resume with no problem.

from rlkit.

vitchyr commented on August 24, 2024

Hmmm, it's a bit hard to say without knowing more. I imagine if you look at the values of alpha itself, it blows up rather quickly. There can only really be three reasons, since the alpha loss depends on three quantities:

log_pi. Are you loading highly off-policy data? Perhaps the log-likelihood of some old action is extremely low given the loaded policy.
self.target_entropy. Any chance the target entropy is something really wonky? For example, if you're doing discrete actions, then it should be positive.

from rlkit.

richardrl commented on August 24, 2024

@yusukeurakami If you're using ADAM, you'll also want to reload the optimizer parameters or set the learning rate lower.

Is the Q function output exploding?

from rlkit.

yusukeurakami commented on August 24, 2024

@vitchyr @richardrl Thank you for your reply.
I'm now re-running the experiment to get those values.

from rlkit.

yusukeurakami commented on August 24, 2024

@vitchyr @richardrl
This is how actually the reward and the alpha looks like when I restart the training. As you see, alpha blows up immediately after when I resume the training and reward goes down and never comes back. qf1_loss and qf2_loss get crazy as well.

log_pi : I am doing domain randomization on my original environment, however, the environments that used when I resumed training is from the same distribution as last the first training. So, it shouldn't be that off-policy. I even saved and load the entire replay buffer and resume. It looks fine for 10 updates or so, but it went crazy laster as well.
self.target_entropy : My environment is continuous and there are 6 actuator. Self.target_entropy is -6 at any time. Does it sound right?
optimizer : Yes. I am using Adam, but I've already tried reloading the optimizer parameters. It mitigates the symptom a little but ended up in the same phenomenon.
Q-function output : As you see in the graph above, since qf1_loss is going crazy, I think that the qf values are also exploding.

F.Y.I.

I am saving the model with torch.save instead of pickle as following. I think this won't change any thing but just in case I am reporting it.

snapshot = self._get_snapshot()
torch.save(snapshot, save/path)

My other hyperparameter as follow.

    if args.algo == 'sac':
        algorithm = "SAC"
    variant = dict(
    algorithm=algorithm,
    version="normal",
    layer_size=100,
    replay_buffer_size=int(1E6),
    algorithm_kwargs=dict(
        num_epochs=6000,
        num_eval_steps_per_epoch=512, #512
        num_trains_per_train_loop=1000, #1000
        num_expl_steps_per_train_loop=512, #512
        min_num_steps_before_training=512, #1000
        max_path_length=512, #512
        batch_size=128,
        ),
    trainer_kwargs=dict(
        discount=0.99,
        soft_target_tau=5e-3,
        target_update_period=1,
        policy_lr=1E-3,
        qf_lr=1E-3,
        reward_scale=0.1,
        use_automatic_entropy_tuning=True,
        ),
    )

I've changed some parameters in batch_rl_algorithm.py.

            num_train_loops_per_epoch=10,
            min_num_steps_before_training=0,

It will be great if you have any advice.

from rlkit.

vitchyr commented on August 24, 2024

It looks like you set `use_automatic_entropy_tuning=False` in the original settings, but somehow the entropy tuning is set to `True` when you load it. How are you resuming training? Are you sure you're using the same hyperparameters for the restarted SAC?

…

On Tue, Aug 20, 2019 at 7:48 PM Yusuke Urakami ***@***.***> wrote: @vitchyr <https://github.com/vitchyr> @richardrl <https://github.com/richardrl> This is how actually the reward and the alpha looks like when I restart the training. As you see, alpha blows up immediately after when I resume the training and reward goes down and never comes back. qf1_loss and qf2_loss get crazy as well. [image: image] <https://user-images.githubusercontent.com/22037436/63373099-2a800d80-c33c-11e9-9b09-92d06ceb13e9.png> [image: image] <https://user-images.githubusercontent.com/22037436/63373557-fe18c100-c33c-11e9-9f5b-4eecf71901d5.png> [image: image] <https://user-images.githubusercontent.com/22037436/63373601-11c42780-c33d-11e9-9843-db9e7a9147a6.png> 1. log_pi : I am doing domain randomization on my original environment, however, the environments that used when I resumed training is from the same distribution as last the first training. So, it shouldn't be that off-policy. I even saved and load the entire replay buffer and resume. It looks fine for 10 updates or so, but it went crazy laster as well. 2. self.target_entropy : My environment is continuous and there are 6 actuator. Self.target_entropy is -6 at any time. Does it sound right? 3. optimizer : Yes. I am using Adam, but I've already tried reloading the optimizer parameters. It mitigates the symptom a little but ended up in the same phenomenon. 4. Q-function output : As you see in the graph above, since qf1_loss is going crazy, I think that the qf values are also exploding. F.Y.I. 1. I am saving the model with torch.save instead of pickle as following. I think this won't change any thing but just in case I am reporting it. snapshot = self._get_snapshot() torch.save(snapshot, save/path) 1. My other hyperparameter as follow. if args.algo == 'sac': algorithm = "SAC" variant = dict( algorithm=algorithm, version="normal", layer_size=100, replay_buffer_size=int(1E6), algorithm_kwargs=dict( num_epochs=6000, num_eval_steps_per_epoch=512, #512 num_trains_per_train_loop=1000, #1000 num_expl_steps_per_train_loop=512, #512 min_num_steps_before_training=512, #1000 max_path_length=512, #512 batch_size=128, ), trainer_kwargs=dict( discount=0.99, soft_target_tau=5e-3, target_update_period=1, policy_lr=1E-3, qf_lr=1E-3, reward_scale=3.0, use_automatic_entropy_tuning=False, ), ) 1. I've changed some parameters in batch_rl_algorithm.py. num_train_loops_per_epoch=10, min_num_steps_before_training=0, It will be great if you have any advice. — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#33>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AAJ4VZNHV7IDJGIKNLQTWXLQFQ37RANCNFSM4GWLTSHA> .

-- Best, Vitchyr

from rlkit.

yusukeurakami commented on August 24, 2024

It looks like you set use_automatic_entropy_tuning=False in the original settings, but somehow the entropy tuning is set to True when you load it. How are you resuming training? Are you sure you're using the same hyperparameters for the restarted SAC?

Sorry I've pasted the wrong hyperparameters... I've edited on above but it was like follows. I am sure that I haven't changed my hyperparameters when I resumed the training.

        reward_scale=0.1,
        use_automatic_entropy_tuning=True,

from rlkit.

vitchyr commented on August 24, 2024

Hmmm, yeah I don't know what could be different. Maybe check the optimizers are connected to the correct parameters? I could imagine getting into trouble if you load in parameters after creating the optimizers, though that would be quite a nasty design on PyTorch's part. On Wed, Aug 21, 2019 at 8:12 AM Yusuke Urakami <[email protected]> wrote:

…

It looks like you set use_automatic_entropy_tuning=False in the original settings, but somehow the entropy tuning is set to True when you load it. How are you resuming training? Are you sure you're using the same hyperparameters for the restarted SAC? … <#m_2645244245077380072_> On Tue, Aug 20, 2019 at 7:48 PM Yusuke Urakami *@*.***> wrote: @vitchyr <https://github.com/vitchyr> https://github.com/vitchyr @richardrl <https://github.com/richardrl> https://github.com/richardrl This is how actually the reward and the alpha looks like when I restart the training. As you see, alpha blows up immediately after when I resume the training and reward goes down and never comes back. qf1_loss and qf2_loss get crazy as well. [image: image] https://user-images.githubusercontent.com/22037436/63373099-2a800d80-c33c-11e9-9b09-92d06ceb13e9.png [image: image] https://user-images.githubusercontent.com/22037436/63373557-fe18c100-c33c-11e9-9f5b-4eecf71901d5.png [image: image] https://user-images.githubusercontent.com/22037436/63373601-11c42780-c33d-11e9-9843-db9e7a9147a6.png 1. log_pi : I am doing domain randomization on my original environment, however, the environments that used when I resumed training is from the same distribution as last the first training. So, it shouldn't be that off-policy. I even saved and load the entire replay buffer and resume. It looks fine for 10 updates or so, but it went crazy laster as well. 2. self.target_entropy : My environment is continuous and there are 6 actuator. Self.target_entropy is -6 at any time. Does it sound right? 3. optimizer : Yes. I am using Adam, but I've already tried reloading the optimizer parameters. It mitigates the symptom a little but ended up in the same phenomenon. 4. Q-function output : As you see in the graph above, since qf1_loss is going crazy, I think that the qf values are also exploding. F.Y.I. 1. I am saving the model with torch.save instead of pickle as following. I think this won't change any thing but just in case I am reporting it. snapshot = self._get_snapshot() torch.save(snapshot, save/path) 1. My other hyperparameter as follow. if args.algo == 'sac': algorithm = "SAC" variant = dict( algorithm=algorithm, version="normal", layer_size=100, replay_buffer_size=int(1E6), algorithm_kwargs=dict( num_epochs=6000, num_eval_steps_per_epoch=512, #512 num_trains_per_train_loop=1000, #1000 num_expl_steps_per_train_loop=512, #512 min_num_steps_before_training=512, #1000 max_path_length=512, #512 batch_size=128, ), trainer_kwargs=dict( discount=0.99, soft_target_tau=5e-3, target_update_period=1, policy_lr=1E-3, qf_lr=1E-3, reward_scale=3.0, use_automatic_entropy_tuning=False, ), ) 1. I've changed some parameters in batch_rl_algorithm.py. num_train_loops_per_epoch=10, min_num_steps_before_training=0, It will be great if you have any advice. — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#33 <#33>, or mute the thread https://github.com/notifications/unsubscribe-auth/AAJ4VZNHV7IDJGIKNLQTWXLQFQ37RANCNFSM4GWLTSHA . -- Best, Vitchyr Sorry I've pasted the wrong hyperparameters... I've edited on above but it was like follows. I am sure that I haven't changed my hyperparameters when I resumed the training. reward_scale=0.1, use_automatic_entropy_tuning=True, — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#33>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AAJ4VZINXMAWH7WLJUZVOIDQFTTG5ANCNFSM4GWLTSHA> .

-- Best, Vitchyr

from rlkit.

nanbaima commented on August 24, 2024

Does anyone knows if it would also be possible to resume an env and the pre-trained dataset from this env (with the same size state), but instead of using the previous reward, simply change it and continues to train the data set starting from the previous one (in other words: resume training but with new rewards to be added)?

from rlkit.

best way to resume training from PKL about rlkit HOT 11 OPEN

Comments (11)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent