Hi, I am wondering whether it is possible to resume training using the saved checkpoin

Resume Training? about pytorch_gbw_lm HOT 2 CLOSED

rdspring1 commented on July 3, 2024

Resume Training?

from pytorch_gbw_lm.

Comments (2)

rdspring1 commented on July 3, 2024

First, you need to save the optimizer. Also, you need to calculate the step variable.
step is the current iteration during the training process.
step = train_corpus.batch_num * epoch + current_batch

    torch.save({
            'epoch': epoch,
            'step': step,
            'model_state_dict': model.state_dict(),
            'optimizer_state_dict': optimizer.state_dict(),
            }, os.path.join(args.save, "optim_model.pt"))

When resuming:
Note: You need to move the model to the GPU after loading from disk.

    if args.resume:
        print("Loading model from checkpoint")
        sys.stdout.flush()
        checkpoint = torch.load(os.path.join(args.save, "model.pt"), map_location="cpu")
        model.load_state_dict(checkpoint['model_state_dict'])
        optimizer.load_state_dict(checkpoint['optimizer_state_dict'])
        start_epoch = checkpoint['epoch']+1
        step = checkpoint['step']

Also, you need to modify the learning rate scheduler:
Change last_iter from -1 to current_step-1

scheduler = LinearLR(optimizer, base_lr=args.lr*args.scale, max_iters=train_corpus.batch_num*args.epochs, last_iter=checkpoint['step']-1, min_lr=1e-8)

Plus, some minor variable tweaks, but this is the general setup.

from pytorch_gbw_lm.