Comments (2)
I guess the file was corrupted due to the multiple processes trying to write on it at the same time. I suggest you try adding the if-statement if comm.is_main_process():
at the top of the line: https://github.com/kakaobrain/scrl/blob/master/trainer/trainer.py#L140
This might be a bug we overlooked. (We always used the last checkpoint instead of the best one since model selection on the test set -note that we didn't use held-out validation set- was kind of cheating.) We would be grateful if you let us know if your problem has been solved by the suggested solution(or submit a PR for us.)
from scrl.
Thank you for your reply and suggestion.
I think so too. The checkpoint file was corrupted due to the multi-gpu & multi-processes (all ranks) trying to write on it at the same time.
I solved the error by modifying the function def _save_checkpoint(self, tag):
as below.
Line 175 in f5bc426
def _save_checkpoint(self, tag):
save_path = f"{self.cfg.save_dir}/checkpoint_" + str(tag) + ".pth"
state_dict = {
'tag': str(tag),
'epoch': self.cur_epoch,
'max_eval_score': self.max_eval_score,
'max_eval_epoch': self.max_eval_epoch,
}
for key, target in self._saving_targets.items():
if self.cfg.fake_checkpoint:
target = "fake_state_dict"
else:
target = utils.unwrap_if_distributed(target)
target = target.state_dict()
state_dict[f"{key}_state_dict"] = target
##Add ##
if tag == 'best':
if torch.distributed.get_rank() == 0:
torch.save(state_dict, save_path)
torch.distributed.barrier()
else:
torch.save(state_dict, save_path)
suffix = (C.debug(" (fake_checkpoint)")
if self.cfg.fake_checkpoint else "")
#######
return save_path + suffix
We always used the last checkpoint instead of the best one since model selection on the test set -note that we didn't use held-out validation set- was kind of cheating.
--> I think that it is the correct way for reporting the model's performance.
from scrl.
Related Issues (10)
- when will the code coming? HOT 2
- Do ur codes using Sync_Batchnorm in default? HOT 1
- Questions about the feature maps
- problem about generating bboxes in intersection area HOT 2
- how about the results of using scrl for classification task? HOT 1
- can you share the training logs for your provided checkpoints?
- want to know where dataset should be set HOT 1
- Try to pretrain on detection dataset HOT 3
- How did you use unlabeled dataset? HOT 2
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from scrl.