Git Product home page Git Product logo

Comments (8)

slin000111 avatar slin000111 commented on July 19, 2024 1

尝试将tain和evaluation "workers_per_gpu"设置为0。

from modelscope.

jeffzhengye avatar jeffzhengye commented on July 19, 2024

2024-04-12 11:13:07,235 - modelscope - INFO - ==========================Training Config Start==========================
2024-04-12 11:13:07,235 - modelscope - INFO - {
"framework": "pytorch",
"task": "multi-modal-embedding",
"pipeline": {
"type": "multi-modal-embedding"
},
"pretrained_model": {
"model_name": "damo/multi-modal_clip-vit-base-patch16_zh"
},
"dataset": {
"column_map": {
"img": "image",
"text": "query"
}
},
"train": {
"work_dir": "./workspace/ckpts/clip",
"max_epochs": 1,
"use_fp16": true,
"dataloader": {
"batch_size_per_gpu": 180,
"workers_per_gpu": 16,
"shuffle": true,
"drop_last": true
},
"lr_scheduler": {
"warmup_proportion": 0.1
},
"lr_scheduler_hook": {
"type": "LrSchedulerHook",
"by_epoch": false
},
"optimizer": {
"type": "AdamW"
},
"optimizer_hparams": {
"lr": 2.5e-05,
"weight_decay": 0.001,
"beta1": 0.9,
"beta2": 0.999,
"eps": 1e-08
},
"optimizer_hook": {
"type": "TorchAMPOptimizerHook",
"cumulative_iters": 1,
"loss_keys": "loss"
},
"loss_cfg": {
"aggregate": true
},
"hooks": [
{
"type": "IterTimerHook"
},
{
"type": "ClipClampLogitScaleHook"
}
],
"logging": {
"interval": 1,
"out_dir": "./workspace/ckpts/clip"
},
"checkpoint": {
"best": {
"metric_key": "inbatch_t2i_recall_at_1",
"by_epoch": false,
"interval": 200,
"save_dir": "./workspace/ckpts/clip"
},
"period": {
"interval": 1,
"save_dir": "./workspace/ckpts/clip"
}
}
},
"evaluation": {
"dataloader": {
"batch_size_per_gpu": 128,
"workers_per_gpu": 16,
"shuffle": false,
"drop_last": true
},
"metrics": [
{
"type": "inbatch_recall"
}
],
"period": {
"by_epoch": true,
"interval": 1
}
},
"preprocessor": []
}
2024-04-12 11:13:07,236 - modelscope - INFO - ===========================Training Config End===========================
2024-04-12 11:13:07,239 - modelscope - INFO - Stage: before_run:
(ABOVE_NORMAL) OptimizerHook
(LOW ) LrSchedulerHook
(LOW ) BestCkptSaverHook
(LOW ) CheckpointHook
(VERY_LOW ) TextLoggerHook

Stage: before_train_epoch:
(LOW ) LrSchedulerHook

Stage: before_train_iter:
(ABOVE_NORMAL) OptimizerHook

Stage: after_train_iter:
(ABOVE_NORMAL) OptimizerHook
(NORMAL ) EvaluationHook
(NORMAL ) ClipClampLogitScaleHook
(LOW ) LrSchedulerHook
(LOW ) BestCkptSaverHook
(LOW ) CheckpointHook
(VERY_LOW ) TextLoggerHook

Stage: after_train_epoch:
(NORMAL ) EvaluationHook
(LOW ) LrSchedulerHook
(LOW ) BestCkptSaverHook
(LOW ) CheckpointHook
(VERY_LOW ) TextLoggerHook

Stage: after_val_epoch:
(VERY_LOW ) TextLoggerHook

Stage: after_run:
(LOW ) BestCkptSaverHook
(LOW ) CheckpointHook

2024-04-12 11:13:07,251 - modelscope - INFO - Checkpoints will be saved to ./workspace/ckpts/clip
2024-04-12 11:13:07,259 - modelscope - INFO - Checkpoints will be saved to ./workspace/ckpts/clip
2024-04-12 11:13:07,259 - modelscope - INFO - Text logs will be saved to ./workspace/ckpts/clip

from modelscope.

wenmengzhou avatar wenmengzhou commented on July 19, 2024
pip install pystack-debugger
pstack $pid

请通过这个命令查看下是卡在了哪段代码里

from modelscope.

jeffzhengye avatar jeffzhengye commented on July 19, 2024

image

@wenmengzhou

from modelscope.

wenmengzhou avatar wenmengzhou commented on July 19, 2024

似乎命令没成功执行,我们复现下这个错误看看

from modelscope.

jeffzhengye avatar jeffzhengye commented on July 19, 2024

@wenmengzhou 应该也不是所有机器都会这样,不然也不可能发布。
但卡住也无法调试,完全没法下一步。

from modelscope.

jeffzhengye avatar jeffzhengye commented on July 19, 2024

@slin000111 it just worked. thanks.
and why?

from modelscope.

wenmengzhou avatar wenmengzhou commented on July 19, 2024

可能是某些操作在子进程中会hang住导致

from modelscope.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.