Comments (8)
尝试将tain和evaluation "workers_per_gpu"设置为0。
from modelscope.
2024-04-12 11:13:07,235 - modelscope - INFO - ==========================Training Config Start==========================
2024-04-12 11:13:07,235 - modelscope - INFO - {
"framework": "pytorch",
"task": "multi-modal-embedding",
"pipeline": {
"type": "multi-modal-embedding"
},
"pretrained_model": {
"model_name": "damo/multi-modal_clip-vit-base-patch16_zh"
},
"dataset": {
"column_map": {
"img": "image",
"text": "query"
}
},
"train": {
"work_dir": "./workspace/ckpts/clip",
"max_epochs": 1,
"use_fp16": true,
"dataloader": {
"batch_size_per_gpu": 180,
"workers_per_gpu": 16,
"shuffle": true,
"drop_last": true
},
"lr_scheduler": {
"warmup_proportion": 0.1
},
"lr_scheduler_hook": {
"type": "LrSchedulerHook",
"by_epoch": false
},
"optimizer": {
"type": "AdamW"
},
"optimizer_hparams": {
"lr": 2.5e-05,
"weight_decay": 0.001,
"beta1": 0.9,
"beta2": 0.999,
"eps": 1e-08
},
"optimizer_hook": {
"type": "TorchAMPOptimizerHook",
"cumulative_iters": 1,
"loss_keys": "loss"
},
"loss_cfg": {
"aggregate": true
},
"hooks": [
{
"type": "IterTimerHook"
},
{
"type": "ClipClampLogitScaleHook"
}
],
"logging": {
"interval": 1,
"out_dir": "./workspace/ckpts/clip"
},
"checkpoint": {
"best": {
"metric_key": "inbatch_t2i_recall_at_1",
"by_epoch": false,
"interval": 200,
"save_dir": "./workspace/ckpts/clip"
},
"period": {
"interval": 1,
"save_dir": "./workspace/ckpts/clip"
}
}
},
"evaluation": {
"dataloader": {
"batch_size_per_gpu": 128,
"workers_per_gpu": 16,
"shuffle": false,
"drop_last": true
},
"metrics": [
{
"type": "inbatch_recall"
}
],
"period": {
"by_epoch": true,
"interval": 1
}
},
"preprocessor": []
}
2024-04-12 11:13:07,236 - modelscope - INFO - ===========================Training Config End===========================
2024-04-12 11:13:07,239 - modelscope - INFO - Stage: before_run:
(ABOVE_NORMAL) OptimizerHook
(LOW ) LrSchedulerHook
(LOW ) BestCkptSaverHook
(LOW ) CheckpointHook
(VERY_LOW ) TextLoggerHook
Stage: before_train_epoch:
(LOW ) LrSchedulerHook
Stage: before_train_iter:
(ABOVE_NORMAL) OptimizerHook
Stage: after_train_iter:
(ABOVE_NORMAL) OptimizerHook
(NORMAL ) EvaluationHook
(NORMAL ) ClipClampLogitScaleHook
(LOW ) LrSchedulerHook
(LOW ) BestCkptSaverHook
(LOW ) CheckpointHook
(VERY_LOW ) TextLoggerHook
Stage: after_train_epoch:
(NORMAL ) EvaluationHook
(LOW ) LrSchedulerHook
(LOW ) BestCkptSaverHook
(LOW ) CheckpointHook
(VERY_LOW ) TextLoggerHook
Stage: after_val_epoch:
(VERY_LOW ) TextLoggerHook
Stage: after_run:
(LOW ) BestCkptSaverHook
(LOW ) CheckpointHook
2024-04-12 11:13:07,251 - modelscope - INFO - Checkpoints will be saved to ./workspace/ckpts/clip
2024-04-12 11:13:07,259 - modelscope - INFO - Checkpoints will be saved to ./workspace/ckpts/clip
2024-04-12 11:13:07,259 - modelscope - INFO - Text logs will be saved to ./workspace/ckpts/clip
from modelscope.
pip install pystack-debugger
pstack $pid
请通过这个命令查看下是卡在了哪段代码里
from modelscope.
from modelscope.
似乎命令没成功执行,我们复现下这个错误看看
from modelscope.
@wenmengzhou 应该也不是所有机器都会这样,不然也不可能发布。
但卡住也无法调试,完全没法下一步。
from modelscope.
@slin000111 it just worked. thanks.
and why?
from modelscope.
可能是某些操作在子进程中会hang住导致
from modelscope.
Related Issues (20)
- 上传模型的时候,LFS track 的内容无法上传 HOT 5
- BERT-语义对话预测模型预测问题 HOT 1
- 有没有零样本训练的技术细节,要怎么在自己的数据集训练 HOT 2
- refine error report when model id is wrong using snapshot_download
- cpu memory leak HOT 2
- Outdated Warning Message HOT 2
- push_hub got error: Message': '创建模型失败,信息:record not found', HOT 2
- 下载Tongyi-DataEngine/SA1B-Dense-Caption数据集,执行网页上命令from modelscope.msdatasets import MsDataset ds = MsDataset.load('Tongyi-DataEngine/SA1B-Dense-Caption', subset_name='default', split='train'),modelscope版本:1.14.0,提示错误:TypeError: Value.__init__() missing 1 required positional argument: 'dtype' HOT 5
- ModelScope使用的,编码助手智能体 HOT 2
- Model.from_petrained后面模型路径我填的是本地模型的文件夹,一直Process finished with exit code -1073741819 (0xC0000005),是参数不对的问题吗? HOT 1
- Config file should not be None if model is not from pretrained! HOT 1
- 内网微调,检查基础模型文件及版本带来的调用超时过长 HOT 4
- 'model_id' extra fields not permitted | langchain embedding HOT 4
- Setting a different folder for c:/users/xxx/.cache/modelscope ? HOT 1
- 希望给数据集也添加深度重启功能 HOT 2
- gpu跑paraformer larger onnx模型时候,模型内部出现维度不匹配错误 HOT 3
- TTS流式合成功能的需求 HOT 1
- py_sound_connect无法安装 HOT 1
- py_sound_connect无法安装 HOT 1
- question about installing modelscope from source using setup.py HOT 1
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from modelscope.