在进行Building trainer时，训练会卡住； about llm-shearing HOT 1 OPEN

princeton-nlp commented on June 19, 2024

在进行Building trainer时，训练会卡住；

from llm-shearing.

Comments (1)

Forival commented on June 19, 2024 1

你好，我使用的是样例测试集，想跑通README. 但是发现，在训练的时候，会卡住，然后超时； [batch=23/3200]: Train time/batch: 22 Train time/sample: 198 Train time/batch_in_epoch: 6 Train time/sample_in_epoch: 54 Train time/token: 811008 Train time/token_in_epoch: 221184 Train metrics/train/cc_weight: 0.6700 Train metrics/train/github_weight: 0.0450 Train metrics/train/book_weight: 0.0450 Train metrics/train/stackexchange_weight: 0.0200 Train metrics/train/wiki_weight: 0.0450 Train metrics/train/arxiv_weight: 0.0250 Train metrics/train/c4-rp_weight: 0.1500 Train memory/current_allocated_mem: 36.8820 Train memory/current_active_mem: 36.8820 Train memory/current_inactive_mem: 0.1744 Train memory/current_reserved_mem: 55.9060 Train memory/peak_allocated_mem: 42.9380 Train memory/peak_active_mem: 42.9380 Train memory/peak_inactive_mem: 7.8742 Train memory/peak_reserved_mem: 55.9060 Train memory/alloc_retries: 0 Train metrics/train/expected_head_sparsity: 0.0039 Train metrics/train/target_head_sparsity: 0.0129 Train metrics/train/expected_intermediate_sparsity: 0.0039 Train metrics/train/target_intermediate_sparsity: 0.0128 Train metrics/train/expected_layer_sparsity: 0.0039 Train metrics/train/target_layer_sparsity: 0.0000 Train metrics/train/expected_hidden_sparsity: 0.0039 Train metrics/train/target_hidden_sparsity: 0.0129 Train metrics/train/expected_sparsity: 0.0117 Train metrics/train/target_sparsity: 0.0209 Train trainer/device_train_microbatch_size: 3 Train loss/train/total: 1.4801 Train loss/train/ce_loss: 1.4716 Train loss/train/lag_loss: 0.0085 Train metrics/train/LanguageCrossEntropy: 1.4716 Train metrics/train/Perplexity: 4.3561 Train metrics/train/cc_LanguageCrossEntropy: 1.1558 Train metrics/train/cc_count: 65 Train metrics/train/github_LanguageCrossEntropy: nan Train metrics/train/github_count: 7 Train metrics/train/book_LanguageCrossEntropy: nan Train metrics/train/book_count: 7 Train metrics/train/stackexchange_LanguageCrossEntropy: 2.1491 Train metrics/train/stackexchange_count: 3 Train metrics/train/wiki_LanguageCrossEntropy: 1.5306 Train metrics/train/wiki_count: 8 Train metrics/train/arxiv_LanguageCrossEntropy: nan Train metrics/train/arxiv_count: 6 Train metrics/train/c4-rp_LanguageCrossEntropy: 1.6471 Train metrics/train/c4-rp_count: 111 Train throughput/batches_per_sec: 0.0914 Train throughput/samples_per_sec: 0.8223 Train throughput/device/batches_per_sec: 0.0305 Train throughput/device/samples_per_sec: 0.2741 Train throughput/tokens_per_sec: 3368.2385 Train throughput/device/tokens_per_sec: 1122.7462 Train throughput/flops_per_sec: 157886485043818.8125 Train throughput/device/flops_per_sec: 52628828347939.6016 Train throughput/device/mfu: 0.1687 Train time/train: 0.0709 Train time/val: 0.0000 Train time/total: 0.0709 Train lr-DecoupledAdamW/group0: 0.0000 Train lr-DecoupledAdamW/group1: 0.0688 Train lr-DecoupledAdamW/group2: -0.0688 [E ProcessGroupNCCL.cpp:828] [Rank 2] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=3777, OpType=_ALLGATHER_BASE, Timeout(ms)=1800000) ran for 1802129 milliseconds before timing out. [E ProcessGroupNCCL.cpp:455] Some NCCL operations have failed or timed out. Due to the asynchronous nature of CUDA kernels, subsequent GPU operations might run on corrupted/incomplete data. [E ProcessGroupNCCL.cpp:460] To avoid data inconsistency, we are taking the entire process down. terminate called after throwing an instance of 'std::runtime_error' what(): [Rank 2] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=3777, OpType=_ALLGATHER_BASE, Timeout(ms)=1800000) ran for 1802129 milliseconds before timing out.

这是因为样例测试集的数据量很少，23个batch之后某个数据用光了，某张卡上训练停止了，你需要处理原始的redpajama来满足数据要求

from llm-shearing.

在进行Building trainer时，训练会卡住； about llm-shearing HOT 1 OPEN

Comments (1)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent