Git Product home page Git Product logo

Comments (1)

Forival avatar Forival commented on June 19, 2024 1

你好,我使用的是样例测试集,想跑通README. 但是发现,在训练的时候,会卡住,然后超时; [batch=23/3200]: Train time/batch: 22 Train time/sample: 198 Train time/batch_in_epoch: 6 Train time/sample_in_epoch: 54 Train time/token: 811008 Train time/token_in_epoch: 221184 Train metrics/train/cc_weight: 0.6700 Train metrics/train/github_weight: 0.0450 Train metrics/train/book_weight: 0.0450 Train metrics/train/stackexchange_weight: 0.0200 Train metrics/train/wiki_weight: 0.0450 Train metrics/train/arxiv_weight: 0.0250 Train metrics/train/c4-rp_weight: 0.1500 Train memory/current_allocated_mem: 36.8820 Train memory/current_active_mem: 36.8820 Train memory/current_inactive_mem: 0.1744 Train memory/current_reserved_mem: 55.9060 Train memory/peak_allocated_mem: 42.9380 Train memory/peak_active_mem: 42.9380 Train memory/peak_inactive_mem: 7.8742 Train memory/peak_reserved_mem: 55.9060 Train memory/alloc_retries: 0 Train metrics/train/expected_head_sparsity: 0.0039 Train metrics/train/target_head_sparsity: 0.0129 Train metrics/train/expected_intermediate_sparsity: 0.0039 Train metrics/train/target_intermediate_sparsity: 0.0128 Train metrics/train/expected_layer_sparsity: 0.0039 Train metrics/train/target_layer_sparsity: 0.0000 Train metrics/train/expected_hidden_sparsity: 0.0039 Train metrics/train/target_hidden_sparsity: 0.0129 Train metrics/train/expected_sparsity: 0.0117 Train metrics/train/target_sparsity: 0.0209 Train trainer/device_train_microbatch_size: 3 Train loss/train/total: 1.4801 Train loss/train/ce_loss: 1.4716 Train loss/train/lag_loss: 0.0085 Train metrics/train/LanguageCrossEntropy: 1.4716 Train metrics/train/Perplexity: 4.3561 Train metrics/train/cc_LanguageCrossEntropy: 1.1558 Train metrics/train/cc_count: 65 Train metrics/train/github_LanguageCrossEntropy: nan Train metrics/train/github_count: 7 Train metrics/train/book_LanguageCrossEntropy: nan Train metrics/train/book_count: 7 Train metrics/train/stackexchange_LanguageCrossEntropy: 2.1491 Train metrics/train/stackexchange_count: 3 Train metrics/train/wiki_LanguageCrossEntropy: 1.5306 Train metrics/train/wiki_count: 8 Train metrics/train/arxiv_LanguageCrossEntropy: nan Train metrics/train/arxiv_count: 6 Train metrics/train/c4-rp_LanguageCrossEntropy: 1.6471 Train metrics/train/c4-rp_count: 111 Train throughput/batches_per_sec: 0.0914 Train throughput/samples_per_sec: 0.8223 Train throughput/device/batches_per_sec: 0.0305 Train throughput/device/samples_per_sec: 0.2741 Train throughput/tokens_per_sec: 3368.2385 Train throughput/device/tokens_per_sec: 1122.7462 Train throughput/flops_per_sec: 157886485043818.8125 Train throughput/device/flops_per_sec: 52628828347939.6016 Train throughput/device/mfu: 0.1687 Train time/train: 0.0709 Train time/val: 0.0000 Train time/total: 0.0709 Train lr-DecoupledAdamW/group0: 0.0000 Train lr-DecoupledAdamW/group1: 0.0688 Train lr-DecoupledAdamW/group2: -0.0688 [E ProcessGroupNCCL.cpp:828] [Rank 2] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=3777, OpType=_ALLGATHER_BASE, Timeout(ms)=1800000) ran for 1802129 milliseconds before timing out. [E ProcessGroupNCCL.cpp:455] Some NCCL operations have failed or timed out. Due to the asynchronous nature of CUDA kernels, subsequent GPU operations might run on corrupted/incomplete data. [E ProcessGroupNCCL.cpp:460] To avoid data inconsistency, we are taking the entire process down. terminate called after throwing an instance of 'std::runtime_error' what(): [Rank 2] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=3777, OpType=_ALLGATHER_BASE, Timeout(ms)=1800000) ran for 1802129 milliseconds before timing out.

这是因为样例测试集的数据量很少,23个batch之后某个数据用光了,某张卡上训练停止了,你需要处理原始的redpajama来满足数据要求

from llm-shearing.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.