<a href="https://github.com/princeton-nlp/LLM-Shearing/blob/8bb2f7c6b494edba50e52ee70a

Thanks for catching this! s are updated.

Hey <a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-url=

Hey <a class="user-mention notranslate" data-hovercard-type="user" data-h

Path no use in continue_pretrain.sh about llm-shearing HOT 9 CLOSED

Longyichen commented on June 16, 2024

Path no use in continue_pretrain.sh

from llm-shearing.

Comments (9)

Longyichen commented on June 16, 2024 1

Hi mengzhou,
Thank you very much for your quick fix, I will try again tomorrow to test your new code.
It is normal to have problems in the code, but fortunately, I have basically run through the framework process now. Although the process is a bit difficult, your work is very interesting and meaningful, so I am happy to participate in the realization process.
Another happy thing is that I have found a solution to the problem of "train.fit" not reporting errors and multi-card process blocking that bothered me before. I raised an issue in the composer library and found the relevant solution with their help. When the code in this warehouse is interrupted, the shared memory will be blocked by some bugs in Steam. The zombie memory needs to be cleaned up in time to avoid lagging. For specific circumstances, please refer to the following two issues.
mosaicml/llm-foundry#436 (comment)
mosaicml/composer#2733
I wrote a script for cleaning. If you need it, I will create a new branch and merge it in.

from llm-shearing.

xiamengzhou commented on June 16, 2024

Thanks for catching this! Scripts are updated.

from llm-shearing.

Longyichen commented on June 16, 2024

Hi mengzhou, i change the code and it prints load weight from my path, its ok

but it raise a problem that the loss keep all the same

It seems that the gradient is not calculated and it is not training normally.
And the loss (10) of the hot start after loading is much higher than the loss (2) I pruned before, on the same data set.
Is this a normal phenomenon?

[batch=366/48000]:
         Train time/batch: 365
         Train time/sample: 93440
         Train time/batch_in_epoch: 365
         Train time/sample_in_epoch: 93440
         Train time/token: 382730240
         Train time/token_in_epoch: 382730240
         Train metrics/train/cc_weight: 0.2192
         Train metrics/train/github_weight: 0.0002
         Train metrics/train/book_weight: 0.0791
         Train metrics/train/stackexchange_weight: 0.0064
         Train metrics/train/wiki_weight: 0.0096
         Train metrics/train/arxiv_weight: 0.0010
         Train metrics/train/c4-rp_weight: 0.6845
         Train memory/current_allocated_mem: 9.7173
         Train memory/current_active_mem: 9.7173
         Train memory/current_inactive_mem: 0.6447
         Train memory/current_reserved_mem: 51.3280
         Train memory/peak_allocated_mem: 44.6420
         Train memory/peak_active_mem: 44.8020
         Train memory/peak_inactive_mem: 17.7940
         Train memory/peak_reserved_mem: 51.3280
         Train memory/alloc_retries: 0
         Train trainer/device_train_microbatch_size: 16
         Train loss/train/total: 10.3750
         Train loss/train/ce_loss: 10.3750
         Train metrics/train/LanguageCrossEntropy: 10.3750
         Train metrics/train/Perplexity: 32048.3164
         Train metrics/train/cc_LanguageCrossEntropy: 10.3750
         Train metrics/train/cc_count: 746
         Train metrics/train/github_LanguageCrossEntropy: nan
         Train metrics/train/github_count: 0
         Train metrics/train/book_LanguageCrossEntropy: 10.3750
         Train metrics/train/book_count: 250
         Train metrics/train/stackexchange_LanguageCrossEntropy: 10.3750
         Train metrics/train/stackexchange_count: 18
         Train metrics/train/wiki_LanguageCrossEntropy: 10.3750
         Train metrics/train/wiki_count: 33
         Train metrics/train/arxiv_LanguageCrossEntropy: nan
         Train metrics/train/arxiv_count: 2
         Train metrics/train/c4-rp_LanguageCrossEntropy: 10.3750
         Train metrics/train/c4-rp_count: 2279
         Train throughput/batches_per_sec: 0.0414
         Train throughput/samples_per_sec: 10.5889
         Train throughput/device/batches_per_sec: 0.0052
         Train throughput/device/samples_per_sec: 1.3236
         Train throughput/tokens_per_sec: 43372.1727
         Train throughput/device/tokens_per_sec: 5421.5216
         Train throughput/flops_per_sec: 877674723199506.0000
         Train throughput/device/flops_per_sec: 109709340399938.2500
         Train throughput/device/mfu: 0.3516
         Train time/train: 2.4628
         Train time/val: 0.0000
         Train time/total: 2.4628
[batch=367/48000]:
         Train time/batch: 366
         Train time/sample: 93696
         Train time/batch_in_epoch: 366
         Train time/sample_in_epoch: 93696
         Train time/token: 383778816
         Train time/token_in_epoch: 383778816
         Train metrics/train/cc_weight: 0.2192
         Train metrics/train/github_weight: 0.0002
         Train metrics/train/book_weight: 0.0791
         Train metrics/train/stackexchange_weight: 0.0064
         Train metrics/train/wiki_weight: 0.0096
         Train metrics/train/arxiv_weight: 0.0010
         Train metrics/train/c4-rp_weight: 0.6845
         Train memory/current_allocated_mem: 9.7173
         Train memory/current_active_mem: 9.7173
         Train memory/current_inactive_mem: 0.6447
         Train memory/current_reserved_mem: 51.3280
         Train memory/peak_allocated_mem: 44.6420
         Train memory/peak_active_mem: 44.8020
         Train memory/peak_inactive_mem: 17.7940
         Train memory/peak_reserved_mem: 51.3280
         Train memory/alloc_retries: 0
         Train trainer/device_train_microbatch_size: 16
         Train loss/train/total: 10.3750
         Train loss/train/ce_loss: 10.3750
         Train metrics/train/LanguageCrossEntropy: 10.3750
         Train metrics/train/Perplexity: 32048.3164
         Train metrics/train/cc_LanguageCrossEntropy: 10.3750
         Train metrics/train/cc_count: 803
         Train metrics/train/github_LanguageCrossEntropy: nan
         Train metrics/train/github_count: 0
         Train metrics/train/book_LanguageCrossEntropy: 10.3750
         Train metrics/train/book_count: 273
         Train metrics/train/stackexchange_LanguageCrossEntropy: 10.3750
         Train metrics/train/stackexchange_count: 20
         Train metrics/train/wiki_LanguageCrossEntropy: 10.3750
         Train metrics/train/wiki_count: 35
         Train metrics/train/arxiv_LanguageCrossEntropy: 10.3750
         Train metrics/train/arxiv_count: 3
         Train metrics/train/c4-rp_LanguageCrossEntropy: 10.3750
         Train metrics/train/c4-rp_count: 2450
         Train throughput/batches_per_sec: 0.0413
         Train throughput/samples_per_sec: 10.5839
         Train throughput/device/batches_per_sec: 0.0052
         Train throughput/device/samples_per_sec: 1.3230
         Train throughput/tokens_per_sec: 43351.5384
         Train throughput/device/tokens_per_sec: 5418.9423
         Train throughput/flops_per_sec: 877257170181446.8750
         Train throughput/device/flops_per_sec: 109657146272680.8594
         Train throughput/device/mfu: 0.3515
         Train time/train: 2.4695
         Train time/val: 0.0000
         Train time/total: 2.4695
[batch=368/48000]:
         Train time/batch: 367
         Train time/sample: 93952
         Train time/batch_in_epoch: 367
         Train time/sample_in_epoch: 93952
         Train time/token: 384827392
         Train time/token_in_epoch: 384827392
         Train metrics/train/cc_weight: 0.2192
         Train metrics/train/github_weight: 0.0002
         Train metrics/train/book_weight: 0.0791
         Train metrics/train/stackexchange_weight: 0.0064
         Train metrics/train/wiki_weight: 0.0096
         Train metrics/train/arxiv_weight: 0.0010
         Train metrics/train/c4-rp_weight: 0.6845
         Train memory/current_allocated_mem: 9.7173
         Train memory/current_active_mem: 9.7173
         Train memory/current_inactive_mem: 0.6447
         Train memory/current_reserved_mem: 51.3280
         Train memory/peak_allocated_mem: 44.6430
         Train memory/peak_active_mem: 44.8020
         Train memory/peak_inactive_mem: 17.7940
         Train memory/peak_reserved_mem: 51.3280
         Train memory/alloc_retries: 0
         Train trainer/device_train_microbatch_size: 16
         Train loss/train/total: 10.3750
         Train loss/train/ce_loss: 10.3750
         Train metrics/train/LanguageCrossEntropy: 10.3750
         Train metrics/train/Perplexity: 32048.3164
         Train metrics/train/cc_LanguageCrossEntropy: 10.3750
         Train metrics/train/cc_count: 861
         Train metrics/train/github_LanguageCrossEntropy: nan
         Train metrics/train/github_count: 0
         Train metrics/train/book_LanguageCrossEntropy: 10.3750
         Train metrics/train/book_count: 293
         Train metrics/train/stackexchange_LanguageCrossEntropy: 10.3750
         Train metrics/train/stackexchange_count: 22
         Train metrics/train/wiki_LanguageCrossEntropy: 10.3750
         Train metrics/train/wiki_count: 40
         Train metrics/train/arxiv_LanguageCrossEntropy: 10.3750
         Train metrics/train/arxiv_count: 4
         Train metrics/train/c4-rp_LanguageCrossEntropy: 10.3750
         Train metrics/train/c4-rp_count: 2620
         Train throughput/batches_per_sec: 0.0412
         Train throughput/samples_per_sec: 10.5394
         Train throughput/device/batches_per_sec: 0.0051
         Train throughput/device/samples_per_sec: 1.3174
         Train throughput/tokens_per_sec: 43169.4561
         Train throughput/device/tokens_per_sec: 5396.1820
         Train throughput/flops_per_sec: 873572571549509.1250
         Train throughput/device/flops_per_sec: 109196571443688.6406
         Train throughput/device/mfu: 0.3500
         Train time/train: 2.4766
         Train time/val: 0.0000
         Train time/total: 2.4766

from llm-shearing.

Longyichen commented on June 16, 2024

Hi mengzhou, i change the code and it prints load weight from my path, its ok

but it raise a problem that the loss keep all the same

It seems that the gradient is not calculated and it is not training normally. And the loss (10) of the hot start after loading is much higher than the loss (2) I pruned before, on the same data set. Is this a normal phenomenon?

[batch=366/48000]:
         Train time/batch: 365
         Train time/sample: 93440
         Train time/batch_in_epoch: 365
         Train time/sample_in_epoch: 93440
         Train time/token: 382730240
         Train time/token_in_epoch: 382730240
         Train metrics/train/cc_weight: 0.2192
         Train metrics/train/github_weight: 0.0002
         Train metrics/train/book_weight: 0.0791
         Train metrics/train/stackexchange_weight: 0.0064
         Train metrics/train/wiki_weight: 0.0096
         Train metrics/train/arxiv_weight: 0.0010
         Train metrics/train/c4-rp_weight: 0.6845
         Train memory/current_allocated_mem: 9.7173
         Train memory/current_active_mem: 9.7173
         Train memory/current_inactive_mem: 0.6447
         Train memory/current_reserved_mem: 51.3280
         Train memory/peak_allocated_mem: 44.6420
         Train memory/peak_active_mem: 44.8020
         Train memory/peak_inactive_mem: 17.7940
         Train memory/peak_reserved_mem: 51.3280
         Train memory/alloc_retries: 0
         Train trainer/device_train_microbatch_size: 16
         Train loss/train/total: 10.3750
         Train loss/train/ce_loss: 10.3750
         Train metrics/train/LanguageCrossEntropy: 10.3750
         Train metrics/train/Perplexity: 32048.3164
         Train metrics/train/cc_LanguageCrossEntropy: 10.3750
         Train metrics/train/cc_count: 746
         Train metrics/train/github_LanguageCrossEntropy: nan
         Train metrics/train/github_count: 0
         Train metrics/train/book_LanguageCrossEntropy: 10.3750
         Train metrics/train/book_count: 250
         Train metrics/train/stackexchange_LanguageCrossEntropy: 10.3750
         Train metrics/train/stackexchange_count: 18
         Train metrics/train/wiki_LanguageCrossEntropy: 10.3750
         Train metrics/train/wiki_count: 33
         Train metrics/train/arxiv_LanguageCrossEntropy: nan
         Train metrics/train/arxiv_count: 2
         Train metrics/train/c4-rp_LanguageCrossEntropy: 10.3750
         Train metrics/train/c4-rp_count: 2279
         Train throughput/batches_per_sec: 0.0414
         Train throughput/samples_per_sec: 10.5889
         Train throughput/device/batches_per_sec: 0.0052
         Train throughput/device/samples_per_sec: 1.3236
         Train throughput/tokens_per_sec: 43372.1727
         Train throughput/device/tokens_per_sec: 5421.5216
         Train throughput/flops_per_sec: 877674723199506.0000
         Train throughput/device/flops_per_sec: 109709340399938.2500
         Train throughput/device/mfu: 0.3516
         Train time/train: 2.4628
         Train time/val: 0.0000
         Train time/total: 2.4628
[batch=367/48000]:
         Train time/batch: 366
         Train time/sample: 93696
         Train time/batch_in_epoch: 366
         Train time/sample_in_epoch: 93696
         Train time/token: 383778816
         Train time/token_in_epoch: 383778816
         Train metrics/train/cc_weight: 0.2192
         Train metrics/train/github_weight: 0.0002
         Train metrics/train/book_weight: 0.0791
         Train metrics/train/stackexchange_weight: 0.0064
         Train metrics/train/wiki_weight: 0.0096
         Train metrics/train/arxiv_weight: 0.0010
         Train metrics/train/c4-rp_weight: 0.6845
         Train memory/current_allocated_mem: 9.7173
         Train memory/current_active_mem: 9.7173
         Train memory/current_inactive_mem: 0.6447
         Train memory/current_reserved_mem: 51.3280
         Train memory/peak_allocated_mem: 44.6420
         Train memory/peak_active_mem: 44.8020
         Train memory/peak_inactive_mem: 17.7940
         Train memory/peak_reserved_mem: 51.3280
         Train memory/alloc_retries: 0
         Train trainer/device_train_microbatch_size: 16
         Train loss/train/total: 10.3750
         Train loss/train/ce_loss: 10.3750
         Train metrics/train/LanguageCrossEntropy: 10.3750
         Train metrics/train/Perplexity: 32048.3164
         Train metrics/train/cc_LanguageCrossEntropy: 10.3750
         Train metrics/train/cc_count: 803
         Train metrics/train/github_LanguageCrossEntropy: nan
         Train metrics/train/github_count: 0
         Train metrics/train/book_LanguageCrossEntropy: 10.3750
         Train metrics/train/book_count: 273
         Train metrics/train/stackexchange_LanguageCrossEntropy: 10.3750
         Train metrics/train/stackexchange_count: 20
         Train metrics/train/wiki_LanguageCrossEntropy: 10.3750
         Train metrics/train/wiki_count: 35
         Train metrics/train/arxiv_LanguageCrossEntropy: 10.3750
         Train metrics/train/arxiv_count: 3
         Train metrics/train/c4-rp_LanguageCrossEntropy: 10.3750
         Train metrics/train/c4-rp_count: 2450
         Train throughput/batches_per_sec: 0.0413
         Train throughput/samples_per_sec: 10.5839
         Train throughput/device/batches_per_sec: 0.0052
         Train throughput/device/samples_per_sec: 1.3230
         Train throughput/tokens_per_sec: 43351.5384
         Train throughput/device/tokens_per_sec: 5418.9423
         Train throughput/flops_per_sec: 877257170181446.8750
         Train throughput/device/flops_per_sec: 109657146272680.8594
         Train throughput/device/mfu: 0.3515
         Train time/train: 2.4695
         Train time/val: 0.0000
         Train time/total: 2.4695
[batch=368/48000]:
         Train time/batch: 367
         Train time/sample: 93952
         Train time/batch_in_epoch: 367
         Train time/sample_in_epoch: 93952
         Train time/token: 384827392
         Train time/token_in_epoch: 384827392
         Train metrics/train/cc_weight: 0.2192
         Train metrics/train/github_weight: 0.0002
         Train metrics/train/book_weight: 0.0791
         Train metrics/train/stackexchange_weight: 0.0064
         Train metrics/train/wiki_weight: 0.0096
         Train metrics/train/arxiv_weight: 0.0010
         Train metrics/train/c4-rp_weight: 0.6845
         Train memory/current_allocated_mem: 9.7173
         Train memory/current_active_mem: 9.7173
         Train memory/current_inactive_mem: 0.6447
         Train memory/current_reserved_mem: 51.3280
         Train memory/peak_allocated_mem: 44.6430
         Train memory/peak_active_mem: 44.8020
         Train memory/peak_inactive_mem: 17.7940
         Train memory/peak_reserved_mem: 51.3280
         Train memory/alloc_retries: 0
         Train trainer/device_train_microbatch_size: 16
         Train loss/train/total: 10.3750
         Train loss/train/ce_loss: 10.3750
         Train metrics/train/LanguageCrossEntropy: 10.3750
         Train metrics/train/Perplexity: 32048.3164
         Train metrics/train/cc_LanguageCrossEntropy: 10.3750
         Train metrics/train/cc_count: 861
         Train metrics/train/github_LanguageCrossEntropy: nan
         Train metrics/train/github_count: 0
         Train metrics/train/book_LanguageCrossEntropy: 10.3750
         Train metrics/train/book_count: 293
         Train metrics/train/stackexchange_LanguageCrossEntropy: 10.3750
         Train metrics/train/stackexchange_count: 22
         Train metrics/train/wiki_LanguageCrossEntropy: 10.3750
         Train metrics/train/wiki_count: 40
         Train metrics/train/arxiv_LanguageCrossEntropy: 10.3750
         Train metrics/train/arxiv_count: 4
         Train metrics/train/c4-rp_LanguageCrossEntropy: 10.3750
         Train metrics/train/c4-rp_count: 2620
         Train throughput/batches_per_sec: 0.0412
         Train throughput/samples_per_sec: 10.5394
         Train throughput/device/batches_per_sec: 0.0051
         Train throughput/device/samples_per_sec: 1.3174
         Train throughput/tokens_per_sec: 43169.4561
         Train throughput/device/tokens_per_sec: 5396.1820
         Train throughput/flops_per_sec: 873572571549509.1250
         Train throughput/device/flops_per_sec: 109196571443688.6406
         Train throughput/device/mfu: 0.3500
         Train time/train: 2.4766
         Train time/val: 0.0000
         Train time/total: 2.4766

I trained on a single card and it can run normally. It seems that there is still a compatibility issue with the composer code on multiple cards. There may be issues with model distributed loading and data sharding. Is it possible to run this code using the deepspeed framework?

from llm-shearing.

xiamengzhou commented on June 16, 2024

Hey I think there is a bug here -- not sure what it is. I am working on it now.

from llm-shearing.

xiamengzhou commented on June 16, 2024

Hi the issue is resolved! It stemmed from the init_device setup from the yaml files. Originally it was set to meta, yet it causes unexpected issues in loading the model. Now it is switched to cpu. Ideally, we would love to support meta loading as it is faster, but I am not sure how to integrate it with the current codebase yet. Thanks for spotting this!

PS. The codebase has been changed a lot after my runs for the paper (mostly to make it adaptable for the update to date composer package.) So there could be issues here and there, as it's not fully tested. Thanks for your work on this!

from llm-shearing.

xiamengzhou commented on June 16, 2024

Hi! Awesome :) Feel free to start a PR on it!

from llm-shearing.

argitrage commented on June 16, 2024

Hey @Longyichen , I am also facing issues with pruning getting stuck after 'Starting Training'.

Could you guide me with the changes you made to solve this issue?

from llm-shearing.

Longyichen commented on June 16, 2024

Hey @Longyichen , I am also facing issues with pruning getting stuck after 'Starting Training'.

Could you guide me with the changes you made to solve this issue?

@argitrage see #30

from llm-shearing.

Path no use in continue_pretrain.sh about llm-shearing HOT 9 CLOSED

Comments (9)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent