Comments (9)
Hi mengzhou,
Thank you very much for your quick fix, I will try again tomorrow to test your new code.
It is normal to have problems in the code, but fortunately, I have basically run through the framework process now. Although the process is a bit difficult, your work is very interesting and meaningful, so I am happy to participate in the realization process.
Another happy thing is that I have found a solution to the problem of "train.fit" not reporting errors and multi-card process blocking that bothered me before. I raised an issue in the composer library and found the relevant solution with their help. When the code in this warehouse is interrupted, the shared memory will be blocked by some bugs in Steam. The zombie memory needs to be cleaned up in time to avoid lagging. For specific circumstances, please refer to the following two issues.
mosaicml/llm-foundry#436 (comment)
mosaicml/composer#2733
I wrote a script for cleaning. If you need it, I will create a new branch and merge it in.
from llm-shearing.
Thanks for catching this! Scripts are updated.
from llm-shearing.
Hi mengzhou, i change the code and it prints load weight from my path, its ok
but it raise a problem that the loss keep all the same
It seems that the gradient is not calculated and it is not training normally.
And the loss (10) of the hot start after loading is much higher than the loss (2) I pruned before, on the same data set.
Is this a normal phenomenon?
[batch=366/48000]:
Train time/batch: 365
Train time/sample: 93440
Train time/batch_in_epoch: 365
Train time/sample_in_epoch: 93440
Train time/token: 382730240
Train time/token_in_epoch: 382730240
Train metrics/train/cc_weight: 0.2192
Train metrics/train/github_weight: 0.0002
Train metrics/train/book_weight: 0.0791
Train metrics/train/stackexchange_weight: 0.0064
Train metrics/train/wiki_weight: 0.0096
Train metrics/train/arxiv_weight: 0.0010
Train metrics/train/c4-rp_weight: 0.6845
Train memory/current_allocated_mem: 9.7173
Train memory/current_active_mem: 9.7173
Train memory/current_inactive_mem: 0.6447
Train memory/current_reserved_mem: 51.3280
Train memory/peak_allocated_mem: 44.6420
Train memory/peak_active_mem: 44.8020
Train memory/peak_inactive_mem: 17.7940
Train memory/peak_reserved_mem: 51.3280
Train memory/alloc_retries: 0
Train trainer/device_train_microbatch_size: 16
Train loss/train/total: 10.3750
Train loss/train/ce_loss: 10.3750
Train metrics/train/LanguageCrossEntropy: 10.3750
Train metrics/train/Perplexity: 32048.3164
Train metrics/train/cc_LanguageCrossEntropy: 10.3750
Train metrics/train/cc_count: 746
Train metrics/train/github_LanguageCrossEntropy: nan
Train metrics/train/github_count: 0
Train metrics/train/book_LanguageCrossEntropy: 10.3750
Train metrics/train/book_count: 250
Train metrics/train/stackexchange_LanguageCrossEntropy: 10.3750
Train metrics/train/stackexchange_count: 18
Train metrics/train/wiki_LanguageCrossEntropy: 10.3750
Train metrics/train/wiki_count: 33
Train metrics/train/arxiv_LanguageCrossEntropy: nan
Train metrics/train/arxiv_count: 2
Train metrics/train/c4-rp_LanguageCrossEntropy: 10.3750
Train metrics/train/c4-rp_count: 2279
Train throughput/batches_per_sec: 0.0414
Train throughput/samples_per_sec: 10.5889
Train throughput/device/batches_per_sec: 0.0052
Train throughput/device/samples_per_sec: 1.3236
Train throughput/tokens_per_sec: 43372.1727
Train throughput/device/tokens_per_sec: 5421.5216
Train throughput/flops_per_sec: 877674723199506.0000
Train throughput/device/flops_per_sec: 109709340399938.2500
Train throughput/device/mfu: 0.3516
Train time/train: 2.4628
Train time/val: 0.0000
Train time/total: 2.4628
[batch=367/48000]:
Train time/batch: 366
Train time/sample: 93696
Train time/batch_in_epoch: 366
Train time/sample_in_epoch: 93696
Train time/token: 383778816
Train time/token_in_epoch: 383778816
Train metrics/train/cc_weight: 0.2192
Train metrics/train/github_weight: 0.0002
Train metrics/train/book_weight: 0.0791
Train metrics/train/stackexchange_weight: 0.0064
Train metrics/train/wiki_weight: 0.0096
Train metrics/train/arxiv_weight: 0.0010
Train metrics/train/c4-rp_weight: 0.6845
Train memory/current_allocated_mem: 9.7173
Train memory/current_active_mem: 9.7173
Train memory/current_inactive_mem: 0.6447
Train memory/current_reserved_mem: 51.3280
Train memory/peak_allocated_mem: 44.6420
Train memory/peak_active_mem: 44.8020
Train memory/peak_inactive_mem: 17.7940
Train memory/peak_reserved_mem: 51.3280
Train memory/alloc_retries: 0
Train trainer/device_train_microbatch_size: 16
Train loss/train/total: 10.3750
Train loss/train/ce_loss: 10.3750
Train metrics/train/LanguageCrossEntropy: 10.3750
Train metrics/train/Perplexity: 32048.3164
Train metrics/train/cc_LanguageCrossEntropy: 10.3750
Train metrics/train/cc_count: 803
Train metrics/train/github_LanguageCrossEntropy: nan
Train metrics/train/github_count: 0
Train metrics/train/book_LanguageCrossEntropy: 10.3750
Train metrics/train/book_count: 273
Train metrics/train/stackexchange_LanguageCrossEntropy: 10.3750
Train metrics/train/stackexchange_count: 20
Train metrics/train/wiki_LanguageCrossEntropy: 10.3750
Train metrics/train/wiki_count: 35
Train metrics/train/arxiv_LanguageCrossEntropy: 10.3750
Train metrics/train/arxiv_count: 3
Train metrics/train/c4-rp_LanguageCrossEntropy: 10.3750
Train metrics/train/c4-rp_count: 2450
Train throughput/batches_per_sec: 0.0413
Train throughput/samples_per_sec: 10.5839
Train throughput/device/batches_per_sec: 0.0052
Train throughput/device/samples_per_sec: 1.3230
Train throughput/tokens_per_sec: 43351.5384
Train throughput/device/tokens_per_sec: 5418.9423
Train throughput/flops_per_sec: 877257170181446.8750
Train throughput/device/flops_per_sec: 109657146272680.8594
Train throughput/device/mfu: 0.3515
Train time/train: 2.4695
Train time/val: 0.0000
Train time/total: 2.4695
[batch=368/48000]:
Train time/batch: 367
Train time/sample: 93952
Train time/batch_in_epoch: 367
Train time/sample_in_epoch: 93952
Train time/token: 384827392
Train time/token_in_epoch: 384827392
Train metrics/train/cc_weight: 0.2192
Train metrics/train/github_weight: 0.0002
Train metrics/train/book_weight: 0.0791
Train metrics/train/stackexchange_weight: 0.0064
Train metrics/train/wiki_weight: 0.0096
Train metrics/train/arxiv_weight: 0.0010
Train metrics/train/c4-rp_weight: 0.6845
Train memory/current_allocated_mem: 9.7173
Train memory/current_active_mem: 9.7173
Train memory/current_inactive_mem: 0.6447
Train memory/current_reserved_mem: 51.3280
Train memory/peak_allocated_mem: 44.6430
Train memory/peak_active_mem: 44.8020
Train memory/peak_inactive_mem: 17.7940
Train memory/peak_reserved_mem: 51.3280
Train memory/alloc_retries: 0
Train trainer/device_train_microbatch_size: 16
Train loss/train/total: 10.3750
Train loss/train/ce_loss: 10.3750
Train metrics/train/LanguageCrossEntropy: 10.3750
Train metrics/train/Perplexity: 32048.3164
Train metrics/train/cc_LanguageCrossEntropy: 10.3750
Train metrics/train/cc_count: 861
Train metrics/train/github_LanguageCrossEntropy: nan
Train metrics/train/github_count: 0
Train metrics/train/book_LanguageCrossEntropy: 10.3750
Train metrics/train/book_count: 293
Train metrics/train/stackexchange_LanguageCrossEntropy: 10.3750
Train metrics/train/stackexchange_count: 22
Train metrics/train/wiki_LanguageCrossEntropy: 10.3750
Train metrics/train/wiki_count: 40
Train metrics/train/arxiv_LanguageCrossEntropy: 10.3750
Train metrics/train/arxiv_count: 4
Train metrics/train/c4-rp_LanguageCrossEntropy: 10.3750
Train metrics/train/c4-rp_count: 2620
Train throughput/batches_per_sec: 0.0412
Train throughput/samples_per_sec: 10.5394
Train throughput/device/batches_per_sec: 0.0051
Train throughput/device/samples_per_sec: 1.3174
Train throughput/tokens_per_sec: 43169.4561
Train throughput/device/tokens_per_sec: 5396.1820
Train throughput/flops_per_sec: 873572571549509.1250
Train throughput/device/flops_per_sec: 109196571443688.6406
Train throughput/device/mfu: 0.3500
Train time/train: 2.4766
Train time/val: 0.0000
Train time/total: 2.4766
from llm-shearing.
Hi mengzhou, i change the code and it prints load weight from my path, its ok
but it raise a problem that the loss keep all the same
It seems that the gradient is not calculated and it is not training normally. And the loss (10) of the hot start after loading is much higher than the loss (2) I pruned before, on the same data set. Is this a normal phenomenon?
[batch=366/48000]: Train time/batch: 365 Train time/sample: 93440 Train time/batch_in_epoch: 365 Train time/sample_in_epoch: 93440 Train time/token: 382730240 Train time/token_in_epoch: 382730240 Train metrics/train/cc_weight: 0.2192 Train metrics/train/github_weight: 0.0002 Train metrics/train/book_weight: 0.0791 Train metrics/train/stackexchange_weight: 0.0064 Train metrics/train/wiki_weight: 0.0096 Train metrics/train/arxiv_weight: 0.0010 Train metrics/train/c4-rp_weight: 0.6845 Train memory/current_allocated_mem: 9.7173 Train memory/current_active_mem: 9.7173 Train memory/current_inactive_mem: 0.6447 Train memory/current_reserved_mem: 51.3280 Train memory/peak_allocated_mem: 44.6420 Train memory/peak_active_mem: 44.8020 Train memory/peak_inactive_mem: 17.7940 Train memory/peak_reserved_mem: 51.3280 Train memory/alloc_retries: 0 Train trainer/device_train_microbatch_size: 16 Train loss/train/total: 10.3750 Train loss/train/ce_loss: 10.3750 Train metrics/train/LanguageCrossEntropy: 10.3750 Train metrics/train/Perplexity: 32048.3164 Train metrics/train/cc_LanguageCrossEntropy: 10.3750 Train metrics/train/cc_count: 746 Train metrics/train/github_LanguageCrossEntropy: nan Train metrics/train/github_count: 0 Train metrics/train/book_LanguageCrossEntropy: 10.3750 Train metrics/train/book_count: 250 Train metrics/train/stackexchange_LanguageCrossEntropy: 10.3750 Train metrics/train/stackexchange_count: 18 Train metrics/train/wiki_LanguageCrossEntropy: 10.3750 Train metrics/train/wiki_count: 33 Train metrics/train/arxiv_LanguageCrossEntropy: nan Train metrics/train/arxiv_count: 2 Train metrics/train/c4-rp_LanguageCrossEntropy: 10.3750 Train metrics/train/c4-rp_count: 2279 Train throughput/batches_per_sec: 0.0414 Train throughput/samples_per_sec: 10.5889 Train throughput/device/batches_per_sec: 0.0052 Train throughput/device/samples_per_sec: 1.3236 Train throughput/tokens_per_sec: 43372.1727 Train throughput/device/tokens_per_sec: 5421.5216 Train throughput/flops_per_sec: 877674723199506.0000 Train throughput/device/flops_per_sec: 109709340399938.2500 Train throughput/device/mfu: 0.3516 Train time/train: 2.4628 Train time/val: 0.0000 Train time/total: 2.4628 [batch=367/48000]: Train time/batch: 366 Train time/sample: 93696 Train time/batch_in_epoch: 366 Train time/sample_in_epoch: 93696 Train time/token: 383778816 Train time/token_in_epoch: 383778816 Train metrics/train/cc_weight: 0.2192 Train metrics/train/github_weight: 0.0002 Train metrics/train/book_weight: 0.0791 Train metrics/train/stackexchange_weight: 0.0064 Train metrics/train/wiki_weight: 0.0096 Train metrics/train/arxiv_weight: 0.0010 Train metrics/train/c4-rp_weight: 0.6845 Train memory/current_allocated_mem: 9.7173 Train memory/current_active_mem: 9.7173 Train memory/current_inactive_mem: 0.6447 Train memory/current_reserved_mem: 51.3280 Train memory/peak_allocated_mem: 44.6420 Train memory/peak_active_mem: 44.8020 Train memory/peak_inactive_mem: 17.7940 Train memory/peak_reserved_mem: 51.3280 Train memory/alloc_retries: 0 Train trainer/device_train_microbatch_size: 16 Train loss/train/total: 10.3750 Train loss/train/ce_loss: 10.3750 Train metrics/train/LanguageCrossEntropy: 10.3750 Train metrics/train/Perplexity: 32048.3164 Train metrics/train/cc_LanguageCrossEntropy: 10.3750 Train metrics/train/cc_count: 803 Train metrics/train/github_LanguageCrossEntropy: nan Train metrics/train/github_count: 0 Train metrics/train/book_LanguageCrossEntropy: 10.3750 Train metrics/train/book_count: 273 Train metrics/train/stackexchange_LanguageCrossEntropy: 10.3750 Train metrics/train/stackexchange_count: 20 Train metrics/train/wiki_LanguageCrossEntropy: 10.3750 Train metrics/train/wiki_count: 35 Train metrics/train/arxiv_LanguageCrossEntropy: 10.3750 Train metrics/train/arxiv_count: 3 Train metrics/train/c4-rp_LanguageCrossEntropy: 10.3750 Train metrics/train/c4-rp_count: 2450 Train throughput/batches_per_sec: 0.0413 Train throughput/samples_per_sec: 10.5839 Train throughput/device/batches_per_sec: 0.0052 Train throughput/device/samples_per_sec: 1.3230 Train throughput/tokens_per_sec: 43351.5384 Train throughput/device/tokens_per_sec: 5418.9423 Train throughput/flops_per_sec: 877257170181446.8750 Train throughput/device/flops_per_sec: 109657146272680.8594 Train throughput/device/mfu: 0.3515 Train time/train: 2.4695 Train time/val: 0.0000 Train time/total: 2.4695 [batch=368/48000]: Train time/batch: 367 Train time/sample: 93952 Train time/batch_in_epoch: 367 Train time/sample_in_epoch: 93952 Train time/token: 384827392 Train time/token_in_epoch: 384827392 Train metrics/train/cc_weight: 0.2192 Train metrics/train/github_weight: 0.0002 Train metrics/train/book_weight: 0.0791 Train metrics/train/stackexchange_weight: 0.0064 Train metrics/train/wiki_weight: 0.0096 Train metrics/train/arxiv_weight: 0.0010 Train metrics/train/c4-rp_weight: 0.6845 Train memory/current_allocated_mem: 9.7173 Train memory/current_active_mem: 9.7173 Train memory/current_inactive_mem: 0.6447 Train memory/current_reserved_mem: 51.3280 Train memory/peak_allocated_mem: 44.6430 Train memory/peak_active_mem: 44.8020 Train memory/peak_inactive_mem: 17.7940 Train memory/peak_reserved_mem: 51.3280 Train memory/alloc_retries: 0 Train trainer/device_train_microbatch_size: 16 Train loss/train/total: 10.3750 Train loss/train/ce_loss: 10.3750 Train metrics/train/LanguageCrossEntropy: 10.3750 Train metrics/train/Perplexity: 32048.3164 Train metrics/train/cc_LanguageCrossEntropy: 10.3750 Train metrics/train/cc_count: 861 Train metrics/train/github_LanguageCrossEntropy: nan Train metrics/train/github_count: 0 Train metrics/train/book_LanguageCrossEntropy: 10.3750 Train metrics/train/book_count: 293 Train metrics/train/stackexchange_LanguageCrossEntropy: 10.3750 Train metrics/train/stackexchange_count: 22 Train metrics/train/wiki_LanguageCrossEntropy: 10.3750 Train metrics/train/wiki_count: 40 Train metrics/train/arxiv_LanguageCrossEntropy: 10.3750 Train metrics/train/arxiv_count: 4 Train metrics/train/c4-rp_LanguageCrossEntropy: 10.3750 Train metrics/train/c4-rp_count: 2620 Train throughput/batches_per_sec: 0.0412 Train throughput/samples_per_sec: 10.5394 Train throughput/device/batches_per_sec: 0.0051 Train throughput/device/samples_per_sec: 1.3174 Train throughput/tokens_per_sec: 43169.4561 Train throughput/device/tokens_per_sec: 5396.1820 Train throughput/flops_per_sec: 873572571549509.1250 Train throughput/device/flops_per_sec: 109196571443688.6406 Train throughput/device/mfu: 0.3500 Train time/train: 2.4766 Train time/val: 0.0000 Train time/total: 2.4766
I trained on a single card and it can run normally. It seems that there is still a compatibility issue with the composer code on multiple cards. There may be issues with model distributed loading and data sharding. Is it possible to run this code using the deepspeed framework?
from llm-shearing.
Hey I think there is a bug here -- not sure what it is. I am working on it now.
from llm-shearing.
Hi the issue is resolved! It stemmed from the init_device
setup from the yaml files. Originally it was set to meta
, yet it causes unexpected issues in loading the model. Now it is switched to cpu
. Ideally, we would love to support meta
loading as it is faster, but I am not sure how to integrate it with the current codebase yet. Thanks for spotting this!
PS. The codebase has been changed a lot after my runs for the paper (mostly to make it adaptable for the update to date composer package.) So there could be issues here and there, as it's not fully tested. Thanks for your work on this!
from llm-shearing.
Hi! Awesome :) Feel free to start a PR on it!
from llm-shearing.
Hey @Longyichen , I am also facing issues with pruning getting stuck after 'Starting Training'.
Could you guide me with the changes you made to solve this issue?
from llm-shearing.
Hey @Longyichen , I am also facing issues with pruning getting stuck after 'Starting Training'.
Could you guide me with the changes you made to solve this issue?
@argitrage see #30
from llm-shearing.
Related Issues (20)
- missmatch shape
- Start training but nothing continue HOT 6
- TypeError: buffer is too small for requested array
- Pruning fine-tuned model HOT 2
- save model meet problem HOT 1
- Instruction tuning dataset HOT 2
- If I can't configure Slurm on a cluster, does that mean I can't use multi-node multi-GPU setups? HOT 5
- 有没有不用Slurm跑剪枝的方法?
- None
- Start training but only output config information HOT 3
- The Project is not implemented for 70B llama? HOT 7
- LlamaRMSNorm() layer differs from original llama HOT 1
- composer model trans to pythia problem
- The dtype of tokenized data should be uint32 HOT 1
- Why the rope params are ignored while converting hf checkpoint to composer checkpoint? HOT 3
- about shearing params config HOT 1
- Can LLM-Shearing be used on ViT models? HOT 1
- Support for Llama-3 / GQA? HOT 1
- Open source the pruning mask. HOT 2
- Default Initialization of Lambda Parameters to Zero HOT 2
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from llm-shearing.