I try to finetune mlx-community/Mixtral-8x7B-v0.1-hf-4bit-mlx. It starts properly but

I managed to reproduce it using the : <a href="https://github.com/mzbac/mlx-lora

As mentioned by <a class="user-mention notranslate" data-hovercard-type="user" data-ho

Seems to be working indeed: <div class="snippet-clipboard-content notranslate posi

Mixtral lora training weird output about mlx-examples HOT 18 CLOSED

l0d0v1c commented on July 22, 2024

Mixtral lora training weird output

from mlx-examples.

Comments (18)

mzbac commented on July 22, 2024 3

I managed to reproduce it using the script: https://github.com/mzbac/mlx-lora/blob/main/lora.py.
Here is how to reproduce:

Modify the model path to use mzbac/Kunpeng-4x7B-mistral-hf-4bit-mlx in https://github.com/mzbac/mlx-lora/blob/main/lora.py#L68
Modify the mlx-lm to m.weight.shape[0] != 4 to avoid quant gate at https://github.com/ml-explore/mlx-examples/blob/main/llms/mlx_lm/utils.py#L34
Download the code feedback dataset by running: python download.py m-a-p/Code-Feedback
Run lora.py

Total parameters 4862.562M
Trainable parameters 977.289M
Starting training..., iters: 10000
Iter 1: Val loss 1.086, Val took 60.461s
...
Iter 180: Train loss 1.066, Learning Rate 9.992e-06, It/sec 0.141, Tokens/sec 108.412, Trained Tokens 127740
Iter 190: Train loss nan, Learning Rate 9.991e-06, It/sec 0.149, Tokens/sec 109.189, Trained Tokens 135075
Iter 200: Train loss nan, Learning Rate 9.990e-06, It/sec 0.161, Tokens/sec 115.795, Trained Tokens 142272
Iter 200: Val loss nan, Val took 55.185s
Iter 200: Saved adapter weights to checkpoints/200_adapters.npz.
Iter 210: Train loss nan, Learning Rate 9.989e-06, It/sec 0.158, Tokens/sec 103.672, Trained Tokens 148849
Iter 220: Train loss nan, Learning Rate 9.988e-06, It/sec 0.153, Tokens/sec 114.852, Trained Tokens 156380
Iter 230: Train loss nan, Learning Rate 9.987e-06, It/sec 0.150, Tokens/sec 115.601, Trained Tokens 164063

After changing mlx version to be mlx 0.3.0 and mlx-lm versionto be 0.0.10,the training loss seems back to normal

from mlx-examples.

awni commented on July 22, 2024 2

There was indeed a bug introduced between 0.3 and 0.4 which seems to have broken MOE training. Sorry about that! We'll try and do a better job testing for these types of cases.

Fix is here ml-explore/mlx#821 and will be in the next release.

from mlx-examples.

mzbac commented on July 22, 2024 1

I am not sure if it was caused by the custom MOE model or something to do with MLX. I will try fine-tuning the standard Mixtral model to see if I can reproduce it.

from mlx-examples.

l0d0v1c commented on July 22, 2024 1

As mentioned by @mzbac , downgrading mlx to 0.3.0 is fixing the issue.

from mlx-examples.

awni commented on July 22, 2024 1

Seems to be working indeed:

Starting training..., iters: 3000
Iter 1: Val loss 2.480, Val took 0.732s
Iter 10: Train loss 2.363, Learning Rate 1.000e-05, It/sec 1.022, Tokens/sec 88.083, Trained Tokens 862
Iter 20: Train loss 1.508, Learning Rate 1.000e-05, It/sec 0.826, Tokens/sec 81.866, Trained Tokens 1853
Iter 30: Train loss 1.252, Learning Rate 1.000e-05, It/sec 0.797, Tokens/sec 84.166, Trained Tokens 2909
Iter 40: Train loss 1.372, Learning Rate 1.000e-05, It/sec 0.811, Tokens/sec 84.242, Trained Tokens 3948
Iter 50: Train loss 1.100, Learning Rate 1.000e-05, It/sec 0.803, Tokens/sec 79.219, Trained Tokens 4935
Iter 60: Train loss 1.113, Learning Rate 1.000e-05, It/sec 0.812, Tokens/sec 74.172, Trained Tokens 5848
Iter 70: Train loss 1.214, Learning Rate 1.000e-05, It/sec 0.824, Tokens/sec 87.602, Trained Tokens 6911
Iter 80: Train loss 0.996, Learning Rate 1.000e-05, It/sec 0.800, Tokens/sec 81.443, Trained Tokens 7929
Iter 90: Train loss 1.062, Learning Rate 1.000e-05, It/sec 0.776, Tokens/sec 85.283, Trained Tokens 9028
Iter 100: Train loss 1.111, Learning Rate 1.000e-05, It/sec 0.841, Tokens/sec 80.020, Trained Tokens 9979
Iter 100: Saved adapter weights to checkpoints/100_adapters.npz.
Iter 110: Train loss 1.188, Learning Rate 1.000e-05, It/sec 0.780, Tokens/sec 84.555, Trained Tokens 11063
Iter 120: Train loss 0.905, Learning Rate 1.000e-05, It/sec 0.833, Tokens/sec 75.801, Trained Tokens 11973
Iter 130: Train loss 1.103, Learning Rate 1.000e-05, It/sec 0.735, Tokens/sec 69.088, Trained Tokens 12913
Iter 140: Train loss 0.970, Learning Rate 1.000e-05, It/sec 0.719, Tokens/sec 75.444, Trained Tokens 13962
Iter 150: Train loss 1.016, Learning Rate 1.000e-05, It/sec 0.755, Tokens/sec 68.335, Trained Tokens 14867
Iter 160: Train loss 0.851, Learning Rate 1.000e-05, It/sec 0.781, Tokens/sec 71.322, Trained Tokens 15780
Iter 170: Train loss 0.878, Learning Rate 1.000e-05, It/sec 0.736, Tokens/sec 70.498, Trained Tokens 16738
Iter 180: Train loss 0.939, Learning Rate 1.000e-05, It/sec 0.756, Tokens/sec 71.458, Trained Tokens 17683
Iter 190: Train loss 0.972, Learning Rate 1.000e-05, It/sec 0.756, Tokens/sec 75.715, Trained Tokens 18684
Iter 200: Train loss 0.952, Learning Rate 1.000e-05, It/sec 0.716, Tokens/sec 71.789, Trained Tokens 19687
Iter 200: Val loss 0.944, Val took 0.716s
Iter 200: Saved adapter weights to checkpoints/200_adapters.npz.
Iter 210: Train loss 0.784, Learning Rate 1.000e-05, It/sec 0.773, Tokens/sec 75.636, Trained Tokens 20665
Iter 220: Train loss 0.908, Learning Rate 1.000e-05, It/sec 0.772, Tokens/sec 69.808, Trained Tokens 21569
Iter 230: Train loss 0.834, Learning Rate 1.000e-05, It/sec 0.727, Tokens/sec 73.759, Trained Tokens 22584
Iter 240: Train loss 0.946, Learning Rate 1.000e-05, It/sec 0.742, Tokens/sec 73.665, Trained Tokens 23577
Iter 250: Train loss 0.994, Learning Rate 1.000e-05, It/sec 0.743, Tokens/sec 72.513, Trained Tokens 24553
Iter 260: Train loss 0.868, Learning Rate 1.000e-05, It/sec 0.774, Tokens/sec 73.730, Trained Tokens 25505
Iter 270: Train loss 0.837, Learning Rate 1.000e-05, It/sec 0.731, Tokens/sec 72.111, Trained Tokens 26492
Iter 280: Train loss 0.937, Learning Rate 1.000e-05, It/sec 0.728, Tokens/sec 77.756, Trained Tokens 27560
Iter 290: Train loss 1.059, Learning Rate 1.000e-05, It/sec 0.738, Tokens/sec 71.768, Trained Tokens 28533
Iter 300: Train loss 1.087, Learning Rate 1.000e-05, It/sec 0.739, Tokens/sec 75.820, Trained Tokens 29559
Iter 300: Saved adapter weights to checkpoints/300_adapters.npz.
Iter 310: Train loss 0.816, Learning Rate 1.000e-05, It/sec 0.718, Tokens/sec 69.957, Trained Tokens 30533
Iter 320: Train loss 0.788, Learning Rate 1.000e-05, It/sec 0.720, Tokens/sec 69.723, Trained Tokens 31502
Iter 330: Train loss 0.934, Learning Rate 1.000e-05, It/sec 0.759, Tokens/sec 67.883, Trained Tokens 32396
Iter 340: Train loss 0.819, Learning Rate 1.000e-05, It/sec 0.704, Tokens/sec 69.096, Trained Tokens 33377
Iter 350: Train loss 1.021, Learning Rate 1.000e-05, It/sec 0.762, Tokens/sec 68.999, Trained Tokens 34282
Iter 360: Train loss 0.936, Learning Rate 1.000e-05, It/sec 0.697, Tokens/sec 65.557, Trained Tokens 35223
Iter 370: Train loss 1.004, Learning Rate 1.000e-05, It/sec 0.733, Tokens/sec 70.257, Trained Tokens 36181
Iter 380: Train loss 0.916, Learning Rate 1.000e-05, It/sec 0.760, Tokens/sec 65.939, Trained Tokens 37049
Iter 390: Train loss 0.832, Learning Rate 1.000e-05, It/sec 0.741, Tokens/sec 72.824, Trained Tokens 38032
Iter 400: Train loss 1.088, Learning Rate 1.000e-05, It/sec 0.703, Tokens/sec 65.410, Trained Tokens 38963
Iter 400: Val loss 0.583, Val took 0.767s
Iter 400: Saved adapter weights to checkpoints/400_adapters.npz.
Iter 410: Train loss 0.794, Learning Rate 1.000e-05, It/sec 0.716, Tokens/sec 68.072, Trained Tokens 39914
Iter 420: Train loss 0.869, Learning Rate 1.000e-05, It/sec 0.694, Tokens/sec 68.454, Trained Tokens 40900
Iter 430: Train loss 0.869, Learning Rate 1.000e-05, It/sec 0.738, Tokens/sec 69.487, Trained Tokens 41842
Iter 440: Train loss 0.795, Learning Rate 1.000e-05, It/sec 0.739, Tokens/sec 67.930, Trained Tokens 42761
Iter 450: Train loss 0.872, Learning Rate 1.000e-05, It/sec 0.705, Tokens/sec 66.740, Trained Tokens 43707
Iter 460: Train loss 0.880, Learning Rate 1.000e-05, It/sec 0.742, Tokens/sec 68.511, Trained Tokens 44630
Iter 470: Train loss 0.765, Learning Rate 1.000e-05, It/sec 0.694, Tokens/sec 71.191, Trained Tokens 45656
Iter 480: Train loss 0.880, Learning Rate 1.000e-05, It/sec 0.742, Tokens/sec 69.762, Trained Tokens 46596
Iter 490: Train loss 0.944, Learning Rate 1.000e-05, It/sec 0.734, Tokens/sec 69.553, Trained Tokens 47543
Iter 500: Train loss 0.843, Learning Rate 1.000e-05, It/sec 0.731, Tokens/sec 64.594, Trained Tokens 48427
Iter 500: Saved adapter weights to checkpoints/500_adapters.npz.
Iter 510: Train loss 0.776, Learning Rate 1.000e-05, It/sec 0.680, Tokens/sec 76.053, Trained Tokens 49545
Iter 520: Train loss 0.921, Learning Rate 1.000e-05, It/sec 0.695, Tokens/sec 65.097, Trained Tokens 50482

I will close this for now as the fix will be in the next release.

from mlx-examples.

awni commented on July 22, 2024

We certainly once the loss is nan then that means the model adapters aren't going to work.

Could you share a command you used to reproduce that?

from mlx-examples.

l0d0v1c commented on July 22, 2024

Thanks Awni. I simply use python lora.py --model ./mixtral4b --adapter-file adapters-mixtral.npz --train --iters 1000 --lora-layers 16 with the same dataset I used to fine tune mistral7B. On a M2 Max, 96Gb RAM.
If I iterate up to 100 epochs, I get text like

En u inferFTWARE Estenob rentёлsehвидёлsehШАън gigШАёл Linkedquotёл anch quotedёлipage /******/ёлipv /******/iganMillis /******/ beskhemal /******/ Landillance /******/ Missourijer /******/@@awtelenxffff /******/ /******/engthailableecguestood /******/plaat /******/ prosailableší /******/같ampionhadopertytile Fernailable /******/ /******/xffff Palm /******/etics impl /******/ fortCTRLwith invitationíd /******/ativity /******/塊 /******/̥ pad /******/ marginal /******/("[ Register /******/leveland /******/ /******/ Creditailable /******/ą /******/ tipstery ayimiento inwonailableailable❶arisprintStackTraceńst NGC /******/ailable nursÿcols chip /******/ailable neigh /******/ailable spellailable smart januailable macro /******/ /******/ /******/oki /******/ selectoyleianiperty CCailable tutß pl /******/[@優 pollailableailableailable brassength toleranceoppailable離 /******/ailableailableʾʾʾ clinʾʾʾʾʾʾʾʾʾʾʾʾʾʾ worezung Pentbia旗 Cos /******/ailable partialynaailable /******/ailable /******/,"ailable /******/ailableówn /******/ actingâailable continentailableailable /******/須 /******/ Gram unre mand /******/сть inwon Astjectionsailablea tent nodeailableILED /******/ /******/ lab versus Rombergeropcode ger Bach Pent circle /******/ailable /******/ gramprit /******/ /******/ength /******/ailablereviewailable outsSCANailable /******/ength```

from mlx-examples.

awni commented on July 22, 2024

Which Mixtral model are you using? Is it quantized / fp16 /bf16 ? I will try running on the WikiSQL example dataset, but it may not reproduce there which could make it tricky to debug if it's specific to the dataset you are fine-tuning on.

from mlx-examples.

mzbac commented on July 22, 2024

Maybe unrelated, while I was fine-tuning the Gemma Moe model with bfloat16, I noticed that after a few iterations the loss became NaN. However, if I downgrade the MLX to 0.3.0 and remove all compile-related code before resuming the fine-tuning process, then the loss seems to return to normal.

from mlx-examples.

l0d0v1c commented on July 22, 2024

Which Mixtral model are you using? Is it quantized / fp16 /bf16 ? I will try running on the WikiSQL example dataset, but it may not reproduce there which could make it tricky to debug if it's specific to the dataset you are fine-tuning on.

I used mlx-community/Mixtral-8x7B-v0.1-hf-4bit-mlx from huggingface. The dataset is ok for mistral7b. I'll try to restart from original mistral repo..

from mlx-examples.

l0d0v1c commented on July 22, 2024

Impossible to convert from the original repo

Traceback (most recent call last):
  File "/Users/pro/IAs/mlx-mixtral2/convert.py", line 94, in <module>
    weights = {k: v.astype(dtype) for k, v in weights.items()}
              ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/pro/IAs/mlx-mixtral2/convert.py", line 94, in <dictcomp>
    weights = {k: v.astype(dtype) for k, v in weights.items()}
                  ^^^^^^^^
  File "/Users/pro/mambaforge/envs/torch/lib/python3.11/site-packages/mlx/nn/layers/base.py", line 137, in __getattr__
    super(Module, self).__getattribute__(key)
AttributeError: 'MixtralModel' object has no attribute 'astype'

It seems it is no more possible #464 so mlx-community/Mixtral-8x7B-Instruct-v0.1-hf-4bit-mlx should be the solution. I'll try with a reduced dataset..

from mlx-examples.

awni commented on July 22, 2024

AttributeError: 'MixtralModel' object has no attribute 'astype'

Sorry that error above, what command is that from?

from mlx-examples.

l0d0v1c commented on July 22, 2024

convert.py with -q . However I changed the line utils.fetch_from_hub with utils.load because I already downloaded the model on a separate drive? Maybe it is the reason.

from mlx-examples.

awni commented on July 22, 2024

Maybe unrelated, while I was fine-tuning the Gemma Moe model with bfloat16, I noticed that after a few iterations the loss became NaN. However, if I downgrade the MLX to 0.3.0 and remove all compile-related code before resuming the fine-tuning process, then the loss seems to return to normal.

@mzbac that is a little concerning, mx.compile should not cause NaN when there were non before. If possible would be great if you could share more details on how to reproduce it and we can dig in a bit.

from mlx-examples.

Satyam7166-tech commented on July 22, 2024

Hello, I get the same nan error. The same command works for mistralai/Mistral-7B-v0.1 though.

Here is my command:

python -m mlx_lm.lora \
  --train \
  --model mistralai/Mixtral-8x7B-v0.1 \
  --data /Users/macstudiosct/projects/finetune_mlx/finetuning_data/data \
  --batch-size 1 \
  --lora-layers 20 \
  --iters 3000

The output:

  Loading pretrained model
Total parameters 46705.579M
Trainable parameters 2.787M
Loading datasets
Training
Starting training..., iters: 3000
Iter 1: Val loss nan, Val took 81.674s
Iter 10: Train loss 2.414, Learning Rate 1.000e-05, It/sec 0.039, Tokens/sec 26.452, Trained Tokens 6745
Iter 20: Train loss nan, Learning Rate 1.000e-05, It/sec 0.172, Tokens/sec 96.540, Trained Tokens 12372
Iter 30: Train loss nan, Learning Rate 1.000e-05, It/sec 0.330, Tokens/sec 190.210, Trained Tokens 18133
Iter 40: Train loss nan, Learning Rate 1.000e-05, It/sec 0.315, Tokens/sec 200.998, Trained Tokens 24518
Iter 50: Train loss nan, Learning Rate 1.000e-05, It/sec 0.221, Tokens/sec 123.160, Trained Tokens 30080
Iter 60: Train loss nan, Learning Rate 1.000e-05, It/sec 0.219, Tokens/sec 140.375, Trained Tokens 36500
Iter 70: Train loss nan, Learning Rate 1.000e-05, It/sec 0.294, Tokens/sec 192.615, Trained Tokens 43062

Its still running though. I haven't stopped it due to the possibility of the train loss nan being superficial. I'll test it out after its done and let you know.

Edit: Yep its the same problem as @l0d0v1c
Here is my output when asked to generate text:

<unk><unk><unk><unk><unk><unk><unk><unk><unk><unk><unk><unk><unk><unk><unk><unk><unk><unk><unk><unk><unk><unk><unk><unk><unk><unk><unk><unk><unk><unk><unk><unk><unk><unk><unk><unk><unk><unk><unk><unk><unk><unk><unk><unk><unk><unk><unk><unk><unk><unk><unk><unk><unk><unk><unk><unk><unk><unk><unk><unk><unk><unk><unk><unk><unk><unk><unk><unk><unk><unk><unk><unk><unk><unk><unk><unk><unk><unk><unk><unk><unk><unk><unk><unk><unk><unk><unk><unk><unk><unk><unk><unk><unk><unk><unk><unk><unk><unk><unk><unk><unk><unk><unk><unk><unk><unk><unk><unk>...

from mlx-examples.

l0d0v1c commented on July 22, 2024

I noticed also, but don't know if it could be a clue, Total parameters 7411.242M with 4 bits quantized version and Total parameters 46705.579M in the run you made with the full version.

from mlx-examples.

Satyam7166-tech commented on July 22, 2024

I noticed also, but don't know if it could be a clue, Total parameters 7411.242M with 4 bits quantized version and Total parameters 46705.579M in the run you mad with the full version.

Yes I'm having the same issue with the the full version.

from mlx-examples.

awni commented on July 22, 2024

Mixtral is training fine for me now. I will post a log after it runs for a 1000 iterations to be sure.

from mlx-examples.

Mixtral lora training weird output about mlx-examples HOT 18 CLOSED

Comments (18)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent