非常抱歉占用博主您宝贵的时间，麻烦请你帮我解答两个问题我按照训练微调提示要求在微软商店下载了WSL及Ubuntu，在xu训

你在windows控制台上输入一下 wsl , 看提示的错误是什么,

请问怎么看训练的进度和结果呢？ <a target="_blank" rel="noopener noreferrer" href="https://private

<a target="_blank" rel="noopener noreferrer" href="https://private-user-images.githubu

Comments (12)

josStorer commented on June 9, 2024

你在windows控制台上输入一下wsl, 看提示的错误是什么, 通过网络搜索解决, 一般是需要安装特定组件或者开启虚拟化功能
windows必须用WSL安装Ubuntu, 因为WSL支持显卡直通
如果你要训练小说, 那么就是整本小说的文本作为一行, 就像你截图的第一行; 如果要训练对话, 则是像第三行那样, 写成对话格式; 如果要训练指令或者代码, 就像第四第五行那样.
总而言之, 你想要lora微调过的模型怎样被使用, 就提供怎样形式的数据.
此外, 建议塞一些与你的微调目标无关的, 各种其他形式的数据(例如对话, 续写), 避免模型过拟合降智

from rwkv-runner.

SerMs commented on June 9, 2024

{
"Instruction": "question ",
"Input": " background knowledge",
"Response": "answer",
}

这种方式可以嘛

from rwkv-runner.

SerMs commented on June 9, 2024

请问怎么看训练的进度和结果呢？

from rwkv-runner.

josStorer commented on June 9, 2024

{ "Instruction": "question ", "Input": " background knowledge", "Response": "answer", }

这种方式可以嘛

不能用这种方式, 看第四行, 基底模型只有这种形式的指令支持, 用你这个形式做lora微调效果不会很好, 建议要么全量微调, 要么参考第四行这种格式

你上面的截图是在安装训练所需的依赖, 安装完毕后开始训练会显示进度的

from rwkv-runner.

SerMs commented on June 9, 2024

一般训练大概要多久，仅仅只是测试训练的话，用的0.1B的模型

from rwkv-runner.

SerMs commented on June 9, 2024

卡这里不动了，是还在加载中嘛？还是以及暂停了，每太看懂他这执行的意思

from rwkv-runner.

josStorer commented on June 9, 2024

再次点击训练, 看是否出现了gcc installed; requirements satisfied

from rwkv-runner.

SerMs commented on June 9, 2024

Building dependency tree...
Reading state information...
Package gcc is not available, but is referred to by another package.
This may mean that the package is missing, has been obsoleted, or
is only available from another source
However the following packages replace it:
gcc-11-doc gcc-9-doc gcc-12-doc gcc-10-doc
E: Package 'gcc' has no installation candidate
pip installed
WARNING: apt does not have a stable CLI interface. Use with caution in scripts.
Reading package lists...
Building dependency tree...
Reading state information...
E: Unable to locate package ninja-build
--2024-01-12 13:59:55-- https://developer.download.nvidia.com/compute/cuda/repos/wsl-ubuntu/x86_64/cuda-wsl-ubuntu.pin
Resolving developer.download.nvidia.com (developer.download.nvidia.com)... 152.199.39.144
Connecting to developer.download.nvidia.com (developer.download.nvidia.com)|152.199.39.144|:443... connected.
HTTP request sent, awaiting response... 301 Moved Permanently
Location: https://developer.download.nvidia.cn/compute/cuda/repos/wsl-ubuntu/x86_64/cuda-wsl-ubuntu.pin [following]
--2024-01-12 14:00:11-- https://developer.download.nvidia.cn/compute/cuda/repos/wsl-ubuntu/x86_64/cuda-wsl-ubuntu.pin
Resolving developer.download.nvidia.cn (developer.download.nvidia.cn)... 175.4.58.178, 175.4.58.179, 175.4.58.180, ...
Connecting to developer.download.nvidia.cn (developer.download.nvidia.cn)|175.4.58.178|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 190 [application/octet-stream]
Saving to: ‘cuda-wsl-ubuntu.pin’
0K 100% 26.6M=0s
2024-01-12 14:00:12 (26.6 MB/s) - ‘cuda-wsl-ubuntu.pin’ saved [190/190]
--2024-01-12 14:00:12-- https://developer.download.nvidia.com/compute/cuda/12.2.0/local_installers/cuda-repo-wsl-ubuntu-12-2-local_12.2.0-1_amd64.deb
Resolving developer.download.nvidia.com (developer.download.nvidia.com)... 152.199.39.144
Connecting to developer.download.nvidia.com (developer.download.nvidia.com)|152.199.39.144|:443... connected.
HTTP request sent, awaiting response... 301 Moved Permanently
Location: https://developer.download.nvidia.cn/compute/cuda/12.2.0/local_installers/cuda-repo-wsl-ubuntu-12-2-local_12.2.0-1_amd64.deb [following]
--2024-01-12 14:00:16-- https://developer.download.nvidia.cn/compute/cuda/12.2.0/local_installers/cuda-repo-wsl-ubuntu-12-2-local_12.2.0-1_amd64.deb
Resolving developer.download.nvidia.cn (developer.download.nvidia.cn)... 175.4.58.178, 175.4.58.179, 175.4.58.180, ...
Connecting to developer.download.nvidia.cn (developer.download.nvidia.cn)|175.4.58.178|:443... connected.
HTTP request sent, awaiting response... 304 Not Modified
File ‘cuda-repo-wsl-ubuntu-12-2-local_12.2.0-1_amd64.deb’ not modified on server. Omitting download.
(Reading database ... 25419 files and directories currently installed.)
Preparing to unpack cuda-repo-wsl-ubuntu-12-2-local_12.2.0-1_amd64.deb ...
Unpacking cuda-repo-wsl-ubuntu-12-2-local (12.2.0-1) over (12.2.0-1) ...
Setting up cuda-repo-wsl-ubuntu-12-2-local (12.2.0-1) ...
Reading package lists...
E: Could not get lock /var/lib/apt/lists/lock. It is held by process 979 (apt-get)
E: Unable to lock directory /var/lib/apt/lists/
Reading package lists...
Building dependency tree...
Reading state information...
E: Unable to locate package cuda
requirements satisfied
loading models/RWKV-5-World-1B5-v2-20231025-ctx4096.pth
v5/train.py --vocab_size 65536 --n_layer 24 --n_embd 2048
INFO:torch.distributed.nn.jit.instantiator:Created a temporary directory at /tmp/tmpw37_4auo
INFO:torch.distributed.nn.jit.instantiator:Writing /tmp/tmpw37_4auo/_remote_module_non_scriptable.py
INFO:pytorch_lightning.utilities.rank_zero:########## work in progress ##########
[2024-01-12 14:01:48,105] [INFO] [real_accelerator.py:158:get_accelerator] Setting ds_accelerator to cuda (auto detect)
INFO:pytorch_lightning.utilities.rank_zero:
############################################################################

RWKV-5 BF16 on 1x1 GPU, bsz 1x1x1=1, deepspeed_stage_2 with grad_cp

Data = ./finetune/json2binidx_tool/data/test2_text_document (binidx), ProjDir = lora-models

Epoch = 0 to 19, save every 1 epoch

Each "epoch" = 200 steps, 200 samples, 204800 tokens

Model = 24 n_layer, 2048 n_embd, 1024 ctx_len

Adam = lr 5e-05 to 5e-05, warmup 0 steps, beta (0.9, 0.999), eps 1e-08

Found torch 1.13.1+cu117, recommend 1.13.1+cu117 or newer

Found deepspeed 0.11.2, recommend 0.7.0 (faster than newer versions)

Found pytorch_lightning 1.9.5, recommend 1.9.5

############################################################################
INFO:pytorch_lightning.utilities.rank_zero:{'load_model': 'models/RWKV-5-World-1B5-v2-20231025-ctx4096.pth', 'wandb': '', 'proj_dir': 'lora-models', 'random_seed': -1, 'data_file': './finetune/json2binidx_tool/data/test2_text_document', 'data_type': 'binidx', 'vocab_size': 65536, 'ctx_len': 1024, 'epoch_steps': 200, 'epoch_count': 20, 'epoch_begin': 0, 'epoch_save': 1, 'micro_bsz': 1, 'n_layer': 24, 'n_embd': 2048, 'dim_att': 2048, 'dim_ffn': 7168, 'pre_ffn': 1, 'head_qk': 1, 'tiny_att_dim': 0, 'tiny_att_layer': -999, 'lr_init': 5e-05, 'lr_final': 5e-05, 'warmup_steps': 0, 'beta1': 0.9, 'beta2': 0.999, 'adam_eps': 1e-08, 'grad_cp': 1, 'dropout': 0, 'weight_decay': 0, 'weight_decay_final': -1, 'my_pile_version': 1, 'my_pile_stage': 0, 'my_pile_shift': -1, 'my_pile_edecay': 0, 'layerwise_lr': 1, 'ds_bucket_mb': 200, 'my_sample_len': 0, 'my_ffn_shift': 1, 'my_att_shift': 1, 'head_size_a': 64, 'head_size_divisor': 8, 'my_pos_emb': 0, 'load_partial': 0, 'magic_prime': 0, 'my_qa_mask': 0, 'my_random_steps': 0, 'my_testing': '', 'my_exit': 99999999, 'my_exit_tokens': 0, 'emb': False, 'lora': True, 'lora_load': '', 'lora_r': 8, 'lora_alpha': 32.0, 'lora_dropout': 0.01, 'lora_parts': 'att,ffn,time,ln', 'logger': False, 'enable_checkpointing': False, 'default_root_dir': None, 'gradient_clip_val': 1.0, 'gradient_clip_algorithm': None, 'num_nodes': 1, 'num_processes': None, 'devices': '1', 'gpus': None, 'auto_select_gpus': None, 'tpu_cores': None, 'ipus': None, 'enable_progress_bar': True, 'overfit_batches': 0.0, 'track_grad_norm': -1, 'check_val_every_n_epoch': 100000000000000000000, 'fast_dev_run': False, 'accumulate_grad_batches': 8, 'max_epochs': 20, 'min_epochs': None, 'max_steps': -1, 'min_steps': None, 'max_time': None, 'limit_train_batches': None, 'limit_val_batches': None, 'limit_test_batches': None, 'limit_predict_batches': None, 'val_check_interval': None, 'log_every_n_steps': 100000000000000000000, 'accelerator': 'gpu', 'strategy': 'deepspeed_stage_2', 'sync_batchnorm': False, 'precision': 'bf16', 'enable_model_summary': True, 'num_sanity_val_steps': 0, 'resume_from_checkpoint': None, 'profiler': None, 'benchmark': None, 'reload_dataloaders_every_n_epochs': 0, 'auto_lr_find': False, 'replace_sampler_ddp': False, 'detect_anomaly': False, 'auto_scale_batch_size': False, 'plugins': None, 'amp_backend': None, 'amp_level': None, 'move_metrics_to_cpu': False, 'multiple_trainloader_mode': 'max_size_cycle', 'inference_mode': True, 'my_timestamp': '2024-01-12-14-01-50', 'betas': (0.9, 0.999), 'real_bsz': 1, 'run_name': '65536 ctx1024 L24 D2048'}
Using /root/.cache/torch_extensions/py310_cu117 as PyTorch extensions root...
RWKV_MY_TESTING
Traceback (most recent call last):
File "/mnt/d/LS/Rwkv/./finetune/lora/v5/train.py", line 308, in
from src.trainer import train_callback, generate_init_weight
File "/mnt/d/LS/Rwkv/finetune/lora/v5/src/trainer.py", line 6, in
from .model import LORA_CONFIG
File "/mnt/d/LS/Rwkv/finetune/lora/v5/src/model.py", line 56, in
wkv5_cuda = load(
File "/usr/local/lib/python3.10/dist-packages/torch/utils/cpp_extension.py", line 1284, in load
return _jit_compile(
File "/usr/local/lib/python3.10/dist-packages/torch/utils/cpp_extension.py", line 1508, in _jit_compile
_write_ninja_file_and_build_library(
File "/usr/local/lib/python3.10/dist-packages/torch/utils/cpp_extension.py", line 1597, in _write_ninja_file_and_build_library
get_compiler_abi_compatibility_and_version(compiler)
File "/usr/local/lib/python3.10/dist-packages/torch/utils/cpp_extension.py", line 336, in get_compiler_abi_compatibility_and_version
if not check_compiler_ok_for_platform(compiler):
File "/usr/local/lib/python3.10/dist-packages/torch/utils/cpp_extension.py", line 290, in check_compiler_ok_for_platform
which = subprocess.check_output(['which', compiler], stderr=subprocess.STDOUT)
File "/usr/lib/python3.10/subprocess.py", line 421, in check_output
return run(*popenargs, stdout=PIPE, timeout=timeout, check=True,
File "/usr/lib/python3.10/subprocess.py", line 526, in run
raise CalledProcessError(retcode, process.args,
subprocess.CalledProcessError: Command '['which', 'c++']' returned non-zero exit status 1.

没有出现 gcc installed; requirements satisfied

再次点击训练, 看是否出现了gcc installed; requirements satisfied

from rwkv-runner.

SerMs commented on June 9, 2024

有几个依赖在请求的时候为443，是不是这个原因？需要借助梯子嘛？

from rwkv-runner.

josStorer commented on June 9, 2024

WSL执行一下 sudo apt update

from rwkv-runner.

SerMs commented on June 9, 2024

LoRA additionally training parameter time_mix_r
LoRA additionally training module blocks.8.ffn.key
LoRA additionally training module blocks.8.ffn.receptance
INFO:pytorch_lightning.utilities.rank_zero:########## Loading models/RWKV-5-World-0.1B-v1-20230803-ctx4096.pth... ##########
LoRA additionally training module blocks.8.ffn.value
LoRA additionally training module blocks.9.ln1
LoRA additionally training module blocks.9.ln2
LoRA additionally training parameter time_mix_k
LoRA additionally training parameter time_mix_v
LoRA additionally training parameter time_mix_r
LoRA additionally training parameter time_mix_g
LoRA additionally training parameter time_decay
LoRA additionally training parameter time_faaaa
LoRA additionally training module blocks.9.att.receptance
LoRA additionally training module blocks.9.att.key
LoRA additionally training module blocks.9.att.value
LoRA additionally training module blocks.9.att.gate
LoRA additionally training module blocks.9.att.ln_x
LoRA additionally training parameter time_mix_k
LoRA additionally training parameter time_mix_r
LoRA additionally training module blocks.9.ffn.key
LoRA additionally training module blocks.9.ffn.receptance
LoRA additionally training module blocks.9.ffn.value
LoRA additionally training module blocks.10.ln1
LoRA additionally training module blocks.10.ln2
LoRA additionally training parameter time_mix_k
LoRA additionally training parameter time_mix_v
LoRA additionally training parameter time_mix_r
LoRA additionally training parameter time_mix_g
LoRA additionally training parameter time_decay
LoRA additionally training parameter time_faaaa
LoRA additionally training module blocks.10.att.receptance
LoRA additionally training module blocks.10.att.key
LoRA additionally training module blocks.10.att.value
LoRA additionally training module blocks.10.att.gate
LoRA additionally training module blocks.10.att.ln_x
LoRA additionally training parameter time_mix_k
LoRA additionally training parameter time_mix_r
LoRA additionally training module blocks.10.ffn.key
LoRA additionally training module blocks.10.ffn.receptance
LoRA additionally training module blocks.10.ffn.value
LoRA additionally training module blocks.11.ln1
LoRA additionally training module blocks.11.ln2
LoRA additionally training parameter time_mix_k
LoRA additionally training parameter time_mix_v
LoRA additionally training parameter time_mix_r
LoRA additionally training parameter time_mix_g
LoRA additionally training parameter time_decay
LoRA additionally training parameter time_faaaa
LoRA additionally training module blocks.11.att.receptance
LoRA additionally training module blocks.11.att.key
LoRA additionally training module blocks.11.att.value
LoRA additionally training module blocks.11.att.gate
LoRA additionally training module blocks.11.att.ln_x
LoRA additionally training parameter time_mix_k
LoRA additionally training parameter time_mix_r
LoRA additionally training module blocks.11.ffn.key
LoRA additionally training module blocks.11.ffn.receptance
LoRA additionally training module blocks.11.ffn.value
Traceback (most recent call last):
File "/mnt/d/LS/Rwkv/./finetune/lora/v5/train.py", line 379, in
model.load_state_dict(load_dict, strict=(not args.lora))
File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1671, in load_state_dict
raise RuntimeError('Error(s) in loading state_dict for {}:\n\t{}'.format(
RuntimeError: Error(s) in loading state_dict for RWKV:
size mismatch for blocks.0.ffn.key.weight: copying a param with shape torch.Size([3072, 768]) from checkpoint, the shape in current model is torch.Size([2688, 768]).
size mismatch for blocks.0.ffn.value.weight: copying a param with shape torch.Size([768, 3072]) from checkpoint, the shape in current model is torch.Size([768, 2688]).
size mismatch for blocks.1.att.time_decay: copying a param with shape torch.Size([12]) from checkpoint, the shape in current model is torch.Size([12, 64]).
size mismatch for blocks.1.ffn.key.weight: copying a param with shape torch.Size([3072, 768]) from checkpoint, the shape in current model is torch.Size([2688, 768]).
size mismatch for blocks.1.ffn.value.weight: copying a param with shape torch.Size([768, 3072]) from checkpoint, the shape in current model is torch.Size([768, 2688]).
size mismatch for blocks.2.att.time_decay: copying a param with shape torch.Size([12]) from checkpoint, the shape in current model is torch.Size([12, 64]).
size mismatch for blocks.2.ffn.key.weight: copying a param with shape torch.Size([3072, 768]) from checkpoint, the shape in current model is torch.Size([2688, 768]).
size mismatch for blocks.2.ffn.value.weight: copying a param with shape torch.Size([768, 3072]) from checkpoint, the shape in current model is torch.Size([768, 2688]).
size mismatch for blocks.3.att.time_decay: copying a param with shape torch.Size([12]) from checkpoint, the shape in current model is torch.Size([12, 64]).
size mismatch for blocks.3.ffn.key.weight: copying a param with shape torch.Size([3072, 768]) from checkpoint, the shape in current model is torch.Size([2688, 768]).
size mismatch for blocks.3.ffn.value.weight: copying a param with shape torch.Size([768, 3072]) from checkpoint, the shape in current model is torch.Size([768, 2688]).
size mismatch for blocks.4.att.time_decay: copying a param with shape torch.Size([12]) from checkpoint, the shape in current model is torch.Size([12, 64]).
size mismatch for blocks.4.ffn.key.weight: copying a param with shape torch.Size([3072, 768]) from checkpoint, the shape in current model is torch.Size([2688, 768]).
size mismatch for blocks.4.ffn.value.weight: copying a param with shape torch.Size([768, 3072]) from checkpoint, the shape in current model is torch.Size([768, 2688]).
size mismatch for blocks.5.att.time_decay: copying a param with shape torch.Size([12]) from checkpoint, the shape in current model is torch.Size([12, 64]).
size mismatch for blocks.5.ffn.key.weight: copying a param with shape torch.Size([3072, 768]) from checkpoint, the shape in current model is torch.Size([2688, 768]).
size mismatch for blocks.5.ffn.value.weight: copying a param with shape torch.Size([768, 3072]) from checkpoint, the shape in current model is torch.Size([768, 2688]).
size mismatch for blocks.6.att.time_decay: copying a param with shape torch.Size([12]) from checkpoint, the shape in current model is torch.Size([12, 64]).
size mismatch for blocks.6.ffn.key.weight: copying a param with shape torch.Size([3072, 768]) from checkpoint, the shape in current model is torch.Size([2688, 768]).
size mismatch for blocks.6.ffn.value.weight: copying a param with shape torch.Size([768, 3072]) from checkpoint, the shape in current model is torch.Size([768, 2688]).
size mismatch for blocks.7.att.time_decay: copying a param with shape torch.Size([12]) from checkpoint, the shape in current model is torch.Size([12, 64]).
size mismatch for blocks.7.ffn.key.weight: copying a param with shape torch.Size([3072, 768]) from checkpoint, the shape in current model is torch.Size([2688, 768]).
size mismatch for blocks.7.ffn.value.weight: copying a param with shape torch.Size([768, 3072]) from checkpoint, the shape in current model is torch.Size([768, 2688]).
size mismatch for blocks.8.att.time_decay: copying a param with shape torch.Size([12]) from checkpoint, the shape in current model is torch.Size([12, 64]).
size mismatch for blocks.8.ffn.key.weight: copying a param with shape torch.Size([3072, 768]) from checkpoint, the shape in current model is torch.Size([2688, 768]).
size mismatch for blocks.8.ffn.value.weight: copying a param with shape torch.Size([768, 3072]) from checkpoint, the shape in current model is torch.Size([768, 2688]).
size mismatch for blocks.9.att.time_decay: copying a param with shape torch.Size([12]) from checkpoint, the shape in current model is torch.Size([12, 64]).
size mismatch for blocks.9.ffn.key.weight: copying a param with shape torch.Size([3072, 768]) from checkpoint, the shape in current model is torch.Size([2688, 768]).
size mismatch for blocks.9.ffn.value.weight: copying a param with shape torch.Size([768, 3072]) from checkpoint, the shape in current model is torch.Size([768, 2688]).
size mismatch for blocks.10.att.time_decay: copying a param with shape torch.Size([12]) from checkpoint, the shape in current model is torch.Size([12, 64]).
size mismatch for blocks.10.ffn.key.weight: copying a param with shape torch.Size([3072, 768]) from checkpoint, the shape in current model is torch.Size([2688, 768]).
size mismatch for blocks.10.ffn.value.weight: copying a param with shape torch.Size([768, 3072]) from checkpoint, the shape in current model is torch.Size([768, 2688]).
size mismatch for blocks.11.att.time_decay: copying a param with shape torch.Size([12]) from checkpoint, the shape in current model is torch.Size([12, 64]).
size mismatch for blocks.11.ffn.key.weight: copying a param with shape torch.Size([3072, 768]) from checkpoint, the shape in current model is torch.Size([2688, 768]).
size mismatch for blocks.11.ffn.value.weight: copying a param with shape torch.Size([768, 3072]) from checkpoint, the shape in current model is torch.Size([768, 2688]).

按照你的做了时候重新试了一下，但是又卡了，卡在这里半个小时没动静了，这是报错了还是怎么回事呢

from rwkv-runner.

josStorer commented on June 9, 2024

试试RWKV5-1.5B, RWKV5的小尺寸模型版本可能lora没有适配, RWKV5有几个小版本的

from rwkv-runner.

LoRa微调训练报错 about rwkv-runner HOT 12 OPEN

Comments (12)

RWKV-5 BF16 on 1x1 GPU, bsz 1x1x1=1, deepspeed_stage_2 with grad_cp

Data = ./finetune/json2binidx_tool/data/test2_text_document (binidx), ProjDir = lora-models

Epoch = 0 to 19, save every 1 epoch

Each "epoch" = 200 steps, 200 samples, 204800 tokens

Model = 24 n_layer, 2048 n_embd, 1024 ctx_len

Adam = lr 5e-05 to 5e-05, warmup 0 steps, beta (0.9, 0.999), eps 1e-08

Found torch 1.13.1+cu117, recommend 1.13.1+cu117 or newer

Found deepspeed 0.11.2, recommend 0.7.0 (faster than newer versions)

Found pytorch_lightning 1.9.5, recommend 1.9.5

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent