visionshao / llmisallyouneed Goto Github PK

View Code? Open in Web Editor NEW

0.0 0.0 0.0 4 KB

A Comprehensive repository for LLM knowledge

llmisallyouneed's People

Contributors

Watchers

llmisallyouneed's Issues

Why does some LLMs use eos_token as pad_token?

Hi all! There’s an interesting story here.

In general you are correct that causal LMs like Falcon are not trained with a pad token, and so the tokenizer does not have one set. This is true for a lot of causal LMs in the Hub. During training, these models are often fed sequences that have been concatenated together and truncated at the maximum sequence length, and so there is never any empty space that needs padding.

The reason we add one later is because a lot of downstream methods use padding and attention masks in some way. However, in many cases it doesn’t really matter what you set the padding token to! This is because the padded tokens will generally be masked by setting the attention_mask to 0, so those tokens will not be attended to by the rest of the sequence.

However, one place the choice of padding token can matter is in the labels when fine-tuning the model. This is because in standard CLM training, the labels are the inputs, shifted by a single position. This would mean that in the final position of the sequence before the padding at the end, the label at that position will be the padding token. When training models with shorter sequences (such as for chat), we generally want them to mark the end of the text they’ve generated, using a token like eos_token. As a result, we commonly just use eos_token as the padding token.

However, depending on your fine-tuning task, you may not want the model to learn to predict eos_token at the end of a sequence - if this is the case, simply change the label at that position to the token you do want, or set the label to -100 to mask the label at that position.

Does that answer the questions you had? Feel free to let me know if I missed anything here!

from https://discuss.huggingface.co/t/why-does-the-falcon-qlora-tutorial-code-use-eos-token-as-pad-token/45954/9

Why nohup background process is getting killed when using DDP to train ?

https://unix.stackexchange.com/questions/446625/why-nohup-background-process-is-getting-killed

使用setsid nohup python3 run.py > nohup.out 2>&1 & 缓解
or
使用disdown
https://discuss.pytorch.org/t/ddp-error-torch-distributed-elastic-agent-server-api-received-1-death-signal-shutting-down-workers/135720/13

具体参考：pytorch/pytorch#67538

DeepSpeed Initialization Error

Detected CUDA files, patching ldflags
Emitting ninja build file /mnt/cache/weishao4/.cache/torch_extensions/py39_cu117/cpu_adam/build.ninja...
Building extension module cpu_adam...
Allowing ninja to set a default number of workers... (overridable by setting the environment variable MAX_JOBS=N)
[1/3] /mnt/cache/weishao4/anaconda3/envs/toxicity/bin/nvcc -DTORCH_EXTENSION_NAME=cpu_adam -DTORCH_API_INCLUDE_EXTENSION_H -DPYBIND11_COMPILER_TYPE="gcc" -DPYBIND11_STDLIB="libstdcpp" -DPYBIND11_BUILD_ABI="cxxabi1011" -I/mnt/cache/weishao4/anaconda3/envs/toxicity/lib/python3.9/site-packages/deepspeed/ops/csrc/includes -I/mnt/cache/weishao4/anaconda3/envs/toxicity/include -isystem /mnt/cache/weishao4/anaconda3/envs/toxicity/lib/python3.9/site-packages/torch/include -isystem /mnt/cache/weishao4/anaconda3/envs/toxicity/lib/python3.9/site-packages/torch/include/torch/csrc/api/include -isystem /mnt/cache/weishao4/anaconda3/envs/toxicity/lib/python3.9/site-packages/torch/include/TH -isystem /mnt/cache/weishao4/anaconda3/envs/toxicity/lib/python3.9/site-packages/torch/include/THC -isystem /mnt/cache/weishao4/anaconda3/envs/toxicity/include -isystem /mnt/cache/weishao4/anaconda3/envs/toxicity/include/python3.9 -D_GLIBCXX_USE_CXX11_ABI=0 -D__CUDA_NO_HALF_OPERATORS -D__CUDA_NO_HALF_CONVERSIONS_ -D__CUDA_NO_BFLOAT16_CONVERSIONS__ -D__CUDA_NO_HALF2_OPERATORS__ --expt-relaxed-constexpr -gencode=arch=compute_86,code=compute_86 -gencode=arch=compute_86,code=sm_86 --compiler-options '-fPIC' -O3 --use_fast_math -std=c++17 -U__CUDA_NO_HALF_OPERATORS__ -U__CUDA_NO_HALF_CONVERSIONS__ -U__CUDA_NO_HALF2_OPERATORS__ -gencode=arch=compute_86,code=sm_86 -gencode=arch=compute_86,code=compute_86 -DBF16_AVAILABLE -c /mnt/cache/weishao4/anaconda3/envs/toxicity/lib/python3.9/site-packages/deepspeed/ops/csrc/common/custom_cuda_kernel.cu -o custom_cuda_kernel.cuda.o
FAILED: custom_cuda_kernel.cuda.o
/mnt/cache/weishao4/anaconda3/envs/toxicity/bin/nvcc -DTORCH_EXTENSION_NAME=cpu_adam -DTORCH_API_INCLUDE_EXTENSION_H -DPYBIND11_COMPILER_TYPE="gcc" -DPYBIND11_STDLIB="libstdcpp" -DPYBIND11_BUILD_ABI="cxxabi1011" -I/mnt/cache/weishao4/anaconda3/envs/toxicity/lib/python3.9/site-packages/deepspeed/ops/csrc/includes -I/mnt/cache/weishao4/anaconda3/envs/toxicity/include -isystem /mnt/cache/weishao4/anaconda3/envs/toxicity/lib/python3.9/site-packages/torch/include -isystem /mnt/cache/weishao4/anaconda3/envs/toxicity/lib/python3.9/site-packages/torch/include/torch/csrc/api/include -isystem /mnt/cache/weishao4/anaconda3/envs/toxicity/lib/python3.9/site-packages/torch/include/TH -isystem /mnt/cache/weishao4/anaconda3/envs/toxicity/lib/python3.9/site-packages/torch/include/THC -isystem /mnt/cache/weishao4/anaconda3/envs/toxicity/include -isystem /mnt/cache/weishao4/anaconda3/envs/toxicity/include/python3.9 -D_GLIBCXX_USE_CXX11_ABI=0 -D__CUDA_NO_HALF_OPERATORS -D__CUDA_NO_HALF_CONVERSIONS_ -D__CUDA_NO_BFLOAT16_CONVERSIONS__ -D__CUDA_NO_HALF2_OPERATORS__ --expt-relaxed-constexpr -gencode=arch=compute_86,code=compute_86 -gencode=arch=compute_86,code=sm_86 --compiler-options '-fPIC' -O3 --use_fast_math -std=c++17 -U__CUDA_NO_HALF_OPERATORS__ -U__CUDA_NO_HALF_CONVERSIONS__ -U__CUDA_NO_HALF2_OPERATORS__ -gencode=arch=compute_86,code=sm_86 -gencode=arch=compute_86,code=compute_86 -DBF16_AVAILABLE -c /mnt/cache/weishao4/anaconda3/envs/toxicity/lib/python3.9/site-packages/deepspeed/ops/csrc/common/custom_cuda_kernel.cu -o custom_cuda_kernel.cuda.o
cc1plus: fatal error: cuda_runtime.h: No such file or directory
compilation terminated.
[2/3] c++ -MMD -MF cpu_adam.o.d -DTORCH_EXTENSION_NAME=cpu_adam -DTORCH_API_INCLUDE_EXTENSION_H -DPYBIND11_COMPILER_TYPE="gcc" -DPYBIND11_STDLIB="libstdcpp" -DPYBIND11_BUILD_ABI="cxxabi1011" -I/mnt/cache/weishao4/anaconda3/envs/toxicity/lib/python3.9/site-packages/deepspeed/ops/csrc/includes -I/mnt/cache/weishao4/anaconda3/envs/toxicity/include -isystem /mnt/cache/weishao4/anaconda3/envs/toxicity/lib/python3.9/site-packages/torch/include -isystem /mnt/cache/weishao4/anaconda3/envs/toxicity/lib/python3.9/site-packages/torch/include/torch/csrc/api/include -isystem /mnt/cache/weishao4/anaconda3/envs/toxicity/lib/python3.9/site-packages/torch/include/TH -isystem /mnt/cache/weishao4/anaconda3/envs/toxicity/lib/python3.9/site-packages/torch/include/THC -isystem /mnt/cache/weishao4/anaconda3/envs/toxicity/include -isystem /mnt/cache/weishao4/anaconda3/envs/toxicity/include/python3.9 -D_GLIBCXX_USE_CXX11_ABI=0 -fPIC -std=c++14 -O3 -std=c++17 -g -Wno-reorder -L/mnt/cache/weishao4/anaconda3/envs/toxicity/lib64 -lcudart -lcublas -g -march=native -fopenmp -D__AVX512 -D__ENABLE_CUDA_ -DBF16_AVAILABLE -c /mnt/cache/weishao4/anaconda3/envs/toxicity/lib/python3.9/site-packages/deepspeed/ops/csrc/adam/cpu_adam.cpp -o cpu_adam.o
FAILED: cpu_adam.o
c++ -MMD -MF cpu_adam.o.d -DTORCH_EXTENSION_NAME=cpu_adam -DTORCH_API_INCLUDE_EXTENSION_H -DPYBIND11_COMPILER_TYPE="gcc" -DPYBIND11_STDLIB="libstdcpp" -DPYBIND11_BUILD_ABI="cxxabi1011" -I/mnt/cache/weishao4/anaconda3/envs/toxicity/lib/python3.9/site-packages/deepspeed/ops/csrc/includes -I/mnt/cache/weishao4/anaconda3/envs/toxicity/include -isystem /mnt/cache/weishao4/anaconda3/envs/toxicity/lib/python3.9/site-packages/torch/include -isystem /mnt/cache/weishao4/anaconda3/envs/toxicity/lib/python3.9/site-packages/torch/include/torch/csrc/api/include -isystem /mnt/cache/weishao4/anaconda3/envs/toxicity/lib/python3.9/site-packages/torch/include/TH -isystem /mnt/cache/weishao4/anaconda3/envs/toxicity/lib/python3.9/site-packages/torch/include/THC -isystem /mnt/cache/weishao4/anaconda3/envs/toxicity/include -isystem /mnt/cache/weishao4/anaconda3/envs/toxicity/include/python3.9 -D_GLIBCXX_USE_CXX11_ABI=0 -fPIC -std=c++14 -O3 -std=c++17 -g -Wno-reorder -L/mnt/cache/weishao4/anaconda3/envs/toxicity/lib64 -lcudart -lcublas -g -march=native -fopenmp -D__AVX512 -D__ENABLE_CUDA_ -DBF16_AVAILABLE -c /mnt/cache/weishao4/anaconda3/envs/toxicity/lib/python3.9/site-packages/deepspeed/ops/csrc/adam/cpu_adam.cpp -o cpu_adam.o
In file included from /mnt/cache/weishao4/anaconda3/envs/toxicity/lib/python3.9/site-packages/deepspeed/ops/csrc/adam/cpu_adam.cpp:6:
/mnt/cache/weishao4/anaconda3/envs/toxicity/lib/python3.9/site-packages/deepspeed/ops/csrc/includes/cpu_adam.h:16:10: fatal error: cuda_fp16.h: No such file or directory
#include <cuda_fp16.h>
^~~~~~~~~~~~~
compilation terminated.
ninja: build stopped: subcommand failed.
Traceback (most recent call last):
File "/mnt/cache/weishao4/anaconda3/envs/toxicity/lib/python3.9/site-packages/torch/utils/cpp_extension.py", line 1900, in _run_ninja_build
subprocess.run(
File "/mnt/cache/weishao4/anaconda3/envs/toxicity/lib/python3.9/subprocess.py", line 528, in run
raise CalledProcessError(retcode, process.args,
subprocess.CalledProcessError: Command '['ninja', '-v']' returned non-zero exit status 1.

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
File "/mnt/cache/weishao4/Projects/Toxicity/LLM_fine_tune/ToxDetLLaMa/main.py", line 620, in
main()
File "/mnt/cache/weishao4/Projects/Toxicity/LLM_fine_tune/ToxDetLLaMa/main.py", line 599, in main
train_result = trainer.train(resume_from_checkpoint=checkpoint)
File "/mnt/cache/weishao4/Projects/Toxicity/LLM_fine_tune/ToxDetLLaMa/transformers/src/transformers/trainer.py", line 1648, in train
return inner_training_loop(
File "/mnt/cache/weishao4/Projects/Toxicity/LLM_fine_tune/ToxDetLLaMa/transformers/src/transformers/trainer.py", line 1717, in _inner_training_loop
deepspeed_engine, optimizer, lr_scheduler = deepspeed_init(
File "/mnt/cache/weishao4/Projects/Toxicity/LLM_fine_tune/ToxDetLLaMa/transformers/src/transformers/deepspeed.py", line 378, in deepspeed_init
deepspeed_engine, optimizer, _, lr_scheduler = deepspeed.initialize(**kwargs)
File "/mnt/cache/weishao4/anaconda3/envs/toxicity/lib/python3.9/site-packages/deepspeed/init.py", line 165, in initialize
engine = DeepSpeedEngine(args=args,
File "/mnt/cache/weishao4/anaconda3/envs/toxicity/lib/python3.9/site-packages/deepspeed/runtime/engine.py", line 309, in init
self._configure_optimizer(optimizer, model_parameters)
File "/mnt/cache/weishao4/anaconda3/envs/toxicity/lib/python3.9/site-packages/deepspeed/runtime/engine.py", line 1174, in _configure_optimizer
basic_optimizer = self._configure_basic_optimizer(model_parameters)
File "/mnt/cache/weishao4/anaconda3/envs/toxicity/lib/python3.9/site-packages/deepspeed/runtime/engine.py", line 1230, in _configure_basic_optimizer
optimizer = DeepSpeedCPUAdam(model_parameters,
File "/mnt/cache/weishao4/anaconda3/envs/toxicity/lib/python3.9/site-packages/deepspeed/ops/adam/cpu_adam.py", line 94, in init
self.ds_opt_adam = CPUAdamBuilder().load()
File "/mnt/cache/weishao4/anaconda3/envs/toxicity/lib/python3.9/site-packages/deepspeed/ops/op_builder/builder.py", line 454, in load
return self.jit_load(verbose)
File "/mnt/cache/weishao4/anaconda3/envs/toxicity/lib/python3.9/site-packages/deepspeed/ops/op_builder/builder.py", line 497, in jit_load
op_module = load(name=self.name,
File "/mnt/cache/weishao4/anaconda3/envs/toxicity/lib/python3.9/site-packages/torch/utils/cpp_extension.py", line 1284, in load
return _jit_compile(
File "/mnt/cache/weishao4/anaconda3/envs/toxicity/lib/python3.9/site-packages/torch/utils/cpp_extension.py", line 1508, in _jit_compile
_write_ninja_file_and_build_library(
File "/mnt/cache/weishao4/anaconda3/envs/toxicity/lib/python3.9/site-packages/torch/utils/cpp_extension.py", line 1623, in _write_ninja_file_and_build_library
_run_ninja_build(
File "/mnt/cache/weishao4/anaconda3/envs/toxicity/lib/python3.9/site-packages/torch/utils/cpp_extension.py", line 1916, in _run_ninja_build
raise RuntimeError(message) from e
RuntimeError: Error building extension 'cpu_adam'
Loading extension module cpu_adam...
Traceback (most recent call last):
File "/mnt/cache/weishao4/Projects/Toxicity/LLM_fine_tune/ToxDetLLaMa/main.py", line 620, in
main()
File "/mnt/cache/weishao4/Projects/Toxicity/LLM_fine_tune/ToxDetLLaMa/main.py", line 599, in main
train_result = trainer.train(resume_from_checkpoint=checkpoint)
File "/mnt/cache/weishao4/Projects/Toxicity/LLM_fine_tune/ToxDetLLaMa/transformers/src/transformers/trainer.py", line 1648, in train
return inner_training_loop(
File "/mnt/cache/weishao4/Projects/Toxicity/LLM_fine_tune/ToxDetLLaMa/transformers/src/transformers/trainer.py", line 1717, in _inner_training_loop
deepspeed_engine, optimizer, lr_scheduler = deepspeed_init(
File "/mnt/cache/weishao4/Projects/Toxicity/LLM_fine_tune/ToxDetLLaMa/transformers/src/transformers/deepspeed.py", line 378, in deepspeed_init
deepspeed_engine, optimizer, _, lr_scheduler = deepspeed.initialize(**kwargs)
File "/mnt/cache/weishao4/anaconda3/envs/toxicity/lib/python3.9/site-packages/deepspeed/init.py", line 165, in initialize
engine = DeepSpeedEngine(args=args,
File "/mnt/cache/weishao4/anaconda3/envs/toxicity/lib/python3.9/site-packages/deepspeed/runtime/engine.py", line 309, in init
self._configure_optimizer(optimizer, model_parameters)
File "/mnt/cache/weishao4/anaconda3/envs/toxicity/lib/python3.9/site-packages/deepspeed/runtime/engine.py", line 1174, in _configure_optimizer
basic_optimizer = self._configure_basic_optimizer(model_parameters)
File "/mnt/cache/weishao4/anaconda3/envs/toxicity/lib/python3.9/site-packages/deepspeed/runtime/engine.py", line 1230, in _configure_basic_optimizer
optimizer = DeepSpeedCPUAdam(model_parameters,
File "/mnt/cache/weishao4/anaconda3/envs/toxicity/lib/python3.9/site-packages/deepspeed/ops/adam/cpu_adam.py", line 94, in init
self.ds_opt_adam = CPUAdamBuilder().load()
File "/mnt/cache/weishao4/anaconda3/envs/toxicity/lib/python3.9/site-packages/deepspeed/ops/op_builder/builder.py", line 454, in load
return self.jit_load(verbose)
File "/mnt/cache/weishao4/anaconda3/envs/toxicity/lib/python3.9/site-packages/deepspeed/ops/op_builder/builder.py", line 497, in jit_load
op_module = load(name=self.name,
File "/mnt/cache/weishao4/anaconda3/envs/toxicity/lib/python3.9/site-packages/torch/utils/cpp_extension.py", line 1284, in load
return _jit_compile(
File "/mnt/cache/weishao4/anaconda3/envs/toxicity/lib/python3.9/site-packages/torch/utils/cpp_extension.py", line 1534, in _jit_compile
return _import_module_from_library(name, build_directory, is_python_module)
File "/mnt/cache/weishao4/anaconda3/envs/toxicity/lib/python3.9/site-packages/torch/utils/cpp_extension.py", line 1936, in _import_module_from_library
module = importlib.util.module_from_spec(spec)
File "", line 565, in module_from_spec
File "", line 1173, in create_module
File "", line 228, in _call_with_frames_removed
ImportError: /mnt/cache/weishao4/.cache/torch_extensions/py39_cu117/cpu_adam/cpu_adam.so: cannot open shared object file: No such file or directory
Exception ignored in: <function DeepSpeedCPUAdam.del at 0x7faf8116e8b0>
Traceback (most recent call last):
File "/mnt/cache/weishao4/anaconda3/envs/toxicity/lib/python3.9/site-packages/deepspeed/ops/adam/cpu_adam.py", line 102, in del
AttributeError: 'DeepSpeedCPUAdam' object has no attribute 'ds_opt_adam'
Exception ignored in: <function DeepSpeedCPUAdam.del at 0x7fc8dcf6e8b0>
Traceback (most recent call last):
File "/mnt/cache/weishao4/anaconda3/envs/toxicity/lib/python3.9/site-packages/deepspeed/ops/adam/cpu_adam.py", line 102, in del
self.ds_opt_adam.destroy_adam(self.opt_id)
AttributeError: 'DeepSpeedCPUAdam' object has no attribute 'ds_opt_adam'
ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: 1) local_rank: 0 (pid: 848830) of binary: /mnt/cache/weishao4/anaconda3/envs/toxicity/bin/python
Traceback (most recent call last):
File "/mnt/cache/weishao4/anaconda3/envs/toxicity/bin/torchrun", line 8, in
sys.exit(main())
File "/mnt/cache/weishao4/anaconda3/envs/toxicity/lib/python3.9/site-packages/torch/distributed/elastic/multiprocessing/errors/init.py", line 346, in wrapper
return f(*args, **kwargs)
File "/mnt/cache/weishao4/anaconda3/envs/toxicity/lib/python3.9/site-packages/torch/distributed/run.py", line 762, in main
run(args)
File "/mnt/cache/weishao4/anaconda3/envs/toxicity/lib/python3.9/site-packages/torch/distributed/run.py", line 753, in run
elastic_launch(
File "/mnt/cache/weishao4/anaconda3/envs/toxicity/lib/python3.9/site-packages/torch/distributed/launcher/api.py", line 132, in call
return launch_agent(self._config, self._entrypoint, list(args))
File "/mnt/cache/weishao4/anaconda3/envs/toxicity/lib/python3.9/site-packages/torch/distributed/launcher/api.py", line 246, in launch_agent
raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError:

/mnt/cache/weishao4/Projects/Toxicity/LLM_fine_tune/ToxDetLLaMa/main.py FAILED

Failures:
[1]:
time : 2023-09-27_14:00:16
host : xgcsdx-SYS-740GP-TNRT
rank : 1 (local_rank: 1)
exitcode : 1 (pid: 848831)
error_file: <N/A>
traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html

Root Cause (first observed failure):
[0]:
time : 2023-09-27_14:00:16
host : xgcsdx-SYS-740GP-TNRT
rank : 0 (local_rank: 0)
exitcode : 1 (pid: 848830)
error_file: <N/A>
traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html

Environment:
A6000 80G
Pytorch 1.13.1
Python 3.9
CUDA 11.7
DeepSpeed 0.9.3

CUDA_HOME is not found when you try to install Deepspeed

This error is very common when you are a deep PyTorch user. Due to some reasons (I also don't know more details but may be known in the future. hhh!), the coda version PyTorch only automatically install partial CUDA dependencies. Unfortunately, the nvcc is not included. This leads to the lost of CUDA_HOME and failure of installing deepspeed. A recommend way to tackle with this problem is that: Install the following dependencies after installing CUDA version PyTorch.

conda install -c nvidia cudatoolkit

conda install -c "nvidia/label/cuda-11.7.0" cuda-nvcc

Reference link:
https://blog.csdn.net/muyao987/article/details/130426069

https://anaconda.org/nvidia/cuda-nvcc

https://www.zhihu.com/question/344950161

https://blog.csdn.net/weixin_44589524/article/details/131663046

No data loaded when you pass a self-constructed model for trainer

When you create a new model for trainer, you should indicate the inputs for forward function. Like

forward(self, input_ids, attention_mask, labels, **kwargs):

The "input_ids", "attention_mask" are the output variables of preprocess_function for data.map().

If the parameters of forward do not contain the output of preprocess_function, no data will be loaded due to the existing of remove_unused_column in trainer class.

Recommend Projects

React

A declarative, efficient, and flexible JavaScript library for building user interfaces.
Vue.js

🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
Typescript

TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
TensorFlow

An Open Source Machine Learning Framework for Everyone
Django

The Web framework for perfectionists with deadlines.
Laravel

A PHP framework for web artisans
D3

Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

javascript

JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
web

Some thing interesting about web. New door for the world.
server

A server is a program made to process requests and deliver data to clients.
Machine learning

Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Visualization

Some thing interesting about visualization, use data art
Game

Some thing interesting about game, make everyone happy.

Recommend Org

Facebook

We are working to build community through open source technology. NB: members must have two-factor auth.
Microsoft

Open source projects and samples from Microsoft.
Google

Google ❤️ Open Source for everyone.
Alibaba

Alibaba Open Source for everyone
D3

Data-Driven Documents codes.
Tencent

China tencent open source team.