Git Product home page Git Product logo

qlora's People

Contributors

abhilash1910 avatar alpindale avatar artidoro avatar birch-san avatar bubundas17 avatar dameikle avatar ffohturk avatar kkcorps avatar muelletm avatar pmysl avatar qubitium avatar ranchlai avatar steremma avatar timdettmers avatar tobi avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

qlora's Issues

How do you process oasst1 to get 9209 examples

Great work! In your paper you say "In our experiments, we
only use the top reply at each level in the conversation tree. This limits the dataset to 9,209 examples. "Could you please tell me how to handle the data? Because I got 10364 examples from 2023-04-12_oasst_ready.trees.jsonl, but I don't know where 9209 came from?

undefined symbol: cquantize_blockwise_fp16_fp4

AttributeError Traceback (most recent call last)
Cell In[6], line 1
----> 1 model = LlamaForCausalLM.from_pretrained("../hf_llama", device_map="auto", torch_dtype=torch.float16, load_in_4bit=True )

File ~/anaconda3/envs/qlora/lib/python3.11/site-packages/transformers/modeling_utils.py:2829, in PreTrainedModel.from_pretrained(cls, pretrained_model_name_or_path, *model_args, **kwargs)
2819 if dtype_orig is not None:
2820 torch.set_default_dtype(dtype_orig)
2822 (
2823 model,
2824 missing_keys,
2825 unexpected_keys,
2826 mismatched_keys,
2827 offload_index,
2828 error_msgs,
-> 2829 ) = cls._load_pretrained_model(
2830 model,
2831 state_dict,
2832 loaded_state_dict_keys, # XXX: rename?
2833 resolved_archive_file,
2834 pretrained_model_name_or_path,
2835 ignore_mismatched_sizes=ignore_mismatched_sizes,
2836 sharded_metadata=sharded_metadata,
2837 _fast_init=_fast_init,
2838 low_cpu_mem_usage=low_cpu_mem_usage,
...
--> 394 func = self._FuncPtr((name_or_ordinal, self))
395 if not isinstance(name_or_ordinal, int):
396 func.name = name_or_ordinal

AttributeError: /home/server/anaconda3/envs/qlora/lib/python3.11/site-packages/bitsandbytes/libbitsandbytes_cpu.so: undefined symbol: cquantize_blockwise_fp16_fp4

Failed to load bloomz 7Bmt

Hi thank you very much for this great library! I am really excited about it!

I tried to fine tune bloomz 7B with the 4 bit lora on alpaca by running: python qlora.py --model_name_path my_bloomz_path. However, the job was killed halfway when it was loading the base model. My server has 40G+RAM and a 3090 gpu, which is large enough to load a 7B model.
It seems that some process amid the 4-bit quantization drains all the memory.Can you please give me some advice about how I can reduce the memory footprint in order to load the base model?
Thank you very much !

Cannot merge LORA layers when the model is loaded in 8-bit mode

When I load the model as following, throw the error: Cannot merge LORA layers when the model is loaded in 8-bit mode
How can I load model with 4bit when inferencing?
model_path = 'decapoda-research/llama-30b-hf' adapter_path = 'timdettmers/guanaco-33b' quantization_config = BitsAndBytesConfig( load_in_4bit=True, bnb_4bit_compute_dtype=torch.bfloat16, bnb_4bit_use_double_quant=True, bnb_4bit_quant_type='nf4' ), model = AutoModelForCausalLM.from_pretrained( model_path, low_cpu_mem_usage=True, load_in_4bit=True, quantization_config=quantization_config, torch_dtype=torch.float16, device_map='auto' ) model = PeftModel.from_pretrained(model, adapter_path) model = model.merge_and_unload()

Environment for running the code

Could the authors share the requirements and general environment for running this code? I am also hitting another few issues, and currently trying to infer the right versions of libraries.

guanaco-13b Model fails on Google Colab Free tier T4 GPU

Starting to load the model decapoda-research/llama-13b-hf into memory
Loading checkpoint shards: 90%
37/41 [04:09<00:27, 6.78s/it]
╭─────────────────────────────── Traceback (most recent call last) ────────────────────────────────╮
│ in <cell line: 14>:14 │
│ │
│ /usr/local/lib/python3.10/dist-packages/transformers/models/auto/auto_factory.py:472 in │
│ from_pretrained │
│ │
│ 469 │ │ │ ) │
│ 470 │ │ elif type(config) in cls._model_mapping.keys(): │
│ 471 │ │ │ model_class = _get_model_class(config, cls._model_mapping) │
│ ❱ 472 │ │ │ return model_class.from_pretrained( │
│ 473 │ │ │ │ pretrained_model_name_or_path, *model_args, config=config, **hub_kwargs, │
│ 474 │ │ │ ) │
│ 475 │ │ raise ValueError( │
│ │
│ /usr/local/lib/python3.10/dist-packages/transformers/modeling_utils.py:2829 in from_pretrained │
│ │
│ 2826 │ │ │ │ mismatched_keys, │
│ 2827 │ │ │ │ offload_index, │
│ 2828 │ │ │ │ error_msgs, │
│ ❱ 2829 │ │ │ ) = cls._load_pretrained_model( │
│ 2830 │ │ │ │ model, │
│ 2831 │ │ │ │ state_dict, │
│ 2832 │ │ │ │ loaded_state_dict_keys, # XXX: rename? │
│ │
│ /usr/local/lib/python3.10/dist-packages/transformers/modeling_utils.py:3172 in │
│ _load_pretrained_model │
│ │
│ 3169 │ │ │ │ ) │
│ 3170 │ │ │ │ │
│ 3171 │ │ │ │ if low_cpu_mem_usage: │
│ ❱ 3172 │ │ │ │ │ new_error_msgs, offload_index, state_dict_index = _load_state_dict_i │
│ 3173 │ │ │ │ │ │ model_to_load, │
│ 3174 │ │ │ │ │ │ state_dict, │
│ 3175 │ │ │ │ │ │ loaded_keys, │
│ │
│ /usr/local/lib/python3.10/dist-packages/transformers/modeling_utils.py:718 in │
│ _load_state_dict_into_meta_model │
│ │
│ 715 │ │ │ │ fp16_statistics = None │
│ 716 │ │ │ │
│ 717 │ │ │ if "SCB" not in param_name: │
│ ❱ 718 │ │ │ │ set_module_quantized_tensor_to_device( │
│ 719 │ │ │ │ │ model, param_name, param_device, value=param, fp16_statistics=fp16_s │
│ 720 │ │ │ │ ) │
│ 721 │
│ │
│ /usr/local/lib/python3.10/dist-packages/transformers/utils/bitsandbytes.py:88 in │
│ set_module_quantized_tensor_to_device │
│ │
│ 85 │ │ │ if is_8bit: │
│ 86 │ │ │ │ new_value = bnb.nn.Int8Params(new_value, requires_grad=False, **kwargs). │
│ 87 │ │ │ elif is_4bit: │
│ ❱ 88 │ │ │ │ new_value = bnb.nn.Params4bit(new_value, requires_grad=False, **kwargs). │
│ 89 │ │ │ │
│ 90 │ │ │ module._parameters[tensor_name] = new_value │
│ 91 │ │ │ if fp16_statistics is not None: │
│ │
│ /usr/local/lib/python3.10/dist-packages/bitsandbytes/nn/modules.py:176 in to │
│ │
│ 173 │ │ device, dtype, non_blocking, convert_to_format = torch._C._nn._parse_to(*args, * │
│ 174 │ │ │
│ 175 │ │ if (device is not None and device.type == "cuda" and self.data.device.type == "c │
│ ❱ 176 │ │ │ return self.cuda(device) │
│ 177 │ │ else: │
│ 178 │ │ │ s = self.quant_state │
│ 179 │ │ │ if s is not None: │
│ │
│ /usr/local/lib/python3.10/dist-packages/bitsandbytes/nn/modules.py:154 in cuda │
│ │
│ 151 │ │
│ 152 │ def cuda(self, device): │
│ 153 │ │ w = self.data.contiguous().half().cuda(device) │
│ ❱ 154 │ │ w_4bit, quant_state = bnb.functional.quantize_4bit(w, blocksize=self.blocksize, │
│ 155 │ │ self.data = w_4bit │
│ 156 │ │ self.quant_state = quant_state │
│ 157 │
│ │
│ /usr/local/lib/python3.10/dist-packages/bitsandbytes/functional.py:760 in quantize_4bit │
│ │
│ 757 │ │
│ 758 │ │
│ 759 │ if out is None: │
│ ❱ 760 │ │ out = torch.zeros(((n+1)//2, 1), dtype=torch.uint8, device=A.device) │
│ 761 │ │
│ 762 │ assert blocksize in [4096, 2048, 1024, 512, 256, 128, 64] │
│ 763 │
╰──────────────────────────────────────────────────────────────────────────────────────────────────╯
OutOfMemoryError: CUDA out of memory. Tried to allocate 34.00 MiB (GPU 0; 14.75 GiB total capacity; 13.85 GiB
already allocated; 832.00 KiB free; 13.88 GiB reserved in total by PyTorch) If reserved memory is >> allocated
memory try setting max_split_size_mb to avoid fragmentation. See documentation for Memory Management and
PYTORCH_CUDA_ALLOC_CONF

[Question] Reason behind removing `lm_head` in `modules`

Hello,

Thank you for the amazing repo. I was curious about this code below.

qlora/qlora.py

Lines 221 to 222 in e381744

if 'lm_head' in lora_module_names: # needed for 16-bit
lora_module_names.remove('lm_head')

Why is lm_head removed? What does it mean by for "needed for 16-bit"? Does it mean targeting this module for fp16 or so is incorrect?

Fine-tuning Guanaco 65B...is it the same as in your fine-tuning notebook?

Thanks so much for making the fine-tuning notebook, super helpful!

I'm curious, if I want to fine-tune from Guacano instead of the original LLaMA, are there any changes I'd need to make? E.g. would it go through this stage:

from peft import prepare_model_for_kbit_training

model.gradient_checkpointing_enable()
model = prepare_model_for_kbit_training(model)

Or would anything else be different?

Compatibility with Deepspeed, Fairscale, or Torch zero-redundancy optimizer

Wonderful work!

May I know the compatibility with ZeRO mechanism? E.g., Torch redundancy optimizer, deepspeed zero-1 to zero-3, and fairscale FSDP. Becaused I noticed that QLoRA relies on particularly implemented optimizer.

If the optimizer is not compabitible with the tools mentioned above, can I use only 4-bit tuning and lora with zero mechanism? Will this cause more memory cost?

Thanks very much!

Best

Training gets killed after eval due to ValueError

I have tried multiple llama based models on both my 4080 card as well as colab A100 GPU.

In both cases, the training gets killed after eval with the following error

100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 2/2 [00:00<00:00, 1296.74it/s]
torch.float32 422326272 0.11537932153507864
torch.uint8 3238002688 0.8846206784649213
{'loss': 2.3365, 'learning_rate': 0.0002, 'epoch': 0.0}                                                                                                                                      
{'loss': 15.4705, 'learning_rate': 0.0002, 'epoch': 0.01}                                                                                                                                    
{'loss': 38.2581, 'learning_rate': 0.0002, 'epoch': 0.01}                                                                                                                                    
{'loss': 159.034, 'learning_rate': 0.0002, 'epoch': 0.01}                                                                                                                                    
{'loss': 777.5871, 'learning_rate': 0.0002, 'epoch': 0.02}                                                                                                                                   
{'eval_loss': 866.6826171875, 'eval_runtime': 85.0603, 'eval_samples_per_second': 11.756, 'eval_steps_per_second': 1.47, 'epoch': 0.02}                                                      
  0%|                                                                                                                                                                | 0/192 [00:01<?, ?it/s]

..... (stacktrace)
qlora.py:676 in <listcomp> 

│   673 │   │   │   │   │   │   logit_abcd = logit[label_non_zero_id-1][abcd_idx]                  │
│   674 │   │   │   │   │   │   preds.append(torch.argmax(logit_abcd).item())                      │
│   675 │   │   │   │   │   labels = labels[labels != IGNORE_INDEX].view(-1, 2)[:,0]               │
│ ❱ 676 │   │   │   │   │   refs += [abcd_idx.index(label) for label in labels.tolist()]           │
│   677 │   │   │   │   │                                                                          │
│   678 │   │   │   │   │   loss_mmlu += loss.item()                                               │
│   679 │   │   │   │   # Extract results by subject.                                              │
╰──────────────────────────────────────────────────────────────────────────────────────────────────╯
ValueError: 29879 is not in list

I am using the following command to run

python3 qlora.py \
    --model_name_or_path "/home/LLaMA/HuggingFaceFormat/7B" \
    --optim paged_adamw_8bit\
    --output_dir ./output \
    --dataset alpaca \
    --do_train True \
    --do_eval True \
    --do_mmlu_eval True \
    --source_max_len 384 \
    --target_max_len 128 \
    --per_device_train_batch_size 4 \
    --per_device_train_batch_size 4 \
    --gradient_accumulation_steps 4 \
    --logging_steps 10 \
    --max_steps 1000 \
    --save_strategy steps \
    --data_seed 42 \
    --save_steps 50 \
    --save_total_limit 10 \
    --evaluation_strategy steps \
    --eval_dataset_size 1024 \
    --max_eval_samples 1000 \
    --eval_steps 50 \

I have tried different models as well e.g. https://huggingface.co/eachadea/vicuna-13b-1.1
but I get the same error

OverflowError: out of range integral type conversion attempted while running python qlora.py

python qlora.py --model_name_or_path decapoda-research/llama-13b-hf

(I have updated the tokenizer_config.json and config.json as per the various discussions here
tokenizer_class: LlamaTokenizer and architectures: LlamaForCausalLM)

==================================================================================

adding LoRA modules...
trainable params: 125173760.0 || all params: 6922327040 || trainable: 1.8082612866554193
loaded model

Using pad_token, but it is not set yet.
Traceback (most recent call last):
File "qlora.py", line 758, in
train()
File "qlora.py", line 620, in train
"unk_token": tokenizer.convert_ids_to_tokens(model.config.pad_token_id),
File "/home/envs/qlora_env/lib/python3.8/site-packages/transformers/tokenization_utils_fast.py", line 307, in convert_ids_to_tokens
return self._tokenizer.id_to_token(ids)
OverflowError: out of range integral type conversion attempted

LORA Merge fails in 4-bit mode

Trained vicuna-13b-1.1 LORA in 4bit

Now trying to merge it for running generations but it fails with the following error

python3.11/site-packages/peft/tuners/lora.py", line 352, in merge_and_unload
    raise ValueError("Cannot merge LORA layers when the model is loaded in 8-bit mode")
ValueError: Cannot merge LORA layers when the model is loaded in 8-bit mode

ValueError: Tokenizer class LLaMATokenizer does not exist or is not currently imported.

I followed the instructions.

CUDA SETUP: CUDA runtime path found: /usr/local/cuda-11.8/lib64/libcudart.so
CUDA SETUP: Highest compute capability among GPUs detected: 5.2
CUDA SETUP: Detected CUDA version 118
/home/developer/mambaforge/envs/Guanaco/lib/python3.10/site-packages/bitsandbytes/cuda_setup/main.py:149: UserWarning: WARNING: Compute capability < 7.5 detected! Only slow 8-bit matmul is supported for your GPU!
warn(msg)
CUDA SETUP: Loading binary /home/developer/mambaforge/envs/Guanaco/lib/python3.10/site-packages/bitsandbytes/libbitsandbytes_cuda118_nocublaslt.so...
loading base model decapoda-research/llama-7b-hf...
Loading checkpoint shards: 100%|███████████████████████████████████████████████████████████████████████████████████████████████| 33/33 [00:51<00:00, 1.55s/it]
/home/developer/mambaforge/envs/Guanaco/lib/python3.10/site-packages/peft/utils/other.py:76: FutureWarning: prepare_model_for_int8_training is deprecated and will be removed in a future version. Use prepare_model_for_kbit_training instead.
warnings.warn(
adding LoRA modules...
trainable params: 159907840 || all params: 6898323456 || trainable: 2.3180681656919973
loaded model
Traceback (most recent call last):
File "/home/developer/qlora/qlora.py", line 763, in
train()
File "/home/developer/qlora/qlora.py", line 604, in train
tokenizer = AutoTokenizer.from_pretrained(
File "/home/developer/mambaforge/envs/Guanaco/lib/python3.10/site-packages/transformers/models/auto/tokenization_auto.py", line 691, in from_pretrained
raise ValueError(
ValueError: Tokenizer class LLaMATokenizer does not exist or is not currently imported.
(Guanaco) developer@ai:~/qlora$

Multi-gpu training example?

Testing 4bit qlora training on 33b llama and the training runs fine on 1x gpu but fails with the following using torchrun on 2x gpu. I am referring to parallel training where each gpu has a full model.

Anyone got multiple-gpu parallel training working yet?

WORLD_SIZE=2 torchrun --rdzv-endpoint=localhost:23456 --nproc_per_node=2
device_map = {"": "cuda:" + str(int(os.environ.get("LOCAL_RANK") or 0))}
 File "/root/miniconda3/lib/python3.10/site-packages/transformers-4.30.0.dev0-py3.10.egg/transformers/trainer.py", line 2804, in training_step
    loss.backward()
  File "/root/miniconda3/lib/python3.10/site-packages/torch/_tensor.py", line 488, in backward
    torch.autograd.backward(
  File "/root/miniconda3/lib/python3.10/site-packages/torch/autograd/__init__.py", line 204, in backward
    Variable._execution_engine.run_backward(  # Calls into the C++ engine to run the backward pass
  File "/root/miniconda3/lib/python3.10/site-packages/torch/autograd/function.py", line 274, in apply
    return user_fn(self, *args)
  File "/root/miniconda3/lib/python3.10/site-packages/torch/utils/checkpoint.py", line 226, in backward
    torch.autograd.backward(outputs_with_grad, args_with_grad)
  File "/root/miniconda3/lib/python3.10/site-packages/torch/autograd/__init__.py", line 204, in backward
    Variable._execution_engine.run_backward(  # Calls into the C++ engine to run the backward pass
RuntimeError: Expected to mark a variable ready only once. This error is caused by one of the following reasons: 1) Use of a module parameter outside the `forward` function. Please make sure model parameters are not shared acro
ss multiple concurrent forward-backward passes. or try to use _set_static_graph() as a workaround if this module graph does not change during training loop.2) Reused parameters in multiple reentrant backward passes. For example
, if you use multiple `checkpoint` functions to wrap the same part of your model, it would result in the same set of parameters been used by different reentrant backward passes multiple times, and hence marking a variable ready
 multiple times. DDP does not support such use cases in default. You can try to use _set_static_graph() as a workaround if your module graph does not change over iterations.
Parameter at index 557 has been marked as ready twice. This means that multiple autograd engine  hooks have fired for this particular parameter during this iteration. You can set the environment variable TORCH_DISTRIBUTED_DEBUG
 to either INFO or DETAIL to print parameter names for further debugging.

OverflowError: out of range integral type conversion attempted

Description:

Hi, I'm trying to use bitsandbytes library to finetune a 4-bit QLoRA model on colab, but I encountered a Error----OverflowError: out of range integral type conversion attempted. Here is the error trace:

My code is here:
!pip install -q -U bitsandbytes
!pip install -q -U git+https://github.com/huggingface/transformers.git
!pip install -q -U git+https://github.com/huggingface/peft.git
!pip install -q -U git+https://github.com/huggingface/accelerate.git
!pip install -q datasets
!git clone https://github.com/artidoro/qlora.git
%cd qlora
!pip install -r requirements.txt
!pip install bitsandbytes

!python qlora.py
--model_name_or_path decapoda-research/llama-7b-hf
--output_dir llama-7b-finetuned
--task_name chatbot
--dataset_name blended_skill_talk
--do_train
--do_eval
--num_train_epochs 3
--per_device_train_batch_size 8
--per_device_eval_batch_size 8
--learning_rate 1e-5

===================================BUG REPORT===================================
Welcome to bitsandbytes. For bug reports, please run

python -m bitsandbytes

and submit this information together with your error trace to: https://github.com/TimDettmers/bitsandbytes/issues

bin /usr/local/lib/python3.10/dist-packages/bitsandbytes/libbitsandbytes_cuda118.so
/usr/local/lib/python3.10/dist-packages/bitsandbytes/cuda_setup/main.py:149: UserWarning: /usr/lib64-nvidia did not contain ['libcudart.so', 'libcudart.so.11.0', 'libcudart.so.12.0'] as expected! Searching further paths...
warn(msg)
/usr/local/lib/python3.10/dist-packages/bitsandbytes/cuda_setup/main.py:149: UserWarning: WARNING: The following directories listed in your path were found to be non-existent: {PosixPath('/sys/fs/cgroup/memory.events /var/colab/cgroup/jupyter-children/memory.events')}
warn(msg)
/usr/local/lib/python3.10/dist-packages/bitsandbytes/cuda_setup/main.py:149: UserWarning: WARNING: The following directories listed in your path were found to be non-existent: {PosixPath('8013'), PosixPath('//172.28.0.1'), PosixPath('http')}
warn(msg)
/usr/local/lib/python3.10/dist-packages/bitsandbytes/cuda_setup/main.py:149: UserWarning: WARNING: The following directories listed in your path were found to be non-existent: {PosixPath('//colab.research.google.com/tun/m/cc48301118ce562b961b3c22d803539adc1e0c19/gpu-t4-s-2m8e0braypyq7 --tunnel_background_save_delay=10s --tunnel_periodic_background_save_frequency=30m0s --enable_output_coalescing=true --output_coalescing_required=true'), PosixPath('--logtostderr --listen_host=172.28.0.12 --target_host=172.28.0.12 --tunnel_background_save_url=https')}
warn(msg)
/usr/local/lib/python3.10/dist-packages/bitsandbytes/cuda_setup/main.py:149: UserWarning: WARNING: The following directories listed in your path were found to be non-existent: {PosixPath('/env/python')}
warn(msg)
/usr/local/lib/python3.10/dist-packages/bitsandbytes/cuda_setup/main.py:149: UserWarning: WARNING: The following directories listed in your path were found to be non-existent: {PosixPath('module'), PosixPath('//ipykernel.pylab.backend_inline')}
warn(msg)
CUDA_SETUP: WARNING! libcudart.so not found in any environmental path. Searching in backup paths...
/usr/local/lib/python3.10/dist-packages/bitsandbytes/cuda_setup/main.py:149: UserWarning: Found duplicate ['libcudart.so', 'libcudart.so.11.0', 'libcudart.so.12.0'] files: {PosixPath('/usr/local/cuda/lib64/libcudart.so.11.0'), PosixPath('/usr/local/cuda/lib64/libcudart.so')}.. We'll flip a coin and try one of these, in order to fail forward.
Either way, this might cause trouble in the future:
If you get CUDA error: invalid device function errors, the above might be the cause and the solution is to make sure only one ['libcudart.so', 'libcudart.so.11.0', 'libcudart.so.12.0'] in the paths that we search based on your env.
warn(msg)
CUDA SETUP: CUDA runtime path found: /usr/local/cuda/lib64/libcudart.so.11.0
CUDA SETUP: Highest compute capability among GPUs detected: 7.5
CUDA SETUP: Detected CUDA version 118
CUDA SETUP: Loading binary /usr/local/lib/python3.10/dist-packages/bitsandbytes/libbitsandbytes_cuda118.so...
2023-05-27 11:21:37.135416: W tensorflow/compiler/tf2tensorrt/utils/py_utils.cc:38] TF-TRT Warning: Could not find TensorRT
loading base model decapoda-research/llama-7b-hf...
Loading checkpoint shards: 100% 33/33 [01:38<00:00, 2.99s/it]
/usr/local/lib/python3.10/dist-packages/peft/utils/other.py:76: FutureWarning: prepare_model_for_int8_training is deprecated and will be removed in a future version. Use prepare_model_for_kbit_training instead.
warnings.warn(
adding LoRA modules...
trainable params: 79953920.0 || all params: 3660320768 || trainable: 2.184341894267557
loaded model
Downloading tokenizer.model: 100% 500k/500k [00:00<00:00, 12.3MB/s]
Downloading (…)cial_tokens_map.json: 100% 2.00/2.00 [00:00<00:00, 16.7kB/s]
Using pad_token, but it is not set yet.
╭───────────────────── Traceback (most recent call last) ──────────────────────╮
│ /content/qlora/qlora.py:758 in │
│ │
│ 755 │ │ │ fout.write(json.dumps(all_metrics)) │
│ 756 │
│ 757 if name == "main": │
│ ❱ 758 │ train() │
│ 759 │
│ │
│ /content/qlora/qlora.py:620 in train │
│ │
│ 617 │ │ │ { │
│ 618 │ │ │ │ "eos_token": tokenizer.convert_ids_to_tokens(model.con │
│ 619 │ │ │ │ "bos_token": tokenizer.convert_ids_to_tokens(model.con │
│ ❱ 620 │ │ │ │ "unk_token": tokenizer.convert_ids_to_tokens(model.con │
│ 621 │ │ │ } │
│ 622 │ │ ) │
│ 623 │
│ │
│ /usr/local/lib/python3.10/dist-packages/transformers/tokenization_utils_fast │
│ .py:307 in convert_ids_to_tokens │
│ │
│ 304 │ │ │ str or List[str]: The decoded token(s). │
│ 305 │ │ """ │
│ 306 │ │ if isinstance(ids, int): │
│ ❱ 307 │ │ │ return self._tokenizer.id_to_token(ids) │
│ 308 │ │ tokens = [] │
│ 309 │ │ for index in ids: │
│ 310 │ │ │ index = int(index) │
╰──────────────────────────────────────────────────────────────────────────────╯
OverflowError: out of range integral type conversion attempted

RecursionError: maximum recursion depth exceeded

I am getting maximum recursion depth error after running this following command:
python qlora.py --model_name_or_path decapoda-research/llama-7b-hf

And this is the error I got:

  File "/home/atilla/miniconda3/envs/qlora/lib/python3.9/site-packages/transformers/tokenization_utils_fast.py", line 257, in _convert_token_to_id_with_added_voc
    return self.unk_token_id
  File "/home/atilla/miniconda3/envs/qlora/lib/python3.9/site-packages/transformers/tokenization_utils_base.py", line 1142, in unk_token_id
    return self.convert_tokens_to_ids(self.unk_token)
  File "/home/atilla/miniconda3/envs/qlora/lib/python3.9/site-packages/transformers/tokenization_utils_fast.py", line 250, in convert_tokens_to_ids
    return self._convert_token_to_id_with_added_voc(tokens)
  File "/home/atilla/miniconda3/envs/qlora/lib/python3.9/site-packages/transformers/tokenization_utils_fast.py", line 257, in _convert_token_to_id_with_added_voc
    return self.unk_token_id
  File "/home/atilla/miniconda3/envs/qlora/lib/python3.9/site-packages/transformers/tokenization_utils_base.py", line 1142, in unk_token_id
    return self.convert_tokens_to_ids(self.unk_token)
  File "/home/atilla/miniconda3/envs/qlora/lib/python3.9/site-packages/transformers/tokenization_utils_fast.py", line 250, in convert_tokens_to_ids
    return self._convert_token_to_id_with_added_voc(tokens)
  File "/home/atilla/miniconda3/envs/qlora/lib/python3.9/site-packages/transformers/tokenization_utils_fast.py", line 257, in _convert_token_to_id_with_added_voc
    return self.unk_token_id
  File "/home/atilla/miniconda3/envs/qlora/lib/python3.9/site-packages/transformers/tokenization_utils_base.py", line 1142, in unk_token_id
    return self.convert_tokens_to_ids(self.unk_token)
RecursionError: maximum recursion depth exceeded

TypeError: __init__() got an unexpected keyword argument 'load_in_4bit'

When I use the command below I got an error:

 python3 qlora.py –learning_rate 0.0001 --model_name_or_path <llama33b_model_path>

╭─────────────────────────────── Traceback (most recent call last) ────────────────────────────────╮
│ /home/tiger/qlora/qlora/qlora.py:758 in │
│ │
│ 755 │ │ │ fout.write(json.dumps(all_metrics)) │
│ 756 │
│ 757 if name == "main": │
│ ❱ 758 │ train() │
│ 759 │
│ │
│ /home/tiger/qlora/qlora/qlora.py:590 in train │
│ │
│ 587 │ if completed_training: │
│ 588 │ │ print('Detected that training was already completed!') │
│ 589 │ │
│ ❱ 590 │ model = get_accelerate_model(args, checkpoint_dir) │
│ 591 │ training_args.skip_loading_checkpoint_weights=True │
│ 592 │ │
│ 593 │ model.config.use_cache = False │
│ │
│ /home/tiger/qlora/qlora/qlora.py:263 in get_accelerate_model │
│ │
│ 260 │ │
│ 261 │ print(f'loading base model {args.model_name_or_path}...') │
│ 262 │ compute_dtype = (torch.float16 if args.fp16 else (torch.bfloat16 if args.bf16 else t │
│ ❱ 263 │ model = AutoModelForCausalLM.from_pretrained( │
│ 264 │ │ args.model_name_or_path, │
│ 265 │ │ load_in_4bit=args.bits == 4, │
│ 266 │ │ load_in_8bit=args.bits == 8, │
│ │
│ /home/tiger/.local/lib/python3.9/site-packages/transformers/models/auto/auto_factory.py:467 in │
│ from_pretrained │
│ │
│ 464 │ │ │ ) │
│ 465 │ │ elif type(config) in cls._model_mapping.keys(): │
│ 466 │ │ │ model_class = _get_model_class(config, cls._model_mapping) │
│ ❱ 467 │ │ │ return model_class.from_pretrained( │
│ 468 │ │ │ │ pretrained_model_name_or_path, *model_args, config=config, **hub_kwargs, │
│ 469 │ │ │ ) │
│ 470 │ │ raise ValueError( │
│ │
│ /home/tiger/.local/lib/python3.9/site-packages/transformers/modeling_utils.py:2611 in │
│ from_pretrained │
│ │
│ 2608 │ │ │ init_contexts.append(init_empty_weights()) │
│ 2609 │ │ │
│ 2610 │ │ with ContextManagers(init_contexts): │
│ ❱ 2611 │ │ │ model = cls(config, *model_args, **model_kwargs) │
│ 2612 │ │ │
│ 2613 │ │ # Check first if we are from_pt
│ 2614 │ │ if use_keep_in_fp32_modules: │
╰──────────────────────────────────────────────────────────────────────────────────────────────────╯
TypeError: init() got an unexpected keyword argument 'load_in_4bit'

OverflowError: out of range integral type conversion attempted

Getting this error at line 620 on unk_token with this command:

qlora.py --model_name_or_path decapoda-research/llama-7b-hf --do_predict

│ ❱ 620 │ │ │ │ "unk_token": tokenizer.convert_ids_to_tokens(model.config.pad_token_id), │

/home/pwood/.local/lib/python3.10/site-packages/transformers/tokenization_utils_fast.py:307 in │
│ convert_ids_to_tokens │
│ │
│ 304 │ │ │ str or List[str]: The decoded token(s). │
│ 305 │ │ """ │
│ 306 │ │ if isinstance(ids, int): │
│ ❱ 307 │ │ │ return self._tokenizer.id_to_token(ids) │
│ 308 │ │ tokens = [] │
│ 309 │ │ for index in ids: │
│ 310 │ │ │ index = int(index)

Can continue by commenting out line 620.

Can't resume from checkpoint

I'm getting:

  File "/home/guest/ai/finetune/qlora/venv/lib/python3.10/site-packages/transformers/trainer.py", line 2159, in _load_from_checkpoint
  ValueError: Can't find a valid checkpoint at ./output/checkpoint-1000

Apparently there are several missing files from the checkpoint dir, like config.json.

Fine-tuning with unlabelled data? (Causal language modelling)

I'd like to fine-tune using unlabelled data, i.e. a causal language modeling. For instance to adapt a model to a new domain or language.

Which parts of the training code need to be changed to use such a data source?

From what I can tell, it would probably be these:

  • DataCollatorForCausalLM (perhaps use DataCollatorForLanguageModeling from transformers)
  • make_data_module()
  • MMLUEvalCallback

Is that correct? Anything else?

Is there perhaps code from this or another repo that I can use?

Thanks!

Edit: Replaced "masked language modeling" with "causal language modeling".

Error invalid device ordinal at line 359 in file /mnt/c/Users/qwopq/Downloads/qlora/bitsandbytes/csrc/pythonInterface.c

I am using cuda 11.8 and wsl2. LLaMA - 7B
I installed a new version of bitsandbytes to use the qlora, but it doesn't work.

(base) root@DESKTOP-DRKMHOU:/mnt/c/Users/qwopq/Downloads/qlora/qlora-main# bash scripts/finetune.sh

===================================BUG REPORT===================================
Welcome to bitsandbytes. For bug reports, please run

python -m bitsandbytes

 and submit this information together with your error trace to: https://github.com/TimDettmers/bitsandbytes/issues
================================================================================
bin /root/anaconda3/lib/python3.9/site-packages/bitsandbytes-0.39.0-py3.9.egg/bitsandbytes/libbitsandbytes_cuda117.so
CUDA SETUP: CUDA runtime path found: /root/anaconda3/lib/libcudart.so.11.0
CUDA SETUP: Highest compute capability among GPUs detected: 8.6
CUDA SETUP: Detected CUDA version 117
CUDA SETUP: Loading binary /root/anaconda3/lib/python3.9/site-packages/bitsandbytes-0.39.0-py3.9.egg/bitsandbytes/libbitsandbytes_cuda117.so...
loading base model ./llama-7b...
Loading checkpoint shards: 100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 2/2 [00:22<00:00, 11.44s/it]
/root/anaconda3/lib/python3.9/site-packages/peft/utils/other.py:76: FutureWarning: prepare_model_for_int8_training is deprecated and will be removed in a future version. Use prepare_model_for_kbit_training instead.
  warnings.warn(
adding LoRA modules...
trainable params: 79953920.0 || all params: 3660320768 || trainable: 2.184341894267557
loaded model
Using pad_token, but it is not set yet.
Found cached dataset parquet (/root/.cache/huggingface/datasets/tatsu-lab___parquet/tatsu-lab--alpaca-2b32f0433506ef5f/0.0.0/2a3b91fbd88a2c90d1dbbb32b460cf621d31bd5b05b934492fdef7d8d6f236ec)
100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 1/1 [00:00<00:00, 947.44it/s]
Loading cached processed dataset at /root/.cache/huggingface/datasets/tatsu-lab___parquet/tatsu-lab--alpaca-2b32f0433506ef5f/0.0.0/2a3b91fbd88a2c90d1dbbb32b460cf621d31bd5b05b934492fdef7d8d6f236ec/cache-95a00d56074ae5fa.arrow
Splitting train dataset in train and validation according to `eval_dataset_size`
Loading cached split indices for dataset at /root/.cache/huggingface/datasets/tatsu-lab___parquet/tatsu-lab--alpaca-2b32f0433506ef5f/0.0.0/2a3b91fbd88a2c90d1dbbb32b460cf621d31bd5b05b934492fdef7d8d6f236ec/cache-aedf12585a19bef4.arrow and /root/.cache/huggingface/datasets/tatsu-lab___parquet/tatsu-lab--alpaca-2b32f0433506ef5f/0.0.0/2a3b91fbd88a2c90d1dbbb32b460cf621d31bd5b05b934492fdef7d8d6f236ec/cache-2ff9ed40565a4b8c.arrow
Loading cached processed dataset at /root/.cache/huggingface/datasets/tatsu-lab___parquet/tatsu-lab--alpaca-2b32f0433506ef5f/0.0.0/2a3b91fbd88a2c90d1dbbb32b460cf621d31bd5b05b934492fdef7d8d6f236ec/cache-b83a3c430699603c.arrow
Loading cached processed dataset at /root/.cache/huggingface/datasets/tatsu-lab___parquet/tatsu-lab--alpaca-2b32f0433506ef5f/0.0.0/2a3b91fbd88a2c90d1dbbb32b460cf621d31bd5b05b934492fdef7d8d6f236ec/cache-e983acc2d14f99d2.arrow
Found cached dataset json (/root/.cache/huggingface/datasets/json/default-e1a2135500543221/0.0.0/e347ab1c932092252e717ff3f949105a4dd28b27e842dd53157d2f72e276c2e4)
100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 2/2 [00:00<00:00, 1694.33it/s]
torch.float32 422326272 0.11537932153507864
torch.uint8 3238002688 0.8846206784649213
  0%|                                                                                                                                                    | 0/10000 [00:00<?, ?it/s]
Error invalid device ordinal at line 359 in file /mnt/c/Users/qwopq/Downloads/qlora/bitsandbytes/csrc/pythonInterface.c

Cannot resume from checkpoint because it is not detected as valid

I have problems resuming a checkpoint. What I did:

  1. python qlora.py --model_name_or_path huggyllama/llama-7b
  2. abort when a checkpoint has been written
  3. python qlora.py --model_name_or_path huggyllama/llama-7b

I expected fine-tuning to pick up where I aborted it, but instead I get the following error message:

...
torch.uint8 3238002688 0.8846206784649213
Traceback (most recent call last):
  File "/workspace/qlora/qlora.py", line 758, in <module>
    train()
  File "/workspace/qlora/qlora.py", line 720, in train
    train_result = trainer.train(resume_from_checkpoint=checkpoint_dir)
  File "/workspace/anaconda3/envs/qlora310/lib/python3.10/site-packages/transformers/trainer.py", line 1685, in train
    self._load_from_checkpoint(resume_from_checkpoint)
  File "/workspace/anaconda3/envs/qlora310/lib/python3.10/site-packages/transformers/trainer.py", line 2159, in _load_from_checkpoint
    raise ValueError(f"Can't find a valid checkpoint at {resume_from_checkpoint}")
ValueError: Can't find a valid checkpoint at ./output/checkpoint-500

Problems with bitsandbytes

Hey, when I tried to run fine-tuning script I got error that '4 bit quantization requires bitsandbytes>=0.39.0 - please upgrade your bitsandbytes version'.
And since I need to install bitsandbytes from source due to CUDA compatibility issues here is a mystery.
I can see the new version, 0.39.0 (https://pypi.org/project/bitsandbytes/), but when I go to the git repo (https://github.com/TimDettmers/bitsandbytes) the latest release is still 0.38.0.
And it seems that master branch does not support what is required for 4 bit to run.

So who built 0.39.0 and where to find source code for it?

Finetuned T5 checkpoints

Very exciting development - thanks for sharing your paper and this repo. Would it be possible for your team to release the T5 finetuned checkpoints (Super-NaturalInstructions), small to xxl? We can upload to HF hub.

Thank you.

The VRAM usage is more than 48GB.

In the paper, it was mentioned that 48G of graphics memory can train 65B of LLaMA

We present QLORA, an efficient finetuning approach that reduces memory usage enough to finetune a 65B parameter model on a single 48GB GPU while
preserving full 16-bit finetuning task performance.

While using the following code to train a LLaMA 65B model, it actually comsumed about 60G VRAM

python qlora.py 

image

4bit inference is slow

Thanks for the excellent work.
When I use 4bit to inference, it's very slow, even slower than 8bit inference.
Will you plan to solve the problem, thanks~

try finetune guanaco-33b-merged with default params and some problems

loading base model /models/guanaco-33b-merged...
Loading checkpoint shards: 100%|██████████████████████████████████████████████████████████████████| 7/7 [01:12<00:00, 10.30s/it]
adding LoRA modules...

/usr/local/lib/python3.8/dist-packages/torch/nn/modules/module.py:1614 in getattr
│ │
│ 1611 │ │ │ modules = self.dict['_modules'] │
│ 1612 │ │ │ if name in modules: │
│ 1613 │ │ │ │ return modules[name] │
│ ❱ 1614 │ │ raise AttributeError("'{}' object has no attribute '{}'".format( │
│ 1615 │ │ │ type(self).name, name)) │
│ 1616 │ │
│ 1617 │ def setattr(self, name: str, value: Union[Tensor, 'Module']) -> None: │
╰──────────────────────────────────────────────────────────────────────────────────────────────────╯
AttributeError: 'CastOutputToFloat' object has no attribute 'weight'
root@5e0ba28fefc9:/wzh/qlora#
root@5e0ba28fefc9:/wzh/qlora# CUDA_VISIBLE_DEVICES=0 PYTORCH_CUDA_ALLOC_CONF=max_split_size_mb:24 sh scripts/finetune.sh

Getting error while trying to replicate

I was able to successfully load the model and dataset. However, when the training initiates, I get an error "self and mat2 must have the same dtype". Could you please help?

RuntimeError: self and mat2 must have the same dtype

---------pip package version
transformers-4.30.0.dev0
accelerate 0.20.0.dev0
bitsandbytes 0.39.0
peft-0.3.0.dev0

----------python cmd
python qlora.py --model_name_or_path /home/bmb/models/facebook/opt-125m

---------error
Traceback (most recent call last):
File "/home/bmb/projects/qlora/qlora.py", line 766, in
train()
File "/home/bmb/projects/qlora/qlora.py", line 728, in train
train_result = trainer.train(resume_from_checkpoint=checkpoint_dir)
File "/home/bmb/anaconda3/envs/qlora/lib/python3.10/site-packages/transformers/trainer.py", line 1696, in train
return inner_training_loop(
File "/home/bmb/anaconda3/envs/qlora/lib/python3.10/site-packages/transformers/trainer.py", line 1972, in _inner_training_loop
tr_loss_step = self.training_step(model, inputs)
File "/home/bmb/anaconda3/envs/qlora/lib/python3.10/site-packages/transformers/trainer.py", line 2786, in training_step
loss = self.compute_loss(model, inputs)
File "/home/bmb/anaconda3/envs/qlora/lib/python3.10/site-packages/transformers/trainer.py", line 2818, in compute_loss
outputs = model(**inputs)
File "/home/bmb/anaconda3/envs/qlora/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1194, in _call_impl
return forward_call(*input, **kwargs)
File "/home/bmb/anaconda3/envs/qlora/lib/python3.10/site-packages/peft/peft_model.py", line 678, in forward
return self.base_model(
File "/home/bmb/anaconda3/envs/qlora/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1194, in _call_impl
return forward_call(*input, **kwargs)
File "/home/bmb/anaconda3/envs/qlora/lib/python3.10/site-packages/accelerate/hooks.py", line 165, in new_forward
output = old_forward(*args, **kwargs)
File "/home/bmb/anaconda3/envs/qlora/lib/python3.10/site-packages/transformers/models/opt/modeling_opt.py", line 945, in forward
outputs = self.model.decoder(
File "/home/bmb/anaconda3/envs/qlora/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1194, in _call_impl
return forward_call(*input, **kwargs)
File "/home/bmb/anaconda3/envs/qlora/lib/python3.10/site-packages/accelerate/hooks.py", line 165, in new_forward
output = old_forward(*args, **kwargs)
File "/home/bmb/anaconda3/envs/qlora/lib/python3.10/site-packages/transformers/models/opt/modeling_opt.py", line 703, in forward
layer_outputs = torch.utils.checkpoint.checkpoint(
File "/home/bmb/anaconda3/envs/qlora/lib/python3.10/site-packages/torch/utils/checkpoint.py", line 249, in checkpoint
return CheckpointFunction.apply(function, preserve, *args)
File "/home/bmb/anaconda3/envs/qlora/lib/python3.10/site-packages/torch/utils/checkpoint.py", line 107, in forward
outputs = run_function(*args)
File "/home/bmb/anaconda3/envs/qlora/lib/python3.10/site-packages/transformers/models/opt/modeling_opt.py", line 699, in custom_forward
return module(*inputs, output_attentions, None)
File "/home/bmb/anaconda3/envs/qlora/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1194, in _call_impl
return forward_call(*input, **kwargs)
File "/home/bmb/anaconda3/envs/qlora/lib/python3.10/site-packages/accelerate/hooks.py", line 165, in new_forward
output = old_forward(*args, **kwargs)
File "/home/bmb/anaconda3/envs/qlora/lib/python3.10/site-packages/transformers/models/opt/modeling_opt.py", line 331, in forward
hidden_states, self_attn_weights, present_key_value = self.self_attn(
File "/home/bmb/anaconda3/envs/qlora/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1194, in _call_impl
return forward_call(*input, **kwargs)
File "/home/bmb/anaconda3/envs/qlora/lib/python3.10/site-packages/accelerate/hooks.py", line 165, in new_forward
output = old_forward(*args, **kwargs)
File "/home/bmb/anaconda3/envs/qlora/lib/python3.10/site-packages/transformers/models/opt/modeling_opt.py", line 174, in forward
query_states = self.q_proj(hidden_states) * self.scaling
File "/home/bmb/anaconda3/envs/qlora/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1194, in _call_impl
return forward_call(*input, **kwargs)
File "/home/bmb/anaconda3/envs/qlora/lib/python3.10/site-packages/peft/tuners/lora.py", line 565, in forward
result = F.linear(x, transpose(self.weight, self.fan_in_fan_out), bias=self.bias)
RuntimeError: self and mat2 must have the same dtype

Which specific checkpoints are supported?

Is it the original leaked FB checkpoints or some derivative ones? The Meta ones don't have config.json per qlora.py error message.

If the latter, which repos would have them?

Thanks.

EleutherAI/gpt-j-6b not supported

It looks like EleutherAI/gpt-j-6b is not supported:

Env:

Running from docker:

FROM pytorch/pytorch:2.0.1-cuda11.7-cudnn8-devel

RUN apt-get update && apt-get install git -y

RUN pip install -q -U bitsandbytes
RUN pip install -q -U git+https://github.com/huggingface/transformers.git
RUN pip install -q -U git+https://github.com/huggingface/peft.git
RUN pip install -q -U git+https://github.com/huggingface/accelerate.git

WORKDIR /code

COPY qlora/requirements.txt qlora/requirements.txt

WORKDIR /code/qlora

RUN pip install -q -r requirements.txt

Cmd:

python qlora.py \
    --model_name_or_path EleutherAI/gpt-j-6b \
    --output_dir /output \
    --dataset alpaca \
    --do_train True \
    --do_eval True \
    --do_mmlu_eval True \
    --source_max_len 384 \
    --target_max_len 128 \
    --per_device_train_batch_size 4 \
    --per_device_train_batch_size 4 \
    --gradient_accumulation_steps 4 \
    --logging_steps 10 \
    --max_steps 10000 \
    --save_strategy steps \
    --data_seed 42 \
    --save_steps 1000 \
    --save_total_limit 40 \
    --evaluation_strategy steps \
    --eval_dataset_size 1024 \
    --max_eval_samples 1000 \
    --eval_steps 1000 \
    --optim paged_adamw_32bit

Stacktrace:

CUDA_SETUP: WARNING! libcudart.so not found in any environmental path. Searching in backup paths...
CUDA SETUP: CUDA runtime path found: /usr/local/cuda/lib64/libcudart.so
CUDA SETUP: Highest compute capability among GPUs detected: 7.0
CUDA SETUP: Detected CUDA version 117
CUDA SETUP: Loading binary /opt/conda/lib/python3.10/site-packages/bitsandbytes/libbitsandbytes_cuda117_nocublaslt.so...
loading base model EleutherAI/gpt-j-6b...
[...]
Traceback (most recent call last):
  File "/code/qlora/qlora.py", line 758, in <module>
    train()
  File "/code/qlora/qlora.py", line 720, in train
    train_result = trainer.train(resume_from_checkpoint=checkpoint_dir)
  File "/opt/conda/lib/python3.10/site-packages/transformers/trainer.py", line 1696, in train
    return inner_training_loop(
  File "/opt/conda/lib/python3.10/site-packages/transformers/trainer.py", line 1973, in _inner_training_loop
    tr_loss_step = self.training_step(model, inputs)
  File "/opt/conda/lib/python3.10/site-packages/transformers/trainer.py", line 2787, in training_step
    loss = self.compute_loss(model, inputs)
  File "/opt/conda/lib/python3.10/site-packages/transformers/trainer.py", line 2819, in compute_loss
    outputs = model(**inputs)
  File "/opt/conda/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl
    return forward_call(*args, **kwargs)
  File "/opt/conda/lib/python3.10/site-packages/peft/peft_model.py", line 686, in forward
    return self.base_model(
  File "/opt/conda/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl
    return forward_call(*args, **kwargs)
  File "/opt/conda/lib/python3.10/site-packages/accelerate/hooks.py", line 165, in new_forward
    output = old_forward(*args, **kwargs)
  File "/opt/conda/lib/python3.10/site-packages/transformers/models/gptj/modeling_gptj.py", line 870, in forward
    torch.cuda.set_device(self.transformer.first_device)
  File "/opt/conda/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1614, in __getattr__
    raise AttributeError("'{}' object has no attribute '{}'".format(
AttributeError: 'GPTJModel' object has no attribute 'first_device'

guanaco-7B-demo-colab.ipynb breaks with 4bit

Uncommenting load_in_4bit in the colab demo causes an error upon calling m.merge_and_unload()

│ /usr/local/lib/python3.10/dist-packages/peft/tuners/lora.py:352 in merge_and_unload              │
│                                                                                                  │
│   349 │   │   │   raise ValueError("GPT2 models are not supported for merging LORA layers")      │
│   350 │   │                                                                                      │
│   351 │   │   if getattr(self.model, "is_loaded_in_8bit", False) or getattr(self.model, "is_lo   │
│ ❱ 352 │   │   │   raise ValueError("Cannot merge LORA layers when the model is loaded in 8-bit   │
│   353 │   │                                                                                      │
│   354 │   │   key_list = [key for key, _ in self.model.named_modules() if "lora" not in key]     │
│   355 │   │   for key in key_list:                                                               │
╰──────────────────────────────────────────────────────────────────────────────────────────────────╯
ValueError: Cannot merge LORA layers when the model is loaded in 8-bit mode

pip install -q -U bitsandbytes

ERROR: Exception:
Traceback (most recent call last):
File "/usr/local/lib/python3.10/site-packages/pip/_vendor/urllib3/response.py", line 438, in _error_catcher
yield
File "/usr/local/lib/python3.10/site-packages/pip/_vendor/urllib3/response.py", line 561, in read
data = self._fp_read(amt) if not fp_closed else b""
File "/usr/local/lib/python3.10/site-packages/pip/_vendor/urllib3/response.py", line 527, in _fp_read
return self._fp.read(amt) if amt is not None else self._fp.read()
File "/usr/local/lib/python3.10/site-packages/pip/_vendor/cachecontrol/filewrapper.py", line 90, in read
data = self.__fp.read(amt)
File "/usr/local/lib/python3.10/http/client.py", line 466, in read
s = self.fp.read(amt)
File "/usr/local/lib/python3.10/socket.py", line 705, in readinto
return self._sock.recv_into(b)
File "/usr/local/lib/python3.10/ssl.py", line 1274, in recv_into
return self.read(nbytes, buffer)
File "/usr/local/lib/python3.10/ssl.py", line 1130, in read
return self._sslobj.read(len, buffer)
TimeoutError: The read operation timed out

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
File "/usr/local/lib/python3.10/site-packages/pip/_internal/cli/base_command.py", line 169, in exc_logging_wrapper
status = run_func(*args)
File "/usr/local/lib/python3.10/site-packages/pip/_internal/cli/req_command.py", line 248, in wrapper
return func(self, options, args)
File "/usr/local/lib/python3.10/site-packages/pip/_internal/commands/install.py", line 377, in run
requirement_set = resolver.resolve(
File "/usr/local/lib/python3.10/site-packages/pip/_internal/resolution/resolvelib/resolver.py", line 92, in resolve
result = self._result = resolver.resolve(
File "/usr/local/lib/python3.10/site-packages/pip/_vendor/resolvelib/resolvers.py", line 546, in resolve
state = resolution.resolve(requirements, max_rounds=max_rounds)
File "/usr/local/lib/python3.10/site-packages/pip/_vendor/resolvelib/resolvers.py", line 397, in resolve
self._add_to_criteria(self.state.criteria, r, parent=None)
File "/usr/local/lib/python3.10/site-packages/pip/_vendor/resolvelib/resolvers.py", line 173, in _add_to_criteria
if not criterion.candidates:
File "/usr/local/lib/python3.10/site-packages/pip/_vendor/resolvelib/structs.py", line 156, in bool
return bool(self._sequence)
File "/usr/local/lib/python3.10/site-packages/pip/_internal/resolution/resolvelib/found_candidates.py", line 155, in bool
return any(self)
File "/usr/local/lib/python3.10/site-packages/pip/_internal/resolution/resolvelib/found_candidates.py", line 143, in
return (c for c in iterator if id(c) not in self._incompatible_ids)
File "/usr/local/lib/python3.10/site-packages/pip/_internal/resolution/resolvelib/found_candidates.py", line 97, in _iter_built_with_inserted
candidate = func()
File "/usr/local/lib/python3.10/site-packages/pip/_internal/resolution/resolvelib/factory.py", line 206, in _make_candidate_from_link
self._link_candidate_cache[link] = LinkCandidate(
File "/usr/local/lib/python3.10/site-packages/pip/_internal/resolution/resolvelib/candidates.py", line 293, in init
super().init(
File "/usr/local/lib/python3.10/site-packages/pip/_internal/resolution/resolvelib/candidates.py", line 156, in init
self.dist = self._prepare()
File "/usr/local/lib/python3.10/site-packages/pip/_internal/resolution/resolvelib/candidates.py", line 225, in _prepare
dist = self._prepare_distribution()
File "/usr/local/lib/python3.10/site-packages/pip/_internal/resolution/resolvelib/candidates.py", line 304, in _prepare_distribution
return preparer.prepare_linked_requirement(self._ireq, parallel_builds=True)
File "/usr/local/lib/python3.10/site-packages/pip/_internal/operations/prepare.py", line 516, in prepare_linked_requirement
return self._prepare_linked_requirement(req, parallel_builds)
File "/usr/local/lib/python3.10/site-packages/pip/_internal/operations/prepare.py", line 587, in _prepare_linked_requirement
local_file = unpack_url(
File "/usr/local/lib/python3.10/site-packages/pip/_internal/operations/prepare.py", line 166, in unpack_url
file = get_http_url(
File "/usr/local/lib/python3.10/site-packages/pip/_internal/operations/prepare.py", line 107, in get_http_url
from_path, content_type = download(link, temp_dir.path)
File "/usr/local/lib/python3.10/site-packages/pip/_internal/network/download.py", line 147, in call
for chunk in chunks:
File "/usr/local/lib/python3.10/site-packages/pip/_internal/network/utils.py", line 63, in response_chunks
for chunk in response.raw.stream(
File "/usr/local/lib/python3.10/site-packages/pip/_vendor/urllib3/response.py", line 622, in stream
data = self.read(amt=amt, decode_content=decode_content)
File "/usr/local/lib/python3.10/site-packages/pip/_vendor/urllib3/response.py", line 560, in read
with self._error_catcher():
File "/usr/local/lib/python3.10/contextlib.py", line 153, in exit
self.gen.throw(typ, value, traceback)
File "/usr/local/lib/python3.10/site-packages/pip/_vendor/urllib3/response.py", line 443, in _error_catcher
raise ReadTimeoutError(self._pool, None, "Read timed out.")
pip._vendor.urllib3.exceptions.ReadTimeoutError: HTTPSConnectionPool(host='files.pythonhosted.org', port=443): Read timed out.

finetune.py 65b on A6000 48GB crashes with OOM

python qlora.py \
    --model_name_or_path /home/nap/llm_models/llamaOG-65B-hf/ \
    --output_dir ./output \
    --dataset alpaca \
    --do_train True \
    --do_eval True \
    --do_mmlu_eval True \
    --source_max_len 384 \
    --target_max_len 128 \
    --per_device_train_batch_size 4 \
    --per_device_train_batch_size 4 \
    --gradient_accumulation_steps 4 \
    --logging_steps 10 \
    --max_steps 10000 \
    --save_strategy steps \
    --data_seed 42 \
    --save_steps 1000 \
    --save_total_limit 40 \
    --evaluation_strategy steps \
    --eval_dataset_size 1024 \
    --max_eval_samples 1000 \
    --eval_steps 1000 \
    --optim paged_adamw_32bit \
    --learning_rate 0.0001 \

CUDA_VISIBLE_DEVICES=1 ./finetune.sh

error:

  File "/home/nap/Documents/githubs/qlora/.qlora/lib/python3.11/site-packages/bitsandbytes/autograd/_functions.py", line 512, in forward
    output = torch.nn.functional.linear(A, F.dequantize_fp4(B, state).to(A.dtype).t(), bias)
                                           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 688.00 MiB (GPU 0; 47.54 GiB total capacity; 45.09 GiB already allocated; 79.12 MiB free; 46.45 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation.  See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF
  0%|                                                                                                                                 | 0/10000 [00:18<?, ?it/s] 

Do I need to change the batch_size or something? Thanks!

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.