artidoro / qlora Goto Github PK

View Code? Open in Web Editor NEW

9.6K 9.6K 786.0 30.5 MB

QLoRA: Efficient Finetuning of Quantized LLMs

Home Page: https://arxiv.org/abs/2305.14314

License: MIT License

Python 11.94% Shell 1.72% Jupyter Notebook 85.36% HTML 0.98%

qlora's People

Contributors

Stargazers

Watchers

Forkers

scottlogic-alex dumpmemory techthiyanes ayourtch iorikyo79 thanhpham1987 snoopycn enkaybit jorgebmann ralf12358 johnaffolter manu87ds pandinosaurus patrickstamant qwopqwop200 goulash1971 mrcodechef yuanli1 chris-han arwin-cc zavier-sanders rosssong randerzander tohrnii sabyaghossh dannyshcn xshhhm keeganmccallum eltociear ricklentz yangwang92 dkzdev ekryski allthingsllm kumar045 wangzhiwei-ai frankgty nanqiai isimsizolan feihuamantian qubitium fran-cois guhaifudeng wddaxiang haorand liangofthechen qsong4 kzke gzm2062 samd54321 vwalk-group cntank01 success449 zhanglv0209 mencomao ppfliu chriscarrollsmith x0us skywalker-l ther-nullptr sunnymarkliu dan-s-w tspannhw acechq cailiang9 dingchaoz tnga rayjue gavi bigdatasciencegroup chaojigang001 rakataprime moreesindo otakbeku githuberpilot mazinhozanelato a-biao96 enjoysport2022 ai-jie01 yhna940 leejodie nadimkaysar birdhaihe thliang01 laiqinghan coallaoh duanexiao zhuzhenping gyunggyung qcwthu apollohuang1 xiongjun19 whitefu mikejin5c fenglaijun ubicomphuahu goxtopia quuhua911 acezen 1335951413

qlora's Issues

Is it possible to apply qlora on Vision Transformer finetuning?

Inference and fine tuning notebook links in the readme point to same page

Both of these point to the same notebook (fine tuning).

How do you process oasst1 to get 9209 examples

Great work! In your paper you say "In our experiments, we
only use the top reply at each level in the conversation tree. This limits the dataset to 9,209 examples. "Could you please tell me how to handle the data? Because I got 10364 examples from 2023-04-12_oasst_ready.trees.jsonl, but I don't know where 9209 came from?

undefined symbol: cquantize_blockwise_fp16_fp4

AttributeError Traceback (most recent call last)
Cell In[6], line 1
----> 1 model = LlamaForCausalLM.from_pretrained("../hf_llama", device_map="auto", torch_dtype=torch.float16, load_in_4bit=True )

File ~/anaconda3/envs/qlora/lib/python3.11/site-packages/transformers/modeling_utils.py:2829, in PreTrainedModel.from_pretrained(cls, pretrained_model_name_or_path, *model_args, **kwargs)
2819 if dtype_orig is not None:
2820 torch.set_default_dtype(dtype_orig)
2822 (
2823 model,
2824 missing_keys,
2825 unexpected_keys,
2826 mismatched_keys,
2827 offload_index,
2828 error_msgs,
-> 2829 ) = cls._load_pretrained_model(
2830 model,
2831 state_dict,
2832 loaded_state_dict_keys, # XXX: rename?
2833 resolved_archive_file,
2834 pretrained_model_name_or_path,
2835 ignore_mismatched_sizes=ignore_mismatched_sizes,
2836 sharded_metadata=sharded_metadata,
2837 _fast_init=_fast_init,
2838 low_cpu_mem_usage=low_cpu_mem_usage,
...
--> 394 func = self._FuncPtr((name_or_ordinal, self))
395 if not isinstance(name_or_ordinal, int):
396 func.name = name_or_ordinal

AttributeError: /home/server/anaconda3/envs/qlora/lib/python3.11/site-packages/bitsandbytes/libbitsandbytes_cpu.so: undefined symbol: cquantize_blockwise_fp16_fp4

Failed to load bloomz 7Bmt

Hi thank you very much for this great library! I am really excited about it!

I tried to fine tune bloomz 7B with the 4 bit lora on alpaca by running: python qlora.py --model_name_path my_bloomz_path. However, the job was killed halfway when it was loading the base model. My server has 40G+RAM and a 3090 gpu, which is large enough to load a 7B model.
It seems that some process amid the 4-bit quantization drains all the memory.Can you please give me some advice about how I can reduce the memory footprint in order to load the base model?
Thank you very much !

Cannot merge LORA layers when the model is loaded in 8-bit mode

When I load the model as following, throw the error: Cannot merge LORA layers when the model is loaded in 8-bit mode
How can I load model with 4bit when inferencing?
model_path = 'decapoda-research/llama-30b-hf' adapter_path = 'timdettmers/guanaco-33b' quantization_config = BitsAndBytesConfig( load_in_4bit=True, bnb_4bit_compute_dtype=torch.bfloat16, bnb_4bit_use_double_quant=True, bnb_4bit_quant_type='nf4' ), model = AutoModelForCausalLM.from_pretrained( model_path, low_cpu_mem_usage=True, load_in_4bit=True, quantization_config=quantization_config, torch_dtype=torch.float16, device_map='auto' ) model = PeftModel.from_pretrained(model, adapter_path) model = model.merge_and_unload()

Environment for running the code

Could the authors share the requirements and general environment for running this code? I am also hitting another few issues, and currently trying to infer the right versions of libraries.

Can wav2vec2 be finetuned?

wav2vec2 does not appear in the initial list of finetuned models, hence opening this issue.

ValueError: paged_adamw_32bit is not a valid OptimizerNames

Trying to run the basic training script but getting this error. Is there a particular branch of transformers I should install?

guanaco-13b Model fails on Google Colab Free tier T4 GPU

Starting to load the model decapoda-research/llama-13b-hf into memory
Loading checkpoint shards: 90%
37/41 [04:09<00:27, 6.78s/it]
╭─────────────────────────────── Traceback (most recent call last) ────────────────────────────────╮
│ in <cell line: 14>:14 │
│ │
│ /usr/local/lib/python3.10/dist-packages/transformers/models/auto/auto_factory.py:472 in │
│ from_pretrained │
│ │
│ 469 │ │ │ ) │
│ 470 │ │ elif type(config) in cls._model_mapping.keys(): │
│ 471 │ │ │ model_class = _get_model_class(config, cls._model_mapping) │
│ ❱ 472 │ │ │ return model_class.from_pretrained( │
│ 473 │ │ │ │ pretrained_model_name_or_path, *model_args, config=config, **hub_kwargs, │
│ 474 │ │ │ ) │
│ 475 │ │ raise ValueError( │
│ │
│ /usr/local/lib/python3.10/dist-packages/transformers/modeling_utils.py:2829 in from_pretrained │
│ │
│ 2826 │ │ │ │ mismatched_keys, │
│ 2827 │ │ │ │ offload_index, │
│ 2828 │ │ │ │ error_msgs, │
│ ❱ 2829 │ │ │ ) = cls._load_pretrained_model( │
│ 2830 │ │ │ │ model, │
│ 2831 │ │ │ │ state_dict, │
│ 2832 │ │ │ │ loaded_state_dict_keys, # XXX: rename? │
│ │
│ /usr/local/lib/python3.10/dist-packages/transformers/modeling_utils.py:3172 in │
│ _load_pretrained_model │
│ │
│ 3169 │ │ │ │ ) │
│ 3170 │ │ │ │ │
│ 3171 │ │ │ │ if low_cpu_mem_usage: │
│ ❱ 3172 │ │ │ │ │ new_error_msgs, offload_index, state_dict_index = _load_state_dict_i │
│ 3173 │ │ │ │ │ │ model_to_load, │
│ 3174 │ │ │ │ │ │ state_dict, │
│ 3175 │ │ │ │ │ │ loaded_keys, │
│ │
│ /usr/local/lib/python3.10/dist-packages/transformers/modeling_utils.py:718 in │
│ _load_state_dict_into_meta_model │
│ │
│ 715 │ │ │ │ fp16_statistics = None │
│ 716 │ │ │ │
│ 717 │ │ │ if "SCB" not in param_name: │
│ ❱ 718 │ │ │ │ set_module_quantized_tensor_to_device( │
│ 719 │ │ │ │ │ model, param_name, param_device, value=param, fp16_statistics=fp16_s │
│ 720 │ │ │ │ ) │
│ 721 │
│ │
│ /usr/local/lib/python3.10/dist-packages/transformers/utils/bitsandbytes.py:88 in │
│ set_module_quantized_tensor_to_device │
│ │
│ 85 │ │ │ if is_8bit: │
│ 86 │ │ │ │ new_value = bnb.nn.Int8Params(new_value, requires_grad=False, **kwargs). │
│ 87 │ │ │ elif is_4bit: │
│ ❱ 88 │ │ │ │ new_value = bnb.nn.Params4bit(new_value, requires_grad=False, **kwargs). │
│ 89 │ │ │ │
│ 90 │ │ │ module._parameters[tensor_name] = new_value │
│ 91 │ │ │ if fp16_statistics is not None: │
│ │
│ /usr/local/lib/python3.10/dist-packages/bitsandbytes/nn/modules.py:176 in to │
│ │
│ 173 │ │ device, dtype, non_blocking, convert_to_format = torch._C._nn._parse_to(*args, * │
│ 174 │ │ │
│ 175 │ │ if (device is not None and device.type == "cuda" and self.data.device.type == "c │
│ ❱ 176 │ │ │ return self.cuda(device) │
│ 177 │ │ else: │
│ 178 │ │ │ s = self.quant_state │
│ 179 │ │ │ if s is not None: │
│ │
│ /usr/local/lib/python3.10/dist-packages/bitsandbytes/nn/modules.py:154 in cuda │
│ │
│ 151 │ │
│ 152 │ def cuda(self, device): │
│ 153 │ │ w = self.data.contiguous().half().cuda(device) │
│ ❱ 154 │ │ w_4bit, quant_state = bnb.functional.quantize_4bit(w, blocksize=self.blocksize, │
│ 155 │ │ self.data = w_4bit │
│ 156 │ │ self.quant_state = quant_state │
│ 157 │
│ │
│ /usr/local/lib/python3.10/dist-packages/bitsandbytes/functional.py:760 in quantize_4bit │
│ │
│ 757 │ │
│ 758 │ │
│ 759 │ if out is None: │
│ ❱ 760 │ │ out = torch.zeros(((n+1)//2, 1), dtype=torch.uint8, device=A.device) │
│ 761 │ │
│ 762 │ assert blocksize in [4096, 2048, 1024, 512, 256, 128, 64] │
│ 763 │
╰──────────────────────────────────────────────────────────────────────────────────────────────────╯
OutOfMemoryError: CUDA out of memory. Tried to allocate 34.00 MiB (GPU 0; 14.75 GiB total capacity; 13.85 GiB
already allocated; 832.00 KiB free; 13.88 GiB reserved in total by PyTorch) If reserved memory is >> allocated
memory try setting max_split_size_mb to avoid fragmentation. See documentation for Memory Management and
PYTORCH_CUDA_ALLOC_CONF

[Question] Reason behind removing `lm_head` in `modules`

Hello,

Thank you for the amazing repo. I was curious about this code below.

qlora/qlora.py

Lines 221 to 222 in e381744

 if 'lm_head' in lora_module_names: # needed for 16-bit 

 lora_module_names.remove('lm_head')

Why is lm_head removed? What does it mean by for "needed for 16-bit"? Does it mean targeting this module for fp16 or so is incorrect?

MosaicMl compatibility?

Would this be compatible?

Fine-tuning Guanaco 65B...is it the same as in your fine-tuning notebook?

Thanks so much for making the fine-tuning notebook, super helpful!

I'm curious, if I want to fine-tune from Guacano instead of the original LLaMA, are there any changes I'd need to make? E.g. would it go through this stage:

from peft import prepare_model_for_kbit_training

model.gradient_checkpointing_enable()
model = prepare_model_for_kbit_training(model)

Or would anything else be different?

it looks like there is 2 "Guanaco" here

https://huggingface.co/JosephusCheung/Guanaco
https://huggingface.co/datasets/JosephusCheung/GuanacoDataset

Question: what dataset was guanaco trained on?

what dataset was guanaco trained on?

Compatibility with Deepspeed, Fairscale, or Torch zero-redundancy optimizer

Wonderful work!

May I know the compatibility with ZeRO mechanism? E.g., Torch redundancy optimizer, deepspeed zero-1 to zero-3, and fairscale FSDP. Becaused I noticed that QLoRA relies on particularly implemented optimizer.

If the optimizer is not compabitible with the tools mentioned above, can I use only 4-bit tuning and lora with zero mechanism? Will this cause more memory cost?

Thanks very much!

Best

Training gets killed after eval due to ValueError

I have tried multiple llama based models on both my 4080 card as well as colab A100 GPU.

In both cases, the training gets killed after eval with the following error

100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 2/2 [00:00<00:00, 1296.74it/s]
torch.float32 422326272 0.11537932153507864
torch.uint8 3238002688 0.8846206784649213
{'loss': 2.3365, 'learning_rate': 0.0002, 'epoch': 0.0}                                                                                                                                      
{'loss': 15.4705, 'learning_rate': 0.0002, 'epoch': 0.01}                                                                                                                                    
{'loss': 38.2581, 'learning_rate': 0.0002, 'epoch': 0.01}                                                                                                                                    
{'loss': 159.034, 'learning_rate': 0.0002, 'epoch': 0.01}                                                                                                                                    
{'loss': 777.5871, 'learning_rate': 0.0002, 'epoch': 0.02}                                                                                                                                   
{'eval_loss': 866.6826171875, 'eval_runtime': 85.0603, 'eval_samples_per_second': 11.756, 'eval_steps_per_second': 1.47, 'epoch': 0.02}                                                      
  0%|                                                                                                                                                                | 0/192 [00:01<?, ?it/s]

..... (stacktrace)
qlora.py:676 in <listcomp> 

│   673 │   │   │   │   │   │   logit_abcd = logit[label_non_zero_id-1][abcd_idx]                  │
│   674 │   │   │   │   │   │   preds.append(torch.argmax(logit_abcd).item())                      │
│   675 │   │   │   │   │   labels = labels[labels != IGNORE_INDEX].view(-1, 2)[:,0]               │
│ ❱ 676 │   │   │   │   │   refs += [abcd_idx.index(label) for label in labels.tolist()]           │
│   677 │   │   │   │   │                                                                          │
│   678 │   │   │   │   │   loss_mmlu += loss.item()                                               │
│   679 │   │   │   │   # Extract results by subject.                                              │
╰──────────────────────────────────────────────────────────────────────────────────────────────────╯
ValueError: 29879 is not in list

I am using the following command to run

python3 qlora.py \
    --model_name_or_path "/home/LLaMA/HuggingFaceFormat/7B" \
    --optim paged_adamw_8bit\
    --output_dir ./output \
    --dataset alpaca \
    --do_train True \
    --do_eval True \
    --do_mmlu_eval True \
    --source_max_len 384 \
    --target_max_len 128 \
    --per_device_train_batch_size 4 \
    --per_device_train_batch_size 4 \
    --gradient_accumulation_steps 4 \
    --logging_steps 10 \
    --max_steps 1000 \
    --save_strategy steps \
    --data_seed 42 \
    --save_steps 50 \
    --save_total_limit 10 \
    --evaluation_strategy steps \
    --eval_dataset_size 1024 \
    --max_eval_samples 1000 \
    --eval_steps 50 \

I have tried different models as well e.g. https://huggingface.co/eachadea/vicuna-13b-1.1
but I get the same error

OverflowError: out of range integral type conversion attempted while running python qlora.py

python qlora.py --model_name_or_path decapoda-research/llama-13b-hf

(I have updated the tokenizer_config.json and config.json as per the various discussions here
tokenizer_class: LlamaTokenizer and architectures: LlamaForCausalLM)

==================================================================================

adding LoRA modules...
trainable params: 125173760.0 || all params: 6922327040 || trainable: 1.8082612866554193
loaded model

Using pad_token, but it is not set yet.
Traceback (most recent call last):
File "qlora.py", line 758, in
train()
File "qlora.py", line 620, in train
"unk_token": tokenizer.convert_ids_to_tokens(model.config.pad_token_id),
File "/home/envs/qlora_env/lib/python3.8/site-packages/transformers/tokenization_utils_fast.py", line 307, in convert_ids_to_tokens
return self._tokenizer.id_to_token(ids)
OverflowError: out of range integral type conversion attempted

LORA Merge fails in 4-bit mode

Trained vicuna-13b-1.1 LORA in 4bit

Now trying to merge it for running generations but it fails with the following error

python3.11/site-packages/peft/tuners/lora.py", line 352, in merge_and_unload
    raise ValueError("Cannot merge LORA layers when the model is loaded in 8-bit mode")
ValueError: Cannot merge LORA layers when the model is loaded in 8-bit mode

live demo always error, when type in chinese.

ValueError: Tokenizer class LLaMATokenizer does not exist or is not currently imported.

I followed the instructions.

CUDA SETUP: CUDA runtime path found: /usr/local/cuda-11.8/lib64/libcudart.so
CUDA SETUP: Highest compute capability among GPUs detected: 5.2
CUDA SETUP: Detected CUDA version 118
/home/developer/mambaforge/envs/Guanaco/lib/python3.10/site-packages/bitsandbytes/cuda_setup/main.py:149: UserWarning: WARNING: Compute capability < 7.5 detected! Only slow 8-bit matmul is supported for your GPU!
warn(msg)
CUDA SETUP: Loading binary /home/developer/mambaforge/envs/Guanaco/lib/python3.10/site-packages/bitsandbytes/libbitsandbytes_cuda118_nocublaslt.so...
loading base model decapoda-research/llama-7b-hf...
Loading checkpoint shards: 100%|███████████████████████████████████████████████████████████████████████████████████████████████| 33/33 [00:51<00:00, 1.55s/it]
/home/developer/mambaforge/envs/Guanaco/lib/python3.10/site-packages/peft/utils/other.py:76: FutureWarning: prepare_model_for_int8_training is deprecated and will be removed in a future version. Use prepare_model_for_kbit_training instead.
warnings.warn(
adding LoRA modules...
trainable params: 159907840 || all params: 6898323456 || trainable: 2.3180681656919973
loaded model
Traceback (most recent call last):
File "/home/developer/qlora/qlora.py", line 763, in
train()
File "/home/developer/qlora/qlora.py", line 604, in train
tokenizer = AutoTokenizer.from_pretrained(
File "/home/developer/mambaforge/envs/Guanaco/lib/python3.10/site-packages/transformers/models/auto/tokenization_auto.py", line 691, in from_pretrained
raise ValueError(
ValueError: Tokenizer class LLaMATokenizer does not exist or is not currently imported.
(Guanaco) developer@ai:~/qlora$

How does QLora work on GLUE as there is no load_in_4bit for the AutoModelForSequenceClassification

Is this just the example not clear? Value of max_memory in README example.

max_memory=max_memory,

NameError: name 'max_memory' is not defined

For testing, running on a smaller machine, 128GB, 48GB GPU ram
Maybe the example could use a suggestion eg...

max_memory={0: "44GiB", "cpu": "110GiB"})

Multi-gpu training example?

Testing 4bit qlora training on 33b llama and the training runs fine on 1x gpu but fails with the following using torchrun on 2x gpu. I am referring to parallel training where each gpu has a full model.

Anyone got multiple-gpu parallel training working yet?

WORLD_SIZE=2 torchrun --rdzv-endpoint=localhost:23456 --nproc_per_node=2
device_map = {"": "cuda:" + str(int(os.environ.get("LOCAL_RANK") or 0))}

 File "/root/miniconda3/lib/python3.10/site-packages/transformers-4.30.0.dev0-py3.10.egg/transformers/trainer.py", line 2804, in training_step
    loss.backward()
  File "/root/miniconda3/lib/python3.10/site-packages/torch/_tensor.py", line 488, in backward
    torch.autograd.backward(
  File "/root/miniconda3/lib/python3.10/site-packages/torch/autograd/__init__.py", line 204, in backward
    Variable._execution_engine.run_backward(  # Calls into the C++ engine to run the backward pass
  File "/root/miniconda3/lib/python3.10/site-packages/torch/autograd/function.py", line 274, in apply
    return user_fn(self, *args)
  File "/root/miniconda3/lib/python3.10/site-packages/torch/utils/checkpoint.py", line 226, in backward
    torch.autograd.backward(outputs_with_grad, args_with_grad)
  File "/root/miniconda3/lib/python3.10/site-packages/torch/autograd/__init__.py", line 204, in backward
    Variable._execution_engine.run_backward(  # Calls into the C++ engine to run the backward pass
RuntimeError: Expected to mark a variable ready only once. This error is caused by one of the following reasons: 1) Use of a module parameter outside the `forward` function. Please make sure model parameters are not shared acro
ss multiple concurrent forward-backward passes. or try to use _set_static_graph() as a workaround if this module graph does not change during training loop.2) Reused parameters in multiple reentrant backward passes. For example
, if you use multiple `checkpoint` functions to wrap the same part of your model, it would result in the same set of parameters been used by different reentrant backward passes multiple times, and hence marking a variable ready
 multiple times. DDP does not support such use cases in default. You can try to use _set_static_graph() as a workaround if your module graph does not change over iterations.
Parameter at index 557 has been marked as ready twice. This means that multiple autograd engine  hooks have fired for this particular parameter during this iteration. You can set the environment variable TORCH_DISTRIBUTED_DEBUG
 to either INFO or DETAIL to print parameter names for further debugging.

OverflowError: out of range integral type conversion attempted

Description:

Hi, I'm trying to use bitsandbytes library to finetune a 4-bit QLoRA model on colab, but I encountered a Error----OverflowError: out of range integral type conversion attempted. Here is the error trace:

My code is here:
!pip install -q -U bitsandbytes
!pip install -q -U git+https://github.com/huggingface/transformers.git
!pip install -q -U git+https://github.com/huggingface/peft.git
!pip install -q -U git+https://github.com/huggingface/accelerate.git
!pip install -q datasets
!git clone https://github.com/artidoro/qlora.git
%cd qlora
!pip install -r requirements.txt
!pip install bitsandbytes

!python qlora.py
--model_name_or_path decapoda-research/llama-7b-hf
--output_dir llama-7b-finetuned
--task_name chatbot
--dataset_name blended_skill_talk
--do_train
--do_eval
--num_train_epochs 3
--per_device_train_batch_size 8
--per_device_eval_batch_size 8
--learning_rate 1e-5

===================================BUG REPORT===================================
Welcome to bitsandbytes. For bug reports, please run

python -m bitsandbytes

and submit this information together with your error trace to: https://github.com/TimDettmers/bitsandbytes/issues

bin /usr/local/lib/python3.10/dist-packages/bitsandbytes/libbitsandbytes_cuda118.so
/usr/local/lib/python3.10/dist-packages/bitsandbytes/cuda_setup/main.py:149: UserWarning: /usr/lib64-nvidia did not contain ['libcudart.so', 'libcudart.so.11.0', 'libcudart.so.12.0'] as expected! Searching further paths...
warn(msg)
/usr/local/lib/python3.10/dist-packages/bitsandbytes/cuda_setup/main.py:149: UserWarning: WARNING: The following directories listed in your path were found to be non-existent: {PosixPath('/sys/fs/cgroup/memory.events /var/colab/cgroup/jupyter-children/memory.events')}
warn(msg)
/usr/local/lib/python3.10/dist-packages/bitsandbytes/cuda_setup/main.py:149: UserWarning: WARNING: The following directories listed in your path were found to be non-existent: {PosixPath('8013'), PosixPath('//172.28.0.1'), PosixPath('http')}
warn(msg)
/usr/local/lib/python3.10/dist-packages/bitsandbytes/cuda_setup/main.py:149: UserWarning: WARNING: The following directories listed in your path were found to be non-existent: {PosixPath('//colab.research.google.com/tun/m/cc48301118ce562b961b3c22d803539adc1e0c19/gpu-t4-s-2m8e0braypyq7 --tunnel_background_save_delay=10s --tunnel_periodic_background_save_frequency=30m0s --enable_output_coalescing=true --output_coalescing_required=true'), PosixPath('--logtostderr --listen_host=172.28.0.12 --target_host=172.28.0.12 --tunnel_background_save_url=https')}
warn(msg)
/usr/local/lib/python3.10/dist-packages/bitsandbytes/cuda_setup/main.py:149: UserWarning: WARNING: The following directories listed in your path were found to be non-existent: {PosixPath('/env/python')}
warn(msg)
/usr/local/lib/python3.10/dist-packages/bitsandbytes/cuda_setup/main.py:149: UserWarning: WARNING: The following directories listed in your path were found to be non-existent: {PosixPath('module'), PosixPath('//ipykernel.pylab.backend_inline')}
warn(msg)
CUDA_SETUP: WARNING! libcudart.so not found in any environmental path. Searching in backup paths...
/usr/local/lib/python3.10/dist-packages/bitsandbytes/cuda_setup/main.py:149: UserWarning: Found duplicate ['libcudart.so', 'libcudart.so.11.0', 'libcudart.so.12.0'] files: {PosixPath('/usr/local/cuda/lib64/libcudart.so.11.0'), PosixPath('/usr/local/cuda/lib64/libcudart.so')}.. We'll flip a coin and try one of these, in order to fail forward.
Either way, this might cause trouble in the future:
If you get CUDA error: invalid device function errors, the above might be the cause and the solution is to make sure only one ['libcudart.so', 'libcudart.so.11.0', 'libcudart.so.12.0'] in the paths that we search based on your env.
warn(msg)
CUDA SETUP: CUDA runtime path found: /usr/local/cuda/lib64/libcudart.so.11.0
CUDA SETUP: Highest compute capability among GPUs detected: 7.5
CUDA SETUP: Detected CUDA version 118
CUDA SETUP: Loading binary /usr/local/lib/python3.10/dist-packages/bitsandbytes/libbitsandbytes_cuda118.so...
2023-05-27 11:21:37.135416: W tensorflow/compiler/tf2tensorrt/utils/py_utils.cc:38] TF-TRT Warning: Could not find TensorRT
loading base model decapoda-research/llama-7b-hf...
Loading checkpoint shards: 100% 33/33 [01:38<00:00, 2.99s/it]
/usr/local/lib/python3.10/dist-packages/peft/utils/other.py:76: FutureWarning: prepare_model_for_int8_training is deprecated and will be removed in a future version. Use prepare_model_for_kbit_training instead.
warnings.warn(
adding LoRA modules...
trainable params: 79953920.0 || all params: 3660320768 || trainable: 2.184341894267557
loaded model
Downloading tokenizer.model: 100% 500k/500k [00:00<00:00, 12.3MB/s]
Downloading (…)cial_tokens_map.json: 100% 2.00/2.00 [00:00<00:00, 16.7kB/s]
Using pad_token, but it is not set yet.
╭───────────────────── Traceback (most recent call last) ──────────────────────╮
│ /content/qlora/qlora.py:758 in │
│ │
│ 755 │ │ │ fout.write(json.dumps(all_metrics)) │
│ 756 │
│ 757 if name == "main": │
│ ❱ 758 │ train() │
│ 759 │
│ │
│ /content/qlora/qlora.py:620 in train │
│ │
│ 617 │ │ │ { │
│ 618 │ │ │ │ "eos_token": tokenizer.convert_ids_to_tokens(model.con │
│ 619 │ │ │ │ "bos_token": tokenizer.convert_ids_to_tokens(model.con │
│ ❱ 620 │ │ │ │ "unk_token": tokenizer.convert_ids_to_tokens(model.con │
│ 621 │ │ │ } │
│ 622 │ │ ) │
│ 623 │
│ │
│ /usr/local/lib/python3.10/dist-packages/transformers/tokenization_utils_fast │
│ .py:307 in convert_ids_to_tokens │
│ │
│ 304 │ │ │ str or List[str]: The decoded token(s). │
│ 305 │ │ """ │
│ 306 │ │ if isinstance(ids, int): │
│ ❱ 307 │ │ │ return self._tokenizer.id_to_token(ids) │
│ 308 │ │ tokens = [] │
│ 309 │ │ for index in ids: │
│ 310 │ │ │ index = int(index) │
╰──────────────────────────────────────────────────────────────────────────────╯
OverflowError: out of range integral type conversion attempted

RecursionError: maximum recursion depth exceeded

I am getting maximum recursion depth error after running this following command:
python qlora.py --model_name_or_path decapoda-research/llama-7b-hf

And this is the error I got:

File "/home/atilla/miniconda3/envs/qlora/lib/python3.9/site-packages/transformers/tokenization_utils_fast.py", line 257, in _convert_token_to_id_with_added_voc
    return self.unk_token_id
File "/home/atilla/miniconda3/envs/qlora/lib/python3.9/site-packages/transformers/tokenization_utils_base.py", line 1142, in unk_token_id
    return self.convert_tokens_to_ids(self.unk_token)
File "/home/atilla/miniconda3/envs/qlora/lib/python3.9/site-packages/transformers/tokenization_utils_fast.py", line 250, in convert_tokens_to_ids
    return self._convert_token_to_id_with_added_voc(tokens)
File "/home/atilla/miniconda3/envs/qlora/lib/python3.9/site-packages/transformers/tokenization_utils_fast.py", line 257, in _convert_token_to_id_with_added_voc
    return self.unk_token_id
File "/home/atilla/miniconda3/envs/qlora/lib/python3.9/site-packages/transformers/tokenization_utils_base.py", line 1142, in unk_token_id
    return self.convert_tokens_to_ids(self.unk_token)
File "/home/atilla/miniconda3/envs/qlora/lib/python3.9/site-packages/transformers/tokenization_utils_fast.py", line 250, in convert_tokens_to_ids
    return self._convert_token_to_id_with_added_voc(tokens)
File "/home/atilla/miniconda3/envs/qlora/lib/python3.9/site-packages/transformers/tokenization_utils_fast.py", line 257, in _convert_token_to_id_with_added_voc
    return self.unk_token_id
File "/home/atilla/miniconda3/envs/qlora/lib/python3.9/site-packages/transformers/tokenization_utils_base.py", line 1142, in unk_token_id
    return self.convert_tokens_to_ids(self.unk_token)
RecursionError: maximum recursion depth exceeded

TypeError: init() got an unexpected keyword argument 'load_in_4bit'

When I use the command below I got an error:

 python3 qlora.py –learning_rate 0.0001 --model_name_or_path <llama33b_model_path>

╭─────────────────────────────── Traceback (most recent call last) ────────────────────────────────╮
│ /home/tiger/qlora/qlora/qlora.py:758 in │
│ │
│ 755 │ │ │ fout.write(json.dumps(all_metrics)) │
│ 756 │
│ 757 if name == "main": │
│ ❱ 758 │ train() │
│ 759 │
│ │
│ /home/tiger/qlora/qlora/qlora.py:590 in train │
│ │
│ 587 │ if completed_training: │
│ 588 │ │ print('Detected that training was already completed!') │
│ 589 │ │
│ ❱ 590 │ model = get_accelerate_model(args, checkpoint_dir) │
│ 591 │ training_args.skip_loading_checkpoint_weights=True │
│ 592 │ │
│ 593 │ model.config.use_cache = False │
│ │
│ /home/tiger/qlora/qlora/qlora.py:263 in get_accelerate_model │
│ │
│ 260 │ │
│ 261 │ print(f'loading base model {args.model_name_or_path}...') │
│ 262 │ compute_dtype = (torch.float16 if args.fp16 else (torch.bfloat16 if args.bf16 else t │
│ ❱ 263 │ model = AutoModelForCausalLM.from_pretrained( │
│ 264 │ │ args.model_name_or_path, │
│ 265 │ │ load_in_4bit=args.bits == 4, │
│ 266 │ │ load_in_8bit=args.bits == 8, │
│ │
│ /home/tiger/.local/lib/python3.9/site-packages/transformers/models/auto/auto_factory.py:467 in │
│ from_pretrained │
│ │
│ 464 │ │ │ ) │
│ 465 │ │ elif type(config) in cls._model_mapping.keys(): │
│ 466 │ │ │ model_class = _get_model_class(config, cls._model_mapping) │
│ ❱ 467 │ │ │ return model_class.from_pretrained( │
│ 468 │ │ │ │ pretrained_model_name_or_path, *model_args, config=config, **hub_kwargs, │
│ 469 │ │ │ ) │
│ 470 │ │ raise ValueError( │
│ │
│ /home/tiger/.local/lib/python3.9/site-packages/transformers/modeling_utils.py:2611 in │
│ from_pretrained │
│ │
│ 2608 │ │ │ init_contexts.append(init_empty_weights()) │
│ 2609 │ │ │
│ 2610 │ │ with ContextManagers(init_contexts): │
│ ❱ 2611 │ │ │ model = cls(config, *model_args, **model_kwargs) │
│ 2612 │ │ │
│ 2613 │ │ # Check first if we are from_pt │
│ 2614 │ │ if use_keep_in_fp32_modules: │
╰──────────────────────────────────────────────────────────────────────────────────────────────────╯
TypeError: init() got an unexpected keyword argument 'load_in_4bit'

OverflowError: out of range integral type conversion attempted

Getting this error at line 620 on unk_token with this command:

qlora.py --model_name_or_path decapoda-research/llama-7b-hf --do_predict

│ ❱ 620 │ │ │ │ "unk_token": tokenizer.convert_ids_to_tokens(model.config.pad_token_id), │

/home/pwood/.local/lib/python3.10/site-packages/transformers/tokenization_utils_fast.py:307 in │
│ convert_ids_to_tokens │
│ │
│ 304 │ │ │ str or List[str]: The decoded token(s). │
│ 305 │ │ """ │
│ 306 │ │ if isinstance(ids, int): │
│ ❱ 307 │ │ │ return self._tokenizer.id_to_token(ids) │
│ 308 │ │ tokens = [] │
│ 309 │ │ for index in ids: │
│ 310 │ │ │ index = int(index)

Can continue by commenting out line 620.

Can't resume from checkpoint

I'm getting:

  File "/home/guest/ai/finetune/qlora/venv/lib/python3.10/site-packages/transformers/trainer.py", line 2159, in _load_from_checkpoint
  ValueError: Can't find a valid checkpoint at ./output/checkpoint-1000

Apparently there are several missing files from the checkpoint dir, like config.json.

Fine-tuning with unlabelled data? (Causal language modelling)

I'd like to fine-tune using unlabelled data, i.e. a causal language modeling. For instance to adapt a model to a new domain or language.

Which parts of the training code need to be changed to use such a data source?

From what I can tell, it would probably be these:

DataCollatorForCausalLM (perhaps use DataCollatorForLanguageModeling from transformers)
make_data_module()
MMLUEvalCallback

Is that correct? Anything else?

Is there perhaps code from this or another repo that I can use?

Thanks!

Edit: Replaced "masked language modeling" with "causal language modeling".

Error invalid device ordinal at line 359 in file /mnt/c/Users/qwopq/Downloads/qlora/bitsandbytes/csrc/pythonInterface.c

I am using cuda 11.8 and wsl2. LLaMA - 7B
I installed a new version of bitsandbytes to use the qlora, but it doesn't work.

(base) root@DESKTOP-DRKMHOU:/mnt/c/Users/qwopq/Downloads/qlora/qlora-main# bash scripts/finetune.sh

===================================BUG REPORT===================================
Welcome to bitsandbytes. For bug reports, please run

python -m bitsandbytes

 and submit this information together with your error trace to: https://github.com/TimDettmers/bitsandbytes/issues
================================================================================
bin /root/anaconda3/lib/python3.9/site-packages/bitsandbytes-0.39.0-py3.9.egg/bitsandbytes/libbitsandbytes_cuda117.so
CUDA SETUP: CUDA runtime path found: /root/anaconda3/lib/libcudart.so.11.0
CUDA SETUP: Highest compute capability among GPUs detected: 8.6
CUDA SETUP: Detected CUDA version 117
CUDA SETUP: Loading binary /root/anaconda3/lib/python3.9/site-packages/bitsandbytes-0.39.0-py3.9.egg/bitsandbytes/libbitsandbytes_cuda117.so...
loading base model ./llama-7b...
Loading checkpoint shards: 100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 2/2 [00:22<00:00, 11.44s/it]
/root/anaconda3/lib/python3.9/site-packages/peft/utils/other.py:76: FutureWarning: prepare_model_for_int8_training is deprecated and will be removed in a future version. Use prepare_model_for_kbit_training instead.
  warnings.warn(
adding LoRA modules...
trainable params: 79953920.0 || all params: 3660320768 || trainable: 2.184341894267557
loaded model
Using pad_token, but it is not set yet.
Found cached dataset parquet (/root/.cache/huggingface/datasets/tatsu-lab___parquet/tatsu-lab--alpaca-2b32f0433506ef5f/0.0.0/2a3b91fbd88a2c90d1dbbb32b460cf621d31bd5b05b934492fdef7d8d6f236ec)
100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 1/1 [00:00<00:00, 947.44it/s]
Loading cached processed dataset at /root/.cache/huggingface/datasets/tatsu-lab___parquet/tatsu-lab--alpaca-2b32f0433506ef5f/0.0.0/2a3b91fbd88a2c90d1dbbb32b460cf621d31bd5b05b934492fdef7d8d6f236ec/cache-95a00d56074ae5fa.arrow
Splitting train dataset in train and validation according to `eval_dataset_size`
Loading cached split indices for dataset at /root/.cache/huggingface/datasets/tatsu-lab___parquet/tatsu-lab--alpaca-2b32f0433506ef5f/0.0.0/2a3b91fbd88a2c90d1dbbb32b460cf621d31bd5b05b934492fdef7d8d6f236ec/cache-aedf12585a19bef4.arrow and /root/.cache/huggingface/datasets/tatsu-lab___parquet/tatsu-lab--alpaca-2b32f0433506ef5f/0.0.0/2a3b91fbd88a2c90d1dbbb32b460cf621d31bd5b05b934492fdef7d8d6f236ec/cache-2ff9ed40565a4b8c.arrow
Loading cached processed dataset at /root/.cache/huggingface/datasets/tatsu-lab___parquet/tatsu-lab--alpaca-2b32f0433506ef5f/0.0.0/2a3b91fbd88a2c90d1dbbb32b460cf621d31bd5b05b934492fdef7d8d6f236ec/cache-b83a3c430699603c.arrow
Loading cached processed dataset at /root/.cache/huggingface/datasets/tatsu-lab___parquet/tatsu-lab--alpaca-2b32f0433506ef5f/0.0.0/2a3b91fbd88a2c90d1dbbb32b460cf621d31bd5b05b934492fdef7d8d6f236ec/cache-e983acc2d14f99d2.arrow
Found cached dataset json (/root/.cache/huggingface/datasets/json/default-e1a2135500543221/0.0.0/e347ab1c932092252e717ff3f949105a4dd28b27e842dd53157d2f72e276c2e4)
100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 2/2 [00:00<00:00, 1694.33it/s]
torch.float32 422326272 0.11537932153507864
torch.uint8 3238002688 0.8846206784649213
  0%|                                                                                                                                                    | 0/10000 [00:00<?, ?it/s]
Error invalid device ordinal at line 359 in file /mnt/c/Users/qwopq/Downloads/qlora/bitsandbytes/csrc/pythonInterface.c

V100 not support int4 and bf16?

I have 1 V100 32G, the llama-33B need 24G in int4, but my gpu error with load_in_int4... what can i do?

Cannot resume from checkpoint because it is not detected as valid

I have problems resuming a checkpoint. What I did:

python qlora.py --model_name_or_path huggyllama/llama-7b
abort when a checkpoint has been written
python qlora.py --model_name_or_path huggyllama/llama-7b

I expected fine-tuning to pick up where I aborted it, but instead I get the following error message:

...
torch.uint8 3238002688 0.8846206784649213
Traceback (most recent call last):
  File "/workspace/qlora/qlora.py", line 758, in <module>
    train()
  File "/workspace/qlora/qlora.py", line 720, in train
    train_result = trainer.train(resume_from_checkpoint=checkpoint_dir)
  File "/workspace/anaconda3/envs/qlora310/lib/python3.10/site-packages/transformers/trainer.py", line 1685, in train
    self._load_from_checkpoint(resume_from_checkpoint)
  File "/workspace/anaconda3/envs/qlora310/lib/python3.10/site-packages/transformers/trainer.py", line 2159, in _load_from_checkpoint
    raise ValueError(f"Can't find a valid checkpoint at {resume_from_checkpoint}")
ValueError: Can't find a valid checkpoint at ./output/checkpoint-500

Problems with bitsandbytes

Hey, when I tried to run fine-tuning script I got error that '4 bit quantization requires bitsandbytes>=0.39.0 - please upgrade your bitsandbytes version'.
And since I need to install bitsandbytes from source due to CUDA compatibility issues here is a mystery.
I can see the new version, 0.39.0 (https://pypi.org/project/bitsandbytes/), but when I go to the git repo (https://github.com/TimDettmers/bitsandbytes) the latest release is still 0.38.0.
And it seems that master branch does not support what is required for 4 bit to run.

So who built 0.39.0 and where to find source code for it?

Finetuned T5 checkpoints

Very exciting development - thanks for sharing your paper and this repo. Would it be possible for your team to release the T5 finetuned checkpoints (Super-NaturalInstructions), small to xxl? We can upload to HF hub.

Thank you.

Add sophia optimizer

Great work! Can you also implement Sophia optimizer?

The VRAM usage is more than 48GB.

In the paper, it was mentioned that 48G of graphics memory can train 65B of LLaMA

We present QLORA, an efficient finetuning approach that reduces memory usage enough to finetune a 65B parameter model on a single 48GB GPU while
preserving full 16-bit finetuning task performance.

While using the following code to train a LLaMA 65B model, it actually comsumed about 60G VRAM

python qlora.py

4bit inference is slow

Thanks for the excellent work.
When I use 4bit to inference, it's very slow, even slower than 8bit inference.
Will you plan to solve the problem, thanks~

try finetune guanaco-33b-merged with default params and some problems

loading base model /models/guanaco-33b-merged...
Loading checkpoint shards: 100%|██████████████████████████████████████████████████████████████████| 7/7 [01:12<00:00, 10.30s/it]
adding LoRA modules...

/usr/local/lib/python3.8/dist-packages/torch/nn/modules/module.py:1614 in getattr │
│ │
│ 1611 │ │ │ modules = self.dict['_modules'] │
│ 1612 │ │ │ if name in modules: │
│ 1613 │ │ │ │ return modules[name] │
│ ❱ 1614 │ │ raise AttributeError("'{}' object has no attribute '{}'".format( │
│ 1615 │ │ │ type(self).name, name)) │
│ 1616 │ │
│ 1617 │ def setattr(self, name: str, value: Union[Tensor, 'Module']) -> None: │
╰──────────────────────────────────────────────────────────────────────────────────────────────────╯
AttributeError: 'CastOutputToFloat' object has no attribute 'weight'
root@5e0ba28fefc9:/wzh/qlora#
root@5e0ba28fefc9:/wzh/qlora# CUDA_VISIBLE_DEVICES=0 PYTORCH_CUDA_ALLOC_CONF=max_split_size_mb:24 sh scripts/finetune.sh

Getting error while trying to replicate

I was able to successfully load the model and dataset. However, when the training initiates, I get an error "self and mat2 must have the same dtype". Could you please help?

Can the quantized model run on multiple GPUs?

For example two 3090

How can I run this repo across multiple machines?

RuntimeError: self and mat2 must have the same dtype

---------pip package version
transformers-4.30.0.dev0
accelerate 0.20.0.dev0
bitsandbytes 0.39.0
peft-0.3.0.dev0

----------python cmd
python qlora.py --model_name_or_path /home/bmb/models/facebook/opt-125m

---------error
Traceback (most recent call last):
File "/home/bmb/projects/qlora/qlora.py", line 766, in
train()
File "/home/bmb/projects/qlora/qlora.py", line 728, in train
train_result = trainer.train(resume_from_checkpoint=checkpoint_dir)
File "/home/bmb/anaconda3/envs/qlora/lib/python3.10/site-packages/transformers/trainer.py", line 1696, in train
return inner_training_loop(
File "/home/bmb/anaconda3/envs/qlora/lib/python3.10/site-packages/transformers/trainer.py", line 1972, in _inner_training_loop
tr_loss_step = self.training_step(model, inputs)
File "/home/bmb/anaconda3/envs/qlora/lib/python3.10/site-packages/transformers/trainer.py", line 2786, in training_step
loss = self.compute_loss(model, inputs)
File "/home/bmb/anaconda3/envs/qlora/lib/python3.10/site-packages/transformers/trainer.py", line 2818, in compute_loss
outputs = model(**inputs)
File "/home/bmb/anaconda3/envs/qlora/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1194, in _call_impl
return forward_call(*input, **kwargs)
File "/home/bmb/anaconda3/envs/qlora/lib/python3.10/site-packages/peft/peft_model.py", line 678, in forward
return self.base_model(
File "/home/bmb/anaconda3/envs/qlora/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1194, in _call_impl
return forward_call(*input, **kwargs)
File "/home/bmb/anaconda3/envs/qlora/lib/python3.10/site-packages/accelerate/hooks.py", line 165, in new_forward
output = old_forward(*args, **kwargs)
File "/home/bmb/anaconda3/envs/qlora/lib/python3.10/site-packages/transformers/models/opt/modeling_opt.py", line 945, in forward
outputs = self.model.decoder(
File "/home/bmb/anaconda3/envs/qlora/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1194, in _call_impl
return forward_call(*input, **kwargs)
File "/home/bmb/anaconda3/envs/qlora/lib/python3.10/site-packages/accelerate/hooks.py", line 165, in new_forward
output = old_forward(*args, **kwargs)
File "/home/bmb/anaconda3/envs/qlora/lib/python3.10/site-packages/transformers/models/opt/modeling_opt.py", line 703, in forward
layer_outputs = torch.utils.checkpoint.checkpoint(
File "/home/bmb/anaconda3/envs/qlora/lib/python3.10/site-packages/torch/utils/checkpoint.py", line 249, in checkpoint
return CheckpointFunction.apply(function, preserve, *args)
File "/home/bmb/anaconda3/envs/qlora/lib/python3.10/site-packages/torch/utils/checkpoint.py", line 107, in forward
outputs = run_function(*args)
File "/home/bmb/anaconda3/envs/qlora/lib/python3.10/site-packages/transformers/models/opt/modeling_opt.py", line 699, in custom_forward
return module(*inputs, output_attentions, None)
File "/home/bmb/anaconda3/envs/qlora/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1194, in _call_impl
return forward_call(*input, **kwargs)
File "/home/bmb/anaconda3/envs/qlora/lib/python3.10/site-packages/accelerate/hooks.py", line 165, in new_forward
output = old_forward(*args, **kwargs)
File "/home/bmb/anaconda3/envs/qlora/lib/python3.10/site-packages/transformers/models/opt/modeling_opt.py", line 331, in forward
hidden_states, self_attn_weights, present_key_value = self.self_attn(
File "/home/bmb/anaconda3/envs/qlora/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1194, in _call_impl
return forward_call(*input, **kwargs)
File "/home/bmb/anaconda3/envs/qlora/lib/python3.10/site-packages/accelerate/hooks.py", line 165, in new_forward
output = old_forward(*args, **kwargs)
File "/home/bmb/anaconda3/envs/qlora/lib/python3.10/site-packages/transformers/models/opt/modeling_opt.py", line 174, in forward
query_states = self.q_proj(hidden_states) * self.scaling
File "/home/bmb/anaconda3/envs/qlora/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1194, in _call_impl
return forward_call(*input, **kwargs)
File "/home/bmb/anaconda3/envs/qlora/lib/python3.10/site-packages/peft/tuners/lora.py", line 565, in forward
result = F.linear(x, transpose(self.weight, self.fan_in_fan_out), bias=self.bias)
RuntimeError: self and mat2 must have the same dtype

Which specific checkpoints are supported?

Is it the original leaked FB checkpoints or some derivative ones? The Meta ones don't have config.json per qlora.py error message.

If the latter, which repos would have them?

Thanks.

EleutherAI/gpt-j-6b not supported

It looks like EleutherAI/gpt-j-6b is not supported:

Env:

Running from docker:

FROM pytorch/pytorch:2.0.1-cuda11.7-cudnn8-devel

RUN apt-get update && apt-get install git -y

RUN pip install -q -U bitsandbytes
RUN pip install -q -U git+https://github.com/huggingface/transformers.git
RUN pip install -q -U git+https://github.com/huggingface/peft.git
RUN pip install -q -U git+https://github.com/huggingface/accelerate.git

WORKDIR /code

COPY qlora/requirements.txt qlora/requirements.txt

WORKDIR /code/qlora

RUN pip install -q -r requirements.txt

Cmd:

python qlora.py \
    --model_name_or_path EleutherAI/gpt-j-6b \
    --output_dir /output \
    --dataset alpaca \
    --do_train True \
    --do_eval True \
    --do_mmlu_eval True \
    --source_max_len 384 \
    --target_max_len 128 \
    --per_device_train_batch_size 4 \
    --per_device_train_batch_size 4 \
    --gradient_accumulation_steps 4 \
    --logging_steps 10 \
    --max_steps 10000 \
    --save_strategy steps \
    --data_seed 42 \
    --save_steps 1000 \
    --save_total_limit 40 \
    --evaluation_strategy steps \
    --eval_dataset_size 1024 \
    --max_eval_samples 1000 \
    --eval_steps 1000 \
    --optim paged_adamw_32bit

Stacktrace:

CUDA_SETUP: WARNING! libcudart.so not found in any environmental path. Searching in backup paths...
CUDA SETUP: CUDA runtime path found: /usr/local/cuda/lib64/libcudart.so
CUDA SETUP: Highest compute capability among GPUs detected: 7.0
CUDA SETUP: Detected CUDA version 117
CUDA SETUP: Loading binary /opt/conda/lib/python3.10/site-packages/bitsandbytes/libbitsandbytes_cuda117_nocublaslt.so...
loading base model EleutherAI/gpt-j-6b...
[...]
Traceback (most recent call last):
  File "/code/qlora/qlora.py", line 758, in <module>
    train()
  File "/code/qlora/qlora.py", line 720, in train
    train_result = trainer.train(resume_from_checkpoint=checkpoint_dir)
  File "/opt/conda/lib/python3.10/site-packages/transformers/trainer.py", line 1696, in train
    return inner_training_loop(
  File "/opt/conda/lib/python3.10/site-packages/transformers/trainer.py", line 1973, in _inner_training_loop
    tr_loss_step = self.training_step(model, inputs)
  File "/opt/conda/lib/python3.10/site-packages/transformers/trainer.py", line 2787, in training_step
    loss = self.compute_loss(model, inputs)
  File "/opt/conda/lib/python3.10/site-packages/transformers/trainer.py", line 2819, in compute_loss
    outputs = model(**inputs)
  File "/opt/conda/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl
    return forward_call(*args, **kwargs)
  File "/opt/conda/lib/python3.10/site-packages/peft/peft_model.py", line 686, in forward
    return self.base_model(
  File "/opt/conda/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl
    return forward_call(*args, **kwargs)
  File "/opt/conda/lib/python3.10/site-packages/accelerate/hooks.py", line 165, in new_forward
    output = old_forward(*args, **kwargs)
  File "/opt/conda/lib/python3.10/site-packages/transformers/models/gptj/modeling_gptj.py", line 870, in forward
    torch.cuda.set_device(self.transformer.first_device)
  File "/opt/conda/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1614, in __getattr__
    raise AttributeError("'{}' object has no attribute '{}'".format(
AttributeError: 'GPTJModel' object has no attribute 'first_device'

guanaco-7B-demo-colab.ipynb breaks with 4bit

Uncommenting load_in_4bit in the colab demo causes an error upon calling m.merge_and_unload()

│ /usr/local/lib/python3.10/dist-packages/peft/tuners/lora.py:352 in merge_and_unload              │
│                                                                                                  │
│   349 │   │   │   raise ValueError("GPT2 models are not supported for merging LORA layers")      │
│   350 │   │                                                                                      │
│   351 │   │   if getattr(self.model, "is_loaded_in_8bit", False) or getattr(self.model, "is_lo   │
│ ❱ 352 │   │   │   raise ValueError("Cannot merge LORA layers when the model is loaded in 8-bit   │
│   353 │   │                                                                                      │
│   354 │   │   key_list = [key for key, _ in self.model.named_modules() if "lora" not in key]     │
│   355 │   │   for key in key_list:                                                               │
╰──────────────────────────────────────────────────────────────────────────────────────────────────╯
ValueError: Cannot merge LORA layers when the model is loaded in 8-bit mode

pip install -q -U bitsandbytes

ERROR: Exception:
Traceback (most recent call last):
File "/usr/local/lib/python3.10/site-packages/pip/_vendor/urllib3/response.py", line 438, in _error_catcher
yield
File "/usr/local/lib/python3.10/site-packages/pip/_vendor/urllib3/response.py", line 561, in read
data = self._fp_read(amt) if not fp_closed else b""
File "/usr/local/lib/python3.10/site-packages/pip/_vendor/urllib3/response.py", line 527, in _fp_read
return self._fp.read(amt) if amt is not None else self._fp.read()
File "/usr/local/lib/python3.10/site-packages/pip/_vendor/cachecontrol/filewrapper.py", line 90, in read
data = self.__fp.read(amt)
File "/usr/local/lib/python3.10/http/client.py", line 466, in read
s = self.fp.read(amt)
File "/usr/local/lib/python3.10/socket.py", line 705, in readinto
return self._sock.recv_into(b)
File "/usr/local/lib/python3.10/ssl.py", line 1274, in recv_into
return self.read(nbytes, buffer)
File "/usr/local/lib/python3.10/ssl.py", line 1130, in read
return self._sslobj.read(len, buffer)
TimeoutError: The read operation timed out

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
File "/usr/local/lib/python3.10/site-packages/pip/_internal/cli/base_command.py", line 169, in exc_logging_wrapper
status = run_func(*args)
File "/usr/local/lib/python3.10/site-packages/pip/_internal/cli/req_command.py", line 248, in wrapper
return func(self, options, args)
File "/usr/local/lib/python3.10/site-packages/pip/_internal/commands/install.py", line 377, in run
requirement_set = resolver.resolve(
File "/usr/local/lib/python3.10/site-packages/pip/_internal/resolution/resolvelib/resolver.py", line 92, in resolve
result = self._result = resolver.resolve(
File "/usr/local/lib/python3.10/site-packages/pip/_vendor/resolvelib/resolvers.py", line 546, in resolve
state = resolution.resolve(requirements, max_rounds=max_rounds)
File "/usr/local/lib/python3.10/site-packages/pip/_vendor/resolvelib/resolvers.py", line 397, in resolve
self._add_to_criteria(self.state.criteria, r, parent=None)
File "/usr/local/lib/python3.10/site-packages/pip/_vendor/resolvelib/resolvers.py", line 173, in _add_to_criteria
if not criterion.candidates:
File "/usr/local/lib/python3.10/site-packages/pip/_vendor/resolvelib/structs.py", line 156, in bool
return bool(self._sequence)
File "/usr/local/lib/python3.10/site-packages/pip/_internal/resolution/resolvelib/found_candidates.py", line 155, in bool
return any(self)
File "/usr/local/lib/python3.10/site-packages/pip/_internal/resolution/resolvelib/found_candidates.py", line 143, in
return (c for c in iterator if id(c) not in self._incompatible_ids)
File "/usr/local/lib/python3.10/site-packages/pip/_internal/resolution/resolvelib/found_candidates.py", line 97, in _iter_built_with_inserted
candidate = func()
File "/usr/local/lib/python3.10/site-packages/pip/_internal/resolution/resolvelib/factory.py", line 206, in _make_candidate_from_link
self._link_candidate_cache[link] = LinkCandidate(
File "/usr/local/lib/python3.10/site-packages/pip/_internal/resolution/resolvelib/candidates.py", line 293, in init
super().init(
File "/usr/local/lib/python3.10/site-packages/pip/_internal/resolution/resolvelib/candidates.py", line 156, in init
self.dist = self._prepare()
File "/usr/local/lib/python3.10/site-packages/pip/_internal/resolution/resolvelib/candidates.py", line 225, in _prepare
dist = self._prepare_distribution()
File "/usr/local/lib/python3.10/site-packages/pip/_internal/resolution/resolvelib/candidates.py", line 304, in _prepare_distribution
return preparer.prepare_linked_requirement(self._ireq, parallel_builds=True)
File "/usr/local/lib/python3.10/site-packages/pip/_internal/operations/prepare.py", line 516, in prepare_linked_requirement
return self._prepare_linked_requirement(req, parallel_builds)
File "/usr/local/lib/python3.10/site-packages/pip/_internal/operations/prepare.py", line 587, in _prepare_linked_requirement
local_file = unpack_url(
File "/usr/local/lib/python3.10/site-packages/pip/_internal/operations/prepare.py", line 166, in unpack_url
file = get_http_url(
File "/usr/local/lib/python3.10/site-packages/pip/_internal/operations/prepare.py", line 107, in get_http_url
from_path, content_type = download(link, temp_dir.path)
File "/usr/local/lib/python3.10/site-packages/pip/_internal/network/download.py", line 147, in call
for chunk in chunks:
File "/usr/local/lib/python3.10/site-packages/pip/_internal/network/utils.py", line 63, in response_chunks
for chunk in response.raw.stream(
File "/usr/local/lib/python3.10/site-packages/pip/_vendor/urllib3/response.py", line 622, in stream
data = self.read(amt=amt, decode_content=decode_content)
File "/usr/local/lib/python3.10/site-packages/pip/_vendor/urllib3/response.py", line 560, in read
with self._error_catcher():
File "/usr/local/lib/python3.10/contextlib.py", line 153, in exit
self.gen.throw(typ, value, traceback)
File "/usr/local/lib/python3.10/site-packages/pip/_vendor/urllib3/response.py", line 443, in _error_catcher
raise ReadTimeoutError(self._pool, None, "Read timed out.")
pip._vendor.urllib3.exceptions.ReadTimeoutError: HTTPSConnectionPool(host='files.pythonhosted.org', port=443): Read timed out.

finetune.py 65b on A6000 48GB crashes with OOM

python qlora.py \
    --model_name_or_path /home/nap/llm_models/llamaOG-65B-hf/ \
    --output_dir ./output \
    --dataset alpaca \
    --do_train True \
    --do_eval True \
    --do_mmlu_eval True \
    --source_max_len 384 \
    --target_max_len 128 \
    --per_device_train_batch_size 4 \
    --per_device_train_batch_size 4 \
    --gradient_accumulation_steps 4 \
    --logging_steps 10 \
    --max_steps 10000 \
    --save_strategy steps \
    --data_seed 42 \
    --save_steps 1000 \
    --save_total_limit 40 \
    --evaluation_strategy steps \
    --eval_dataset_size 1024 \
    --max_eval_samples 1000 \
    --eval_steps 1000 \
    --optim paged_adamw_32bit \
    --learning_rate 0.0001 \

CUDA_VISIBLE_DEVICES=1 ./finetune.sh

error:

  File "/home/nap/Documents/githubs/qlora/.qlora/lib/python3.11/site-packages/bitsandbytes/autograd/_functions.py", line 512, in forward
    output = torch.nn.functional.linear(A, F.dequantize_fp4(B, state).to(A.dtype).t(), bias)
                                           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 688.00 MiB (GPU 0; 47.54 GiB total capacity; 45.09 GiB already allocated; 79.12 MiB free; 46.45 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation.  See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF
  0%|                                                                                                                                 | 0/10000 [00:18<?, ?it/s]

Do I need to change the batch_size or something? Thanks!

lora weights are not saved correctly

The saved adapter_model.bin is only 441kb.

#38

Adapter model is just 400 bytes when using finetune.py

I am using the finetune.py script with default params

it works perfectly but when I get the adapter model in the checkpoint directory, the size of adapter model is just 400 bytes.

What can be the reason for this?

	if 'lm_head' in lora_module_names: # needed for 16-bit
	lora_module_names.remove('lm_head')