lightning-ai / litgpt Goto Github PK

View Code? Open in Web Editor NEW

9.4K 88.0 942.0 3.98 MB

20+ high-performance LLMs with recipes to pretrain, finetune and deploy at scale.

Home Page: https://lightning.ai

License: Apache License 2.0

Python 99.67% Shell 0.33%

ai artificial-intelligence deep-learning large-language-models llm llm-inference llms

litgpt's Introduction

⚡ LitGPT

20+ high-performance LLMs with recipes to pretrain, finetune, deploy at scale.

✅ From scratch implementations     ✅ No abstractions    ✅ Beginner friendly   
✅ Flash attention                  ✅ FSDP               ✅ LoRA, QLoRA, Adapter
✅ Reduce GPU memory (fp4/8/16/32)  ✅ 1-1000+ GPUs/TPUs  ✅ 20+ LLMs

Lightning AI • Quick start • Models • Finetune • Deploy • All workflows • Features • Recipes (YAML) • Tutorials

Use, finetune, pretrain, deploy LLMs Lightning fast ⚡⚡

Every LLM is implemented from scratch with no abstractions and full control, making them blazing fast, minimal, and performant at enterprise scale.

✅ Enterprise ready - Apache 2.0 for unlimited enterprise use.
✅ Developer friendly - Easy debugging with no abstraction layers and single file implementations.
✅ Optimized performance - Models designed to maximize performance, reduce costs, and speed up training.
✅ Proven recipes - Highly-optimized training/finetuning recipes tested at enterprise scale.

Quick start

Install LitGPT

pip install 'litgpt[all]'

Load and use any of the 20+ LLMs:

from litgpt import LLM

llm = LLM.load("microsoft/phi-2")
text = llm.generate("Fix the spelling: Every fall, the familly goes to the mountains.")
print(text)
# Corrected Sentence: Every fall, the family goes to the mountains.

✅ Optimized for fast inference
✅ Quantization
✅ Runs on low-memory GPUs
✅ No layers of internal abstractions
✅ Optimized for production scale

Advanced install options

Install from source:

git clone https://github.com/Lightning-AI/litgpt
cd litgpt
pip install -e '.[all]'

Explore the full Python API docs.

Choose from 20+ LLMs

Every model is written from scratch to maximize performance and remove layers of abstraction:

Model	Model size	Author	Reference
Llama 3 & 3.1	8B, 70B, 405B	Meta AI	Meta AI 2024
Code Llama	7B, 13B, 34B, 70B	Meta AI	Rozière et al. 2023
Mixtral MoE	8x7B	Mistral AI	Mistral AI 2023
Mistral	7B	Mistral AI	Mistral AI 2023
CodeGemma	7B	Google	Google Team, Google Deepmind
Gemma 2	2B, 9B, 27B	Google	Google Team, Google Deepmind
Phi 3	3.8B	Microsoft	Abdin et al. 2024
...	...	...	...

See full list of 20+ LLMs

All models

Model	Model size	Author	Reference
CodeGemma	7B	Google	Google Team, Google Deepmind
Code Llama	7B, 13B, 34B, 70B	Meta AI	Rozière et al. 2023
Danube2	1.8B	H2O.ai	H2O.ai
Dolly	3B, 7B, 12B	Databricks	Conover et al. 2023
Falcon	7B, 40B, 180B	TII UAE	TII 2023
FreeWilly2 (Stable Beluga 2)	70B	Stability AI	Stability AI 2023
Function Calling Llama 2	7B	Trelis	Trelis et al. 2023
Gemma	2B, 7B	Google	Google Team, Google Deepmind
Gemma 2	9B, 27B	Google	Google Team, Google Deepmind
Llama 2	7B, 13B, 70B	Meta AI	Touvron et al. 2023
Llama 3.1	8B, 70B	Meta AI	Meta AI 2024
LongChat	7B, 13B	LMSYS	LongChat Team 2023
Mathstral	7B	Mistral AI	Mistral AI 2024
MicroLlama	300M	Ken Wang	MicroLlama repo
Mixtral MoE	8x7B	Mistral AI	Mistral AI 2023
Mistral	7B	Mistral AI	Mistral AI 2023
Nous-Hermes	7B, 13B, 70B	NousResearch	Org page
OpenLLaMA	3B, 7B, 13B	OpenLM Research	Geng & Liu 2023
Phi 1.5 & 2	1.3B, 2.7B	Microsoft Research	Li et al. 2023
Phi 3	3.8B	Microsoft Research	Abdin et al. 2024
Platypus	7B, 13B, 70B	Lee et al.	Lee, Hunter, and Ruiz 2023
Pythia	{14,31,70,160,410}M, {1,1.4,2.8,6.9,12}B	EleutherAI	Biderman et al. 2023
RedPajama-INCITE	3B, 7B	Together	Together 2023
StableCode	3B	Stability AI	Stability AI 2023
StableLM	3B, 7B	Stability AI	Stability AI 2023
StableLM Zephyr	3B	Stability AI	Stability AI 2023
TinyLlama	1.1B	Zhang et al.	Zhang et al. 2023
Vicuna	7B, 13B, 33B	LMSYS	Li et al. 2023

Tip: You can list all available models by running the litgpt download list command.

Workflows

Finetune • Pretrain • Continued pretraining • Evaluate • Deploy • Test

Use the command line interface to run advanced workflows such as pretraining or finetuning on your own data.

All workflows

After installing LitGPT, select the model and workflow to run (finetune, pretrain, evaluate, deploy, etc...):

# ligpt [action] [model]
litgpt  serve     meta-llama/Meta-Llama-3.1-8B-Instruct
litgpt  finetune  meta-llama/Meta-Llama-3.1-8B-Instruct
litgpt  pretrain  meta-llama/Meta-Llama-3.1-8B-Instruct
litgpt  chat      meta-llama/Meta-Llama-3.1-8B-Instruct
litgpt  evaluate  meta-llama/Meta-Llama-3.1-8B-Instruct

Finetune an LLM

Finetuning is the process of taking a pretrained AI model and further training it on a smaller, specialized dataset tailored to a specific task or application.

# 0) setup your dataset
curl -L https://huggingface.co/datasets/ksaw008/finance_alpaca/resolve/main/finance_alpaca.json -o my_custom_dataset.json

# 1) Finetune a model (auto downloads weights)
litgpt finetune microsoft/phi-2 \
  --data JSON \
  --data.json_path my_custom_dataset.json \
  --data.val_split_fraction 0.1 \
  --out_dir out/custom-model

# 2) Test the model
litgpt chat out/custom-model/final

# 3) Deploy the model
litgpt serve out/custom-model/final

Read the full finetuning docs

Deploy an LLM

Deploy a pretrained or finetune LLM to use it in real-world applications. Deploy, automatically sets up a web server that can be accessed by a website or app.

# deploy an out-of-the-box LLM
litgpt serve microsoft/phi-2

# deploy your own trained model
litgpt serve path/to/microsoft/phi-2/checkpoint

Show code to query server:

Test the server in a separate terminal and integrate the model API into your AI product:

# 3) Use the server (in a separate Python session)
import requests, json
response = requests.post(
    "http://127.0.0.1:8000/predict",
    json={"prompt": "Fix typos in the following sentence: Exampel input"}
)
print(response.json()["output"])

Read the full deploy docs.

Evaluate an LLM

Evaluate an LLM to test its performance on various tasks to see how well it understands and generates text. Simply put, we can evaluate things like how well would it do in college-level chemistry, coding, etc... (MMLU, Truthful QA, etc...)

litgpt evaluate microsoft/phi-2 --tasks 'truthfulqa_mc2,mmlu'

Read the full evaluation docs.

Test an LLM

Test how well the model works via an interactive chat. Use the chat command to chat, extract embeddings, etc...

Here's an example showing how to use the Phi-2 LLM:

litgpt chat microsoft/phi-2

>> Prompt: What do Llamas eat?

Full code:

# 1) List all supported LLMs
litgpt download list

# 2) Use a model (auto downloads weights)
litgpt chat microsoft/phi-2

>> Prompt: What do Llamas eat?

The download of certain models requires an additional access token. You can read more about this in the download documentation.

Read the full chat docs.

Pretrain an LLM

Pretraining is the process of teaching an AI model by exposing it to a large amount of data before it is fine-tuned for specific tasks.

Show code:

mkdir -p custom_texts
curl https://www.gutenberg.org/cache/epub/24440/pg24440.txt --output custom_texts/book1.txt
curl https://www.gutenberg.org/cache/epub/26393/pg26393.txt --output custom_texts/book2.txt

# 1) Download a tokenizer
litgpt download EleutherAI/pythia-160m \
  --tokenizer_only True

# 2) Pretrain the model
litgpt pretrain EleutherAI/pythia-160m \
  --tokenizer_dir EleutherAI/pythia-160m \
  --data TextFiles \
  --data.train_data_path "custom_texts/" \
  --train.max_tokens 10_000_000 \
  --out_dir out/custom-model

# 3) Test the model
litgpt chat out/custom-model/final

Read the full pretraining docs

Continue pretraining an LLM

Continued pretraining is another way of finetuning that specializes an already pretrained model by training on custom data:

Show code:

mkdir -p custom_texts
curl https://www.gutenberg.org/cache/epub/24440/pg24440.txt --output custom_texts/book1.txt
curl https://www.gutenberg.org/cache/epub/26393/pg26393.txt --output custom_texts/book2.txt

# 1) Continue pretraining a model (auto downloads weights)
litgpt pretrain EleutherAI/pythia-160m \
  --tokenizer_dir EleutherAI/pythia-160m \
  --initial_checkpoint_dir EleutherAI/pythia-160m \
  --data TextFiles \
  --data.train_data_path "custom_texts/" \
  --train.max_tokens 10_000_000 \
  --out_dir out/custom-model

# 2) Test the model
litgpt chat out/custom-model/final

Read the full continued pretraining docs

State-of-the-art features

✅ State-of-the-art optimizations: Flash Attention v2, multi-GPU support via fully-sharded data parallelism, optional CPU offloading, and TPU and XLA support.

✅ Pretrain, finetune, and deploy

✅ Reduce compute requirements with low-precision settings: FP16, BF16, and FP16/FP32 mixed.

✅ Lower memory requirements with quantization: 4-bit floats, 8-bit integers, and double quantization.

✅ Configuration files for great out-of-the-box performance.

✅ Parameter-efficient finetuning: LoRA, QLoRA, Adapter, and Adapter v2.

✅ Exporting to other popular model weight formats.

✅ Many popular datasets for pretraining and finetuning, and support for custom datasets.

✅ Readable and easy-to-modify code to experiment with the latest research ideas.

Training recipes

LitGPT comes with validated recipes (YAML configs) to train models under different conditions. We've generated these recipes based on the parameters we found to perform the best for different training conditions.

Browse all training recipes here.

Example

litgpt finetune \
  --config https://raw.githubusercontent.com/Lightning-AI/litgpt/main/config_hub/finetune/llama-2-7b/lora.yaml

✅ Use configs to customize training

Configs let you customize training for all granular parameters like:

# The path to the base model's checkpoint directory to load for finetuning. (type: <class 'Path'>, default: checkpoints/stabilityai/stablelm-base-alpha-3b)
checkpoint_dir: checkpoints/meta-llama/Llama-2-7b-hf

# Directory in which to save checkpoints and logs. (type: <class 'Path'>, default: out/lora)
out_dir: out/finetune/qlora-llama2-7b

# The precision to use for finetuning. Possible choices: "bf16-true", "bf16-mixed", "32-true". (type: Optional[str], default: null)
precision: bf16-true

...

✅ Example: LoRA finetuning config

# The path to the base model's checkpoint directory to load for finetuning. (type: <class 'Path'>, default: checkpoints/stabilityai/stablelm-base-alpha-3b)
checkpoint_dir: checkpoints/meta-llama/Llama-2-7b-hf

# Directory in which to save checkpoints and logs. (type: <class 'Path'>, default: out/lora)
out_dir: out/finetune/qlora-llama2-7b

# The precision to use for finetuning. Possible choices: "bf16-true", "bf16-mixed", "32-true". (type: Optional[str], default: null)
precision: bf16-true

# If set, quantize the model with this algorithm. See ``tutorials/quantize.md`` for more information. (type: Optional[Literal['nf4', 'nf4-dq', 'fp4', 'fp4-dq', 'int8-training']], default: null)
quantize: bnb.nf4

# How many devices/GPUs to use. (type: Union[int, str], default: 1)
devices: 1

# How many nodes to use. (type: int, default: 1)
num_nodes: 1

# The LoRA rank. (type: int, default: 8)
lora_r: 32

# The LoRA alpha. (type: int, default: 16)
lora_alpha: 16

# The LoRA dropout value. (type: float, default: 0.05)
lora_dropout: 0.05

# Whether to apply LoRA to the query weights in attention. (type: bool, default: True)
lora_query: true

# Whether to apply LoRA to the key weights in attention. (type: bool, default: False)
lora_key: false

# Whether to apply LoRA to the value weights in attention. (type: bool, default: True)
lora_value: true

# Whether to apply LoRA to the output projection in the attention block. (type: bool, default: False)
lora_projection: false

# Whether to apply LoRA to the weights of the MLP in the attention block. (type: bool, default: False)
lora_mlp: false

# Whether to apply LoRA to output head in GPT. (type: bool, default: False)
lora_head: false

# Data-related arguments. If not provided, the default is ``litgpt.data.Alpaca``.
data:
  class_path: litgpt.data.Alpaca2k
  init_args:
    mask_prompt: false
    val_split_fraction: 0.05
    prompt_style: alpaca
    ignore_index: -100
    seed: 42
    num_workers: 4
    download_dir: data/alpaca2k

# Training-related arguments. See ``litgpt.args.TrainArgs`` for details
train:

  # Number of optimizer steps between saving checkpoints (type: Optional[int], default: 1000)
  save_interval: 200

  # Number of iterations between logging calls (type: int, default: 1)
  log_interval: 1

  # Number of samples between optimizer steps across data-parallel ranks (type: int, default: 128)
  global_batch_size: 8

  # Number of samples per data-parallel rank (type: int, default: 4)
  micro_batch_size: 2

  # Number of iterations with learning rate warmup active (type: int, default: 100)
  lr_warmup_steps: 10

  # Number of epochs to train on (type: Optional[int], default: 5)
  epochs: 4

  # Total number of tokens to train on (type: Optional[int], default: null)
  max_tokens:

  # Limits the number of optimizer steps to run (type: Optional[int], default: null)
  max_steps:

  # Limits the length of samples (type: Optional[int], default: null)
  max_seq_length: 512

  # Whether to tie the embedding weights with the language modeling head weights (type: Optional[bool], default: null)
  tie_embeddings:

  #   (type: float, default: 0.0003)
  learning_rate: 0.0002

  #   (type: float, default: 0.02)
  weight_decay: 0.0

  #   (type: float, default: 0.9)
  beta1: 0.9

  #   (type: float, default: 0.95)
  beta2: 0.95

  #   (type: Optional[float], default: null)
  max_norm:

  #   (type: float, default: 6e-05)
  min_lr: 6.0e-05

# Evaluation-related arguments. See ``litgpt.args.EvalArgs`` for details
eval:

  # Number of optimizer steps between evaluation calls (type: int, default: 100)
  interval: 100

  # Number of tokens to generate (type: Optional[int], default: 100)
  max_new_tokens: 100

  # Number of iterations (type: int, default: 100)
  max_iters: 100

# The name of the logger to send metrics to. (type: Literal['wandb', 'tensorboard', 'csv'], default: csv)
logger_name: csv

# The random seed to use for reproducibility. (type: int, default: 1337)
seed: 1337

✅ Override any parameter in the CLI:

litgpt finetune \
  --config https://raw.githubusercontent.com/Lightning-AI/litgpt/main/config_hub/finetune/llama-2-7b/lora.yaml \
  --lora_r 4

Project highlights

LitGPT powers many great AI projects, initiatives, challenges and of course enterprises. Please submit a pull request to be considered for a feature.

📊 SAMBA: Simple Hybrid State Space Models for Efficient Unlimited Context Language Modeling

The Samba project by researchers at Microsoft is built on top of the LitGPT code base and combines state space models with sliding window attention, which outperforms pure state space models.

🏆 NeurIPS 2023 Large Language Model Efficiency Challenge: 1 LLM + 1 GPU + 1 Day

The LitGPT repository was the official starter kit for the NeurIPS 2023 LLM Efficiency Challenge, which is a competition focused on finetuning an existing non-instruction tuned LLM for 24 hours on a single GPU.

🦙 TinyLlama: An Open-Source Small Language Model

LitGPT powered the TinyLlama project and TinyLlama: An Open-Source Small Language Model research paper.

🍪 MicroLlama: MicroLlama-300M

MicroLlama is a 300M Llama model pretrained on 50B tokens powered by TinyLlama and LitGPT.

🔬 Pre-training Small Base LMs with Fewer Tokens

The research paper "Pre-training Small Base LMs with Fewer Tokens", which utilizes LitGPT, develops smaller base language models by inheriting a few transformer blocks from larger models and training on a tiny fraction of the data used by the larger models. It demonstrates that these smaller models can perform comparably to larger models despite using significantly less training data and resources.

Community

We welcome all individual contributors, regardless of their level of experience or hardware. Your contributions are valuable, and we are excited to see what you can accomplish in this collaborative and supportive environment.

Tutorials

🚀 Get started
⚡️ Finetuning, incl. LoRA, QLoRA, and Adapters
🤖 Pretraining
💬 Model evaluation
📘 Supported and custom datasets
🧹 Quantization
🤯 Tips for dealing with out-of-memory (OOM) errors
🧑🏽‍💻 Using cloud TPUs

Acknowledgements

This implementation extends on Lit-LLaMA and nanoGPT, and it's powered by Lightning Fabric ⚡.

@karpathy for nanoGPT
@EleutherAI for GPT-NeoX and the Evaluation Harness
@TimDettmers for bitsandbytes
@Microsoft for LoRA
@tridao for Flash Attention 2

License

LitGPT is released under the Apache 2.0 license.

Citation

If you use LitGPT in your research, please cite the following work:

@misc{litgpt-2023,
  author       = {Lightning AI},
  title        = {LitGPT},
  howpublished = {\url{https://github.com/Lightning-AI/litgpt}},
  year         = {2023},
}

litgpt's People

Contributors

Stargazers

Watchers

Forkers

ai-jie01 sciumo avinashreddych rcmalli fdoperezi chorseng arturk-85 bkiat1123 neo111000 vihangd agmo1993 liangofthechen thearchiver hammer-ml jolks chiragsingla17 abulhasanat marimeireles emilesilvis mz0in universeresearch alanderex gkroiz neuralsignal recursionbane rudrahh tfius starkeyjon soxunlocks seshakiran juangon ppfliu kodylow elma-dev traviscooper awesome-software akaanirban techthiyanes stevross mjdhasan fastflair doytsujin iskandr sanyaade-teachings worthmining qiufengyuyi henryhesz ichit rsohlot angainordev inarikami tuyenttmathoslo niskarsh12 shreyas88 anservat olavl polytat rohitpandey13 zeropaper siltat saeli0949 wassimchouchen khushpatel2002 richardsonjf bryanchrist mieitza macguyversmusic rajesh16702 kp-forks bongsang filip9f abaso007 pauljw28 jhalljhall moonisali dji-transpire gmongaras holdenk richginsberg jayeshthk amroaljundi the-mercury guru1966 surak griff4692 kdgyun diggerdu heee6991 kayjayi omygpt destefani enginbozkurt byte-genie eltociear mitzen abdoiiii hongyunqiu someshfengde mittalpatel stevegyutyan

litgpt's Issues

micro_batch_size, step run time, total training time

Hi,

Thanks a lot for this clear and fat-free code base!
I'm training Falcon-7B with adapters-v2 and an Alpaca-formated dataset of mine.

As usual, I'm trying to max out the vram use for best training time but in this case, there is no significant gain since the step time is almost proportional to the batch size.

step times:
micro_batch_size 1, 159ms
micro_batch_size 2, 293ms
micro_batch_size 4, 560ms

Is this expected, or can this be optimized?

Note:
I'll also open a new issue as advised with my attempt at batch inference, exhibiting the same lack of gain when batching at inference, see
Lightning-AI/lit-llama#188 (comment)

Pythia embedding dimension mismatch

I have encountered some errors while downloading these models and converting their weights from Huggingface:

pythia-1b
pythia-1.4b
pythia-2.8b

Their embedding dimensions are not correctly specified inside huggingface repository. For example:

pythia-1b model expects n_embd=8192 but the actual weight dimension is 2048.
pythia-1.4b model expects n_embd=8192 but the actual weight dimension is 2048.
pythia-2.8b model expects n_embd=8192 but the actual weight dimension is 2560.

I also checked their original repository on EleutherAI/pythia and numbers are aligned with hidden-size parameter inside the configuration file of each model. Configuration files on Hugginface repository might be wrong. Have you ever checked these models?

Here is an example error:

Note: It is same for both deduped and original models and I didn't try models larger than 2.8b from Pythia repository.

ERROR: Could not find a version that satisfies the requirement torch>=2.1.0dev

Hello,

I have just cloned the repository and run pip install -r requirements.txt as explained in README.md file but I get following error:

  Installing build dependencies ... done
  Getting requirements to build wheel ... done
  Preparing metadata (pyproject.toml) ... done
ERROR: Could not find a version that satisfies the requirement torch>=2.1.0dev (from versions: 2.0.0)
ERROR: No matching distribution found for torch>=2.1.0dev

I already tried following commands with same result:

pip install --pre -r requirements.txt
pip install --pre -r requirements.txt -f https://download.pytorch.org/whl/nightly/cpu

The issue is because nightly dev version of torch is not found and not installed.
Am I missing something?
I am running on a MacBook Pro Apple M1 Max

Thanks,
Nestor

Should `huggingface_hub` be added to requirements.txt?

Falcon 7B fails on 16GB memory with OOM

Cuda OOM on a 16GB GPU memory. Trying out the https://lightning.ai/pages/blog/falcon-a-guide-to-finetune-and-inference/.
I tried reducing the batch size and micro_batch size in https://discord.com/channels/1077906959069626439/1116820391885799556/1116820630243909803 however I still see the issue

Get Attempting to unscale FP16 gradients while finetune on float16

The initial script showed error gpu doesn't support bfloat16 and ask me to use float16 instead.

I modify as below.

fabric = L.Fabric(
        accelerator="cuda",
        devices=devices,
        strategy=(DeepSpeedStrategy(config=ds_config) if devices > 1 else "auto"),
        precision="16-mixed",
    )

with EmptyInitOnDevice(device=fabric.device, dtype=torch.float16):
        model = Parrot(config)

It shows errors when optimizer trying to step

File "finetune_adapter.py", line 117, in train
    optimizer.step()
  File "/disk3/ai/ml_experiments/finetune_llms/venv/lib/python3.8/site-packages/lightning/fabric/wrappers.py", line 72, in step
    return self._strategy.optimizer_step(
  File "/disk3/ai/ml_experiments/finetune_llms/venv/lib/python3.8/site-packages/lightning/fabric/strategies/strategy.py", line 193, in optimizer_step
    return self.precision.optimizer_step(optimizer, **kwargs)
  File "/disk3/ai/ml_experiments/finetune_llms/venv/lib/python3.8/site-packages/lightning/fabric/plugins/precision/amp.py", line 83, in optimizer_step
    step_output = self.scaler.step(optimizer, **kwargs)
  File "/disk3/ai/ml_experiments/finetune_llms/venv/lib/python3.8/site-packages/torch/cuda/amp/grad_scaler.py", line 370, in step
    self.unscale_(optimizer)
  File "/disk3/ai/ml_experiments/finetune_llms/venv/lib/python3.8/site-packages/torch/cuda/amp/grad_scaler.py", line 284, in unscale_
    optimizer_state["found_inf_per_device"] = self._unscale_grads_(optimizer, inv_scale, found_inf, False)
  File "/disk3/ai/ml_experiments/finetune_llms/venv/lib/python3.8/site-packages/torch/cuda/amp/grad_scaler.py", line 212, in _unscale_grads_
    raise ValueError("Attempting to unscale FP16 gradients.")
ValueError: Attempting to unscale FP16 gradients.

Model finetuned using finetune_adapter not directly usable in generaete/chat... How to convert?

I used the finetune_adapter.py script to generate a tuned model. I tried loading that tuned model back into chat.py, and I get the following error upon load:

RuntimeError: Error(s) in loading state_dict for Parrot:
        Missing key(s) in state_dict: "lm_head.weight", "transformer.wte.weight", "transformer.h.0.norm_1.weight", "transformer.h.0.norm_1.bias", "transformer.h.0.attn.attn.weight", "transformer.h.0.attn.attn.bias", "transformer.h.0.attn.proj.weight",
"transformer.h.0.attn.proj.bias", "transformer.h.0.norm_2.weight", "transformer.h.0.norm_2.bias", "transformer.h.0.mlp.fc.weight", "transformer.h.0.mlp.fc.bias", "transformer.h.0.mlp.proj.weight", "transformer.h.0.mlp.proj.bias", "transformer.h.1.norm_1.weight",
"transformer.h.1.norm_1.bias", "transformer.h.1.attn.attn.weight", "transformer.h.1.attn.attn.bias", "transformer.h.1.attn.proj.weight", "transformer.h.1.attn.proj.bias", "transformer.h.1.norm_2.weight", "transformer.h.1.norm_2.bias", "transformer.h.1.mlp.fc.weight",
"transformer.h.1.mlp.fc.bias", "transformer.h.1.mlp.proj.weight", "transformer.h.1.mlp.proj.bias", "transformer.h.2.norm_1.weight", "transformer.h.2.norm_1.bias", "transformer.h.2.attn.attn.weight", "transformer.h.2.attn.attn.bias", "transformer.h.2.attn.proj.weight",
"transformer.h.2.attn.proj.bias", "transformer.h.2.norm_2.weight", "transformer.h.2.norm_2.bias", "transformer.h.2.mlp.fc.weight", "transformer.h.2.mlp.fc.bias", "transformer.h.2.mlp.proj.weight", "transformer.h.2.mlp.proj.bias", "transformer.h.3.norm_1.weight",
"transformer.h.3.norm_1.bias", "transformer.h.3.attn.attn.weight", "transformer.h.3.attn.attn.bias", "transformer.h.3.attn.proj.weight", "transformer.h.3.attn.proj.bias", "transformer.h.3.norm_2.weight", "transformer.h.3.norm_2.bias", "transformer.h.3.mlp.fc.weight",
"transformer.h.3.mlp.fc.bias", "transformer.h.3.mlp.proj.weight", "transformer.h.3.mlp.proj.bias", "transformer.h.4.norm_1.weight", "transformer.h.4.norm_1.bias", "transformer.h.4.attn.attn.weight", "transformer.h.4.attn.attn.bias", "transformer.h.4.attn.proj.weight",
"transformer.h.4.attn.proj.bias", "transformer.h.4.norm_2.weight", "transformer.h.4.norm_2.bias", "transformer.h.4.mlp.fc.weight", "transformer.h.4.mlp.fc.bias", "transformer.h.4.mlp.proj.weight", "transformer.h.4.mlp.proj.bias", "transformer.h.5.norm_1.weight",
"transformer.h.5.norm_1.bias", "transformer.h.5.attn.attn.weight", "transformer.h.5.attn.attn.bias", "transformer.h.5.attn.proj.weight", "transformer.h.5.attn.proj.bias", "transformer.h.5.norm_2.weight", "transformer.h.5.norm_2.bias", "transformer.h.5.mlp.fc.weight",
"transformer.h.5.mlp.fc.bias", "transformer.h.5.mlp.proj.weight", "transformer.h.5.mlp.proj.bias", "transformer.h.6.norm_1.weight", "transformer.h.6.norm_1.bias", "transformer.h.6.attn.attn.weight", "transformer.h.6.attn.attn.bias", "transformer.h.6.attn.proj.weight",
"transformer.h.6.attn.proj.bias", "transformer.h.6.norm_2.weight", "transformer.h.6.norm_2.bias", "transformer.h.6.mlp.fc.weight", "transformer.h.6.mlp.fc.bias", "transformer.h.6.mlp.proj.weight", "transformer.h.6.mlp.proj.bias", "transformer.h.7.norm_1.weight",
"transformer.h.7.norm_1.bias", "transformer.h.7.attn.attn.weight", "transformer.h.7.attn.attn.bias", "transformer.h.7.attn.proj.weight", "transformer.h.7.attn.proj.bias", "transformer.h.7.norm_2.weight", "transformer.h.7.norm_2.bias", "transformer.h.7.mlp.fc.weight",
"transformer.h.7.mlp.fc.bias", "transformer.h.7.mlp.proj.weight", "transformer.h.7.mlp.proj.bias", "transformer.h.8.norm_1.weight", "transformer.h.8.norm_1.bias", "transformer.h.8.attn.attn.weight", "transformer.h.8.attn.attn.bias", "transformer.h.8.attn.proj.weight",
"transformer.h.8.attn.proj.bias", "transformer.h.8.norm_2.weight", "transformer.h.8.norm_2.bias", "transformer.h.8.mlp.fc.weight", "transformer.h.8.mlp.fc.bias", "transformer.h.8.mlp.proj.weight", "transformer.h.8.mlp.proj.bias", "transformer.h.9.norm_1.weight",
"transformer.h.9.norm_1.bias", "transformer.h.9.attn.attn.weight", "transformer.h.9.attn.attn.bias", "transformer.h.9.attn.proj.weight", "transformer.h.9.attn.proj.bias", "transformer.h.9.norm_2.weight", "transformer.h.9.norm_2.bias", "transformer.h.9.mlp.fc.weight",
"transformer.h.9.mlp.fc.bias", "transformer.h.9.mlp.proj.weight", "transformer.h.9.mlp.proj.bias", "transformer.h.10.norm_1.weight", "transformer.h.10.norm_1.bias", "transformer.h.10.attn.attn.weight", "transformer.h.10.attn.attn.bias",
"transformer.h.10.attn.proj.weight", "transformer.h.10.attn.proj.bias", "transformer.h.10.norm_2.weight", "transformer.h.10.norm_2.bias", "transformer.h.10.mlp.fc.weight", "transformer.h.10.mlp.fc.bias", "transformer.h.10.mlp.proj.weight",
"transformer.h.10.mlp.proj.bias", "transformer.h.11.norm_1.weight", "transformer.h.11.norm_1.bias", "transformer.h.11.attn.attn.weight", "transformer.h.11.attn.attn.bias", "transformer.h.11.attn.proj.weight", "transformer.h.11.attn.proj.bias",
"transformer.h.11.norm_2.weight", "transformer.h.11.norm_2.bias", "transformer.h.11.mlp.fc.weight", "transformer.h.11.mlp.fc.bias", "transformer.h.11.mlp.proj.weight", "transformer.h.11.mlp.proj.bias", "transformer.h.12.norm_1.weight", "transformer.h.12.norm_1.bias",
"transformer.h.12.attn.attn.weight", "transformer.h.12.attn.attn.bias", "transformer.h.12.attn.proj.weight", "transformer.h.12.attn.proj.bias", "transformer.h.12.norm_2.weight", "transformer.h.12.norm_2.bias", "transformer.h.12.mlp.fc.weight",
"transformer.h.12.mlp.fc.bias", "transformer.h.12.mlp.proj.weight", "transformer.h.12.mlp.proj.bias", "transformer.h.13.norm_1.weight", "transformer.h.13.norm_1.bias", "transformer.h.13.attn.attn.weight", "transformer.h.13.attn.attn.bias",
"transformer.h.13.attn.proj.weight", "transformer.h.13.attn.proj.bias", "transformer.h.13.norm_2.weight", "transformer.h.13.norm_2.bias", "transformer.h.13.mlp.fc.weight", "transformer.h.13.mlp.fc.bias", "transformer.h.13.mlp.proj.weight",
"transformer.h.13.mlp.proj.bias", "transformer.h.14.norm_1.weight", "transformer.h.14.norm_1.bias", "transformer.h.14.attn.attn.weight", "transformer.h.14.attn.attn.bias", "transformer.h.14.attn.proj.weight", "transformer.h.14.attn.proj.bias",
"transformer.h.14.norm_2.weight", "transformer.h.14.norm_2.bias", "transformer.h.14.mlp.fc.weight", "transformer.h.14.mlp.fc.bias", "transformer.h.14.mlp.proj.weight", "transformer.h.14.mlp.proj.bias", "transformer.h.15.norm_1.weight", "transformer.h.15.norm_1.bias",
"transformer.h.15.attn.attn.weight", "transformer.h.15.attn.attn.bias", "transformer.h.15.attn.proj.weight", "transformer.h.15.attn.proj.bias", "transformer.h.15.norm_2.weight", "transformer.h.15.norm_2.bias", "transformer.h.15.mlp.fc.weight",
"transformer.h.15.mlp.fc.bias", "transformer.h.15.mlp.proj.weight", "transformer.h.15.mlp.proj.bias", "transformer.ln_f.weight", "transformer.ln_f.bias".
        Unexpected key(s) in state_dict: "transformer.h.2.attn.gating_factor", "transformer.h.2.attn.adapter_wte.weight", "transformer.h.3.attn.gating_factor", "transformer.h.3.attn.adapter_wte.weight", "transformer.h.4.attn.gating_factor",
"transformer.h.4.attn.adapter_wte.weight", "transformer.h.5.attn.gating_factor", "transformer.h.5.attn.adapter_wte.weight", "transformer.h.6.attn.gating_factor", "transformer.h.6.attn.adapter_wte.weight", "transformer.h.7.attn.gating_factor",
"transformer.h.7.attn.adapter_wte.weight", "transformer.h.8.attn.gating_factor", "transformer.h.8.attn.adapter_wte.weight", "transformer.h.9.attn.gating_factor", "transformer.h.9.attn.adapter_wte.weight", "transformer.h.10.attn.gating_factor",
"transformer.h.10.attn.adapter_wte.weight", "transformer.h.11.attn.gating_factor", "transformer.h.11.attn.adapter_wte.weight", "transformer.h.12.attn.gating_factor", "transformer.h.12.attn.adapter_wte.weight", "transformer.h.13.attn.gating_factor",
"transformer.h.13.attn.adapter_wte.weight", "transformer.h.14.attn.gating_factor", "transformer.h.14.attn.adapter_wte.weight", "transformer.h.15.attn.gating_factor", "transformer.h.15.attn.adapter_wte.weight".

What am I doing wrong? How do I convert a tuned model checkpoint to what is expected by generate / chat?

TypeError: BFloat16 is not supported on MPS

Getting this when running Falcon 7b model on M1 Pro, is there a specific version that supports this on M1?

Command that was run:
python generate/base.py --prompt "Hello, my name is" --checkpoint_dir checkpoints/tiiuae/falcon-7b

Document that was referred:
https://github.com/Lightning-AI/lit-parrot/blob/main/howto/download_falcon.md

too many values to unpack in Block forward

Failed to unpack block forward results correctly .

File "/root/lit-parrot/lit_parrot/adapter.py", line 207, in forward
	if input_pos is None:  # proxy for use_cache=False
    for block in self.transformer.h:
	    x, _ = block(x, (cos, sin), mask, max_seq_length)
ValueError: too many values to unpack (expected 2)

Training time is unexpectedly very slow compared to lit-llama

Hello,

I'm using the pretrain code to train falcon-7B, I've already used lit-llama and trained llama-7B.
I noticed that falcon is very slow compared to llama, and it takes more memory.
In llama 7B:
iter 2: loss 11.0692, time: 5024.25ms, speed: 1705 toks/s/device
In flacon 7B:
iter 2: loss 11.0666, time: 26360.27ms, speed: 388 toks/s/device

Also, falcon consumes a lot of the memory, I couldn't increase the batch size to more than 160 with micro batch size 5, while in llama I went to 384 with micro batch size 6.
Is it normal?

Config cannot be overwritten through kwargs

Repro from finetune_adapter script:

from pathlib import Path
from lit_parrot.adapter import Config

max_seq_length = 256  # see scripts/prepare_alpaca.py
checkpoint_dir = Path("checkpoints/togethercomputer/RedPajama-INCITE-Base-3B-v1")
config = Config.from_name(name=checkpoint_dir.name, block_size=max_seq_length)


TypeError: lit_parrot.adapter.Config() got multiple values for keyword argument 'block_size'

https://github.com/Lightning-AI/lit-parrot/blob/0b5620de0c261a69298d39565d6e0f4b1e255fdb/lit_parrot/config.py#L25-L27

We can change it to below so user specified kwargs will overwrite configs.

@classmethod
def from_name(cls, name: str, **kwargs: Any) -> Self:
    return cls(**{**configs[name], **kwargs})

KV cache for faster generation

Same as Lightning-AI/lit-llama#197

Which should be our default model?

Given that this repository supports a multitude of models, which one should be chosen when the user doesn't specify it?

This is important because in the howtos and README, concrete numbers are given for a model

Alternatively, should we force the user to choose one?

Text generation fails on --devices 2

Hi, I am trying to generate text predictions using falcon-7b-instruct on machine with two A10-24GB gpu, when I run generate with default --devices option which is 1, it runs successfully while it fails with --device 2

python generate/base.py --prompt "Hello, my name is" --checkpoint_dir checkpoints/tiiuae/falcon-7b-instruct

default --devices

Loading model 'checkpoints/tiiuae/falcon-7b-instruct/lit_model.pth' with {'block_size': 2048, 'vocab_size': 50254, 'padding_multiple': 512, 'padded_vocab_size': 65024, 'n_layer': 32, 'n_head': 71, 'n_embd': 4544, 'rotary_percentage': 1.0, 'parallel_residual': True, 'bias': False, 'n_query_groups': 1, 'shared_attention_norm': True}
Time to instantiate model: 0.15 seconds.
Time to load the model weights: 15.32 seconds.
Global seed set to 1234
Hello, my name is Jack.
Some people think that having a blog is a great way to make money online and others insist that it is not. In my own view, I do agree with the latter one.
But in the end, it will have to depend
Time for inference 1: 2.13 sec total, 23.47 tokens/sec
Memory used: 14.56 GB

python generate/base.py --prompt "Hello, my name is" --checkpoint_dir checkpoints/tiiuae/falcon-7b-instruct --devices 2

--devices 2

Initializing distributed: GLOBAL_RANK: 0, MEMBER: 1/2
Initializing distributed: GLOBAL_RANK: 1, MEMBER: 2/2
----------------------------------------------------------------------------------------------------
distributed_backend=nccl
All distributed processes registered. Starting with 2 processes
----------------------------------------------------------------------------------------------------

Loading model 'checkpoints/tiiuae/falcon-7b-instruct/lit_model.pth' with {'block_size': 2048, 'vocab_size': 50254, 'padding_multiple': 512, 'padded_vocab_size': 65024, 'n_layer': 32, 'n_head': 71, 'n_embd': 4544, 'rotary_percentage': 1.0, 'parallel_residual': True, 'bias': False, 'n_query_groups': 1, 'shared_attention_norm': True}
Time to instantiate model: 1.33 seconds.
Time to load the model weights: 16.37 seconds.
Traceback (most recent call last):
  File "/home/ubuntu/llm-repos/lit-parrot/generate/base.py", line 204, in <module>
    CLI(main)
  File "/home/ubuntu/miniconda3/envs/litpar/lib/python3.9/site-packages/jsonargparse/cli.py", line 85, in CLI
    return _run_component(component, cfg_init)
  File "/home/ubuntu/miniconda3/envs/litpar/lib/python3.9/site-packages/jsonargparse/cli.py", line 147, in _run_component
    return component(**cfg)
  File "/home/ubuntu/llm-repos/lit-parrot/generate/base.py", line 156, in main
    model = fabric.setup_module(model)
  File "/home/ubuntu/miniconda3/envs/litpar/lib/python3.9/site-packages/lightning/fabric/fabric.py", line 265, in setup_module
    module = self._strategy.setup_module(module)
  File "/home/ubuntu/miniconda3/envs/litpar/lib/python3.9/site-packages/lightning/fabric/strategies/ddp.py", line 121, in setup_module
    return DistributedDataParallel(module=module, device_ids=device_ids, **self._ddp_kwargs)
  File "/home/ubuntu/miniconda3/envs/litpar/lib/python3.9/site-packages/torch/nn/parallel/distributed.py", line 805, in __init__
    self._ddp_init_helper(
  File "/home/ubuntu/miniconda3/envs/litpar/lib/python3.9/site-packages/torch/nn/parallel/distributed.py", line 1095, in _ddp_init_helper
Traceback (most recent call last):
  File "/home/ubuntu/llm-repos/lit-parrot/generate/base.py", line 204, in <module>
    CLI(main)
  File "/home/ubuntu/miniconda3/envs/litpar/lib/python3.9/site-packages/jsonargparse/cli.py", line 85, in CLI
    self.reducer = dist.Reducer(
torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 13.44 GiB. GPU 1 has a total capacty of 22.05 GiB of which 7.74 GiB is free. Including non-PyTorch memory, this process has 14.31 GiB memory in use. Of the allocated memory 13.49 GiB is allocated by PyTorch, and 49.67 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting max_split_size_mb to avoid fragmentation.  See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF
    return _run_component(component, cfg_init)
  File "/home/ubuntu/miniconda3/envs/litpar/lib/python3.9/site-packages/jsonargparse/cli.py", line 147, in _run_component
    return component(**cfg)
  File "/home/ubuntu/llm-repos/lit-parrot/generate/base.py", line 156, in main
    model = fabric.setup_module(model)
  File "/home/ubuntu/miniconda3/envs/litpar/lib/python3.9/site-packages/lightning/fabric/fabric.py", line 265, in setup_module
    module = self._strategy.setup_module(module)
  File "/home/ubuntu/miniconda3/envs/litpar/lib/python3.9/site-packages/lightning/fabric/strategies/ddp.py", line 121, in setup_module
    return DistributedDataParallel(module=module, device_ids=device_ids, **self._ddp_kwargs)
  File "/home/ubuntu/miniconda3/envs/litpar/lib/python3.9/site-packages/torch/nn/parallel/distributed.py", line 805, in __init__
    self._ddp_init_helper(
  File "/home/ubuntu/miniconda3/envs/litpar/lib/python3.9/site-packages/torch/nn/parallel/distributed.py", line 1095, in _ddp_init_helper
    self.reducer = dist.Reducer(
torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 13.44 GiB. GPU 0 has a total capacty of 22.05 GiB of which 7.74 GiB is free. Including non-PyTorch memory, this process has 14.31 GiB memory in use. Of the allocated memory 13.49 GiB is allocated by PyTorch, and 49.67 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting max_split_size_mb to avoid fragmentation.  See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF

Plans on integrating qlora 4bit finetuning?

My understanding is currently the repo provides 4bit only for inference but not finetuning. If this is the case, is there a plan for integrating QLoRA-style 4bit finetuning?

Quantization is not working

There are missing keys in the state dict

python3 chat.py --checkpoint_dir checkpoints/stabilityai/stablelm-tuned-alpha-7b --quantize "gptq.int4" fails

Loading without quantize succeeds, but first generate fails with cuda out of memory. Running with quantize fails on load...

RuntimeError: Error(s) in loading state_dict for Parrot:
Missing key(s) in state_dict: "lm_head.quant_weight", "lm_head.scales", "lm_head.zeros", "transformer.h.0.attn.attn.quant_weight",
"transformer.h.0.attn.attn.scales", "transformer.h.0.attn.attn.zeros", "transformer.h.0.attn.proj.quant_weight",
"transformer.h.0.attn.proj.scales", "transformer.h.0.attn.proj.zeros", "transformer.h.0.mlp.fc.quant_weight", "transformer.h.0.mlp.fc.scales",
"transformer.h.0.mlp.fc.zeros", "transformer.h.0.mlp.proj.quant_weight", "transformer.h.0.mlp.proj.scales", "transformer.h.0.mlp.proj.zeros",
"transformer.h.1.attn.attn.quant_weight", "transformer.h.1.attn.attn.scales", "transformer.h.1.attn.attn.zeros",
"transformer.h.1.attn.proj.quant_weight", "transformer.h.1.attn.proj.scales", "transformer.h.1.attn.proj.zeros",
"transformer.h.1.mlp.fc.quant_weight", "transformer.h.1.mlp.fc.scales", "transformer.h.1.mlp.fc.zeros", "transformer.h.1.mlp.proj.quant_weight",
"transformer.h.1.mlp.proj.scales", "transformer.h.1.mlp.proj.zeros", "transformer.h.2.attn.attn.quant_weight",
"transformer.h.2.attn.attn.scales", "transformer.h.2.attn.attn.zeros", "transformer.h.2.attn.proj.quant_weight",
"transformer.h.2.attn.proj.scales", "transformer.h.2.attn.proj.zeros", "transformer.h.2.mlp.fc.quant_weight", "transformer.h.2.mlp.fc.scales",
"transformer.h.2.mlp.fc.zeros", "transformer.h.2.mlp.proj.quant_weight", "transformer.h.2.mlp.proj.scales", "transformer.h.2.mlp.proj.zeros",
"transformer.h.3.attn.attn.quant_weight", "transformer.h.3.attn.attn.scales", "transformer.h.3.attn.attn.zeros",
"transformer.h.3.attn.proj.quant_weight", "transformer.h.3.attn.proj.scales", "transformer.h.3.attn.proj.zeros",
"transformer.h.3.mlp.fc.quant_weight", "transformer.h.3.mlp.fc.scales", "transformer.h.3.mlp.fc.zeros", "transformer.h.3.mlp.proj.quant_weight",
"transformer.h.3.mlp.proj.scales", "transformer.h.3.mlp.proj.zeros", "transformer.h.4.attn.attn.quant_weight",
"transformer.h.4.attn.attn.scales", "transformer.h.4.attn.attn.zeros", "transformer.h.4.attn.proj.quant_weight",
"transformer.h.4.attn.proj.scales", "transformer.h.4.attn.proj.zeros", "transformer.h.4.mlp.fc.quant_weight", "transformer.h.4.mlp.fc.scales",
"transformer.h.4.mlp.fc.zeros", "transformer.h.4.mlp.proj.quant_weight", "transformer.h.4.mlp.proj.scales", "transformer.h.4.mlp.proj.zeros",
"transformer.h.5.attn.attn.quant_weight", "transformer.h.5.attn.attn.scales", "transformer.h.5.attn.attn.zeros",
"transformer.h.5.attn.proj.quant_weight", "transformer.h.5.attn.proj.scales", "transformer.h.5.attn.proj.zeros",
"transformer.h.5.mlp.fc.quant_weight", "transformer.h.5.mlp.fc.scales", "transformer.h.5.mlp.fc.zeros", "transformer.h.5.mlp.proj.quant_weight",
"transformer.h.5.mlp.proj.scales", "transformer.h.5.mlp.proj.zeros", "transformer.h.6.attn.attn.quant_weight",
"transformer.h.6.attn.attn.scales", "transformer.h.6.attn.attn.zeros", "transformer.h.6.attn.proj.quant_weight",
"transformer.h.6.attn.proj.scales", "transformer.h.6.attn.proj.zeros", "transformer.h.6.mlp.fc.quant_weight", "transformer.h.6.mlp.fc.scales",
"transformer.h.6.mlp.fc.zeros", "transformer.h.6.mlp.proj.quant_weight", "transformer.h.6.mlp.proj.scales", "transformer.h.6.mlp.proj.zeros",
"transformer.h.7.attn.attn.quant_weight", "transformer.h.7.attn.attn.scales", "transformer.h.7.attn.attn.zeros",
"transformer.h.7.attn.proj.quant_weight", "transformer.h.7.attn.proj.scales", "transformer.h.7.attn.proj.zeros",
"transformer.h.7.mlp.fc.quant_weight", "transformer.h.7.mlp.fc.scales", "transformer.h.7.mlp.fc.zeros", "transformer.h.7.mlp.proj.quant_weight",
"transformer.h.7.mlp.proj.scales", "transformer.h.7.mlp.proj.zeros", "transformer.h.8.attn.attn.quant_weight",
"transformer.h.8.attn.attn.scales", "transformer.h.8.attn.attn.zeros", "transformer.h.8.attn.proj.quant_weight",
"transformer.h.8.attn.proj.scales", "transformer.h.8.attn.proj.zeros", "transformer.h.8.mlp.fc.quant_weight", "transformer.h.8.mlp.fc.scales",
"transformer.h.8.mlp.fc.zeros", "transformer.h.8.mlp.proj.quant_weight", "transformer.h.8.mlp.proj.scales", "transformer.h.8.mlp.proj.zeros",
"transformer.h.9.attn.attn.quant_weight", "transformer.h.9.attn.attn.scales", "transformer.h.9.attn.attn.zeros",
"transformer.h.9.attn.proj.quant_weight", "transformer.h.9.attn.proj.scales", "transformer.h.9.attn.proj.zeros",
"transformer.h.9.mlp.fc.quant_weight", "transformer.h.9.mlp.fc.scales", "transformer.h.9.mlp.fc.zeros", "transformer.h.9.mlp.proj.quant_weight",
"transformer.h.9.mlp.proj.scales", "transformer.h.9.mlp.proj.zeros", "transformer.h.10.attn.attn.quant_weight",
"transformer.h.10.attn.attn.scales", "transformer.h.10.attn.attn.zeros", "transformer.h.10.attn.proj.quant_weight",
"transformer.h.10.attn.proj.scales", "transformer.h.10.attn.proj.zeros", "transformer.h.10.mlp.fc.quant_weight",
"transformer.h.10.mlp.fc.scales", "transformer.h.10.mlp.fc.zeros", "transformer.h.10.mlp.proj.quant_weight", "transformer.h.10.mlp.proj.scales",
"transformer.h.10.mlp.proj.zeros", "transformer.h.11.attn.attn.quant_weight", "transformer.h.11.attn.attn.scales",
"transformer.h.11.attn.attn.zeros", "transformer.h.11.attn.proj.quant_weight", "transformer.h.11.attn.proj.scales",
"transformer.h.11.attn.proj.zeros", "transformer.h.11.mlp.fc.quant_weight", "transformer.h.11.mlp.fc.scales", "transformer.h.11.mlp.fc.zeros",
"transformer.h.11.mlp.proj.quant_weight", "transformer.h.11.mlp.proj.scales", "transformer.h.11.mlp.proj.zeros",
"transformer.h.12.attn.attn.quant_weight", "transformer.h.12.attn.attn.scales", "transformer.h.12.attn.attn.zeros",
"transformer.h.12.attn.proj.quant_weight", "transformer.h.12.attn.proj.scales", "transformer.h.12.attn.proj.zeros",
"transformer.h.12.mlp.fc.quant_weight", "transformer.h.12.mlp.fc.scales", "transformer.h.12.mlp.fc.zeros",
"transformer.h.12.mlp.proj.quant_weight", "transformer.h.12.mlp.proj.scales", "transformer.h.12.mlp.proj.zeros",
"transformer.h.13.attn.attn.quant_weight", "transformer.h.13.attn.attn.scales", "transformer.h.13.attn.attn.zeros",
"transformer.h.13.attn.proj.quant_weight", "transformer.h.13.attn.proj.scales", "transformer.h.13.attn.proj.zeros",
"transformer.h.13.mlp.fc.quant_weight", "transformer.h.13.mlp.fc.scales", "transformer.h.13.mlp.fc.zeros",
"transformer.h.13.mlp.proj.quant_weight", "transformer.h.13.mlp.proj.scales", "transformer.h.13.mlp.proj.zeros",
"transformer.h.14.attn.attn.quant_weight", "transformer.h.14.attn.attn.scales", "transformer.h.14.attn.attn.zeros",
"transformer.h.14.attn.proj.quant_weight", "transformer.h.14.attn.proj.scales", "transformer.h.14.attn.proj.zeros",
"transformer.h.14.mlp.fc.quant_weight", "transformer.h.14.mlp.fc.scales", "transformer.h.14.mlp.fc.zeros",
"transformer.h.14.mlp.proj.quant_weight", "transformer.h.14.mlp.proj.scales", "transformer.h.14.mlp.proj.zeros",
"transformer.h.15.attn.attn.quant_weight", "transformer.h.15.attn.attn.scales", "transformer.h.15.attn.attn.zeros",
"transformer.h.15.attn.proj.quant_weight", "transformer.h.15.attn.proj.scales", "transformer.h.15.attn.proj.zeros",
"transformer.h.15.mlp.fc.quant_weight", "transformer.h.15.mlp.fc.scales", "transformer.h.15.mlp.fc.zeros",
"transformer.h.15.mlp.proj.quant_weight", "transformer.h.15.mlp.proj.scales", "transformer.h.15.mlp.proj.zeros".
Unexpected key(s) in state_dict: "lm_head.weight", "transformer.h.0.attn.attn.weight", "transformer.h.0.attn.proj.weight",
"transformer.h.0.mlp.fc.weight", "transformer.h.0.mlp.proj.weight", "transformer.h.1.attn.attn.weight", "transformer.h.1.attn.proj.weight",
"transformer.h.1.mlp.fc.weight", "transformer.h.1.mlp.proj.weight", "transformer.h.2.attn.attn.weight", "transformer.h.2.attn.proj.weight",
"transformer.h.2.mlp.fc.weight", "transformer.h.2.mlp.proj.weight", "transformer.h.3.attn.attn.weight", "transformer.h.3.attn.proj.weight",
"transformer.h.3.mlp.fc.weight", "transformer.h.3.mlp.proj.weight", "transformer.h.4.attn.attn.weight", "transformer.h.4.attn.proj.weight",
"transformer.h.4.mlp.fc.weight", "transformer.h.4.mlp.proj.weight", "transformer.h.5.attn.attn.weight", "transformer.h.5.attn.proj.weight",
"transformer.h.5.mlp.fc.weight", "transformer.h.5.mlp.proj.weight", "transformer.h.6.attn.attn.weight", "transformer.h.6.attn.proj.weight",
"transformer.h.6.mlp.fc.weight", "transformer.h.6.mlp.proj.weight", "transformer.h.7.attn.attn.weight", "transformer.h.7.attn.proj.weight",
"transformer.h.7.mlp.fc.weight", "transformer.h.7.mlp.proj.weight", "transformer.h.8.attn.attn.weight", "transformer.h.8.attn.proj.weight",
"transformer.h.8.mlp.fc.weight", "transformer.h.8.mlp.proj.weight", "transformer.h.9.attn.attn.weight", "transformer.h.9.attn.proj.weight",
"transformer.h.9.mlp.fc.weight", "transformer.h.9.mlp.proj.weight", "transformer.h.10.attn.attn.weight", "transformer.h.10.attn.proj.weight",
"transformer.h.10.mlp.fc.weight", "transformer.h.10.mlp.proj.weight", "transformer.h.11.attn.attn.weight", "transformer.h.11.attn.proj.weight",
"transformer.h.11.mlp.fc.weight", "transformer.h.11.mlp.proj.weight", "transformer.h.12.attn.attn.weight", "transformer.h.12.attn.proj.weight",
"transformer.h.12.mlp.fc.weight", "transformer.h.12.mlp.proj.weight", "transformer.h.13.attn.attn.weight", "transformer.h.13.attn.proj.weight",
"transformer.h.13.mlp.fc.weight", "transformer.h.13.mlp.proj.weight", "transformer.h.14.attn.attn.weight", "transformer.h.14.attn.proj.weight",
"transformer.h.14.mlp.fc.weight", "transformer.h.14.mlp.proj.weight", "transformer.h.15.attn.attn.weight", "transformer.h.15.attn.proj.weight",
"transformer.h.15.mlp.fc.weight", "transformer.h.15.mlp.proj.weight".

Flash attention support

In PyTorch 2.0, torch.nn.functional.scaled_dot_product_attention takes the normalization factor from Q.size(-1): https://pytorch.org/docs/stable/generated/torch.nn.functional.scaled_dot_product_attention.html

However, in our model implementation, this value is different from the head size because a rotary percentage of 0.25 is used by default, meaning that we cannot use it in that case

if self.rotary_percentage != 1.0:
    self.register_buffer(
        "bias",
        torch.tril(torch.ones(config.block_size, config.block_size)).view(
            1, 1, config.block_size, config.block_size
        ),
    )

...

if hasattr(self, "bias"):
    # causal self-attention; Self-attend: (B, nh, T, hs) x (B, nh, hs, T) -> (B, nh, T, T)
    att = (q @ k.transpose(-2, -1)) * (1.0 / math.sqrt(head_size))
    att = att.masked_fill(self.bias[:, :, :T, :T] == 0, float("-inf"))
    att = F.softmax(att, dim=-1)
    y = att @ v  # (B, nh, T, T) x (B, nh, T, hs) -> (B, nh, T, hs)
else:
    # efficient attention using Flash Attention CUDA kernels
    y = F.scaled_dot_product_attention(q, k, v, attn_mask=None, dropout_p=0.0, is_causal=True)

PyTorch nightly (to be released with 2.1) conveniently added a scale argument to scaled_dot_product_attention: https://pytorch.org/docs/main/generated/torch.nn.functional.scaled_dot_product_attention.html

My proposal would be to install a nightly version in our requirements

Improve UX for discovering available checkpoints

#5 (comment)

/lit_parrot/model.py:201 in forward

Running:
python generate.py --prompt "Hello, my name is" --checkpoint_dir checkpoints/stabilityai/stablelm-tuned-alpha-3b

/lit_parrot/model.py:201 in forward:
k = cache_k.index_copy(2, input_pos, k)
RuntimeError: index_copy_(): self and source expected to have the same dtype, but got (self) Float and (source) BFloat16

Create a table of results for our supported checkpoints

We support a large number of checkpoints. And there's a multitude of scripts that can be run.

Users often ask questions like "can I run X script with Y model given Z memory?" or "is X (script, model) faster than Y (script, model)?"

The idea would be to collect data in a Markdown table that we can point to answer these questions.

The data should always be collected from the same machine (our 8xA100 node).
Some scripts will have to specify the hparams used.
We can pick out a subset of the checkpoints to start with.

For example:

generate/base.py --precision bf16-true

Model	tokens/sec	Memory (GB)
pythia-6.9b	...	...
falcon-7b	...	...
stablelm-base-alpha-7b	...	...

Download documentation needs updating, --repo_id required

Document says:

python scripts/download.py stabilityai/stablelm-base-alpha-3b

Actually required

python scripts/download.py --repo_id stabilityai/stablelm-base-alpha-3b

I could just make edits as I go along and send PR if you wish.

RuntimeError: handle_0 INTERNAL ASSERT FAILED at "../c10/cuda/driver_api.cpp":15, please report a bug to PyTorch.

Been trying for some time now and always run into this error. Everything prior worked. What am I doing wrong?
RTX3090 - 24go
Windows 10 but on Ubuntu using wsl, maybe that's the problem but don't want to install Ubuntu on a new partition.

python3 finetune/adapter_v2.py --data_dir data/alpaca --checkpoint_dir checkpoints/tiiuae/falcon-7b --out_dir out/adapter/alpaca
/usr/lib/python3/dist-packages/pkg_resources/init.py:116: PkgResourcesDeprecationWarning: 1.1build1 is an invalid version and will not be supported in a future release
warnings.warn(
/usr/lib/python3/dist-packages/pkg_resources/init.py:116: PkgResourcesDeprecationWarning: 1.1build1 is an invalid version and will not be supported in a future release
warnings.warn(
Global seed set to 1337
Loading model 'checkpoints/tiiuae/falcon-7b/lit_model.pth' with {'block_size': 2048, 'vocab_size': 50254, 'padding_multiple': 512, 'padded_vocab_size': 65024, 'n_layer': 32, 'n_head': 71, 'n_embd': 4544, 'rotary_percentage': 1.0, 'parallel_residual': True, 'bias': False, 'n_query_groups': 1, 'shared_attention_norm': True, 'adapter_prompt_length': 10, 'adapter_start_layer': 2}
Number of trainable parameters: 3839186
/usr/local/lib/python3.10/dist-packages/lightning/fabric/fabric.py:828: PossibleUserWarning: The model passed to Fabric.setup() has parameters on different devices. Since move_to_device=True, all parameters will be moved to the new device. If this is not desired, set Fabric.setup(..., move_to_device=False).
rank_zero_warn(
iter 0: loss 2.7154, time: 2929.28ms
Traceback (most recent call last):
File "/root/lit-parrot/finetune/adapter_v2.py", line 254, in
CLI(main)
File "/usr/local/lib/python3.10/dist-packages/jsonargparse/cli.py", line 85, in CLI
return _run_component(component, cfg_init)
File "/usr/local/lib/python3.10/dist-packages/jsonargparse/cli.py", line 147, in _run_component
return component(**cfg)
File "/root/lit-parrot/finetune/adapter_v2.py", line 90, in main
train(fabric, model, optimizer, train_data, val_data, checkpoint_dir, out_dir)
File "/root/lit-parrot/finetune/adapter_v2.py", line 126, in train
logits = model(input_ids)
File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1502, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1511, in _call_impl
return forward_call(*args, **kwargs)
File "/usr/local/lib/python3.10/dist-packages/lightning/fabric/wrappers.py", line 115, in forward
output = self._forward_module(*args, **kwargs)
File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1502, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1511, in call_impl
return forward_call(*args, **kwargs)
File "/root/lit-parrot/lit_parrot/adapter.py", line 95, in forward
x, * = block(x, (cos, sin), mask, max_seq_length)
File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1502, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1511, in _call_impl
return forward_call(*args, **kwargs)
File "/root/lit-parrot/lit_parrot/adapter.py", line 140, in forward
h, new_kv_cache, new_adapter_kv_cache = self.attn(
File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1502, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1511, in _call_impl
return forward_call(*args, **kwargs)
File "/root/lit-parrot/lit_parrot/adapter.py", line 241, in forward
y = y + self.gating_factor * ay
RuntimeError: handle_0 INTERNAL ASSERT FAILED at "../c10/cuda/driver_api.cpp":15, please report a bug to PyTorch.

Generation of text that is longer than the context window is no longer possible

#39 removed the ability for the generate function to handle longer sequences.
max_seq_length in this is another name for "block size" or "context size" and is model specific. It is not expressing how long we want the generated new text to be. That's handled by "max_new_tokens".

Due to this misunderstanding, the generate function can now no loger generate longer text than the context size. If you want to keep this limitation, I recommend to remove one of the two size limits. But for correctness, I would revert the change.

Support all StableLM and Pythia checkpoint configs

https://github.com/EleutherAI/pythia#models

https://github.com/Stability-AI/StableLM#stablelm-alpha

Why have a default max_seq_length of 256?

I noticed that both the data prep / tokenization script (https://github.com/Lightning-AI/lit-parrot/blob/main/scripts/prepare_alpaca.py#L26) and the fine-tuning scripts (https://github.com/Lightning-AI/lit-parrot/blob/main/finetune/adapter.py#L41, https://github.com/Lightning-AI/lit-parrot/blob/main/finetune/adapter_v2.py#L46) have max_seq_length=256.

While this does seem to speed up tokenization it has the unfortunate property of truncating fine-tuning inputs and also requiring someone to change both scripts to use the full context length of a language model. I'm curious why this parameter got added and whether it might be possible to switch to a default of None or 4096?

NAN training loss after couple of steps

When run fine-tune with stablelm-base-alpha-3b on alpaca, the fine-tune works well in first couple of iterations, but training loss becomes NaN after some iterations. Could you please help me out his issue? btw run on 1 gpu g5.16xlarge (aws sagemaker).

Loading model 'checkpoints/stabilityai/stablelm-base-alpha-3b/lit_model.pth' with {'block_size': 4096, 'vocab_size': 50254, 'padding_multiple': 512, 'padded_vocab_size': 50688, 'n_layer': 16, 'n_head': 32, 'n_embd': 4096, 'rotary_percentage': 0.25, 'parallel_residual': True, 'bias': True, 'n_query_groups': 32, 'shared_attention_norm': False, 'adapter_prompt_length':10, 'adapter_start_layer': 2}
Number of trainable parameters: 2125248
/home/ec2-user/anaconda3/envs/JupyterSystemEnv/lib/python3.10/site-packages/lightning/fabric/fabric.py:828: PossibleUserWarning: The model passed to Fabric.setup() has parameters on different devices. Since move_to_device=True, all parameters will be moved to the new device. If this is not desired, set Fabric.setup(..., move_to_device=False).
rank_zero_warn(
iter 0: loss 3.5421, time: 174.07ms
iter 1: loss 3.0288, time: 95.89ms
iter 2: loss 3.5571, time: 60.14ms
iter 3: loss 2.8494, time: 88.95ms
iter 4: loss 3.2140, time: 64.26ms
iter 5: loss 2.7726, time: 67.84ms
iter 6: loss 2.7332, time: 66.63ms
iter 7: loss 3.1365, time: 67.12ms
iter 8: loss 2.6164, time: 88.71ms
iter 9: loss 2.6239, time: 90.34ms
iter 10: loss 2.7440, time: 98.67ms
iter 11: loss 2.9421, time: 64.48ms
iter 12: loss 2.5184, time: 97.68ms
iter 13: loss 2.7282, time: 61.63ms
iter 14: loss 1.9213, time: 180.24ms
iter 15: loss 2.5665, time: 96.10ms
iter 16: loss 3.0199, time: 65.29ms
iter 17: loss 3.4083, time: 66.38ms
iter 18: loss 3.0120, time: 61.52ms
iter 19: loss 2.6137, time: 96.16ms
iter 20: loss 2.6338, time: 88.55ms
iter 21: loss 2.6259, time: 67.08ms
iter 22: loss 3.1457, time: 64.26ms
iter 23: loss 2.7812, time: 95.88ms
iter 24: loss 2.5923, time: 64.98ms
iter 25: loss 2.4579, time: 91.93ms
iter 26: loss 2.8956, time: 61.76ms
iter 27: loss 3.5309, time: 57.92ms
iter 28: loss 2.8725, time: 67.91ms
iter 29: loss 2.9909, time: 90.01ms
iter 30: loss 2.6652, time: 121.70ms
iter 31: loss 3.2488, time: 58.30ms
iter 32: loss 3.0665, time: 90.61ms
iter 33: loss 3.2830, time: 58.08ms
iter 34: loss 2.6600, time: 116.47ms
iter 35: loss 2.6636, time: 136.96ms
iter 36: loss 3.6505, time: 58.66ms
iter 37: loss 2.7473, time: 89.21ms
iter 38: loss 2.9823, time: 87.24ms
iter 39: loss 2.8799, time: 85.97ms
iter 40: loss 2.6276, time: 114.52ms
iter 41: loss 2.3663, time: 66.84ms
iter 42: loss 3.0142, time: 88.69ms
iter 43: loss 3.0303, time: 64.94ms
iter 44: loss 4.0041, time: 65.64ms
iter 45: loss 3.3370, time: 59.52ms
iter 46: loss 3.3909, time: 65.03ms
iter 47: loss 3.1888, time: 54.24ms
iter 48: loss 2.6625, time: 91.05ms
iter 49: loss 3.1856, time: 66.61ms
iter 50: loss 3.5569, time: 57.50ms
iter 51: loss 3.0958, time: 66.84ms
iter 52: loss 3.4789, time: 67.88ms
iter 53: loss 3.2668, time: 64.46ms
iter 54: loss 3.1411, time: 65.62ms
iter 55: loss 2.9815, time: 124.00ms
iter 56: loss 2.6963, time: 114.22ms
iter 57: loss 2.9008, time: 97.70ms
iter 58: loss 3.0037, time: 64.61ms
iter 59: loss 2.8624, time: 115.96ms
iter 60: loss 3.0150, time: 66.87ms
iter 61: loss 2.6633, time: 97.41ms
iter 62: loss 2.7912, time: 114.09ms
iter 63: loss 2.7428, time: 158.58ms
iter 64: loss nan, time: 86.94ms
iter 65: loss nan, time: 91.27ms
iter 66: loss nan, time: 84.82ms
iter 67: loss nan, time: 66.46ms
iter 68: loss nan, time: 65.96ms
iter 69: loss nan, time: 97.37ms
iter 70: loss nan, time: 115.41ms

gptq quantization fails ModuleNotFoundError

Dear team,

Thanks a lot for reducing the barrier of entrance to work & use open-source LLMs. I was not able to quantize a 2.4B model with gptq for my modest RTX2080.
I got the following error

python quantize/gptq.py --checkpoint_dir checkpoints/EleutherAI/pythia-2.8b-deduped --dtype bfloat16
Loading model 'checkpoints/EleutherAI/pythia-2.8b-deduped/lit_model.pth' with {'block_size': 2048, 'vocab_size': 50254, 'padding_multiple': 128, 'padded_vocab_size': 50304, 'n_layer': 32, 'n_head': 32, 'n_embd': 2560, 'rotary_percentage': 0.25, 'parallel_residual': True, 'bias': True, 'n_query_groups': 32, 'shared_attention_norm': False}
Time to load model: 9.79 seconds.
Traceback (most recent call last):
  File "/awesome-project/lit-parrot/quantize/gptq.py", line 376, in <module>
    CLI(main)
  File "/env/lib/python3.10/site-packages/jsonargparse/cli.py", line 85, in CLI
    return _run_component(component, cfg_init)
  File "/env/on-device-llm/lib/python3.10/site-packages/jsonargparse/cli.py", line 147, in _run_component
    return component(**cfg)
  File "/awesome-project/lit-parrot/quantize/gptq.py", line 357, in main
    test_string = get_sample_data()
  File "/awesome-project/lit-parrot/quantize/gptq.py", line 214, in get_sample_data
    from datasets import load_dataset
ModuleNotFoundError: No module named 'datasets'

The issue may be related to the module (package?) datasets. Could you kindly provide a pointer to fix it?

Thanks in advance!

Caches should not persist across multiple generate.

When running generate function twice on the same method, the cache on first generation need to be teared down before another generation. Otherwise, we get error like below.

/content/lit-parrot/lit_parrot/model.py in forward(self, x, rope, mask, max_seq_length, input_pos, kv_cache)
    205 
    206         # efficient attention using Flash Attention CUDA kernels
--> 207         y = F.scaled_dot_product_attention(q, k, v, attn_mask=mask, dropout_p=0.0, scale=1.0 / math.sqrt(head_size))
    208 
    209         y = y.transpose(1, 2).contiguous().view(B, T, C)  # re-assemble all head outputs side by side

RuntimeError: The size of tensor a (7) must match the size of tensor b (8) at non-singleton dimension 3

We can add a context manager in the Model class and put the generation code under it.

class Parrot(Parrot):
    @contextmanager
    def cache(self):
        try:
            yield
        finally:
            self.kv_caches = []
            self.rope_cache = None
            self.mask_cache = None


# inside generate function
...
with model.cache():
    # generate max_new_tokens tokens
    for _ in range(max_new_tokens):
        x = idx.index_select(0, input_pos).view(1, -1)

        # forward
        logits = model(x, max_seq_length, input_pos)
        logits = logits[0, -1] / temperature
...

Fine tune adapter device = 2 deepspeed error

The `DeepSpeedStrategy` does not support skipping the gradient synchronization. Remove `.no_backward_sync()` from your code or choose a different strategy

Cached KVs not implemented on Adapter causing errors.

Adapter inherits most methods from the BaseModel. The Adapter's init method, Block's forward method and CausalSelfAttention's forward method didn't implement the Cached KVs logics, causing errors.

The errors are mostly from forward method in Adapter.
Example:

AttributeError: 'Parrot' object has no attribute 'rope_cache' 
TypeError: Block.forward() takes 2 positional arguments but 7 were given

We should add

rope_cache, mask_cache, kv_caches attributes to Adapter init method.
rope, mask, max_seq_length, input_pos, kv_cache to input of Block and CausalSelfAttention forward method
return kv_cache in Block and CausalSelfAttention forward method

Support tuned-model mode during generation

Usage for tuned models needs to append the system prompt:

https://huggingface.co/stabilityai/stablelm-tuned-alpha-7b#usage

RoPE precision issue

One of the CUDA tests is failing: pytest tests/test_model.py::test_bfloat16_llama_init

E       RuntimeError: Expected query, key, and value to have the same dtype, but got query.dtype: float key.dtype: float and value.dtype: c10::BFloat16 instead.

I think there's a bug in how the dtype is managed in rope

Originally posted by @carmocca in #11 (comment)

Port finetuning from Lit-LLaMA

full #117
LoRA #128
Adapter #31

Assert in generate.py needs to go...

generate.py line 45: assert max_seq_length <= T_new breaks otherwise running training code. There is no good reason for this assertion IMHO.

I need to run with a long max_seq_length, to learn from some longer passages in my instruction set. Just because the specific instruction that is used in validation is shorter than the longest required is no reason to abort, and I discovered this assertion after about an hour of training. With the assertion gone, and training restarted, everything is working...

Out of memory issue for fine-tuning RedPajama-INCITE-7B-Base with 1 GPU

Hi, I faced an out-of-memory issue fine-tuning RedPajama-INCITE-7B-Base on Alpaca data with 1GPU g5.16xlarge with 24 GPU memory (GiB). With adapter_v2.py, I changed the learning_rate = 3e-3 and micro_batch_size = 1. The model fine-tune works really well in the beginning and run into out of memory issue after 65498 iterations. Any one knows how to solve it? Thanks!

iter 65496: loss 1.2029, time: 101.54ms
iter 65497: loss 1.5817, time: 184.24ms
iter 65498: loss 1.4716, time: 101.98ms
Traceback (most recent call last):
File "/home/ec2-user/SageMaker/lit-parrot/finetune/adapter_v2.py", line 281, in
CLI(setup)
File "/home/ec2-user/anaconda3/envs/JupyterSystemEnv/lib/python3.10/site-packages/jsonargparse/cli.py", line 85, in CLI
return _run_component(component, cfg_init)
File "/home/ec2-user/anaconda3/envs/JupyterSystemEnv/lib/python3.10/site-packages/jsonargparse/cli.py", line 147, in _run_component
return component(**cfg)
File "/home/ec2-user/SageMaker/lit-parrot/finetune/adapter_v2.py", line 71, in setup
fabric.launch(main, data_dir, checkpoint_dir, out_dir)
File "/home/ec2-user/anaconda3/envs/JupyterSystemEnv/lib/python3.10/site-packages/lightning/fabric/fabric.py", line 732,in launch
return self._wrap_and_launch(function, self, *args, **kwargs)
File "/home/ec2-user/anaconda3/envs/JupyterSystemEnv/lib/python3.10/site-packages/lightning/fabric/fabric.py", line 814,in _wrap_and_launch
return to_run(*args, **kwargs)
File "/home/ec2-user/anaconda3/envs/JupyterSystemEnv/lib/python3.10/site-packages/lightning/fabric/fabric.py", line 823,in _wrap_with_setup
return to_run(*args, **kwargs)
File "/home/ec2-user/SageMaker/lit-parrot/finetune/adapter_v2.py", line 105, in main
train(fabric, model, optimizer, train_data, val_data, checkpoint_dir, out_dir)
File "/home/ec2-user/SageMaker/lit-parrot/finetune/adapter_v2.py", line 148, in train
fabric.backward(loss / gradient_accumulation_iters)
File "/home/ec2-user/anaconda3/envs/JupyterSystemEnv/lib/python3.10/site-packages/lightning/fabric/fabric.py", line 387,in backward
self._strategy.backward(tensor, module, *args, **kwargs)
File "/home/ec2-user/anaconda3/envs/JupyterSystemEnv/lib/python3.10/site-packages/lightning/fabric/strategies/strategy.py", line 179, in backward
self.precision.backward(tensor, module, *args, **kwargs)
File "/home/ec2-user/anaconda3/envs/JupyterSystemEnv/lib/python3.10/site-packages/lightning/fabric/plugins/precision/precision.py", line 89, in backward
tensor.backward(*args, **kwargs)
File "/home/ec2-user/anaconda3/envs/JupyterSystemEnv/lib/python3.10/site-packages/torch/_tensor.py", line 491, in backward
torch.autograd.backward(
File "/home/ec2-user/anaconda3/envs/JupyterSystemEnv/lib/python3.10/site-packages/torch/autograd/init.py", line 204,in backward
Variable._execution_engine.run_backward( # Calls into the C++ engine to run the backward pass
torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 200.00 MiB. GPU 0 has a total capacty of 22.19 GiB of which 106.50 MiB is free. Including non-PyTorch memory, this process has 22.08 GiB memory in use. Of the allocated memory 20.42 GiB is allocated by PyTorch, and 1.36 GiB is reserved by PyTorch but unallocated. If reserved but unallocated memory islarge try setting max_split_size_mb to avoid fragmentation. See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF

Loss nan while fine tuning Falcon7b

By following the same instruction provided for fine tuning falcon7b and by leaving all paramters as the defult ones, I could start fine tuning but after 60 iteration, loss is nan. Could anyone explains to me which might be the issue ? URGENT

finetune/adpapter.py not loading the train_data from train.pt

I am getting the following error:

python finetune/adapter.py  \
   --data_dir data/dolly \
   --checkpoint_dir checkpoints/togethercomputer/RedPajama-INCITE-Base-3B-v1  \
    --out_dir out/adapter/dolly
Global seed set to 1337
Loading model 'checkpoints/togethercomputer/RedPajama-INCITE-Base-3B-v1/lit_model.pth' with {'block_size': 256, 'vocab_size': 50254, 'padding_multiple': 256, 'padded_vocab_size': 50432, 'n_layer': 32, 'n_head': 32, 'n_embd': 2560, 'rotary_percentage': 1.0, 'parallel_residual': False, 'bias': True, 'n_query_groups': 32, 'shared_attention_norm': False, 'adapter_prompt_length': 10, 'adapter_start_layer': 2}
Number of trainable parameters: 768960
Traceback (most recent call last):
  File "/workspace/lit-parrot/finetune/adapter.py", line 246, in <module>
    CLI(main)
  File "/workspace/lit-parrot/litparrot/lib/python3.10/site-packages/jsonargparse/cli.py", line 85, in CLI
    return _run_component(component, cfg_init)
  File "/workspace/lit-parrot/litparrot/lib/python3.10/site-packages/jsonargparse/cli.py", line 147, in _run_component
    return component(**cfg)
  File "/workspace/lit-parrot/finetune/adapter.py", line 85, in main
    train(fabric, model, optimizer, train_data, val_data, checkpoint_dir, out_dir)
  File "/workspace/lit-parrot/finetune/adapter.py", line 119, in train
    input_ids, targets = get_batch(fabric, train_data)
  File "/workspace/lit-parrot/finetune/adapter.py", line 184, in get_batch
    ix = torch.randint(len(data), (micro_batch_size,))
RuntimeError: random_ expects 'from' to be less than 'to', but got from=0 >= to=0

When load_datasets() is called, it correctly loads the data from the test.pt file, but for some reason its not loading the data from train.pt even though both files exist, in the same directory (data/dolly ).

These files were created by running:

python scripts/prepare_custom.py \
    --destination_path data/dolly \
    --checkpoint_dir checkpoints/togethercomputer/RedPajama-INCITE-Base-3B-v1

Avoid the `convert_hf_checkpoint` step

https://github.com/Lightning-AI/lit-parrot/blob/main/scripts/convert_hf_checkpoint.py is a script that converts a list of *.bin files into a single checkpoint file: lit_model.pth.

This has the disadvantage of:

adds 1 extra step to get started
the checkpoint weights are now duplicated in the filesystem
it takes time and memory to convert.

This is particularly interesting for inference. For training/fine-tuning, the checkpoints generated will still be single file. We would need to support loading both options.

Instead, we could write a function lazy_load_from(checkpoint_dir) that does the weight mapping on the fly.

Deepspeed and bf16-true

In the finetuning scripts, we only allow

precision: Literal["bf16-true", "32-true"] = "bf16-true",

But we also use DeepSpeed when devices > 1. However, in this case, you'd get a

ValueError: `precision='bf16-true')` is not supported in DeepSpeed. `precision` must be one of: ('32-true', '16-mixed', 'bf16-mixed').

Should we allow bf16-mixed, or should we switch to FSDP? Or something else?

quantile() input tensor must be either float or double dtype

for anybody running into this when loading a gptq.int4 model, it can be fixed by running
pip install -U --index-url https://aiinfra.pkgs.visualstudio.com/PublicPackages/_packaging/Triton-Nightly/pypi/simple/ triton-nightly
``
as per triton-lang/triton#1741

Add chat script for adapter checkpoints

import json
import os
import sys
import time
import warnings
from pathlib import Path
from typing import Optional

import lightning as L
import torch

from generate import generate
from lit_parrot import Tokenizer
from lit_parrot.adapter import Parrot, Config
from lit_parrot.utils import EmptyInitOnDevice, lazy_load, check_valid_checkpoint_dir
sys.path.append(os.path.join(os.path.dirname(__file__), 'scripts'))
from prepare_alpaca import generate_prompt


def main(
    prompt: str = "What would be a good movie to see, and wy do you recommend it?",
    input_string: str = "",
    interactive: bool = False,
    adapter_path: Path = Path("out/adapter/alpaca/lit_model_adapter_finetuned.pth"),
    #checkpoint_dir: Path = Path(f"checkpoints/stabilityai/stablelm-base-alpha-3b"),
    checkpoint_dir: Path = Path(f"checkpoints/stabilityai/stablelm-tuned-alpha-3b"),
    quantize: Optional[str] = None,
    max_new_tokens: int = 100,
    top_k: int = 200,
    temperature: float = 0.8,
    max_seq_length: int = 1250  # set this to what you used during fine tuning
) -> None:
    """Generates a response based on a given instruction and an optional input.
    This script will only work with checkpoints from the instruction-tuned Parrot-Adapter model.
    See `finetune_adapter.py`.

    Args:
        prompt: The prompt/instruction (Alpaca style).
        adapter_path: Path to the checkpoint with trained adapter weights, which are the output of
            `finetune_adapter.py`.
        checkpoint_dir: The path to the checkpoint folder with pretrained Parrot weights.
        input_string: Optional input (Alpaca style).
        quantize: Whether to quantize the model and using which method:
            ``"llm.int8"``: LLM.int8() mode,
            ``"gptq.int4"``: GPTQ 4-bit mode.
        max_new_tokens: The number of generation steps to take.
        top_k: The number of top most probable tokens to consider in the sampling process.
        temperature: A value controlling the randomness of the sampling process. Higher values result in more random
            samples.
        max_seq_length: Optional int idefaults to 1250  # set this to what you used during fine tuning
    """
    check_valid_checkpoint_dir(checkpoint_dir)

    fabric = L.Fabric(devices=1)
    dtype = torch.bfloat16 if fabric.device.type == "cuda" and torch.cuda.is_bf16_supported() else torch.float32

    with open(checkpoint_dir / "lit_config.json") as fp:
        config = Config(**json.load(fp))

    print("Loading model ...", file=sys.stderr)
    t0 = time.time()
    with EmptyInitOnDevice(device=fabric.device, dtype=dtype, quantization_mode=quantize):
        model = Parrot(config)
    with lazy_load(checkpoint_dir / "lit_model.pth") as pretrained_checkpoint, lazy_load(
        adapter_path
    ) as adapter_checkpoint:
        # 1. Load the pretrained weights
        model.load_state_dict(pretrained_checkpoint, strict=False)
        # 2. Load the fine-tuned adapter weights
        model.load_state_dict(adapter_checkpoint, strict=False)

    print(f"Time to load model: {time.time() - t0:.02f} seconds.", file=sys.stderr)

    model.eval()
    model = fabric.setup(model)

    tokenizer = Tokenizer(checkpoint_dir / "tokenizer.json", checkpoint_dir / "tokenizer_config.json")


    while True:
        if interactive:
            try:
                prompt = input(">> Prompt: ")
            except KeyboardInterrupt:
                break
            if not prompt:
                break
        else:
            print(f'Prompt: {prompt}')

        sample = {"instruction": prompt, "input": input_string}
        prompt = generate_prompt(sample)
        encoded = tokenizer.encode(prompt, device=model.device)
        prompt_length = encoded.size(0)

        t0 = time.perf_counter()
        y = generate(
           model, 
           idx=encoded, 
           max_new_tokens=max_new_tokens, 
           max_seq_length=max_seq_length,
           temperature=temperature, 
           top_k=top_k, 
           eos_id=tokenizer.eos_id
        )
        t = time.perf_counter() - t0

        output = tokenizer.decode(y)
        output = output.split("### Response:")[1].strip()
        print(output)

        tokens_generated = y.size(0) - prompt_length
        print(f"\n\nTime for inference: {t:.02f} sec total, {tokens_generated / t:.02f} tokens/sec", file=sys.stderr)
        if fabric.device.type == "cuda":
            print(f"Memory used: {torch.cuda.max_memory_reserved() / 1e9:.02f} GB", file=sys.stderr)

        if not interactive:
            break


if __name__ == "__main__":
    from jsonargparse import CLI

    torch.set_float32_matmul_precision("high")
    warnings.filterwarnings(
        # Triggered internally at ../aten/src/ATen/EmptyTensor.cpp:31
        "ignore",
        message="ComplexHalf support is experimental and many operators don't support it yet",
    )
    CLI(main)

Fix CPU OOM on Windows

__________________________________ test_main __________________________________

_ = <MagicMock name='is_bf16_supported' id='1532430881920'>
tmp_path = WindowsPath('C:/Users/runneradmin/AppData/Local/Temp/pytest-of-runneradmin/pytest-0/test_main0')
monkeypatch = <_pytest.monkeypatch.MonkeyPatch object at 0x00000164CBFF8D00>

    @mock.patch("torch.cuda.is_bf16_supported", return_value=False)
    def test_main(_, tmp_path, monkeypatch):
        generate = load_generate_script()
    
        config_path = tmp_path / "config"
        config_path.write_text("{}")
    
        class FabricMock(Mock):
            @property
            def device(self):
                return torch.device("cpu")
    
        monkeypatch.setattr(generate.L, "Fabric", FabricMock)
        load_mock = Mock()
        load_mock.return_value = load_mock
        load_mock.__enter__ = Mock()
        load_mock.__exit__ = Mock()
        monkeypatch.setattr(generate, "lazy_load", load_mock)
        tokenizer_mock = Mock()
        tokenizer_mock.return_value.encode.return_value = torch.tensor([[1, 2, 3]])
        tokenizer_mock.return_value.decode.return_value = "foo bar baz"
        monkeypatch.setattr(generate, "Tokenizer", tokenizer_mock)
        generate_mock = Mock()
        generate_mock.return_value = torch.tensor([[3, 2, 1]])
        monkeypatch.setattr(generate, "generate", generate_mock)
    
        num_samples = 2
        out = StringIO()
        with redirect_stdout(out):
>           generate.main(temperature=2.0, top_k=2, num_samples=num_samples, config_path=config_path)

tests\test_generate.py:83: 
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _
generate.py:122: in main
    model = StableLM(config)
lit_stablelm\model.py:58: in __init__
    h=nn.ModuleList(Block(config) for _ in range(config.n_layer)),
C:\hostedtoolcache\windows\Python\3.10.11\x64\lib\site-packages\torch\nn\modules\container.py:279: in __init__
    self += modules
C:\hostedtoolcache\windows\Python\3.10.11\x64\lib\site-packages\torch\nn\modules\container.py:320: in __iadd__
    return self.extend(modules)
C:\hostedtoolcache\windows\Python\3.10.11\x64\lib\site-packages\torch\nn\modules\container.py:[401](https://github.com/Lightning-AI/lit-stablelm/actions/runs/4894150646/jobs/8738126492#step:4:402): in extend
    for i, module in enumerate(modules):
lit_stablelm\model.py:58: in <genexpr>
    h=nn.ModuleList(Block(config) for _ in range(config.n_layer)),
lit_stablelm\model.py:103: in __init__
    self.attn = CausalSelfAttention(config)
lit_stablelm\model.py:121: in __init__
    self.proj = nn.Linear(config.n_embd, config.n_embd, bias=True)
C:\hostedtoolcache\windows\Python\3.10.11\x64\lib\site-packages\torch\nn\modules\linear.py:96: in __init__
    self.weight = Parameter(torch.empty((out_features, in_features), **factory_kwargs))
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _

self = <lit_stablelm.utils.EmptyInitOnDevice object at 0x00000164CBFFBA60>
func = <built-in method empty of type object at 0x00007FFD63CAC560>, types = ()
args = ((4096, 4096),)
kwargs = {'device': device(type='cpu'), 'dtype': torch.float32}

    def __torch_function__(self, func, types, args=(), kwargs=None):
        kwargs = kwargs or {}
        if getattr(func, "__module__", None) == "torch.nn.init":
            if "tensor" in kwargs:
                return kwargs["tensor"]
            else:
                return args[0]
        if (
            self.device is not None
            and func in torch.utils._device._device_constructors()
            and kwargs.get("device") is None
        ):
            kwargs["device"] = self.device
        if (
            self.dtype is not None
            and func in torch.utils._device._device_constructors()
            and kwargs.get("dtype") is None
        ):
            kwargs["dtype"] = self.dtype
>       return func(*args, **kwargs)
E       RuntimeError: [enforce fail at ..\c10\core\impl\alloc_cpu.cpp:72] data. DefaultCPUAllocator: not enough memory: you tried to allocate 67108864 bytes.

lit_stablelm\utils.py:120: RuntimeError
---------------------------- Captured stderr call -----------------------------
Loading model 'checkpoints\\lit-stablelm\\stablelm-base-alpha-3b\\lit-stablelm.pth' with {'block_size': 4096, 'vocab_size': 50254, 'padding_multiplier': 512, 'padded_vocab_size': 50688, 'n_layer': 16, 'n_head': 32, 'n_embd': [409](https://github.com/Lightning-AI/lit-stablelm/actions/runs/4894150646/jobs/8738126492#step:4:410)6, 'rotary_percentage': 0.25}

If we cannot fix it, just skip the test on Windows

Query Regarding Minimum Hardware Requirements for Fine-tuning and Inference

Hi there,

Firstly, I want to express my appreciation for the insightful tutorial and the fine-tuning repository. I've found them extremely useful. 🚀

I'm looking to clarify what the minimum computer hardware requirements are for fine-tuning and inference with the models supported in this repo. I encountered some out-of-memory (OOM) issues during quantization on a system with 8GB RAM running on a CPU only.

The reason I'm asking this is because I'm considering using this repo for our open-source project (OpenBBTerminal). Understanding the minimum requirements will help us ensure the widest possible user accessibility.

Thanks in advance for your help on this matter.

Add adapter tests

We are currently lacking coverage for this. We can follow the pattern used by test_generate.py and maybe a simple forward test

Falcon Loss Not Decreasing During Training

I'm using pretrain code with falcon 7B. I've noticed that the loss didn't change for 400 iterations.

iter 1: loss 11.0666, time: 13381.00ms, speed: 306 toks/s/device
....
iter 400: loss 11.0666, time: 19090.34ms, speed: 214 toks/s/device

Feature Request: add support for fine-tuned (Falcon) models

Hi folks,
I'm trying to use a LoRA fine-tuning of Falcon with multilingual support with your pipeline, but it's not natively supported, would be a good addition to the project!
Thanks for the attention

Generate should allow seed, rather than setting to 1234.

The seeding can greatly change the results of generate.

$ python generate.py --seed 317 --prompt "What is the capital of England?"
Loading model 'checkpoints/stabilityai/stablelm-base-alpha-3b/lit_model.pth' with {'block_size': 4096, 'vocab_size': 50254, 'padding_multiple': 512, 'padded_vocab_size': 50688, 'n_layer': 16, 'n_head': 32, 'n_embd': 4096, 'rotary_percentage': 0.25, 'parallel_residual': True}
Time to load model: 8.11 seconds.
Global seed set to 317
What is the capital of England?

Wales

3 days

Total area: 2.2 million sqkm

Population in 2019 (July)

22,700,000

County: Wales

Capital: 2.2 million sqkm
Time for inference 1: 0.74 sec total, 76.98 tokens/sec
Memory used: 7.31 GB
$ python generate.py --seed 411 --prompt "What is the capital of England?"
Loading model 'checkpoints/stabilityai/stablelm-base-alpha-3b/lit_model.pth' with {'block_size': 4096, 'vocab_size': 50254, 'padding_multiple': 512, 'padded_vocab_size': 50688, 'n_layer': 16, 'n_head': 32, 'n_embd': 4096, 'rotary_percentage': 0.25, 'parallel_residual': True}
Time to load model: 8.17 seconds.
Global seed set to 411
What is the capital of England?

The capital of England is the county of Lincolnshire (in England, it's usually called Lincolnshire
User2: This is correct. In the US, we typically refer to the county of Lincoln as Lincoln County.

I hear
Time for inference 1: 0.74 sec total, 77.05 tokens/sec
Memory used: 7.31 GB

Problem with finetune_adapter.py along with fix

AttributeError: 'Parrot' object has no attribute 'rope_cache'
lit-parrot/lit_parrot/model.py:67 in forward
❱ 67 │ │ if self.rope_cache is None: │
│ 68 │ │ │ self.rope_cache = self.build_rope_cache(idx)

Problem is due to lit_parrot/adapter.py initialization initializing the super super class instead of the super class:

 Should be -

class CausalSelfAttention(BaseModel):
"""A modification of lit_parrot.model.CausalSelfAttention that adds the attention
over the adaption prompt."""

def __init__(self, config: Config, block_idx: int) -> None:
    super().__init__(config)

Instead of:
class CausalSelfAttention(nn.Module):
"""A modification of lit_parrot.model.CausalSelfAttention that adds the attention
over the adaption prompt."""

def __init__(self, config: Config, block_idx: int) -> None:
    super().__init__()