Comments (2)
Hi @dannikay the solution is in the error itself. As told in the error message, you can't train a model that has been loaded with device_map='auto'
with distributed mode when using device_map. But, you can train it by specifying --num_processes=1
or by launching with python {{myscript.py}}
. The --num_processes=1
can be used like:
accelerate launch --num_processes 1 train.py
by putting your code in the script.
Also, if you still want to use jupyter notebook instead of a python script you can 'utilise accelerate's library notebook_launcher utility, which allows for starting multi-gpu training based on code inside of a Jupyter Notebook.' Just do as such:
from accelerate import notebook_launcher
def train_accelerate():
import pandas as pd
from datasets import load_dataset
from IPython.display import HTML, display
dataset_name = "b-mc2/sql-create-context"
dataset = load_dataset(dataset_name, split="train")
def display_table(dataset_or_sample):
# A helper fuction to display a Transformer dataset or single sample contains multi-line string nicely
pd.set_option("display.max_colwidth", None)
pd.set_option("display.width", None)
pd.set_option("display.max_rows", None)
if isinstance(dataset_or_sample, dict):
df = pd.DataFrame(dataset_or_sample, index=[0])
else:
df = pd.DataFrame(dataset_or_sample)
html = df.to_html().replace("\\n", "<br>")
styled_html = f"""<style> .dataframe th, .dataframe tbody td {{ text-align: left; padding-right: 30px; }} </style> {html}"""
display(HTML(styled_html))
display_table(dataset.select(range(3)))
split_dataset = dataset.train_test_split(test_size=0.2, seed=42)
train_dataset = split_dataset["train"]
test_dataset = split_dataset["test"]
print(f"Training dataset contains {len(train_dataset)} text-to-SQL pairs")
print(f"Test dataset contains {len(test_dataset)} text-to-SQL pairs")
PROMPT_TEMPLATE = """You are a powerful text-to-SQL model. Given the SQL tables and natural language question, your job is to write SQL query that answers the question.
### Table:
{context}
### Question:
{question}
### Response:
{output}"""
def apply_prompt_template(row):
prompt = PROMPT_TEMPLATE.format(
question=row["question"],
context=row["context"],
output=row["answer"],
)
return {"prompt": prompt}
train_dataset = train_dataset.map(apply_prompt_template)
display_table(train_dataset.select(range(1)))
from transformers import AutoTokenizer
token ="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"
from huggingface_hub import login
login(token=token)
base_model_id = "mistralai/Mistral-7B-v0.1"
# You can use a different max length if your custom dataset has shorter/longer input sequences.
MAX_LENGTH = 256
tokenizer = AutoTokenizer.from_pretrained(
base_model_id,
model_max_length=MAX_LENGTH,
padding_side="left",
add_eos_token=True,
)
tokenizer.pad_token = tokenizer.eos_token
def tokenize_and_pad_to_fixed_length(sample):
result = tokenizer(
sample["prompt"],
truncation=True,
max_length=MAX_LENGTH,
padding="max_length",
)
result["labels"] = result["input_ids"].copy()
return result
tokenized_train_dataset = train_dataset.map(tokenize_and_pad_to_fixed_length)
assert all(len(x["input_ids"]) == MAX_LENGTH for x in tokenized_train_dataset)
display_table(tokenized_train_dataset.select(range(1)))
import torch
from transformers import AutoModelForCausalLM, BitsAndBytesConfig
quantization_config = BitsAndBytesConfig(
# Load the model with 4-bit quantization
load_in_4bit=True,
# Use double quantization
bnb_4bit_use_double_quant=True,
# Use 4-bit Normal Float for storing the base model weights in GPU memory
bnb_4bit_quant_type="nf4",
# De-quantize the weights to 16-bit (Brain) float before the forward/backward pass
bnb_4bit_compute_dtype=torch.bfloat16,
# This allow CPU offload.
llm_int8_enable_fp32_cpu_offload=True,
)
# https://huggingface.co/docs/accelerate/en/usage_guides/big_modeling
# device_map = "auto" buffers model to CPU in case it does not fit GPU.
model = AutoModelForCausalLM.from_pretrained(base_model_id,
quantization_config=quantization_config,
low_cpu_mem_usage=True,
device_map="auto",
torch_dtype=torch.float16)
from peft import LoraConfig, get_peft_model, prepare_model_for_kbit_training
# Enabling gradient checkpointing, to make the training further efficient
model.gradient_checkpointing_enable()
# Set up the model for quantization-aware training e.g. casting layers, parameter freezing, etc.
model = prepare_model_for_kbit_training(model)
peft_config = LoraConfig(
task_type="CAUSAL_LM",
# This is the rank of the decomposed matrices A and B to be learned during fine-tuning. A smaller number will save more GPU memory but might result in worse performance.
r=32,
# This is the coefficient for the learned ΔW factor, so the larger number will typically result in a larger behavior change after fine-tuning.
lora_alpha=64,
# Drop out ratio for the layers in LoRA adaptors A and B.
lora_dropout=0.1,
# We fine-tune all linear layers in the model. It might sound a bit large, but the trainable adapter size is still only **1.16%** of the whole model.
target_modules=[
"q_proj",
"k_proj",
"v_proj",
"o_proj",
"gate_proj",
"up_proj",
"down_proj",
"lm_head",
],
# Bias parameters to train. 'none' is recommended to keep the original model performing equally when turning off the adapter.
bias="none",
)
peft_model = get_peft_model(model, peft_config)
peft_model.print_trainable_parameters()
from datetime import datetime
import transformers
from transformers import TrainingArguments
import mlflow
# DeepSpeed requires a distributed environment even when only one process is used.
# This emulates a launcher in the notebook
import os
os.environ["MASTER_ADDR"] = "localhost"
os.environ["MASTER_PORT"] = "9994" # modify if RuntimeError: Address already in use
os.environ["RANK"] = "0"
os.environ["LOCAL_RANK"] = "0"
os.environ["WORLD_SIZE"] = "1"
os.environ["NCCL_DEBUG"] = "INFO"
training_args = TrainingArguments(
# Set this to mlflow for logging your training
report_to="mlflow",
# Name the MLflow run
run_name=f"Mistral-7B-SQL-QLoRA-{datetime.now().strftime('%Y-%m-%d-%H-%M-%s')}",
# Replace with your output destination
output_dir="YOUR_OUTPUT_DIR",
# For the following arguments, refer to https://huggingface.co/docs/transformers/main_classes/trainer
per_device_train_batch_size=1,
gradient_accumulation_steps=1,
gradient_checkpointing=True,
optim="paged_adamw_8bit",
bf16=True,
learning_rate=2e-5,
lr_scheduler_type="constant",
max_steps=500,
save_steps=100,
logging_steps=100,
warmup_steps=5,
# https://discuss.huggingface.co/t/training-llama-with-lora-on-multiple-gpus-may-exist-bug/47005/3
ddp_find_unused_parameters=False,
deepspeed="ds_zero3_config.json",
)
trainer = transformers.Trainer(
model=peft_model,
train_dataset=tokenized_train_dataset,
data_collator=transformers.DataCollatorForLanguageModeling(tokenizer, mlm=False),
args=training_args,
)
# use_cache=True is incompatible with gradient checkpointing.
peft_model.config.use_cache = False
trainer.train()
notebook_launcher(train_accelerate, args=(), num_processes=1)
In the above code your code is in a function which is passed to notebook_launcher
with num_processes = 1
arg (1 for using 1 GPU)
Cheers!
from transformers.
from transformers.
Related Issues (20)
- [Bug] Modifying normalizer for pretrained tokenizers don't consistently work HOT 1
- Failed to import transformers HOT 4
- Generation with HybridCache fails (affecting Gemma-2) HOT 2
- Qwen2-1.5B eos_token NoneType Error prevents generation HOT 2
- Vit-hybrid is deprecated, however still shown in the official documentation (with broken links) HOT 4
- compute_metric(eval_pred) in trainer is not mini-batch HOT 1
- transformers.pipeline does not load tokenizer passed as string for custom models HOT 1
- Do we need a config to change `padding_side='left` before the evaluation? HOT 3
- Label Leakage in Gemma 2 Finetuning HOT 1
- QLORA + FSDP distributed fine-tuning failed at the end during model saving stage
- Error running inference on CogVLM2 when distributing it on multiple GPUs: Expected all tensors to be on the same device, but found at least two devices HOT 2
- Mismatch with epoch when using gradient_accumulation HOT 2
- AttributeError: 'str' object has no attribute 'shape' HOT 4
- Whisper - list index out of range with word level timestamps HOT 1
- NameError: free variable 'state_dict' referenced before assignment in enclosing scope HOT 3
- Any config for DeBERTa series as decoders for TSDAE? HOT 3
- Unable to load models with adapter weights in offline mode HOT 3
- meta-llama/Llama-2-7b-chat-hf tokenizer `model_max_length` attribute needs to be fixed.
- When I used galore, the learning rate was set to 8e-6, but the training rate was 0.001 HOT 7
- Add `bot_token` attribute to `PreTrainedTokenizer` and `PreTrainedTokenizerFast` HOT 1
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from transformers.