Git Product home page Git Product logo

sagemaker-huggingface-llama-2-samples's Introduction

Getting Started with LLaMa 2 and Hugging Face

This repository contains instructions/examples/tutorials for getting started with LLaMA 2 and Hugging Face libraries like transformers, datasets.

Training

Requirements

Before we can start make sure you have met the following requirements

  • AWS Account with quota
  • AWS CLI installed
  • AWS IAM user configured in CLI with permission to create and manage ec2 instances

Commands

watch -n0.1 nvidia-smi

sagemaker-huggingface-llama-2-samples's People

Contributors

mukundt avatar philschmid avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar

sagemaker-huggingface-llama-2-samples's Issues

The model did not return a loss from the inputs

Hello!
I get this value error :

"ValueError: The model did not return a loss from the inputs, only the following keys: logits. For reference, the inputs it received are input_ids,attention_mask."

whereas my tokenized dataset is structured like this :
Dataset({
features: ['input_ids', 'attention_mask', 'labels'],
num_rows: 24702
})

Fine Tuning LLAMA2-13b on ml.g4dn.12xlarge is taking too much time.

So I was fine tuning LLAMA2 13B on a different dataset. I used the code tweaked it a little just to preprocess that specific dataset. Then I ran it via SageMaker training job.
The training was running great but it was very slow and even after 24 hours it only managed to go up to 7% on ml.g4dn.12xlarge instance.
Can anyone please guide me how I can increase the speed of training.
Unfortunately I cannot use "ml.g5.4xlarge" since that training instance is not available in the region I am working with right now.
Thanks.

Error running fine-tuning on SageMaker

Hi, I'm facing this issue when following the fine-tuning example for Llama-2-13b-chat. This happens after it loads the model:

╭───────────────────── Traceback (most recent call last) ──────────────────────╮
│ /opt/ml/code/train_sagemaker.py:315 in │
│ │
│ 312 │ training_function(args) │
│ 313 │
│ 314 if name == "main": │
│ ❱ 315 │ main() │
│ │
│ /opt/ml/code/train_sagemaker.py:312 in main │
│ │
│ 309 │
│ 310 def main(): │
│ 311 │ args = parse_args() │
│ ❱ 312 │ training_function(args) │
│ 313 │
│ 314 if name == "main": │
│ 315 │ main() │
│ │
│ /opt/ml/code/train_sagemaker.py:195 in training_function │
│ │
│ 192 │ │ bnb_4bit_compute_dtype=torch.bfloat16, │
│ 193 │ ) │
│ 194 │ │
│ ❱ 195 │ model = AutoModelForCausalLM.from_pretrained( │
│ 196 │ │ args.model_id, │
│ 197 │ │ use_cache=False │
│ 198 │ │ if args.gradient_checkpointing │
│ │
│ /opt/conda/lib/python3.10/site-packages/transformers/models/auto/auto_factor │
│ y.py:471 in from_pretrained │
│ │
│ 468 │ │ │ ) │
│ 469 │ │ elif type(config) in cls._model_mapping.keys(): │
│ 470 │ │ │ model_class = _get_model_class(config, cls._model_mapping) │
│ ❱ 471 │ │ │ return model_class.from_pretrained( │
│ 472 │ │ │ │ pretrained_model_name_or_path, *model_args, config=con │
│ 473 │ │ │ ) │
│ 474 │ │ raise ValueError( │
│ │
│ /opt/conda/lib/python3.10/site-packages/transformers/modeling_utils.py:2795 │
│ in from_pretrained │
│ │
│ 2792 │ │ │ │ mismatched_keys, │
│ 2793 │ │ │ │ offload_index, │
│ 2794 │ │ │ │ error_msgs, │
│ ❱ 2795 │ │ │ ) = cls._load_pretrained_model( │
│ 2796 │ │ │ │ model, │
│ 2797 │ │ │ │ state_dict, │
│ 2798 │ │ │ │ loaded_state_dict_keys, # XXX: rename? │
│ │
│ /opt/conda/lib/python3.10/site-packages/transformers/modeling_utils.py:3109 │
│ in _load_pretrained_model │
│ │
│ 3106 │ │ │ │ # Skip the load for shards that only contain disk-off │
│ 3107 │ │ │ │ if shard_file in disk_only_shard_files: │
│ 3108 │ │ │ │ │ continue │
│ ❱ 3109 │ │ │ │ state_dict = load_state_dict(shard_file) │
│ 3110 │ │ │ │ │
│ 3111 │ │ │ │ # Mistmatched keys contains tuples key/shape1/shape2 │
│ 3112 │ │ │ │ # matching the weights in the model. │
│ │
│ /opt/conda/lib/python3.10/site-packages/transformers/modeling_utils.py:429 │
│ in load_state_dict │
│ │
│ 426 │ """ │
│ 427 │ if checkpoint_file.endswith(".safetensors") and is_safetensors_av │
│ 428 │ │ # Check format of the archive │
│ ❱ 429 │ │ with safe_open(checkpoint_file, framework="pt") as f: │
│ 430 │ │ │ metadata = f.metadata() │
│ 431 │ │ if metadata.get("format") not in ["pt", "tf", "flax"]: │
│ 432 │ │ │ raise OSError( │
╰──────────────────────────────────────────────────────────────────────────────╯
NameError: name 'safe_open' is not defined

pretrain llama-2

Hi, I want to pre-train llama-2 with my custom dataset starting from the published checkpoint before fine-tuning. can you recommend/add a script for this? thank you!

CUDA error encountered on both BS=2&3 for 7B and 13B.

File "/opt/conda/lib/python3.10/site-packages/torch/autograd/function.py", line 506, in apply
return super().apply(*args, **kwargs) # type: ignore[misc]
File "/opt/conda/lib/python3.10/site-packages/bitsandbytes/autograd/_functions.py", line 516, in forward
output = torch.nn.functional.linear(A, F.dequantize_4bit(B, state).to(A.dtype).t(), bias)
RuntimeError: CUDA error: CUBLAS_STATUS_NOT_INITIALIZED when calling cublasCreate(handle)
0%| | 0/2052 [00:00<?, ?it/s]
2023-07-20 15:52:05,435 sagemaker-training-toolkit INFO Waiting for the process to finish and give a return code.
2023-07-20 15:52:05,435 sagemaker-training-toolkit INFO Done waiting for a return code. Received 1 from exiting process.
2023-07-20 15:52:05,435 sagemaker-training-toolkit ERROR Reporting training FAILURE
2023-07-20 15:52:05,435 sagemaker-training-toolkit ERROR ExecuteUserScriptError:
ExitCode 1
ErrorMessage "RuntimeError: CUDA error: CUBLAS_STATUS_NOT_INITIALIZED when calling cublasCreate(handle)
0%| | 0/2052 [00:00<?, ?it/s]"
Command "/opt/conda/bin/python3.10 run_clm.py --dataset_path /opt/ml/input/data/training --epochs 3 --hf_token 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 --lr 0.0002 --merge_weights True --model_id meta-llama/Llama-2-13b-hf --per_device_train_batch_size 2"
2023-07-20 15:52:05,435 sagemaker-training-toolkit ERROR Encountered exit_code 1

Error running trained model on sagemaker endpoint

I followed the instructions for training, and it works great. However, I now want to deploy my fine-tuned model to a sagemaker endpoint, but I get the following error when I run predict() on the endpoint.

"Could not load model /opt/ml/model with any of the following classes: (\u003cclass \u0027transformers.models.auto.modeling_auto.AutoModelForCausalLM\u0027\u003e, \u003cclass \u0027transformers.models.llama.modeling_llama.LlamaForCausalLM\u0027\u003e)."

Here is my inference code:

llm_model = HuggingFaceModel(
  role=role,
  transformers_version = '4.28',   
  pytorch_version      = '2.0',            
  py_version           = 'py310',   
  sagemaker_session    = sess,
  model_data           = <path to the model.tar.gz>,
)

llm = llm_model.deploy(
  initial_instance_count=1,
  instance_type="ml.g5.12xlarge",
)

response = llm.predict(prompt)

No models to quantize

peft found no modules to quantize:

Found 0 modules to quantize: []

This was the invocation:

huggingface_estimator = HuggingFace(
    entry_point          = 'run_clm.py',      # train script
    source_dir           = 'scripts',         # directory which includes all the files needed for training
    instance_type        = 'ml.g5.4xlarge',   # instances type used for the training job
    instance_count       = 1,                 # the number of instances used for training
    base_job_name        = job_name,          # the name of the training job
    role                 = role,              # Iam role used in training job to access AWS ressources, e.g. S3
    volume_size          = 300,               # the size of the EBS volume in GB
    transformers_version = '4.28',            # the transformers version used in the training job
    pytorch_version      = '2.0',             # the pytorch_version version used in the training job
    py_version           = 'py310',           # the python version used in the training job
    hyperparameters      =  hyperparameters,  # the hyperparameters passed to the training job
    environment          = { "HUGGINGFACE_HUB_CACHE": "/tmp/.cache" }, # set env variable to cache models in /tmp
    keep_alive_period_in_seconds = 3600
)

I did not change run_clm.py at all

HF token leak

Hello!

I am running the llama finetuning example on Sagemaker, and realized that this will leak the HF token (it is written on the jupyter after execution as output metadata).
Screenshot 2023-10-13 at 19 31 53
Is there a better way to pass the secret to the sagemaker job?
As an example, the W&B lib writes the token to a file that then gets copied to the instance, without explicitly passing the token.
Thanks =)
Thomas

Error in build_llama2_prompt()

@philschmid I think you might have a small typo in the build_llama2_prompt() function. In the final else block, the message.append should be message['content'].

The following then works to continue the chat conversation:

messages.append({"role": "bot",
                 "content": response[0]["generated_text"][len(prompt):]})

messages.append({"role": "user", 
                 "content": "Verify that the previous answer was accurate."})

No matching distribution found for peft==0.4.0

When I start the train job, I get this error. Here is my requirements.txt :
transformers==4.6
peft==0.6
accelerate==0.21.0
bitsandbytes==0.40.2
safetensors>=0.3.1
tokenizers>=0.13.3

I modified the version of transformers because the previous version was not supported.

Does we need the prompt formatting during the instruct model fine-tuning?

Hello!

I was reviewing the notebooks and in the fine-tuning tutorials we do not do the prompt formatting with [INST], and other special tokens. I understand that we are free to chose any prompt template in case of fine-tuning the text-completion model. But in case of tuning the instruct model, should we follow the same prompt formatting as Meta recommend?

Confused about chunk and potential concern of it hurting finetuning performance

def chunk(sample, chunk_length=2048):
    # define global remainder variable to save remainder from batches to use in next batch
    global remainder
    # Concatenate all texts and add remainder from previous batch
    concatenated_examples = {k: list(chain(*sample[k])) for k in sample.keys()}
    concatenated_examples = {k: remainder[k] + concatenated_examples[k] for k in concatenated_examples.keys()}
    # get total number of tokens for batch
    batch_total_length = len(concatenated_examples[list(sample.keys())[0]])

    # get max number of chunks for batch
    if batch_total_length >= chunk_length:
        batch_chunk_length = (batch_total_length // chunk_length) * chunk_length

    # Split by chunks of max_len.
    result = {
        k: [t[i : i + chunk_length] for i in range(0, batch_chunk_length, chunk_length)]
        for k, t in concatenated_examples.items()
    }
    # add remainder to global variable for next batch
    remainder = {k: concatenated_examples[k][batch_chunk_length:] for k in concatenated_examples.keys()}
    # prepare labels
    result["labels"] = result["input_ids"].copy()
    return result

I am fairly confused on this piece of code in the finetuning script you provided. If we are packing many samples into one data point to train the LLM, wouldn't the performance got hurt because the model cannot figure out the original boundaries between different samples during the finetuning?

Inference generating output with <unks> only with batch decoding but generates correct result if I pass a single prompt.

Hi,

I finetuned llama-2-7B-chat-hf model with your code and it ran perefectly. Thanks for sharing such wonderful blogs.

When I run inference loading the finetune model either with hf directcly (merged_model) or with AutoPeft (adapter_model_path) class, I get expected results if I don't use batching. But if I use batching, results change mostly to unknowns for the same prompts that had correct results without batching.

I am wondering if the issue is because of the adapter layers or something else. If you have any insights or suggestions, please share.

Note: I directly finetuned the model on an A10G server and not using Sagemaker.

Here is the generation code block I am using.

encodings = tokenizer(prompt, return_tensors="pt", truncation=True, padding=True, max_length=512).to(model.device)
outputs = model.generate(**encodings, max_new_tokens=128, do_sample=False, num_beams=1, eos_token_id=tokenizer.eos_token_id, early_stopping=True)
op = tokenizer.batch_decode(outputs.detach().cpu().numpy(), skip_special_tokens=False)

OSError: meta-llama/Llama-2-13b-hf is not a local folder and is not a valid model identifier listed on 'https://huggingface.co/models'

OSError: meta-llama/Llama-2-13b-hf is not a local folder and is not a valid model identifier listed on 'https://huggingface.co/models'
If this is a private repository, make sure to pass a token having permission to this repo with use_auth_token or log in with huggingface-cli login and pass use_auth_token=True

where to use the 'use_auth_token'

I am facing this error, I have been granted access by both HF and Meta. Please can someone help

I have also logged in using cli and a read token

ValueError: expected sequence of length 2048 at dim 1 (got 0)

Found 7 modules to quantize: ['k_proj', 'gate_proj', 'o_proj', 'down_proj', 'q_proj', 'up_proj', 'v_proj']
trainable params: 250,347,520 || all params: 6,922,327,040 || trainable%: 3.6165225733108386
/opt/conda/lib/python3.10/site-packages/transformers/optimization.py:411: FutureWarning: This implementation of AdamW is deprecated and will be removed in a future version. Use the PyTorch implementation torch.optim.AdamW instead, or set no_deprecation_warning=True to disable this warning
warnings.warn(
0%| | 0/1872 [00:00<?, ?it/s]
Traceback (most recent call last):
File "/opt/ml/code/run_clm.py", line 253, in
main()
File "/opt/ml/code/run_clm.py", line 249, in main
training_function(args)
File "/opt/ml/code/run_clm.py", line 212, in training_function
trainer.train()
File "/opt/conda/lib/python3.10/site-packages/transformers/trainer.py", line 1539, in train
return inner_training_loop(
File "/opt/conda/lib/python3.10/site-packages/transformers/trainer.py", line 1787, in _inner_training_loop
for step, inputs in enumerate(epoch_iterator):
File "/opt/conda/lib/python3.10/site-packages/accelerate/data_loader.py", line 394, in iter
next_batch = next(dataloader_iter)
File "/opt/conda/lib/python3.10/site-packages/torch/utils/data/dataloader.py", line 634, in next
data = self._next_data()
File "/opt/conda/lib/python3.10/site-packages/torch/utils/data/dataloader.py", line 678, in _next_data
data = self._dataset_fetcher.fetch(index) # may raise StopIteration
File "/opt/conda/lib/python3.10/site-packages/torch/utils/data/_utils/fetch.py", line 54, in fetch
return self.collate_fn(data)
File "/opt/conda/lib/python3.10/site-packages/transformers/data/data_collator.py", line 70, in default_data_collator
return torch_default_data_collator(features)
File "/opt/conda/lib/python3.10/site-packages/transformers/data/data_collator.py", line 136, in torch_default_data_collator
batch[k] = torch.tensor([f[k] for f in features])

ValueError: expected sequence of length 2048 at dim 1 (got 0)
0%| | 0/1872 [00:00<?, ?it/s]
2023-11-01 10:02:48,388 sagemaker-training-toolkit INFO Waiting for the process to finish and give a return code.
2023-11-01 10:02:48,388 sagemaker-training-toolkit INFO Done waiting for a return code. Received 1 from exiting process.
2023-11-01 10:02:48,388 sagemaker-training-toolkit ERROR Reporting training FAILURE
2023-11-01 10:02:48,388 sagemaker-training-toolkit ERROR ExecuteUserScriptError:
ExitCode 1
ErrorMessage "ValueError: expected sequence of length 2048 at dim 1 (got 0)
0%| | 0/1872 [00:00<?, ?it/s]"
Command "/opt/conda/bin/python3.10 run_clm.py --dataset_path /opt/ml/input/data/training --epochs 24 --hf_token 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 --lr 0.0002 --merge_weights True --model_id meta-llama/Llama-2-13b-hf --per_device_train_batch_size 2"
2023-11-01 10:02:48,388 sagemaker-training-toolkit ERROR Encountered exit_code 1

I was trying to fine-tune the model with my data, i have followed the same steps and stored the data to the s3 bucket aswell
while started the training i got this error?
I checked my input sequence length it's 2048, not able to find out what's wrong

print("Shape of processed data:", llm_dataset.shape)
# Assuming llm_dataset["input_ids"] contains the tokenized sequences
print("Length of sequences:", len(llm_dataset["input_ids"][0]))

@philschmid any help will be appreciated?

RuntimeError: mat1 and mat2 shapes cannot be multiplied (4096x5120 and 1x2560)

Ran all the cells of Notebook to funetune LLama2 got this error.

  2023-07-20T16:08:06.067+05:30 return forward_call(*args, **kwargs) File "/opt/conda/lib/python3.10/site-packages/accelerate/hooks.py", line 165, in new_forward
  2023-07-20T16:08:06.068+05:30 output = old_forward(*args, **kwargs) File "/opt/conda/lib/python3.10/site-packages/transformers/models/llama/modeling_llama.py", line 408, in forward
  2023-07-20T16:08:06.068+05:30 hidden_states, self_attn_weights, present_key_value = self.self_attn( File "/opt/conda/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl
  2023-07-20T16:08:06.068+05:30 return forward_call(*args, **kwargs) File "/opt/conda/lib/python3.10/site-packages/accelerate/hooks.py", line 165, in new_forward
  2023-07-20T16:08:06.068+05:30 output = old_forward(*args, **kwargs) File "/opt/conda/lib/python3.10/site-packages/transformers/models/llama/modeling_llama.py", line 295, in forward
  2023-07-20T16:08:06.068+05:30 query_states = [F.linear(hidden_states, query_slices[i]) for i in range(self.pretraining_tp)] File "/opt/conda/lib/python3.10/site-packages/transformers/models/llama/modeling_llama.py", line 295, in
  2023-07-20T16:08:06.068+05:30 query_states = [F.linear(hidden_states, query_slices[i]) for i in range(self.pretraining_tp)]
  2023-07-20T16:08:06.068+05:30 RuntimeError: mat1 and mat2 shapes cannot be multiplied (4096x5120 and 1x2560)

RuntimeError: cannot reshape tensor

Thanks so much for providing scripts for Llama2 on Sagemaker.

I'm running the code from:
https://www.philschmid.de/sagemaker-llama2-qlora

When fitting the model I get a RuntimeError:

ErrorMessage "RuntimeError: cannot reshape tensor of 0 elements into shape [-1, 0] because the unspecified dimension size -1 can be any value and is ambiguous
 0%|          | 0/276 [00:00<?, ?it/s]"
Command "/opt/conda/bin/python3.10 run_clm.py --dataset_path /opt/ml/input/data/training --epochs 3 --hf_token <REDACTED> --lr 0.0002 --merge_weights True --model_id meta-llama/Llama-2-13b-hf --per_device_train_batch_size 2"
2023-07-28 01:56:33,101 sagemaker-training-toolkit ERROR    Encountered exit_code 1

2023-07-28 01:56:48 Uploading - Uploading generated training model
2023-07-28 01:56:48 Failed - Training job failed
---------------------------------------------------------------------------
UnexpectedStatusException                 Traceback (most recent call last)
[<ipython-input-18-26cb6fd8d084>](https://localhost:8080/#) in <cell line: 5>()
      3 
      4 # starting the train job with our uploaded datasets as input
----> 5 huggingface_estimator.fit(data, wait=True)

5 frames
[/usr/local/lib/python3.10/dist-packages/sagemaker/session.py](https://localhost:8080/#) in _check_job_status(job, desc, status_key_name)
   6734                 actual_status=status,
   6735             )
-> 6736         raise exceptions.UnexpectedStatusException(
   6737             message=message,
   6738             allowed_statuses=["Completed", "Stopped"],

UnexpectedStatusException: Error for Training job huggingface-qlora-2023-07-28-01-46-14-2023-07-28-01-46-20-353: Failed. Reason: AlgorithmError: ExecuteUserScriptError:
ExitCode 1
ErrorMessage "RuntimeError: cannot reshape tensor of 0 elements into shape [-1, 0] because the unspecified dimension size -1 can be any value and is ambiguous
 0%|          | 0/276 [00:00<?, ?it/s]"
Command "/opt/conda/bin/python3.10 run_clm.py --dataset_path /opt/ml/input/data/training --epochs 3 --hf_token <REDACTED> --lr 0.0002 --merge_weights True --model_id meta-llama/Llama-2-13b-hf --per_device_train_batch_size 2", exit code: 1

missing "requirements.txt"

The bitsandbytes library is not installed in the container, so it should be mentioned in a requirements.txt in ./scripts/

Cannot save checkpoints on provided s3 bucket

Hi, i am using your guide in order to fine-tune a LLama 7B model on sagemaker.
I added to the HuggingFaceEstimator the following parameters in order to enable checkpointing but no checkpoints are saved on the provided S3 bucket. -->checkpoint_s3_uri = s3buck and checkpoint_local_path= "/opt/ml/checkpoints/",
I'll post you the configuration for the estimator, such as the configuration for the TrainingArguments in the entry script.

import time
from sagemaker.huggingface import HuggingFace
from huggingface_hub import HfFolder

# define Training Job Name 
job_name = f'huggingface-qlora-{time.strftime("%Y-%m-%d-%H-%M-%S", time.localtime())}'
s3buck='s3://mys3bucket/checks/'
# hyperparameters, which are passed into the training job
hyperparameters ={
  'model_id': model_id,                             # pre-trained model
  'dataset_path': '/opt/ml/input/data/training',    # path where sagemaker will save training dataset
  'epochs': 3,                                      # number of training epochs
  'per_device_train_batch_size': 2,                 # batch size for training
  'per_device_train_batch_size': 2,
  'lr': 2e-4,         
  'merge_weights': True,                            # wether to merge LoRA into the model (needs more memory)
}

metric_definitions=[
    {'Name': 'loss', 'Regex': "'loss': ([0-9]+(.|e\-)[0-9]+),?"},
     {'Name': 'eval_loss', 'Regex': "'eval_loss': ([0-9]+(.|e\-)[0-9]+),?"}]

# create the Estimator
huggingface_estimator = HuggingFace(
    entry_point          = 'run_clm.py',      # train script
    source_dir           = 'scripts',         # directory which includes all the files needed for training
    instance_type        = 'ml.g5.12xlarge',   # instances type used for the training job
    instance_count       = 1,                 # the number of instances used for training
    base_job_name        = job_name,          # the name of the training job
    role                 = role,   # Iam role used in training job to access AWS ressources, e.g. S3
    sagemaker_session    = sess,
    volume_size          = 300,               # the size of the EBS volume in GB
    transformers_version = '4.28',            # the transformers version used in the training job
    pytorch_version      = '2.0',             # the pytorch_version version used in the training job
    py_version           = 'py310',           # the python version used in the training job
    hyperparameters      =  hyperparameters,      # the hyperparameters passed to the training job
    metric_definitions   =  metric_definitions,
    checkpoint_s3_uri    = s3buck,
   checkpoint_local_path= "/opt/ml/checkpoints/",
    max_run              = max_run,
    environment          = {"HUGGINGFACE_HUB_CACHE": "/tmp/.cache" }, # set env variable to cache models in /tmp
)

################# in run_clm.py script ########################

output_dir = "/opt/ml/model/"
  training_args = TrainingArguments(
      overwrite_output_dir=True if get_last_checkpoint(output_dir) is not None else False,
      output_dir=output_dir,
      per_device_train_batch_size=args.per_device_train_batch_size,
      bf16=args.bf16,  # Use BF16 if available
      learning_rate=args.lr,
      num_train_epochs=args.epochs,
      gradient_checkpointing=args.gradient_checkpointing,
      # logging strategies
      logging_dir=f"{output_dir}/logs",
      logging_strategy="epoch",
      save_strategy="epoch",
      evaluation_strategy="epoch"        
  )

  # Create Trainer instance
  trainer = Trainer(
      model=model,
      args=training_args,
      train_dataset=dataset,
      eval_dataset=dataset_val,
      data_collator=default_data_collator
  )

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.