Jupyter Notebook 81.07% Python 18.93%

sagemaker-huggingface-llama-2-samples's Introduction

Getting Started with LLaMa 2 and Hugging Face

This repository contains instructions/examples/tutorials for getting started with LLaMA 2 and Hugging Face libraries like transformers, datasets.

Training

Requirements

Before we can start make sure you have met the following requirements

AWS Account with quota
AWS CLI installed
AWS IAM user configured in CLI with permission to create and manage ec2 instances

Commands

watch -n0.1 nvidia-smi

sagemaker-huggingface-llama-2-samples's People

Contributors

Stargazers

Watchers

sagemaker-huggingface-llama-2-samples's Issues

The model did not return a loss from the inputs

Hello!
I get this value error :

"ValueError: The model did not return a loss from the inputs, only the following keys: logits. For reference, the inputs it received are input_ids,attention_mask."

whereas my tokenized dataset is structured like this :
Dataset({
features: ['input_ids', 'attention_mask', 'labels'],
num_rows: 24702
})

Fine Tuning LLAMA2-13b on ml.g4dn.12xlarge is taking too much time.

So I was fine tuning LLAMA2 13B on a different dataset. I used the code tweaked it a little just to preprocess that specific dataset. Then I ran it via SageMaker training job.
The training was running great but it was very slow and even after 24 hours it only managed to go up to 7% on ml.g4dn.12xlarge instance.
Can anyone please guide me how I can increase the speed of training.
Unfortunately I cannot use "ml.g5.4xlarge" since that training instance is not available in the region I am working with right now.
Thanks.

Error running fine-tuning on SageMaker

Hi, I'm facing this issue when following the fine-tuning example for Llama-2-13b-chat. This happens after it loads the model:

╭───────────────────── Traceback (most recent call last) ──────────────────────╮
│ /opt/ml/code/train_sagemaker.py:315 in │
│ │
│ 312 │ training_function(args) │
│ 313 │
│ 314 if name == "main": │
│ ❱ 315 │ main() │
│ │
│ /opt/ml/code/train_sagemaker.py:312 in main │
│ │
│ 309 │
│ 310 def main(): │
│ 311 │ args = parse_args() │
│ ❱ 312 │ training_function(args) │
│ 313 │
│ 314 if name == "main": │
│ 315 │ main() │
│ │
│ /opt/ml/code/train_sagemaker.py:195 in training_function │
│ │
│ 192 │ │ bnb_4bit_compute_dtype=torch.bfloat16, │
│ 193 │ ) │
│ 194 │ │
│ ❱ 195 │ model = AutoModelForCausalLM.from_pretrained( │
│ 196 │ │ args.model_id, │
│ 197 │ │ use_cache=False │
│ 198 │ │ if args.gradient_checkpointing │
│ │
│ /opt/conda/lib/python3.10/site-packages/transformers/models/auto/auto_factor │
│ y.py:471 in from_pretrained │
│ │
│ 468 │ │ │ ) │
│ 469 │ │ elif type(config) in cls._model_mapping.keys(): │
│ 470 │ │ │ model_class = _get_model_class(config, cls._model_mapping) │
│ ❱ 471 │ │ │ return model_class.from_pretrained( │
│ 472 │ │ │ │ pretrained_model_name_or_path, *model_args, config=con │
│ 473 │ │ │ ) │
│ 474 │ │ raise ValueError( │
│ │
│ /opt/conda/lib/python3.10/site-packages/transformers/modeling_utils.py:2795 │
│ in from_pretrained │
│ │
│ 2792 │ │ │ │ mismatched_keys, │
│ 2793 │ │ │ │ offload_index, │
│ 2794 │ │ │ │ error_msgs, │
│ ❱ 2795 │ │ │ ) = cls._load_pretrained_model( │
│ 2796 │ │ │ │ model, │
│ 2797 │ │ │ │ state_dict, │
│ 2798 │ │ │ │ loaded_state_dict_keys, # XXX: rename? │
│ │
│ /opt/conda/lib/python3.10/site-packages/transformers/modeling_utils.py:3109 │
│ in _load_pretrained_model │
│ │
│ 3106 │ │ │ │ # Skip the load for shards that only contain disk-off │
│ 3107 │ │ │ │ if shard_file in disk_only_shard_files: │
│ 3108 │ │ │ │ │ continue │
│ ❱ 3109 │ │ │ │ state_dict = load_state_dict(shard_file) │
│ 3110 │ │ │ │ │
│ 3111 │ │ │ │ # Mistmatched keys contains tuples key/shape1/shape2 │
│ 3112 │ │ │ │ # matching the weights in the model. │
│ │
│ /opt/conda/lib/python3.10/site-packages/transformers/modeling_utils.py:429 │
│ in load_state_dict │
│ │
│ 426 │ """ │
│ 427 │ if checkpoint_file.endswith(".safetensors") and is_safetensors_av │
│ 428 │ │ # Check format of the archive │
│ ❱ 429 │ │ with safe_open(checkpoint_file, framework="pt") as f: │
│ 430 │ │ │ metadata = f.metadata() │
│ 431 │ │ if metadata.get("format") not in ["pt", "tf", "flax"]: │
│ 432 │ │ │ raise OSError( │
╰──────────────────────────────────────────────────────────────────────────────╯
NameError: name 'safe_open' is not defined

pretrain llama-2

Hi, I want to pre-train llama-2 with my custom dataset starting from the published checkpoint before fine-tuning. can you recommend/add a script for this? thank you!

CUDA error encountered on both BS=2&3 for 7B and 13B.

File "/opt/conda/lib/python3.10/site-packages/torch/autograd/function.py", line 506, in apply
return super().apply(*args, **kwargs) # type: ignore[misc]
File "/opt/conda/lib/python3.10/site-packages/bitsandbytes/autograd/_functions.py", line 516, in forward
output = torch.nn.functional.linear(A, F.dequantize_4bit(B, state).to(A.dtype).t(), bias)
RuntimeError: CUDA error: CUBLAS_STATUS_NOT_INITIALIZED when calling cublasCreate(handle)
0%| | 0/2052 [00:00<?, ?it/s]
2023-07-20 15:52:05,435 sagemaker-training-toolkit INFO Waiting for the process to finish and give a return code.
2023-07-20 15:52:05,435 sagemaker-training-toolkit INFO Done waiting for a return code. Received 1 from exiting process.
2023-07-20 15:52:05,435 sagemaker-training-toolkit ERROR Reporting training FAILURE
2023-07-20 15:52:05,435 sagemaker-training-toolkit ERROR ExecuteUserScriptError:
ExitCode 1
ErrorMessage "RuntimeError: CUDA error: CUBLAS_STATUS_NOT_INITIALIZED when calling cublasCreate(handle)
0%| | 0/2052 [00:00<?, ?it/s]"
Command "/opt/conda/bin/python3.10 run_clm.py --dataset_path /opt/ml/input/data/training --epochs 3 --hf_token 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 --lr 0.0002 --merge_weights True --model_id meta-llama/Llama-2-13b-hf --per_device_train_batch_size 2"
2023-07-20 15:52:05,435 sagemaker-training-toolkit ERROR Encountered exit_code 1

Error running trained model on sagemaker endpoint

I followed the instructions for training, and it works great. However, I now want to deploy my fine-tuned model to a sagemaker endpoint, but I get the following error when I run predict() on the endpoint.

"Could not load model /opt/ml/model with any of the following classes: (\u003cclass \u0027transformers.models.auto.modeling_auto.AutoModelForCausalLM\u0027\u003e, \u003cclass \u0027transformers.models.llama.modeling_llama.LlamaForCausalLM\u0027\u003e)."

Here is my inference code:

llm_model = HuggingFaceModel(
  role=role,
  transformers_version = '4.28',   
  pytorch_version      = '2.0',            
  py_version           = 'py310',   
  sagemaker_session    = sess,
  model_data           = <path to the model.tar.gz>,
)

llm = llm_model.deploy(
  initial_instance_count=1,
  instance_type="ml.g5.12xlarge",
)

response = llm.predict(prompt)

No models to quantize

peft found no modules to quantize:

Found 0 modules to quantize: []

This was the invocation:

huggingface_estimator = HuggingFace(
    entry_point          = 'run_clm.py',      # train script
    source_dir           = 'scripts',         # directory which includes all the files needed for training
    instance_type        = 'ml.g5.4xlarge',   # instances type used for the training job
    instance_count       = 1,                 # the number of instances used for training
    base_job_name        = job_name,          # the name of the training job
    role                 = role,              # Iam role used in training job to access AWS ressources, e.g. S3
    volume_size          = 300,               # the size of the EBS volume in GB
    transformers_version = '4.28',            # the transformers version used in the training job
    pytorch_version      = '2.0',             # the pytorch_version version used in the training job
    py_version           = 'py310',           # the python version used in the training job
    hyperparameters      =  hyperparameters,  # the hyperparameters passed to the training job
    environment          = { "HUGGINGFACE_HUB_CACHE": "/tmp/.cache" }, # set env variable to cache models in /tmp
    keep_alive_period_in_seconds = 3600
)

I did not change run_clm.py at all

HF token leak

Hello!

I am running the llama finetuning example on Sagemaker, and realized that this will leak the HF token (it is written on the jupyter after execution as output metadata).

Is there a better way to pass the secret to the sagemaker job?
As an example, the W&B lib writes the token to a file that then gets copied to the instance, without explicitly passing the token.
Thanks =)
Thomas

trainer.evaluate() with run_clm.py

Hey Philip
Thanks for this repo.

Would you mind taking a look at this issue? It is kind of related to run_clm.py. Would like to know if it is expected behavior or incorrect usage. Thanks!
https://discuss.huggingface.co/t/hallucination-with-trainer-evaluate-on-llms/48158

Error in build_llama2_prompt()

@philschmid I think you might have a small typo in the build_llama2_prompt() function. In the final else block, the message.append should be message['content'].

The following then works to continue the chat conversation:

messages.append({"role": "bot",
                 "content": response[0]["generated_text"][len(prompt):]})

messages.append({"role": "user", 
                 "content": "Verify that the previous answer was accurate."})

Does the inference notebook work with Code Llama models?

Has anyone tried the new Code Llama models from https://huggingface.co/codellama ?

My understanding is the 7B and 13B code-llama set of models have the same architecture as the 7B and 13B Llama-2 models. So they should work as a drop in replacement right?

No matching distribution found for peft==0.4.0

When I start the train job, I get this error. Here is my requirements.txt :
transformers==4.6
peft==0.6
accelerate==0.21.0
bitsandbytes==0.40.2
safetensors>=0.3.1
tokenizers>=0.13.3

I modified the version of transformers because the previous version was not supported.

Does we need the prompt formatting during the instruct model fine-tuning?

Hello!

I was reviewing the notebooks and in the fine-tuning tutorials we do not do the prompt formatting with [INST], and other special tokens. I understand that we are free to chose any prompt template in case of fine-tuning the text-completion model. But in case of tuning the instruct model, should we follow the same prompt formatting as Meta recommend?

Confused about chunk and potential concern of it hurting finetuning performance

def chunk(sample, chunk_length=2048):
    # define global remainder variable to save remainder from batches to use in next batch
    global remainder
    # Concatenate all texts and add remainder from previous batch
    concatenated_examples = {k: list(chain(*sample[k])) for k in sample.keys()}
    concatenated_examples = {k: remainder[k] + concatenated_examples[k] for k in concatenated_examples.keys()}
    # get total number of tokens for batch
    batch_total_length = len(concatenated_examples[list(sample.keys())[0]])

    # get max number of chunks for batch
    if batch_total_length >= chunk_length:
        batch_chunk_length = (batch_total_length // chunk_length) * chunk_length

    # Split by chunks of max_len.
    result = {
        k: [t[i : i + chunk_length] for i in range(0, batch_chunk_length, chunk_length)]
        for k, t in concatenated_examples.items()
    }
    # add remainder to global variable for next batch
    remainder = {k: concatenated_examples[k][batch_chunk_length:] for k in concatenated_examples.keys()}
    # prepare labels
    result["labels"] = result["input_ids"].copy()
    return result

I am fairly confused on this piece of code in the finetuning script you provided. If we are packing many samples into one data point to train the LLM, wouldn't the performance got hurt because the model cannot figure out the original boundaries between different samples during the finetuning?

Inference generating output with <unks> only with batch decoding but generates correct result if I pass a single prompt.

Hi,

I finetuned llama-2-7B-chat-hf model with your code and it ran perefectly. Thanks for sharing such wonderful blogs.

When I run inference loading the finetune model either with hf directcly (merged_model) or with AutoPeft (adapter_model_path) class, I get expected results if I don't use batching. But if I use batching, results change mostly to unknowns for the same prompts that had correct results without batching.

I am wondering if the issue is because of the adapter layers or something else. If you have any insights or suggestions, please share.

Note: I directly finetuned the model on an A10G server and not using Sagemaker.

Here is the generation code block I am using.

encodings = tokenizer(prompt, return_tensors="pt", truncation=True, padding=True, max_length=512).to(model.device)
outputs = model.generate(**encodings, max_new_tokens=128, do_sample=False, num_beams=1, eos_token_id=tokenizer.eos_token_id, early_stopping=True)
op = tokenizer.batch_decode(outputs.detach().cpu().numpy(), skip_special_tokens=False)

OSError: meta-llama/Llama-2-13b-hf is not a local folder and is not a valid model identifier listed on 'https://huggingface.co/models'

OSError: meta-llama/Llama-2-13b-hf is not a local folder and is not a valid model identifier listed on 'https://huggingface.co/models'
If this is a private repository, make sure to pass a token having permission to this repo with use_auth_token or log in with huggingface-cli login and pass use_auth_token=True

where to use the 'use_auth_token'

I am facing this error, I have been granted access by both HF and Meta. Please can someone help

I have also logged in using cli and a read token

ValueError: expected sequence of length 2048 at dim 1 (got 0)

Found 7 modules to quantize: ['k_proj', 'gate_proj', 'o_proj', 'down_proj', 'q_proj', 'up_proj', 'v_proj']
trainable params: 250,347,520 || all params: 6,922,327,040 || trainable%: 3.6165225733108386
/opt/conda/lib/python3.10/site-packages/transformers/optimization.py:411: FutureWarning: This implementation of AdamW is deprecated and will be removed in a future version. Use the PyTorch implementation torch.optim.AdamW instead, or set no_deprecation_warning=True to disable this warning
warnings.warn(
0%| | 0/1872 [00:00<?, ?it/s]
Traceback (most recent call last):
File "/opt/ml/code/run_clm.py", line 253, in
main()
File "/opt/ml/code/run_clm.py", line 249, in main
training_function(args)
File "/opt/ml/code/run_clm.py", line 212, in training_function
trainer.train()
File "/opt/conda/lib/python3.10/site-packages/transformers/trainer.py", line 1539, in train
return inner_training_loop(
File "/opt/conda/lib/python3.10/site-packages/transformers/trainer.py", line 1787, in _inner_training_loop
for step, inputs in enumerate(epoch_iterator):
File "/opt/conda/lib/python3.10/site-packages/accelerate/data_loader.py", line 394, in iter
next_batch = next(dataloader_iter)
File "/opt/conda/lib/python3.10/site-packages/torch/utils/data/dataloader.py", line 634, in next
data = self._next_data()
File "/opt/conda/lib/python3.10/site-packages/torch/utils/data/dataloader.py", line 678, in _next_data
data = self._dataset_fetcher.fetch(index) # may raise StopIteration
File "/opt/conda/lib/python3.10/site-packages/torch/utils/data/_utils/fetch.py", line 54, in fetch
return self.collate_fn(data)
File "/opt/conda/lib/python3.10/site-packages/transformers/data/data_collator.py", line 70, in default_data_collator
return torch_default_data_collator(features)
File "/opt/conda/lib/python3.10/site-packages/transformers/data/data_collator.py", line 136, in torch_default_data_collator
batch[k] = torch.tensor([f[k] for f in features])

ValueError: expected sequence of length 2048 at dim 1 (got 0)
0%| | 0/1872 [00:00<?, ?it/s]
2023-11-01 10:02:48,388 sagemaker-training-toolkit INFO Waiting for the process to finish and give a return code.
2023-11-01 10:02:48,388 sagemaker-training-toolkit INFO Done waiting for a return code. Received 1 from exiting process.
2023-11-01 10:02:48,388 sagemaker-training-toolkit ERROR Reporting training FAILURE
2023-11-01 10:02:48,388 sagemaker-training-toolkit ERROR ExecuteUserScriptError:
ExitCode 1
ErrorMessage "ValueError: expected sequence of length 2048 at dim 1 (got 0)
0%| | 0/1872 [00:00<?, ?it/s]"
Command "/opt/conda/bin/python3.10 run_clm.py --dataset_path /opt/ml/input/data/training --epochs 24 --hf_token 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 --lr 0.0002 --merge_weights True --model_id meta-llama/Llama-2-13b-hf --per_device_train_batch_size 2"
2023-11-01 10:02:48,388 sagemaker-training-toolkit ERROR Encountered exit_code 1

I was trying to fine-tune the model with my data, i have followed the same steps and stored the data to the s3 bucket aswell
while started the training i got this error?
I checked my input sequence length it's 2048, not able to find out what's wrong

print("Shape of processed data:", llm_dataset.shape)
# Assuming llm_dataset["input_ids"] contains the tokenized sequences
print("Length of sequences:", len(llm_dataset["input_ids"][0]))

@philschmid any help will be appreciated?

RuntimeError: mat1 and mat2 shapes cannot be multiplied (4096x5120 and 1x2560)

Ran all the cells of Notebook to funetune LLama2 got this error.

	2023-07-20T16:08:06.067+05:30	return forward_call(args, *kwargs) File "/opt/conda/lib/python3.10/site-packages/accelerate/hooks.py", line 165, in new_forward
	2023-07-20T16:08:06.068+05:30	output = old_forward(args, *kwargs) File "/opt/conda/lib/python3.10/site-packages/transformers/models/llama/modeling_llama.py", line 408, in forward
	2023-07-20T16:08:06.068+05:30	hidden_states, self_attn_weights, present_key_value = self.self_attn( File "/opt/conda/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl
	2023-07-20T16:08:06.068+05:30	return forward_call(args, *kwargs) File "/opt/conda/lib/python3.10/site-packages/accelerate/hooks.py", line 165, in new_forward
	2023-07-20T16:08:06.068+05:30	output = old_forward(args, *kwargs) File "/opt/conda/lib/python3.10/site-packages/transformers/models/llama/modeling_llama.py", line 295, in forward
	2023-07-20T16:08:06.068+05:30	query_states = [F.linear(hidden_states, query_slices[i]) for i in range(self.pretraining_tp)] File "/opt/conda/lib/python3.10/site-packages/transformers/models/llama/modeling_llama.py", line 295, in
	2023-07-20T16:08:06.068+05:30	query_states = [F.linear(hidden_states, query_slices[i]) for i in range(self.pretraining_tp)]
	2023-07-20T16:08:06.068+05:30	RuntimeError: mat1 and mat2 shapes cannot be multiplied (4096x5120 and 1x2560)

RuntimeError: cannot reshape tensor

Thanks so much for providing scripts for Llama2 on Sagemaker.

I'm running the code from:
https://www.philschmid.de/sagemaker-llama2-qlora

When fitting the model I get a RuntimeError:

ErrorMessage "RuntimeError: cannot reshape tensor of 0 elements into shape [-1, 0] because the unspecified dimension size -1 can be any value and is ambiguous
 0%|          | 0/276 [00:00<?, ?it/s]"
Command "/opt/conda/bin/python3.10 run_clm.py --dataset_path /opt/ml/input/data/training --epochs 3 --hf_token <REDACTED> --lr 0.0002 --merge_weights True --model_id meta-llama/Llama-2-13b-hf --per_device_train_batch_size 2"
2023-07-28 01:56:33,101 sagemaker-training-toolkit ERROR    Encountered exit_code 1

2023-07-28 01:56:48 Uploading - Uploading generated training model
2023-07-28 01:56:48 Failed - Training job failed
---------------------------------------------------------------------------
UnexpectedStatusException                 Traceback (most recent call last)
[<ipython-input-18-26cb6fd8d084>](https://localhost:8080/#) in <cell line: 5>()
      3 
      4 # starting the train job with our uploaded datasets as input
----> 5 huggingface_estimator.fit(data, wait=True)

5 frames
[/usr/local/lib/python3.10/dist-packages/sagemaker/session.py](https://localhost:8080/#) in _check_job_status(job, desc, status_key_name)
   6734                 actual_status=status,
   6735             )
-> 6736         raise exceptions.UnexpectedStatusException(
   6737             message=message,
   6738             allowed_statuses=["Completed", "Stopped"],

UnexpectedStatusException: Error for Training job huggingface-qlora-2023-07-28-01-46-14-2023-07-28-01-46-20-353: Failed. Reason: AlgorithmError: ExecuteUserScriptError:
ExitCode 1
ErrorMessage "RuntimeError: cannot reshape tensor of 0 elements into shape [-1, 0] because the unspecified dimension size -1 can be any value and is ambiguous
 0%|          | 0/276 [00:00<?, ?it/s]"
Command "/opt/conda/bin/python3.10 run_clm.py --dataset_path /opt/ml/input/data/training --epochs 3 --hf_token <REDACTED> --lr 0.0002 --merge_weights True --model_id meta-llama/Llama-2-13b-hf --per_device_train_batch_size 2", exit code: 1

missing "requirements.txt"

The bitsandbytes library is not installed in the container, so it should be mentioned in a requirements.txt in ./scripts/

Cannot save checkpoints on provided s3 bucket

Hi, i am using your guide in order to fine-tune a LLama 7B model on sagemaker.
I added to the HuggingFaceEstimator the following parameters in order to enable checkpointing but no checkpoints are saved on the provided S3 bucket. -->checkpoint_s3_uri = s3buck and checkpoint_local_path= "/opt/ml/checkpoints/",
I'll post you the configuration for the estimator, such as the configuration for the TrainingArguments in the entry script.

import time
from sagemaker.huggingface import HuggingFace
from huggingface_hub import HfFolder

# define Training Job Name 
job_name = f'huggingface-qlora-{time.strftime("%Y-%m-%d-%H-%M-%S", time.localtime())}'
s3buck='s3://mys3bucket/checks/'
# hyperparameters, which are passed into the training job
hyperparameters ={
  'model_id': model_id,                             # pre-trained model
  'dataset_path': '/opt/ml/input/data/training',    # path where sagemaker will save training dataset
  'epochs': 3,                                      # number of training epochs
  'per_device_train_batch_size': 2,                 # batch size for training
  'per_device_train_batch_size': 2,
  'lr': 2e-4,         
  'merge_weights': True,                            # wether to merge LoRA into the model (needs more memory)
}

metric_definitions=[
    {'Name': 'loss', 'Regex': "'loss': ([0-9]+(.|e\-)[0-9]+),?"},
     {'Name': 'eval_loss', 'Regex': "'eval_loss': ([0-9]+(.|e\-)[0-9]+),?"}]

# create the Estimator
huggingface_estimator = HuggingFace(
    entry_point          = 'run_clm.py',      # train script
    source_dir           = 'scripts',         # directory which includes all the files needed for training
    instance_type        = 'ml.g5.12xlarge',   # instances type used for the training job
    instance_count       = 1,                 # the number of instances used for training
    base_job_name        = job_name,          # the name of the training job
    role                 = role,   # Iam role used in training job to access AWS ressources, e.g. S3
    sagemaker_session    = sess,
    volume_size          = 300,               # the size of the EBS volume in GB
    transformers_version = '4.28',            # the transformers version used in the training job
    pytorch_version      = '2.0',             # the pytorch_version version used in the training job
    py_version           = 'py310',           # the python version used in the training job
    hyperparameters      =  hyperparameters,      # the hyperparameters passed to the training job
    metric_definitions   =  metric_definitions,
    checkpoint_s3_uri    = s3buck,
   checkpoint_local_path= "/opt/ml/checkpoints/",
    max_run              = max_run,
    environment          = {"HUGGINGFACE_HUB_CACHE": "/tmp/.cache" }, # set env variable to cache models in /tmp
)

################# in run_clm.py script ########################

output_dir = "/opt/ml/model/"
  training_args = TrainingArguments(
      overwrite_output_dir=True if get_last_checkpoint(output_dir) is not None else False,
      output_dir=output_dir,
      per_device_train_batch_size=args.per_device_train_batch_size,
      bf16=args.bf16,  # Use BF16 if available
      learning_rate=args.lr,
      num_train_epochs=args.epochs,
      gradient_checkpointing=args.gradient_checkpointing,
      # logging strategies
      logging_dir=f"{output_dir}/logs",
      logging_strategy="epoch",
      save_strategy="epoch",
      evaluation_strategy="epoch"        
  )

  # Create Trainer instance
  trainer = Trainer(
      model=model,
      args=training_args,
      train_dataset=dataset,
      eval_dataset=dataset_val,
      data_collator=default_data_collator
  )

philschmid / sagemaker-huggingface-llama-2-samples Goto Github PK