Git Product home page Git Product logo

Comments (17)

baptistejamin avatar baptistejamin commented on May 16, 2024 2

Yes, keep the spit. I strongly recommend using data_dir rather than data_files. Keep the same config, but replace data_files with data_dir

from llm-foundry.

baptistejamin avatar baptistejamin commented on May 16, 2024 2

from llm-foundry.

singhalshikha518 avatar singhalshikha518 commented on May 16, 2024 2

I tried fine tuning mpt7b using dolly dataset. Using below command:
composer train.py yamls/finetune/mpt-7b_dolly_sft.yaml

Before strating training i am getting below error:

[Eval batch=321/321] Eval on eval data:
Eval metrics/eval/LanguageCrossEntropy: 9.1594
Eval metrics/eval/LanguagePerplexity: 9503.6523
/home/stsingha/LLM/llm-foundry/llmfoundry-venv/lib/python3.9/site-packages/torch/utils/data/dataloader.py:554: UserWarning: This DataLoader will create 4 worker processes in total. Our suggested max number of worker in current system is 2, which is smaller than what this DataLoader is going to create. Please be aware that excessive worker creation might get DataLoader running slow or even freeze, lower the worker number to avoid potential slowness/freeze if necessary.
warnings.warn(_create_warning_msg(
Traceback (most recent call last):
File "", line 21, in _bwd_kernel
KeyError: ('2-.-0-.-0-842f0fbd42a6607893f7134cdd9d16f2-2b0c5161c53c71b37ae20a9996ee4bb8-c1f92808b4e4644c1732e8338187ac87-f24b6aa9b101a518b6a4a6bddded372e-12f7ac1ca211e037f62a7c0c323d9990-5c5e32ff210f3b7f56c98ca29917c25e-06f0df2d61979d629033f4a22eff5198-0dd03b0bd512a184b3512b278d9dfa59-d35ab04ae841e2714a253c523530b071', (torch.bfloat16, torch.bfloat16, torch.bfloat16, torch.bfloat16, torch.bfloat16, torch.float32, torch.bfloat16, torch.bfloat16, torch.float32, torch.float32, 'fp32', 'i32', 'i32', 'i32', 'i32', 'i32', 'i32', 'i32', 'i32', 'i32', 'i32', 'i32', 'i32', 'i32', 'i32', 'i32', 'i32', 'i32', 'i32', 'i32', 'i32', 'i32', 'i32', 'i32', 'i32', 'i32', 'i32', 'i32', 'i32', 'i32', 'i32', 'i32'), ('vector', True, 128, False, True, True, True, 128, 128), (True, True, True, True, True, True, True, True, True, True, (False,), (True, False), (True, False), (True, False), (True, False), (True, False), (True, False), (True, False), (True, False), (True, False), (True, False), (True, False), (True, False), (True, False), (True, False), (True, False), (True, False), (True, False), (True, False), (True, False), (True, False), (True, False), (True, False), (True, False), (True, False), (True, False), (True, False), (True, False), (True, False), (True, False), (True, False), (True, False)))

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
File "/home/stsingha/LLM/llm-foundry/scripts/train/train.py", line 254, in
main(cfg)
File "/home/stsingha/LLM/llm-foundry/scripts/train/train.py", line 243, in main
trainer.fit()
File "/home/stsingha/LLM/llm-foundry/llmfoundry-venv/lib/python3.9/site-packages/composer/trainer/trainer.py", line 1766, in fit
self._train_loop()
File "/home/stsingha/LLM/llm-foundry/llmfoundry-venv/lib/python3.9/site-packages/composer/trainer/trainer.py", line 1940, in _train_loop
total_loss_dict = self._train_batch(use_grad_scaling)
File "/home/stsingha/LLM/llm-foundry/llmfoundry-venv/lib/python3.9/site-packages/composer/trainer/trainer.py", line 2115, in _train_batch
optimizer.step(closure=lambda **kwargs: self._train_microbatches(
File "/home/stsingha/LLM/llm-foundry/llmfoundry-venv/lib/python3.9/site-packages/torch/optim/lr_scheduler.py", line 68, in wrapper
return wrapped(*args, **kwargs)
File "/home/stsingha/LLM/llm-foundry/llmfoundry-venv/lib/python3.9/site-packages/torch/optim/optimizer.py", line 140, in wrapper
out = func(*args, **kwargs)
File "/home/stsingha/LLM/llm-foundry/llmfoundry-venv/lib/python3.9/site-packages/torch/autograd/grad_mode.py", line 27, in decorate_context
return func(*args, **kwargs)
File "/home/stsingha/LLM/llm-foundry/llmfoundry-venv/lib/python3.9/site-packages/composer/optim/decoupled_weight_decay.py", line 288, in step
loss = closure()
File "/home/stsingha/LLM/llm-foundry/llmfoundry-venv/lib/python3.9/site-packages/composer/trainer/trainer.py", line 2115, in
optimizer.step(closure=lambda **kwargs: self._train_microbatches(
File "/home/stsingha/LLM/llm-foundry/llmfoundry-venv/lib/python3.9/site-packages/composer/trainer/trainer.py", line 2213, in _train_microbatches
microbatch_loss_dict = self._train_microbatch(use_grad_scaling, current_batch_size, is_final_microbatch)
File "/home/stsingha/LLM/llm-foundry/llmfoundry-venv/lib/python3.9/site-packages/composer/trainer/trainer.py", line 2340, in _train_microbatch
microbatch_loss.backward(create_graph=self._backwards_create_graph)
File "/home/stsingha/LLM/llm-foundry/llmfoundry-venv/lib/python3.9/site-packages/torch/_tensor.py", line 487, in backward
torch.autograd.backward(
File "/home/stsingha/LLM/llm-foundry/llmfoundry-venv/lib/python3.9/site-packages/torch/autograd/init.py", line 197, in backward
Variable._execution_engine.run_backward( # Calls into the C++ engine to run the backward pass
File "/home/stsingha/LLM/llm-foundry/llmfoundry-venv/lib/python3.9/site-packages/torch/autograd/function.py", line 267, in apply
return user_fn(self, *args)
File "/home/stsingha/LLM/llm-foundry/llmfoundry-venv/lib/python3.9/site-packages/flash_attn/flash_attn_triton.py", line 827, in backward
_flash_attn_backward(do, q, k, v, o, lse, dq, dk, dv,
File "/home/stsingha/LLM/llm-foundry/llmfoundry-venv/lib/python3.9/site-packages/flash_attn/flash_attn_triton.py", line 694, in _flash_attn_backward
_bwd_kernel[grid](
File "/home/stsingha/LLM/llm-foundry/llmfoundry-venv/lib/python3.9/site-packages/triton/runtime/jit.py", line 106, in launcher
return self.run(*args, grid=grid, **kwargs)
File "/home/stsingha/LLM/llm-foundry/llmfoundry-venv/lib/python3.9/site-packages/triton/runtime/autotuner.py", line 73, in run
timings = {config: self._bench(*args, config=config, **kwargs)
File "/home/stsingha/LLM/llm-foundry/llmfoundry-venv/lib/python3.9/site-packages/triton/runtime/autotuner.py", line 73, in
timings = {config: self._bench(*args, config=config, **kwargs)
File "/home/stsingha/LLM/llm-foundry/llmfoundry-venv/lib/python3.9/site-packages/triton/runtime/autotuner.py", line 63, in _bench
return do_bench(kernel_call)
File "/home/stsingha/LLM/llm-foundry/llmfoundry-venv/lib/python3.9/site-packages/triton/testing.py", line 140, in do_bench
fn()
File "/home/stsingha/LLM/llm-foundry/llmfoundry-venv/lib/python3.9/site-packages/triton/runtime/autotuner.py", line 62, in kernel_call
self.fn.run(*args, num_warps=config.num_warps, num_stages=config.num_stages, **current)
File "/home/stsingha/LLM/llm-foundry/llmfoundry-venv/lib/python3.9/site-packages/triton/runtime/autotuner.py", line 200, in run
return self.fn.run(*args, **kwargs)
File "", line 43, in _bwd_kernel
RuntimeError: Triton Error [CUDA]: invalid argument

Could you please help in this issue. @arpitkk @baptistejamin @alextrott16

from llm-foundry.

arpitkk avatar arpitkk commented on May 16, 2024 1

Hi Team, I have identified the issue there was problem with batch size.. now its working fine.. thanks for support !!

from llm-foundry.

baptistejamin avatar baptistejamin commented on May 16, 2024

I tried doing the same, and I do agree the instructions are not clear at all.

from llm-foundry.

alextrott16 avatar alextrott16 commented on May 16, 2024

Sorry for the lack of clarity here, we'll update the docs.

Going off of @arpitkk 's example, you would want to use train_loader.dataset.hf_name: json and keep the rest the same.

Just to clarify, that config will eventually influence the behavior of this dataset-building method here:

def build_from_hf(self, cfg: DictConfig, tokenizer: Tokenizer):

So, by setting your config to

train_loader:
   name: finetuning
   dataset:
      hf_name: json
      hf_kwargs:
         data_files:
            train: /path/to/train.jsonl
      preprocessing_fn: my.import.path:my_preprocessing_fn
      split: train

The build_from_hf method will effectively execute the following code:

import datasets
from my.import.path import my_preprocessing_fn

dataset = datasets.load_dataset('json', split='train',  data_files={'train': '/path/to/train.jsonl'})

def dataset_mapper(example: Dict):
    example = my_preprocessing_fn(example)
    return _tokenize_formatted_example(example, tokenizer)

columns_to_remove = list(dataset[0].keys())
tokenized_dataset = dataset.map(
    dataset_mapper,
    batched=False,
    remove_columns=columns_to_remove,
)

Does that help to clear things up? If so, I can work a similar explanation into the README. If not, please let me know what still feels unclear!

from llm-foundry.

baptistejamin avatar baptistejamin commented on May 16, 2024

There might be an issue then. Here is the config I am using:

train_loader:
  name: finetuning

  dataset:
    hf_name: json
    hf_kwargs:
       data_files:
          train: /mnt/training/mylocaldataset/train.jsonl
    preprocessing_fn: mylocaldataset.utils:prep_fn
    split: train
    max_seq_len: ${max_seq_len}
    allow_pad_trimming: false
    decoder_only_format: true
    shuffle: true

I am running with: composer llm-foundry/scripts/train/train.py mpt.yml save_folder=mpt-tuned at the path /mnt/training.

wc -l /mnt/training/mylocaldataset/train.jsonl: 1212821

I will debug this today and let you know if I make some progress :)

from llm-foundry.

baptistejamin avatar baptistejamin commented on May 16, 2024

Issue found!

It seems there are a couple of bugs in the HuggingFace library. First, due to a regex problem the system is mixing the train.jsonl and the split train.

The dataset should not be named train.jsonl. Name it for instance prompts.jsonl.

Also, I now use data_dir instead and it works fine this way.

dataset:
    hf_name: json
    hf_kwargs:
      keep_in_memory: true
      data_dir: /mnt/training/mylocaldataset
    preprocessing_fn: mylocaldataset.utils:prep_fn

from llm-foundry.

wj210 avatar wj210 commented on May 16, 2024

@baptistejamin, do you have to specify the splits? so data_dir is only the dir? how would it find prompt.jsonl then?

from llm-foundry.

wj210 avatar wj210 commented on May 16, 2024

i do not know why, but i keep getting error "FileNotFoundError: Unable to find '/workspace/scripts/train/train' at /workspace/scripts/train"
at dataset = datasets.load_dataset(dataset_name, split=split, **kwargs)
my yaml is # Dataloaders train_loader: name: finetuning dataset: hf_name: json hf_kwargs: data_files: train: data/train.json preprocessing_fn: preprocess_investopedia:preprocess_investopedia

the data folder is inside the current working dir.

from llm-foundry.

baptistejamin avatar baptistejamin commented on May 16, 2024

The global batch size is too high. try with a batch size of 1, and then increase it until OOM

from llm-foundry.

alextrott16 avatar alextrott16 commented on May 16, 2024

Thanks for helping to surface these issues!!

@baptistejamin @arpitkk Did the explanation I posted above provide a useful intuition for how to set up the YAML? I want to make sure our README instructions are clear. I'll use your feedback to update them.

I'll also aim to include some of the gotchas that you have caught, e.g., data_dir vs data_files, relative pathing.

from llm-foundry.

arpitkk avatar arpitkk commented on May 16, 2024

Hi, After running few batches the code is failing with the below errors:

IndexError: Caught IndexError in DataLoader worker process 6.
Original Traceback (most recent call last):
File "/home/anaconda3/envs/mpt-train/lib/python3.10/site-packages/torch/utils/data/_utils/worker.py", line 302, in _worker_loop
data = fetcher.fetch(index)
File "/home/anaconda3/envs/mpt-train/lib/python3.10/site-packages/torch/utils/data/_utils/fetch.py", line 61, in fetch
return self.collate_fn(data)
File "/home/llm-foundry/llmfoundry/data/finetuning/collator.py", line 116, in call
batch = self._process_and_batch_decoder_only(examples)
File "/home/llm-foundry/llmfoundry/data/finetuning/collator.py", line 222, in _process_and_batch_decoder_only
batch = self.tokenizer.pad(
File "/home/anaconda3/envs/mpt-train/lib/python3.10/site-packages/transformers/tokenization_utils_base.py", line 2949, in pad
if isinstance(encoded_inputs, (list, tuple)) and isinstance(encoded_inputs[0], Mapping):
IndexError: list index out of range

I am using the below yaml file

max_seq_len: 2048
global_seed: 17

Run Name

run_name: # If left blank, will be read from env var $COMPOSER_RUN_NAME

Model

These must match pretraining

model:
name: hf_causal_lm
device: cuda:0
pretrained: true
pretrained_model_name_or_path: mosaicml/mpt-7b

Tokenizer

tokenizer:
name: EleutherAI/gpt-neox-20b
kwargs:
model_max_length: ${max_seq_len}

dataset: &hf_dataset
hf_name: json
hf_kwargs:
data_files:
train: /home/MPT-7B/mpt_dataset/mpt-train.jsonl
test: /home/MPT-7B/mpt_dataset/mpt-test.jsonl

Dataloaders

train_loader: &train_loader
name: finetuning
dataset:
<<: *hf_dataset
split: train
max_seq_len: ${max_seq_len}
allow_pad_trimming: false
decoder_only_format: true
shuffle: true
# Use python llmfoundry/data/packing.py --yaml-path /path/to/this/yaml/ ... to profile
# this run's optimal packing_ratio
# packing_ratio:
drop_last: true
num_workers: 8
pin_memory: false
prefetch_factor: 2
persistent_workers: true
timeout: 0

eval_loader:
<<: *train_loader
dataset:
<<: *hf_dataset
split: test
max_seq_len: ${max_seq_len}
allow_pad_trimming: false
decoder_only_format: true
shuffle: false

Optimization

scheduler:
name: linear_decay_with_warmup # linear no warmup is HF default which dolly used
t_warmup: 0ba
alpha_f: 0

optimizer:

mimic HF defaults to replicate dolly

name: decoupled_adamw
lr: 1.0e-5
betas:

  • 0.9
  • 0.999
    eps: 1.0e-8
    weight_decay: 0

algorithms:
gradient_clipping:
clipping_type: norm
clipping_threshold: 1.0

max_duration: 1ep
eval_interval: 500ba
eval_first: false
eval_subset_num_batches: -1
global_train_batch_size: 2

System

seed: ${global_seed}
device_eval_batch_size: 1
device_train_microbatch_size: 1

device_train_microbatch_size: auto

precision: amp_bf16

FSDP

fsdp_config:
sharding_strategy: FULL_SHARD
mixed_precision: PURE
activation_checkpointing: false
activation_checkpointing_reentrant: false
activation_cpu_offload: false
limit_all_gathers: true
verbose: false

Logging

progress_bar: false
log_to_console: true
console_log_interval: 1ba

callbacks:
speed_monitor:
window_size: 10
lr_monitor: {}
memory_monitor: {}
runtime_estimator: {}

loggers:

wandb: {}

Checkpoint to local filesystem or remote object store

save_interval: 2000ba
save_num_checkpoints_to_keep: 1 # Important, this cleans up checkpoints saved to DISK
save_folder: ./llm_local_finetune/checkpoints

save_folder: s3://my-bucket/my-folder/{run_name}/checkpoints

Load from remote object store

REPLACE THE BELOW with you own checkpoint!

#load_path: oci://my-bucket/my-folder/mpt-7b/checkpoints/some_checkpoint.pt

The training is failing at 486 ba irrespective of which dataset I use and I check the if any empty inputs are getting passed to collator.py but it has data

[{'input_ids': [30003, 310, 271, 9775, 326, 8631, 247, 4836, 15, 19566, 247, 2380, 326, 20420, 29141, 253, 2748, 15, 187, 187, 4118, 41959, 27, 187, 2513, 627, 667, 1039, 281, 1721, 555, 14, 24382, 247, 40315, 8393, 32, 535, 187, 4118, 19371, 27, 187], 'attention_mask': [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1], 'labels': [8469, 20444, 20798, 476, 320, 1892, 281, 1721, 555, 14, 24382, 13, 1580, 597, 10748, 1453, 616, 2133, 3276, 285, 18676, 407, 1469, 281, 253, 15336, 347, 3058, 13, 2299, 352, 310, 1896, 949, 247, 9950, 3733, 1232, 15, 50276, 4943, 403, 690, 8521, 7259, 323, 5547, 3733, 27, 187, 187, 18, 15, 50276, 12864, 634, 40315, 8393, 715, 247, 1355, 13, 6537, 2317, 342, 521, 390, 617, 7583, 2739, 285, 23908, 13, 285, 1918, 634, 40315, 8393, 5044, 3733, 15, 187, 19, 15, 50276, 13502, 38529, 9848, 3057, 342, 634, 40315, 8393, 4768, 253, 3733, 1232, 15, 187, 20, 15, 50276, 30802, 472, 1721, 555, 3879, 342, 1355, 26574, 13, 2739, 285, 1132, 2606, 15, 9225, 504, 3081, 26574, 846, 1016, 5547, 3733, 6874, 15, 209, 187, 21, 15, 50276, 5279, 673, 13, 13237, 2572, 253, 673, 634, 40315, 8393, 310, 7591, 281, 2289, 521, 390, 617, 1721, 555, 2170, 15, 50276, 21914, 326, 253, 3733, 1232, 3936, 673, 13, 594, 22450, 285, 5185, 49495, 403, 253, 2234, 281, 2323, 15, 209, 187, 22, 15, 50276, 29146, 1230, 4575, 634, 40315, 8393, 434, 6196, 281, 359, 266, 779, 390, 617, 432, 3081, 26574, 285, 6558, 253, 6799, 3879, 15, 187, 23, 15, 50276, 16628, 5277, 634, 40315, 8393, 342, 24443, 4158, 30653, 281, 1361, 731, 755, 9848, 342, 970, 247, 1721, 555, 275, 253, 987, 4328, 15, 50276, 187, 187, 1231, 671, 971, 281, 22175, 326, 40315, 20798, 403, 1355, 285, 28304, 13, 285, 2430, 2714, 1557, 285, 10885, 15, 50276, 6693, 187]}]

from llm-foundry.

vchiley avatar vchiley commented on May 16, 2024

see #143 (comment)

from llm-foundry.

alextrott16 avatar alextrott16 commented on May 16, 2024

Since this issue was originally made to address the lack of clarity around finetuning from a local dataset, I just want to let folks know that we just pushed a PR that includes a much more concrete example of this workflow.

In the scripts/train directory, you'll find finetune_example, which includes:

  • a detailed README
  • an example local training dataset
  • an implementation of a preprocessing function for that dataset
  • a YAML which puts it all together and can be run locally via train.py

To help us stay on top of other issues, I'll close this one. If things remain unclear, feel free to add another comment and I'll re-open the issue if necessary. Thank you!

from llm-foundry.

zacharyblank avatar zacharyblank commented on May 16, 2024

Hi Team, I have identified the issue there was problem with batch size.. now its working fine.. thanks for support !!

@arpitkk can you explain what you changed to get this working? I am running into the same issue.

from llm-foundry.

varunnathan avatar varunnathan commented on May 16, 2024

Hi Team, I have identified the issue there was problem with batch size.. now its working fine.. thanks for support !!

@arpitkk can you explain what you changed to get this working? I am running into the same issue.

I am facing the same issue as well. @arpitkk can you please share as to what you changed to get this working?

from llm-foundry.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.