sjunhongshen / tag-llm Goto Github PK

Python 100.00%

tag-llm's Introduction

Tag-LLM: Repurposing General-Purpose LLMs for Specialized Domains

PyTorch implementation of Tag-LLM proposed in the paper "Tag-LLM: Repurposing General-Purpose LLMs for Specialized Domains".

Requirements

Run pip install -r requirements.txt to install the dependencies.

Experiments

Download pretrained LLMs

Get a copy of LLAMA-7B to .llama-7b with the following commands:

apt-get install git-lfs 
git lfs install 
git clone https://huggingface.co/huggyllama/llama-7b

Prepare datasets

The language, SMILES, protein related datasets are automatically downloaded from hugging face when running the code. To download the TDC benchmark data, run the following code in python.

Binding affinity prediction:

from tdc import BenchmarkGroup
group = BenchmarkGroup(name = 'DTI_DG_Group', path = './data/')
benchmark = group.get('BindingDB_Patent')
benchmark['train_val'].to_csv("data/dti_dg_group/bindingdb_patent/train_val.csv")
benchmark['test'].to_csv("data/dti_dg_group/bindingdb_patent/test.csv")

Drug combination:

from tdc.benchmark_group import drugcombo_group
group = drugcombo_group(path = './data/')
benchmark = group.get('Drugcomb_CSS')

Run experiments

We provide all config files to reproduce our experiments under src/conf. Rename the one config file of interest to src/conf/config.yaml. Then, run the following command:

python3 –m src.train

Evaluation

For generation tasks:

python3 -m src.generation

For regression tasks:

python3 -m src.infer_regression

Citation

If you find this project helpful, please consider citing our paper:

@misc{shen2024tagllm,
      title={Tag-LLM: Repurposing General-Purpose LLMs for Specialized Domains}, 
      author={Junhong Shen and Neil Tenenholtz and James Brian Hall and David Alvarez-Melis and Nicolo Fusi},
      year={2024},
      eprint={2402.05140},
      archivePrefix={arXiv},
      primaryClass={cs.LG}
}

Thanks!

tag-llm's People

Contributors

Stargazers

Watchers

Forkers

monoboard1

tag-llm's Issues

ValueError: Attention mask should be of size (1, 1, 1, 25), but is torch.Size([1, 1, 25, 25])

Training works fine but during evaluation, I got the following attention mask dimension mismatch:

 08:24:03,106][src.trainer_seq2seq][INFO] - ***** Running Evaluation *****
[2024-03-15 08:24:03,107][src.trainer_seq2seq][INFO] -   Num examples = 3
[2024-03-15 08:24:03,107][src.trainer_seq2seq][INFO] -   Batch size = 1
eval step 0
[2024-03-15 08:24:03,110][src.trainer_seq2seq][WARNING] - Overwriting existing generation config due to DeepSpeed bug. If model is not LLAMA, check this.
Error executing job with overrides: []
Traceback (most recent call last):
  File "/Users/kevin/Tag-LLM/src/train.py", line 376, in main
    train_result = trainer.train(resume_from_checkpoint=checkpoint)
  File "/Users/kevin/miniconda3/envs/tagllm/lib/python3.8/site-packages/transformers/trainer.py", line 1633, in train
    return inner_training_loop(
  File "/Users/kevin/miniconda3/envs/tagllm/lib/python3.8/site-packages/transformers/trainer.py", line 1979, in _inner_training_loop
    self._maybe_log_save_evaluate(tr_loss, model, trial, epoch, ignore_keys_for_eval)
  File "/Users/kevin/miniconda3/envs/tagllm/lib/python3.8/site-packages/transformers/trainer.py", line 2236, in _maybe_log_save_evaluate
    metrics = self.evaluate(ignore_keys=ignore_keys_for_eval)
  File "/Users/kevin/Tag-LLM/src/trainer_seq2seq.py", line 282, in evaluate
    output = eval_loop(
  File "/Users/kevin/Tag-LLM/src/trainer_seq2seq.py", line 417, in evaluation_loop
    loss, logits, labels = self.prediction_step(
  File "/Users/kevin/Tag-LLM/src/trainer_seq2seq.py", line 158, in prediction_step
    generated_tokens = self.model.generate(
  File "/Users/kevin/miniconda3/envs/tagllm/lib/python3.8/site-packages/torch/utils/_contextlib.py", line 115, in decorate_context
    return func(*args, **kwargs)
  File "/Users/kevin/miniconda3/envs/tagllm/lib/python3.8/site-packages/transformers/generation/utils.py", line 1406, in generate
    return self.greedy_search(
  File "/Users/kevin/miniconda3/envs/tagllm/lib/python3.8/site-packages/transformers/generation/utils.py", line 2201, in greedy_search
    outputs = self(
  File "/Users/kevin/miniconda3/envs/tagllm/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl
    return forward_call(*args, **kwargs)
  File "/Users/kevin/Tag-LLM/src/tag_llama.py", line 738, in forward
    outputs = self.model(
  File "/Users/kevin/miniconda3/envs/tagllm/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl
    return forward_call(*args, **kwargs)
  File "/Users/kevin/Tag-LLM/src/tag_llama.py", line 627, in forward
    layer_outputs = decoder_layer(
  File "/Users/kevin/miniconda3/envs/tagllm/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl
    return forward_call(*args, **kwargs)
  File "/Users/kevin/Tag-LLM/src/tag_llama.py", line 312, in forward
    hidden_states, self_attn_weights, present_key_value = self.self_attn(
  File "/Users/kevin/miniconda3/envs/tagllm/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl
    return forward_call(*args, **kwargs)
  File "/Users/kevin/Tag-LLM/src/tag_llama.py", line 228, in forward
    raise ValueError(
ValueError: Attention mask should be of size (1, 1, 1, 25), but is torch.Size([1, 1, 25, 25])

saving checkpoints errors: AttributeError: 'str' object has no attribute 'numel'

Domain Task: Language
Model: llama7b (huggyllama)
Python 3.10 (same on 3.9)
GPU: RTX A6000 Ada

This happens on every save checkpoint and breaks the training.

I can only get around this when patching transformers/trainer.py _save_checkpoint to be a try/except block and see the layer causing this is col_ampere, all other laysers save with no issue

AttributeError: 'str' object has no attribute 'numel'

Traceback (most recent call last):
File "/workspace/Tag-LLM/src/train.py", line 378, in main
train_result = trainer.train(resume_from_checkpoint=checkpoint)
File "/usr/local/lib/python3.10/dist-packages/transformers/trainer.py", line 1633, in train
return inner_training_loop(
File "/usr/local/lib/python3.10/dist-packages/transformers/trainer.py", line 1979, in _inner_training_loop
self._maybe_log_save_evaluate(tr_loss, model, trial, epoch, ignore_keys_for_eval)
File "/usr/local/lib/python3.10/dist-packages/transformers/trainer.py", line 2240, in _maybe_log_save_evaluate
self._save_checkpoint(model, trial, metrics=metrics)
File "/usr/local/lib/python3.10/dist-packages/transformers/trainer.py", line 2297, in _save_checkpoint
self.save_model(output_dir, _internal_call=True)
File "/usr/local/lib/python3.10/dist-packages/transformers/trainer.py", line 2775, in save_model
self._save(output_dir)
File "/usr/local/lib/python3.10/dist-packages/transformers/trainer.py", line 2827, in _save
self.model.save_pretrained(output_dir, state_dict=state_dict)
File "/usr/local/lib/python3.10/dist-packages/transformers/modeling_utils.py", line 1729, in save_pretrained
shards, index = shard_checkpoint(state_dict, max_shard_size=max_shard_size, weights_name=weights_name)
File "/usr/local/lib/python3.10/dist-packages/transformers/modeling_utils.py", line 302, in shard_checkpoint
weight_size = weight.numel() * dtype_byte_size(weight.dtype)
AttributeError: 'str' object has no attribute 'numel'

unusable augmented embedder: values are nan

Task: Language (domain_tag.yaml)
Model: llama7b (huggyllama)
Python 3.10 (same on 3.9)
GPU: RTX A6000 Ada

When training the model (I tried many training runs & different datasets) the embeddings turn out always nan (not a number). The trained model will only output garbage (\x04 tokens - token id 7 - instead of text, a single \x02 - token id 5 - token at the end). I'm using TagLLamaForCausalLM.from_pretrained to load the model, but also see this in the exp/Language/[...]/embedding_weights.npy file

Update:
This is interesting, it seems to happen gradually

Step 0 (initial weights) to 2

Loss step 0: tensor(9.0469, device='cuda:0', dtype=torch.float16, grad_fn=<AddBackward0>)
Loss step 1: tensor(nan, device='cuda:0', dtype=torch.float16, grad_fn=<AddBackward0>)

I can see that there are no input gradients coming to the augmented embedder when registering a backward hook to it, I'm unsure if that's what's supposed to happen

Dataset Sources

From where can I download the dataset?

Typo in training file leads to import error

Tag-LLM/src/train.py

Line 28 in d68886e

AutoConfiger,

this should be "AutoConfig", the import errors otherwise

Getting an error while running the train.py file

After installing the Git LFS, downloading all the required dependencies using the requirements.txt, and downloading the datasets, I have been trying to run the train.py from the src folder. In line no: 160 of train.py where the code is :

train_dataset, eval_dataset, tag_name_dict, num_new_tokens, tags_to_update, domain_tags = get_dataset(task_name, num_existing_tokens, tag_name_dict, args.model.num_token_per_tag, args.model.use_domain_tag, args.model.use_function_tag, args.model.regression, freeze, True)

I am getting an error as: idx = tag_name_dict[tname].find(">")
which is coming from line no: 358 from get_dataset.py file where the code is:
if task_name == "BA":
existing_tokens = ["", ""] if use_domain_tag else []

        for tname in existing_tokens:
            idx = tag_name_dict[tname].find(">")
            domain_tags.append(int(tag_name_dict[tname][5:idx]))     
    
        tags_to_update =  ["<BA>"]
        for tname in tags_to_update:
            tag_name_dict[tname] = "".join(["<TAG " + str(i) + ">" for i in range(num_existing_tokens, num_existing_tokens + num_token_per_tag)])
            num_existing_tokens += num_token_per_tag

My doubt is : when I am initialising the tag_name_dict = {}, during the training of the domain tag, then after that why are we still trying to find the and in that particular dictionary

TypeError: first argument must be callable or None

I got the following error during training:

Error executing job with overrides: []
Traceback (most recent call last):
  File "/Users/kevin/Tag-LLM/src/train.py", line 376, in main
    train_result = trainer.train(resume_from_checkpoint=checkpoint)
  File "/Users/kevin/miniconda3/envs/tagllm/lib/python3.8/site-packages/transformers/trainer.py", line 1633, in train
    return inner_training_loop(
  File "/Users/kevin/miniconda3/envs/tagllm/lib/python3.8/site-packages/transformers/trainer.py", line 1902, in _inner_training_loop
    tr_loss_step = self.training_step(model, inputs)
  File "/Users/kevin/miniconda3/envs/tagllm/lib/python3.8/site-packages/transformers/trainer.py", line 2638, in training_step
    inputs = self._prepare_inputs(inputs)
  File "/Users/kevin/miniconda3/envs/tagllm/lib/python3.8/site-packages/transformers/trainer.py", line 2583, in _prepare_inputs
    inputs = self._prepare_input(inputs)
  File "/Users/kevin/miniconda3/envs/tagllm/lib/python3.8/site-packages/transformers/trainer.py", line 2565, in _prepare_input
    return type(data)({k: self._prepare_input(v) for k, v in data.items()})
TypeError: first argument must be callable or None

The error was triggered in transformers/trainer.py when type(data) tries to get the type of data which is a defaultdict and reconstruct a new defaultdict after mapping its values using _prepare_input. However, the defaultdict constructor cannot be used this way:

def _prepare_input(self, data: Union[torch.Tensor, Any]) -> Union[torch.Tensor, Any]:
        """
        Prepares one `data` before feeding it to the model, be it a tensor or a nested list/dictionary of tensors.
        """
        if isinstance(data, Mapping):
            return type(data)({k: self._prepare_input(v) for k, v in data.items()})
        elif isinstance(data, (tuple, list)):

A fix is to convert the defaultdict to a dict at the end of the DataCollatorForTagLLM in collator.py:

...
        if len(reg_idx) > 0:
            model_inputs["reg_idx"] = reg_idx
            model_inputs["reg_dim"] = reg_dim
            model_inputs["reg_pred_idx"] = reg_pred_idx
        if len(clf_idx) > 0:
            model_inputs["clf_idx"] = clf_idx
    
        return dict(model_inputs)

TypeError: compute_metrics() got an unexpected keyword argument 'special_tokens'

In metrics.py, get_compute_metrics_fn tries to do a partial application with a keyword argument called special_tokens. However, the compute_metrics function does not have this argument. It only has something called gist_token but somehow that argument is not even used and is deleted immediately on the first line.

def get_compute_metrics_fn(
    special_tokens: int, tokenizer: PreTrainedTokenizer, args: Arguments
) -> Callable:
    return functools.partial(
        compute_metrics, special_tokens=special_tokens, tokenizer=tokenizer, args=args
    )

def compute_metrics(
    eval_preds: EvalPrediction,
    gist_token: int,
    tokenizer: PreTrainedTokenizer,
    args: Arguments,
    output_file: Optional[str] = None,
) -> Metrics:
    del gist_token
    ...

Clarify Stage 2 & 3

Hi! Thanks for sharing your code. I'd like to evaluate Tag LLM according to the paper. However I cannot see how the code runs 3 stages, when looking at e.g. Translate there's a Domain Training and a Cross Domain Function training, it seems as if stage 2 is skipped.

Would I still train stage 2 or is supplying cross-domain data in stage 3 (autoregressive=False) achieving the same goal?
There are 3 config files, however only two of them show e.g. the Translate/Language training .

Would love for a quick clarification. Thanks