Hello, I am trying to use AdaptLLM/finance-LLM along with Retrieval-

AdaptLLM models with Llama Index about lmops HOT 6 CLOSED

mirix commented on June 3, 2024

AdaptLLM models with Llama Index

from lmops.

Comments (6)

cdxeve commented on June 3, 2024 1

Hi, thanks for your feedback🤗! The prompt template that uses system prompt and "[/INST]" is specifically designed for the chat model.

We highly recommend switching from 'AdaptLLM/finance-LLM' to 'AdaptLLM/finance-chat' for improved response quality.

Regarding your use-case, here's an example using the recommended 'finance-chat' model:

from transformers import AutoModelForCausalLM, AutoTokenizer

model = AutoModelForCausalLM.from_pretrained("AdaptLLM/finance-chat")
tokenizer = AutoTokenizer.from_pretrained("AdaptLLM/finance-chat", use_fast=False)

# Put your query here
query_str = 'xxx'

your_system_prompt = 'Please, check if the answer can be inferred from the pieces of context provided. If the answer cannot be inferred from the context, just state that the question is out of scope and do not provide any answer.'

# Please integrate 'your system prompt' into the input instruction part following 'our system prompt'.
query_prompt = f"<s>[INST] <<SYS>>\nYou are a helpful, respectful, and honest assistant. Always answer as helpfully as possible, while being safe. Your responses should not include any harmful, unethical, racist, sexist, toxic, dangerous, or illegal content. Please ensure that your responses are socially unbiased and positive in nature.\n\nIf a question does not make any sense or is not factually coherent, explain why instead of answering something not correct. If you don't know the answer to a question, please don't share false information.\n<</SYS>>\n\n{your_system_prompt}\n{query_str} [/INST]"

# NOTE: another option might be: skipping our system prompt and directly starting from your system prompt like this:
# query_prompt = f"{your_system_prompt}\n{query_str} [/INST]"

inputs = tokenizer(query_prompt, return_tensors="pt", add_special_tokens=False).input_ids.to(model.device)
outputs = model.generate(input_ids=inputs, max_length=4096)[0]

answer_start = int(inputs.shape[-1])
pred = tokenizer.decode(outputs[answer_start:], skip_special_tokens=True)

print(f'### User Query:\n{query_str}\n\n### Assistant Output:\n{pred}')

Feel free to let us know if you have any more questions🤗.

from lmops.

mirix commented on June 3, 2024 1

Thanks for the advice, I will try that immediately.

But that configuration has been tested with many models. I have also tried going with the defaults and many other combinations.

from lmops.

mirix commented on June 3, 2024

Hi,

I am trying 'AdaptLLM/finance-chat' as suggested and it seems to work fine.

However, the generation configuration does not seem to be taken into account.

First, with transformers 4.36.2, I receive the following warning twice:

/home/emoman/.local/lib/python3.11/site-packages/transformers/generation/configuration_utils.py:389: UserWarning: do_sampleis set toFalse. However, temperatureis set to0.9-- this flag is only used in sample-based generation modes. You should setdo_sample=Trueor unsettemperature. This was detected when initializing the generation config instance, which means the corresponding file may hold incorrect parameterization and should be fixed.

It seems that the generation kwargs from the script are completely ignored and that they are read directly from 'generation_config.json'.

So, if I alter that file to:

{
    "_from_model_config": true,
    "bos_token_id": 1,
    "eos_token_id": 2,
    "pad_token_id": 32000,
    "do_sample": true,
    "temperature": 0.0000001,
    "top_p": 0.0000001,
    "top_p": 1,
    "repetition_penalty": 0.1,
    "transformers_version": "4.31.0.dev0"
}

The warnings disappear, but the model keeps repeating itself, which would seem to indicate that 'repetition_penalty' is being ignored.

Some people suggest setting '_from_model_config' to false, but it does not change anything.

from lmops.

cdxeve commented on June 3, 2024

Hi, thanks for the feedback. I think we can resolve this warning by unsetting temperature and top_p.

Remove temperature and top_p from the generation_config.json file, making it look like this:

{
    "_from_model_config": true,
    "bos_token_id": 1,
    "eos_token_id": 2,
    "pad_token_id": 32000,
    "transformers_version": "4.31.0.dev0"
}

I've tested this with transformers version 4.36.2, and it works fine now.

from lmops.

mirix commented on June 3, 2024

Yes, thank you. It works.

But, generally speaking, I believe that the generation kwargs explicitly set on the script should override the default configuration file.

I would seem that, if they don't, it means that transformers has switched to some sort of legacy mode.

Finally, the model does respond to repetition_penalty and other generation parameters.

But it is extremely capricious and I haven't found a way to consistently avoid repetition other than post-processing.

It is a pity because the model seem very good for my purposes.

I believe that this volatility may be intrinsic to vanilla Llama-2 and it is not a consequence of the "reading comprehension" adaptation.

That being the case, perhaps the best solution would be to replace vanilla Llama with something better stabilised such as Mistral, for instance. Tulu also shows a very steady behaviour.

from lmops.

cdxeve commented on June 3, 2024

Hi,

Thanks for your recommendation to switch our base models to Mistral and Tule. Mistral is indeed in our future plans.

Regarding this issue:

But, generally speaking, I believe that the generation kwargs explicitly set on the script should override the default configuration file

I completely agree that "generation kwargs explicitly set on the script should override the default configuration file".

But there might exist some conflicts in your config settings.

{
    "_from_model_config": true,
    "bos_token_id": 1,
    "eos_token_id": 2,
    "pad_token_id": 32000,
    "do_sample": true,
    "temperature": 0.0000001,
    "top_p": 0.0000001,
    "top_p": 1,
    "repetition_penalty": 0.1,
    "transformers_version": "4.31.0.dev0"
}

According to the official documentation: https://huggingface.co/docs/transformers/v4.36.1/en/main_classes/text_generation#generation

Firstly, setting the temperature to an extremely small value near 0 (0.0000001) creates a highly concentrated token distribution, behaving similarly to "do_sample"=false. This contradicts your setting of "do_sample": true.

Secondly, there are conflicting values for top_p in your configuration.

Then, the repetition_penalty value of 0.1 would make the problem even worse, and a value higher than 1 such as 1.2 is recommended to solve repetition.

The simplest setting for your config is like this, and you may refer to the official documentation for your specific usecase

{
    "_from_model_config": true,
    "bos_token_id": 1,
    "eos_token_id": 2,
    "pad_token_id": 32000,
    "repetition_penalty": 1.2,
    "transformers_version": "4.31.0.dev0"
}

from lmops.

AdaptLLM models with Llama Index about lmops HOT 6 CLOSED

Comments (6)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent