Running Llama 2 and other Open-Source LLMs on CPU Inference Locally for Document Q&A

Home Page: https://towardsdatascience.com/running-llama-2-on-cpu-inference-for-document-q-a-3d636037a3d8

License: MIT License

Python 100.00%

cpu cpu-inference deep-learning faiss langchain large-language-models llm machine-learning natural-language-processing nlp

llama-2-open-source-llm-cpu-inference's Introduction

Running Llama 2 and other Open-Source LLMs on CPU Inference Locally for Document Q&A

Clearly explained guide for running quantized open-source LLM applications on CPUs using LLama 2, C Transformers, GGML, and LangChain

Step-by-step guide on TowardsDataScience: https://towardsdatascience.com/running-llama-2-on-cpu-inference-for-document-q-a-3d636037a3d8

Context

Third-party commercial large language model (LLM) providers like OpenAI's GPT4 have democratized LLM use via simple API calls.
However, there are instances where teams would require self-managed or private model deployment for reasons like data privacy and residency rules.
The proliferation of open-source LLMs has opened up a vast range of options for us, thus reducing our reliance on these third-party providers.
When we host open-source LLMs locally on-premise or in the cloud, the dedicated compute capacity becomes a key issue. While GPU instances may seem the obvious choice, the costs can easily skyrocket beyond budget.
In this project, we will discover how to run quantized versions of open-source LLMs on local CPU inference for document question-and-answer (Q&A).

Quickstart

Ensure you have downloaded the GGML binary file from https://huggingface.co/TheBloke/Llama-2-7B-Chat-GGML and placed it into the models/ folder
To start parsing user queries into the application, launch the terminal from the project directory and run the following command: poetry run python main.py "<user query>"
For example, poetry run python main.py "What is the minimum guarantee payable by Adidas?"
Note: Omit the prepended poetry run if you are NOT using Poetry

Tools

LangChain: Framework for developing applications powered by language models
C Transformers: Python bindings for the Transformer models implemented in C/C++ using GGML library
FAISS: Open-source library for efficient similarity search and clustering of dense vectors.
Sentence-Transformers (all-MiniLM-L6-v2): Open-source pre-trained transformer model for embedding text to a 384-dimensional dense vector space for tasks like clustering or semantic search.
Llama-2-7B-Chat: Open-source fine-tuned Llama 2 model designed for chat dialogue. Leverages publicly available instruction datasets and over 1 million human annotations.
Poetry: Tool for dependency management and Python packaging

Files and Content

/assets: Images relevant to the project
/config: Configuration files for LLM application
/data: Dataset used for this project (i.e., Manchester United FC 2022 Annual Report - 177-page PDF document)
/models: Binary file of GGML quantized LLM model (i.e., Llama-2-7B-Chat)
/src: Python codes of key components of LLM application, namely llm.py, utils.py, and prompts.py
/vectorstore: FAISS vector store for documents
db_build.py: Python script to ingest dataset and generate FAISS vector store
main.py: Main Python script to launch the application and to pass user query via command line
pyproject.toml: TOML file to specify which versions of the dependencies used (Poetry)
requirements.txt: List of Python dependencies (and version)

References

llama-2-open-source-llm-cpu-inference's People

Contributors

Stargazers

Watchers

Forkers

techthiyanes joskid aucan joaocps abdoiiii najiaboo jay-dox filexft sprime01 rampenke eng-alikazemi ppb-software gagbaghdas martyyz2112 castillosebastian fullstackbusiness biranchi2018 hpunetha gavinchen1314 ramstorage stjordanis eltociear tintamarre lskov lizziflashka luisriverag kunwar-vikrant metapowermatrix craftingdata whitelotusapps rhm2k qqq-tech pragyanaischool soon14 egerdm-ai tonywhite11 mdmmn378 dst1213 ai-jie01 tysonalcorn buena1408 ai-ld cleardry sysujayce iuriimattos2 afilimonov vaibhavnayak30 joaorodrigues9 hhy5277 vivihuang jmaigc earlallred ziad1988 mengliang63 agustincastro seyedsaeidmasoumzadeh jbrewton45 paulshuva yanh12 jeremyrenjw apeng-singlestore goswamig rkp64 sindongboy archernero neelectric kotharthar liangofthechen nageshmashette cmajorsolo zhutony deltavml glopaq mszylkowski stephenzwj wyhhky jamess-ai elpolini anandsrao malekhnovich suacalis atulyakr duongch4 automindx essencetech8028 der-ofenmeister will23332 angelajiaqichen bobolau gitinspired ai-nishikant imran2708 ivanmsiegfried flix-wg ganhojage06 carywoods w-winnie nomiscientist theflyingindian minori327

llama-2-open-source-llm-cpu-inference's Issues

error model config

File "/home/void/.local/lib/python3.11/site-packages/huggingface_hub/utils/_errors.py", line 261, in hf_raise_for_status
response.raise_for_status()
File "/home/void/.local/lib/python3.11/site-packages/requests/models.py", line 1021, in raise_for_status
raise HTTPError(http_error_msg, response=self)
requests.exceptions.HTTPError: 401 Client Error: Unauthorized for url: https://huggingface.co/api/models/models/llama-2-7b-chat.ggmlv3.q8_0.bin/revision/main

where should we need to make modifications for it run on GPU based systems?

Support for 70b by updating ctransformers

You can use the 70b parameter model now as well, here is how I accomplished it:

Downloaded the 70b parameter model I wanted from https://huggingface.co/TheBloke/Llama-2-70B-Chat-GGML/tree/main. In my case, I chose 'llama-2-70b-chat.ggmlv3.q5_K_M.bin'. None of my runs so far have used much more than 6-8GB of RAM. You need to modify the 'config/config.yml' to point to your newly downloaded model.
Updated the CTransformers package to the latest version which adds support for 70b (ctransformers-0.2.15 or higher):
poetry run pip install ctransformers --upgrade
I also updated langchain (and I had done this first but I'm not sure it's required):
poetry run pip install langchain --upgrade

Now it runs! Much slower (<1 minute became almost 10min).

how to change data files?

first time I run the main program successfully, then I try to change the data file to be my own pdf file and ask related questions.

But results always look like based on the original one.

results:

D:\Llama-2-Open-Source-LLM-CPU-Inference>python main.py "how many years of experience in banking technology practitioners"

Answer: Mr. Woodward has 16 years of experience in banking and finance, including 9 years as a senior investment banker within J.P . Morgan’s international mergers and acquisitions team and 7 years in various senior finance roles at Ladbrokes.

Source Document 1

Source Text: worked as a senior investment banker within J.P . Morgan’s international mergers and acquisitions teambetween 1999 and 2005. Prior to joining J.P . Morgan, Mr. Woodward worked for PricewaterhouseCoopersLLP in the Accounting and Tax advisory department between 1993 and 1999. He received a Bachelor ofScience degree in physics from Bristol University in 1993 and qualified for his Chartered Accountancyin 1996.
Document Name: data\manu-20f-2022-09-24.pdf
Page Number: 87

in fact the file is data/resume_cn1_en.pdf.

How can I handle this issue?

requesting a requirements.txt

Aborted (core dumped) when executing dbqa()

Hi,
Thanks for the great article. I've a question here.

When executing dbqa(), it returned:
free(): invalid next size (normal)
Aborted (core dumped)

Any idea?

==
Here are the steps to reproduce to problem.

Modify the embedding model to "sentence-transformers/paraphrase-multilingual-mpnet-base-v2"
python db_build.py to build vectorestore
python main.py "MY QUERY STRING IS HERE"

license

Could you please add a LICENSE file to this repo? Are you releasing the code under Apache 2 license, for example?

Can I run the same code for different data formats like .xlsx, .docx, .txt and .pptx? or should I include or modify any part of the code? If yes, please share the code

[Feature Request] Support InternLM Deploy

Dear Llama-2-Open-Source-LLM-CPU-Inference developer,

Greetings! I am vansinhu, a community developer and volunteer at InternLM. InternLM is a large language model similar to llama2, and we look forward to InternLM being supported in Llama-2-Open-Source-LLM-CPU-Inference. If there are any challenges or inquiries regarding support for InternLM, please feel free to join our Discord discussion at https://discord.gg/gF9ezcmtM3.

Best regards,
vansinhu

config customization

Hi!
Thanks - awesome job.

I have a question - why changing config (bigger chunks, vector counts) lead to broken output? For example:
VECTOR_COUNT: 3
CHUNK_SIZE: 600
CHUNK_OVERLAP: 50
Gives me non logic output

git lfs error

You seem to have cloned a repository without having git-lfs installed. Please install git-lfs and run git lfs install followed by git lfs pull in the folder you cloned. please tell me how to rectify it

"error loading model: unrecognized tensor type 11"

I was intrigued by your medium article, so I downloaded and ran this repo, and the following error appears.
"error loading model: unrecognized tensor type 11"
Please help.
And there is a code snippet that loads the environment variable, where and how do I declare it?

can we use a gpu for increased speed and use of a bigger better llama2 model

is there instructs for that - most of us AI folk have good gpus - seems silly not to use them.

Please copy / post instructions in GitHub.. Medium blog instructions are paywalled

Hey there

Medium have paywalled your blog explaining the instructions.

Could you please copy-paste into github readme?

Thanks!

A Question.

Great project here, I am wondering:

What is a technique one could use to add conversation history to this method of QA Chain? I have not found any subject material with respect to calling the LLM from chain type.

Any idea how to let it remember all previous prompt and answers like ChatGPT so it can have continuous chat?

I've been trying ways to get it to remember the questions and answers like how ChatGPT does it to have continuous with context chat, but keeps failing.

ModuleNotFoundError: No module named 'langchain' is occured

even I installed langchain but error is occured

File "D:\ML\test\Llama-2-Open-Source-LLM-CPU-Inference\main.py", line 3, in
from src.utils import setup_dbqa
File "D:\ML\test\Llama-2-Open-Source-LLM-CPU-Inference\src\utils.py", line 7, in
from langchain import PromptTemplate
ModuleNotFoundError: No module named 'langchain

I used keyword "pip install -r requirements.txt" and install them all.

how should I do?

how to improve response time ?

response time of more than 80 seconds for 20 page single document

Configure system prompt

Is there a way to pass a custom system prompt with the query?

401 Client Error in loading Llama2-7b model

Hi,
I tried to run the main.py and I got following error messages.
It looks like it failed to load the Llama2-7b model.
Any help will be appreciated.

'''
$ python main.py 'How much is the minimum guarantee payable by adidas?'

==
Traceback (most recent call last):
File "/usr/local/lib/python3.10/dist-packages/huggingface_hub/utils/_errors.py", line 261, in hf_raise_for_status
response.raise_for_status()
File "/usr/local/lib/python3.10/dist-packages/requests/models.py", line 1021, in raise_for_status
raise HTTPError(http_error_msg, response=self)
requests.exceptions.HTTPError: 401 Client Error: Unauthorized for url: https://huggingface.co/api/models/models/llama-2-7b-chat.ggmlv3.q8_0.bin/revision/main

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
File "/workspace/test/Llama-2-Open-Source-LLM-CPU-Inference/main.py", line 26, in
dbqa = setup_dbqa()
File "/workspace/test/Llama-2-Open-Source-LLM-CPU-Inference/src/utils.py", line 44, in setup_dbqa
llm = build_llm()
File "/workspace/test/Llama-2-Open-Source-LLM-CPU-Inference/src/llm.py", line 21, in build_llm
llm = CTransformers(model=cfg.MODEL_BIN_PATH,
File "/usr/local/lib/python3.10/dist-packages/langchain/load/serializable.py", line 74, in init
super().init(**kwargs)
File "pydantic/main.py", line 339, in pydantic.main.BaseModel.init
File "pydantic/main.py", line 1102, in pydantic.main.validate_model
File "/usr/local/lib/python3.10/dist-packages/langchain/llms/ctransformers.py", line 70, in validate_environment
values["client"] = AutoModelForCausalLM.from_pretrained(
File "/usr/local/lib/python3.10/dist-packages/ctransformers/hub.py", line 130, in from_pretrained
config = config or AutoConfig.from_pretrained(
File "/usr/local/lib/python3.10/dist-packages/ctransformers/hub.py", line 47, in from_pretrained
cls._update_from_repo(
File "/usr/local/lib/python3.10/dist-packages/ctransformers/hub.py", line 69, in _update_from_repo
path = snapshot_download(
File "/usr/local/lib/python3.10/dist-packages/huggingface_hub/utils/_validators.py", line 118, in _inner_fn
return fn(*args, **kwargs)
File "/usr/local/lib/python3.10/dist-packages/huggingface_hub/_snapshot_download.py", line 186, in snapshot_download
repo_info = api.repo_info(repo_id=repo_id, repo_type=repo_type, revision=revision, token=token)
File "/usr/local/lib/python3.10/dist-packages/huggingface_hub/utils/_validators.py", line 118, in _inner_fn
return fn(*args, **kwargs)
File "/usr/local/lib/python3.10/dist-packages/huggingface_hub/hf_api.py", line 1868, in repo_info
return method(
File "/usr/local/lib/python3.10/dist-packages/huggingface_hub/utils/_validators.py", line 118, in _inner_fn
return fn(*args, **kwargs)
File "/usr/local/lib/python3.10/dist-packages/huggingface_hub/hf_api.py", line 1678, in model_info
hf_raise_for_status(r)
File "/usr/local/lib/python3.10/dist-packages/huggingface_hub/utils/_errors.py", line 293, in hf_raise_for_status
raise RepositoryNotFoundError(message, response) from e
huggingface_hub.utils._errors.RepositoryNotFoundError: 401 Client Error. (Request ID: Root=1-64b9e8a9-40a275255e2a52783534bb88;5565df67-c3e7-4849-ad18-f760c8a653fa)

Repository Not Found for url: https://huggingface.co/api/models/models/llama-2-7b-chat.ggmlv3.q8_0.bin/revision/main.
Please make sure you specified the correct repo_id and repo_type.
If you are trying to access a private or gated repo, make sure you are authenticated.
Invalid username or password.
'''

Is there a way to make the system answer in a specific language?

First of all, my compliments for the project. I understood a lot and i succeded in create my private instance of LLM. Thanks a lot.
I loaded many italian language pdf files and the search seems to work, but it randomly answers in english or italian. Do you know a way to force the output language?

Using llama-2-7b-chat.ggmlv3.q8_0.bin downloaded today
I've setup everything running on a Dell PowerEdge 16-processor machine w/ 128GB of RAM.
Pointed to a directory that only had 4 PDFs in it.
Every question just using your standard poetry run python main.py "question text?" returns just one line:

"Illegal instruction"

Some questions take longer to return the same.

Any clues?

kennethleungty / llama-2-open-source-llm-cpu-inference Goto Github PK

llama-2-open-source-llm-cpu-inference's Introduction

Running Llama 2 and other Open-Source LLMs on CPU Inference Locally for Document Q&A

Clearly explained guide for running quantized open-source LLM applications on CPUs using LLama 2, C Transformers, GGML, and LangChain

Context

Quickstart

Tools

Files and Content

References

llama-2-open-source-llm-cpu-inference's People

Contributors

Stargazers

Watchers

Forkers

llama-2-open-source-llm-cpu-inference's Issues

Answer: Mr. Woodward has 16 years of experience in banking and finance, including 9 years as a senior investment banker within J.P . Morgan’s international mergers and acquisitions team and 7 years in various senior finance roles at Ladbrokes.

Recommend Projects

Recommend Topics

Recommend Org