Git Product home page Git Product logo

kennethleungty / llama-2-open-source-llm-cpu-inference Goto Github PK

View Code? Open in Web Editor NEW
927.0 13.0 205.0 4.63 MB

Running Llama 2 and other Open-Source LLMs on CPU Inference Locally for Document Q&A

Home Page: https://towardsdatascience.com/running-llama-2-on-cpu-inference-for-document-q-a-3d636037a3d8

License: MIT License

Python 100.00%
cpu cpu-inference deep-learning faiss langchain large-language-models llm machine-learning natural-language-processing nlp

llama-2-open-source-llm-cpu-inference's Introduction

Running Llama 2 and other Open-Source LLMs on CPU Inference Locally for Document Q&A

Clearly explained guide for running quantized open-source LLM applications on CPUs using LLama 2, C Transformers, GGML, and LangChain

Step-by-step guide on TowardsDataScience: https://towardsdatascience.com/running-llama-2-on-cpu-inference-for-document-q-a-3d636037a3d8


Context

  • Third-party commercial large language model (LLM) providers like OpenAI's GPT4 have democratized LLM use via simple API calls.
  • However, there are instances where teams would require self-managed or private model deployment for reasons like data privacy and residency rules.
  • The proliferation of open-source LLMs has opened up a vast range of options for us, thus reducing our reliance on these third-party providers. 
  • When we host open-source LLMs locally on-premise or in the cloud, the dedicated compute capacity becomes a key issue. While GPU instances may seem the obvious choice, the costs can easily skyrocket beyond budget.
  • In this project, we will discover how to run quantized versions of open-source LLMs on local CPU inference for document question-and-answer (Q&A).

    Alt text

Quickstart

  • Ensure you have downloaded the GGML binary file from https://huggingface.co/TheBloke/Llama-2-7B-Chat-GGML and placed it into the models/ folder
  • To start parsing user queries into the application, launch the terminal from the project directory and run the following command: poetry run python main.py "<user query>"
  • For example, poetry run python main.py "What is the minimum guarantee payable by Adidas?"
  • Note: Omit the prepended poetry run if you are NOT using Poetry

    Alt text

Tools

  • LangChain: Framework for developing applications powered by language models
  • C Transformers: Python bindings for the Transformer models implemented in C/C++ using GGML library
  • FAISS: Open-source library for efficient similarity search and clustering of dense vectors.
  • Sentence-Transformers (all-MiniLM-L6-v2): Open-source pre-trained transformer model for embedding text to a 384-dimensional dense vector space for tasks like clustering or semantic search.
  • Llama-2-7B-Chat: Open-source fine-tuned Llama 2 model designed for chat dialogue. Leverages publicly available instruction datasets and over 1 million human annotations.
  • Poetry: Tool for dependency management and Python packaging

Files and Content

  • /assets: Images relevant to the project
  • /config: Configuration files for LLM application
  • /data: Dataset used for this project (i.e., Manchester United FC 2022 Annual Report - 177-page PDF document)
  • /models: Binary file of GGML quantized LLM model (i.e., Llama-2-7B-Chat)
  • /src: Python codes of key components of LLM application, namely llm.py, utils.py, and prompts.py
  • /vectorstore: FAISS vector store for documents
  • db_build.py: Python script to ingest dataset and generate FAISS vector store
  • main.py: Main Python script to launch the application and to pass user query via command line
  • pyproject.toml: TOML file to specify which versions of the dependencies used (Poetry)
  • requirements.txt: List of Python dependencies (and version)

References

llama-2-open-source-llm-cpu-inference's People

Contributors

eltociear avatar kennethleungty avatar msfasha avatar seyedsaeidmasoumzadeh avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

llama-2-open-source-llm-cpu-inference's Issues

Support for 70b by updating ctransformers

You can use the 70b parameter model now as well, here is how I accomplished it:

  1. Downloaded the 70b parameter model I wanted from https://huggingface.co/TheBloke/Llama-2-70B-Chat-GGML/tree/main. In my case, I chose 'llama-2-70b-chat.ggmlv3.q5_K_M.bin'. None of my runs so far have used much more than 6-8GB of RAM. You need to modify the 'config/config.yml' to point to your newly downloaded model.

  2. Updated the CTransformers package to the latest version which adds support for 70b (ctransformers-0.2.15 or higher):
    poetry run pip install ctransformers --upgrade

  3. I also updated langchain (and I had done this first but I'm not sure it's required):
    poetry run pip install langchain --upgrade

Now it runs! Much slower (<1 minute became almost 10min).

how to change data files?

first time I run the main program successfully, then I try to change the data file to be my own pdf file and ask related questions.

But results always look like based on the original one.

results:

D:\Llama-2-Open-Source-LLM-CPU-Inference>python main.py "how many years of experience in banking technology practitioners"

Answer: Mr. Woodward has 16 years of experience in banking and finance, including 9 years as a senior investment banker within J.P . Morgan’s international mergers and acquisitions team and 7 years in various senior finance roles at Ladbrokes.

Source Document 1

Source Text: worked as a senior investment banker within J.P . Morgan’s international mergers and acquisitions teambetween 1999 and 2005. Prior to joining J.P . Morgan, Mr. Woodward worked for PricewaterhouseCoopersLLP in the Accounting and Tax advisory department between 1993 and 1999. He received a Bachelor ofScience degree in physics from Bristol University in 1993 and qualified for his Chartered Accountancyin 1996.
Document Name: data\manu-20f-2022-09-24.pdf
Page Number: 87

in fact the file is data/resume_cn1_en.pdf.

How can I handle this issue?

Aborted (core dumped) when executing dbqa()

Hi,
Thanks for the great article. I've a question here.

When executing dbqa(), it returned:
free(): invalid next size (normal)
Aborted (core dumped)

Any idea?

==
Here are the steps to reproduce to problem.

  1. Modify the embedding model to "sentence-transformers/paraphrase-multilingual-mpnet-base-v2"
  2. python db_build.py to build vectorestore
  3. python main.py "MY QUERY STRING IS HERE"

license

Could you please add a LICENSE file to this repo? Are you releasing the code under Apache 2 license, for example?

[Feature Request] Support InternLM Deploy

Dear Llama-2-Open-Source-LLM-CPU-Inference developer,

Greetings! I am vansinhu, a community developer and volunteer at InternLM. InternLM is a large language model similar to llama2, and we look forward to InternLM being supported in Llama-2-Open-Source-LLM-CPU-Inference. If there are any challenges or inquiries regarding support for InternLM, please feel free to join our Discord discussion at https://discord.gg/gF9ezcmtM3.

Best regards,
vansinhu

config customization

Hi!
Thanks - awesome job.

I have a question - why changing config (bigger chunks, vector counts) lead to broken output? For example:
VECTOR_COUNT: 3
CHUNK_SIZE: 600
CHUNK_OVERLAP: 50
Gives me non logic output

git lfs error

You seem to have cloned a repository without having git-lfs installed. Please install git-lfs and run git lfs install followed by git lfs pull in the folder you cloned. please tell me how to rectify it

"error loading model: unrecognized tensor type 11"

I was intrigued by your medium article, so I downloaded and ran this repo, and the following error appears.
"error loading model: unrecognized tensor type 11"
Please help.
And there is a code snippet that loads the environment variable, where and how do I declare it?

A Question.

Great project here, I am wondering:

What is a technique one could use to add conversation history to this method of QA Chain? I have not found any subject material with respect to calling the LLM from chain type.

ModuleNotFoundError: No module named 'langchain' is occured

even I installed langchain but error is occured

File "D:\ML\test\Llama-2-Open-Source-LLM-CPU-Inference\main.py", line 3, in
from src.utils import setup_dbqa
File "D:\ML\test\Llama-2-Open-Source-LLM-CPU-Inference\src\utils.py", line 7, in
from langchain import PromptTemplate
ModuleNotFoundError: No module named 'langchain

I used keyword "pip install -r requirements.txt" and install them all.

how should I do?

image

image

401 Client Error in loading Llama2-7b model

Hi,
I tried to run the main.py and I got following error messages.
It looks like it failed to load the Llama2-7b model.
Any help will be appreciated.

'''
$ python main.py 'How much is the minimum guarantee payable by adidas?'

==
Traceback (most recent call last):
File "/usr/local/lib/python3.10/dist-packages/huggingface_hub/utils/_errors.py", line 261, in hf_raise_for_status
response.raise_for_status()
File "/usr/local/lib/python3.10/dist-packages/requests/models.py", line 1021, in raise_for_status
raise HTTPError(http_error_msg, response=self)
requests.exceptions.HTTPError: 401 Client Error: Unauthorized for url: https://huggingface.co/api/models/models/llama-2-7b-chat.ggmlv3.q8_0.bin/revision/main

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
File "/workspace/test/Llama-2-Open-Source-LLM-CPU-Inference/main.py", line 26, in
dbqa = setup_dbqa()
File "/workspace/test/Llama-2-Open-Source-LLM-CPU-Inference/src/utils.py", line 44, in setup_dbqa
llm = build_llm()
File "/workspace/test/Llama-2-Open-Source-LLM-CPU-Inference/src/llm.py", line 21, in build_llm
llm = CTransformers(model=cfg.MODEL_BIN_PATH,
File "/usr/local/lib/python3.10/dist-packages/langchain/load/serializable.py", line 74, in init
super().init(**kwargs)
File "pydantic/main.py", line 339, in pydantic.main.BaseModel.init
File "pydantic/main.py", line 1102, in pydantic.main.validate_model
File "/usr/local/lib/python3.10/dist-packages/langchain/llms/ctransformers.py", line 70, in validate_environment
values["client"] = AutoModelForCausalLM.from_pretrained(
File "/usr/local/lib/python3.10/dist-packages/ctransformers/hub.py", line 130, in from_pretrained
config = config or AutoConfig.from_pretrained(
File "/usr/local/lib/python3.10/dist-packages/ctransformers/hub.py", line 47, in from_pretrained
cls._update_from_repo(
File "/usr/local/lib/python3.10/dist-packages/ctransformers/hub.py", line 69, in _update_from_repo
path = snapshot_download(
File "/usr/local/lib/python3.10/dist-packages/huggingface_hub/utils/_validators.py", line 118, in _inner_fn
return fn(*args, **kwargs)
File "/usr/local/lib/python3.10/dist-packages/huggingface_hub/_snapshot_download.py", line 186, in snapshot_download
repo_info = api.repo_info(repo_id=repo_id, repo_type=repo_type, revision=revision, token=token)
File "/usr/local/lib/python3.10/dist-packages/huggingface_hub/utils/_validators.py", line 118, in _inner_fn
return fn(*args, **kwargs)
File "/usr/local/lib/python3.10/dist-packages/huggingface_hub/hf_api.py", line 1868, in repo_info
return method(
File "/usr/local/lib/python3.10/dist-packages/huggingface_hub/utils/_validators.py", line 118, in _inner_fn
return fn(*args, **kwargs)
File "/usr/local/lib/python3.10/dist-packages/huggingface_hub/hf_api.py", line 1678, in model_info
hf_raise_for_status(r)
File "/usr/local/lib/python3.10/dist-packages/huggingface_hub/utils/_errors.py", line 293, in hf_raise_for_status
raise RepositoryNotFoundError(message, response) from e
huggingface_hub.utils._errors.RepositoryNotFoundError: 401 Client Error. (Request ID: Root=1-64b9e8a9-40a275255e2a52783534bb88;5565df67-c3e7-4849-ad18-f760c8a653fa)

Repository Not Found for url: https://huggingface.co/api/models/models/llama-2-7b-chat.ggmlv3.q8_0.bin/revision/main.
Please make sure you specified the correct repo_id and repo_type.
If you are trying to access a private or gated repo, make sure you are authenticated.
Invalid username or password.
'''

Is there a way to make the system answer in a specific language?

First of all, my compliments for the project. I understood a lot and i succeded in create my private instance of LLM. Thanks a lot.
I loaded many italian language pdf files and the search seems to work, but it randomly answers in english or italian. Do you know a way to force the output language?

Any ideas what "Illegal Instruction" indicates?

  • Using llama-2-7b-chat.ggmlv3.q8_0.bin downloaded today
  • I've setup everything running on a Dell PowerEdge 16-processor machine w/ 128GB of RAM.
  • Pointed to a directory that only had 4 PDFs in it.
  • Every question just using your standard poetry run python main.py "question text?" returns just one line:

"Illegal instruction"

Some questions take longer to return the same.

Any clues?

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.