Git Product home page Git Product logo

chatpdf's Introduction

ChatPDF

基于本地 LLM 做检索知识问答(RAG)

Tests Passing GitHub Contributors

根据文件回答 / 开源模型 / 本地部署LLM

Animation Demo

  • 本项目支持多种开源LLM模型,包括ChatGLM3-6b、Chinese-LLaMA-Alpaca-2、Baichuan、YI等
  • 本项目支持多种文件格式,包括PDF、docx、markdown、txt等
  • 本项目优化了RAG准确率
    • Chinese chunk切分优化,适配中英文混合文档
    • embedding优化,使用text2vec的sentence embedding,支持sentence embedding/字面相似度匹配算法
    • 检索匹配优化,引入jieba分词的rank_BM25,提升对query关键词的字面匹配,使用字面相似度+sentence embedding向量相似度加权获取corpus候选集
    • 新增reranker模块,对字面+语义检索的候选集进行rerank排序,减少候选集,并提升候选命中准确率,用rerank_model_name_or_path参数设置rerank模型
    • 新增候选chunk扩展上下文功能,用num_expand_context_chunk参数设置命中的候选chunk扩展上下文窗口大小
    • RAG底模优化,可以使用200k的基于RAG微调的LLM模型,支持自定义RAG模型,用generate_model_name_or_path参数设置底模
  • 本项目基于gradio开发了RAG对话页面,支持流式对话

原理

使用说明

安装依赖

在终端中输入下面的命令,然后回车即可。

pip install -r requirements.txt

如果您在使用Windows,建议通过WSL,在Linux上安装。如果您没有安装CUDA,并且不想只用CPU跑大模型,请先安装CUDA。

如果下载慢,建议配置豆瓣源。

本地调用

请使用下面的命令。取决于你的系统,你可能需要用python或者python3命令。请确保你已经安装了Python。

CUDA_VISIBLE_DEVICES=0 python chatpdf.py --gen_model_type auto --gen_model_name 01-ai/Yi-6B-Chat --corpus_files sample.pdf

启动Web服务

CUDA_VISIBLE_DEVICES=0 python webui.py --gen_model_type auto --gen_model_name 01-ai/Yi-6B-Chat --corpus_files sample.pdf --share

如果一切顺利,现在,你应该已经可以在浏览器地址栏中输入 http://localhost:7860 查看并使用 ChatPDF 了。

Contact

  • Issue(建议):GitHub issues
  • 邮件我:xuming: [email protected]
  • 微信我:加我微信号:xuming624, 备注:姓名-公司-NLP 进NLP交流群。

License

授权协议为 The Apache License 2.0,可免费用做商业用途。请在产品说明中附加ChatPDF的链接和授权协议。

Contribute

项目代码还很粗糙,如果大家对代码有所改进,欢迎提交回本项目。

Reference

关联项目推荐

  • shibing624/MedicalGPT:训练自己的GPT大模型,实现了包括增量预训练、有监督微调、RLHF(奖励建模、强化学习训练)和DPO(直接偏好优化)

chatpdf's People

Contributors

shibing624 avatar zhongpei avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar

chatpdf's Issues

module 'gradio' has no attribute 'LikeData'

Traceback (most recent call last):
File "/Users/hao/pythonProject/ChatPDF/webui.py", line 68, in
def vote(data: gr.LikeData):
AttributeError: module 'gradio' has no attribute 'LikeData'

加载好文件和模型后,无法完成chat

卡在这里不动了

Generating outputs: 0%| | 0/1 [00:00<?, ?it/s]C:\Users\liufe\anaconda3\envs\chatglm\lib\site-packages\transformers\tokenization_utils_base.py:717: UserWarning: Creating a tensor from a list of numpy.ndarrays is extremely slow. Please consider converting the list to a single numpy.ndarray with numpy.array() before converting to a tensor. (Triggered internally at ..\torch\csrc\utils\tensor_new.cpp:248.)
tensor = as_tensor(value)

希望取得联系

尊敬的ChatPDF 应用开发者,我是 InternLM 社区开发者&志愿者尖米, 大佬开源的工作对我的启发很大,希望可以探讨使用 InternLM 实现ChatPDF 的可能性和实现路径,我的微信是mzm312,希望可以取得联系进行更深度的交流

修改参数

本科生想问大佬参数怎么修改,比如说我这边一直显示用cpu跑

README 中命令有误

CUDA_VISIBLE_DEVICES=0``` python chatpdf.py --gen_model_type auto --gen_model_model 01-ai/Yi-6B-Chat --corpus_files sample.pdf --rerank_model_name maidalun1020/bce-reranker-base_v1

其中gen_model_model 有误,需要改成gen_model_name

推理3个数据,就开始报错,重新运行后又可以成功推理

Traceback (most recent call last):
File "/home/ubuntu/chatpdf.py", line 212, in
response1 = m.query('Please indicate the empirical research data provided in the literature on vaccine '
File "/home/ubuntu/chatpdf.py", line 163, in query
reference_results.append(self.sim_model.corpus[corpus_id])
KeyError: 184

context len 同时控制了 tokenize之前的string 的len 和 tokenize之后的token len

如标题所说, context_len参数同时控制了stream_generate_answer函数中token的len:

@torch.inference_mode()
def stream_generate_answer(
        self,
        max_new_tokens=512,
        temperature=0.7,
        repetition_penalty=1.0,
        context_len=8192
):
    streamer = TextIteratorStreamer(self.tokenizer, timeout=60.0, skip_prompt=True, skip_special_tokens=True)
    input_ids = self._get_chat_input()
    max_src_len = context_len - max_new_tokens - 8

和predict_stream中的str的len:

def predict_stream(
        self,
        query: str,
        max_length: int = 512,
        context_len: int = 8192,
        temperature: float = 0.7,
):
    """Generate predictions stream."""
    stop_str = self.tokenizer.eos_token if self.tokenizer.eos_token else "</s>"
    if not self.enable_history:
        self.history = []
    if self.sim_model.corpus:
        reference_results = self.get_reference_results(query)
        if not reference_results:
            yield '没有提供足够的相关信息', reference_results
        reference_results = self._add_source_numbers(reference_results)
        context_str = '\n'.join(reference_results)[:(context_len - len(PROMPT_TEMPLATE))]
        prompt = PROMPT_TEMPLATE.format(context_str=context_str, query_str=query)
        logger.debug(f"prompt: {prompt}")

这样会导致, prompt注入的知识永远和模型的最大上下文能力不匹配.

如果可以我可以提个pull request.

TypeError: BFloat16 is not supported on MPS

(.venv) lizhong@lizhongdeMac-mini ChatPDF % PYTORCH_ENABLE_MPS_FALLBACK=1 CUDA_VISIBLE_DEVICES=0 python chatpdf.py --gen_model_type auto --gen_model_name 01-ai/Yi-6B-Chat --corpus_files sample.pdf
Namespace(sim_model_name='shibing624/text2vec-base-multilingual', gen_model_type='auto', gen_model_name='01-ai/Yi-6B-Chat', lora_model=None, rerank_model_name='', corpus_files='sample.pdf', device=None, int4=False, int8=False, chunk_size=220, chunk_overlap=0, num_expand_context_chunk=1)
2024-03-14 11:11:22.449 | DEBUG | text2vec.sentence_model:init:80 - Use device: cpu
Loading checkpoint shards: 0%| | 0/3 [00:00<?, ?it/s]
Traceback (most recent call last):
File "/Volumes/LZ_Storage/workspace/ai/ChatPDF/chatpdf.py", line 528, in
m = ChatPDF(
File "/Volumes/LZ_Storage/workspace/ai/ChatPDF/chatpdf.py", line 179, in init
self.gen_model, self.tokenizer = self._init_gen_model(
File "/Volumes/LZ_Storage/workspace/ai/ChatPDF/chatpdf.py", line 221, in _init_gen_model
model = model_class.from_pretrained(
File "/Volumes/LZ_Storage/workspace/ai/ChatPDF/.venv/lib/python3.10/site-packages/transformers/models/auto/auto_factory.py", line 561, in from_pretrained
return model_class.from_pretrained(
File "/Volumes/LZ_Storage/workspace/ai/ChatPDF/.venv/lib/python3.10/site-packages/transformers/modeling_utils.py", line 3502, in from_pretrained
) = cls._load_pretrained_model(
File "/Volumes/LZ_Storage/workspace/ai/ChatPDF/.venv/lib/python3.10/site-packages/transformers/modeling_utils.py", line 3926, in _load_pretrained_model
new_error_msgs, offload_index, state_dict_index = _load_state_dict_into_meta_model(
File "/Volumes/LZ_Storage/workspace/ai/ChatPDF/.venv/lib/python3.10/site-packages/transformers/modeling_utils.py", line 805, in _load_state_dict_into_meta_model
set_module_tensor_to_device(model, param_name, param_device, **set_module_kwargs)
File "/Volumes/LZ_Storage/workspace/ai/ChatPDF/.venv/lib/python3.10/site-packages/accelerate/utils/modeling.py", line 387, in set_module_tensor_to_device
new_value = value.to(device)
TypeError: BFloat16 is not supported on MPS

设备型号: mac mini m2
python版本: 3.10

缺少一个文件similarities

您好!
运行出错,发现在chatpdf.py文件中,缺少一个文件,请问,哪儿可以找到这个文件?
from similarities import Similarity
from textgen import ChatGlmModel, LlamaModel

关于稀疏检索

您好!想请教一下这边做的BM25检索能够用于大规模数据吗(比如几百个文档),有没有用到向量库索引呢?

最小显存是否有要求

运行 chatpdf.py 时报错:

torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 12.00 MiB (GPU 0; 5.79 GiB total capacity; 5.05 GiB already allocated; 13.88 MiB free; 5.11 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation.  See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF

GPU :2060 显存6GB
LLM :chatglm-6b-int4
Embedding : text2vec-base

尝试修改batch_size,无果~

M1 pro support?

I'm running on M1 pro and the command python chatpdf.py returned the following error.
AssertionError: Torch not compiled with CUDA enabled

I tried to just let bf16_is_supported() return false, but then this came up:
RuntimeError: Unknown platform: darwin

I've seen this error elsewhere mentioned by mac users trying to work with chatglm. I think it has something to do with mac not supporting cpm kernel, and their solution was to run the model locally, but I'm assuming I'm already doing that as I entered the path instead of the model name?

我想问一个这个基于RAG检索的一个问题

就是你们的做法是对于每一个问题都需要基于RAG检索吗?还是说会有适应性的判断条件,对于一些模型未知的问题需要检索,其他简单的问题不需要检索?请问你们是具体咋做的?还有就是你们的RAG检索效率高吗?准确率能达到多少呀?

加载不了模型 提示 No compiled kernel found

我用的 wsl 加载模型时候总是失败 然后程序暂停。

/usr/lib/python3/dist-packages/requests/init.py:89: RequestsDependencyWarning: urllib3 (1.26.16) or chardet (3.0.4) doesn't match a supported version!
warnings.warn("urllib3 ({}) or chardet ({}) doesn't match a supported "
2023-05-28 21:02:00.484 | INFO | main::16 - CONTENT_DIR: /mnt/c/Users/MEIP-users/Desktop/ChatPDF-main/ChatPDF-main/content
Running on local URL: http://0.0.0.0:7860

To create a public link, set share=True in launch().
2023-05-28 21:02:25.002 | DEBUG | text2vec.sentence_model:init:74 - Use device: cuda
2023-05-28 21:02:28.328 | DEBUG | textgen.chatglm.chatglm_model:init:94 - Device: cuda
No compiled kernel found.
Compiling kernels : /home/meip/.cache/huggingface/modules/transformers_modules/THUDM/chatglm-6b-int4/02a065cf2797029c036a02cac30f1da1a9bc49a3/quantization_kernels_parallel.c
Compiling gcc -O3 -fPIC -pthread -fopenmp -std=c99 /home/meip/.cache/huggingface/modules/transformers_modules/THUDM/chatglm-6b-int4/02a065cf2797029c036a02cac30f1da1a9bc49a3/quantization_kernels_parallel.c -shared -o /home/meip/.cache/huggingface/modules/transformers_modules/THUDM/chatglm-6b-int4/02a065cf2797029c036a02cac30f1da1a9bc49a3/quantization_kernels_parallel.so
Load kernel : /home/meip/.cache/huggingface/modules/transformers_modules/THUDM/chatglm-6b-int4/02a065cf2797029c036a02cac30f1da1a9bc49a3/quantization_kernels_parallel.so
Setting CPU quantization kernel threads to 4
Using quantization cache
Applying quantization to glm layers
Killed

运行chatPDF.py时会导致session崩溃

前期已经下载完成一些模型运行需要的东西,但在接着运行时节点的当前窗口突然不能登陆,并显示“Remote side unexpectedly closed network connection”,在本窗口也无法再重新登陆,新开窗口则可以登陆。
因为没有安装CUDA,我使用的是cpu来运行程序。
盼回复。
image
屏幕截图 2024-07-01 225109

运行webui.py的时候出错

image
你好,运行webui.py的时候出错,能否帮忙看看什么情况。
报错内容:
self.sim_model.save_index(index_path)
Traceback (most recent call last):
File "/data/anaconda3/envs/laillm/lib/python3.10/site-packages/gradio/routes.py", line 437, in run_predict
output = await app.get_blocks().process_api(
File "/data/anaconda3/envs/laillm/lib/python3.10/site-packages/gradio/blocks.py", line 1352, in process_api
result = await self.call_function(
File "/data/anaconda3/envs/laillm/lib/python3.10/site-packages/gradio/blocks.py", line 1077, in call_function
prediction = await anyio.to_thread.run_sync(
File "/data/anaconda3/envs/laillm/lib/python3.10/site-packages/anyio/to_thread.py", line 33, in run_sync
return await get_async_backend().run_sync_in_worker_thread(
File "/data/anaconda3/envs/laillm/lib/python3.10/site-packages/anyio/_backends/_asyncio.py", line 2106, in run_sync_in_worker_thread
return await future
File "/data/anaconda3/envs/laillm/lib/python3.10/site-packages/anyio/_backends/_asyncio.py", line 833, in run
result = context.run(func, *args)
File "/data/lailai/ChatPDF/webui.py", line 211, in get_vector_store
model.save_index(local_index_path)
File "/data/lailai/ChatPDF/chatpdf.py", line 261, in save_index
self.sim_model.save_index(index_path)
AttributeError: 'BertSimilarity' object has no attribute 'save_index'

pdf能实现上传么?

个性化,想上传啥pdf都可实现那种,而不是在命令行指定pdf,
感谢大佬

加载模型时报错

LLM模型选择的:chatglm-6b-int4
Embedding 模型:sentence-transformers
错误信息如下:
2023-06-29 15:10:36.834 | ERROR | main:reinit_model:145 - Only Tensors of floating point and complex dtype can require gradients

chatpdf 和 webui 都改成 llama-7b , 跑不起来,NameError: name 'atJatJ' is not defined

ubuntu@VM-0-2-ubuntu:~/ChatPDF$ python3 chatpdf.py
/usr/lib/python3/dist-packages/requests/init.py:89: RequestsDependencyWarning: urllib3 (1.26.16) or chardet (3.0.4) doesn't match a supported version!
warnings.warn("urllib3 ({}) or chardet ({}) doesn't match a supported "
Traceback (most recent call last):
File "chatpdf.py", line 1, in
atJatJ# -- coding: utf-8 --
NameError: name 'atJatJ' is not defined

llama-7b 对 Python 和 cuda 有什么要求么

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.