shibing624 / chatpdf Goto Github PK

View Code? Open in Web Editor NEW

513.0 5.0 95.0 1.35 MB

RAG for Local LLM, chat with PDF/doc/txt files, ChatPDF

License: Apache License 2.0

Python 100.00%

chatdoc chatpdf llm pdf rag local-rag

chatpdf's Introduction

ChatPDF

基于本地 LLM 做检索知识问答(RAG)

根据文件回答 / 开源模型 / 本地部署LLM

本项目支持多种开源LLM模型，包括ChatGLM3-6b、Chinese-LLaMA-Alpaca-2、Baichuan、YI等
本项目支持多种文件格式，包括PDF、docx、markdown、txt等
本项目优化了RAG准确率
- Chinese chunk切分优化，适配中英文混合文档
- embedding优化，使用text2vec的sentence embedding，支持sentence embedding/字面相似度匹配算法
- 检索匹配优化，引入jieba分词的rank_BM25，提升对query关键词的字面匹配，使用字面相似度+sentence embedding向量相似度加权获取corpus候选集
- 新增reranker模块，对字面+语义检索的候选集进行rerank排序，减少候选集，并提升候选命中准确率，用rerank_model_name_or_path参数设置rerank模型
- 新增候选chunk扩展上下文功能，用num_expand_context_chunk参数设置命中的候选chunk扩展上下文窗口大小
- RAG底模优化，可以使用200k的基于RAG微调的LLM模型，支持自定义RAG模型，用generate_model_name_or_path参数设置底模
本项目基于gradio开发了RAG对话页面，支持流式对话

原理

使用说明

安装依赖

在终端中输入下面的命令，然后回车即可。

pip install -r requirements.txt

如果您在使用Windows，建议通过WSL，在Linux上安装。如果您没有安装CUDA，并且不想只用CPU跑大模型，请先安装CUDA。

如果下载慢，建议配置豆瓣源。

本地调用

请使用下面的命令。取决于你的系统，你可能需要用python或者python3命令。请确保你已经安装了Python。

CUDA_VISIBLE_DEVICES=0 python chatpdf.py --gen_model_type auto --gen_model_name 01-ai/Yi-6B-Chat --corpus_files sample.pdf

启动Web服务

CUDA_VISIBLE_DEVICES=0 python webui.py --gen_model_type auto --gen_model_name 01-ai/Yi-6B-Chat --corpus_files sample.pdf --share

如果一切顺利，现在，你应该已经可以在浏览器地址栏中输入 http://localhost:7860 查看并使用 ChatPDF 了。

Contact

Issue(建议)：
邮件我：xuming: [email protected]
微信我：加我微信号：xuming624, 备注：姓名-公司-NLP 进NLP交流群。

License

授权协议为 The Apache License 2.0，可免费用做商业用途。请在产品说明中附加ChatPDF的链接和授权协议。

Contribute

项目代码还很粗糙，如果大家对代码有所改进，欢迎提交回本项目。

Reference

imClumsyPanda/langchain-ChatGLM

关联项目推荐

shibing624/MedicalGPT：训练自己的GPT大模型，实现了包括增量预训练、有监督微调、RLHF(奖励建模、强化学习训练)和DPO(直接偏好优化)

chatpdf's People

Contributors

Stargazers

Watchers

Forkers

zhongpei mrbingzhao saanvi95 yanqingjun whitespur rcadecaro ph0en1xs lianxin2424 you-hei-mi nalanqingcheng jtan21at aiwenforgit tomwey slashwong lilulilulilu alitrack iwanglei1 wntd devedtara sxm1129 gmh5225 breative2023 chris-han proalf ahaufox sobinge jameskung5488 ego 0130719 zhys513 omiga0 lgiter yuyhao skyrookieyu yuyufei88 kekewind geekcheng rotbit itsharex adambear jordimerejo francico1996 lihuibng stu-github unixcrh lonewolf-01 newsyh weibobo lanboss haol666 hshanghai iwiii 577wq hqmetaphor demaolianda feiyangw smalldoctor yuanmeng1120 sifuhr arcphoenix95 lzpfmh luohoufu xuanjiawang floatin alvinwangzi wangdongjie100 19940308cai happylql 1012am infoaitek24 zhidaoai oceantangwei zhengr wangzai2050 owenliu50377 qq2021g ysxu666 zs001122 huanghan rjagge m220745 catkinser jianguo99 numberonewastefellow sunshinezhihuo koolkamalkishor seamasyang chenchenxinyu timjiang2020 upcreat hehedahehe chujunwhu

chatpdf's Issues

module 'gradio' has no attribute 'LikeData'

Traceback (most recent call last):
File "/Users/hao/pythonProject/ChatPDF/webui.py", line 68, in
def vote(data: gr.LikeData):
AttributeError: module 'gradio' has no attribute 'LikeData'

加载好文件和模型后，无法完成chat

卡在这里不动了

Generating outputs: 0%| | 0/1 [00:00<?, ?it/s]C:\Users\liufe\anaconda3\envs\chatglm\lib\site-packages\transformers\tokenization_utils_base.py:717: UserWarning: Creating a tensor from a list of numpy.ndarrays is extremely slow. Please consider converting the list to a single numpy.ndarray with numpy.array() before converting to a tensor. (Triggered internally at ..\torch\csrc\utils\tensor_new.cpp:248.)
tensor = as_tensor(value)

希望取得联系

尊敬的ChatPDF 应用开发者，我是 InternLM 社区开发者&志愿者尖米, 大佬开源的工作对我的启发很大，希望可以探讨使用 InternLM 实现ChatPDF 的可能性和实现路径，我的微信是mzm312，希望可以取得联系进行更深度的交流

修改参数

本科生想问大佬参数怎么修改，比如说我这边一直显示用cpu跑

webUI跑起来了，总是显示错误

README 中命令有误

CUDA_VISIBLE_DEVICES=0``` python chatpdf.py --gen_model_type auto --gen_model_model 01-ai/Yi-6B-Chat --corpus_files sample.pdf --rerank_model_name maidalun1020/bce-reranker-base_v1

其中gen_model_model 有误，需要改成gen_model_name

推理3个数据，就开始报错，重新运行后又可以成功推理

Traceback (most recent call last):
File "/home/ubuntu/chatpdf.py", line 212, in
response1 = m.query('Please indicate the empirical research data provided in the literature on vaccine '
File "/home/ubuntu/chatpdf.py", line 163, in query
reference_results.append(self.sim_model.corpus[corpus_id])
KeyError: 184

文档内容太大回答的就一点都对不上是怎么回事

context len 同时控制了 tokenize之前的string 的len 和 tokenize之后的token len

如标题所说, context_len参数同时控制了stream_generate_answer函数中token的len:

@torch.inference_mode()
def stream_generate_answer(
        self,
        max_new_tokens=512,
        temperature=0.7,
        repetition_penalty=1.0,
        context_len=8192
):
    streamer = TextIteratorStreamer(self.tokenizer, timeout=60.0, skip_prompt=True, skip_special_tokens=True)
    input_ids = self._get_chat_input()
    max_src_len = context_len - max_new_tokens - 8

和predict_stream中的str的len:

def predict_stream(
        self,
        query: str,
        max_length: int = 512,
        context_len: int = 8192,
        temperature: float = 0.7,
):
    """Generate predictions stream."""
    stop_str = self.tokenizer.eos_token if self.tokenizer.eos_token else "</s>"
    if not self.enable_history:
        self.history = []
    if self.sim_model.corpus:
        reference_results = self.get_reference_results(query)
        if not reference_results:
            yield '没有提供足够的相关信息', reference_results
        reference_results = self._add_source_numbers(reference_results)
        context_str = '\n'.join(reference_results)[:(context_len - len(PROMPT_TEMPLATE))]
        prompt = PROMPT_TEMPLATE.format(context_str=context_str, query_str=query)
        logger.debug(f"prompt: {prompt}")

这样会导致, prompt注入的知识永远和模型的最大上下文能力不匹配.

如果可以我可以提个pull request.

想请教下pdf内容提取的方法

作者您好，想请教下对pdf内容进行提取的步骤具体是怎么做的，谢谢

TypeError: BFloat16 is not supported on MPS

(.venv) lizhong@lizhongdeMac-mini ChatPDF % PYTORCH_ENABLE_MPS_FALLBACK=1 CUDA_VISIBLE_DEVICES=0 python chatpdf.py --gen_model_type auto --gen_model_name 01-ai/Yi-6B-Chat --corpus_files sample.pdf
Namespace(sim_model_name='shibing624/text2vec-base-multilingual', gen_model_type='auto', gen_model_name='01-ai/Yi-6B-Chat', lora_model=None, rerank_model_name='', corpus_files='sample.pdf', device=None, int4=False, int8=False, chunk_size=220, chunk_overlap=0, num_expand_context_chunk=1)
2024-03-14 11:11:22.449 | DEBUG | text2vec.sentence_model:init:80 - Use device: cpu
Loading checkpoint shards: 0%| | 0/3 [00:00<?, ?it/s]
Traceback (most recent call last):
File "/Volumes/LZ_Storage/workspace/ai/ChatPDF/chatpdf.py", line 528, in
m = ChatPDF(
File "/Volumes/LZ_Storage/workspace/ai/ChatPDF/chatpdf.py", line 179, in init
self.gen_model, self.tokenizer = self._init_gen_model(
File "/Volumes/LZ_Storage/workspace/ai/ChatPDF/chatpdf.py", line 221, in _init_gen_model
model = model_class.from_pretrained(
File "/Volumes/LZ_Storage/workspace/ai/ChatPDF/.venv/lib/python3.10/site-packages/transformers/models/auto/auto_factory.py", line 561, in from_pretrained
return model_class.from_pretrained(
File "/Volumes/LZ_Storage/workspace/ai/ChatPDF/.venv/lib/python3.10/site-packages/transformers/modeling_utils.py", line 3502, in from_pretrained
) = cls._load_pretrained_model(
File "/Volumes/LZ_Storage/workspace/ai/ChatPDF/.venv/lib/python3.10/site-packages/transformers/modeling_utils.py", line 3926, in _load_pretrained_model
new_error_msgs, offload_index, state_dict_index = _load_state_dict_into_meta_model(
File "/Volumes/LZ_Storage/workspace/ai/ChatPDF/.venv/lib/python3.10/site-packages/transformers/modeling_utils.py", line 805, in _load_state_dict_into_meta_model
set_module_tensor_to_device(model, param_name, param_device, **set_module_kwargs)
File "/Volumes/LZ_Storage/workspace/ai/ChatPDF/.venv/lib/python3.10/site-packages/accelerate/utils/modeling.py", line 387, in set_module_tensor_to_device
new_value = value.to(device)
TypeError: BFloat16 is not supported on MPS

设备型号： mac mini m2
python版本： 3.10

缺少一个文件similarities

您好！
运行出错，发现在chatpdf.py文件中，缺少一个文件，请问，哪儿可以找到这个文件？
from similarities import Similarity
from textgen import ChatGlmModel, LlamaModel

关于稀疏检索

您好！想请教一下这边做的BM25检索能够用于大规模数据吗（比如几百个文档），有没有用到向量库索引呢？

模型换成Yi-6B-200K 就启动不了了

最小显存是否有要求

运行 chatpdf.py 时报错：

torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 12.00 MiB (GPU 0; 5.79 GiB total capacity; 5.05 GiB already allocated; 13.88 MiB free; 5.11 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation.  See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF

GPU :2060 显存6GB
LLM :chatglm-6b-int4
Embedding : text2vec-base

尝试修改batch_size，无果~

M1 pro support?

I'm running on M1 pro and the command python chatpdf.py returned the following error.
AssertionError: Torch not compiled with CUDA enabled

I tried to just let bf16_is_supported() return false, but then this came up:
RuntimeError: Unknown platform: darwin

I've seen this error elsewhere mentioned by mac users trying to work with chatglm. I think it has something to do with mac not supporting cpm kernel, and their solution was to run the model locally, but I'm assuming I'm already doing that as I entered the path instead of the model name?

我想问一个这个基于RAG检索的一个问题

就是你们的做法是对于每一个问题都需要基于RAG检索吗？还是说会有适应性的判断条件，对于一些模型未知的问题需要检索，其他简单的问题不需要检索？请问你们是具体咋做的？还有就是你们的RAG检索效率高吗？准确率能达到多少呀？

效果一言难尽

加载不了模型提示 No compiled kernel found

我用的 wsl 加载模型时候总是失败然后程序暂停。

/usr/lib/python3/dist-packages/requests/init.py:89: RequestsDependencyWarning: urllib3 (1.26.16) or chardet (3.0.4) doesn't match a supported version!
warnings.warn("urllib3 ({}) or chardet ({}) doesn't match a supported "
2023-05-28 21:02:00.484 | INFO | main::16 - CONTENT_DIR: /mnt/c/Users/MEIP-users/Desktop/ChatPDF-main/ChatPDF-main/content
Running on local URL: http://0.0.0.0:7860

To create a public link, set share=True in launch().
2023-05-28 21:02:25.002 | DEBUG | text2vec.sentence_model:init:74 - Use device: cuda
2023-05-28 21:02:28.328 | DEBUG | textgen.chatglm.chatglm_model:init:94 - Device: cuda
No compiled kernel found.
Compiling kernels : /home/meip/.cache/huggingface/modules/transformers_modules/THUDM/chatglm-6b-int4/02a065cf2797029c036a02cac30f1da1a9bc49a3/quantization_kernels_parallel.c
Compiling gcc -O3 -fPIC -pthread -fopenmp -std=c99 /home/meip/.cache/huggingface/modules/transformers_modules/THUDM/chatglm-6b-int4/02a065cf2797029c036a02cac30f1da1a9bc49a3/quantization_kernels_parallel.c -shared -o /home/meip/.cache/huggingface/modules/transformers_modules/THUDM/chatglm-6b-int4/02a065cf2797029c036a02cac30f1da1a9bc49a3/quantization_kernels_parallel.so
Load kernel : /home/meip/.cache/huggingface/modules/transformers_modules/THUDM/chatglm-6b-int4/02a065cf2797029c036a02cac30f1da1a9bc49a3/quantization_kernels_parallel.so
Setting CPU quantization kernel threads to 4
Using quantization cache
Applying quantization to glm layers
Killed

运行chatPDF.py时会导致session崩溃

前期已经下载完成一些模型运行需要的东西，但在接着运行时节点的当前窗口突然不能登陆，并显示“Remote side unexpectedly closed network connection”，在本窗口也无法再重新登陆，新开窗口则可以登陆。
因为没有安装CUDA，我使用的是cpu来运行程序。
盼回复。

能否支持通过oneapi 之类接入大模型

bce-reranker-base_v1无法下载

401 Client Error: Unauthorized for url: https://huggingface.co/maidalun1020/bce-reranker-base_v1/resolve/main/config.json

Cannot access gated repo for url https://huggingface.co/maidalun1020/bce-reranker-base_v1/resolve/main/config.json.
Repo model maidalun1020/bce-reranker-base_v1 is gated. You must be authenticated to access it.

运行webui.py的时候出错

你好，运行webui.py的时候出错，能否帮忙看看什么情况。
报错内容：
self.sim_model.save_index(index_path)
Traceback (most recent call last):
File "/data/anaconda3/envs/laillm/lib/python3.10/site-packages/gradio/routes.py", line 437, in run_predict
output = await app.get_blocks().process_api(
File "/data/anaconda3/envs/laillm/lib/python3.10/site-packages/gradio/blocks.py", line 1352, in process_api
result = await self.call_function(
File "/data/anaconda3/envs/laillm/lib/python3.10/site-packages/gradio/blocks.py", line 1077, in call_function
prediction = await anyio.to_thread.run_sync(
File "/data/anaconda3/envs/laillm/lib/python3.10/site-packages/anyio/to_thread.py", line 33, in run_sync
return await get_async_backend().run_sync_in_worker_thread(
File "/data/anaconda3/envs/laillm/lib/python3.10/site-packages/anyio/_backends/_asyncio.py", line 2106, in run_sync_in_worker_thread
return await future
File "/data/anaconda3/envs/laillm/lib/python3.10/site-packages/anyio/_backends/_asyncio.py", line 833, in run
result = context.run(func, *args)
File "/data/lailai/ChatPDF/webui.py", line 211, in get_vector_store
model.save_index(local_index_path)
File "/data/lailai/ChatPDF/chatpdf.py", line 261, in save_index
self.sim_model.save_index(index_path)
AttributeError: 'BertSimilarity' object has no attribute 'save_index'

pdf能实现上传么？

个性化，想上传啥pdf都可实现那种，而不是在命令行指定pdf，
感谢大佬

加载模型时报错

LLM模型选择的：chatglm-6b-int4
Embedding 模型：sentence-transformers
错误信息如下：
2023-06-29 15:10:36.834 | ERROR | main:reinit_model:145 - Only Tensors of floating point and complex dtype can require gradients

chatpdf 和 webui 都改成 llama-7b ，跑不起来，NameError: name 'atJatJ' is not defined

ubuntu@VM-0-2-ubuntu:~/ChatPDF$ python3 chatpdf.py
/usr/lib/python3/dist-packages/requests/init.py:89: RequestsDependencyWarning: urllib3 (1.26.16) or chardet (3.0.4) doesn't match a supported version!
warnings.warn("urllib3 ({}) or chardet ({}) doesn't match a supported "
Traceback (most recent call last):
File "chatpdf.py", line 1, in
atJatJ# -- coding: utf-8 --
NameError: name 'atJatJ' is not defined

llama-7b 对 Python 和 cuda 有什么要求么

AttributeError: 'ChatPDF' object has no attribute 'rerank_model'

运行chatpdf.pys时候报错AttributeError: 'ChatPDF' object has no attribute 'rerank_model'，请问怎么解决？