fudandisc / disc-lawllm Goto Github PK
View Code? Open in Web Editor NEWDISC-LawLLM, an intelligent legal system utilizing large language models (LLMs) to provide a wide range of legal services
License: Apache License 2.0
DISC-LawLLM, an intelligent legal system utilizing large language models (LLMs) to provide a wide range of legal services
License: Apache License 2.0
您好,您提供的数据集中,DISC-Law-SFT-Triplet 包含 input
,output
和 reference
三个部分。在用 LLaMA Efficient Tuning 微调时,请问 reference
是如何加入训练的呢?我目前是把它作为 system 输入,或者说这部分应该直接拼接到 input 中?
Traceback (most recent call last):
File "/Users/yansir/Code/DISC-LawLLM/cli_demo.py", line 81, in
main()
File "/Users/yansir/Code/DISC-LawLLM/cli_demo.py", line 38, in main
model, tokenizer = init_model()
^^^^^^^^^^^^
File "/Users/yansir/Code/DISC-LawLLM/cli_demo.py", line 17, in init_model
tokenizer = AutoTokenizer.from_pretrained(
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/opt/homebrew/anaconda3/envs/disc-lawllm/lib/python3.11/site-packages/transformers/models/auto/tokenization_auto.py", line 774, in from_pretrained
return tokenizer_class.from_pretrained(pretrained_model_name_or_path, *inputs, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/opt/homebrew/anaconda3/envs/disc-lawllm/lib/python3.11/site-packages/transformers/tokenization_utils_base.py", line 2028, in from_pretrained
return cls._from_pretrained(
^^^^^^^^^^^^^^^^^^^^^
File "/opt/homebrew/anaconda3/envs/disc-lawllm/lib/python3.11/site-packages/transformers/tokenization_utils_base.py", line 2260, in _from_pretrained
tokenizer = cls(*init_inputs, **init_kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/Users/yansir/.cache/huggingface/modules/transformers_modules/models/tokenization_baichuan.py", line 55, in init
super().init(
File "/opt/homebrew/anaconda3/envs/disc-lawllm/lib/python3.11/site-packages/transformers/tokenization_utils.py", line 367, in init
self._add_tokens(
File "/opt/homebrew/anaconda3/envs/disc-lawllm/lib/python3.11/site-packages/transformers/tokenization_utils.py", line 467, in _add_tokens
current_vocab = self.get_vocab().copy()
^^^^^^^^^^^^^^^^
File "/Users/yansir/.cache/huggingface/modules/transformers_modules/models/tokenization_baichuan.py", line 89, in get_vocab
vocab = {self.convert_ids_to_tokens(i): i for i in range(self.vocab_size)}
^^^^^^^^^^^^^^^
File "/Users/yansir/.cache/huggingface/modules/transformers_modules/models/tokenization_baichuan.py", line 85, in vocab_size
return self.sp_model.get_piece_size()
^^^^^^^^^^^^^
AttributeError: 'BaichuanTokenizer' object has no attribute 'sp_model'
And here is pip list:
ackage Version Editable project location
accelerate 0.25.0
altair 5.2.0
attrs 23.2.0
blinker 1.7.0
cachetools 5.3.2
certifi 2023.11.17
charset-normalizer 3.3.2
click 8.1.7
colorama 0.4.6
cpm-kernels 1.0.11
cvxopt 1.3.2
filelock 3.13.1
fsspec 2023.12.2
gguf 0.5.2 /Users/yansir/Code/PowerInfer/gguf-py
gitdb 4.0.11
GitPython 3.1.40
huggingface-hub 0.20.1
idna 3.6
importlib-metadata 6.11.0
Jinja2 3.1.2
jsonschema 4.20.0
jsonschema-specifications 2023.12.1
markdown-it-py 3.0.0
MarkupSafe 2.1.3
mdurl 0.1.2
mpmath 1.3.0
networkx 3.2.1
numpy 1.26.3
packaging 23.2
pandas 2.1.4
pillow 10.2.0
pip 23.3.1
powerinfer 0.0.1 /Users/yansir/Code/PowerInfer/powerinfer-py
protobuf 4.25.1
psutil 5.9.7
pyarrow 14.0.2
pydeck 0.8.1b0
Pygments 2.17.2
python-dateutil 2.8.2
pytz 2023.3.post1
PyYAML 6.0.1
referencing 0.32.0
regex 2023.12.25
requests 2.31.0
rich 13.7.0
rpds-py 0.16.2
safetensors 0.4.1
sentencepiece 0.1.99
setuptools 68.2.2
six 1.16.0
smmap 5.0.1
streamlit 1.29.0
sympy 1.12
tenacity 8.2.3
tokenizers 0.15.0
toml 0.10.2
toolz 0.12.0
torch 2.1.2
torchaudio 2.1.2
torchvision 0.16.2
tornado 6.4
tqdm 4.66.1
transformers 4.36.2
transformers-stream-generator 0.0.4
typing_extensions 4.9.0
tzdata 2023.4
tzlocal 5.2
urllib3 2.1.0
validators 0.22.0
wheel 0.41.2
zipp 3.17.0
Here is conda info:
active environment : disc-lawllm
active env location : /opt/homebrew/anaconda3/envs/disc-lawllm
shell level : 2
user config file : /Users/yansir/.condarc
populated config files :
conda version : 23.11.0
conda-build version : 3.28.1
python version : 3.11.5.final.0
solver : libmamba (default)
virtual packages : __archspec=1=m1
__conda=23.11.0=0
__osx=14.1.2=0
__unix=0=0
base environment : /opt/homebrew/anaconda3 (writable)
conda av data dir : /opt/homebrew/anaconda3/etc/conda
conda av metadata url : None
channel URLs : https://repo.anaconda.com/pkgs/main/osx-arm64
https://repo.anaconda.com/pkgs/main/noarch
https://repo.anaconda.com/pkgs/r/osx-arm64
https://repo.anaconda.com/pkgs/r/noarch
package cache : /opt/homebrew/anaconda3/pkgs
/Users/yansir/.conda/pkgs
envs directories : /opt/homebrew/anaconda3/envs
/Users/yansir/.conda/envs
platform : osx-arm64
user-agent : conda/23.11.0 requests/2.31.0 CPython/3.11.5 Darwin/23.1.0 OSX/14.1.2 solver/libmamba conda-libmamba-solver/23.11.1 libmambapy/1.5.3 aau/0.4.2 c/46E3x2d6f2VlX4gv7DYsuw s/sxEnXN6WjIEInKrIgszajQ e/-Xx3es9J-DigXAfp0lDX7A
UID:GID : 501:20
netrc file : None
offline mode : False
Would you mind providing more information about the method Knowledge expansion ?
My question is, how do you ensure the correctness when asking ChatGPT to give full explanations about the correct or wrong options?
In my experience, ChatGPT is currently not very good at knowledge about Chinese laws and related analysis. And it's often hard to tell when ChatGPT will give unreliable results because it's really good at making stories.
请问什么时候可以公布评估框架呢?
想知道主观题怎么评测的,有可以参考的提示词么?
请问 SFT 中的法律问答部分的数据集是否会公开?
你好,问您一下,为什么在评测分支下,用您这边提供的DISC-lawLLM测评 评测集的结果与仓库底下评测的结果略有差距呢?
是跟解码方式有关么,另外,咱们的外挂知识库,代码里有用到么。谢谢
我仔细阅读了技术报告,发现没有仓库中提到的lora训练的细节,尤其是学习率这里,为什么全参数量微调学习率是5e-5,远高于LoRA训练的1e-5?我很好奇会带来什么样的表现,希望能够得到回复。
"These candidate documents, along with the user input,
are formulated using our designed template and
then fed into the DISC-LawLLM"
请问是否会公布此处的模板样式以及具体的SFT训练方法?
试用了下产品demo,有点出乎意料的好。
但看到并未公开问答数据集合,想问下具体是如何构建问答数据集。
从技术报告上说是通过Behavior Shaping、Knowledge Expansion、Thinking Development方法构建的,但没想明白具体是如何利用这三种方法构建问答数据集的
请问train_bash.py开源了吗?
src/train_bash.py
多选题的评估是不是错了呀,我看有些答案是A、模型也回答A。但是分数是0
请问检索具体是如何实现的,用的什么embedding模型?
很赞的工作。但是我在使用中遇到了一些openai的报错。
请问您使用的openai版本是多少呢
{"time": "2024-01-26 11:25:02.948113", "index": 6, "iter": 1, "eval_scores": null, "norm_msg": "None", "err_msg": "Traceback (most recent call last):\n File \"/root/.pyenv/versions/3.10.0/lib/python3.10/site-packages/ml3m/base/eval.py\", line 378, in process_func\n scores = self._get_score(data_item, **kwargs)\n File \"/root/.pyenv/versions/3.10.0/lib/python3.10/site-packages/ml3m/base/eval.py\", line 866, in _get_score\n completion = openai.ChatCompletion.create(\n File \"/root/.pyenv/versions/3.10.0/lib/python3.10/site-packages/openai/lib/_old_api.py\", line 39, in __call__\n raise APIRemovedInV1(symbol=self._symbol)\nopenai.lib._old_api.APIRemovedInV1: \n\nYou tried to access openai.ChatCompletion, but this is no longer supported in openai>=1.0.0 - see the README at https://github.com/openai/openai-python for the API.\n\nYou can run `openai migrate` to automatically upgrade your codebase to use the 1.0.0 interface. \n\nAlternatively, you can pin your installation to the old version, e.g. `pip install openai==0.28`\n\nA detailed migration guide is available here: https://github.com/openai/openai-python/discussions/742\n\n"}
没有看到src目录
from datasets import load_dataset
dataset = load_dataset("ShengbinYue/DISC-Law-SFT")
----------------------------------------------------------------------------------
error:
Generating train split: 166758 examples [00:00, 184286.58 examples/s]
Traceback (most recent call last):
File "/home/ysx/miniconda3/lib/python3.11/site-packages/datasets/builder.py", line 1940, in _prepare_split_single
writer.write_table(table)
File "/home/ysx/miniconda3/lib/python3.11/site-packages/datasets/arrow_writer.py", line 572, in write_table
pa_table = table_cast(pa_table, self._schema)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/ysx/miniconda3/lib/python3.11/site-packages/datasets/table.py", line 2328, in table_cast
return cast_table_to_schema(table, schema)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/ysx/miniconda3/lib/python3.11/site-packages/datasets/table.py", line 2286, in cast_table_to_schema
raise ValueError(f"Couldn't cast\n{table.schema}\nto\n{features}\nbecause column names don't match")
ValueError: Couldn't cast
id: string
reference: list<item: string>
child 0, item: string
input: string
output: string
to
{'id': Value(dtype='string', id=None), 'input': Value(dtype='string', id=None), 'output': Value(dtype='string', id=None)}
because column names don't match
The above exception was the direct cause of the following exception:
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/home/ysx/miniconda3/lib/python3.11/site-packages/datasets/load.py", line 2153, in load_dataset
builder_instance.download_and_prepare(
File "/home/ysx/miniconda3/lib/python3.11/site-packages/datasets/builder.py", line 954, in download_and_prepare
self._download_and_prepare(
File "/home/ysx/miniconda3/lib/python3.11/site-packages/datasets/builder.py", line 1049, in _download_and_prepare
self._prepare_split(split_generator, **prepare_split_kwargs)
File "/home/ysx/miniconda3/lib/python3.11/site-packages/datasets/builder.py", line 1813, in _prepare_split
for job_id, done, content in self._prepare_split_single(
File "/home/ysx/miniconda3/lib/python3.11/site-packages/datasets/builder.py", line 1958, in _prepare_split_single
raise DatasetGenerationError("An error occurred while generating the dataset") from e
datasets.builder.DatasetGenerationError: An error occurred while generating the dataset
环境:A6000 显卡, 单卡
使用LLaMA Efficient Tuning 进行LoRA微调时,报如下错误:
使用脚本如下:该脚本从Lora微调复制过来的,还未修改参数
`
torchrun --nproc_per_node 1 src/train_bash.py
--stage sft
--model_name_or_path ShengbinYue/DISC-LawLLM
--do_train
--dataset alpaca_gpt4_zh
--template baichuan
--finetuning_type lora
--lora_rank 8
--lora_target W_pack
--output_dir path_to_your_sft_checkpoint
--overwrite_cache
--per_device_train_batch_size 4
--per_device_eval_batch_size 4
--gradient_accumulation_steps 8
--preprocessing_num_workers 16
--lr_scheduler_type cosine
--logging_steps 10
--save_steps 100
--eval_steps 100
--learning_rate 1e-5
--max_grad_norm 0.5
--num_train_epochs 2.0
--evaluation_strategy steps
--load_best_model_at_end
--plot_loss
--fp16
`
这个问题当时贵团队有遇到嘛,是如何解决的?
Isn't DISC-Law-Eval-Benchmark open sourced?
Could you provide evals datasets with answerrt
请问这个13B的模型实际运行要多大显存呢?40G的卡够不QAQ
麻烦问一下,知识库可以在哪里下载?
huggingface现在很多都下载不了,哪怕用代理ip都限制,有没有网盘或魔搭社区的链接?
__init__
self.post_init()
File "D:\transformers\src\transformers\modeling_utils.py", line 1160, in post_init
self._backward_compatibility_gradient_checkpointing()
File "D:\transformers\src\transformers\modeling_utils.py", line 1164, in _backward_compatibility_gradient_checkpointing
self.gradient_checkpointing_enable()
File "D:\transformers\src\transformers\modeling_utils.py", line 1873, in gradient_checkpointing_enable
self._set_gradient_checkpointing(enable=True, gradient_checkpointing_func=gradient_checkpointing_func)
TypeError: BaichuanPreTrainedModel._set_gradient_checkpointing() got an unexpected keyword argument 'enable'
请问这是transformers版本的问题么?请问匹配的版本是多少呢
关于评测的话,可以提供一下具体的评测方式嘛?
例如objective中的单选题,few-shot setting是如何设置的呢?是找四条样例数据拼在当前问题之前嘛?
以及对于模型的回答,只需要包括选项的字母就算正确,还是说需要把选项中的文字也都包括了才算正确呀?
对于subjective的问题,可以提供一下gpt3.5的评测prompt模版嘛?
非常感谢~
非常期待这个项目能继续推出新的模型,及公布更多的训练细节~
DISC-Law-SFT-Pair中的每一种类型的数据,在模型训练过程中,是什么地位?重要是给模型提供什么信息呢?
DISC-Law-SFT-Pair:
Id 条数 类别
'jud_doc_sum': 8234, 文件摘要
'jud_read_compre': 38530, 阅读理解
'leg_case_cls': 20563, 案件分类
'leg_ele_extra': 32042, 要素抽取
'leg_eve_detec': 21289, 事件检测
'op_sum': 5251, 舆情摘要
'exam': 21054, 司法考试
'sent_pred': 11657, 判决预测
'sim_case_match': 8138, 类案匹配
这部分要怎么搭建,谢谢
你好我要微调数据。现在用【我是谁】来做测试数据。
微调之后,感觉没有一点效果。
还请看一下什么问题?
数据集如下:
0.json
微调脚本如下:
torchrun --nproc_per_node 1 src/train_bash.py
--stage sft
--model_name_or_path ShengbinYue/DISC-LawLLM
--do_train
--dataset yiqi
--template baichuan2
--finetuning_type lora
--lora_rank 8
--lora_target W_pack
--output_dir output_checkpoint
--overwrite_cache
--per_device_train_batch_size 4
--per_device_eval_batch_size 4
--gradient_accumulation_steps 8
--preprocessing_num_workers 16
--lr_scheduler_type cosine
--logging_steps 10
--save_steps 100
--eval_steps 100
--learning_rate 1e-4
--max_grad_norm 0.5
--num_train_epochs 3000.0
--evaluation_strategy steps
--load_best_model_at_end
--plot_loss
--fp16
--val_size 0.01
导出脚本如下:
python src/export_model.py
--model_name_or_path ShengbinYue/DISC-LawLLM
--template baichuan2
--finetuning_type lora
--checkpoint_dir output_checkpoint
--export_dir export_model
最近网页版突然打不开了,请问是怎么回事呢
你好,请问可以开源【检索增强模块】和【知识库】吗,或者后续会有开源计划吗?
Hello, is it convenient to provide the code for the evaluation process?
为什么id 270的数据 没有input?{
"output": "期间是在刑事诉讼过程中,各个诉讼阶段、各种诉讼行为所用的法定的时间。期间是法律为保护当事人的合法权益不被侵犯所确定的司法机关必须遵守的强制性规定。\n期间以小时、日、月计算,开始的小时、日不计算在内,也就是说,期间应当从诉讼行为开始后的第二个小时或者第二天开始计算。而且为了保证实际的诉讼期间,计算期间时,不包括路途上的时间。对于上诉状或者其他诉讼文件而言,只要在期间届满前交给邮局寄出的,就不算过期。路途上的时间是司法机关邮寄送达文书及当事人向司法机关邮寄诉讼文书在路途上所占用的时间。这是为了便于当事人行使诉讼权利,如果不扣除邮寄诉讼文书在路途上的时间,当事人的诉讼权利难以得到保障,由于路途较远,有的当事人可能还没有接到司法机关送达的诉讼文书,期间就已经届满了,当事人就会因此失去相应的诉讼权利。当然,期间届满前交给邮局寄出的,必须以邮戳为证。\n当期间的最后一天为节假日的情况,要分以下两种情况处理:第一是为了切实保障当事人的诉讼权利,期间的最后一天为节假日的,以节假日后的第一天为期间届满日期。如计算上诉、抗诉等进行刑事诉讼活动的期间,而期间的最后一天为节假日的,以节假日后的第一天为期间届满日期。第二是为了保护当事人的人身权益,对于犯罪嫌疑人、被告人、罪犯的在押期间,应当计算到期间届满时止,不能因为节假日而延长在押期间。",
"id": 270
}, 谢谢
“我们在 DISC-LawLLM 的基础上增加了一个基于开源检索框架 Langchain-Chatchat 的检索模块。”
没有提供langchain-chatchat的检索模块啊
您好,在评测分支中,您使用了多选题的few-shot文件来测评单选题,单选题的few-shot文件来测评多选题。在 src/few_shot
文件夹中,两个csv文件的单选多选问题与csv文件的标题不一致,导致代码中出现了上述问题。
非常欣赏您的工作,希望能够在线做一些测试,能否帮忙查看一下 “在线dmeo” 的服务器是不是出bug了?现在无法在网页在线体验产品了,希望收到您的答复,谢谢!
Could you give me some examples about QA training set or more precisely how I can construct it?I am curious about it.
能不能发布一个基于baichuan2-13B的版本,这样可以和国产化硬件进行适配。
Hello, very great job! Thank you for your contributions!
Could you provide some more details on how to construct the legal element extraction dataset? If so, I would highly appreciate it!
By the way, I want to find out if you have encountered the decrease of general ability while fine-tuning?
进行了10000次训练力度,相同的问题最终输出结果一样。请问是哪个步骤出了问题?
数据集:
新闻Q&A.json
训练脚本:
torchrun --nproc_per_node 1 src/train_bash.py \
--stage sft \
--model_name_or_path ShengbinYue/DISC-LawLLM \
--do_train \
--dataset yiqi5-fun \
--template baichuan2 \
--finetuning_type lora \
--lora_rank 8 \
--lora_target W_pack \
--output_dir /home/DISC-output-checkpoint \
--overwrite_output_dir \
--overwrite_cache \
--per_device_train_batch_size 4 \
--per_device_eval_batch_size 4 \
--gradient_accumulation_steps 8 \
--preprocessing_num_workers 16 \
--lr_scheduler_type cosine \
--logging_steps 10 \
--save_steps 100 \
--eval_steps 100 \
--learning_rate 1e-5 \
--max_grad_norm 0.5 \
--num_train_epochs 10000.0 \
--evaluation_strategy steps \
--load_best_model_at_end \
--plot_loss \
--fp16 \
--val_size 0.01
有没有多轮会话的评测?
A declarative, efficient, and flexible JavaScript library for building user interfaces.
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google ❤️ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.