fudandisc / disc-lawllm Goto Github PK

DISC-LawLLM, an intelligent legal system utilizing large language models (LLMs) to provide a wide range of legal services

License: Apache License 2.0

Python 100.00%

disc-lawllm's People

Contributors

Stargazers

Watchers

disc-lawllm's Issues

DISC-Law-SFT-Triplet 数据集结构

您好，您提供的数据集中，DISC-Law-SFT-Triplet 包含 input，output 和 reference 三个部分。在用 LLaMA Efficient Tuning 微调时，请问 reference 是如何加入训练的呢？我目前是把它作为 system 输入，或者说这部分应该直接拼接到 input 中？

'BaichuanTokenizer' object has no attribute 'sp_model'

Traceback (most recent call last):
File "/Users/yansir/Code/DISC-LawLLM/cli_demo.py", line 81, in
main()
File "/Users/yansir/Code/DISC-LawLLM/cli_demo.py", line 38, in main
model, tokenizer = init_model()
^^^^^^^^^^^^
File "/Users/yansir/Code/DISC-LawLLM/cli_demo.py", line 17, in init_model
tokenizer = AutoTokenizer.from_pretrained(
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/opt/homebrew/anaconda3/envs/disc-lawllm/lib/python3.11/site-packages/transformers/models/auto/tokenization_auto.py", line 774, in from_pretrained
return tokenizer_class.from_pretrained(pretrained_model_name_or_path, *inputs, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/opt/homebrew/anaconda3/envs/disc-lawllm/lib/python3.11/site-packages/transformers/tokenization_utils_base.py", line 2028, in from_pretrained
return cls._from_pretrained(
^^^^^^^^^^^^^^^^^^^^^
File "/opt/homebrew/anaconda3/envs/disc-lawllm/lib/python3.11/site-packages/transformers/tokenization_utils_base.py", line 2260, in _from_pretrained
tokenizer = cls(*init_inputs, **init_kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/Users/yansir/.cache/huggingface/modules/transformers_modules/models/tokenization_baichuan.py", line 55, in init
super().init(
File "/opt/homebrew/anaconda3/envs/disc-lawllm/lib/python3.11/site-packages/transformers/tokenization_utils.py", line 367, in init
self._add_tokens(
File "/opt/homebrew/anaconda3/envs/disc-lawllm/lib/python3.11/site-packages/transformers/tokenization_utils.py", line 467, in _add_tokens
current_vocab = self.get_vocab().copy()
^^^^^^^^^^^^^^^^
File "/Users/yansir/.cache/huggingface/modules/transformers_modules/models/tokenization_baichuan.py", line 89, in get_vocab
vocab = {self.convert_ids_to_tokens(i): i for i in range(self.vocab_size)}
^^^^^^^^^^^^^^^
File "/Users/yansir/.cache/huggingface/modules/transformers_modules/models/tokenization_baichuan.py", line 85, in vocab_size
return self.sp_model.get_piece_size()
^^^^^^^^^^^^^
AttributeError: 'BaichuanTokenizer' object has no attribute 'sp_model'

And here is pip list:
ackage Version Editable project location

accelerate 0.25.0
altair 5.2.0
attrs 23.2.0
blinker 1.7.0
cachetools 5.3.2
certifi 2023.11.17
charset-normalizer 3.3.2
click 8.1.7
colorama 0.4.6
cpm-kernels 1.0.11
cvxopt 1.3.2
filelock 3.13.1
fsspec 2023.12.2
gguf 0.5.2 /Users/yansir/Code/PowerInfer/gguf-py
gitdb 4.0.11
GitPython 3.1.40
huggingface-hub 0.20.1
idna 3.6
importlib-metadata 6.11.0
Jinja2 3.1.2
jsonschema 4.20.0
jsonschema-specifications 2023.12.1
markdown-it-py 3.0.0
MarkupSafe 2.1.3
mdurl 0.1.2
mpmath 1.3.0
networkx 3.2.1
numpy 1.26.3
packaging 23.2
pandas 2.1.4
pillow 10.2.0
pip 23.3.1
powerinfer 0.0.1 /Users/yansir/Code/PowerInfer/powerinfer-py
protobuf 4.25.1
psutil 5.9.7
pyarrow 14.0.2
pydeck 0.8.1b0
Pygments 2.17.2
python-dateutil 2.8.2
pytz 2023.3.post1
PyYAML 6.0.1
referencing 0.32.0
regex 2023.12.25
requests 2.31.0
rich 13.7.0
rpds-py 0.16.2
safetensors 0.4.1
sentencepiece 0.1.99
setuptools 68.2.2
six 1.16.0
smmap 5.0.1
streamlit 1.29.0
sympy 1.12
tenacity 8.2.3
tokenizers 0.15.0
toml 0.10.2
toolz 0.12.0
torch 2.1.2
torchaudio 2.1.2
torchvision 0.16.2
tornado 6.4
tqdm 4.66.1
transformers 4.36.2
transformers-stream-generator 0.0.4
typing_extensions 4.9.0
tzdata 2023.4
tzlocal 5.2
urllib3 2.1.0
validators 0.22.0
wheel 0.41.2
zipp 3.17.0

Here is conda info:
active environment : disc-lawllm
active env location : /opt/homebrew/anaconda3/envs/disc-lawllm
shell level : 2
user config file : /Users/yansir/.condarc
populated config files :
conda version : 23.11.0
conda-build version : 3.28.1
python version : 3.11.5.final.0
solver : libmamba (default)
virtual packages : __archspec=1=m1
__conda=23.11.0=0
__osx=14.1.2=0
__unix=0=0
base environment : /opt/homebrew/anaconda3 (writable)
conda av data dir : /opt/homebrew/anaconda3/etc/conda
conda av metadata url : None
channel URLs : https://repo.anaconda.com/pkgs/main/osx-arm64
https://repo.anaconda.com/pkgs/main/noarch
https://repo.anaconda.com/pkgs/r/osx-arm64
https://repo.anaconda.com/pkgs/r/noarch
package cache : /opt/homebrew/anaconda3/pkgs
/Users/yansir/.conda/pkgs
envs directories : /opt/homebrew/anaconda3/envs
/Users/yansir/.conda/envs
platform : osx-arm64
user-agent : conda/23.11.0 requests/2.31.0 CPython/3.11.5 Darwin/23.1.0 OSX/14.1.2 solver/libmamba conda-libmamba-solver/23.11.1 libmambapy/1.5.3 aau/0.4.2 c/46E3x2d6f2VlX4gv7DYsuw s/sxEnXN6WjIEInKrIgszajQ e/-Xx3es9J-DigXAfp0lDX7A
UID:GID : 501:20
netrc file : None
offline mode : False

You are using an old version of the checkpointing format that is deprecated, please help me fix error. Thanks !

Hi, I'm trying to run client.py and I get the following error message:

Even though I tried to fix it by updating the transformer library, it still doesn't seem to work. Help me please ! Thank you.

Details on Knowledge Expansion

Would you mind providing more information about the method Knowledge expansion ?
My question is, how do you ensure the correctness when asking ChatGPT to give full explanations about the correct or wrong options?
In my experience, ChatGPT is currently not very good at knowledge about Chinese laws and related analysis. And it's often hard to tell when ChatGPT will give unreliable results because it's really good at making stories.

Evaluation Framework not open sourced

请问什么时候可以公布评估框架呢？

Subjective evaluation

想知道主观题怎么评测的，有可以参考的提示词么？

QA dataset part opensource

请问 SFT 中的法律问答部分的数据集是否会公开？

Different evaluation results

你好，问您一下，为什么在评测分支下，用您这边提供的DISC-lawLLM测评评测集的结果与仓库底下评测的结果略有差距呢？

是跟解码方式有关么，另外，咱们的外挂知识库，代码里有用到么。谢谢

Details on LoRA

我仔细阅读了技术报告，发现没有仓库中提到的lora训练的细节，尤其是学习率这里，为什么全参数量微调学习率是5e-5,远高于LoRA训练的1e-5？我很好奇会带来什么样的表现，希望能够得到回复。

consultance of designed template

"These candidate documents, along with the user input,
are formulated using our designed template and
then fed into the DISC-LawLLM"
请问是否会公布此处的模板样式以及具体的SFT训练方法？

Construction of QA training set

试用了下产品demo，有点出乎意料的好。
但看到并未公开问答数据集合，想问下具体是如何构建问答数据集。
从技术报告上说是通过Behavior Shaping、Knowledge Expansion、Thinking Development方法构建的，但没想明白具体是如何利用这三种方法构建问答数据集的

请问train_bash.py开源了吗？

请问train_bash.py开源了吗？
src/train_bash.py

demo在线网页好像无法访问了

多选题的评估是不是出错了

多选题的评估是不是错了呀，我看有些答案是A、模型也回答A。但是分数是0

Embedding model in the retrieval module

请问检索具体是如何实现的，用的什么embedding模型？

请问openai的版本是？

很赞的工作。但是我在使用中遇到了一些openai的报错。
请问您使用的openai版本是多少呢

{"time": "2024-01-26 11:25:02.948113", "index": 6, "iter": 1, "eval_scores": null, "norm_msg": "None", "err_msg": "Traceback (most recent call last):\n  File \"/root/.pyenv/versions/3.10.0/lib/python3.10/site-packages/ml3m/base/eval.py\", line 378, in process_func\n    scores = self._get_score(data_item, **kwargs)\n  File \"/root/.pyenv/versions/3.10.0/lib/python3.10/site-packages/ml3m/base/eval.py\", line 866, in _get_score\n    completion = openai.ChatCompletion.create(\n  File \"/root/.pyenv/versions/3.10.0/lib/python3.10/site-packages/openai/lib/_old_api.py\", line 39, in __call__\n    raise APIRemovedInV1(symbol=self._symbol)\nopenai.lib._old_api.APIRemovedInV1: \n\nYou tried to access openai.ChatCompletion, but this is no longer supported in openai>=1.0.0 - see the README at https://github.com/openai/openai-python for the API.\n\nYou can run `openai migrate` to automatically upgrade your codebase to use the 1.0.0 interface. \n\nAlternatively, you can pin your installation to the old version, e.g. `pip install openai==0.28`\n\nA detailed migration guide is available here: https://github.com/openai/openai-python/discussions/742\n\n"}

Finetune code missing

没有看到src目录

SFT datasets error

from datasets import load_dataset

dataset = load_dataset("ShengbinYue/DISC-Law-SFT")

----------------------------------------------------------------------------------
error:
Generating train split: 166758 examples [00:00, 184286.58 examples/s]
Traceback (most recent call last):
  File "/home/ysx/miniconda3/lib/python3.11/site-packages/datasets/builder.py", line 1940, in _prepare_split_single
    writer.write_table(table)
  File "/home/ysx/miniconda3/lib/python3.11/site-packages/datasets/arrow_writer.py", line 572, in write_table
    pa_table = table_cast(pa_table, self._schema)
               ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/ysx/miniconda3/lib/python3.11/site-packages/datasets/table.py", line 2328, in table_cast
    return cast_table_to_schema(table, schema)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/ysx/miniconda3/lib/python3.11/site-packages/datasets/table.py", line 2286, in cast_table_to_schema
    raise ValueError(f"Couldn't cast\n{table.schema}\nto\n{features}\nbecause column names don't match")
ValueError: Couldn't cast
id: string
reference: list<item: string>
  child 0, item: string
input: string
output: string
to
{'id': Value(dtype='string', id=None), 'input': Value(dtype='string', id=None), 'output': Value(dtype='string', id=None)}
because column names don't match

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/home/ysx/miniconda3/lib/python3.11/site-packages/datasets/load.py", line 2153, in load_dataset
    builder_instance.download_and_prepare(
  File "/home/ysx/miniconda3/lib/python3.11/site-packages/datasets/builder.py", line 954, in download_and_prepare
    self._download_and_prepare(
  File "/home/ysx/miniconda3/lib/python3.11/site-packages/datasets/builder.py", line 1049, in _download_and_prepare
    self._prepare_split(split_generator, **prepare_split_kwargs)
  File "/home/ysx/miniconda3/lib/python3.11/site-packages/datasets/builder.py", line 1813, in _prepare_split
    for job_id, done, content in self._prepare_split_single(
  File "/home/ysx/miniconda3/lib/python3.11/site-packages/datasets/builder.py", line 1958, in _prepare_split_single
    raise DatasetGenerationError("An error occurred while generating the dataset") from e
datasets.builder.DatasetGenerationError: An error occurred while generating the dataset

Connection error (OSError) running web demo

报错信息如下：

这个是上传到hf上的模型出问题了吗？

ValueError in finetuning

环境：A6000 显卡，单卡
使用LLaMA Efficient Tuning 进行LoRA微调时，报如下错误：

使用脚本如下：该脚本从Lora微调复制过来的，还未修改参数
`
torchrun --nproc_per_node 1 src/train_bash.py
--stage sft
--model_name_or_path ShengbinYue/DISC-LawLLM
--do_train
--dataset alpaca_gpt4_zh
--template baichuan
--finetuning_type lora
--lora_rank 8
--lora_target W_pack
--output_dir path_to_your_sft_checkpoint
--overwrite_cache
--per_device_train_batch_size 4
--per_device_eval_batch_size 4
--gradient_accumulation_steps 8
--preprocessing_num_workers 16
--lr_scheduler_type cosine
--logging_steps 10
--save_steps 100
--eval_steps 100
--learning_rate 1e-5
--max_grad_norm 0.5
--num_train_epochs 2.0
--evaluation_strategy steps
--load_best_model_at_end
--plot_loss
--fp16

这个问题当时贵团队有遇到嘛，是如何解决的？

会开源发布完整的训练和检索代码吗/

DISC-Law-Eval-Benchmark 404

Isn't DISC-Law-Eval-Benchmark open sourced？

Eval datasets with answer

Could you provide evals datasets with answerrt

Memory configuration requirements for graphics cards

请问这个13B的模型实际运行要多大显存呢？40G的卡够不QAQ

Where to download the knowledge base

麻烦问一下，知识库可以在哪里下载？

HuggingFace download issue

huggingface现在很多都下载不了，哪怕用代理ip都限制，有没有网盘或魔搭社区的链接？

TypeError in local deployment (possibly due to `transformers` version)

__init__
    self.post_init()
  File "D:\transformers\src\transformers\modeling_utils.py", line 1160, in post_init
    self._backward_compatibility_gradient_checkpointing()
  File "D:\transformers\src\transformers\modeling_utils.py", line 1164, in _backward_compatibility_gradient_checkpointing
    self.gradient_checkpointing_enable()
  File "D:\transformers\src\transformers\modeling_utils.py", line 1873, in gradient_checkpointing_enable
    self._set_gradient_checkpointing(enable=True, gradient_checkpointing_func=gradient_checkpointing_func)
TypeError: BaichuanPreTrainedModel._set_gradient_checkpointing() got an unexpected keyword argument 'enable'

请问这是transformers版本的问题么？请问匹配的版本是多少呢

Eval details

关于评测的话，可以提供一下具体的评测方式嘛？
例如objective中的单选题，few-shot setting是如何设置的呢？是找四条样例数据拼在当前问题之前嘛？
以及对于模型的回答，只需要包括选项的字母就算正确，还是说需要把选项中的文字也都包括了才算正确呀？
对于subjective的问题，可以提供一下gpt3.5的评测prompt模版嘛？
非常感谢~

请问会基于 baichuan2 再训练吗？

非常期待这个项目能继续推出新的模型，及公布更多的训练细节~

DISC-Law-SFT-Pair specific information of each type of data

DISC-Law-SFT-Pair中的每一种类型的数据，在模型训练过程中，是什么地位？重要是给模型提供什么信息呢？

DISC-Law-SFT-Pair:

Id 条数类别
'jud_doc_sum': 8234, 文件摘要
'jud_read_compre': 38530, 阅读理解
'leg_case_cls': 20563, 案件分类
'leg_ele_extra': 32042, 要素抽取
'leg_eve_detec': 21289, 事件检测
'op_sum': 5251, 舆情摘要
'exam': 21054, 司法考试
'sent_pred': 11657, 判决预测
'sim_case_match': 8138, 类案匹配

How to get the retrieval module

这部分要怎么搭建，谢谢

Finetuning self-cognition not working

你好我要微调数据。现在用【我是谁】来做测试数据。
微调之后，感觉没有一点效果。
还请看一下什么问题？

数据集如下：
0.json

微调脚本如下：
torchrun --nproc_per_node 1 src/train_bash.py
--stage sft
--model_name_or_path ShengbinYue/DISC-LawLLM
--do_train
--dataset yiqi
--template baichuan2
--finetuning_type lora
--lora_rank 8
--lora_target W_pack
--output_dir output_checkpoint
--overwrite_cache
--per_device_train_batch_size 4
--per_device_eval_batch_size 4
--gradient_accumulation_steps 8
--preprocessing_num_workers 16
--lr_scheduler_type cosine
--logging_steps 10
--save_steps 100
--eval_steps 100
--learning_rate 1e-4
--max_grad_norm 0.5
--num_train_epochs 3000.0
--evaluation_strategy steps
--load_best_model_at_end
--plot_loss
--fp16
--val_size 0.01

导出脚本如下：
python src/export_model.py
--model_name_or_path ShengbinYue/DISC-LawLLM
--template baichuan2
--finetuning_type lora
--checkpoint_dir output_checkpoint
--export_dir export_model

disclaw网页版打不开了

最近网页版突然打不开了，请问是怎么回事呢

请问能提供下《法律条款》的知识库吗？

Request: retrieval module and the knowledge base

你好，请问可以开源【检索增强模块】和【知识库】吗，或者后续会有开源计划吗？

Code

Hello, is it convenient to provide the code for the evaluation process？

An error of subjective_eval datasets （id 270）

为什么id 270的数据没有input?{
"output": "期间是在刑事诉讼过程中，各个诉讼阶段、各种诉讼行为所用的法定的时间。期间是法律为保护当事人的合法权益不被侵犯所确定的司法机关必须遵守的强制性规定。\n期间以小时、日、月计算，开始的小时、日不计算在内，也就是说，期间应当从诉讼行为开始后的第二个小时或者第二天开始计算。而且为了保证实际的诉讼期间，计算期间时，不包括路途上的时间。对于上诉状或者其他诉讼文件而言，只要在期间届满前交给邮局寄出的，就不算过期。路途上的时间是司法机关邮寄送达文书及当事人向司法机关邮寄诉讼文书在路途上所占用的时间。这是为了便于当事人行使诉讼权利，如果不扣除邮寄诉讼文书在路途上的时间，当事人的诉讼权利难以得到保障，由于路途较远，有的当事人可能还没有接到司法机关送达的诉讼文书，期间就已经届满了，当事人就会因此失去相应的诉讼权利。当然，期间届满前交给邮局寄出的，必须以邮戳为证。\n当期间的最后一天为节假日的情况，要分以下两种情况处理：第一是为了切实保障当事人的诉讼权利，期间的最后一天为节假日的，以节假日后的第一天为期间届满日期。如计算上诉、抗诉等进行刑事诉讼活动的期间，而期间的最后一天为节假日的，以节假日后的第一天为期间届满日期。第二是为了保护当事人的人身权益，对于犯罪嫌疑人、被告人、罪犯的在押期间，应当计算到期间届满时止，不能因为节假日而延长在押期间。",
"id": 270
}, 谢谢

Inconsistency between code and demo: no retrieval module

“我们在 DISC-LawLLM 的基础上增加了一个基于开源检索框架 Langchain-Chatchat 的检索模块。”

没有提供langchain-chatchat的检索模块啊

关于测评代码中 few-shot example中使用错误的问题。

您好，在评测分支中，您使用了多选题的few-shot文件来测评单选题，单选题的few-shot文件来测评多选题。在 src/few_shot文件夹中，两个csv文件的单选多选问题与csv文件的标题不一致，导致代码中出现了上述问题。

Requesting a 7B model

在线demo测试链接无法访问？

非常欣赏您的工作，希望能够在线做一些测试，能否帮忙查看一下 “在线dmeo” 的服务器是不是出bug了？现在无法在网页在线体验产品了，希望收到您的答复，谢谢！

QA training set construction

Could you give me some examples about QA training set or more precisely how I can construct it?I am curious about it.

Adapting Chinese hardwares (国产化适配)

能不能发布一个基于baichuan2-13B的版本，这样可以和国产化硬件进行适配。

Legal Element Extraction

Hello, very great job! Thank you for your contributions!
Could you provide some more details on how to construct the legal element extraction dataset? If so, I would highly appreciate it!
By the way, I want to find out if you have encountered the decrease of general ability while fine-tuning?

Same output after finetuning

进行了10000次训练力度，相同的问题最终输出结果一样。请问是哪个步骤出了问题？
数据集：
新闻Q&A.json
训练脚本：

torchrun --nproc_per_node 1 src/train_bash.py \
    --stage sft \
    --model_name_or_path ShengbinYue/DISC-LawLLM \
    --do_train \
    --dataset yiqi5-fun \
    --template baichuan2 \
    --finetuning_type lora \
    --lora_rank 8 \
    --lora_target W_pack \
    --output_dir /home/DISC-output-checkpoint \
    --overwrite_output_dir \
    --overwrite_cache \
    --per_device_train_batch_size 4 \
    --per_device_eval_batch_size 4 \
    --gradient_accumulation_steps 8 \
    --preprocessing_num_workers 16 \
    --lr_scheduler_type cosine \
    --logging_steps 10 \
    --save_steps 100 \
    --eval_steps 100 \
    --learning_rate 1e-5 \
    --max_grad_norm 0.5 \
    --num_train_epochs 10000.0 \
    --evaluation_strategy steps \
    --load_best_model_at_end \
    --plot_loss \
    --fp16 \
    --val_size 0.01

@Charlie-XIAO

多轮会话

有没有多轮会话的评测？

fudandisc / disc-lawllm Goto Github PK

disc-lawllm's People

Contributors

Stargazers

Watchers

Forkers

disc-lawllm's Issues

Recommend Projects

Recommend Topics

Recommend Org