Git Product home page Git Product logo

disc-lawllm's People

Contributors

benson114 avatar charlie-xiao avatar eltociear avatar lemuria-wchen avatar lsjlsj35 avatar scc-bit avatar yueshengbin avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

disc-lawllm's Issues

DISC-Law-SFT-Triplet 数据集结构

您好,您提供的数据集中,DISC-Law-SFT-Triplet 包含 inputoutputreference 三个部分。在用 LLaMA Efficient Tuning 微调时,请问 reference 是如何加入训练的呢?我目前是把它作为 system 输入,或者说这部分应该直接拼接到 input 中?

image image

'BaichuanTokenizer' object has no attribute 'sp_model'

Traceback (most recent call last):
File "/Users/yansir/Code/DISC-LawLLM/cli_demo.py", line 81, in
main()
File "/Users/yansir/Code/DISC-LawLLM/cli_demo.py", line 38, in main
model, tokenizer = init_model()
^^^^^^^^^^^^
File "/Users/yansir/Code/DISC-LawLLM/cli_demo.py", line 17, in init_model
tokenizer = AutoTokenizer.from_pretrained(
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/opt/homebrew/anaconda3/envs/disc-lawllm/lib/python3.11/site-packages/transformers/models/auto/tokenization_auto.py", line 774, in from_pretrained
return tokenizer_class.from_pretrained(pretrained_model_name_or_path, *inputs, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/opt/homebrew/anaconda3/envs/disc-lawllm/lib/python3.11/site-packages/transformers/tokenization_utils_base.py", line 2028, in from_pretrained
return cls._from_pretrained(
^^^^^^^^^^^^^^^^^^^^^
File "/opt/homebrew/anaconda3/envs/disc-lawllm/lib/python3.11/site-packages/transformers/tokenization_utils_base.py", line 2260, in _from_pretrained
tokenizer = cls(*init_inputs, **init_kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/Users/yansir/.cache/huggingface/modules/transformers_modules/models/tokenization_baichuan.py", line 55, in init
super().init(
File "/opt/homebrew/anaconda3/envs/disc-lawllm/lib/python3.11/site-packages/transformers/tokenization_utils.py", line 367, in init
self._add_tokens(
File "/opt/homebrew/anaconda3/envs/disc-lawllm/lib/python3.11/site-packages/transformers/tokenization_utils.py", line 467, in _add_tokens
current_vocab = self.get_vocab().copy()
^^^^^^^^^^^^^^^^
File "/Users/yansir/.cache/huggingface/modules/transformers_modules/models/tokenization_baichuan.py", line 89, in get_vocab
vocab = {self.convert_ids_to_tokens(i): i for i in range(self.vocab_size)}
^^^^^^^^^^^^^^^
File "/Users/yansir/.cache/huggingface/modules/transformers_modules/models/tokenization_baichuan.py", line 85, in vocab_size
return self.sp_model.get_piece_size()
^^^^^^^^^^^^^
AttributeError: 'BaichuanTokenizer' object has no attribute 'sp_model'

And here is pip list:
ackage Version Editable project location

accelerate 0.25.0
altair 5.2.0
attrs 23.2.0
blinker 1.7.0
cachetools 5.3.2
certifi 2023.11.17
charset-normalizer 3.3.2
click 8.1.7
colorama 0.4.6
cpm-kernels 1.0.11
cvxopt 1.3.2
filelock 3.13.1
fsspec 2023.12.2
gguf 0.5.2 /Users/yansir/Code/PowerInfer/gguf-py
gitdb 4.0.11
GitPython 3.1.40
huggingface-hub 0.20.1
idna 3.6
importlib-metadata 6.11.0
Jinja2 3.1.2
jsonschema 4.20.0
jsonschema-specifications 2023.12.1
markdown-it-py 3.0.0
MarkupSafe 2.1.3
mdurl 0.1.2
mpmath 1.3.0
networkx 3.2.1
numpy 1.26.3
packaging 23.2
pandas 2.1.4
pillow 10.2.0
pip 23.3.1
powerinfer 0.0.1 /Users/yansir/Code/PowerInfer/powerinfer-py
protobuf 4.25.1
psutil 5.9.7
pyarrow 14.0.2
pydeck 0.8.1b0
Pygments 2.17.2
python-dateutil 2.8.2
pytz 2023.3.post1
PyYAML 6.0.1
referencing 0.32.0
regex 2023.12.25
requests 2.31.0
rich 13.7.0
rpds-py 0.16.2
safetensors 0.4.1
sentencepiece 0.1.99
setuptools 68.2.2
six 1.16.0
smmap 5.0.1
streamlit 1.29.0
sympy 1.12
tenacity 8.2.3
tokenizers 0.15.0
toml 0.10.2
toolz 0.12.0
torch 2.1.2
torchaudio 2.1.2
torchvision 0.16.2
tornado 6.4
tqdm 4.66.1
transformers 4.36.2
transformers-stream-generator 0.0.4
typing_extensions 4.9.0
tzdata 2023.4
tzlocal 5.2
urllib3 2.1.0
validators 0.22.0
wheel 0.41.2
zipp 3.17.0

Here is conda info:
active environment : disc-lawllm
active env location : /opt/homebrew/anaconda3/envs/disc-lawllm
shell level : 2
user config file : /Users/yansir/.condarc
populated config files :
conda version : 23.11.0
conda-build version : 3.28.1
python version : 3.11.5.final.0
solver : libmamba (default)
virtual packages : __archspec=1=m1
__conda=23.11.0=0
__osx=14.1.2=0
__unix=0=0
base environment : /opt/homebrew/anaconda3 (writable)
conda av data dir : /opt/homebrew/anaconda3/etc/conda
conda av metadata url : None
channel URLs : https://repo.anaconda.com/pkgs/main/osx-arm64
https://repo.anaconda.com/pkgs/main/noarch
https://repo.anaconda.com/pkgs/r/osx-arm64
https://repo.anaconda.com/pkgs/r/noarch
package cache : /opt/homebrew/anaconda3/pkgs
/Users/yansir/.conda/pkgs
envs directories : /opt/homebrew/anaconda3/envs
/Users/yansir/.conda/envs
platform : osx-arm64
user-agent : conda/23.11.0 requests/2.31.0 CPython/3.11.5 Darwin/23.1.0 OSX/14.1.2 solver/libmamba conda-libmamba-solver/23.11.1 libmambapy/1.5.3 aau/0.4.2 c/46E3x2d6f2VlX4gv7DYsuw s/sxEnXN6WjIEInKrIgszajQ e/-Xx3es9J-DigXAfp0lDX7A
UID:GID : 501:20
netrc file : None
offline mode : False

Details on Knowledge Expansion

Would you mind providing more information about the method Knowledge expansion ?
My question is, how do you ensure the correctness when asking ChatGPT to give full explanations about the correct or wrong options?
In my experience, ChatGPT is currently not very good at knowledge about Chinese laws and related analysis. And it's often hard to tell when ChatGPT will give unreliable results because it's really good at making stories.

Different evaluation results

你好,问您一下,为什么在评测分支下,用您这边提供的DISC-lawLLM测评 评测集的结果与仓库底下评测的结果略有差距呢?

是跟解码方式有关么,另外,咱们的外挂知识库,代码里有用到么。谢谢

Details on LoRA

我仔细阅读了技术报告,发现没有仓库中提到的lora训练的细节,尤其是学习率这里,为什么全参数量微调学习率是5e-5,远高于LoRA训练的1e-5?我很好奇会带来什么样的表现,希望能够得到回复。

consultance of designed template

"These candidate documents, along with the user input,
are formulated using our designed template and
then fed into the DISC-LawLLM"
请问是否会公布此处的模板样式以及具体的SFT训练方法?

Construction of QA training set

试用了下产品demo,有点出乎意料的好。
但看到并未公开问答数据集合,想问下具体是如何构建问答数据集。
从技术报告上说是通过Behavior Shaping、Knowledge Expansion、Thinking Development方法构建的,但没想明白具体是如何利用这三种方法构建问答数据集的

请问openai的版本是?

很赞的工作。但是我在使用中遇到了一些openai的报错。
请问您使用的openai版本是多少呢

{"time": "2024-01-26 11:25:02.948113", "index": 6, "iter": 1, "eval_scores": null, "norm_msg": "None", "err_msg": "Traceback (most recent call last):\n  File \"/root/.pyenv/versions/3.10.0/lib/python3.10/site-packages/ml3m/base/eval.py\", line 378, in process_func\n    scores = self._get_score(data_item, **kwargs)\n  File \"/root/.pyenv/versions/3.10.0/lib/python3.10/site-packages/ml3m/base/eval.py\", line 866, in _get_score\n    completion = openai.ChatCompletion.create(\n  File \"/root/.pyenv/versions/3.10.0/lib/python3.10/site-packages/openai/lib/_old_api.py\", line 39, in __call__\n    raise APIRemovedInV1(symbol=self._symbol)\nopenai.lib._old_api.APIRemovedInV1: \n\nYou tried to access openai.ChatCompletion, but this is no longer supported in openai>=1.0.0 - see the README at https://github.com/openai/openai-python for the API.\n\nYou can run `openai migrate` to automatically upgrade your codebase to use the 1.0.0 interface. \n\nAlternatively, you can pin your installation to the old version, e.g. `pip install openai==0.28`\n\nA detailed migration guide is available here: https://github.com/openai/openai-python/discussions/742\n\n"}

SFT datasets error

from datasets import load_dataset

dataset = load_dataset("ShengbinYue/DISC-Law-SFT")

----------------------------------------------------------------------------------
error:
Generating train split: 166758 examples [00:00, 184286.58 examples/s]
Traceback (most recent call last):
  File "/home/ysx/miniconda3/lib/python3.11/site-packages/datasets/builder.py", line 1940, in _prepare_split_single
    writer.write_table(table)
  File "/home/ysx/miniconda3/lib/python3.11/site-packages/datasets/arrow_writer.py", line 572, in write_table
    pa_table = table_cast(pa_table, self._schema)
               ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/ysx/miniconda3/lib/python3.11/site-packages/datasets/table.py", line 2328, in table_cast
    return cast_table_to_schema(table, schema)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/ysx/miniconda3/lib/python3.11/site-packages/datasets/table.py", line 2286, in cast_table_to_schema
    raise ValueError(f"Couldn't cast\n{table.schema}\nto\n{features}\nbecause column names don't match")
ValueError: Couldn't cast
id: string
reference: list<item: string>
  child 0, item: string
input: string
output: string
to
{'id': Value(dtype='string', id=None), 'input': Value(dtype='string', id=None), 'output': Value(dtype='string', id=None)}
because column names don't match

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/home/ysx/miniconda3/lib/python3.11/site-packages/datasets/load.py", line 2153, in load_dataset
    builder_instance.download_and_prepare(
  File "/home/ysx/miniconda3/lib/python3.11/site-packages/datasets/builder.py", line 954, in download_and_prepare
    self._download_and_prepare(
  File "/home/ysx/miniconda3/lib/python3.11/site-packages/datasets/builder.py", line 1049, in _download_and_prepare
    self._prepare_split(split_generator, **prepare_split_kwargs)
  File "/home/ysx/miniconda3/lib/python3.11/site-packages/datasets/builder.py", line 1813, in _prepare_split
    for job_id, done, content in self._prepare_split_single(
  File "/home/ysx/miniconda3/lib/python3.11/site-packages/datasets/builder.py", line 1958, in _prepare_split_single
    raise DatasetGenerationError("An error occurred while generating the dataset") from e
datasets.builder.DatasetGenerationError: An error occurred while generating the dataset

ValueError in finetuning

环境:A6000 显卡, 单卡
使用LLaMA Efficient Tuning 进行LoRA微调时,报如下错误:
image
使用脚本如下:该脚本从Lora微调复制过来的,还未修改参数
`
torchrun --nproc_per_node 1 src/train_bash.py
--stage sft
--model_name_or_path ShengbinYue/DISC-LawLLM
--do_train
--dataset alpaca_gpt4_zh
--template baichuan
--finetuning_type lora
--lora_rank 8
--lora_target W_pack
--output_dir path_to_your_sft_checkpoint
--overwrite_cache
--per_device_train_batch_size 4
--per_device_eval_batch_size 4
--gradient_accumulation_steps 8
--preprocessing_num_workers 16
--lr_scheduler_type cosine
--logging_steps 10
--save_steps 100
--eval_steps 100
--learning_rate 1e-5
--max_grad_norm 0.5
--num_train_epochs 2.0
--evaluation_strategy steps
--load_best_model_at_end
--plot_loss
--fp16

`

这个问题当时贵团队有遇到嘛,是如何解决的?

HuggingFace download issue

huggingface现在很多都下载不了,哪怕用代理ip都限制,有没有网盘或魔搭社区的链接?

TypeError in local deployment (possibly due to `transformers` version)

__init__
    self.post_init()
  File "D:\transformers\src\transformers\modeling_utils.py", line 1160, in post_init
    self._backward_compatibility_gradient_checkpointing()
  File "D:\transformers\src\transformers\modeling_utils.py", line 1164, in _backward_compatibility_gradient_checkpointing
    self.gradient_checkpointing_enable()
  File "D:\transformers\src\transformers\modeling_utils.py", line 1873, in gradient_checkpointing_enable
    self._set_gradient_checkpointing(enable=True, gradient_checkpointing_func=gradient_checkpointing_func)
TypeError: BaichuanPreTrainedModel._set_gradient_checkpointing() got an unexpected keyword argument 'enable'

请问这是transformers版本的问题么?请问匹配的版本是多少呢

Eval details

关于评测的话,可以提供一下具体的评测方式嘛?
例如objective中的单选题,few-shot setting是如何设置的呢?是找四条样例数据拼在当前问题之前嘛?
以及对于模型的回答,只需要包括选项的字母就算正确,还是说需要把选项中的文字也都包括了才算正确呀?
对于subjective的问题,可以提供一下gpt3.5的评测prompt模版嘛?
非常感谢~

DISC-Law-SFT-Pair specific information of each type of data

DISC-Law-SFT-Pair中的每一种类型的数据,在模型训练过程中,是什么地位?重要是给模型提供什么信息呢?

DISC-Law-SFT-Pair:

Id 条数 类别
'jud_doc_sum': 8234, 文件摘要
'jud_read_compre': 38530, 阅读理解
'leg_case_cls': 20563, 案件分类
'leg_ele_extra': 32042, 要素抽取
'leg_eve_detec': 21289, 事件检测
'op_sum': 5251, 舆情摘要
'exam': 21054, 司法考试
'sent_pred': 11657, 判决预测
'sim_case_match': 8138, 类案匹配

Finetuning self-cognition not working

你好我要微调数据。现在用【我是谁】来做测试数据。
微调之后,感觉没有一点效果。
还请看一下什么问题?

数据集如下:
0.json

微调脚本如下:
torchrun --nproc_per_node 1 src/train_bash.py
--stage sft
--model_name_or_path ShengbinYue/DISC-LawLLM
--do_train
--dataset yiqi
--template baichuan2
--finetuning_type lora
--lora_rank 8
--lora_target W_pack
--output_dir output_checkpoint
--overwrite_cache
--per_device_train_batch_size 4
--per_device_eval_batch_size 4
--gradient_accumulation_steps 8
--preprocessing_num_workers 16
--lr_scheduler_type cosine
--logging_steps 10
--save_steps 100
--eval_steps 100
--learning_rate 1e-4
--max_grad_norm 0.5
--num_train_epochs 3000.0
--evaluation_strategy steps
--load_best_model_at_end
--plot_loss
--fp16
--val_size 0.01

导出脚本如下:
python src/export_model.py
--model_name_or_path ShengbinYue/DISC-LawLLM
--template baichuan2
--finetuning_type lora
--checkpoint_dir output_checkpoint
--export_dir export_model

Code

Hello, is it convenient to provide the code for the evaluation process?

An error of subjective_eval datasets (id 270)

为什么id 270的数据 没有input?{
"output": "期间是在刑事诉讼过程中,各个诉讼阶段、各种诉讼行为所用的法定的时间。期间是法律为保护当事人的合法权益不被侵犯所确定的司法机关必须遵守的强制性规定。\n期间以小时、日、月计算,开始的小时、日不计算在内,也就是说,期间应当从诉讼行为开始后的第二个小时或者第二天开始计算。而且为了保证实际的诉讼期间,计算期间时,不包括路途上的时间。对于上诉状或者其他诉讼文件而言,只要在期间届满前交给邮局寄出的,就不算过期。路途上的时间是司法机关邮寄送达文书及当事人向司法机关邮寄诉讼文书在路途上所占用的时间。这是为了便于当事人行使诉讼权利,如果不扣除邮寄诉讼文书在路途上的时间,当事人的诉讼权利难以得到保障,由于路途较远,有的当事人可能还没有接到司法机关送达的诉讼文书,期间就已经届满了,当事人就会因此失去相应的诉讼权利。当然,期间届满前交给邮局寄出的,必须以邮戳为证。\n当期间的最后一天为节假日的情况,要分以下两种情况处理:第一是为了切实保障当事人的诉讼权利,期间的最后一天为节假日的,以节假日后的第一天为期间届满日期。如计算上诉、抗诉等进行刑事诉讼活动的期间,而期间的最后一天为节假日的,以节假日后的第一天为期间届满日期。第二是为了保护当事人的人身权益,对于犯罪嫌疑人、被告人、罪犯的在押期间,应当计算到期间届满时止,不能因为节假日而延长在押期间。",
"id": 270
}, 谢谢

关于测评代码中 few-shot example中使用错误的问题。

您好,在评测分支中,您使用了多选题的few-shot文件来测评单选题,单选题的few-shot文件来测评多选题。在 src/few_shot文件夹中,两个csv文件的单选多选问题与csv文件的标题不一致,导致代码中出现了上述问题。

在线demo测试 链接无法访问?

非常欣赏您的工作,希望能够在线做一些测试,能否帮忙查看一下 “在线dmeo” 的服务器是不是出bug了?现在无法在网页在线体验产品了,希望收到您的答复,谢谢!

QA training set construction

Could you give me some examples about QA training set or more precisely how I can construct it?I am curious about it.

Legal Element Extraction

Hello, very great job! Thank you for your contributions!
Could you provide some more details on how to construct the legal element extraction dataset? If so, I would highly appreciate it!
By the way, I want to find out if you have encountered the decrease of general ability while fine-tuning?

Same output after finetuning

进行了10000次训练力度,相同的问题最终输出结果一样。请问是哪个步骤出了问题?
数据集:
新闻Q&A.json
训练脚本:

torchrun --nproc_per_node 1 src/train_bash.py \
    --stage sft \
    --model_name_or_path ShengbinYue/DISC-LawLLM \
    --do_train \
    --dataset yiqi5-fun \
    --template baichuan2 \
    --finetuning_type lora \
    --lora_rank 8 \
    --lora_target W_pack \
    --output_dir /home/DISC-output-checkpoint \
    --overwrite_output_dir \
    --overwrite_cache \
    --per_device_train_batch_size 4 \
    --per_device_eval_batch_size 4 \
    --gradient_accumulation_steps 8 \
    --preprocessing_num_workers 16 \
    --lr_scheduler_type cosine \
    --logging_steps 10 \
    --save_steps 100 \
    --eval_steps 100 \
    --learning_rate 1e-5 \
    --max_grad_norm 0.5 \
    --num_train_epochs 10000.0 \
    --evaluation_strategy steps \
    --load_best_model_at_end \
    --plot_loss \
    --fp16 \
    --val_size 0.01

@Charlie-XIAO

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.