harderthenharder / transformers_tasks Goto Github PK

⭐️ NLP Algorithms with transformers lib. Supporting Text-Classification, Text-Generation, Information-Extraction, Text-Matching, RLHF, SFT etc.

Home Page: https://www.zhihu.com/column/c_1451236880973426688

Python 24.41% Shell 0.18% Makefile 0.02% Jupyter Notebook 74.69% MDX 0.70%

nlp text-classification text-matching information-extraction reinforcement-learning transformers text-generation

transformers_tasks's Introduction

该项目集成了基于 transformers 库实现的多种 NLP 任务。

huggingface transformers 是一个非常棒的开源框架，支持非常方便的加载/训练 transformer 模型，你可以在这里看到该库的安装方法和入门级调用，该库也能支持用户非常便捷的微调一个属于自己的模型。

在该项目中我们集成了一些主流的NLP任务，你可以找到对应的任务，将代码中的训练数据集更换成你自己任务下的数据集从而训练一个符合你自己任务下的模型。

目前已经实现的NLP任务如下（更新中）：

1. 文本匹配（Text Matching）

计算文本间的相似度，多用于：搜索召回、文本检索、蕴含识别 等任务。

模型	传送门
【监督】概览	[这里]
【监督】PointWise（单塔）	[这里]
【监督】DSSM（双塔）	[这里]
【监督】Sentence Bert（双塔）	[这里]
【无监督】SimCSE	[这里]

2. 信息抽取（Information Extraction）

在给定的文本段落中抽取目标信息，多用于：命名实体识别（NER），实体关系抽取（RE） 等任务。

模型	传送门
通用信息抽取（Universe Information Extraction, UIE）	[这里]

3. Prompt任务（Prompt Tasks）

通过设计提示（prompt）模板，实现使用更少量的数据在预训练模型（Pretrained Model）上得到更好的效果，多用于：Few-Shot，Zero-Shot 等任务。

模型	传送门
PET（基于人工定义 prompt pattern 的方法）	[这里]
p-tuning（机器自动学习 prompt pattern 的方法）	[这里]

4. 文本分类（Text Classification）

对给定文本进行分类，多用于：情感识别，文章分类识别 等任务。

模型	传送门
BERT-CLS（基于 BERT 的分类器）	[这里]

5. 强化学习 & 语言模型（Reinforcement Learning & Language Model）

RLHF（Reinforcement Learning from Human Feedback）通过人类的反馈，将强化学习（RL）用于更新语言生成模型（LM），从而达到更好的生成效果（代表例子：ChatGPT）；通常包括：奖励模型（Reward Model） 训练和 强化学习（Reinforcement Learning） 训练两个阶段。

模型	传送门
RLHF（Reward Model 训练，PPO 更新 GPT2）	[这里]

6. 文本生成（Text Generation）

文本生成（NLG），通常用于：小说续写，智能问答，对话机器人 等任务。

模型	传送门
中文问答模型（T5-Based）	[这里]
Filling 模型（T5-Based）	[这里]

7. 大模型应用（LLM Application）

构建大模型（LLM）zero-shot 解决多种任务所需的 prompt pattern(s)。

模型	传送门
文本分类（chatglm-6b-Based）	[这里]
文本匹配（chatglm-6b-Based）	[这里]
信息抽取（chatglm-6b-Based）	[这里]
大模型性格测试（LLMs MBTI）	[这里]

8. 大模型训练（LLM Training）

大模型训练相关，涵盖预训练，指令微调，奖励模型，强化学习。

模型	传送门
ChatGLM-6B Finetune	[这里]
从零开始训练大模型	[这里]

9. 工具类（Tools）

一些常用工具集合。

工具名	传送门
Tokenizer Viewer	[这里]

transformers_tasks's People

Contributors

Stargazers

Watchers

Forkers

zxs-learn delaiahz hongdangshao haojiepan1 ybz79 reallinorth liam0949 jiahaoqin misakamikoto96 nakaizura ksana-kami little1tow ericxsun makikokoro bywjge superhg sf91971684 marcyu0303 dotaofll cdj0311 los47 liangwq zhangtuo723 yxk9810 pkq1688 linjinhai zero506 cjltctc aleeehu georgehu0815 sjyttkl hujiaxin0 dumpmemory suparek xiaoyee qingkongzhiqian yz-liu liushuchun eluo7 yyht chenchun0629 zhang-jian guhaifudeng brunch1896 sssszh rocprint0321 wanjiahe xinbingzhe zhangshuhao0928 ngc7292 joshi-rohit fanmengdan freedomkite fatfatbear yiwangchenghuan ruzwdy iamfaith vividlife mp5a5 ramonyeung duanzhihua paradocx anderson-lee-mr zhemeduogewangtianyi 1ycxz haiming2019 zhangheyi-1 wen1163204547 skychang121 zujl123 zxf864823150 shenlian lelegogo26 zqianl lujun5011802 yuping322 krystal941024 liu-angelo qzl164 gerryqi liuchen19960902 trigrass2 jewhau flypanda666 guolan-newbie tuzeao mengzhangjian yiruil as472780551 zoudong qingwu11 turing-dz veenmakron zhangnn520 nanqiai singal0927 mlshenkai sevenold chenpe32cp ichbinhandsome

transformers_tasks's Issues

LLM的finetune需要多大的显存啊？为什么我batch__size调到了1,24g的显存还是不够。。。。。。。

报错：OutOfMemoryError: CUDA out of memory. Tried to allocate 128.00 MiB (GPU 0;
24.00 GiB total capacity; 23.13 GiB already allocated; 0 bytes free; 23.15 GiB
reserved in total by PyTorch) If reserved memory is >> allocated memory try
setting max_split_size_mb to avoid fragmentation. See documentation for Memory
Management and PYTORCH_CUDA_ALLOC_CONF

推理部分

请问可以提供一份RLHF中GPT模型的推理代码吗？我看里面好像没有包含模型保存和重新加载的部分

关于词表问题

咱们用的事bert词表还是清华官方的

a little bug for RLHF/train_reward_model.py

Thank you for your excellent work.
when I run the train_reward_model.py, I found the evaluation always to be the same. So I found:

"batch_rank_rewards = []" should be in line 70. like below:

def evaluate_model(model, data_loader):
    """
    在测试集上评估当前模型的训练效果。

    Args:
        model: 当前模型
        data_loader: 测试集的dataloader
    """
    model.eval()
    with torch.no_grad():
        batch_rank_rewards = []
        for batch in data_loader:
            for batch_idx in range(len(batch['input_ids'])):
                rank_texts_count = len(batch['input_ids'][batch_idx])
                rank_rewards = []
...

UIE第五部分数据增强，文件格式不对

无法传入doccano标注后导出的文件

UIE信息抽取

结果没有你写的那么高，只有F1值只有70%左右

请教关于Tokenizer的问题

自制jsonl中，含有（）这种符号无法识别。
我理解，本repo按照bert token的格式来做的，所以具体逻辑可以介绍下吗？
感谢

train error 需要您的解答

你好作者，我在研究您的LLM微调模型的时候发现，以下错误，代码无法跑通
ValueError: Please specify target_modules in peft_config
模型运行train的时候发现以下这些方法都没有找到
'ChatGLMForConditionalGeneration' object has no attribute 'enable_input_require_grads'
'ChatGLMForConditionalGeneration' object has no attribute 'gradient_checkpointing_enable'

请问streamlit安装的是什么版本，

streamlit==1.7
streamlit==1.10.0
都会报错：AttributeError: module 'streamlit' has no attribute 'tabs'

关于UIE模型微调，如何进行一机多卡训练？

ValueError: 150001 is not in list

在ppo的训练过程中生成质量越来越差。

我使用项目中的RW模型代码基于自己标注的对话排序数据训练了一个RW模型，然后接着进行ppo训练，结果发现训练初期base模型可以正常生成句子，训练一段时间后就开始生成乱码了，是哪个环节出问题了吗？

多卡训练求助？

RuntimeError: :0: cudaFuncSetAttribute(kernel_entry, cudaFuncAttributeMaxDynamicSharedMemorySize, integer_cast<int32_t>(launch_configs[0].smemSizeInBytes)): out of memory
/data1/software/anaconda3/envs/llm_env/lib/python3.8/site-packages/torch/autograd/init.py:200: UserWarning: Grad strides do not match bucket view strides. This may indicate grad was not created according to the gradient layout contract, or that the param's strides changed since DDP was constructed. This is not an error, but may impair performance.
grad.sizes() = [8192, 2, 1], strides() = [2, 1, 2]
bucket_view.sizes() = [8192, 2, 1], strides() = [2, 1, 1] (Triggered internally at ../torch/csrc/distributed/c10d/reducer.cpp:323.)
Variable._execution_engine.run_backward( # Calls into the C++ engine to run the backward pass
[16:26:30] WARNING Sending process 17571 closing signal SIGTERM api.py:698
[16:26:31] ERROR failed (exitcode: 1) local_rank: 1 (pid: 17572) of binary: /data1/software/anaconda3/envs/llm_env/bin/python

单卡没问题，多卡会报显存溢出？

RuntimeError: The size of tensor a (16) must match the size of tensor b (17)

Thanks for this good project.
but when I run "RLHF project":

python ppo_sentiment_example.py

I got :"RuntimeError: The size of tensor a (16) must match the size of tensor b (17) at non-singleton dimension 1"
I am a newer of transformer, Could you please tell me what is the problem. Thank you!

ppo中获取生成tokens的value 代码疑似有误

https://github.com/HarderThenHarder/transformers_tasks/blob/main/RLHF/trl/ppo.py#L203
all_values.append(v[j, start-1:end-1]) 这里给value的起始下标-1了。
应该不需要-1，value直接预测的就是当前的状态值了。

能去掉网上的关联吗，每次都要重新下载

PPO策略之后，更新后的模型是怎么存储的呢

目前在terminal_main.py文件基础上利用强化学习去更新模型，但是没有看到相关模型存储的模块

LLM微调

请问该项目是对所有网络层进行微调，还是仅对部分网络层微调呢？

ValueError: Attempting to unscale FP16 gradients

新版代码的bug
跑train.py 没事，multi不行

关于模型训练的问题

多卡训练，对验证集第一次验证后报错如下，想请问您这边遇到过这种情况吗

UIE 实体事件抽取时报错_pickle.UnpicklingError: invalid load key, '<'.

File "train.py", line 328, in
train()
File "train.py", line 249, in train
model = torch.load(os.path.join(args.pretrained_model, 'pytorch_model.bin')) # 加载预训练好的UIE模型，模型结构见：model.UIE()
File "D:\Dev_Utils\Anaconda3\lib\site-packages\torch\serialization.py", line 713, in load
return _legacy_load(opened_file, map_location, pickle_module, **pickle_load_args)
File "D:\Dev_Utils\Anaconda3\lib\site-packages\torch\serialization.py", line 920, in _legacy_load
magic_number = pickle_module.load(f, **pickle_load_args)
_pickle.UnpicklingError: invalid load key, '<'.

UI界面打开提示模型未被定义

微调完模型，设置好了model_path,启动界面后，界面启动成功，但是点generate会报错：

咨询一下RLHF算法的细节

通常PPO算法需要收集一个episode的数据，计算整个episode的DiscountedReturn/Advantage/GAE，用来更新Critic
在情感分析或者对话任务中，一个episode是什么？

The hostname of the rendezvous endpoint ':8080' must be a dot-separated list of labels, an IPv4 address, or an IPv6 address.

run problem

您好，我在运行你们的代码时，碰到了下图问题，现在报错说说是pytorch中的问题，我想咨询一下这种问题该如何解决。

不错的工作，希望继续下去

模型多卡训练

torch.load显示"no module named model"

您好，我在huggingface里使用torch.load方法加载pytorch_model.bin，但是显示"no module named model"，请问这是为什么呢？

大模型阅读理解微调

老哥，关于大模型阅读理解的微调有案例？，比如，给一大段医疗文本，如何判断里面是否有抽烟，是否有现病史，体格是否健康等呢

playground推理过慢

请问playground每次generate都要加载一次模型？怎么改可以使速度变快一些呢？

您好，我将bert-base-chinese换成了roberta-base，然后报错了

您好！我把模型换成了roberta-base，然后报了上面的错误，我应该怎么修改代码！

没有dataset文件

运行RLHF/train_reward_model.sh的疑问

1，训练过程中，loss是从-0.6左右逐渐下降到-0.9以下，但为什么acc虽然会上下跳动，却始终是0.1的整数倍呢？
2，粗看了一下train_reward_model.py代码， train()函数中末尾，在评估的时候， acc = evaluate_model(model, eval_dataloader)，调用了这个函数。但是在evaluate_model(model, data_loader)函数中，又有一行 model.train()，这是什么用意呢？

纯请教 lora_rank 这个参数咋用，什么含义？求解

lora_rank 大小具体影响了啥？

Might it better to change the loss for PET if one label corresponds to multiple label words?

In the verbalizer, if one label (e.g. Fruit) corresponds to multiple label words (e.g. apple, pear, watermelon), and the prompted sentence is :

This is a [MASK] comment: it tasted good.

the loss function in your current code will boost the probability of predicting all the three words {apple, watermelon, pear} at the [MASK] position. For a PLM, different contexts would results in different probabilities of the three words. For example, "red" appears more likely around "apple" than "pear". If you boost the probability of generating "pear" around "red", it might ruin the knowledge in the PLM to some extent, or on the contrary increases the difficulties during prompt-tuning.

Maybe you can visit this paper to design the loss? https://aclanthology.org/2022.acl-long.158.pdf

ModuleNotFoundError: No module named 'transformers.generation'

多卡finetune，出现这个问题

你好，你使用peft是做了一些修改吗？是在那个版本上进行的修改？

关于训练的问题

多卡训练，有时候会成功载入模型进行训练，但第二次再进行训练就会报错（没有做任何的改变，截图如下）
想请问一下这是什么原因

“在最后一位加上人工给的reward” 的出处

感谢作者的分享。仔细阅读源码后我有个疑问，请问下面这行代码有出处吗？如有，烦请给下名字或URL。谢谢！

https://github.com/HarderThenHarder/transformers_tasks/blob/main/RLHF/trl/ppo.py#L225

生成式任务里所谓问答模型（Text-Generation, T5 Based）似乎更像是阅读理解任务

因为推理的时候还需要给一个问题和一段上下文。。跟抽取式任务也差不多，只不过答案是用模型自己的话复述。
而不像一般的问答任务那样只提供问题即可。

chatglm做finetune的时候报错

╭─────────────────────────────── Traceback (most recent call last) ────────────────────────────────╮
│ /home/mo/chatglm/transformers_tasks/LLM/finetune/train.py:352 in │
│ │
│ 349 │
│ 350 │
│ 351 if name == "main": │
│ ❱ 352 │ main() │
│ 353 │
│ │
│ /home/mo/chatglm/transformers_tasks/LLM/finetune/train.py:230 in main │
│ │
│ 227 │ │ model = model.quantize(args.quantization_bit) │
│ 228 │ │
│ 229 │ model = model.half() │
│ ❱ 230 │ model.gradient_checkpointing_enable() │
│ 231 │ model.enable_input_require_grads() │
│ 232 │ model.is_parallelizable = True │
│ 233 │ model.model_parallel = True │
│ │
│ /home/mo/miniconda3/envs/llm_env/lib/python3.8/site-packages/transformers/modeling_utils.py:1584 │
│ in gradient_checkpointing_enable │
│ │
│ 1581 │ │ activations". │
│ 1582 │ │ """ │
│ 1583 │ │ if not self.supports_gradient_checkpointing: │
│ ❱ 1584 │ │ │ raise ValueError(f"{self.class.name} does not support gradient check │
│ 1585 │ │ self.apply(partial(self._set_gradient_checkpointing, value=True)) │
│ 1586 │ │
│ 1587 │ def gradient_checkpointing_disable(self): │
╰──────────────────────────────────────────────────────────────────────────────────────────────────╯
ValueError: ChatGLMForConditionalGeneration does not support gradient checkpointing.

Here seems Wrong

transformers_tasks/RLHF/model.py

Line 97 in adbfce2

loss = loss + diff

由于 token 长度限制，怎么在 prompting 构造时输入更多的标注样本，提升分类精度呢？

transformers_tasks/LLM/llm_classification.py

楼主给每一个类目提供了一个样本，实际业务场景中，一个样本肯定不够。理论上输入的样本越多，识别精度会约好。

大模型的token长度一般是有限的（输入的长度越长，耗时也会越长），ChatGLM-6B 推荐的 token 长度是 2048，如何更高效的利用业务场景已有的样本呢？