Git Product home page Git Product logo

ark-nlp's Introduction

ark-nlp

ark-nlp主要是收集和复现学术与工作中常用的NLP模型

环境

  • python 3
  • torch >= 1.0.0, <1.10.0
  • tqdm >= 4.56.0
  • jieba >= 0.42.1
  • transformers >= 3.0.0
  • zhon >= 1.1.5
  • scipy >= 1.2.0
  • scikit-learn >= 0.17.0

pip安装

pip install --upgrade ark-nlp

项目结构

ark_nlp 开源的自然语言处理库
ark_nlp.dataset 封装数据加载、处理和转化等功能
ark_nlp.nn 封装一些完整的神经网络模型
ark_nlp.processor 封装分词器、词典和构图器等
ark_nlp.factory 封装损失函数、优化器、训练和预测等功能
ark_nlp.model 按实际NLP任务封装常用的模型,方便调用

实现的模型

预训练模型

模型 参考文献
BERT BERT:Pre-training of Deep Bidirectional Transformers for Language Understanding
ERNIE1.0 ERNIE:Enhanced Representation through Knowledge Integration
NEZHA NEZHA:Neural Contextualized Representation For Chinese Language Understanding
Roformer Roformer: Enhanced Transformer with Rotary Position Embedding
ERNIE-CTM ERNIE-CTM(ERNIE for Chinese Text Mining)

文本分类 (Text Classification)

模型 简介
RNN/CNN/GRU/LSTM 经典的RNN, CNN, GRU, LSTM等经典文本分类结构
BERT/ERNIE 常用的预训练模型分类

文本匹配 (Text Matching)

模型 简介
BERT/ERNIE 常用的预训练模型匹配分类
UnsupervisedSimcse 无监督Simcse匹配算法
CoSENT CoSENT:比Sentence-BERT更有效的句向量方案

命名实体识别 (Named Entity Recognition)

模型 参考文献 论文源码
CRF BERT
Biaffine BERT
Span BERT
Global Pointer BERT GlobalPointer:用统一的方式处理嵌套和非嵌套NER
Efficient Global Pointer BERT Efficient GlobalPointer:少点参数,多点效果
W2NER BERT Unified Named Entity Recognition as Word-Word Relation Classification github

关系抽取 (Relation Extraction)

模型 参考文献 论文源码
Casrel A Novel Cascade Binary Tagging Framework for Relational Triple Extraction github
PRGC PRGC: Potential Relation and Global Correspondence Based Joint Relational Triple Extraction github

信息抽取 (Information Extraction)

模型 参考文献 论文源码
PromptUie 通用信息抽取 UIE(Universal Information Extraction) github

少样本 (Few-Shot Learning)

模型 参考文献 论文源码
PromptBert Pre-train, Prompt, and Predict: A Systematic Survey of Prompting Methods in Natural Language Processing)

实际应用

使用例子

完整代码可参考test文件夹

  • 文本分类

    import torch
    import pandas as pd
    
    from ark_nlp.model.tc.bert import Bert
    from ark_nlp.model.tc.bert import BertConfig
    from ark_nlp.model.tc.bert import Dataset
    from ark_nlp.model.tc.bert import Task
    from ark_nlp.model.tc.bert import get_default_model_optimizer
    from ark_nlp.model.tc.bert import Tokenizer
    
    # 加载数据集
    # train_data_df的columns必选包含"text"和"label"
    # text列为文本,label列为分类标签
    tc_train_dataset = Dataset(train_data_df)
    tc_dev_dataset = Dataset(dev_data_df)
    
    # 加载分词器
    tokenizer = Tokenizer(vocab='nghuyong/ernie-1.0', max_seq_len=30)
    
    # 文本切分、ID化
    tc_train_dataset.convert_to_ids(tokenizer)
    tc_dev_dataset.convert_to_ids(tokenizer)
    
    # 加载预训练模型
    config = BertConfig.from_pretrained('nghuyong/ernie-1.0',
                                       num_labels=len(tc_train_dataset.cat2id))
    dl_module = Bert.from_pretrained('nghuyong/ernie-1.0', 
                                     config=config)
    
    # 任务构建
    num_epoches = 10
    batch_size = 32
    optimizer = get_default_model_optimizer(dl_module)
    model = Task(dl_module, optimizer, 'ce', cuda_device=0)
    
    # 训练
    model.fit(tc_train_dataset, 
              tc_dev_dataset,
              lr=2e-5,
              epochs=5, 
              batch_size=batch_size
             )
    
    # 推断
    from ark_nlp.model.tc.bert import Predictor
    
    tc_predictor_instance = Predictor(model.module, tokenizer, tc_train_dataset.cat2id)
    
    tc_predictor_instance.predict_one_sample(待预测文本)
  • 文本匹配

    import torch
    import pandas as pd
    
    from ark_nlp.model.tm.bert import Bert
    from ark_nlp.model.tm.bert import BertConfig
    from ark_nlp.model.tm.bert import Dataset
    from ark_nlp.model.tm.bert import Task
    from ark_nlp.model.tm.bert import get_default_model_optimizer
    from ark_nlp.model.tm.bert import Tokenizer
    
    # 加载数据集
    # train_data_df的columns必选包含"text_a"、"text_b"和"label"
    # text_a和text_b列为文本,label列为匹配标签
    tm_train_dataset = Dataset(train_data_df)
    tm_dev_dataset = Dataset(dev_data_df)
    
    # 加载分词器
    tokenizer = Tokenizer(vocab='nghuyong/ernie-1.0', max_seq_len=30)
    
    # 文本切分、ID化
    tm_train_dataset.convert_to_ids(tokenizer)
    tm_dev_dataset.convert_to_ids(tokenizer)
    
    # 加载预训练模型
    config = BertConfig.from_pretrained('nghuyong/ernie-1.0', 
                                       num_labels=len(tm_train_dataset.cat2id))
    dl_module = Bert.from_pretrained('nghuyong/ernie-1.0', 
                                     config=config)
    
    # 任务构建
    num_epoches = 10
    batch_size = 32
    optimizer = get_default_model_optimizer(dl_module)
    model = Task(dl_module, optimizer, 'ce', cuda_device=0)
    
    # 训练
    model.fit(tm_train_dataset, 
              tm_dev_dataset,
              lr=2e-5,
              epochs=5, 
              batch_size=batch_size
             )
    
    # 推断
    from ark_nlp.model.tm.bert import Predictor
    
    tm_predictor_instance = Predictor(model.module, tokenizer, tm_train_dataset.cat2id)
    
    tm_predictor_instance.predict_one_sample([待预测文本A, 待预测文本B])
  • 命名实体

    import torch
    import pandas as pd
    
    from ark_nlp.model.ner.crf_bert import CRFBert
    from ark_nlp.model.ner.crf_bert import CRFBertConfig
    from ark_nlp.model.ner.crf_bert import Dataset
    from ark_nlp.model.ner.crf_bert import Task
    from ark_nlp.model.ner.crf_bert import get_default_model_optimizer
    from ark_nlp.model.ner.crf_bert import Tokenizer
    
    # 加载数据集
    # train_data_df的columns必选包含"text"和"label"
    # text列为文本
    # label列为列表形式,列表中每个元素是如下组织的字典
    # {'start_idx': 实体首字符在文本的位置, 'end_idx': 实体尾字符在文本的位置, 'type': 实体类型标签, 'entity': 实体}
    ner_train_dataset = Dataset(train_data_df)
    ner_dev_dataset = Dataset(dev_data_df)
    
    # 加载分词器
    tokenizer = Tokenizer(vocab='nghuyong/ernie-1.0', max_seq_len=30)
    
    # 文本切分、ID化
    ner_train_dataset.convert_to_ids(tokenizer)
    ner_dev_dataset.convert_to_ids(tokenizer)
    
    # 加载预训练模型
    config = CRFBertConfig.from_pretrained('nghuyong/ernie-1.0', 
                                      num_labels=len(ner_train_dataset.cat2id))
    dl_module = CRFBert.from_pretrained('nghuyong/ernie-1.0', 
                                        config=config)
    
    # 任务构建
    num_epoches = 10
    batch_size = 32
    optimizer = get_default_model_optimizer(dl_module)
    model = Task(dl_module, optimizer, 'ce', cuda_device=0)
    
    # 训练
    model.fit(ner_train_dataset, 
              ner_dev_dataset,
              lr=2e-5,
              epochs=5, 
              batch_size=batch_size
             )
    
    # 推断
    from ark_nlp.model.ner.crf_bert import Predictor
    
    ner_predictor_instance = Predictor(model.module, tokenizer, ner_train_dataset.cat2id)
    
    ner_predictor_instance.predict_one_sample(待抽取文本)
  • Casrel关系抽取

    import torch
    import pandas as pd
    
    from ark_nlp.model.re.casrel_bert import CasRelBert
    from ark_nlp.model.re.casrel_bert import CasRelBertConfig
    from ark_nlp.model.re.casrel_bert import Dataset
    from ark_nlp.model.re.casrel_bert import Task
    from ark_nlp.model.re.casrel_bert import get_default_model_optimizer
    from ark_nlp.model.re.casrel_bert import Tokenizer
    from ark_nlp.factory.loss_function import CasrelLoss
    
    # 加载数据集
    # train_data_df的columns必选包含"text"和"label"
    # text列为文本
    # label列为列表形式,列表中每个元素是如下组织的字典
    # [头实体, 头实体首字符在文本的位置, 头实体尾字符在文本的位置, 关系类型, 尾实体, 尾实体首字符在文本的位置, 尾实体尾字符在文本的位置]
    re_train_dataset = Dataset(train_data_df)
    re_dev_dataset = Dataset(dev_data_df,
                             categories = re_train_dataset.categories,
                             is_train=False)
    
    # 加载分词器
    tokenizer = Tokenizer(vocab='nghuyong/ernie-1.0', max_seq_len=100)
    
    # 文本切分、ID化
    # 注意:casrel的代码这部分其实并没有进行切分、ID化,仅是将分词器赋予dataset对象
    re_train_dataset.convert_to_ids(tokenizer)
    re_dev_dataset.convert_to_ids(tokenizer)
    
    # 加载预训练模型
    config = CasRelBertConfig.from_pretrained('nghuyong/ernie-1.0',
                                              num_labels=len(re_train_dataset.cat2id))
    dl_module = CasRelBert.from_pretrained('nghuyong/ernie-1.0', 
                                           config=config)
    
    # 任务构建
    num_epoches = 40
    batch_size = 16
    optimizer = get_default_model_optimizer(dl_module)
    model = Task(dl_module, optimizer, CasrelLoss(), cuda_device=0)
    
    # 训练
    model.fit(re_train_dataset, 
              re_dev_dataset,
              lr=2e-5,
              epochs=5, 
              batch_size=batch_size
             )
    
    # 推断
    from ark_nlp.model.re.casrel_bert import Predictor
    
    casrel_re_predictor_instance = Predictor(model.module, tokenizer, re_train_dataset.cat2id)
    
    casrel_re_predictor_instance.predict_one_sample(待抽取文本)
  • PRGC关系抽取

    import torch
    import pandas as pd
    
    from ark_nlp.model.re.prgc_bert import PRGCBert
    from ark_nlp.model.re.prgc_bert import PRGCBertConfig
    from ark_nlp.model.re.prgc_bert import Dataset
    from ark_nlp.model.re.prgc_bert import Task
    from ark_nlp.model.re.prgc_bert import get_default_model_optimizer
    from ark_nlp.model.re.prgc_bert import Tokenizer
    
    # 加载数据集
    # train_data_df的columns必选包含"text"和"label"
    # text列为文本
    # label列为列表形式,列表中每个元素是如下组织的字典
    # [头实体, 头实体首字符在文本的位置, 头实体尾字符在文本的位置, 关系类型, 尾实体, 尾实体首字符在文本的位置, 尾实体尾字符在文本的位置]
    re_train_dataset = Dataset(train_df, is_retain_dataset=True)
    re_dev_dataset = Dataset(dev_df,
                             categories = re_train_dataset.categories,
                             is_train=False)
    
    # 加载分词器
    tokenizer = Tokenizer(vocab='nghuyong/ernie-1.0', max_seq_len=100)
    
    # 文本切分、ID化
    re_train_dataset.convert_to_ids(tokenizer)
    re_dev_dataset.convert_to_ids(tokenizer)
    
    # 加载预训练模型
    config = PRGCBertConfig.from_pretrained('nghuyong/ernie-1.0',
                                              num_labels=len(re_train_dataset.cat2id))
    dl_module = PRGCBert.from_pretrained('nghuyong/ernie-1.0', 
                                           config=config)
    
    # 任务构建
    num_epoches = 40
    batch_size = 16
    optimizer = get_default_model_optimizer(dl_module)
    model = Task(dl_module, optimizer, None, cuda_device=0)
    
    # 训练
    model.fit(re_train_dataset, 
              re_dev_dataset,
              lr=2e-5,
              epochs=5, 
              batch_size=batch_size
             )
    
    # 推断
    from ark_nlp.model.re.prgc_bert import Predictor
    
    prgc_re_predictor_instance = Predictor(model.module, tokenizer, re_train_dataset.cat2id)
    
    prgc_re_predictor_instance.predict_one_sample(待抽取文本)

DisscussionGroup

  • 公众号:DataArk

wechat

  • wechat ID: fk95624

Main contributors

xiangking/
xiangking
Jimme/
Jimme
Zrealshadow/
Zrealshadow

Acknowledge

本项目用于收集和复现学术与工作中常用的NLP模型,整合成方便调用的形式,所以参考借鉴了网上很多开源实现,如有不当的地方,还请联系批评指教。 在此,感谢大佬们的开源实现。

ark-nlp's People

Contributors

jimme0421 avatar xiangking avatar zhw666888 avatar zrealshadow avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

ark-nlp's Issues

输入数据的格式

请问关系抽取中输入数据的格式是什么样子的?

“列表中每个元素是如下组织的字典”([头实体, 头实体首字符在文本的位置, 头实体尾字符在文本的位置, 关系类型, 尾实体, 尾实体首字符在文本的位置, 尾实体尾字符在文本的位置])

以上所说的字典是什么意思?没太理解

Fix: SpanTokenizer使用'[blank]'作为空格,但在预训练模型词典不包含该符号时并不报错

Environment info

Python 3.8.10
ark-nlp 0.0.6

Information

/pytorch/aten/src/ATen/native/cuda/Indexing.cu:658: indexSelectLargeIndex: block: [94,0,0], thread: [32,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
/pytorch/aten/src/ATen/native/cuda/Indexing.cu:658: indexSelectLargeIndex: block: [94,0,0], thread: [33,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
/pytorch/aten/src/ATen/native/cuda/Indexing.cu:658: indexSelectLargeIndex: block: [94,0,0], thread: [34,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
/pytorch/aten/src/ATen/native/cuda/Indexing.cu:658: indexSelectLargeIndex: block: [94,0,0], thread: [35,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
/pytorch/aten/src/ATen/native/cuda/Indexing.cu:658: indexSelectLargeIndex: block: [94,0,0], thread: [36,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
/pytorch/aten/src/ATen/native/cuda/Indexing.cu:658: indexSelectLargeIndex: block: [94,0,0], thread: [37,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
/pytorch/aten/src/ATen/native/cuda/Indexing.cu:658: indexSelectLargeIndex: block: [94,0,0], thread: [38,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
/pytorch/aten/src/ATen/native/cuda/Indexing.cu:658: indexSelectLargeIndex: block: [94,0,0], thread: [39,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
/pytorch/aten/src/ATen/native/cuda/Indexing.cu:658: indexSelectLargeIndex: block: [94,0,0], thread: [40,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
/pytorch/aten/src/ATen/native/cuda/Indexing.cu:658: indexSelectLargeIndex: block: [94,0,0], thread: [41,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
/pytorch/aten/src/ATen/native/cuda/Indexing.cu:658: indexSelectLargeIndex: block: [94,0,0], thread: [42,0,0] Assertion `srcIndex < srcSelectDimSize` failed.

如何有效添加 对抗训练的 pgd。参考:https://github.com/xiangking/ark-nlp/issues/46

参考:#46
查询发现是loss 释放了,采用了 loss.backward(retain_graph=True) 的方法,替换所有的 loss.backward()为loss.backward(retain_graph=True)。能够正常训练。
但F1指标效果低于 fgm。如果有效添加 对抗训练的 pgd

def _on_backward(
self,
inputs,
outputs,
logits,
loss,
gradient_accumulation_steps=1,
**kwargs
):

    # 如果GPU数量大于1
    if self.n_gpu > 1:
        loss = loss.mean()
    # 如果使用了梯度累积,除以累积的轮数
    if gradient_accumulation_steps > 1:
        loss = loss / gradient_accumulation_steps

    loss.backward()
    self.pgd.backup_grad()
    # 对抗训练
    for t in range(self.K):
        self.pgd.attack(is_first_attack=(t==0)) # 在embedding上添加对抗扰动, first attack时备份param.data
        if t != self.K-1:
            self.module.zero_grad()
        else:
            self.pgd.restore_grad()
        logits = self.module(**inputs)
        logits, loss_adv = self._get_train_loss(inputs, outputs, **kwargs)
        # 如果GPU数量大于1
        if self.n_gpu > 1:
            loss_adv = loss_adv.mean()
        # 如果使用了梯度累积,除以累积的轮数
        if gradient_accumulation_steps > 1:
            loss_adv = loss_adv / gradient_accumulation_steps
        loss_adv.backward()
    self.pgd.restore() # 恢复embedding参数 

    self._on_backward_record(loss, **kwargs)

    return loss

在loss.backward()后面添加PGD对抗训练报错

RuntimeError: Trying to backward through the graph a second time, but the saved intermediate results have already been freed. Specify retain_graph=True when calling backward the first time.

添加PGM对抗训练报错

def _on_backward(
        self,
        inputs,
        outputs,
        logits,
        loss,
        gradient_accumulation_steps=1,
        **kwargs
    ):

        # 如果GPU数量大于1
        if self.n_gpu > 1:
            loss = loss.mean()
        # 如果使用了梯度累积,除以累积的轮数
        if gradient_accumulation_steps > 1:
            loss = loss / gradient_accumulation_steps

        loss.backward()
        self.pgd.backup_grad()
        # 对抗训练
        for t in range(self.K):
            self.pgd.attack(is_first_attack=(t==0)) # 在embedding上添加对抗扰动, first attack时备份param.data
            if t != self.K-1:
                self.module.zero_grad()
            else:
                self.pgd.restore_grad()
            logits = self.module(**inputs)
            logits, loss_adv = self._get_train_loss(inputs, outputs, **kwargs)
            # 如果GPU数量大于1
            if self.n_gpu > 1:
                loss_adv = loss_adv.mean()
            # 如果使用了梯度累积,除以累积的轮数
            if gradient_accumulation_steps > 1:
                loss_adv = loss_adv / gradient_accumulation_steps
            loss_adv.backward()
        self.pgd.restore() # 恢复embedding参数 

        self._on_backward_record(loss, **kwargs)

        return loss

在loss.backward()后面添加PGD对抗训练报错

RuntimeError: Trying to backward through the graph a second time, but the saved intermediate results have already been freed. Specify retain_graph=True when calling backward the first time.

请问作者大大这个是怎么回事?

Fix: ValueError: The data format does not exist

Environment info

Python 3.8.10
ark-nlp 0.0.6

Information

读取本地文件失败

from ark_nlp.dataset import SentenceClassificationDataset

train_dataset = SentenceClassificationDataset('../data/task_datasets/cMedTC/train_data.csv')

报错信息

---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
<ipython-input-16-52c3ab3bcf08> in <module>
----> 1 train_dataset = SentenceClassificationDataset('../data/task_datasets/cMedTC/train_data.csv')
      2 # dev_dataset = SentenceClassificationDataset('../data/source_datasets/cMedTC/dev_data.csv')

~/anaconda3/lib/python3.7/site-packages/ark_nlp/dataset/base/_dataset.py in __init__(self, data, categories, is_retain_df, is_retain_dataset, is_train, is_test)

~/anaconda3/lib/python3.7/site-packages/ark_nlp/dataset/base/_dataset.py in _load_dataset(self, data_path)

~/anaconda3/lib/python3.7/site-packages/ark_nlp/dataset/base/_dataset.py in _read_data(self, data_path, data_format, skiprows)

ValueError: The data format does not exist

Bert模型参数维度对不上

我用ark_nlp里面的 from ark_nlp.model.tc.bert import Bert训练的模型保存之后,用transformers里的BertModel去load_state_dict模型,结果发现参数维度对不上,报错如下:
size mismatch for pooler.dense.bias: copying a param with shape torch.Size([4]) from checkpoint, the shape in current model is torch.Size([768]).

PS:而且还发现了用ark_nlp里的Bert训练完后会比transformers里的BertModel少两个参数,分别是:"classifier.weight"和 "classifier.bias"

Fix: 使用model调用CrfBert本质在调用bert+softmax

Environment info

Python 3.8.10
ark-nlp 0.0.7

Information

from ark_nlp.dataset import BIONERDataset as Dataset
from ark_nlp.dataset import BIONERDataset as CrfBertNERDataset

from ark_nlp.processor.tokenizer.transfomer import TokenTokenizer as Tokenizer
from ark_nlp.processor.tokenizer.transfomer import TokenTokenizer as CrfBertNERTokenizer

from ark_nlp.nn import BertConfig as CrfBertConfig
from ark_nlp.nn import BertConfig as ModuleConfig

from ark_nlp.model.ner.crf_bert.crf_bert import CrfBert
from ark_nlp.model.ner.crf_bert.crf_bert import CrfBert as Module

from ark_nlp.factory.optimizer import get_default_crf_bert_optimizer as get_default_model_optimizer
from ark_nlp.factory.optimizer import get_default_crf_bert_optimizer as get_default_crf_bert_optimizer

from ark_nlp.factory.task import BIONERTask as Task
from ark_nlp.factory.task import BIONERTask as CrfBertNERTask

from ark_nlp.factory.predictor import BIONERPredictor as Predictor
from ark_nlp.factory.predictor import BIONERPredictor as CrfBertNERPredictor

convert_to_ids函数占用大量CPU

您好,我利用样例以及GlobelPointerBert做NER任务加载数据时发现代码跑到convert_to_ids时,服务器CPU占用十分严重,通过htop查到服务器64核CPU占据了将约5000%,请问是什么原因呢

RoPE实现细节

# RoPE编码
if self.RoPE:
    pos = SinusoidalPositionEmbedding(self.head_size, 'zero')(inputs)
    # cos_pos = pos[..., 1::2].repeat(1, 1, 2)
    # sin_pos = pos[..., ::2].repeat(1, 1, 2)
    cos_pos = pos[..., 1::2].repeat_interleave(2, dim=-1)  # 修改后
    sin_pos = pos[..., ::2].repeat_interleave(2, dim=-1)  # 修改后

大佬你好,你的RoPE在实现上是不是有点问题,按照苏神的博客应该是上面修改后的代码吧

非连续实体问题

您好,非连续实体这里有什么比较好的算法可以支持吗,输入的数据格式是什么呢

无实体

query无实体情况怎么加进去训练?

span_mask

你好,在解码的时候,_convert_to_transfomer_ids中的span_mask在哪里用到了呀

分词器报错了

昨天出现个问题,加载词向量文件时出现下面截图中的问题
image

看上去是网络问题,但是我这边能正常访问https://huggingface.co/models的
后从该地址上下载了词向量 nghuyong/ernie-1.0-base-zh并指定绝对路径,程序正常了,但是之前的模型准确性全不对了,是词向量文件改了吗,还是什么原因?

SpanTokenizer会导致token_mapping的索引不正确问题

Environment info:
ark-nlp 0.0.9
python 3.9

Information:
在使用bert模型时,用SpanTokenizer会导致token_mapping的索引不正确。
比如输入以下句子时的结果(其中下划线表示空格):
input:B o s e _ S o u n d S p o r t _ F r e e _ 真 无 线 蓝 牙 耳 机
tokens:['[UNK]', '[unused1]', '[UNK]', '[unused1]', '[UNK]', '[unused1]', '真', '无', '线', '蓝', '牙', '耳', '机']
token_mapping:[[0], [1], [2], [3], [4], [5], [21], [22], [23], [24], [25], [26], [27]]
正确的token_mapping应该是如下的:
[[0,1,2,3], [4], [5,6,7,8,9,10,11,12,13,14], [15], [16,17,18,19], [20], [21], [22], [23], [24], [25], [26], [27]]

没有cuda

没有安装cuda,如何调用cpu训练,现在报如下错误:
AssertionError: Torch not compiled with CUDA enabled

Fix: TokenTokenizer存在分词忽略空格的问题

Environment info

Python 3.8.10
ark-nlp 0.0.7

Information

tokenizer.tokenize('森麥康 小米3 M4 M5 5C 5X 5S 5Splus mi 6 6X电源开机音量按键排线侧键 小米5C 开机音量排线')

>>> 
['森',
 '麥',
 '康',
 '小',
 '米',
 '3',
 'm',
 '4',
 'm',
 '5',
 '5',
 'c',
 '5',
 'x',
 '5',
 's',
 '5',
 's',
 'p',
 'l',
 'u',
 's',
 'm',
 'i',
 '6',
 '6',
 'x',
 '电',
 '源',
 '开',
 '机',
 '音',
 '量',
 '按',
 '键',
 '排',
 '线',
 '侧',
 '键',
 '小',
 '米',
 '5',
 'c',
 '开',
 '机',
 '音',
 '量',
 '排',
 '线']

New Feature : Adding Pipeline and refine Introduction of doc.

Description

The example in Introduction docs is too long and complex.
Many settings can be wrappered into a default configuration, such as epoch, batchsize, optimizer, etc.
users can custom their own configuration through some string options instead of declaring a specific class.

The all process can also be wrappered into a default class (called pipeline in huggingface)
This is an example

>>> from transformers import pipeline
>>> unmasker = pipeline('fill-mask', model='bert-base-uncased')
>>> unmasker("Hello I'm a [MASK] model.")

[{'sequence': "[CLS] hello i'm a fashion model. [SEP]",
  'score': 0.1073106899857521,
  'token': 4827,
  'token_str': 'fashion'},
 {'sequence': "[CLS] hello i'm a role model. [SEP]",
  'score': 0.08774490654468536,
  'token': 2535,
  'token_str': 'role'},
 {'sequence': "[CLS] hello i'm a new model. [SEP]",
  'score': 0.05338378623127937,
  'token': 2047,
  'token_str': 'new'},
 {'sequence': "[CLS] hello i'm a super model. [SEP]",
  'score': 0.04667217284440994,
  'token': 3565,
  'token_str': 'super'},
 {'sequence': "[CLS] hello i'm a fine model. [SEP]",
  'score': 0.027095865458250046,
  'token': 2986,
  'token_str': 'fine'}]

By this way, we can refine README docs, make it clear and easy-understanding.
As for some users who want to further custom their model, we can provide more complex example scripts under test directory.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.