yongzhuo / pytorch-nlu Goto Github PK

Pytorch-NLU，一个中文文本分类、序列标注工具包，支持中文长文本、短文本的多类、多标签分类任务，支持中文命名实体识别、词性标注、分词、抽取式文本摘要等序列标注任务。 Ptorch NLU, a Chinese text classification and sequence annotation toolkit, supports multi class and multi label classification tasks of Chinese long text and short text, and supports sequence annotation tasks such as Chinese named entity recognition, part of spee

Home Page: https://blog.csdn.net/rensihui

License: Apache License 2.0

Python 100.00%

python3 pytorch text-classification sequence-labeling named-entity-recognition word-segmentation pos-tagging chinese-text-segmentation chinese-text-classification transformers

pytorch-nlu's People

Contributors

Stargazers

Watchers

pytorch-nlu's Issues

大佬跪求MiningZhiDaoQACorpus这个数据集求的共享链接，原链接失效了

大佬跪求MiningZhiDaoQACorpus这个数据集求的共享链接，原链接失效了
万分感谢

求回复：用bert-tiny做多标签分类，训练没有问题，保存了tc.config和tc.model，推理的时候在加载模型的地方报错

self.pretrain_model = pretrained_model(self.pretrained_config) # 推理时候只需要加载超参数, 不需要预训练模型的权重

预训练模型是bert-tiny，加载用的AutotModel，不是BertModel
在tcGraph.py的这行报错，报错如下

OSError: AutoModel is designed to be instantiated using the AutoModel.from_pretrained(pretrained_model_name_or_path) or AutoModel.from_config(config) methods.

支持分类和实体识别联合训练吗？

self.do_lower_case 和 self.vocab 没定义，执行报错？！

Pytorch-NLU/pytorch_nlu/pytorch_textclassification/tcData.py

Line 169 in 864fb9a

if self.do_lower_case:

Pytorch-NLU/pytorch_nlu/pytorch_textclassification/tcData.py

Line 171 in 864fb9a

if t in self.vocab:

这两个类变量在哪定义的？跑代码时报错！

支持英文吗

请问读取数据集内存占用过高的问题

进行文本多标签分类，数据有90多万，txt文件有不到200m，但是读取数据集占用的内存太多了，不知道是不是bug还是本来就这样，机子32g的内存都不够读取四分之一的数据，

请问怎么对复旦大学计算机信息与技术系国际数据库中心自然语言处理小组提供的新闻语料分类呢

大佬能不能出个零基础的傻瓜式训练测试教程啊，看着有点蒙。

比如这些预训练模型去哪里下载。

请问当前代码中是否包含FLAT的相对位置矩阵处理？

首先感谢作者的贡献，特别是中文NER的loss汇总实现对我帮助很大

您好，我想问下这里是否包含了FLAT的相对位置矩阵的实现呢，可能是我看漏了

选择albert模型tokenizer加载错误

The tokenizer class you load from this checkpoint is not the same type as the class this function is called from. It may result in unexpected tokenization.
The tokenizer class you load from this checkpoint is 'BertTokenizer'.
The class this function is called from is 'AlbertTokenizer'.
Traceback (most recent call last):
File "/home/efsz/localCode/Pytorch-NLU/test/tc/tet_tc_base_multi_label.py", line 73, in
lc.process()
File "/home/efsz/localCode/Pytorch-NLU/pytorch_nlu/pytorch_textclassification/tcRun.py", line 32, in process
self.corpus = Corpus(self.config, self.logger)
File "/home/efsz/localCode/Pytorch-NLU/pytorch_nlu/pytorch_textclassification/tcData.py", line 25, in init
self.tokenizer = self.load_tokenizer(self.config)
File "/home/efsz/localCode/Pytorch-NLU/pytorch_nlu/pytorch_textclassification/tcData.py", line 206, in load_tokenizer
tokenizer = PRETRAINED_MODEL_CLASSES[config.model_type][1].from_pretrained(config.pretrained_model_name_or_path)
File "/home/efsz/anaconda3/envs/nlu/lib/python3.10/site-packages/transformers/tokenization_utils_base.py", line 1854, in from_pretrained
return cls._from_pretrained(
File "/home/efsz/anaconda3/envs/nlu/lib/python3.10/site-packages/transformers/tokenization_utils_base.py", line 2017, in _from_pretrained
tokenizer = cls(*init_inputs, **init_kwargs)
File "/home/efsz/anaconda3/envs/nlu/lib/python3.10/site-packages/transformers/models/albert/tokenization_albert.py", line 183, in init
self.sp_model.Load(vocab_file)
File "/home/efsz/anaconda3/envs/nlu/lib/python3.10/site-packages/sentencepiece/init.py", line 905, in Load
return self.LoadFromFile(model_file)
File "/home/efsz/anaconda3/envs/nlu/lib/python3.10/site-packages/sentencepiece/init.py", line 310, in LoadFromFile
return _sentencepiece.SentencePieceProcessor_LoadFromFile(self, arg)
TypeError: not a string

其他模型好像没问题，但是albert的tokenizer加载会报这个错

想问一下多标签是怎么处理的？跑多标签数据集的时候support值好像总和等于那些只有一个标签的

以给定的school为例，测试集一共有132条，其中多标签的12条，单标签的120条，最后support和为120. 我单步调试看了一下，多标签的样本输入时one hot向量是全零的，想问一下这一步什么原理？

yongzhuo / pytorch-nlu Goto Github PK

pytorch-nlu's People

Contributors

Stargazers

Watchers

Forkers

pytorch-nlu's Issues

Recommend Projects

Recommend Topics

Recommend Org