ckiplab / ckip-transformers Goto Github PK

View Code? Open in Web Editor NEW

638.0 13.0 67.0 238 KB

CKIP Transformers

Home Page: https://ckip-transformers.readthedocs.io

License: GNU General Public License v3.0

Makefile 3.66% Python 96.34%

ckip transformers language-model word-segmentation part-of-speech-tagging named-entity-recognition

ckip-transformers's Introduction

CKIP Transformers

This project provides traditional Chinese transformers models (including ALBERT, BERT, GPT2) and NLP tools (including word segmentation, part-of-speech tagging, named entity recognition).
這個專案提供了繁體中文的 transformers 模型（包含 ALBERT、BERT、GPT2）及自然語言處理工具（包含斷詞、詞性標記、實體辨識）。

Git

https://github.com/ckiplab/ckip-transformers

PyPI

https://pypi.org/project/ckip-transformers

Documentation

https://ckip-transformers.readthedocs.io

Demo

https://ckip.iis.sinica.edu.tw/service/transformers

Contributers

Mu Yang at CKIP (Author & Maintainer).
Wei-Yun Ma at CKIP (Maintainer).

CkipTagger: An alternative Chinese NLP library with using BiLSTM.
CKIP CoreNLP Toolkit: A Chinese NLP library with more NLP tasks and utilities.

Models

You may also use our pretrained models with HuggingFace transformers library directly: https://huggingface.co/ckiplab/.
您可於 https://huggingface.co/ckiplab/ 下載預訓練的模型。

Language Models
- ALBERT Tiny: ckiplab/albert-tiny-chinese
- ALBERT Base: ckiplab/albert-base-chinese
- BERT Tiny: ckiplab/bert-tiny-chinese
- BERT Base: ckiplab/bert-base-chinese
- GPT2 Tiny: ckiplab/gpt2-tiny-chinese
- GPT2 Base: ckiplab/gpt2-base-chinese
NLP Task Models
- ALBERT Tiny — Word Segmentation: ckiplab/albert-tiny-chinese-ws
- ALBERT Tiny — Part-of-Speech Tagging: ckiplab/albert-tiny-chinese-pos
- ALBERT Tiny — Named-Entity Recognition: ckiplab/albert-tiny-chinese-ner
- ALBERT Base — Word Segmentation: ckiplab/albert-base-chinese-ws
- ALBERT Base — Part-of-Speech Tagging: ckiplab/albert-base-chinese-pos
- ALBERT Base — Named-Entity Recognition: ckiplab/albert-base-chinese-ner
- BERT Tiny — Word Segmentation: ckiplab/bert-tiny-chinese-ws
- BERT Tiny — Part-of-Speech Tagging: ckiplab/bert-tiny-chinese-pos
- BERT Tiny — Named-Entity Recognition: ckiplab/bert-tiny-chinese-ner
- BERT Base — Word Segmentation: ckiplab/bert-base-chinese-ws
- BERT Base — Part-of-Speech Tagging: ckiplab/bert-base-chinese-pos
- BERT Base — Named-Entity Recognition: ckiplab/bert-base-chinese-ner

Model Usage

You may use our model directly from the HuggingFace's transformers library.
您可直接透過 HuggingFace's transformers 套件使用我們的模型。

pip install -U transformers

Please use BertTokenizerFast as tokenizer, and replace ckiplab/albert-tiny-chinese and ckiplab/albert-tiny-chinese-ws by any model you need in the following example.
請使用內建的 BertTokenizerFast，並將以下範例中的 ckiplab/albert-tiny-chinese 與 ckiplab/albert-tiny-chinese-ws 替換成任何您要使用的模型名稱。

from transformers import (
   BertTokenizerFast,
   AutoModelForMaskedLM,
   AutoModelForCausalLM,
   AutoModelForTokenClassification,
)

# masked language model (ALBERT, BERT)
tokenizer = BertTokenizerFast.from_pretrained('bert-base-chinese')
model = AutoModelForMaskedLM.from_pretrained('ckiplab/albert-tiny-chinese') # or other models above

# casual language model (GPT2)
tokenizer = BertTokenizerFast.from_pretrained('bert-base-chinese')
model = AutoModelForCausalLM.from_pretrained('ckiplab/gpt2-base-chinese') # or other models above

# nlp task model
tokenizer = BertTokenizerFast.from_pretrained('bert-base-chinese')
model = AutoModelForTokenClassification.from_pretrained('ckiplab/albert-tiny-chinese-ws') # or other models above

Model Fine-Tunning

To fine tunning our model on your own datasets, please refer to the following example from HuggingFace's transformers.
您可參考以下的範例去微調我們的模型於您自己的資料集。

Remember to set --tokenizer_name bert-base-chinese in order to use Chinese tokenizer.
記得設置 --tokenizer_name bert-base-chinese 以正確的使用中文的 tokenizer。

python run_mlm.py \
   --model_name_or_path ckiplab/albert-tiny-chinese \ # or other models above
   --tokenizer_name bert-base-chinese \
   ...

python run_ner.py \
   --model_name_or_path ckiplab/albert-tiny-chinese-ws \ # or other models above
   --tokenizer_name bert-base-chinese \
   ...

Model Performance

The following is a performance comparison between our model and other models.
The results are tested on a traditional Chinese corpus.
以下是我們的模型與其他的模型之性能比較。
各個任務皆測試於繁體中文的測試集。

Model	#Parameters	Perplexity†	WS (F1)‡	POS (ACC)‡	NER (F1)‡
ckiplab/albert-tiny-chinese	4M	4.80	96.66%	94.48%	71.17%
ckiplab/albert-base-chinese	11M	2.65	97.33%	95.30%	79.47%
ckiplab/bert-tiny-chinese	12M	8.07	96.98%	95.11%	74.21%
ckiplab/bert-base-chinese	102M	1.88	97.60%	95.67%	81.18%
ckiplab/gpt2-tiny-chinese	4M	16.94	--	--	--
ckiplab/gpt2-base-chinese	102M	8.36	--	--	--

--------------------------------	-----------	-----------	--------	----------	---------
voidful/albert_chinese_tiny	4M	74.93	--	--	--
voidful/albert_chinese_base	11M	22.34	--	--	--
bert-base-chinese	102M	2.53	--	--	--

† Perplexity; the smaller the better.
† 混淆度；數字越小越好。
‡ WS: word segmentation; POS: part-of-speech; NER: named-entity recognition; the larger the better.
‡ WS: 斷詞；POS: 詞性標記；NER: 實體辨識；數字越大越好。

Training Corpus

The language models are trained on the ZhWiki and CNA datasets; the WS and POS tasks are trained on the ASBC dataset; the NER tasks are trained on the OntoNotes dataset.
以上的語言模型訓練於 ZhWiki 與 CNA 資料集上；斷詞（WS）與詞性標記（POS）任務模型訓練於 ASBC 資料集上；實體辨識（NER）任務模型訓練於 OntoNotes 資料集上。

ZhWiki: https://dumps.wikimedia.org/zhwiki/

Chinese Wikipedia text (20200801 dump), translated to Traditional using OpenCC.
中文維基的文章（20200801 版本），利用 OpenCC 翻譯成繁體中文。
CNA: https://catalog.ldc.upenn.edu/LDC2011T13

Chinese Gigaword Fifth Edition — CNA (Central News Agency) part.
中文 Gigaword 第五版 — CNA（**社）的部分。
ASBC: http://asbc.iis.sinica.edu.tw

Academia Sinica Balanced Corpus of Modern Chinese release 4.0.
**研究院漢語平衡語料庫第四版。
OntoNotes: https://catalog.ldc.upenn.edu/LDC2013T19

OntoNotes release 5.0, Chinese part, translated to Traditional using OpenCC.
OntoNotes 第五版，中文部分，利用 OpenCC 翻譯成繁體中文。

Here is a summary of each corpus.
以下是各個資料集的一覽表。

Dataset	#Documents	#Lines	#Characters	Line Type
CNA	2,559,520	13,532,445	1,219,029,974	Paragraph
ZhWiki	1,106,783	5,918,975	495,446,829	Paragraph
ASBC	19,247	1,395,949	17,572,374	Clause
OntoNotes	1,911	48,067	1,568,491	Sentence

Here is the dataset split used for language models.
以下是用於訓練語言模型的資料集切割。

CNA+ZhWiki	#Documents	#Lines	#Characters
Train	3,606,303	18,986,238	4,347,517,682
Dev	30,000	148,077	32,888,978
Test	30,000	151,241	35,216,818

Here is the dataset split used for word segmentation and part-of-speech tagging models.
以下是用於訓練斷詞及詞性標記模型的資料集切割。

ASBC	#Documents	#Lines	#Words	#Characters
Train	15,247	1,183,260	9,480,899	14,724,250
Dev	2,000	52,677	448,964	741,323
Test	2,000	160,012	1,315,129	2,106,799

Here is the dataset split used for word segmentation and named entity recognition models.
以下是用於訓練實體辨識模型的資料集切割。

OntoNotes	#Documents	#Lines	#Characters	#Named-Entities
Train	1,511	43,362	1,367,658	68,947
Dev	200	2,304	93,535	7,186
Test	200	2,401	107,298	6,977

NLP Tools

The package also provide the following NLP tools.
我們的套件也提供了以下的自然語言處理工具。

(WS) Word Segmentation 斷詞
(POS) Part-of-Speech Tagging 詞性標記
(NER) Named Entity Recognition 實體辨識

Installation

pip install -U ckip-transformers

Requirements:

NLP Tools Usage

See here for API details.
詳細的 API 請參見此處。

The complete script of this example is https://github.com/ckiplab/ckip-transformers/blob/master/example/example.py.
以下的範例的完整檔案可參見 https://github.com/ckiplab/ckip-transformers/blob/master/example/example.py 。

1. Import module

from ckip_transformers.nlp import CkipWordSegmenter, CkipPosTagger, CkipNerChunker

2. Load models

We provide several pretrained models for the NLP tools.
我們提供了一些適用於自然語言工具的預訓練的模型。

# Initialize drivers
ws_driver  = CkipWordSegmenter(model="bert-base")
pos_driver = CkipPosTagger(model="bert-base")
ner_driver = CkipNerChunker(model="bert-base")

One may also load their own checkpoints using our drivers.
也可以運用我們的工具於自己訓練的模型上。

# Initialize drivers with custom checkpoints
ws_driver  = CkipWordSegmenter(model_name="path_to_your_model")
pos_driver = CkipPosTagger(model_name="path_to_your_model")
ner_driver = CkipNerChunker(model_name="path_to_your_model")

To use GPU, one may specify device ID while initialize the drivers. Set to -1 (default) to disable GPU.
可於宣告斷詞等工具時指定 device 以使用 GPU，設為 -1 （預設值）代表不使用 GPU。

# Use CPU
ws_driver = CkipWordSegmenter(device=-1)

# Use GPU:0
ws_driver = CkipWordSegmenter(device=0)

3. Run pipeline

The input for word segmentation and named-entity recognition must be a list of sentences.
The input for part-of-speech tagging must be a list of list of words (the output of word segmentation).
斷詞與實體辨識的輸入必須是 list of sentences。
詞性標記的輸入必須是 list of list of words。

# Input text
text = [
   "傅達仁今將執行安樂死，卻突然爆出自己20年前遭緯來體育台封殺，他不懂自己哪裡得罪到電視台。",
   "美國參議院針對今天總統布什所提名的勞工部長趙小蘭展開認可聽證會，預料她將會很順利通過參議院支持，成為該國有史以來第一位的華裔女性內閣成員。",
   "空白 也是可以的～",
]

# Run pipeline
ws  = ws_driver(text)
pos = pos_driver(ws)
ner = ner_driver(text)

The POS driver will automatically segment the sentence internally using there characters '，,。：:；;！!？?' while running the model. (The output sentences will be concatenated back.) You may set delim_set to any characters you want.
You may set use_delim=False to disable this feature, or set use_delim=True in WS and NER driver to enable this feature.
詞性標記工具會自動用 '，,。：:；;！!？?' 等字元在執行模型前切割句子（輸出的句子會自動接回）。可設定 delim_set 參數使用別的字元做切割。
另外可指定 use_delim=False 已停用此功能，或於斷詞、實體辨識時指定 use_delim=True 已啟用此功能。

# Enable sentence segmentation
ws  = ws_driver(text, use_delim=True)
ner = ner_driver(text, use_delim=True)

# Disable sentence segmentation
pos = pos_driver(ws, use_delim=False)

# Use new line characters and tabs for sentence segmentation
pos = pos_driver(ws, delim_set='\n\t')

You may specify batch_size and max_length to better utilize you machine resources.
您亦可設置 batch_size 與 max_length 以更完美的利用您的機器資源。

# Sets the batch size and maximum sentence length
ws = ws_driver(text, batch_size=256, max_length=128)

4. Show results

# Pack word segmentation and part-of-speech results
def pack_ws_pos_sentece(sentence_ws, sentence_pos):
   assert len(sentence_ws) == len(sentence_pos)
   res = []
   for word_ws, word_pos in zip(sentence_ws, sentence_pos):
      res.append(f"{word_ws}({word_pos})")
   return "\u3000".join(res)

# Show results
for sentence, sentence_ws, sentence_pos, sentence_ner in zip(text, ws, pos, ner):
   print(sentence)
   print(pack_ws_pos_sentece(sentence_ws, sentence_pos))
   for entity in sentence_ner:
      print(entity)
   print()

傅達仁今將執行安樂死，卻突然爆出自己20年前遭緯來體育台封殺，他不懂自己哪裡得罪到電視台。
傅達仁(Nb)　今(Nd)　將(D)　執行(VC)　安樂死(Na)　，(COMMACATEGORY)　卻(D)　突然(D)　爆出(VJ)　自己(Nh)　20(Neu)　年(Nd)　前(Ng)　遭(P)　緯來(Nb)　體育台(Na)　封殺(VC)　，(COMMACATEGORY)　他(Nh)　不(D)　懂(VK)　自己(Nh)　哪裡(Ncd)　得罪到(VC)　電視台(Nc)　。(PERIODCATEGORY)
NerToken(word='傅達仁', ner='PERSON', idx=(0, 3))
NerToken(word='20年', ner='DATE', idx=(18, 21))
NerToken(word='緯來體育台', ner='ORG', idx=(23, 28))

美國參議院針對今天總統布什所提名的勞工部長趙小蘭展開認可聽證會，預料她將會很順利通過參議院支持，成為該國有史以來第一位的華裔女性內閣成員。
美國(Nc)　參議院(Nc)　針對(P)　今天(Nd)　總統(Na)　布什(Nb)　所(D)　提名(VC)　的(DE)　勞工部長(Na)　趙小蘭(Nb)　展開(VC)　認可(VC)　聽證會(Na)　，(COMMACATEGORY)　預料(VE)　她(Nh)　將(D)　會(D)　很(Dfa)　順利(VH)　通過(VC)　參議院(Nc)　支持(VC)　，(COMMACATEGORY)　成為(VG)　該(Nes)　國(Nc)　有史以來(D)　第一(Neu)　位(Nf)　的(DE)　華裔(Na)　女性(Na)　內閣(Na)　成員(Na)　。(PERIODCATEGORY)
NerToken(word='美國參議院', ner='ORG', idx=(0, 5))
NerToken(word='今天', ner='LOC', idx=(7, 9))
NerToken(word='布什', ner='PERSON', idx=(11, 13))
NerToken(word='勞工部長', ner='ORG', idx=(17, 21))
NerToken(word='趙小蘭', ner='PERSON', idx=(21, 24))
NerToken(word='認可聽證會', ner='EVENT', idx=(26, 31))
NerToken(word='參議院', ner='ORG', idx=(42, 45))
NerToken(word='第一', ner='ORDINAL', idx=(56, 58))
NerToken(word='華裔', ner='NORP', idx=(60, 62))

空白 也是可以的～
空白(VH)　 (WHITESPACE)　也(D)　是(SHI)　可以(VH)　的(T)　～(FW)

NLP Tools Performance

The following is a performance comparison between our tool and other tools.
以下是我們的工具與其他的工具之性能比較。

CKIP Transformers v.s. Monpa & Jeiba

Tool	WS (F1)	POS (Acc)	WS+POS (F1)	NER (F1)
CKIP BERT Base	97.60%	95.67%	94.19%	81.18%
CKIP ALBERT Base	97.33%	95.30%	93.52%	79.47%
CKIP BERT Tiny	96.98%	95.08%	93.13%	74.20%
CKIP ALBERT Tiny	96.66%	94.48%	92.25%	71.17%

------------------------	-----------	-------------	---------------	------------
Monpa†	92.58%	--	83.88%	--
Jeiba	81.18%	--	--	--

† Monpa provides only 3 types of tags in NER.
† Monpa 的實體辨識僅提供三種標記而已。

CKIP Transformers v.s. CkipTagger

The following results are tested on a different dataset.†
以下實驗在另一個資料集測試。†

Tool	WS (F1)	POS (Acc)	WS+POS (F1)	NER (F1)
CKIP BERT Base	97.84%	96.46%	94.91%	79.20%
CkipTagger	97.33%	97.20%	94.75%	77.87%

† Here we retrained/tested our BERT model using the same dataset with CkipTagger.
† 我們重新訓練／測試我們的 BERT 模型於跟 CkipTagger 相同的資料集。

License

ckip-transformers's People

Contributors

Stargazers

Watchers

ckip-transformers's Issues

max_length in README is not correct?

Hi, thanks for your great work.
I found a tiny error for your example, when execute the code
ws = ws_driver(text, batch_size=256, max_length=512)
It would show the error message is that
"AssertionError: Sequence length is longer than the maximum sequence length for this model (512 > 510)."
Set the max_length lower than 510 can fix this.
Without that, everything is fine. It's a excellent and convenience tool for extract information from data.

fineTune Model

你好,
我想請問若要fine-tune以下ws ,pos, ner 的model，
ckiplab/bert-base-chinese-ws
ckiplab/bert-base-chinese-pos
ckiplab/bert-base-chinese-ner

依照例子透過huggingFace上的run_ner.py 來執行，去置換model_name_or_path成以上三個 model來源來做訓練，
那這樣我在fine-tune這三種model時，我的訓練的data標記是只能有 B 跟 I 嗎? 不能額外標註類型嗎，例如 "B-PRODUCT", "I-PRODUCT" 的這種方式嗎? 也不能有O嗎? 因為我看先前的issue提問說是用B、I。

謝謝

Import nlp tools package error

Hello,

When I use：
from ckip_transformers.nlp import CkipWordSegmenter, CkipPosTagger, CkipNerChunker
It reports ValueError: source code string cannot contain null bytes

How to fix it？

Thanks a lot！

如何引用？

如果用了你的bert base chinese ，应该加哪个参考文献？

有没有简体中文bert-tiny的预训练模型？

Implement custom Chinese tokenizer.

We may implement our own tokenizer rather than using BertTokenizerFast.
Our own tokenizer should have the following features:

Disable word piece. Convert text to token ID character by character (e.g. tokenizer.convert_tokens_to_ids(list(input_text)))
Reimplement clean_up_tokenization method. The default method is implemented for English only. Our method may remove whitespaces and convert half-width punctuations to full-width ones.

Speed up tokenize.

HuggingFace's tokenizer can also return the original indices.
We may rewrite the tokenization step using this feature instead of tokenizing character by character.

Output的embedding獲取

您好：
想請教一下在使用CkipWordSegmenter, CkipPosTagger, CkipNerChunker
能從結果中獲取每一個output的embedding嗎?

像是範例中的字串長度為45的句子
傅達仁今將執行安樂死，卻突然爆出自己20年前遭緯來體育台封殺，他不懂自己哪裡得罪到電視台。
最終輸出時可以從某地方得到45x768這樣的結果嗎? 謝謝

啥时候能出一个支持gpt2 和 bloom的ner模型呀

Unable to load weights from pytorch checkpoint.

when computing this:
tokenizer = BertTokenizerFast.from_pretrained('bert-base-chinese')
model = AutoModel.from_pretrained('ckiplab/albert-tiny-chinese-pos')

I have issue:
f"Unable to load weights from pytorch checkpoint file for '{pretrained_model_name_or_path}' "
OSError: Unable to load weights from pytorch checkpoint file for 'ckiplab/albert-tiny-chinese-pos' at

I have
transformers==4.2.2
ckip-transformers==0.2.1
torch==1.4.0

Originally posted by @WachaIPSOS in #3 (comment)

Model Fine-Tunning 數個問題

您好，

請問模型中各個模型的NLP Task Models 是如何訓練的?是基於Language Models 進行fine tune嗎?
是否能透過ckiplab/albert-base-chinese-ws 模型進行fine tune 我自己本身的資料集(希望可以訓練一個新的模型以利新資料斷詞) 若可以，資料集是否需要事先label (tokens) 還是透過raw data即可
訓練出來的模型，可以透過NLP tool 來使用嗎? 因目前的套件提供的方式，似乎是使用數字代表模型 (不能客製化使用自己finetune的模型)

简体中文

这个模型是否适用于简体中文呢？是否有简体中文的相关实验数据？

請教BERT-base-chinese預訓練方式

想請教一下，
貴單位BERT-base-chinese預訓練方式是完全遵照原始BERT的方式，
只有將資料集換成繁體中文、Tokenizer改變是嗎?

感謝

Chinese text classification model usage example

Hello,

Can you share please an example how to use your model to split Chinese text into separate words?

At this moment this code:

from transformers import (
   BertTokenizerFast,
   AutoModelForMaskedLM,
   AutoModelForCausalLM,
   AutoModelForTokenClassification,
)

# casual language model (GPT2)
tokenizer = BertTokenizerFast.from_pretrained('bert-base-chinese')
model = AutoModelForCausalLM.from_pretrained('ckiplab/gpt2-base-chinese') # or other models above

encoded_input = tokenizer.encode(sample_input, return_tensors="pt")
# batch = []
# batch.append(encoded_input)
predictions = model.generate(encoded_input)

tokenizer.batch_decode(predictions)

gives ['[CLS] 之后你看看了我的出版请告诉我你认为什么 [SEP] 我'] for 之后你看看了我的出版请告诉我你认为什么 input.

At the same time your example in example.py in the repo gives correct output for my input:

之后你看看了我的出版请告诉我你认为什么
之后(Nd)　你(Nh)　看看(VE)　了(Di)　我(Nh)　的(DE)　出版(Nv)　请(VF)　告诉(VE)　我(Nh)　你(Nh)　认为(VE)　什么(Nep)

(in case you are interested in context of this issue, here is google doc with my R&D information on this task)

function pack_ws_pos_sentece() was not defined

In section README.rst
item
4. Show results
showed this line
print(pack_ws_pos_sentece(sentence_ws, sentence_pos))
It gave error since this function pack_ws_pos_sentece() was not defined in this block of code.

如何使用fine-tuned完的model

您好:

想請教一下我目前利用了您在說明中提到的範例檔run_ner.py來去依照我自己的資料集微調完model了
https://github.com/huggingface/transformers/tree/main/examples/pytorch/token-classification

最後分別生成了config.json以及tf_model.h5兩個檔案

但是當我想使用使用自己微調過的model時
在這行
ws_driver = CkipNerChunker(model="tmp/tf_model.h5")
跳出了以下錯誤

Traceback (most recent call last):
File "/root/miniconda3/envs/chatbot/lib/python3.6/site-packages/ckip_transformers/nlp/util.py", line 89, in _get_model_name
model_name = self._model_names[model]
KeyError: './tmp/tf_model.h5'

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
File "Transformers_pretrained.py", line 12, in
ws_driver = CkipWordSegmenter(model="./tmp/tf_model.h5")
File "/root/miniconda3/envs/chatbot/lib/python3.6/site-packages/ckip_transformers/nlp/driver.py", line 52, in init
model_name = kwargs.pop("model_name", self._get_model_name(model))
File "/root/miniconda3/envs/chatbot/lib/python3.6/site-packages/ckip_transformers/nlp/util.py", line 91, in _get_model_name
raise KeyError(f"Invalid model {model}") from exc
KeyError: 'Invalid model ./tmp/tf_model.h5'

請問我該如何正確地使用我自己微調完的model搭配CkipWordSegmenter, CkipPosTagger以及CkipNerChunker呢

請問finetune模型前，有辦法更改模型原有label的個數嗎

您好，我想要微調ckiplab/bert-base-chinese-ner這個模型，但看到模型的label有72個，有辦法從72個label中選我會使用到的29個，然後再進行微調嗎？

Some traditional Chinese characters mapped to UNK

Thanks for the great library! Not sure if this is the correct place to ask, but I think I was using your tokenizer in huggingface transformers. I found that some traditional Chinese characters are mapped to UNKs, see the below screenshot.

The code I used was

from transformers import BertTokenizer
tokenizer = BertTokenizer.from_pretrained("bert-base-chinese")
input_ids = tokenizer.encode("重刋道藏輯要高上玉皇本行集經天樞上相(臣)張良校正三淸勅門下湛寂常道信擬議之至難恢漠神通豈形容之可盡", return_tensors='pt')
print ('encoded ids: ', input_ids)
print ('map encoded ids back to words: ', tokenizer.decode(input_ids[0]))

Thanks in advance!

Loading model error

When I tried to use CKIP-transformer to perform Chinese NER task pytorch. But when I loaded the model of level 3, The follow error occurs:
Traceback (most recent call last):
File "ner.py", line 3, in
ner_driver = CkipNerChunker(level=3, device=0)
File "/home/nieyang/anaconda3/envs/huggingface/lib/python3.6/site-packages/ckip_transformers/nlp/driver.py", line 224, in init
super().init(model_name=model_name, **kwargs)
File "/home/nieyang/anaconda3/envs/huggingface/lib/python3.6/site-packages/ckip_transformers/nlp/util.py", line 64, in init
self.model = AutoModelForTokenClassification.from_pretrained(model_name)
File "/home/nieyang/anaconda3/envs/huggingface/lib/python3.6/site-packages/transformers/models/auto/auto_factory.py", line 360, in from_pretrained
pretrained_model_name_or_path, *model_args, config=config, **kwargs
File "/home/nieyang/anaconda3/envs/huggingface/lib/python3.6/site-packages/transformers/modeling_utils.py", line 1066, in from_pretrained
f"Unable to load weights from pytorch checkpoint file for '{pretrained_model_name_or_path}' "
OSError: Unable to load weights from pytorch checkpoint file for 'ckiplab/bert-base-chinese-ner' at '/home/nieyang/.cache/huggingface/transformers/46785b95696d8e6a5004a6a73fcee887d60745a5872af82ca7599b9470554ce3.bdaa5056a5c748eca59fe2c7eef8fa2d034f5092fc84ce6b008c27ddf6f0025c'If you tried to load a PyTorch model from a TF 2.0 checkpoint, please set from_tf=True.

So I added the flag from_tf=True to self.model = AutoModelForTokenClassification.from_pretrained(model_name) in ckip_transformers/nlp/util.py, but it then cames out that the model name is wrong.

So can you help me with this?

请问POS任务中识别出的标签=Neu是什么意思呢

请问POS任务中识别出的标签=Neu是什么意思呢，指连串的数字？

Albert-tiny English support for NLU tasks

Is there a way to get an equivalent albert-tiny english language model to perform downstream tasks like intent and entity classification. I'm afraid there is no albert-tiny model present hence any lead on this regards or guide to create one from scratch, would be highly appreciated.
Thanks

Is it possible to provide a demo code for bert-base-chinese-qa?

Hi, I am new in this field. Is it possible to provide a demo code for bert-base-chinese-qa?
I tried the following code, following the book "Getting Started with Google BERT":

from transformers import BertTokenizerFast, BertForQuestionAnswering

Tokenizer = BertTokenizerFast.from_pretrained("ckiplab/bert-base-chinese")
model = BertForQuestionAnswering.from_pretrained("ckiplab/bert-base-chinese-qa")

paragraph = "李同 也 沒有 在意 ， 大廈 中 ， 几乎 每 天 都 有 人 搬進 搬出 ， 原 不足為奇 。 \
             可是 ， 當 李同 走進 大廈 時 ， 卻 看見 了 那 個 老者 ， 那 老者 是 倒退 著 身子 走出來 的 ， \
             在 那 老者 的 面前 ， 兩 個 搬運 工人 ， 正 抬 著 一 只 箱子 。 那 是 一 只 木 箱子 ， \
             很 殘舊 了 ， 箱子 并 不 大 ， 但是 兩 個 搬運 工人 抬 著 ， 看來 十分 吃力 。[SEP]".strip(" ")

question = "[CLS]老者怎麼走出來的？[SEP]"

question_tokens = tokenizer.tokenize(question)
paragraph_tokens = tokenizer.tokenize(paragraph)

tokens = question_tokens + paragraph_tokens
input_ids = tokenizer.convert_tokens_to_ids(tokens)

segment_ids = [0] * len(question_tokens)
segment_ids += [1] * len(paragraph_tokens)

input_ids = torch.tensor([input_ids])
segment_ids = torch.tensor([segment_ids])

# Getting the answer

res = model(input_ids, token_type_ids=segment_ids)

start_scores, end_scores = res['start_logits'], res['end_logits']

start_index = torch.argmax(start_scores)
end_index = torch.argmax(end_scores)

print(" ".join(tokens[start_index:end_index+1]))

But, I got [CLS]. Could you provide a sample code to how how this Chinese QA model can work properly?
Thank you!

When I use model inference, why do the embedding generated by the same sentence are different every time

this is my code

First generated embedding ：

Second generated embedding ：

You can see that the sentence embedding generated twice are different

關於依存句法分析

您好 !
想請教之後有可能開發依存句法分析 dependency parsing 的工具嗎

感謝回答

Set device = -1 but still using GPU

Hi @emfomy , thank you for your attention 🙏

`ckip_transformers` version

0.2.7

What happened

Set device = -1, but the model still uses GPU.

script:

from ckip_transformers.nlp import CkipNerChunker
ner_driver = CkipNerChunker(level=3, device=-1)
res = ner_driver(text_list)

Before running the script:

After running the script:

What do you think should happen instead

It should not consume GPU resources.

How to reproduce

Run the script in GPU enable env:

from ckip_transformers.nlp import CkipNerChunker
ner_driver = CkipNerChunker(level=3, device=-1)
res = ner_driver(text_list)

Operating System

Ubuntu 20.04.2 LTS

Development Environment

Python 3.8.12
PyTorch 1.9.0+cu111
Transformers 4.7.0
Tensorflow 2.11.0

Anything else

I've checked the source code, self.device is set as "cpu", and both model and data tensor has to(self.device), so it's weird to have this problem.
And if the environment has no GPU, the model script is still runnable.

how to compare ckiplab/bert-base-chinese with bert-base-chinese?

Thanks so much for this excellent model and having it accessible in huggingface.

Would like to know why the ckiplab/bert-base-chinese seems a bit strange to me when compared to the usual bert-base-chinese which I think it mainly trained on simplified chinese. For instance, when I masked the word 風 of the phrase 颱風預測。 in the usual bert-base-chinese it managed to give me back 風 with high probability 0.992; in contrast, in the ckiplab/bert-base-chinese it didn't give back the masked word 風 in the top 5 but giving the word 的 with highest probability albeit only around 0.3 something which I am wondering.

Is it supposed that we have to fine-tune this MLM first? Or perhaps I interpreted it wrongly (as I'm very new in this field). Mind sharing a bit on your thought? Thanks very much and thanks in advance.

NER 怎么使用输出ner对应标签

from transformers import (
BertTokenizerFast,
AutoModel,
)

tokenizer = BertTokenizerFast.from_pretrained('bert-base-chinese')
model = AutoModel.from_pretrained('ckiplab/bert-base-chinese-ner')

。。。
这后面应该怎么写呢，才能对应输出这种

Add support to HuggingFace's transformers v4

HuggingFace's team released a new major version of transformers (v4).
We should add support to this version.

Pinning memory issue

Hi,

I'm currently using ckip-transformers-ws as a preprocessing tool in my project, and I noticed that the DataLoader's pin_memory flag was hard-coded True in util.py.

As pinning memory is incompatible with multiprocessing (or multiple workers) [1], when users leverage ckip-transformers in their collate_fn of DataLoader with multiple workers, a CUDA error will occur as shown in [1], even if only using CPU for inference.

Therefore, I think it would be better that:

Pin memory only when the device is GPU.
Add an option to decide whether or not to enable memory pinning.

Regards.

[1] https://discuss.pytorch.org/t/pin-memory-vs-sending-direct-to-gpu-from-dataset/33891/2

ckiplab / ckip-transformers Goto Github PK

ckip-transformers's Introduction

CKIP Transformers

Git

PyPI

Documentation

Demo

Contributers

Related Packages

Models

Model Usage

Model Fine-Tunning

Model Performance

Training Corpus

NLP Tools

Installation

NLP Tools Usage

1. Import module

2. Load models

3. Run pipeline

4. Show results

NLP Tools Performance

CKIP Transformers v.s. Monpa & Jeiba

CKIP Transformers v.s. CkipTagger

License

ckip-transformers's People

Contributors

Stargazers

Watchers

Forkers

ckip-transformers's Issues

ckip_transformers version

What happened

What do you think should happen instead

How to reproduce

Operating System

Development Environment

Anything else

Recommend Projects

Recommend Topics

Recommend Org

`ckip_transformers` version