Git Product home page Git Product logo

ckip-transformers's Introduction

CKIP Transformers

This project provides traditional Chinese transformers models (including ALBERT, BERT, GPT2) and NLP tools (including word segmentation, part-of-speech tagging, named entity recognition).
這個專案提供了繁體中文的 transformers 模型(包含 ALBERT、BERT、GPT2)及自然語言處理工具(包含斷詞、詞性標記、實體辨識)。

Git

PyPI

Documentation

Demo

Contributers

Models

You may also use our pretrained models with HuggingFace transformers library directly: https://huggingface.co/ckiplab/.
您可於 https://huggingface.co/ckiplab/ 下載預訓練的模型。

Model Usage

You may use our model directly from the HuggingFace's transformers library.
您可直接透過 HuggingFace's transformers 套件使用我們的模型。
pip install -U transformers
Please use BertTokenizerFast as tokenizer, and replace ckiplab/albert-tiny-chinese and ckiplab/albert-tiny-chinese-ws by any model you need in the following example.
請使用內建的 BertTokenizerFast,並將以下範例中的 ckiplab/albert-tiny-chineseckiplab/albert-tiny-chinese-ws 替換成任何您要使用的模型名稱。
from transformers import (
   BertTokenizerFast,
   AutoModelForMaskedLM,
   AutoModelForCausalLM,
   AutoModelForTokenClassification,
)

# masked language model (ALBERT, BERT)
tokenizer = BertTokenizerFast.from_pretrained('bert-base-chinese')
model = AutoModelForMaskedLM.from_pretrained('ckiplab/albert-tiny-chinese') # or other models above

# casual language model (GPT2)
tokenizer = BertTokenizerFast.from_pretrained('bert-base-chinese')
model = AutoModelForCausalLM.from_pretrained('ckiplab/gpt2-base-chinese') # or other models above

# nlp task model
tokenizer = BertTokenizerFast.from_pretrained('bert-base-chinese')
model = AutoModelForTokenClassification.from_pretrained('ckiplab/albert-tiny-chinese-ws') # or other models above

Model Fine-Tunning

To fine tunning our model on your own datasets, please refer to the following example from HuggingFace's transformers.
您可參考以下的範例去微調我們的模型於您自己的資料集。
Remember to set --tokenizer_name bert-base-chinese in order to use Chinese tokenizer.
記得設置 --tokenizer_name bert-base-chinese 以正確的使用中文的 tokenizer。
python run_mlm.py \
   --model_name_or_path ckiplab/albert-tiny-chinese \ # or other models above
   --tokenizer_name bert-base-chinese \
   ...

python run_ner.py \
   --model_name_or_path ckiplab/albert-tiny-chinese-ws \ # or other models above
   --tokenizer_name bert-base-chinese \
   ...

Model Performance

The following is a performance comparison between our model and other models.
The results are tested on a traditional Chinese corpus.
以下是我們的模型與其他的模型之性能比較。
各個任務皆測試於繁體中文的測試集。
Model #Parameters Perplexity† WS (F1)‡ POS (ACC)‡ NER (F1)‡
ckiplab/albert-tiny-chinese

4M

4.80

96.66% 94.48% 71.17%
ckiplab/albert-base-chinese

11M

2.65

97.33% 95.30% 79.47%
ckiplab/bert-tiny-chinese

12M

8.07

96.98% 95.11% 74.21%
ckiplab/bert-base-chinese 102M

1.88

97.60% 95.67% 81.18%
ckiplab/gpt2-tiny-chinese

4M

16.94 -- -- --
ckiplab/gpt2-base-chinese 102M

8.36

-- -- --






-------------------------------- ----------- ----------- -------- ---------- ---------
voidful/albert_chinese_tiny

4M

74.93 -- -- --
voidful/albert_chinese_base

11M

22.34 -- -- --
bert-base-chinese 102M

2.53

-- -- --
† Perplexity; the smaller the better.
† 混淆度;數字越小越好。
‡ WS: word segmentation; POS: part-of-speech; NER: named-entity recognition; the larger the better.
‡ WS: 斷詞;POS: 詞性標記;NER: 實體辨識;數字越大越好。

Training Corpus

The language models are trained on the ZhWiki and CNA datasets; the WS and POS tasks are trained on the ASBC dataset; the NER tasks are trained on the OntoNotes dataset.
以上的語言模型訓練於 ZhWiki 與 CNA 資料集上;斷詞(WS)與詞性標記(POS)任務模型訓練於 ASBC 資料集上;實體辨識(NER)任務模型訓練於 OntoNotes 資料集上。
Here is a summary of each corpus.
以下是各個資料集的一覽表。
Dataset #Documents #Lines #Characters Line Type
CNA 2,559,520 13,532,445 1,219,029,974 Paragraph
ZhWiki 1,106,783 5,918,975 495,446,829 Paragraph
ASBC 19,247 1,395,949 17,572,374 Clause
OntoNotes 1,911 48,067 1,568,491 Sentence
Here is the dataset split used for language models.
以下是用於訓練語言模型的資料集切割。
CNA+ZhWiki #Documents #Lines #Characters
Train 3,606,303 18,986,238 4,347,517,682
Dev 30,000 148,077 32,888,978
Test 30,000 151,241 35,216,818
Here is the dataset split used for word segmentation and part-of-speech tagging models.
以下是用於訓練斷詞及詞性標記模型的資料集切割。
ASBC #Documents #Lines #Words #Characters
Train 15,247 1,183,260 9,480,899 14,724,250
Dev 2,000 52,677 448,964 741,323
Test 2,000 160,012 1,315,129 2,106,799
Here is the dataset split used for word segmentation and named entity recognition models.
以下是用於訓練實體辨識模型的資料集切割。
OntoNotes #Documents #Lines #Characters #Named-Entities
Train 1,511 43,362 1,367,658 68,947
Dev 200 2,304 93,535 7,186
Test 200 2,401 107,298 6,977

NLP Tools

The package also provide the following NLP tools.
我們的套件也提供了以下的自然語言處理工具。
  • (WS) Word Segmentation 斷詞
  • (POS) Part-of-Speech Tagging 詞性標記
  • (NER) Named Entity Recognition 實體辨識

Installation

pip install -U ckip-transformers

Requirements:

NLP Tools Usage

See here for API details.
詳細的 API 請參見 此處

1. Import module

from ckip_transformers.nlp import CkipWordSegmenter, CkipPosTagger, CkipNerChunker

2. Load models

We provide several pretrained models for the NLP tools.
我們提供了一些適用於自然語言工具的預訓練的模型。
# Initialize drivers
ws_driver  = CkipWordSegmenter(model="bert-base")
pos_driver = CkipPosTagger(model="bert-base")
ner_driver = CkipNerChunker(model="bert-base")
One may also load their own checkpoints using our drivers.
也可以運用我們的工具於自己訓練的模型上。
# Initialize drivers with custom checkpoints
ws_driver  = CkipWordSegmenter(model_name="path_to_your_model")
pos_driver = CkipPosTagger(model_name="path_to_your_model")
ner_driver = CkipNerChunker(model_name="path_to_your_model")
To use GPU, one may specify device ID while initialize the drivers. Set to -1 (default) to disable GPU.
可於宣告斷詞等工具時指定 device 以使用 GPU,設為 -1 (預設值)代表不使用 GPU。
# Use CPU
ws_driver = CkipWordSegmenter(device=-1)

# Use GPU:0
ws_driver = CkipWordSegmenter(device=0)

3. Run pipeline

The input for word segmentation and named-entity recognition must be a list of sentences.
The input for part-of-speech tagging must be a list of list of words (the output of word segmentation).
斷詞與實體辨識的輸入必須是 list of sentences。
詞性標記的輸入必須是 list of list of words。
# Input text
text = [
   "傅達仁今將執行安樂死,卻突然爆出自己20年前遭緯來體育台封殺,他不懂自己哪裡得罪到電視台。",
   "美國參議院針對今天總統布什所提名的勞工部長趙小蘭展開認可聽證會,預料她將會很順利通過參議院支持,成為該國有史以來第一位的華裔女性內閣成員。",
   "空白 也是可以的~",
]

# Run pipeline
ws  = ws_driver(text)
pos = pos_driver(ws)
ner = ner_driver(text)
The POS driver will automatically segment the sentence internally using there characters ',,。::;;!!??' while running the model. (The output sentences will be concatenated back.) You may set delim_set to any characters you want.
You may set use_delim=False to disable this feature, or set use_delim=True in WS and NER driver to enable this feature.
詞性標記工具會自動用 ',,。::;;!!??' 等字元在執行模型前切割句子(輸出的句子會自動接回)。可設定 delim_set 參數使用別的字元做切割。
另外可指定 use_delim=False 已停用此功能,或於斷詞、實體辨識時指定 use_delim=True 已啟用此功能。
# Enable sentence segmentation
ws  = ws_driver(text, use_delim=True)
ner = ner_driver(text, use_delim=True)

# Disable sentence segmentation
pos = pos_driver(ws, use_delim=False)

# Use new line characters and tabs for sentence segmentation
pos = pos_driver(ws, delim_set='\n\t')
You may specify batch_size and max_length to better utilize you machine resources.
您亦可設置 batch_sizemax_length 以更完美的利用您的機器資源。
# Sets the batch size and maximum sentence length
ws = ws_driver(text, batch_size=256, max_length=128)

4. Show results

# Pack word segmentation and part-of-speech results
def pack_ws_pos_sentece(sentence_ws, sentence_pos):
   assert len(sentence_ws) == len(sentence_pos)
   res = []
   for word_ws, word_pos in zip(sentence_ws, sentence_pos):
      res.append(f"{word_ws}({word_pos})")
   return "\u3000".join(res)

# Show results
for sentence, sentence_ws, sentence_pos, sentence_ner in zip(text, ws, pos, ner):
   print(sentence)
   print(pack_ws_pos_sentece(sentence_ws, sentence_pos))
   for entity in sentence_ner:
      print(entity)
   print()
傅達仁今將執行安樂死,卻突然爆出自己20年前遭緯來體育台封殺,他不懂自己哪裡得罪到電視台。
傅達仁(Nb) 今(Nd) 將(D) 執行(VC) 安樂死(Na) ,(COMMACATEGORY) 卻(D) 突然(D) 爆出(VJ) 自己(Nh) 20(Neu) 年(Nd) 前(Ng) 遭(P) 緯來(Nb) 體育台(Na) 封殺(VC) ,(COMMACATEGORY) 他(Nh) 不(D) 懂(VK) 自己(Nh) 哪裡(Ncd) 得罪到(VC) 電視台(Nc) 。(PERIODCATEGORY)
NerToken(word='傅達仁', ner='PERSON', idx=(0, 3))
NerToken(word='20年', ner='DATE', idx=(18, 21))
NerToken(word='緯來體育台', ner='ORG', idx=(23, 28))

美國參議院針對今天總統布什所提名的勞工部長趙小蘭展開認可聽證會,預料她將會很順利通過參議院支持,成為該國有史以來第一位的華裔女性內閣成員。
美國(Nc) 參議院(Nc) 針對(P) 今天(Nd) 總統(Na) 布什(Nb) 所(D) 提名(VC) 的(DE) 勞工部長(Na) 趙小蘭(Nb) 展開(VC) 認可(VC) 聽證會(Na) ,(COMMACATEGORY) 預料(VE) 她(Nh) 將(D) 會(D) 很(Dfa) 順利(VH) 通過(VC) 參議院(Nc) 支持(VC) ,(COMMACATEGORY) 成為(VG) 該(Nes) 國(Nc) 有史以來(D) 第一(Neu) 位(Nf) 的(DE) 華裔(Na) 女性(Na) 內閣(Na) 成員(Na) 。(PERIODCATEGORY)
NerToken(word='美國參議院', ner='ORG', idx=(0, 5))
NerToken(word='今天', ner='LOC', idx=(7, 9))
NerToken(word='布什', ner='PERSON', idx=(11, 13))
NerToken(word='勞工部長', ner='ORG', idx=(17, 21))
NerToken(word='趙小蘭', ner='PERSON', idx=(21, 24))
NerToken(word='認可聽證會', ner='EVENT', idx=(26, 31))
NerToken(word='參議院', ner='ORG', idx=(42, 45))
NerToken(word='第一', ner='ORDINAL', idx=(56, 58))
NerToken(word='華裔', ner='NORP', idx=(60, 62))

空白 也是可以的~
空白(VH)  (WHITESPACE) 也(D) 是(SHI) 可以(VH) 的(T) ~(FW)

NLP Tools Performance

The following is a performance comparison between our tool and other tools.
以下是我們的工具與其他的工具之性能比較。

CKIP Transformers v.s. Monpa & Jeiba

Tool WS (F1) POS (Acc) WS+POS (F1) NER (F1)
CKIP BERT Base 97.60% 95.67% 94.19% 81.18%
CKIP ALBERT Base

97.33%

95.30%

93.52%

79.47%

CKIP BERT Tiny

96.98%

95.08%

93.13%

74.20%

CKIP ALBERT Tiny

96.66%

94.48%

92.25%

71.17%






------------------------ ----------- ------------- --------------- ------------
Monpa†

92.58%

--

83.88%

--

Jeiba

81.18%

--

--

--

† Monpa provides only 3 types of tags in NER.
† Monpa 的實體辨識僅提供三種標記而已。

CKIP Transformers v.s. CkipTagger

The following results are tested on a different dataset.†
以下實驗在另一個資料集測試。†
Tool WS (F1) POS (Acc) WS+POS (F1) NER (F1)
CKIP BERT Base 97.84%

96.46%

94.91% 79.20%
CkipTagger

97.33%

97.20%

94.75%

77.87%

† Here we retrained/tested our BERT model using the same dataset with CkipTagger.
† 我們重新訓練/測試我們的 BERT 模型於跟 CkipTagger 相同的資料集。

License

GPL-3.0

Copyright (c) 2023 CKIP Lab under the GPL-3.0 License.

ckip-transformers's People

Contributors

emfomy avatar pan93412 avatar qqaatw avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

ckip-transformers's Issues

max_length in README is not correct?

Hi, thanks for your great work.
I found a tiny error for your example, when execute the code
ws = ws_driver(text, batch_size=256, max_length=512)
It would show the error message is that
"AssertionError: Sequence length is longer than the maximum sequence length for this model (512 > 510)."
Set the max_length lower than 510 can fix this.
Without that, everything is fine. It's a excellent and convenience tool for extract information from data.

fineTune Model

你好,
我想請問若要fine-tune以下ws ,pos, ner 的model,
ckiplab/bert-base-chinese-ws
ckiplab/bert-base-chinese-pos
ckiplab/bert-base-chinese-ner

依照例子透過huggingFace上的run_ner.py 來執行,去置換model_name_or_path成以上三個 model來源來做訓練,
那這樣我在fine-tune這三種model時,我的訓練的data標記是只能有 B 跟 I 嗎? 不能額外標註類型嗎,例如 "B-PRODUCT", "I-PRODUCT" 的這種方式嗎? 也不能有O嗎? 因為我看先前的issue提問說是用B、I。

謝謝

Import nlp tools package error

Hello,

When I use:
from ckip_transformers.nlp import CkipWordSegmenter, CkipPosTagger, CkipNerChunker
It reports ValueError: source code string cannot contain null bytes
捕获

How to fix it?

Thanks a lot!

如何引用?

如果用了你的bert base chinese , 应该加哪个参考文献?

Implement custom Chinese tokenizer.

We may implement our own tokenizer rather than using BertTokenizerFast.
Our own tokenizer should have the following features:

  • Disable word piece. Convert text to token ID character by character (e.g. tokenizer.convert_tokens_to_ids(list(input_text)))
  • Reimplement clean_up_tokenization method. The default method is implemented for English only. Our method may remove whitespaces and convert half-width punctuations to full-width ones.

Speed up tokenize.

HuggingFace's tokenizer can also return the original indices.
We may rewrite the tokenization step using this feature instead of tokenizing character by character.

Output的embedding獲取

您好:
想請教一下在使用CkipWordSegmenter, CkipPosTagger, CkipNerChunker
能從結果中獲取每一個output的embedding嗎?

像是範例中的字串長度為45的句子
傅達仁今將執行安樂死,卻突然爆出自己20年前遭緯來體育台封殺,他不懂自己哪裡得罪到電視台。
最終輸出時可以從某地方得到45x768這樣的結果嗎? 謝謝

Unable to load weights from pytorch checkpoint.

when computing this:
tokenizer = BertTokenizerFast.from_pretrained('bert-base-chinese')
model = AutoModel.from_pretrained('ckiplab/albert-tiny-chinese-pos')

I have issue:
f"Unable to load weights from pytorch checkpoint file for '{pretrained_model_name_or_path}' "
OSError: Unable to load weights from pytorch checkpoint file for 'ckiplab/albert-tiny-chinese-pos' at

I have
transformers==4.2.2
ckip-transformers==0.2.1
torch==1.4.0

Originally posted by @WachaIPSOS in #3 (comment)

Model Fine-Tunning 數個問題

您好,

  1. 請問模型中各個模型的NLP Task Models 是如何訓練的?是基於Language Models 進行fine tune嗎?
  2. 是否能透過ckiplab/albert-base-chinese-ws 模型 進行fine tune 我自己本身的資料集(希望可以訓練一個新的模型以利新資料斷詞) 若可以,資料集是否需要事先label (tokens) 還是透過raw data即可
  3. 訓練出來的模型,可以透過NLP tool 來使用嗎? 因目前的套件提供的方式,似乎是使用數字代表模型 (不能客製化使用自己finetune的模型)

简体中文

这个模型是否适用于简体中文呢?是否有简体中文的相关实验数据?

請教BERT-base-chinese預訓練方式

想請教一下,
貴單位BERT-base-chinese預訓練方式是完全遵照原始BERT的方式,
只有將資料集換成繁體中文、Tokenizer改變是嗎?

感謝

Chinese text classification model usage example

Hello,

Can you share please an example how to use your model to split Chinese text into separate words?

At this moment this code:

from transformers import (
   BertTokenizerFast,
   AutoModelForMaskedLM,
   AutoModelForCausalLM,
   AutoModelForTokenClassification,
)

# casual language model (GPT2)
tokenizer = BertTokenizerFast.from_pretrained('bert-base-chinese')
model = AutoModelForCausalLM.from_pretrained('ckiplab/gpt2-base-chinese') # or other models above

encoded_input = tokenizer.encode(sample_input, return_tensors="pt")
# batch = []
# batch.append(encoded_input)
predictions = model.generate(encoded_input)

tokenizer.batch_decode(predictions)

gives ['[CLS] 之 后 你 看 看 了 我 的 出 版 请 告 诉 我 你 认 为 什 么 [SEP] 我'] for 之后你看看了我的出版请告诉我你认为什么 input.

At the same time your example in example.py in the repo gives correct output for my input:

之后你看看了我的出版请告诉我你认为什么
之后(Nd) 你(Nh) 看看(VE) 了(Di) 我(Nh) 的(DE) 出版(Nv) 请(VF) 告诉(VE) 我(Nh) 你(Nh) 认为(VE) 什么(Nep)

(in case you are interested in context of this issue, here is google doc with my R&D information on this task)

function pack_ws_pos_sentece() was not defined

In section README.rst
item
4. Show results
showed this line
print(pack_ws_pos_sentece(sentence_ws, sentence_pos))
It gave error since this function pack_ws_pos_sentece() was not defined in this block of code.

如何使用fine-tuned完的model

您好:

想請教一下我目前利用了您在說明中提到的範例檔run_ner.py來去依照我自己的資料集微調完model了
https://github.com/huggingface/transformers/tree/main/examples/pytorch/token-classification

最後分別生成了config.json以及tf_model.h5兩個檔案

但是當我想使用使用自己微調過的model時
在這行
ws_driver = CkipNerChunker(model="tmp/tf_model.h5")
跳出了以下錯誤

Traceback (most recent call last):
File "/root/miniconda3/envs/chatbot/lib/python3.6/site-packages/ckip_transformers/nlp/util.py", line 89, in _get_model_name
model_name = self._model_names[model]
KeyError: './tmp/tf_model.h5'

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
File "Transformers_pretrained.py", line 12, in
ws_driver = CkipWordSegmenter(model="./tmp/tf_model.h5")
File "/root/miniconda3/envs/chatbot/lib/python3.6/site-packages/ckip_transformers/nlp/driver.py", line 52, in init
model_name = kwargs.pop("model_name", self._get_model_name(model))
File "/root/miniconda3/envs/chatbot/lib/python3.6/site-packages/ckip_transformers/nlp/util.py", line 91, in _get_model_name
raise KeyError(f"Invalid model {model}") from exc
KeyError: 'Invalid model ./tmp/tf_model.h5'

請問我該如何正確地使用我自己微調完的model搭配CkipWordSegmenter, CkipPosTagger以及CkipNerChunker呢

Some traditional Chinese characters mapped to UNK

Thanks for the great library! Not sure if this is the correct place to ask, but I think I was using your tokenizer in huggingface transformers. I found that some traditional Chinese characters are mapped to UNKs, see the below screenshot.

The code I used was

from transformers import BertTokenizer
tokenizer = BertTokenizer.from_pretrained("bert-base-chinese")
input_ids = tokenizer.encode("重刋道藏輯要高上玉皇本行集經天樞上相(臣)張良校正三淸勅門下湛寂常道信擬議之至難恢漠神通豈形容之可盡", return_tensors='pt')
print ('encoded ids: ', input_ids)
print ('map encoded ids back to words: ', tokenizer.decode(input_ids[0]))

Screen Shot 2021-04-12 at 8 11 31 PM

Thanks in advance!

Loading model error

When I tried to use CKIP-transformer to perform Chinese NER task pytorch. But when I loaded the model of level 3, The follow error occurs:
Traceback (most recent call last):
File "ner.py", line 3, in
ner_driver = CkipNerChunker(level=3, device=0)
File "/home/nieyang/anaconda3/envs/huggingface/lib/python3.6/site-packages/ckip_transformers/nlp/driver.py", line 224, in init
super().init(model_name=model_name, **kwargs)
File "/home/nieyang/anaconda3/envs/huggingface/lib/python3.6/site-packages/ckip_transformers/nlp/util.py", line 64, in init
self.model = AutoModelForTokenClassification.from_pretrained(model_name)
File "/home/nieyang/anaconda3/envs/huggingface/lib/python3.6/site-packages/transformers/models/auto/auto_factory.py", line 360, in from_pretrained
pretrained_model_name_or_path, *model_args, config=config, **kwargs
File "/home/nieyang/anaconda3/envs/huggingface/lib/python3.6/site-packages/transformers/modeling_utils.py", line 1066, in from_pretrained
f"Unable to load weights from pytorch checkpoint file for '{pretrained_model_name_or_path}' "
OSError: Unable to load weights from pytorch checkpoint file for 'ckiplab/bert-base-chinese-ner' at '/home/nieyang/.cache/huggingface/transformers/46785b95696d8e6a5004a6a73fcee887d60745a5872af82ca7599b9470554ce3.bdaa5056a5c748eca59fe2c7eef8fa2d034f5092fc84ce6b008c27ddf6f0025c'If you tried to load a PyTorch model from a TF 2.0 checkpoint, please set from_tf=True.

So I added the flag from_tf=True to self.model = AutoModelForTokenClassification.from_pretrained(model_name) in ckip_transformers/nlp/util.py, but it then cames out that the model name is wrong.

So can you help me with this?

Albert-tiny English support for NLU tasks

Is there a way to get an equivalent albert-tiny english language model to perform downstream tasks like intent and entity classification. I'm afraid there is no albert-tiny model present hence any lead on this regards or guide to create one from scratch, would be highly appreciated.
Thanks

Is it possible to provide a demo code for bert-base-chinese-qa?

Hi, I am new in this field. Is it possible to provide a demo code for bert-base-chinese-qa?
I tried the following code, following the book "Getting Started with Google BERT":

from transformers import BertTokenizerFast, BertForQuestionAnswering

Tokenizer = BertTokenizerFast.from_pretrained("ckiplab/bert-base-chinese")
model = BertForQuestionAnswering.from_pretrained("ckiplab/bert-base-chinese-qa")

paragraph = "李同 也 沒有 在意 , 大廈 中 , 几乎 每 天 都 有 人 搬進 搬出 , 原 不足為奇 。 \
             可是 , 當 李同 走進 大廈 時 , 卻 看見 了 那 個 老者 , 那 老者 是 倒退 著 身子 走出來 的 , \
             在 那 老者 的 面前 , 兩 個 搬運 工人 , 正 抬 著 一 只 箱子 。 那 是 一 只 木 箱子 , \
             很 殘舊 了 , 箱子 并 不 大 , 但是 兩 個 搬運 工人 抬 著 , 看來 十分 吃力 。[SEP]".strip(" ")

question = "[CLS]老者怎麼走出來的?[SEP]"

question_tokens = tokenizer.tokenize(question)
paragraph_tokens = tokenizer.tokenize(paragraph)

tokens = question_tokens + paragraph_tokens
input_ids = tokenizer.convert_tokens_to_ids(tokens)

segment_ids = [0] * len(question_tokens)
segment_ids += [1] * len(paragraph_tokens)

input_ids = torch.tensor([input_ids])
segment_ids = torch.tensor([segment_ids])

# Getting the answer

res = model(input_ids, token_type_ids=segment_ids)

start_scores, end_scores = res['start_logits'], res['end_logits']

start_index = torch.argmax(start_scores)
end_index = torch.argmax(end_scores)

print(" ".join(tokens[start_index:end_index+1]))

But, I got [CLS]. Could you provide a sample code to how how this Chinese QA model can work properly?
Thank you!

關於依存句法分析

您好 !
想請教之後有可能開發依存句法分析 dependency parsing 的工具嗎

感謝回答

Set device = -1 but still using GPU

Hi @emfomy , thank you for your attention 🙏

ckip_transformers version

0.2.7

What happened

Set device = -1, but the model still uses GPU.

script:

from ckip_transformers.nlp import CkipNerChunker
ner_driver = CkipNerChunker(level=3, device=-1)
res = ner_driver(text_list)

Before running the script:
image

After running the script:
image

What do you think should happen instead

It should not consume GPU resources.

How to reproduce

Run the script in GPU enable env:

from ckip_transformers.nlp import CkipNerChunker
ner_driver = CkipNerChunker(level=3, device=-1)
res = ner_driver(text_list)

Operating System

Ubuntu 20.04.2 LTS

Development Environment

  • Python 3.8.12
  • PyTorch 1.9.0+cu111
  • Transformers 4.7.0
  • Tensorflow 2.11.0

Anything else

I've checked the source code, self.device is set as "cpu", and both model and data tensor has to(self.device), so it's weird to have this problem.
And if the environment has no GPU, the model script is still runnable.

how to compare ckiplab/bert-base-chinese with bert-base-chinese?

Thanks so much for this excellent model and having it accessible in huggingface.

Would like to know why the ckiplab/bert-base-chinese seems a bit strange to me when compared to the usual bert-base-chinese which I think it mainly trained on simplified chinese. For instance, when I masked the word of the phrase 颱風預測。 in the usual bert-base-chinese it managed to give me back with high probability 0.992; in contrast, in the ckiplab/bert-base-chinese it didn't give back the masked word in the top 5 but giving the word with highest probability albeit only around 0.3 something which I am wondering.

Is it supposed that we have to fine-tune this MLM first? Or perhaps I interpreted it wrongly (as I'm very new in this field). Mind sharing a bit on your thought? Thanks very much and thanks in advance.

NER 怎么使用输出ner对应标签

from transformers import (
BertTokenizerFast,
AutoModel,
)

tokenizer = BertTokenizerFast.from_pretrained('bert-base-chinese')
model = AutoModel.from_pretrained('ckiplab/bert-base-chinese-ner')

。。。
这后面应该怎么写呢,才能对应输出这种
image

Pinning memory issue

Hi,

I'm currently using ckip-transformers-ws as a preprocessing tool in my project, and I noticed that the DataLoader's pin_memory flag was hard-coded True in util.py.

As pinning memory is incompatible with multiprocessing (or multiple workers) [1], when users leverage ckip-transformers in their collate_fn of DataLoader with multiple workers, a CUDA error will occur as shown in [1], even if only using CPU for inference.

Therefore, I think it would be better that:

  1. Pin memory only when the device is GPU.
  2. Add an option to decide whether or not to enable memory pinning.

Regards.

[1] https://discuss.pytorch.org/t/pin-memory-vs-sending-direct-to-gpu-from-dataset/33891/2

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.