cl-tohoku / bert-japanese Goto Github PK

View Code? Open in Web Editor NEW

500.0 500.0 55.0 138 KB

BERT models for Japanese text.

License: Apache License 2.0

Python 89.94% Jupyter Notebook 10.06%

bert-japanese's People

Contributors

Stargazers

Watchers

bert-japanese's Issues

Fine tune

Do you have some guide to fine tune bert-japanese

I tried to fine tune, and result is not good. Seems like I did some thing wrong.
Since GPU training is bit expensive, I like to have some opinion from you before finetune again .

Do I need to separate words using mecab-neologd ?
Do I need to do some thing to tokenizer before fine tune ?

Error when initializing from the transformers pipeline

Hello,

I get an error when trying to initialize models that rely on your tokenizer from the transformers package's pipeline. Here is code that yields the error as well as the traceback.

from transformers import pipeline 

sentiment_analyzer = pipeline(
    "sentiment-analysis", model="cl-tohoku/bert-base-japanese", tokenizer="cl-tohoku/bert-base-japanese")

Traceback (most recent call last):
  File "<input>", line 3, in <module>
  File "C:\Users\gagno\Anaconda3\envs\japanese_admin_scrape\lib\site-packages\transformers\pipelines\__init__.py", line 377, in pipeline
    tokenizer = AutoTokenizer.from_pretrained(tokenizer, revision=revision, use_fast=use_fast)
  File "C:\Users\gagno\Anaconda3\envs\japanese_admin_scrape\lib\site-packages\transformers\models\auto\tokenization_auto.py", line 391, in from_pretrained
    tokenizer_class = tokenizer_class_from_name(tokenizer_class_candidate)
  File "C:\Users\gagno\Anaconda3\envs\japanese_admin_scrape\lib\site-packages\transformers\models\auto\tokenization_auto.py", line 294, in tokenizer_class_from_name
    if c.__name__ == class_name:
AttributeError: 'NoneType' object has no attribute '__name__'

How to use my own pretrained model?

I followed "Details of pretraining" and have pretrained a model using my own dataset. Could you tell me how to convert the model files (model.ckpt*) so that the model can be used in Transformers library? I will appreciate your help with my problem.

Please tell us how to quote your model for paper

Thank you for your great job. I am writing international workshop paper, and I would like
to quote your pretrained japanese model(I used in it) . If you did not set how to quote yet , please check below sample is OK or not.
Suzuki Masatoshi(2019) Pretrained Japanese BERT models, GitHub, GitHub repository, https://github.com/cl-tohoku/bert-japanese
Thank you for your help.

Can you detail the preprocessing needed to be done with text during finetuning when using your pretrained model?

The part about tokenizing with Mecab is clear but what about the sub-word tokenization? And what if there are words found in the data used for finetuning but not found in the data used for pretraining? Some guide on using your pretrained model would be great.

AutoTokenizer.from_pretrained doesn't work on newer models

Thank you @singletongue for releasing new BERT models at Hugging Face, but their config.json does not include

  "tokenizer_class": "BertJapaneseTokenizer",

thus Transformers' AutoTokenizer will use BertTokenizerFast. Please compare new config.json with old one, and please check the blog here written in Japanese.

Is tokenization.py needs to be uploaded to GCP?

Hi, I'm following your scripts to train a bert on my own datasets, I trained the tokenizer and created the pretraining data in local, and prepare to upload the tfrecord files to Google Cloud Storage (GCS) bucket for training. Do I need to upload your [ tokenization.py ] to replace the one provided by git cloned google-bert when training the model ? Thanks for your help.

How did you translate tensorflow2 pretrained model to pytorch model?

As with cl-tohoku/bert-base-japanese-v2, I guess you pretrained tensorflow2 model and then translated into pytorch model.
But how did you do that?
This is because I can successfully translate tensorflow v1 model into pytorch model, but not tensorflow v2 model.

Past pretrained model license change from CC-BY-SA 3.0 to Apache 2.0

Hi, I am supporting to develop large language model and defining license of model has becomo important recently. Therefore,
I would like to inquire how the licenese is defined for Tohoku BERT.

I have found the following issue that defines license of pretrained model as CC-BY-SA, that comes from Wikipedia license.
#7

However, current pretrained model license has been Apache 2.0. And the rationale for this change has not been disclosed in the commit.
89e406e

What situation did trigger this change?

What is the size (GB) of the pretraining corpus?

I couldn't find the ja-wiki dump used to train the language model. Is it accessible anywhere and how large is it (in GB, tokens, etc.)?

Is it unnecessary to use neologd in segmentation?

I'm using your pretrained model via huggingface transformers.
In Details of pretraining, neologd is used, whereas Requirements demands only mecab-python3 install, that is it is just original ipadic.
Its different may be small by wordpiece/BPE, but not exactly same, I think.

[Question] How to mask token

始めまして。

マスクのかけ方について教えてください。

masked_lm_example.ipynb で
['[CLS]', '青葉', '##山', 'で', '[MASK]', 'の', '研究', 'を', 'し', 'て', 'い', 'ます', '。', '[SEP]']
となっているところを
['[CLS]', '青葉', '##山', 'で', '植物', 'の', '研究', 'を', 'し', 'て', 'い', 'ます', '。', '[SEP]']
にして青葉山に相当するところを予測させるには
['[CLS]', '[MASK]', '[MASK]', 'で', '植物', 'の', '研究', 'を', 'し', 'て', 'い', 'ます', '。', '[SEP]']
のように2箇所をマスクすればよいのでしょうか？

よろしくお願いいたします。

Add new vocabs

What is the easiest way to add specific domain content to bert-japanese?
Thanks in advance.

BertJapaneseTokenizer can find 'cl-tohoku/bert-base-japanese-whole-word-masking' but BertModel cannot ('cl-tohoku/bert-base-japanese-whole-word-masking')

During preprocessing, the following line has no problem.

    self.tokenizer = BertJapaneseTokenizer.from_pretrained('cl-tohoku/bert-base-japanese-whole-word-masking')

However, during training, I get the following error

Model name 'cl-tohoku/bert-base-japanese-whole-word-masking' was not found in model name list (bert-base-uncased, bert-large-uncased, bert-base-cased, bert-large-cased, bert-base-multilingual-uncased, bert-base-multilingual-cased, bert-base-chinese, bert-base-german-cased, bert-large-uncased-whole-word-masking, bert-large-cased-whole-word-masking, bert-large-uncased-whole-word-masking-finetuned-squad, bert-large-cased-whole-word-masking-finetuned-squad, bert-base-cased-finetuned-mrpc).

from

BertModel.from_pretrained('cl-tohoku/bert-base-japanese-whole-word-masking')

Any idea?

In both case, I install pytorch-transformers with pip. Thanks in advance for your help.

[Question] About the Char model

Hi, thank you for sharing this project.
I want to ask the reason for the MeCab tokenization in the Char model.
Is there any difference between "directly split into characters" and "first MeCab tokenization and then split into characters"?

Swap Mecab tokenizer with Sentencepiece : possible ?

Hi @cl-tohoku, I wanted to get my hands dirty with your model to finetune a pos model. When going on your model card I wanted to test out your model using the recently released Hosted inference API from hugging face when I got this error: ⚠️ This model could not be loaded by the inference API. ⚠️ Error loading tokenizer No module named 'MeCab' ModuleNotFoundError("No module named 'MeCab'"). Correct me if I'm wrong but wouldn't be possible to swap out the Mecab based tokenizer with sentencepiece using this pretrained weights?

Unable to find .ckpt. file

This is a repository of pretrained Japanese BERT models. The pretrained models are available along with the source code of pretraining.

Hi Team,
As mentioned above, i am unable to find the .ckpt file . My intention is is to host the model as a service and i need the below files for the same . Can you let me know on this.

├── model.ckpt.data-00000-of-00001
├── model.ckpt.index
├── model.ckpt.meta

Help on using the model for finetuning

How to add new vocabulary to vocab.txt

Hi Team,

I want to add new domain specific words to tokenizer vocabulary so that I can do more better Word-separation(wakachi-gaki) for those words which are not in default vocab.txt

Is this correct way ?
1: manually add words in the bottom of vocab.txt (from last line)
2: Initialize tokenizer as below
tokenizer = BertJapaneseTokenizer.from_pretrained("{Directory Path to vocab.txt and cofig.json etc...}")
Thanks,

About Pre-Training times

Nice to meet you.
I will be using this gitlab code to pre-train with CloudTPU (v3-8).
I have only done 1000 steps and it was going to take me 4 days to implement 1000000 steps.
How many hours (days) did this gitlab pre-training take using CloudTPU(v3-8)?

strange tokenizer results with self-pretrained model

Hi, I trained a new vocab and bert model with my own datasets following your scripts, with the Mecab Dictionary being changed.
but when I exam it, quite strange results returned everytime. Would you please help me check on this and give me some advice?

Details as below:
My code:

from transformers import BertJapaneseTokenizer, BertForMaskedLM

model_name_or_path = "/content/drive/MyDrive/bert/new_bert/" 
tokenizer = BertJapaneseTokenizer.from_pretrained(model_name_or_path, mecab_kwargs={"mecab_option": "-d /content/drive/MyDrive/UniDic"})

model = BertForMaskedLM.from_pretrained(model_name_or_path)
input_ids = tokenizer.encode(f"青葉山で{tokenizer.mask_token}の研究をしています。", return_tensors="pt")
print(tokenizer.convert_ids_to_tokens(input_ids[0].tolist()))

masked_index = torch.where(input_ids == tokenizer.mask_token_id)[1][0].tolist()
print(masked_index)

result = model(input_ids)
pred_ids = result[0][:, masked_index].topk(5).indices.tolist()[0]
for pred_id in pred_ids:
    output_ids = input_ids.tolist()[0]
    output_ids[masked_index] = pred_id
    print(tokenizer.decode(output_ids))

the result:

The tokenizer class you load from this checkpoint is not the same type as the class this function is called from. It may result in unexpected tokenization. 
The tokenizer class you load from this checkpoint is 'BertTokenizer'. 
The class this function is called from is 'BertJapaneseTokenizer'.
Some weights of the model checkpoint at /content/drive/MyDrive/bert/new_bert/ were not used when initializing BertForMaskedLM: ['cls.seq_relationship.bias', 'cls.seq_relationship.weight']
- This IS expected if you are initializing BertForMaskedLM from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertForMaskedLM from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
['[CLS]', '青葉', '##山', '##で', '[MASK]', 'の', '##研', '##究', '##を', '##して', '##いま', '##す', '。', '[SEP]']
4
[CLS] 青葉山で ヒダ の研究をしています 。 [SEP]
[CLS] 青葉山で 宿つ の研究をしています 。 [SEP]
[CLS] 青葉山で 法外 の研究をしています 。 [SEP]
[CLS] 青葉山で 頑丈 の研究をしています 。 [SEP]
[CLS] 青葉山で弱 の研究をしています 。 [SEP]

the tokenize result is firstly quite odd as below, and then the predict results.

['[CLS]', '青葉', '##山', '##で', '[MASK]', 'の', '##研', '##究', '##を', '##して', '##いま', '##す', '。', '[SEP]']

but when I change to your pre-trained tokenizer bert-base-v2 (still use my model), the result changed alot.

Some weights of the model checkpoint at /content/drive/MyDrive/kindai_bert/kindai_bert/ were not used when initializing BertForMaskedLM: ['cls.seq_relationship.bias', 'cls.seq_relationship.weight']
- This IS expected if you are initializing BertForMaskedLM from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertForMaskedLM from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
['[CLS]', '青葉', '山', 'で', '[MASK]', 'の', '研究', 'を', 'し', 'て', 'い', 'ます', '。', '[SEP]']
4
[CLS] 青葉 山 で 宮司 の 研究 を し て い ます 。 [SEP]
[CLS] 青葉 山 で 飛翔 の 研究 を し て い ます 。 [SEP]
[CLS] 青葉 山 で 旧来 の 研究 を し て い ます 。 [SEP]
[CLS] 青葉 山 で 生野 の 研究 を し て い ます 。 [SEP]
[CLS] 青葉 山 でד の 研究 を し て い ます 。 [SEP]

My local bert folder is like:

Thank you in advance.

SSL error

Max retries exceeded with url: //home/my_username/JapaneseBERTModel/cl-tohoku/bert-base-japanese-whole-word-masking//resolve/main/vocab.txt (Caused by SSLError(SSLError(1, '[SSL: WRONG_VERSION_NUMBER] wrong version number (_ssl.c:1123)')))

The results seems different from hugging face...

Thank you for the great model. I tried this model on our lab experiment machine. But the result seems different from that running on hugging face.

I used this model:
https://huggingface.co/cl-tohoku/bert-base-japanese-whole-word-masking?text=%E3%83%AA%E3%83%B3%E3%82%B4%5BMASK%5D%E9%A3%9F%E3%81%B9%E3%82%8B%E3%80%82

And I wrote:
リンゴ[MASK]食べる。

The model on the web gives that:
リンゴを食べる。
0.870
リンゴも食べる。
0.108
リンゴは食べる。
0.009
リンゴのみ食べる。
0.005
リンゴとともに食べる。
0.001

And I download the model, run it locally. The output is:
['リンゴ', '[MASK]', '食べる', '。']
Some weights of the model checkpoint at /home/Xu_Zhenyu/JapaneseBERTModel/cl-tohoku/bert-base-japanese-whole-word-masking/ were not used when initializing BertForMaskedLM: ['cls.seq_relationship.weight', 'cls.seq_relationship.bias']

This IS expected if you are initializing BertForMaskedLM from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
This IS NOT expected if you are initializing BertForMaskedLM from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
0 を
1 、
2 も
3 野菜
4 で

The results[を　も　は　のみ　とともに] and [を　、　も　野菜　で] is different, why?

And I have another question, there are 0.870, 0.108, 0.009 etc on the web.
How can I get those numbers locally?

Thank you for your time.

Will tokenizer remove stopwords?

I'm using the hugging face's japanese tokenizer.
The name is ''cl-tohoku/bert-base-japanese-whole-word-masking'. Will it remove stopwords automatically in tokenizer and model?

AttributeError: 'MecabBertTokenizer' object has no attribute 'vocab'

Traceback (most recent call last):
File "C:\Users\jsx.p2\pool\plugins\org.python.pydev.core_7.7.0.202008021154\pysrc\pydevd.py", line 3206, in
main()
File "C:\Users\jsx.p2\pool\plugins\org.python.pydev.core_7.7.0.202008021154\pysrc\pydevd.py", line 3199, in main
globals = debugger.run(setup['file'], None, None, is_module)
File "C:\Users\jsx.p2\pool\plugins\org.python.pydev.core_7.7.0.202008021154\pysrc\pydevd.py", line 2273, in run
return self._exec(is_module, entry_point_fn, module_name, file, globals, locals)
File "C:\Users\jsx.p2\pool\plugins\org.python.pydev.core_7.7.0.202008021154\pysrc\pydevd.py", line 2280, in _exec
pydev_imports.execfile(file, globals, locals) # execute the script
File "C:\Users\jsx.p2\pool\plugins\org.python.pydev.core_7.7.0.202008021154\pysrc_pydev_imps_pydev_execfile.py", line 25, in execfile
exec(compile(contents+"\n", file, 'exec'), glob, loc)
File "D:\coding\NLP_AI\BERT\japanese_bert_1_0\masked_lm_main.py", line 32, in
tokenizer = MecabBertTokenizer(vocab_file=f'{BERT_BASE_DIR}/vocab.txt')
File "D:\coding\NLP_AI\BERT\japanese_bert_1_0\tokenization.py", line 53, in init
self.max_len_single_sentence = self.max_len - 2 # take into account special tokens
File "C:\ProgramData\Anaconda3\lib\site-packages\transformers\tokenization_utils_base.py", line 1066, in max_len_single_sentence
if value == self.model_max_length - self.num_special_tokens_to_add(pair=False) and self.verbose:
File "C:\ProgramData\Anaconda3\lib\site-packages\transformers\tokenization_utils.py", line 254, in num_special_tokens_to_add
return len(self.build_inputs_with_special_tokens(token_ids_0, token_ids_1 if pair else None))
File "C:\ProgramData\Anaconda3\lib\site-packages\transformers\tokenization_bert.py", line 256, in build_inputs_with_special_tokens
return [self.cls_token_id] + token_ids_0 + [self.sep_token_id]
File "C:\ProgramData\Anaconda3\lib\site-packages\transformers\tokenization_utils_base.py", line 858, in cls_token_id
return self.convert_tokens_to_ids(self.cls_token)
File "C:\ProgramData\Anaconda3\lib\site-packages\transformers\tokenization_utils.py", line 384, in convert_tokens_to_ids
return self._convert_token_to_id_with_added_voc(tokens)
File "C:\ProgramData\Anaconda3\lib\site-packages\transformers\tokenization_utils.py", line 397, in _convert_token_to_id_with_added_voc
return self._convert_token_to_id(token)
File "C:\ProgramData\Anaconda3\lib\site-packages\transformers\tokenization_bert.py", line 224, in _convert_token_to_id
return self.vocab.get(token, self.vocab.get(self.unk_token))
AttributeError: 'MecabBertTokenizer' object has no attribute 'vocab'

=============
(base) C:\Users\jsx>pip list
Package Version

absl-py 0.9.0
alabaster 0.7.12
amqp 1.4.9
anaconda-client 1.7.2
anaconda-navigator 1.9.12
anaconda-project 0.8.3
antiorm 1.2.1
anyjson 0.3.3
APScheduler 3.6.3
argh 0.26.2
arrow 0.13.0
asari 0.0.4
asgiref 3.2.10
asn1crypto 1.3.0
astroid 2.4.2
astropy 4.0.1.post1
astunparse 1.6.3
atomicwrites 1.4.0
attrs 19.3.0
Automat 20.2.0
autopep8 1.5.3
Babel 2.8.0
backcall 0.2.0
backports.functools-lru-cache 1.6.1
backports.shutil-get-terminal-size 1.0.0
backports.tempfile 1.0
backports.weakref 1.0.post1
bcrypt 3.1.7
beautifulsoup4 4.9.1
billiard 3.3.0.23
bitarray 1.4.0
bkcharts 0.2
bleach 3.1.5
blis 0.4.1
bokeh 2.1.1
boto 2.49.0
boto3 1.14.35
botocore 1.17.35
Bottleneck 1.3.2
breadability 0.1.20
brotlipy 0.7.0
cachetools 4.1.1
catalogue 1.0.0
celery 3.1.26.post2
certifi 2020.6.20
cffi 1.14.0
chardet 3.0.4
click 7.1.2
cloudpickle 1.5.0
clyent 1.2.2
colorama 0.4.3
comtypes 1.1.7
conda 4.8.3
conda-build 3.18.11
conda-package-handling 1.7.0
conda-verify 3.4.2
constantly 15.1.0
contextlib2 0.6.0.post1
contextvars 2.4
crispy-forms-gds 0.2.1
cryptography 2.9.2
cssselect 1.1.0
cycler 0.10.0
cymem 2.0.3
Cython 0.29.14
cytoolz 0.10.1
dartsclone 0.9.0
dask 2.20.0
dataclasses 0.7
db 0.1.1
db-sqlite3 0.0.1
decorator 4.4.2
defusedxml 0.6.0
diff-match-patch 20200713
dill 0.2.9
distributed 2.20.0
Django 3.1
django-bootstrap3 14.1.0
django-celery 3.3.1
django-crispy-forms 1.9.2
django-crontab 0.7.1
django-debug-toolbar 2.2
django-filter 2.3.0
django-import-export 2.3.0
django-mathfilters 1.0.0
django-paypal 1.0.0
docker 4.2.2
docopt 0.6.2
docutils 0.15.2
en-core-web-md 2.2.5
en-core-web-sm 2.2.5
entrypoints 0.3
et-xmlfile 1.0.1
fake-useragent 0.1.11
fastcache 1.1.0
filelock 3.0.12
fix-yahoo-finance 0.1.37
flake8 3.8.3
Flask 1.1.2
Flask-APScheduler 1.11.0
Flask-SQLAlchemy 2.4.4
Flask-WTF 0.14.3
fsspec 0.7.4
future 0.18.2
gast 0.3.3
gensim 3.6.0
gensim-sum-ext 0.1.2
gevent 20.6.2
ginza 3.1.2
glob2 0.7
gmpy2 2.0.8
google-auth 1.20.0
google-auth-oauthlib 0.4.1
google-pasta 0.2.0
greenlet 0.4.16
grpcio 1.31.0
h5py 2.10.0
HeapDict 1.0.1
helpdev 0.7.1
html5lib 1.1
hyperlink 20.0.1
idna 2.10
imageio 2.9.0
imagesize 1.2.0
immutables 0.14
importlib-metadata 1.7.0
incremental 17.5.0
intervaltree 3.0.2
ipykernel 5.3.2
ipython 7.16.1
ipython-genutils 0.2.0
ipywidgets 7.5.1
isort 4.3.21
itsdangerous 1.1.0
ja-ginza 3.1.0
ja-ginza-dict 3.1.0
Janome 0.3.10
jdcal 1.4.1
jedi 0.17.1
Jinja2 2.11.2
jmespath 0.10.0
joblib 0.16.0
json5 0.9.5
jsonschema 3.2.0
jupyter 1.0.0
jupyter-client 6.1.6
jupyter-console 6.1.0
jupyter-core 4.6.3
jupyterlab 2.1.5
jupyterlab-server 1.2.0
Keras 2.4.3
Keras-Preprocessing 1.1.2
keyring 21.2.1
kiwisolver 1.2.0
kombu 3.0.37
lazy-object-proxy 1.4.3
libarchive-c 2.9
llvmlite 0.33.0+1.g022ab0f
locket 0.2.0
lxml 4.5.2
Markdown 3.2.2
MarkupPy 1.14
MarkupSafe 1.1.1
matplotlib 3.2.2
mccabe 0.6.1
mecab 0.996.2
mecab-python 1.0.0
mecab-python3 1.0.1
menuinst 1.4.16
mistune 0.8.4
mkl-fft 1.0.14
mkl-random 1.0.4
mkl-service 2.3.0
mock 4.0.2
more-itertools 8.4.0
mpmath 1.1.0
msgpack 0.5.6
multipledispatch 0.6.0
multitasking 0.0.9
murmurhash 1.0.2
mysql-connector 2.2.9
mysqlclient 2.0.1
navigator-updater 0.2.1
nbconvert 5.6.1
nbformat 5.0.7
networkx 2.4
nltk 3.5
nose 1.3.7
notebook 6.0.3
numba 0.50.1
numexpr 2.7.1
numpy 1.18.5
numpydoc 1.1.0
oauthlib 3.1.0
odfpy 1.4.1
olefile 0.46
openpyxl 3.0.4
opt-einsum 3.3.0
packaging 20.4
pandas 1.0.5
pandas-datareader 0.9.0
pandocfilters 1.4.2
paramiko 2.7.1
parso 0.7.0
partd 1.1.0
path 13.1.0
pathlib2 2.3.5
pathtools 0.1.2
patsy 0.5.1
pep8 1.7.1
pexpect 4.8.0
pickleshare 0.7.5
Pillow 7.2.0
pip 20.1.1
pkginfo 1.5.0.1
plac 0.9.6
pluggy 0.13.1
ply 3.11
preshed 3.0.2
prometheus-client 0.8.0
prompt-toolkit 3.0.5
protobuf 3.12.4
psutil 5.7.0
ptyprocess 0.6.0
py 1.9.0
pyasn1 0.4.8
pyasn1-modules 0.2.8
pycodestyle 2.6.0
pycorenlp 0.3.0
pycosat 0.6.3
pycountry 20.7.3
pycparser 2.20
pycrypto 2.6.1
pycurl 7.43.0.5
pydocstyle 5.0.2
pyflakes 2.2.0
Pygments 2.6.1
PyHamcrest 2.0.2
pylint 2.5.3
PyMySQL 0.10.0
PyNaCl 1.4.0
pyodbc 4.0.0-unsupported
pyOpenSSL 19.1.0
pyparsing 2.4.7
pypiwin32 223
PyQt5 5.12.3
PyQt5-sip 12.8.0
PyQtWebEngine 5.12.1
pyquery 1.4.1
pyreadline 2.1
pyrsistent 0.16.0
PySocks 1.7.1
pystan 2.19.1.1
pytest 5.4.3
python-dateutil 2.8.1
python-jsonrpc-server 0.3.4
python-language-server 0.34.1
pytz 2020.1
PyWavelets 1.1.1
pywin32 227
pywin32-ctypes 0.2.0
pywinpty 0.5.7
PyYAML 5.3.1
pyzmq 19.0.2
QDarkStyle 2.8.1
QtAwesome 0.7.2
qtconsole 4.7.5
QtPy 1.9.0
redis 3.2.1
regex 2020.6.8
requests 2.24.0
requests-oauthlib 1.3.0
rope 0.17.0
rsa 4.6
Rtree 0.9.4
ruamel-yaml 0.15.87
s3transfer 0.3.3
sacremoses 0.0.43
scikit-image 0.16.2
scikit-learn 0.23.1
scipy 1.4.1
seaborn 0.10.1
Send2Trash 1.5.0
sentencepiece 0.1.91
setuptools 49.2.0.post20200714
simplegeneric 0.8.1
simplejson 3.17.2
singledispatch 3.4.0.3
six 1.15.0
smart-open 2.1.0
snowballstemmer 2.0.0
sortedcollections 1.2.1
sortedcontainers 2.1.0
soupsieve 2.0.1
spacy 2.2.3
Sphinx 3.1.2
sphinxcontrib-applehelp 1.0.2
sphinxcontrib-devhelp 1.0.2
sphinxcontrib-htmlhelp 1.0.3
sphinxcontrib-jsmath 1.0.1
sphinxcontrib-qthelp 1.0.3
sphinxcontrib-serializinghtml 1.1.4
sphinxcontrib-websupport 1.2.3
spyder 4.1.4
spyder-kernels 1.9.2
SQLAlchemy 1.3.18
sqlparse 0.3.1
srsly 1.0.2
statsmodels 0.11.1
SudachiDict-core 20191030
SudachiPy 0.4.5
sumy 0.8.1
sympy 1.6.1
tables 3.6.1
tablib 2.0.0
tblib 1.6.0
tensorboard 2.3.0
tensorboard-plugin-wit 1.7.0
tensorflow 2.3.0
tensorflow-estimator 2.3.0
tensorflow-hub 0.8.0
termcolor 1.1.0
terminado 0.8.3
testpath 0.4.4
textblob 0.15.3
thinc 7.3.1
threadpoolctl 2.1.0
tokenization 1.0.7
tokenizers 0.8.1rc1
toml 0.10.1
toolz 0.10.0
torch 1.1.0
torchvision 0.3.0
tornado 6.0.4
tqdm 4.47.0
traitlets 4.3.3
transformers 3.0.2
Twisted 20.3.0
typed-ast 1.4.1
typing-extensions 3.7.4.2
tzlocal 2.1
ujson 1.35
unicodecsv 0.14.1
unidic-lite 1.0.7
urllib3 1.25.9
vine 1.3.0
wasabi 0.7.1
watchdog 0.10.3
wcwidth 0.2.5
webencodings 0.5.1
websocket-client 0.57.0
Werkzeug 1.0.1
wheel 0.34.2
widgetsnbextension 3.5.1
win-inet-pton 1.1.0
win-unicode-console 0.5
wincertstore 0.2
wrapt 1.12.1
WTForms 2.3.3
xlrd 1.2.0
XlsxWriter 1.2.9
xlwings 0.19.5
xlwt 1.3.0
xmltodict 0.12.0
yahoo-finance 1.4.0
YahooFinanceSpider 0.3
yapf 0.30.0
yfinance 0.1.54
zict 2.0.0
zipp 3.1.0
zmq 0.0.0
zope.event 4.4
zope.interface 4.7.1

'Can't convert ['test.txt'] to Trainer' when training a BertWordPieceTokenizer

Hi Team,

I'm tring to train a Japanese Bert with my own data based on yours, and did't modify the structures.
but when I pass the train data path to train a tokenizer, every time there go's wrong,
the error is "Can't convert ['test.txt'] to Trainer".

here's something I tired:

pass a sigle filename or content of a single file (withine the same folder of the train_tokenizers.py file), the error appears.
pass a list of filenames like
['data_file/0.txt', 'data_file/1.txt', 'data_file/2.txt', 'data_file/3.txt', 'data_file/4.txt']
or single sentence list, the same error also occur.

Can you give any advise on this situation? Thanks a lot.

Cannot run the example masked_lm_example.ipynb

As the title said,

I am trying to re-run the example file from README on Google Colab and keep getting this error.

input_ids = tokenizer.encode(f'''
    青葉山で{tokenizer.mask_token}の研究をしています。
''', return_tensors='pt')

ERROR:root:An unexpected error occurred while tokenizing input
The following traceback may be corrupted or invalid
The error message is: ('EOF in multi-line string', (1, 0))

---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
<ipython-input-7-f8582275f4db> in <module>()
      1 input_ids = tokenizer.encode(f'''
      2     青葉山で{tokenizer.mask_token}の研究をしています。
----> 3 ''', return_tensors='pt')

8 frames
/usr/local/lib/python3.6/dist-packages/transformers/tokenization_bert_japanese.py in tokenize(self, text, never_split, **kwargs)
    205                 break
    206 
--> 207             token, _ = line.split("\t")
    208             token_start = text.index(token, cursor)
    209             token_end = token_start + len(token)

ValueError: too many values to unpack (expected 2)

Get the last output of the model 'cl-tohoku/bert-base-japanese-char-whole-word-masking'

The tokenizer is good for japanese but I want to get the last output layer of the model above.
Since I am following the instruction in the huggingface that:

tokenizer = AutoTokenizer.from_pretrained("cl-tohoku/bert-base-japanese-whole-word-masking")
model = AutoModelWithLMHead.from_pretrained("cl-tohoku/bert-base-japanese-whole-word-masking")
input_ids = torch.tensor(tokenizer.encode(text, add_special_tokens=True)).unsqueeze(0) # Batch size 1
outputs = model(input_ids)
last_hidden_states = outputs[0]

Then i got len(outputs) = 1,
The expected last_hidden_states shape is (batch,seq len, dmodel) but i got (batch,seq len, vocab size).

How can i get the shape (batch,seq len, dmodel) in of your model.

Getting some weights not used warning

Hi, I get the following warning when loading a checkpoint trained using your pretrained model. Is there anything wrong with it?

Some weights of the model checkpoint at /content/drive/My Drive/pretrainedBertJa/bert-base-japanese-whole-word-masking were not used when initializing BertForSequenceClassification: ['cls.predictions.bias', 'cls.predictions.transform.dense.weight', 'cls.predictions.transform.dense.bias', 'cls.predictions.transform.LayerNorm.weight', 'cls.predictions.transform.LayerNorm.bias', 'cls.predictions.decoder.weight', 'cls.seq_relationship.weight', 'cls.seq_relationship.bias']

This IS expected if you are initializing BertForSequenceClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPretraining model).
This IS NOT expected if you are initializing BertForSequenceClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Some weights of BertForSequenceClassification were not initialized from the model checkpoint at /content/drive/My Drive/pretrainedBertJa/bert-base-japanese-whole-word-masking and are newly initialized: ['classifier.weight', 'classifier.bias']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.

tensorflow version

hi，
I have a question about tensorflow version. when installing tensorflow-gpu==1.1.14, cannot find the version, I can use the version 1.14.0 but not using gpu. is the version 1.1.14? Thanks for your help.

License for models

Hi!

First off, thanks a lot for making and sharing these models.

Second, is there a license that can be applied to the pre-trained models themselves?

Thanks,
Josh

Could not download jawiki-20230102

Was about to do the comparision experiment,
however, I could not download the jawiki-20230102 version.

Could you kindly share this wiki data with me?

Much appreciation.

transoformers' japanese vocab don't have "ゑ", but have "ヱ"

The following is the reproduce code:

import transformers
tokenizer = transformers.GPT2JapaneseTokenizer.from_pretrained(
    'gpt2-japanese-vocab.txt')
# the sentents come from ja-wikipedia dump (20191201)
text = "また、「紙(かみ)」「絵/画(ゑ)」など、もともと音であるが和語と認識されているものもある。" 
tokens = tokenizer.tokenize(text)
print(tokens)
ids = tokenizer.convert_tokens_to_ids(tokens)
print(ids)

and the output is:

['また', '、', '「', '紙', '(', 'かみ', ')', '」', '「', '絵', '/', '画', '(', '[UNK]', ')', '」', 'など', '、', 'もともと', '##音', '##で', '##ある', '##が', '##和', '##語', '##と', '##認', '##識', '##さ', '##れ', '##てい', '##る', '##もの', '##も', '##ある', '。']
[105, 6, 36, 2100, 23, 12060, 24, 38, 36, 1681, 460, 1031, 23, 1, 24, 38, 65, 6, 4830, 28833, 28453, 15162, 28456, 28693, 28702, 28450, 28948, 29369, 28473, 28459, 17033, 28447, 3729, 28478, 15162, 8]

Is this just a intentional things?

cl-tohoku / bert-japanese Goto Github PK

bert-japanese's People

Contributors

Stargazers

Watchers

Forkers

bert-japanese's Issues

Recommend Projects

Recommend Topics

Recommend Org