cl-tohoku / bert-japanese Goto Github PK
View Code? Open in Web Editor NEWBERT models for Japanese text.
License: Apache License 2.0
BERT models for Japanese text.
License: Apache License 2.0
Do you have some guide to fine tune bert-japanese
I tried to fine tune, and result is not good. Seems like I did some thing wrong.
Since GPU training is bit expensive, I like to have some opinion from you before finetune again .
Do I need to separate words using mecab-neologd ?
Do I need to do some thing to tokenizer before fine tune ?
Hello,
I get an error when trying to initialize models that rely on your tokenizer from the transformers package's pipeline. Here is code that yields the error as well as the traceback.
from transformers import pipeline
sentiment_analyzer = pipeline(
"sentiment-analysis", model="cl-tohoku/bert-base-japanese", tokenizer="cl-tohoku/bert-base-japanese")
Traceback (most recent call last):
File "<input>", line 3, in <module>
File "C:\Users\gagno\Anaconda3\envs\japanese_admin_scrape\lib\site-packages\transformers\pipelines\__init__.py", line 377, in pipeline
tokenizer = AutoTokenizer.from_pretrained(tokenizer, revision=revision, use_fast=use_fast)
File "C:\Users\gagno\Anaconda3\envs\japanese_admin_scrape\lib\site-packages\transformers\models\auto\tokenization_auto.py", line 391, in from_pretrained
tokenizer_class = tokenizer_class_from_name(tokenizer_class_candidate)
File "C:\Users\gagno\Anaconda3\envs\japanese_admin_scrape\lib\site-packages\transformers\models\auto\tokenization_auto.py", line 294, in tokenizer_class_from_name
if c.__name__ == class_name:
AttributeError: 'NoneType' object has no attribute '__name__'
I followed "Details of pretraining" and have pretrained a model using my own dataset. Could you tell me how to convert the model files (model.ckpt*) so that the model can be used in Transformers library? I will appreciate your help with my problem.
Thank you for your great job. I am writing international workshop paper, and I would like
to quote your pretrained japanese model(I used in it) . If you did not set how to quote yet , please check below sample is OK or not.
Suzuki Masatoshi(2019) Pretrained Japanese BERT models, GitHub, GitHub repository, https://github.com/cl-tohoku/bert-japanese
Thank you for your help.
The part about tokenizing with Mecab is clear but what about the sub-word tokenization? And what if there are words found in the data used for finetuning but not found in the data used for pretraining? Some guide on using your pretrained model would be great.
Thank you @singletongue for releasing new BERT models at Hugging Face, but their config.json
does not include
"tokenizer_class": "BertJapaneseTokenizer",
thus Transformers' AutoTokenizer
will use BertTokenizerFast
. Please compare new config.json with old one, and please check the blog here written in Japanese.
Hi, I'm following your scripts to train a bert on my own datasets, I trained the tokenizer and created the pretraining data in local, and prepare to upload the tfrecord files to Google Cloud Storage (GCS) bucket for training. Do I need to upload your [ tokenization.py ] to replace the one provided by git cloned google-bert when training the model ? Thanks for your help.
As with cl-tohoku/bert-base-japanese-v2, I guess you pretrained tensorflow2 model and then translated into pytorch model.
But how did you do that?
This is because I can successfully translate tensorflow v1 model into pytorch model, but not tensorflow v2 model.
Hi, I am supporting to develop large language model and defining license of model has becomo important recently. Therefore,
I would like to inquire how the licenese is defined for Tohoku BERT.
I have found the following issue that defines license of pretrained model as CC-BY-SA, that comes from Wikipedia license.
#7
However, current pretrained model license has been Apache 2.0. And the rationale for this change has not been disclosed in the commit.
89e406e
What situation did trigger this change?
I couldn't find the ja-wiki dump used to train the language model. Is it accessible anywhere and how large is it (in GB, tokens, etc.)?
I'm using your pretrained model via huggingface transformers.
In Details of pretraining, neologd is used, whereas Requirements demands only mecab-python3 install, that is it is just original ipadic.
Its different may be small by wordpiece/BPE, but not exactly same, I think.
始めまして。
マスクのかけ方について教えてください。
masked_lm_example.ipynb で
['[CLS]', '青葉', '##山', 'で', '[MASK]', 'の', '研究', 'を', 'し', 'て', 'い', 'ます', '。', '[SEP]']
となっているところを
['[CLS]', '青葉', '##山', 'で', '植物', 'の', '研究', 'を', 'し', 'て', 'い', 'ます', '。', '[SEP]']
にして青葉山に相当するところを予測させるには
['[CLS]', '[MASK]', '[MASK]', 'で', '植物', 'の', '研究', 'を', 'し', 'て', 'い', 'ます', '。', '[SEP]']
のように2箇所をマスクすればよいのでしょうか?
よろしくお願いいたします。
What is the easiest way to add specific domain content to bert-japanese?
Thanks in advance.
During preprocessing, the following line has no problem.
self.tokenizer = BertJapaneseTokenizer.from_pretrained('cl-tohoku/bert-base-japanese-whole-word-masking')
However, during training, I get the following error
Model name 'cl-tohoku/bert-base-japanese-whole-word-masking' was not found in model name list (bert-base-uncased, bert-large-uncased, bert-base-cased, bert-large-cased, bert-base-multilingual-uncased, bert-base-multilingual-cased, bert-base-chinese, bert-base-german-cased, bert-large-uncased-whole-word-masking, bert-large-cased-whole-word-masking, bert-large-uncased-whole-word-masking-finetuned-squad, bert-large-cased-whole-word-masking-finetuned-squad, bert-base-cased-finetuned-mrpc).
from
BertModel.from_pretrained('cl-tohoku/bert-base-japanese-whole-word-masking')
Any idea?
In both case, I install pytorch-transformers with pip. Thanks in advance for your help.
Hi, thank you for sharing this project.
I want to ask the reason for the MeCab tokenization in the Char model.
Is there any difference between "directly split into characters" and "first MeCab tokenization and then split into characters"?
Hi @cl-tohoku, I wanted to get my hands dirty with your model to finetune a pos model. When going on your model card I wanted to test out your model using the recently released Hosted inference API
from hugging face when I got this error: ⚠️ This model could not be loaded by the inference API. ⚠️ Error loading tokenizer No module named 'MeCab' ModuleNotFoundError("No module named 'MeCab'")
. Correct me if I'm wrong but wouldn't be possible to swap out the Mecab based tokenizer with sentencepiece using this pretrained weights?
This is a repository of pretrained Japanese BERT models. The pretrained models are available along with the source code of pretraining.
Hi Team,
As mentioned above, i am unable to find the .ckpt file . My intention is is to host the model as a service and i need the below files for the same . Can you let me know on this.
├── model.ckpt.data-00000-of-00001
├── model.ckpt.index
├── model.ckpt.meta
The part about tokenizing with Mecab is clear but what about the sub-word tokenization? And what if there are words found in the data used for finetuning but not found in the data used for pretraining? Some guide on using your pretrained model would be great.
Hi Team,
I want to add new domain specific words to tokenizer vocabulary so that I can do more better Word-separation(wakachi-gaki) for those words which are not in default vocab.txt
Is this correct way ?
1: manually add words in the bottom of vocab.txt (from last line)
2: Initialize tokenizer as below
tokenizer = BertJapaneseTokenizer.from_pretrained("{Directory Path to vocab.txt and cofig.json etc...}")
Thanks,
Nice to meet you.
I will be using this gitlab code to pre-train with CloudTPU (v3-8).
I have only done 1000 steps and it was going to take me 4 days to implement 1000000 steps.
How many hours (days) did this gitlab pre-training take using CloudTPU(v3-8)?
Hi, I trained a new vocab and bert model with my own datasets following your scripts, with the Mecab Dictionary being changed.
but when I exam it, quite strange results returned everytime. Would you please help me check on this and give me some advice?
Details as below:
My code:
from transformers import BertJapaneseTokenizer, BertForMaskedLM
model_name_or_path = "/content/drive/MyDrive/bert/new_bert/"
tokenizer = BertJapaneseTokenizer.from_pretrained(model_name_or_path, mecab_kwargs={"mecab_option": "-d /content/drive/MyDrive/UniDic"})
model = BertForMaskedLM.from_pretrained(model_name_or_path)
input_ids = tokenizer.encode(f"青葉山で{tokenizer.mask_token}の研究をしています。", return_tensors="pt")
print(tokenizer.convert_ids_to_tokens(input_ids[0].tolist()))
masked_index = torch.where(input_ids == tokenizer.mask_token_id)[1][0].tolist()
print(masked_index)
result = model(input_ids)
pred_ids = result[0][:, masked_index].topk(5).indices.tolist()[0]
for pred_id in pred_ids:
output_ids = input_ids.tolist()[0]
output_ids[masked_index] = pred_id
print(tokenizer.decode(output_ids))
the result:
The tokenizer class you load from this checkpoint is not the same type as the class this function is called from. It may result in unexpected tokenization.
The tokenizer class you load from this checkpoint is 'BertTokenizer'.
The class this function is called from is 'BertJapaneseTokenizer'.
Some weights of the model checkpoint at /content/drive/MyDrive/bert/new_bert/ were not used when initializing BertForMaskedLM: ['cls.seq_relationship.bias', 'cls.seq_relationship.weight']
- This IS expected if you are initializing BertForMaskedLM from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertForMaskedLM from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
['[CLS]', '青葉', '##山', '##で', '[MASK]', 'の', '##研', '##究', '##を', '##して', '##いま', '##す', '。', '[SEP]']
4
[CLS] 青葉山で ヒダ の研究をしています 。 [SEP]
[CLS] 青葉山で 宿つ の研究をしています 。 [SEP]
[CLS] 青葉山で 法外 の研究をしています 。 [SEP]
[CLS] 青葉山で 頑丈 の研究をしています 。 [SEP]
[CLS] 青葉山で弱 の研究をしています 。 [SEP]
the tokenize result is firstly quite odd as below, and then the predict results.
['[CLS]', '青葉', '##山', '##で', '[MASK]', 'の', '##研', '##究', '##を', '##して', '##いま', '##す', '。', '[SEP]']
but when I change to your pre-trained tokenizer bert-base-v2 (still use my model), the result changed alot.
Some weights of the model checkpoint at /content/drive/MyDrive/kindai_bert/kindai_bert/ were not used when initializing BertForMaskedLM: ['cls.seq_relationship.bias', 'cls.seq_relationship.weight']
- This IS expected if you are initializing BertForMaskedLM from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertForMaskedLM from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
['[CLS]', '青葉', '山', 'で', '[MASK]', 'の', '研究', 'を', 'し', 'て', 'い', 'ます', '。', '[SEP]']
4
[CLS] 青葉 山 で 宮司 の 研究 を し て い ます 。 [SEP]
[CLS] 青葉 山 で 飛翔 の 研究 を し て い ます 。 [SEP]
[CLS] 青葉 山 で 旧来 の 研究 を し て い ます 。 [SEP]
[CLS] 青葉 山 で 生野 の 研究 を し て い ます 。 [SEP]
[CLS] 青葉 山 でד の 研究 を し て い ます 。 [SEP]
Thank you in advance.
Max retries exceeded with url: //home/my_username/JapaneseBERTModel/cl-tohoku/bert-base-japanese-whole-word-masking//resolve/main/vocab.txt (Caused by SSLError(SSLError(1, '[SSL: WRONG_VERSION_NUMBER] wrong version number (_ssl.c:1123)')))
Thank you for the great model. I tried this model on our lab experiment machine. But the result seems different from that running on hugging face.
I used this model:
https://huggingface.co/cl-tohoku/bert-base-japanese-whole-word-masking?text=%E3%83%AA%E3%83%B3%E3%82%B4%5BMASK%5D%E9%A3%9F%E3%81%B9%E3%82%8B%E3%80%82
And I wrote:
リンゴ[MASK]食べる。
The model on the web gives that:
リンゴ を 食べる 。
0.870
リンゴ も 食べる 。
0.108
リンゴ は 食べる 。
0.009
リンゴ のみ 食べる 。
0.005
リンゴ とともに 食べる 。
0.001
And I download the model, run it locally. The output is:
['リンゴ', '[MASK]', '食べる', '。']
Some weights of the model checkpoint at /home/Xu_Zhenyu/JapaneseBERTModel/cl-tohoku/bert-base-japanese-whole-word-masking/ were not used when initializing BertForMaskedLM: ['cls.seq_relationship.weight', 'cls.seq_relationship.bias']
The results[を も は のみ とともに] and [を 、 も 野菜 で] is different, why?
And I have another question, there are 0.870, 0.108, 0.009 etc on the web.
How can I get those numbers locally?
Thank you for your time.
I'm using the hugging face's japanese tokenizer.
The name is ''cl-tohoku/bert-base-japanese-whole-word-masking'. Will it remove stopwords automatically in tokenizer and model?
Traceback (most recent call last):
File "C:\Users\jsx.p2\pool\plugins\org.python.pydev.core_7.7.0.202008021154\pysrc\pydevd.py", line 3206, in
main()
File "C:\Users\jsx.p2\pool\plugins\org.python.pydev.core_7.7.0.202008021154\pysrc\pydevd.py", line 3199, in main
globals = debugger.run(setup['file'], None, None, is_module)
File "C:\Users\jsx.p2\pool\plugins\org.python.pydev.core_7.7.0.202008021154\pysrc\pydevd.py", line 2273, in run
return self._exec(is_module, entry_point_fn, module_name, file, globals, locals)
File "C:\Users\jsx.p2\pool\plugins\org.python.pydev.core_7.7.0.202008021154\pysrc\pydevd.py", line 2280, in _exec
pydev_imports.execfile(file, globals, locals) # execute the script
File "C:\Users\jsx.p2\pool\plugins\org.python.pydev.core_7.7.0.202008021154\pysrc_pydev_imps_pydev_execfile.py", line 25, in execfile
exec(compile(contents+"\n", file, 'exec'), glob, loc)
File "D:\coding\NLP_AI\BERT\japanese_bert_1_0\masked_lm_main.py", line 32, in
tokenizer = MecabBertTokenizer(vocab_file=f'{BERT_BASE_DIR}/vocab.txt')
File "D:\coding\NLP_AI\BERT\japanese_bert_1_0\tokenization.py", line 53, in init
self.max_len_single_sentence = self.max_len - 2 # take into account special tokens
File "C:\ProgramData\Anaconda3\lib\site-packages\transformers\tokenization_utils_base.py", line 1066, in max_len_single_sentence
if value == self.model_max_length - self.num_special_tokens_to_add(pair=False) and self.verbose:
File "C:\ProgramData\Anaconda3\lib\site-packages\transformers\tokenization_utils.py", line 254, in num_special_tokens_to_add
return len(self.build_inputs_with_special_tokens(token_ids_0, token_ids_1 if pair else None))
File "C:\ProgramData\Anaconda3\lib\site-packages\transformers\tokenization_bert.py", line 256, in build_inputs_with_special_tokens
return [self.cls_token_id] + token_ids_0 + [self.sep_token_id]
File "C:\ProgramData\Anaconda3\lib\site-packages\transformers\tokenization_utils_base.py", line 858, in cls_token_id
return self.convert_tokens_to_ids(self.cls_token)
File "C:\ProgramData\Anaconda3\lib\site-packages\transformers\tokenization_utils.py", line 384, in convert_tokens_to_ids
return self._convert_token_to_id_with_added_voc(tokens)
File "C:\ProgramData\Anaconda3\lib\site-packages\transformers\tokenization_utils.py", line 397, in _convert_token_to_id_with_added_voc
return self._convert_token_to_id(token)
File "C:\ProgramData\Anaconda3\lib\site-packages\transformers\tokenization_bert.py", line 224, in _convert_token_to_id
return self.vocab.get(token, self.vocab.get(self.unk_token))
AttributeError: 'MecabBertTokenizer' object has no attribute 'vocab'
=============
(base) C:\Users\jsx>pip list
Package Version
absl-py 0.9.0
alabaster 0.7.12
amqp 1.4.9
anaconda-client 1.7.2
anaconda-navigator 1.9.12
anaconda-project 0.8.3
antiorm 1.2.1
anyjson 0.3.3
APScheduler 3.6.3
argh 0.26.2
arrow 0.13.0
asari 0.0.4
asgiref 3.2.10
asn1crypto 1.3.0
astroid 2.4.2
astropy 4.0.1.post1
astunparse 1.6.3
atomicwrites 1.4.0
attrs 19.3.0
Automat 20.2.0
autopep8 1.5.3
Babel 2.8.0
backcall 0.2.0
backports.functools-lru-cache 1.6.1
backports.shutil-get-terminal-size 1.0.0
backports.tempfile 1.0
backports.weakref 1.0.post1
bcrypt 3.1.7
beautifulsoup4 4.9.1
billiard 3.3.0.23
bitarray 1.4.0
bkcharts 0.2
bleach 3.1.5
blis 0.4.1
bokeh 2.1.1
boto 2.49.0
boto3 1.14.35
botocore 1.17.35
Bottleneck 1.3.2
breadability 0.1.20
brotlipy 0.7.0
cachetools 4.1.1
catalogue 1.0.0
celery 3.1.26.post2
certifi 2020.6.20
cffi 1.14.0
chardet 3.0.4
click 7.1.2
cloudpickle 1.5.0
clyent 1.2.2
colorama 0.4.3
comtypes 1.1.7
conda 4.8.3
conda-build 3.18.11
conda-package-handling 1.7.0
conda-verify 3.4.2
constantly 15.1.0
contextlib2 0.6.0.post1
contextvars 2.4
crispy-forms-gds 0.2.1
cryptography 2.9.2
cssselect 1.1.0
cycler 0.10.0
cymem 2.0.3
Cython 0.29.14
cytoolz 0.10.1
dartsclone 0.9.0
dask 2.20.0
dataclasses 0.7
db 0.1.1
db-sqlite3 0.0.1
decorator 4.4.2
defusedxml 0.6.0
diff-match-patch 20200713
dill 0.2.9
distributed 2.20.0
Django 3.1
django-bootstrap3 14.1.0
django-celery 3.3.1
django-crispy-forms 1.9.2
django-crontab 0.7.1
django-debug-toolbar 2.2
django-filter 2.3.0
django-import-export 2.3.0
django-mathfilters 1.0.0
django-paypal 1.0.0
docker 4.2.2
docopt 0.6.2
docutils 0.15.2
en-core-web-md 2.2.5
en-core-web-sm 2.2.5
entrypoints 0.3
et-xmlfile 1.0.1
fake-useragent 0.1.11
fastcache 1.1.0
filelock 3.0.12
fix-yahoo-finance 0.1.37
flake8 3.8.3
Flask 1.1.2
Flask-APScheduler 1.11.0
Flask-SQLAlchemy 2.4.4
Flask-WTF 0.14.3
fsspec 0.7.4
future 0.18.2
gast 0.3.3
gensim 3.6.0
gensim-sum-ext 0.1.2
gevent 20.6.2
ginza 3.1.2
glob2 0.7
gmpy2 2.0.8
google-auth 1.20.0
google-auth-oauthlib 0.4.1
google-pasta 0.2.0
greenlet 0.4.16
grpcio 1.31.0
h5py 2.10.0
HeapDict 1.0.1
helpdev 0.7.1
html5lib 1.1
hyperlink 20.0.1
idna 2.10
imageio 2.9.0
imagesize 1.2.0
immutables 0.14
importlib-metadata 1.7.0
incremental 17.5.0
intervaltree 3.0.2
ipykernel 5.3.2
ipython 7.16.1
ipython-genutils 0.2.0
ipywidgets 7.5.1
isort 4.3.21
itsdangerous 1.1.0
ja-ginza 3.1.0
ja-ginza-dict 3.1.0
Janome 0.3.10
jdcal 1.4.1
jedi 0.17.1
Jinja2 2.11.2
jmespath 0.10.0
joblib 0.16.0
json5 0.9.5
jsonschema 3.2.0
jupyter 1.0.0
jupyter-client 6.1.6
jupyter-console 6.1.0
jupyter-core 4.6.3
jupyterlab 2.1.5
jupyterlab-server 1.2.0
Keras 2.4.3
Keras-Preprocessing 1.1.2
keyring 21.2.1
kiwisolver 1.2.0
kombu 3.0.37
lazy-object-proxy 1.4.3
libarchive-c 2.9
llvmlite 0.33.0+1.g022ab0f
locket 0.2.0
lxml 4.5.2
Markdown 3.2.2
MarkupPy 1.14
MarkupSafe 1.1.1
matplotlib 3.2.2
mccabe 0.6.1
mecab 0.996.2
mecab-python 1.0.0
mecab-python3 1.0.1
menuinst 1.4.16
mistune 0.8.4
mkl-fft 1.0.14
mkl-random 1.0.4
mkl-service 2.3.0
mock 4.0.2
more-itertools 8.4.0
mpmath 1.1.0
msgpack 0.5.6
multipledispatch 0.6.0
multitasking 0.0.9
murmurhash 1.0.2
mysql-connector 2.2.9
mysqlclient 2.0.1
navigator-updater 0.2.1
nbconvert 5.6.1
nbformat 5.0.7
networkx 2.4
nltk 3.5
nose 1.3.7
notebook 6.0.3
numba 0.50.1
numexpr 2.7.1
numpy 1.18.5
numpydoc 1.1.0
oauthlib 3.1.0
odfpy 1.4.1
olefile 0.46
openpyxl 3.0.4
opt-einsum 3.3.0
packaging 20.4
pandas 1.0.5
pandas-datareader 0.9.0
pandocfilters 1.4.2
paramiko 2.7.1
parso 0.7.0
partd 1.1.0
path 13.1.0
pathlib2 2.3.5
pathtools 0.1.2
patsy 0.5.1
pep8 1.7.1
pexpect 4.8.0
pickleshare 0.7.5
Pillow 7.2.0
pip 20.1.1
pkginfo 1.5.0.1
plac 0.9.6
pluggy 0.13.1
ply 3.11
preshed 3.0.2
prometheus-client 0.8.0
prompt-toolkit 3.0.5
protobuf 3.12.4
psutil 5.7.0
ptyprocess 0.6.0
py 1.9.0
pyasn1 0.4.8
pyasn1-modules 0.2.8
pycodestyle 2.6.0
pycorenlp 0.3.0
pycosat 0.6.3
pycountry 20.7.3
pycparser 2.20
pycrypto 2.6.1
pycurl 7.43.0.5
pydocstyle 5.0.2
pyflakes 2.2.0
Pygments 2.6.1
PyHamcrest 2.0.2
pylint 2.5.3
PyMySQL 0.10.0
PyNaCl 1.4.0
pyodbc 4.0.0-unsupported
pyOpenSSL 19.1.0
pyparsing 2.4.7
pypiwin32 223
PyQt5 5.12.3
PyQt5-sip 12.8.0
PyQtWebEngine 5.12.1
pyquery 1.4.1
pyreadline 2.1
pyrsistent 0.16.0
PySocks 1.7.1
pystan 2.19.1.1
pytest 5.4.3
python-dateutil 2.8.1
python-jsonrpc-server 0.3.4
python-language-server 0.34.1
pytz 2020.1
PyWavelets 1.1.1
pywin32 227
pywin32-ctypes 0.2.0
pywinpty 0.5.7
PyYAML 5.3.1
pyzmq 19.0.2
QDarkStyle 2.8.1
QtAwesome 0.7.2
qtconsole 4.7.5
QtPy 1.9.0
redis 3.2.1
regex 2020.6.8
requests 2.24.0
requests-oauthlib 1.3.0
rope 0.17.0
rsa 4.6
Rtree 0.9.4
ruamel-yaml 0.15.87
s3transfer 0.3.3
sacremoses 0.0.43
scikit-image 0.16.2
scikit-learn 0.23.1
scipy 1.4.1
seaborn 0.10.1
Send2Trash 1.5.0
sentencepiece 0.1.91
setuptools 49.2.0.post20200714
simplegeneric 0.8.1
simplejson 3.17.2
singledispatch 3.4.0.3
six 1.15.0
smart-open 2.1.0
snowballstemmer 2.0.0
sortedcollections 1.2.1
sortedcontainers 2.1.0
soupsieve 2.0.1
spacy 2.2.3
Sphinx 3.1.2
sphinxcontrib-applehelp 1.0.2
sphinxcontrib-devhelp 1.0.2
sphinxcontrib-htmlhelp 1.0.3
sphinxcontrib-jsmath 1.0.1
sphinxcontrib-qthelp 1.0.3
sphinxcontrib-serializinghtml 1.1.4
sphinxcontrib-websupport 1.2.3
spyder 4.1.4
spyder-kernels 1.9.2
SQLAlchemy 1.3.18
sqlparse 0.3.1
srsly 1.0.2
statsmodels 0.11.1
SudachiDict-core 20191030
SudachiPy 0.4.5
sumy 0.8.1
sympy 1.6.1
tables 3.6.1
tablib 2.0.0
tblib 1.6.0
tensorboard 2.3.0
tensorboard-plugin-wit 1.7.0
tensorflow 2.3.0
tensorflow-estimator 2.3.0
tensorflow-hub 0.8.0
termcolor 1.1.0
terminado 0.8.3
testpath 0.4.4
textblob 0.15.3
thinc 7.3.1
threadpoolctl 2.1.0
tokenization 1.0.7
tokenizers 0.8.1rc1
toml 0.10.1
toolz 0.10.0
torch 1.1.0
torchvision 0.3.0
tornado 6.0.4
tqdm 4.47.0
traitlets 4.3.3
transformers 3.0.2
Twisted 20.3.0
typed-ast 1.4.1
typing-extensions 3.7.4.2
tzlocal 2.1
ujson 1.35
unicodecsv 0.14.1
unidic-lite 1.0.7
urllib3 1.25.9
vine 1.3.0
wasabi 0.7.1
watchdog 0.10.3
wcwidth 0.2.5
webencodings 0.5.1
websocket-client 0.57.0
Werkzeug 1.0.1
wheel 0.34.2
widgetsnbextension 3.5.1
win-inet-pton 1.1.0
win-unicode-console 0.5
wincertstore 0.2
wrapt 1.12.1
WTForms 2.3.3
xlrd 1.2.0
XlsxWriter 1.2.9
xlwings 0.19.5
xlwt 1.3.0
xmltodict 0.12.0
yahoo-finance 1.4.0
YahooFinanceSpider 0.3
yapf 0.30.0
yfinance 0.1.54
zict 2.0.0
zipp 3.1.0
zmq 0.0.0
zope.event 4.4
zope.interface 4.7.1
Hi Team,
I'm tring to train a Japanese Bert with my own data based on yours, and did't modify the structures.
but when I pass the train data path to train a tokenizer, every time there go's wrong,
the error is "Can't convert ['test.txt'] to Trainer".
here's something I tired:
Can you give any advise on this situation? Thanks a lot.
As the title said,
I am trying to re-run the example file from README on Google Colab and keep getting this error.
input_ids = tokenizer.encode(f'''
青葉山で{tokenizer.mask_token}の研究をしています。
''', return_tensors='pt')
ERROR:root:An unexpected error occurred while tokenizing input
The following traceback may be corrupted or invalid
The error message is: ('EOF in multi-line string', (1, 0))
---------------------------------------------------------------------------
ValueError Traceback (most recent call last)
<ipython-input-7-f8582275f4db> in <module>()
1 input_ids = tokenizer.encode(f'''
2 青葉山で{tokenizer.mask_token}の研究をしています。
----> 3 ''', return_tensors='pt')
8 frames
/usr/local/lib/python3.6/dist-packages/transformers/tokenization_bert_japanese.py in tokenize(self, text, never_split, **kwargs)
205 break
206
--> 207 token, _ = line.split("\t")
208 token_start = text.index(token, cursor)
209 token_end = token_start + len(token)
ValueError: too many values to unpack (expected 2)
The tokenizer is good for japanese but I want to get the last output layer of the model above.
Since I am following the instruction in the huggingface that:
tokenizer = AutoTokenizer.from_pretrained("cl-tohoku/bert-base-japanese-whole-word-masking")
model = AutoModelWithLMHead.from_pretrained("cl-tohoku/bert-base-japanese-whole-word-masking")
input_ids = torch.tensor(tokenizer.encode(text, add_special_tokens=True)).unsqueeze(0) # Batch size 1
outputs = model(input_ids)
last_hidden_states = outputs[0]
Then i got len(outputs) = 1,
The expected last_hidden_states shape is (batch,seq len, dmodel) but i got (batch,seq len, vocab size).
How can i get the shape (batch,seq len, dmodel) in of your model.
Hi, I get the following warning when loading a checkpoint trained using your pretrained model. Is there anything wrong with it?
Some weights of the model checkpoint at /content/drive/My Drive/pretrainedBertJa/bert-base-japanese-whole-word-masking were not used when initializing BertForSequenceClassification: ['cls.predictions.bias', 'cls.predictions.transform.dense.weight', 'cls.predictions.transform.dense.bias', 'cls.predictions.transform.LayerNorm.weight', 'cls.predictions.transform.LayerNorm.bias', 'cls.predictions.decoder.weight', 'cls.seq_relationship.weight', 'cls.seq_relationship.bias']
hi,
I have a question about tensorflow version. when installing tensorflow-gpu==1.1.14, cannot find the version, I can use the version 1.14.0 but not using gpu. is the version 1.1.14? Thanks for your help.
Hi!
First off, thanks a lot for making and sharing these models.
Second, is there a license that can be applied to the pre-trained models themselves?
Thanks,
Josh
Was about to do the comparision experiment,
however, I could not download the jawiki-20230102 version.
Could you kindly share this wiki data with me?
Much appreciation.
The following is the reproduce code:
import transformers
tokenizer = transformers.GPT2JapaneseTokenizer.from_pretrained(
'gpt2-japanese-vocab.txt')
# the sentents come from ja-wikipedia dump (20191201)
text = "また、「紙(かみ)」「絵/画(ゑ)」など、もともと音であるが和語と認識されているものもある。"
tokens = tokenizer.tokenize(text)
print(tokens)
ids = tokenizer.convert_tokens_to_ids(tokens)
print(ids)
and the output is:
['また', '、', '「', '紙', '(', 'かみ', ')', '」', '「', '絵', '/', '画', '(', '[UNK]', ')', '」', 'など', '、', 'もともと', '##音', '##で', '##ある', '##が', '##和', '##語', '##と', '##認', '##識', '##さ', '##れ', '##てい', '##る', '##もの', '##も', '##ある', '。']
[105, 6, 36, 2100, 23, 12060, 24, 38, 36, 1681, 460, 1031, 23, 1, 24, 38, 65, 6, 4830, 28833, 28453, 15162, 28456, 28693, 28702, 28450, 28948, 29369, 28473, 28459, 17033, 28447, 3729, 28478, 15162, 8]
Is this just a intentional things?
A declarative, efficient, and flexible JavaScript library for building user interfaces.
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google ❤️ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.