himkt / awesome-bert-japanese Goto Github PK

📝 A list of pre-trained BERT models for Japanese with word/subword tokenization + vocabulary construction algorithm information

bert bert-models japanese natural-language-processing nlp

awesome-bert-japanese's People

Contributors

Stargazers

Watchers

Forkers

yut148 kyodocn rosssong katsumata420 mukei stophobia

awesome-bert-japanese's Issues

Japanese ALBERT

https://zenn.dev/ken_11/articles/a36fa7fc59367d

Raw text segmentation or puntuation

Hello,

Thank you for collecting links to the bert based models for Japanese

Just wanted to ask if you know any models or investigations regarding raw text (after automatic speech recognition the text is not splitted at all, just characters one by one) segmentation? Something simple like splitting text on sentences or more complicated like adding punctuation to the text. For example, nvidia provides models for punctuation based on bert and distilbert: https://docs.nvidia.com/deeplearning/nemo/user-guide/docs/en/main/nlp/punctuation_and_capitalization.html

That would be great if there is something for raw text split for Japanese language

Japanese BART

Although not sure if we would include BART.
https://tech.stockmark.co.jp/blog/bart-japanese-base-news/

東北大学とNICT

東北大学 (a)の「サブワード分割のための語彙構築アルゴリズム」はSentencepieceだと思います。
以下のscriptで、Sentencepieceでまずvocabを学習してから、BERTのvocab.txtのフォーマットになるように変換しています。

https://github.com/cl-tohoku/bert-japanese/blob/master/build_vocab.py

東北大学 (b)の「単語 -> サブワード」は文字単位なので Character とかの方がいいのではないでしょうか。
(「サブワード分割のための語彙構築アルゴリズム」のところ、正確にはSentencepieceの --model_type=char オプションで学習していますが、実質文字単位なので Character でいいと思います。)

NICT (a)の「単語 -> サブワード」はWordPieceであっていると思います。
NICT (b)が「BPEなし」モデルだと思いますが、「BPE」が人によって何を指しているのかがまちまちというのもあるのですが、ここでは「BPEなし」は「サブワードに分割せずに形態素単位」という意味なので、「単語 -> サブワード」「サブワード分割のための語彙構築アルゴリズム」ともに「--」が正しいと思います。

himkt / awesome-bert-japanese Goto Github PK

awesome-bert-japanese's People

Contributors

Stargazers

Watchers

Forkers

awesome-bert-japanese's Issues

Japanese RoBERTa

Japanese ALBERT

Raw text segmentation or puntuation

Japanese BART

東北大学とNICT

Japanese SentenceBERT

Add `DistilBERT-base-jp`

Japanese RoBERTa

Japanese T5

Japanese T5

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent