vinairesearch / phobert Goto Github PK

PhoBERT: Pre-trained language models for Vietnamese (EMNLP-2020 Findings)

License: MIT License

python3 bert bert-embeddings pos-tagging ner nli roberta part-of-speech-tagging named-entity-recognition natural-language-inference

phobert's Introduction

Introduction
Using PhoBERT with transformers
Using PhoBERT with fairseq
Notes

PhoBERT: Pre-trained language models for Vietnamese

Pre-trained PhoBERT models are the state-of-the-art language models for Vietnamese (Pho, i.e. "Phở", is a popular food in Vietnam):

Two PhoBERT versions of "base" and "large" are the first public large-scale monolingual language models pre-trained for Vietnamese. PhoBERT pre-training approach is based on RoBERTa which optimizes the BERT pre-training procedure for more robust performance.
PhoBERT outperforms previous monolingual and multilingual approaches, obtaining new state-of-the-art performances on four downstream Vietnamese NLP tasks of Part-of-speech tagging, Dependency parsing, Named-entity recognition and Natural language inference.

The general architecture and experimental results of PhoBERT can be found in our paper:

@inproceedings{phobert,
title     = {{PhoBERT: Pre-trained language models for Vietnamese}},
author    = {Dat Quoc Nguyen and Anh Tuan Nguyen},
booktitle = {Findings of the Association for Computational Linguistics: EMNLP 2020},
year      = {2020},
pages     = {1037--1042}
}

Please CITE our paper when PhoBERT is used to help produce published results or is incorporated into other software.

Using PhoBERT with `transformers`

Installation

Install transformers with pip: pip install transformers, or install transformers from source.
Note that we merged a slow tokenizer for PhoBERT into the main transformers branch. The process of merging a fast tokenizer for PhoBERT is in the discussion, as mentioned in this pull request. If users would like to utilize the fast tokenizer, the users might install transformers as follows:

git clone --single-branch --branch fast_tokenizers_BARTpho_PhoBERT_BERTweet https://github.com/datquocnguyen/transformers.git
cd transformers
pip3 install -e .

Install tokenizers with pip: pip3 install tokenizers

Pre-trained models

Model	#params	Arch.	Max length	Pre-training data	License
`vinai/phobert-base-v2`	135M	base	256	20GB of Wikipedia and News texts + 120GB of texts from OSCAR-2301	GNU Affero GPL v3
`vinai/phobert-base`	135M	base	256	20GB of Wikipedia and News texts	MIT License
`vinai/phobert-large`	370M	large	256	20GB of Wikipedia and News texts	MIT License

Example usage

import torch
from transformers import AutoModel, AutoTokenizer

phobert = AutoModel.from_pretrained("vinai/phobert-base-v2")
tokenizer = AutoTokenizer.from_pretrained("vinai/phobert-base-v2")

# INPUT TEXT MUST BE ALREADY WORD-SEGMENTED!
sentence = 'Chúng_tôi là những nghiên_cứu_viên .'  

input_ids = torch.tensor([tokenizer.encode(sentence)])

with torch.no_grad():
    features = phobert(input_ids)  # Models outputs are now tuples

## With TensorFlow 2.0+:
# from transformers import TFAutoModel
# phobert = TFAutoModel.from_pretrained("vinai/phobert-base")

Using PhoBERT with `fairseq`

Please see details at HERE!

Notes

In case the input texts are raw, i.e. without word segmentation, a word segmenter must be applied to produce word-segmented texts before feeding to PhoBERT. As PhoBERT employed the RDRSegmenter from VnCoreNLP to pre-process the pre-training data (including Vietnamese tone normalization and word and sentence segmentation), it is recommended to also use the same word segmenter for PhoBERT-based downstream applications w.r.t. the input raw texts.

Installation

pip install py_vncorenlp

Example usage

import py_vncorenlp

# Automatically download VnCoreNLP components from the original repository
# and save them in some local machine folder
py_vncorenlp.download_model(save_dir='/absolute/path/to/vncorenlp')

# Load the word and sentence segmentation component
rdrsegmenter = py_vncorenlp.VnCoreNLP(annotators=["wseg"], save_dir='/absolute/path/to/vncorenlp')

text = "Ông Nguyễn Khắc Chúc  đang làm việc tại Đại học Quốc gia Hà Nội. Bà Lan, vợ ông Chúc, cũng làm việc tại đây."

output = rdrsegmenter.word_segment(text)

print(output)
# ['Ông Nguyễn_Khắc_Chúc đang làm_việc tại Đại_học Quốc_gia Hà_Nội .', 'Bà Lan , vợ ông Chúc , cũng làm_việc tại đây .']

phobert's People

Contributors

Stargazers

Watchers

Forkers

itdonline thanhtd91 mcdavid109 luvata trendingtechnology anhngd nready-rnd phamdinhkhanh php1301 hmthanh mossydidar vukien95 ttdung997 blackandrose trituenhantaoio nhatrio noahz110 khuongvo2305 justinphan3110 diophung tpnguyen 123456pop00 sonvx vovanphuc caothetoan texervn enormous-system cy30mk3v nthon weinyn johnnydiep1021 bino282 dwtcourses yongtso martinhoang11 hongbietcode ducminh0705 docuong96 folkevil viethungha0610 stanley-fork quangchiem139 mtbhust nqtrieu7987 horkimhab trungtruc123 donhuvy phuong-tk-nguyen truongthuanr cuulee kisejin nguyenvulebinh lpham-tech vinhphu3000 ngockhong 209sontung ngminhphuc thuy-nguyenthanh jiezhanggt mukashivn tuliosil bangnghh julianoloren nhatpham2016 duyvo18 manhtientran haidangai aehuspham sthanhng khanhi2r dannv0602 daotranbk williamvn2002 vubao108 hoangquy18 fehuynhvi lehieu2903 miendinh hchautai cyn7hia dinhthixuanbinh thanhpham1987 sysang buitranngocly haiconmeo tellme0823 sangdv serkidoko baoquoc285

phobert's Issues

Fine-tune pos tagging

Cho em hỏi là sau khi chuyển word thành subword, thì anh để nhãn các subword là gì ạ? vd: (liên_cảng_A5, N) sau tách (liên_@@ cảng_@@ A5) .Ngoài ra em cũng muốn hỏi a có dùng thêm các kỹ thuật gì ko ạ, em định thử lại xem có kết qủa có đúng như trong bài báo ko thôi

Semantic search with Sentence Transformers

Currently, I'm a student working on a project related to semantic search with Vietnamese corpus. My goal is to build a model which can process a user's query and return sentences or documents that related the most to the given query. My strategy is using PhoBERT with sentence-transformers to embed sentences & documents into vectors and then store those to FAISS index for query.

I have conducted semantic search experiment as I described above with both PhoBERT (base and large versions) but it turned out very bad results for the Vietnamese corpus, while performing pretty good in the English corpus. After identifying the problem, I realized the problem lied on the quality of encoding results. Hence, I have a couple of questions here:

Have you ever tried PhoBERT with sentence-transformers or any similar embedding techniques with PhoBERT on Vietnamese corpus for semantic search? What is your approach and how was the outcomes?
Does sentence embedding with PhoBERT on Vietnamese text require any further preprocessing techniques besides word segmentation?
Is it efficient to embed a whole a Vietnamese document (~200 words) with sentence-transformers and PhoBERT?

Additionally, I did preprocess text using only word segmentation for Vietnamese corpus before passing them to embedding process.

I am new to NLP so I am really appreciate if you could give me some advice on this project.

Can't find Named-entity recognition in PhoBERT

I want to use Named-entity recognition in PhoBERT. But can't find how to use it. Please add to readme how to use it.

Thanks

Synergy Jina <> PhoBERT

hi VinAI team,

Great work 👍 I'm the founder & CEO of Jina AI, you may have used/heard my prev. work on Fashion-MNIST and bert-as-service. I'm the creator of these two OSS.

I'm asking if we can build a synergy between Jina <> PhoBERT (& BERTwee, I post it separately). https://github.com/jina-ai/jina

Simply put Jina is a universal neural search engine, it is a search infrastructure that can be used for searching text2text, image2image, audio2audio, etc. We already have examples using Jina for QA and semantic text search here, full examples can be found in here.

Potential synergy

I see great potential to apply this in production. Therefore I kindly ask if you are interested in port it into jina or jina-hub. So that people can use it as one of their search component in Jina?
If you are interested in long-term collaboration, we also have a Slack channel, where we can invite you in to have more discussions. We also welcome your thoughts on it.

phoBERT model output is not compatible with neural machine translation

from transformers import AutoModel, AutoTokenizer
import os
model_choice = 'vinai/phobert-large'
tokenizer = AutoTokenizer.from_pretrained(model_choice)
model = AutoModel.from_pretrained(model_choice)
text = 'Trái_Đất'
batch = tokenizer.prepare_seq2seq_batch(src_texts = [text], return_tensors = 'pt')
translation = model.generate(**batch) # error happens here
tokenizer.batch_decode(translation, skip_special_tokens=True)

Error message 'BaseModelOutputWithPoolingAndCrossAttentions' object has no attribute 'logits' tells that phoBERT output is BaseModelOutputWithPoolingAndCrossAttentions, while generation utility would only works with CausalLMOutput which has logits attribute.

As read from phoBERT official pages, I do not assume that machine translation is a feature. I just want to ask the author to verify if the real reason for the error message above is phoBERT does not support machine translation yet, or am I missing something here?

Thanks a lot.

Code to train model to reproduce the model and evaluate in all experiments

In other words, I would like to know the source code corresponding to Section 2 and 3 in the paper.

Batch_size in downstream task

Batchsize cho task NER anh đã chọn là 32, cho em hỏi là đây là giá trị mà mô hình tốt nhất ạ, hay là chỉ để so sánh với các mô hình khác. Vì em có thử batchsize = 64 thì kết quả có tốt hơn, nên muốn hỏi lí do tại sao ạ. Em xin cảm ơn

Thầy ơi, Cho mình hỏi thêm về PhoBert

Mình đoc thấy Phobert dành cho chủ yếu: dich, thưc thể. Vì mình đang nghiên cứu về phân tích cảm xúc của xã hôi với cơ quan của mình, thì có thể dùng phobert thay thế đươc không a? Vì mới nghiên cứu chưa hiểu lắm nên hỏi thêm Thầy, Phobert này mình thay thế cho word2vec rồi dùng các mang neural như CNN, LSTM để phân loai ra Tích cưc , tiêu cưc, trung tính hay chỉ phobert là có thể phân loai ra đươc a?
Cảm ơn Thầy!

Hello brother,

Error with rasa chatbot

What's the meaning of position_ids = {0, 1}?

Hi, i notice that at Line here, the position id of the first input token will be 2. So what do the first two position ids stand for? And where are they used? Thanks for your patience and looking forward to your response!

Question about useFast=False when loading PhoBERT Tokenizer

Hi @datquocnguyen ,
I remember you mentioned in some documents that when we use transformer v4, we should add useFast=False when loading the Tokenizer.
Is that still true for now? As I couldn't find that document again and your latest README seems to be updated too.

Another question is: what is the difference between useFast = True and useFast = False in this case? Anything change in the output?
Thank you very much.

What is the max sequence length of the encoder and decoder?

Hi @datquocnguyen ,
Can you specify the max sequence length of the model's encoder and decoder?

I want you use this model for regression (text generation task). Is that possible?
Thank you very much!

Named Entity Recognition by PhoBert

Can I ask you about how to implement PhoBert to recognize named entity ? I am trying to do that but I don't know how. I need get results without having to train it. Is it impossible? Many thanks.

when Phobert available on HuggingFace?

Hi sir,
when Phobert available on HuggingFace?
I need it to use in Rasa NLU

Missing `config.json` in `PhoBERT_large_fairseq`

First thing first, thank you for publishing the source code of PhoBERT.

The issue is just a minor one which is that I can't find the config.json file in PhoBERT_large_fairseq downloaded here. Therefore I have to download the base pre-trained to grab the file.

Have a nice day!

A Distilation version of PhoBERT

Hi, I am trying to get a distillation version of PhoBERT.
Is there any pre-trained of it or I have to train myself?
I found available code to train a distillation versions of transformers-models from tranformers package.
Link: https://github.com/huggingface/transformers/tree/master/examples/research_projects/distillation
I tried, but it doesn't seem to work well with PhoBert, due to some differences in tokenizer.
Any guide for me here. Thanks in advance.

error tokenize

sau khi kiểm tra code của file alignment_utils.py, em nhận ra bpe_tokens và other_tokens khác nhau đối với từ "gì vậy"
dòng thứ nhất là bpe_tokens
dòng thứ 2 là other_tokens
dòng thứ 3 là ''.join(bpe_tokens)
dòng thứ 4 là ''.join(other_tokens)
từ "gì vậy" được tokenize thành 2 token "g" và " unk ", dẫn đến việc không thể lỗi "cannot align"

các từ khác, ví dụ như "gì thế", hay "gì cơ" không xảy ra lỗi trên
em mong được mọi người giúp đỡ giải quyết lỗi này

Can we use PhoBERT-base Tokenizer for PhoBERT-large model and vice versa?

Hi @datquocnguyen ,
Both PhoBERT-base Tokenizer and PhoBERT-large Tokenizer have the same vocab size = 64001. Then the question is can we use PhoBERT-base Tokenizer for PhoBERT-large model and vice versa? It means can we use only one of them to tokenize the dataset and use that prepared tokenized tensor dataset to do fine-tuning on downstream tasks for both PhoBERT-base Tokenizer and PhoBERT-large to save preparing time :)

Error while loadding BPE encoder

While I'm trying to run this, I've got this error: https://prnt.sc/u6338u

Question about text lowercase and remove punctuation

Hi @datquocnguyen ,
I'm using PhoBERT for finding semantic textual similarity between sentences. I'm wonder is it necessary to perform lowercase and remove punctuation in the input text before feeding to model?
Thank you so much.

AssertionError, I have this error when run with fairseq, RobertaModel

doc = phoBERT.extract_features_aligned_to_words('Nghe nhiều về ông nhưng đến hôm_nay tôi mới có dịp về ấp Long_Châu 1 xã Thạnh_Mỹ_Tây Châu_Phú_An_Giang để gặp ông .')
OK, but
doc = phoBERT.extract_features_aligned_to_words('nghe nhiều về ông nhưng đến hôm_nay tôi mới có dịp về ấp long_châu 1 xã thạnh_mỹ_tây châu_phú_an_giang để gặp ông .')
Traceback (most recent call last):
File "", line 1, in
File "/home/hoang/anaconda3/lib/python3.6/site-packages/fairseq/models/roberta/hub_interface.py", line 133, in extract_features_aligned_to_words
alignment = alignment_utils.align_bpe_to_words(self, bpe_toks, spacy_toks_ws)
File "/home/hoang/anaconda3/lib/python3.6/site-packages/fairseq/models/roberta/alignment_utils.py", line 39, in align_bpe_to_words
assert "".join(bpe_tokens) == "".join(other_tokens)
AssertionError

and

doc = phoBERT.extract_features_aligned_to_words('chuyên_môn_hoá là xu_hướng của phát_triển việc tốt cũng chuyên_môn_hoá thì quả là tốt quá .')
OK, but in
doc = phoBERT.extract_features_aligned_to_words('Chuyên_môn_hoá là xu_hướng của phát_triển việc tốt cũng chuyên_môn_hoá thì quả là tốt quá.')
Traceback (most recent call last):
File "", line 1, in
File "/home/hoang/anaconda3/lib/python3.6/site-packages/fairseq/models/roberta/hub_interface.py", line 133, in extract_features_aligned_to_words
alignment = alignment_utils.align_bpe_to_words(self, bpe_toks, spacy_toks_ws)
File "/home/hoang/anaconda3/lib/python3.6/site-packages/fairseq/models/roberta/alignment_utils.py", line 39, in align_bpe_to_words
assert "".join(bpe_tokens) == "".join(other_tokens)
AssertionError

Non-consecutive added token '{token}' found.

As the title. I meet the below error when using PhoBertTokenizer for Vietnamese Question Answering task. Could you please help me to fix it ? Thank you.
f"Non-consecutive added token '{token}' found. " AssertionError: Non-consecutive added token '<mask>' found. Should have index 5 but has index 64000 in saved vocabulary.
Btw, i have tried to set self.encoder[self.mask_token] = 4, the training process can run normally, but it doesn't seem a right way.

word_ids() is not available when using Python-based tokenizers

Hi, I'm getting this error now, is there any way to fix it

Can anyone pass the VLSP 2018 datasets application form to me?

Hi,
I am working on a Vietnamese NER project and I need this dataset. However the official VLSP site seems to be down for days. Kindly share with me the application form and the e-mail address to send it to if possible. Sorry I don't know whom to ask (I am a newbie to Vietnamese NLP).

Tokenizer class PhobertTokenizerFast does not exist or is not currently imported.

First, thank you for your contributions.
I followed your instruction but I got the error.
Could you help me fix this?
Thanks a lot!

Xác nhận kết qủa POS tagging

Em thử nghiệm kết quả fine-tune phobert trên vlsp2013 pos tag
Nhưng có vẻ data trong paper không giống data gốc như vlsp. Ví dụ như trong tập test vlsp có 2131 example trong khi paper chỉ có 2120 example
anh có thể cho em xin data vlsp pos tag sau khi anh chỉnh sửa, để em có thể so sánh giữa các model chính xác nhất với ạ

Error tokenizer = AutoTokenizer.from_pretrained("vinai/phobert-base")

I run pretrain PhoBERT error .
ValueError Traceback (most recent call last)
in ()
3
4 phobert = AutoModel.from_pretrained("vinai/phobert-base")
----> 5 tokenizer = AutoTokenizer.from_pretrained("vinai/phobert-base")
6
7 # INPUT TEXT MUST BE ALREADY WORD-SEGMENTED!

/usr/local/lib/python3.6/dist-packages/transformers/models/auto/tokenization_auto.py in from_pretrained(cls, pretrained_model_name_or_path, *inputs, **kwargs)
323 if tokenizer_class is None:
324 raise ValueError(
--> 325 "Tokenizer class {} does not exist or is not currently imported.".format(tokenizer_class_candidate)
326 )
327 return tokenizer_class.from_pretrained(pretrained_model_name_or_path, *inputs, **kwargs)

ValueError: Tokenizer class PhobertTokenizerFast does not exist or is not currently imported.****

Connection Error when loading rdrsegmenter

Using this code

from vncorenlp import VnCoreNLP
rdrsegmenter = VnCoreNLP("/Absolute-path-to/vncorenlp/VnCoreNLP-1.1.1.jar", annotators="wseg", max_heap_size='-Xmx500m')

and get error

ConnectionError: HTTPConnectionPool(host='127.0.0.1', port=51159): Max retries exceeded with url: /annotators (Caused by NewConnectionError('<urllib3.connection.HTTPConnection object at 0x10c11f250>: Failed to establish a new connection: [Errno 61] Connection refused'))

pretraining on own dataset

Hi,

I followed the instructions at https://github.com/pytorch/fairseq/blob/master/examples/roberta/README.pretraining.md

But I am running into this error:

I am using the wikitext-103-raw. Do you think it would be because of different PyTorch or TensorFlow versions?

I am using this configuration: 4x NVIDIA Tesla V100 GPUs with 16 GiB of memory.

Thanks, again!

Originally posted by @vr25 in #3 (comment)

Convert PhoBert (RoBerta) to BERT

Hello brother,

I am working on a project that need your PhoBert, but I cannot adapt it for my model that have used BERT before. How can I convert between them?

Tks you for your time

creating a pre-trained model

Hi,

Thank you for releasing the language-specific model along with the instructions.

I want to create a similar language-specific pre-trained model. I was wondering if you could share the pre-training scripts and toy data (and maybe a short write-up) so that it is easier to pre-train similar BERT-based models in another language.

I just have one important question; how do you chunk the documents where the text is longer than 512 tokens. Also, do you just simply split at 512 tokens even if that sentence hasn't ended and consider another 512 tokens chunk beginning at the previous chunk end? Does this take a lot of memory?

Thanks!

transformers

can you public this in the transformes huggingfaces. It's easy to use, easy to deploy

Question Answering for PhoBert.

I have run the question answering demo of djl library, https://djl.ai/, and it worked well for English.
I have found that VinAI has released PhoBERT, but I don't know how to use it in the djl library.
Can you help me?
Thanks in advance.

post tagging?

Hi,

Could download the PhoBERT package with transformers; would you know then how to use post tagging afterwards?

Thanks!
Pierre

Revert Tokenization nonsense!

Dear,

Before apply model, I want to check how tokenization works.
Firstly, I run
import transformers
from transformers import AutoModel, AutoTokenizer
embedder = AutoModel.from_pretrained("vinai/phobert-base")
tokenizer = AutoTokenizer.from_pretrained("vinai/phobert-base", use_fast=False)
to download artifacts into C:\Users\ADMIN.cache\huggingface\transformers.

I find vocab file in this folder (already check with file in HuggingFace). I build 2 dictionaries as below
with open("C:/Users/ADMIN/.cache/huggingface/transformers/970c6224b2713c8b52a7bcfc4d5a951c9bb88302e4523388b50f28284e87ac44.26ba0c8945e559c68d0bc35d24fea16f5463a49fe8f134e0c32261d590b577fa", 'r', encoding='utf-8') as f:
vocab = f.readlines()
vocab = [p[:-1] for p in vocab]
vocab_forward = dict()
vocab_reverse = dict()
for pair in tqdm(vocab):
k, v = pair.split(' ')[:2]
if k in vocab_forward.keys():
vocab_forward[k].append(v)
else:
vocab_forward[k] = [v]
if v in vocab_reverse.keys():
vocab_reverse[v].append(k)
else:
vocab_reverse[v] = [k]

I used VNCoreNLP to segment sentence into words as below:
['Công_ty', 'cổ_phần', 'Tập_đoàn', 'FLC', '(', 'FLC', ',', 'mã', 'chứng_khoán', ':', 'FLC', ')', 'vừa', 'công_bố', 'một', 'loạt', 'giao_dịch', 'với', 'các', 'bên', 'liên_quan', 'trong', 'giai_đoạn', '2019-2021', 'theo', 'yêu_cầu', 'của', 'Uỷ_ban', 'Chứng_khoán', 'Nhà_nước', '.']
However, after running tokenization
tokens = tokenizer.encode(sentence)[1:-1]
I cannot receive the original tokens. Below is the code:
for ti, token in enumerate(tokens):
token = str(token)
print(sentence[ti], token, vocab_reverse[token] if token in vocab_reverse.keys() else None)
the result is:

Original token: Công_ty
token: 290 - reverted token: ['Yeah@@', 'ngứa_@@', 'ĩ_vãng', 'Lolli@@', 'enei', 'đẽ', 'ト', 'ốc_liệt', 'ng_tín', 'ành_viên', 'thành_niên', 'Ai_C@@']
Original token: cổ_phần
token: 1272 - reverted token: ['asha', 'ưa@@']
Original token: Tập_đoàn
token: 1183 - reverted token: ['khép@@', 'NAM']
Original token: FLC
token: 6505 - reverted token: ['chất_chứa', 'Quốc_Hùng', 'Đoài', 'Liveshow']
Original token: (
token: 20 - reverted token: None
Original token: FLC
token: 6505 - reverted token: ['chất_chứa', 'Quốc_Hùng', 'Đoài', 'Liveshow']
Original token: ,
token: 4 - reverted token: None
Original token: mã
token: 1624 - reverted token: ['Gò_Đ@@', 'động_vật@@']
Original token: chứng_khoán
token: 2476 - reverted token: ['7,6%', 'Hi_Lạp', 'shor@@', 'Tân_Hương', '21-3', 'Fantas@@', 'Lớp_@@', 'Allardyce', 'Y_Tế', 'Liên_Bộ', 'ính_phòng', 'văn_tự', 'isi@@', 'ay_tr@@', 'Hui', 'va_vấp']
Original token: :
token: 27 - reverted token: None
Original token: FLC
token: 6505 - reverted token: ['chất_chứa', 'Quốc_Hùng', 'Đoài', 'Liveshow']
Original token: )
token: 19 - reverted token: None
Original token: vừa
token: 164 - reverted token: ['水', 'Kêu_@@', 'át_nước', 'nh_diện', '真', 'Heyn@@', '府', '구@@', 'trao_@@', 'Tẻ@@', 'TNĐ', '色@@', '_South@@', 'イ', '靜@@', '紅@@', '會@@', '格@@']
Original token: công_bố
token: 576 - reverted token: ['Middles@@', '_trương', '花@@', 'nhộn', 'Dong_Gun', 'Lị@@', 'Don@@']
Original token: một
token: 16 - reverted token: None
Original token: loạt
token: 1406 - reverted token: ['Fin@@', 'ẩm_tra', 'Né', 'giải_th@@']
Original token: giao_dịch
token: 786 - reverted token: ['Wiscon@@', 'thù@@']
Original token: với
token: 15 - reverted token: None
Original token: các
token: 9 - reverted token: None
Original token: bên
token: 145 - reverted token: ['ż', 'ア', 'ghi_nợ', 'Tàm', 'tài_sản@@', '汉@@', '義', 'n_đơn', 'ǒ@@']
Original token: liên_quan
token: 314 - reverted token: ['nhân_hoá', 'ž', 'triển@@', '拉@@', 'Hmeym@@', 'n_lút', 'π', 'ở_thành', 'hiếp', 'ôn_phu', 'phạm_pháp_luật']
Original token: trong
token: 12 - reverted token: None
Original token: giai_đoạn
token: 609 - reverted token: ['n_thuyết', 'Lò_G@@', 'nh_thiêng', 'tư_th@@']
Original token: 2019-2021
token: 3 - reverted token: None
Original token: theo
token: 63 - reverted token: None
Original token: yêu_cầu
token: 285 - reverted token: ['Cocker@@', 'Raphael@@', 'bánh_tr@@', '良@@', 'súng_l@@', 'sổ@@', 'DAR@@']
Original token: của
token: 7 - reverted token: None
Original token: Uỷ_ban
token: 871 - reverted token: ['phương_hoá', 'acca', 'xiêu', 'yol']
Original token: Chứng_khoán
token: 5149 - reverted token: ['Bích_Hằng', 'ack@@', 'Aleksand@@', 'hoang_dại']
Original token: Nhà_nước
token: 544 - reverted token: ['nh_thường_quân', 'Bảo_t@@', 'Cheon@@', 'ambon', 'ững_sờ']
Original token: .
token: 5 - reverted token: None

My question is: there is a problem with tokenizer or the vocab file is wrong?

Cannot install fastBPE

I tried to install the fastBPE package using PyCharm on Windows 10 but I couldn't: https://prnt.sc/u6342a

Cannot import name 'PhobertModel'

I'm using Google Colab and cannot import the "PhoBERTModel"

Wrong URL in the installation of VnCoreNLP's word segmenter

I followed the installation instructions in the "Using VnCoreNLP's word segmenter to pre-process input raw texts" section and then could not start the server.
It turned out that the wordsegmenter.rdr downloaded from the command

wget https://github.com/vncorenlp/VnCoreNLP/blob/master/models/wordsegmenter/wordsegmenter.rdr

is actually an HTML file.
So I replaced it by

wget https://raw.githubusercontent.com/vncorenlp/VnCoreNLP/master/models/wordsegmenter/wordsegmenter.rdr

and it worked!

Cách load trained model

Chào anh, lúc trước để load model đã train thì em load bằng
self.roberta = RobertaModel(config)
thay vì
self.roberta = RobertaModel.from_pretrained("PhoBERT_base_transformers/model.bin", config=config)
(Em bọc trong class riêng )
Và model em save bằng torch.save (state_dict),
Vấn đề là em giờ k biết file config đó nằm đâu, hoặc có cách nào khác để load model đã train , mong anh giải đáp

Question about the training process

Hi @datquocnguyen
Can you share some information about the training process?
For example:

How many samples per batch?
Which optimizer did you use and its params?
Did you use learning rate schedule? its params?
What is the result at the end? ex: training loss and accuracy

Thanks you so much.

support encode_batch for PhobertTokenizer?

Could you support encode_batch method for a batch of inputs in the near future?

Thanks.

NER task

pls, provide us a script to run NER tasks?
Thanks

BPE Tokenizer

Hi, i'm having a problem with your bpe tokenizer. Giving the token 'Trần_văn_thời', i used AutoTokenizer.from_pretrained('vinai/phobert-base', use_fast=False) to convert this token into ids and the results were [1359, 3, 8915], which is [ 'Trần_@@', '', 'ời']. However, when i changed the token into 'Trần_văn_Thời', i got [1359, 16398, 4834,], which is ['Trần_@@', 'văn_@@', 'Thời']. Another example is that the token 'Lê_văn_Tám' when tokenized will gave the result of [1475, 16398, 6813], which is ['Lê_@@', 'văn_@@', 'Tám'].
So, obviously the result for 'Trần_văn_thời' is probably wrong. Can you give me any explanation for this? Thank you very much.

HFValidationError

Hi, when I run my training code, I get this error:

I ensure that I have pointed to the right path of the model in my server. Can you help me please?

vinairesearch / phobert Goto Github PK

phobert's Introduction

Table of contents

PhoBERT: Pre-trained language models for Vietnamese

Using PhoBERT with transformers

Installation

Pre-trained models

Example usage

Using PhoBERT with fairseq

Notes

Installation

Example usage

phobert's People

Contributors

Stargazers

Watchers

Forkers

phobert's Issues

Potential synergy

Recommend Projects

Recommend Topics

Recommend Org

Using PhoBERT with `transformers`

Using PhoBERT with `fairseq`