Git Product home page Git Product logo

phobert's Introduction

Table of contents

  1. Introduction
  2. Using PhoBERT with transformers
  3. Using PhoBERT with fairseq
  4. Notes

PhoBERT: Pre-trained language models for Vietnamese

Pre-trained PhoBERT models are the state-of-the-art language models for Vietnamese (Pho, i.e. "Phở", is a popular food in Vietnam):

  • Two PhoBERT versions of "base" and "large" are the first public large-scale monolingual language models pre-trained for Vietnamese. PhoBERT pre-training approach is based on RoBERTa which optimizes the BERT pre-training procedure for more robust performance.
  • PhoBERT outperforms previous monolingual and multilingual approaches, obtaining new state-of-the-art performances on four downstream Vietnamese NLP tasks of Part-of-speech tagging, Dependency parsing, Named-entity recognition and Natural language inference.

The general architecture and experimental results of PhoBERT can be found in our paper:

@inproceedings{phobert,
title     = {{PhoBERT: Pre-trained language models for Vietnamese}},
author    = {Dat Quoc Nguyen and Anh Tuan Nguyen},
booktitle = {Findings of the Association for Computational Linguistics: EMNLP 2020},
year      = {2020},
pages     = {1037--1042}
}

Please CITE our paper when PhoBERT is used to help produce published results or is incorporated into other software.

Using PhoBERT with transformers

Installation

  • Install transformers with pip: pip install transformers, or install transformers from source.
    Note that we merged a slow tokenizer for PhoBERT into the main transformers branch. The process of merging a fast tokenizer for PhoBERT is in the discussion, as mentioned in this pull request. If users would like to utilize the fast tokenizer, the users might install transformers as follows:
git clone --single-branch --branch fast_tokenizers_BARTpho_PhoBERT_BERTweet https://github.com/datquocnguyen/transformers.git
cd transformers
pip3 install -e .
  • Install tokenizers with pip: pip3 install tokenizers

Pre-trained models

Model #params Arch. Max length Pre-training data License
vinai/phobert-base-v2 135M base 256 20GB of Wikipedia and News texts + 120GB of texts from OSCAR-2301 GNU Affero GPL v3
vinai/phobert-base 135M base 256 20GB of Wikipedia and News texts MIT License
vinai/phobert-large 370M large 256 20GB of Wikipedia and News texts MIT License

Example usage

import torch
from transformers import AutoModel, AutoTokenizer

phobert = AutoModel.from_pretrained("vinai/phobert-base-v2")
tokenizer = AutoTokenizer.from_pretrained("vinai/phobert-base-v2")

# INPUT TEXT MUST BE ALREADY WORD-SEGMENTED!
sentence = 'Chúng_tôi là những nghiên_cứu_viên .'  

input_ids = torch.tensor([tokenizer.encode(sentence)])

with torch.no_grad():
    features = phobert(input_ids)  # Models outputs are now tuples

## With TensorFlow 2.0+:
# from transformers import TFAutoModel
# phobert = TFAutoModel.from_pretrained("vinai/phobert-base")

Using PhoBERT with fairseq

Please see details at HERE!

Notes

In case the input texts are raw, i.e. without word segmentation, a word segmenter must be applied to produce word-segmented texts before feeding to PhoBERT. As PhoBERT employed the RDRSegmenter from VnCoreNLP to pre-process the pre-training data (including Vietnamese tone normalization and word and sentence segmentation), it is recommended to also use the same word segmenter for PhoBERT-based downstream applications w.r.t. the input raw texts.

Installation

pip install py_vncorenlp

Example usage

import py_vncorenlp

# Automatically download VnCoreNLP components from the original repository
# and save them in some local machine folder
py_vncorenlp.download_model(save_dir='/absolute/path/to/vncorenlp')

# Load the word and sentence segmentation component
rdrsegmenter = py_vncorenlp.VnCoreNLP(annotators=["wseg"], save_dir='/absolute/path/to/vncorenlp')

text = "Ông Nguyễn Khắc Chúc  đang làm việc tại Đại học Quốc gia Hà Nội. Bà Lan, vợ ông Chúc, cũng làm việc tại đây."

output = rdrsegmenter.word_segment(text)

print(output)
# ['Ông Nguyễn_Khắc_Chúc đang làm_việc tại Đại_học Quốc_gia Hà_Nội .', 'Bà Lan , vợ ông Chúc , cũng làm_việc tại đây .']

phobert's People

Contributors

datquocnguyen avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

phobert's Issues

Fine-tune pos tagging

Cho em hỏi là sau khi chuyển word thành subword, thì anh để nhãn các subword là gì ạ? vd: (liên_cảng_A5, N) sau tách (liên_@@ cảng_@@ A5) .Ngoài ra em cũng muốn hỏi a có dùng thêm các kỹ thuật gì ko ạ, em định thử lại xem có kết qủa có đúng như trong bài báo ko thôi

Semantic search with Sentence Transformers

Currently, I'm a student working on a project related to semantic search with Vietnamese corpus. My goal is to build a model which can process a user's query and return sentences or documents that related the most to the given query. My strategy is using PhoBERT with sentence-transformers to embed sentences & documents into vectors and then store those to FAISS index for query.

I have conducted semantic search experiment as I described above with both PhoBERT (base and large versions) but it turned out very bad results for the Vietnamese corpus, while performing pretty good in the English corpus. After identifying the problem, I realized the problem lied on the quality of encoding results. Hence, I have a couple of questions here:

  1. Have you ever tried PhoBERT with sentence-transformers or any similar embedding techniques with PhoBERT on Vietnamese corpus for semantic search? What is your approach and how was the outcomes?
  2. Does sentence embedding with PhoBERT on Vietnamese text require any further preprocessing techniques besides word segmentation?
  3. Is it efficient to embed a whole a Vietnamese document (~200 words) with sentence-transformers and PhoBERT?

Additionally, I did preprocess text using only word segmentation for Vietnamese corpus before passing them to embedding process.

I am new to NLP so I am really appreciate if you could give me some advice on this project.

Synergy Jina <> PhoBERT

hi VinAI team,

Great work 👍 I'm the founder & CEO of Jina AI, you may have used/heard my prev. work on Fashion-MNIST and bert-as-service. I'm the creator of these two OSS.

I'm asking if we can build a synergy between Jina <> PhoBERT (& BERTwee, I post it separately). https://github.com/jina-ai/jina
image

Simply put Jina is a universal neural search engine, it is a search infrastructure that can be used for searching text2text, image2image, audio2audio, etc. We already have examples using Jina for QA and semantic text search here, full examples can be found in here.

Potential synergy

  1. I see great potential to apply this in production. Therefore I kindly ask if you are interested in port it into jina or jina-hub. So that people can use it as one of their search component in Jina?

  2. If you are interested in long-term collaboration, we also have a Slack channel, where we can invite you in to have more discussions. We also welcome your thoughts on it.

phoBERT model output is not compatible with neural machine translation

from transformers import AutoModel, AutoTokenizer
import os
model_choice = 'vinai/phobert-large'
tokenizer = AutoTokenizer.from_pretrained(model_choice)
model = AutoModel.from_pretrained(model_choice)
text = 'Trái_Đất'
batch = tokenizer.prepare_seq2seq_batch(src_texts = [text], return_tensors = 'pt')
translation = model.generate(**batch) # error happens here
tokenizer.batch_decode(translation, skip_special_tokens=True)

Error message 'BaseModelOutputWithPoolingAndCrossAttentions' object has no attribute 'logits' tells that phoBERT output is BaseModelOutputWithPoolingAndCrossAttentions, while generation utility would only works with CausalLMOutput which has logits attribute.

As read from phoBERT official pages, I do not assume that machine translation is a feature. I just want to ask the author to verify if the real reason for the error message above is phoBERT does not support machine translation yet, or am I missing something here?

Thanks a lot.

Batch_size in downstream task

Batchsize cho task NER anh đã chọn là 32, cho em hỏi là đây là giá trị mà mô hình tốt nhất ạ, hay là chỉ để so sánh với các mô hình khác. Vì em có thử batchsize = 64 thì kết quả có tốt hơn, nên muốn hỏi lí do tại sao ạ. Em xin cảm ơn

Thầy ơi, Cho mình hỏi thêm về PhoBert

Mình đoc thấy Phobert dành cho chủ yếu: dich, thưc thể. Vì mình đang nghiên cứu về phân tích cảm xúc của xã hôi với cơ quan của mình, thì có thể dùng phobert thay thế đươc không a? Vì mới nghiên cứu chưa hiểu lắm nên hỏi thêm Thầy, Phobert này mình thay thế cho word2vec rồi dùng các mang neural như CNN, LSTM để phân loai ra Tích cưc , tiêu cưc, trung tính hay chỉ phobert là có thể phân loai ra đươc a?
Cảm ơn Thầy!

What's the meaning of position_ids = {0, 1}?

Hi, i notice that at Line here, the position id of the first input token will be 2. So what do the first two position ids stand for? And where are they used? Thanks for your patience and looking forward to your response!

Question about useFast=False when loading PhoBERT Tokenizer

Hi @datquocnguyen ,
I remember you mentioned in some documents that when we use transformer v4, we should add useFast=False when loading the Tokenizer.
Is that still true for now? As I couldn't find that document again and your latest README seems to be updated too.

Another question is: what is the difference between useFast = True and useFast = False in this case? Anything change in the output?
Thank you very much.

Named Entity Recognition by PhoBert

Can I ask you about how to implement PhoBert to recognize named entity ? I am trying to do that but I don't know how. I need get results without having to train it. Is it impossible? Many thanks.

Missing `config.json` in `PhoBERT_large_fairseq`

First thing first, thank you for publishing the source code of PhoBERT.

The issue is just a minor one which is that I can't find the config.json file in PhoBERT_large_fairseq downloaded here. Therefore I have to download the base pre-trained to grab the file.

Have a nice day!

error tokenize

image
sau khi kiểm tra code của file alignment_utils.py, em nhận ra bpe_tokens và other_tokens khác nhau đối với từ "gì vậy"
dòng thứ nhất là bpe_tokens
dòng thứ 2 là other_tokens
dòng thứ 3 là ''.join(bpe_tokens)
dòng thứ 4 là ''.join(other_tokens)
từ "gì vậy" được tokenize thành 2 token "g" và " unk ", dẫn đến việc không thể lỗi "cannot align"
image

các từ khác, ví dụ như "gì thế", hay "gì cơ" không xảy ra lỗi trên
em mong được mọi người giúp đỡ giải quyết lỗi này

Can we use PhoBERT-base Tokenizer for PhoBERT-large model and vice versa?

Hi @datquocnguyen ,
Both PhoBERT-base Tokenizer and PhoBERT-large Tokenizer have the same vocab size = 64001. Then the question is can we use PhoBERT-base Tokenizer for PhoBERT-large model and vice versa? It means can we use only one of them to tokenize the dataset and use that prepared tokenized tensor dataset to do fine-tuning on downstream tasks for both PhoBERT-base Tokenizer and PhoBERT-large to save preparing time :)

AssertionError, I have this error when run with fairseq, RobertaModel

doc = phoBERT.extract_features_aligned_to_words('Nghe nhiều về ông nhưng đến hôm_nay tôi mới có dịp về ấp Long_Châu 1 xã Thạnh_Mỹ_Tây Châu_Phú_An_Giang để gặp ông .')
OK, but
doc = phoBERT.extract_features_aligned_to_words('nghe nhiều về ông nhưng đến hôm_nay tôi mới có dịp về ấp long_châu 1 xã thạnh_mỹ_tây châu_phú_an_giang để gặp ông .')
Traceback (most recent call last):
File "", line 1, in
File "/home/hoang/anaconda3/lib/python3.6/site-packages/fairseq/models/roberta/hub_interface.py", line 133, in extract_features_aligned_to_words
alignment = alignment_utils.align_bpe_to_words(self, bpe_toks, spacy_toks_ws)
File "/home/hoang/anaconda3/lib/python3.6/site-packages/fairseq/models/roberta/alignment_utils.py", line 39, in align_bpe_to_words
assert "".join(bpe_tokens) == "".join(other_tokens)
AssertionError

and

doc = phoBERT.extract_features_aligned_to_words('chuyên_môn_hoá là xu_hướng của phát_triển việc tốt cũng chuyên_môn_hoá thì quả là tốt quá .')
OK, but in
doc = phoBERT.extract_features_aligned_to_words('Chuyên_môn_hoá là xu_hướng của phát_triển việc tốt cũng chuyên_môn_hoá thì quả là tốt quá.')
Traceback (most recent call last):
File "", line 1, in
File "/home/hoang/anaconda3/lib/python3.6/site-packages/fairseq/models/roberta/hub_interface.py", line 133, in extract_features_aligned_to_words
alignment = alignment_utils.align_bpe_to_words(self, bpe_toks, spacy_toks_ws)
File "/home/hoang/anaconda3/lib/python3.6/site-packages/fairseq/models/roberta/alignment_utils.py", line 39, in align_bpe_to_words
assert "".join(bpe_tokens) == "".join(other_tokens)
AssertionError

Non-consecutive added token '{token}' found.

As the title. I meet the below error when using PhoBertTokenizer for Vietnamese Question Answering task. Could you please help me to fix it ? Thank you.
f"Non-consecutive added token '{token}' found. " AssertionError: Non-consecutive added token '<mask>' found. Should have index 5 but has index 64000 in saved vocabulary.
Btw, i have tried to set self.encoder[self.mask_token] = 4, the training process can run normally, but it doesn't seem a right way.

Can anyone pass the VLSP 2018 datasets application form to me?

Hi,
I am working on a Vietnamese NER project and I need this dataset. However the official VLSP site seems to be down for days. Kindly share with me the application form and the e-mail address to send it to if possible. Sorry I don't know whom to ask (I am a newbie to Vietnamese NLP).

Xác nhận kết qủa POS tagging

  • Em thử nghiệm kết quả fine-tune phobert trên vlsp2013 pos tag
  • Nhưng có vẻ data trong paper không giống data gốc như vlsp. Ví dụ như trong tập test vlsp có 2131 example trong khi paper chỉ có 2120 example
  • anh có thể cho em xin data vlsp pos tag sau khi anh chỉnh sửa, để em có thể so sánh giữa các model chính xác nhất với ạ

Error tokenizer = AutoTokenizer.from_pretrained("vinai/phobert-base")

I run pretrain PhoBERT error .
ValueError Traceback (most recent call last)
in ()
3
4 phobert = AutoModel.from_pretrained("vinai/phobert-base")
----> 5 tokenizer = AutoTokenizer.from_pretrained("vinai/phobert-base")
6
7 # INPUT TEXT MUST BE ALREADY WORD-SEGMENTED!

/usr/local/lib/python3.6/dist-packages/transformers/models/auto/tokenization_auto.py in from_pretrained(cls, pretrained_model_name_or_path, *inputs, **kwargs)
323 if tokenizer_class is None:
324 raise ValueError(
--> 325 "Tokenizer class {} does not exist or is not currently imported.".format(tokenizer_class_candidate)
326 )
327 return tokenizer_class.from_pretrained(pretrained_model_name_or_path, *inputs, **kwargs)

ValueError: Tokenizer class PhobertTokenizerFast does not exist or is not currently imported.****

Connection Error when loading rdrsegmenter

Using this code

from vncorenlp import VnCoreNLP
rdrsegmenter = VnCoreNLP("/Absolute-path-to/vncorenlp/VnCoreNLP-1.1.1.jar", annotators="wseg", max_heap_size='-Xmx500m') 

and get error

ConnectionError: HTTPConnectionPool(host='127.0.0.1', port=51159): Max retries exceeded with url: /annotators (Caused by NewConnectionError('<urllib3.connection.HTTPConnection object at 0x10c11f250>: Failed to establish a new connection: [Errno 61] Connection refused'))

Convert PhoBert (RoBerta) to BERT

Hello brother,

I am working on a project that need your PhoBert, but I cannot adapt it for my model that have used BERT before. How can I convert between them?

Tks you for your time

creating a pre-trained model

Hi,

Thank you for releasing the language-specific model along with the instructions.

I want to create a similar language-specific pre-trained model. I was wondering if you could share the pre-training scripts and toy data (and maybe a short write-up) so that it is easier to pre-train similar BERT-based models in another language.

I just have one important question; how do you chunk the documents where the text is longer than 512 tokens. Also, do you just simply split at 512 tokens even if that sentence hasn't ended and consider another 512 tokens chunk beginning at the previous chunk end? Does this take a lot of memory?

Thanks!

transformers

can you public this in the transformes huggingfaces. It's easy to use, easy to deploy

Question Answering for PhoBert.

I have run the question answering demo of djl library, https://djl.ai/, and it worked well for English.
I have found that VinAI has released PhoBERT, but I don't know how to use it in the djl library.
Can you help me?
Thanks in advance.

DL4J-Bert

post tagging?

Hi,

Could download the PhoBERT package with transformers; would you know then how to use post tagging afterwards?

Thanks!
Pierre

Revert Tokenization nonsense!

Dear,

Before apply model, I want to check how tokenization works.
Firstly, I run
import transformers
from transformers import AutoModel, AutoTokenizer
embedder = AutoModel.from_pretrained("vinai/phobert-base")
tokenizer = AutoTokenizer.from_pretrained("vinai/phobert-base", use_fast=False)
to download artifacts into C:\Users\ADMIN.cache\huggingface\transformers.

I find vocab file in this folder (already check with file in HuggingFace). I build 2 dictionaries as below
with open("C:/Users/ADMIN/.cache/huggingface/transformers/970c6224b2713c8b52a7bcfc4d5a951c9bb88302e4523388b50f28284e87ac44.26ba0c8945e559c68d0bc35d24fea16f5463a49fe8f134e0c32261d590b577fa", 'r', encoding='utf-8') as f:
vocab = f.readlines()
vocab = [p[:-1] for p in vocab]
vocab_forward = dict()
vocab_reverse = dict()
for pair in tqdm(vocab):
k, v = pair.split(' ')[:2]
if k in vocab_forward.keys():
vocab_forward[k].append(v)
else:
vocab_forward[k] = [v]
if v in vocab_reverse.keys():
vocab_reverse[v].append(k)
else:
vocab_reverse[v] = [k]

I used VNCoreNLP to segment sentence into words as below:
['Công_ty', 'cổ_phần', 'Tập_đoàn', 'FLC', '(', 'FLC', ',', 'mã', 'chứng_khoán', ':', 'FLC', ')', 'vừa', 'công_bố', 'một', 'loạt', 'giao_dịch', 'với', 'các', 'bên', 'liên_quan', 'trong', 'giai_đoạn', '2019-2021', 'theo', 'yêu_cầu', 'của', 'Uỷ_ban', 'Chứng_khoán', 'Nhà_nước', '.']
However, after running tokenization
tokens = tokenizer.encode(sentence)[1:-1]
I cannot receive the original tokens. Below is the code:
for ti, token in enumerate(tokens):
token = str(token)
print(sentence[ti], token, vocab_reverse[token] if token in vocab_reverse.keys() else None)
the result is:

Original token: Công_ty
token: 290 - reverted token: ['Yeah@@', 'ngứa_@@', 'ĩ_vãng', 'Lolli@@', 'enei', 'đẽ', 'ト', 'ốc_liệt', 'ng_tín', 'ành_viên', 'thành_niên', 'Ai_C@@']
Original token: cổ_phần
token: 1272 - reverted token: ['asha', 'ưa@@']
Original token: Tập_đoàn
token: 1183 - reverted token: ['khép
@@', 'NAM']
Original token: FLC
token: 6505 - reverted token: ['chất_chứa', 'Quốc_Hùng', 'Đoài', 'Liveshow']
Original token: (
token: 20 - reverted token: None
Original token: FLC
token: 6505 - reverted token: ['chất_chứa', 'Quốc_Hùng', 'Đoài', 'Liveshow']
Original token: ,
token: 4 - reverted token: None
Original token: mã
token: 1624 - reverted token: ['Gò_Đ@@', 'động_vật
@@']
Original token: chứng_khoán
token: 2476 - reverted token: ['7,6%', 'Hi_Lạp', 'shor@@', 'Tân_Hương', '21-3', 'Fantas@@', 'Lớp_@@', 'Allardyce', 'Y_Tế', 'Liên_Bộ', 'ính_phòng', 'văn_tự', 'isi@@', 'ay_tr@@', 'Hui', 'va_vấp']
Original token: :
token: 27 - reverted token: None
Original token: FLC
token: 6505 - reverted token: ['chất_chứa', 'Quốc_Hùng', 'Đoài', 'Liveshow']
Original token: )
token: 19 - reverted token: None
Original token: vừa
token: 164 - reverted token: ['水', 'Kêu_@@', 'át_nước', 'nh_diện', '真', 'Heyn@@', '府', '구@@', 'trao_@@', 'Tẻ@@', 'TNĐ', '色@@', '_South@@', 'イ', '靜@@', '紅@@', '會@@', '格@@']
Original token: công_bố
token: 576 - reverted token: ['Middles@@', '_trương', '花@@', 'nhộn', 'Dong_Gun', 'Lị@@', 'Don@@']
Original token: một
token: 16 - reverted token: None
Original token: loạt
token: 1406 - reverted token: ['Fin@@', 'ẩm_tra', 'Né', 'giải_th@@']
Original token: giao_dịch
token: 786 - reverted token: ['Wiscon@@', 'thù@@']
Original token: với
token: 15 - reverted token: None
Original token: các
token: 9 - reverted token: None
Original token: bên
token: 145 - reverted token: ['ż', 'ア', 'ghi_nợ', 'Tàm', 'tài_sản
@@', '汉@@', '義', 'n_đơn', 'ǒ@@']
Original token: liên_quan
token: 314 - reverted token: ['nhân_hoá', 'ž', 'triển@@', '拉@@', 'Hmeym@@', 'n_lút', 'π', 'ở_thành', 'hiếp', 'ôn_phu', 'phạm_pháp_luật']
Original token: trong
token: 12 - reverted token: None
Original token: giai_đoạn
token: 609 - reverted token: ['n_thuyết', 'Lò_G@@', 'nh_thiêng', 'tư_th@@']
Original token: 2019-2021
token: 3 - reverted token: None
Original token: theo
token: 63 - reverted token: None
Original token: yêu_cầu
token: 285 - reverted token: ['Cocker@@', 'Raphael
@@', 'bánh_tr@@', '良@@', 'súng_l@@', 'sổ@@', 'DAR@@']
Original token: của
token: 7 - reverted token: None
Original token: Uỷ_ban
token: 871 - reverted token: ['phương_hoá', 'acca', 'xiêu', 'yol']
Original token: Chứng_khoán
token: 5149 - reverted token: ['Bích_Hằng', 'ack
@@', 'Aleksand@@', 'hoang_dại']
Original token: Nhà_nước
token: 544 - reverted token: ['nh_thường_quân', 'Bảo_t@@', 'Cheon@@', 'ambon', 'ững_sờ']
Original token: .
token: 5 - reverted token: None

My question is: there is a problem with tokenizer or the vocab file is wrong?

Wrong URL in the installation of VnCoreNLP's word segmenter

I followed the installation instructions in the "Using VnCoreNLP's word segmenter to pre-process input raw texts" section and then could not start the server.
It turned out that the wordsegmenter.rdr downloaded from the command

wget https://github.com/vncorenlp/VnCoreNLP/blob/master/models/wordsegmenter/wordsegmenter.rdr

is actually an HTML file.
So I replaced it by

wget https://raw.githubusercontent.com/vncorenlp/VnCoreNLP/master/models/wordsegmenter/wordsegmenter.rdr

and it worked!

Cách load trained model

Chào anh, lúc trước để load model đã train thì em load bằng
self.roberta = RobertaModel(config)
thay vì
self.roberta = RobertaModel.from_pretrained("PhoBERT_base_transformers/model.bin", config=config)
(Em bọc trong class riêng )
Và model em save bằng torch.save (state_dict),
Vấn đề là em giờ k biết file config đó nằm đâu, hoặc có cách nào khác để load model đã train , mong anh giải đáp

Question about the training process

Hi @datquocnguyen
Can you share some information about the training process?
For example:

  • How many samples per batch?
  • Which optimizer did you use and its params?
  • Did you use learning rate schedule? its params?
  • What is the result at the end? ex: training loss and accuracy

Thanks you so much.

NER task

pls, provide us a script to run NER tasks?
Thanks

BPE Tokenizer

Hi, i'm having a problem with your bpe tokenizer. Giving the token 'Trần_văn_thời', i used AutoTokenizer.from_pretrained('vinai/phobert-base', use_fast=False) to convert this token into ids and the results were [1359, 3, 8915], which is [ 'Trần_@@', '', 'ời']. However, when i changed the token into 'Trần_văn_Thời', i got [1359, 16398, 4834,], which is ['Trần_@@', 'văn_@@', 'Thời']. Another example is that the token 'Lê_văn_Tám' when tokenized will gave the result of [1475, 16398, 6813], which is ['Lê_@@', 'văn_@@', 'Tám'].
So, obviously the result for 'Trần_văn_thời' is probably wrong. Can you give me any explanation for this? Thank you very much.

HFValidationError

Hi, when I run my training code, I get this error:
validationerror
I ensure that I have pointed to the right path of the model in my server. Can you help me please?

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.