hooshvare / parsbert Goto Github PK
View Code? Open in Web Editor NEW🤗 ParsBERT: Transformer-based Model for Persian Language Understanding
Home Page: https://doi.org/10.1007/s11063-021-10528-4
License: Apache License 2.0
🤗 ParsBERT: Transformer-based Model for Persian Language Understanding
Home Page: https://doi.org/10.1007/s11063-021-10528-4
License: Apache License 2.0
How can I use the parsBERT model for text generate?
How can I install Parsbert in Ubuntu 20.04? Where is it's source code?
how to fine-tune ParseBERT Model on our dataset?
Please help me ...
thanks
Hello,
Is there any sample code ( for example in google colab) to use parsebert for filling the blanks task? How to fine-tune the model? or what is the input and output format?
Thanks
How we can use ParsBERT to learn an unlabeled set of text (e.g online form titles) to cluster them automatically based on similarity or based on topic?
Hi,
Firstly thanks for this great contribution to nlp task in persian language,
I wanted to know about the license which your pretrained models are published?
Thanks.
من میخواستم از مدل های تحلیل احساس و طبقه بندی متن و خلاصه سازی متن هوشواره استفاده کنم منتهی اسکریپت ندیم که توضیح بده روال استفاده ازشون به چه شکله یعنی بعد اینکه مدل و توکنایزر رو لود کردیم چطور متن فارسی رو باید بهش بدیم و خروجی به شکلی هست.خوشبختانه برای شناخت موجودیت های نامدار این اسکریپت موجود هست ولی برای بقیه من ندیدم.ممنونم میشوم اگر راهنمایی کنید
Hi - nice work on this!
Sorry if I'm not looking hard enough, but have you released the code/details for preprocessing sequences used in pre-training?
I.e. Steps 1 & 2 in the paper:
(1) removing all the trivial and junk characters and (2) standardizing the corpus with respect to Persian characters.
Is this at all handled by the included tokenizer, or should one recreate the preprocessing step for fine-tuning?
Thanks
Hi, There is an issue that you should handle, this problem is because of running the V3 code on a Hugging Face V4 package
So, refer to this url https://stackoverflow.com/questions/65082243/dropout-argument-input-position-1-must-be-tensor-not-str-when-using-bert you should add the term return_dict=False
when loading the model in SentimentModel
class, i.e replace:
self.bert = BertModel.from_pretrained(MODEL_NAME_OR_PATH)
With:
self.bert = BertModel.from_pretrained(MODEL_NAME_OR_PATH, return_dict=False)
It seems the model on huggingface returns the permutations of some queries (mostly when the [MASK] is at the end of the sentence). for instance when the query is "I have an [MASK]" (in farsi), the model returns two sentences as below plus the candidates of the masked word (sometimes no word candidates).
Hi
I've been curious about the pretraining part. More specifically, I saw that in your paper, you did not report the accuracy of the masked language modeling and the next sentence prediction tasks. Would you please provide some details on these parts?
For instance, what was the accuracy for these tasks (MLM & NSP) in the pretraining? Did you use the whole corpus (Miras+Wiki+etc.) for these parts too?
Hi there,
Thanks a lot for this repo and models you have trained.
Is it possible for you to train electra model too?
This model seems very interesting among other ones.
Thanks in advance.
I want to run the sentiment analysis code in the pycharm IDE, but there is an error for the cleantext library. How do I fix it?
error:
TypeError: clean() got an unexpected keyword argument 'fix_unicode'
-->(for all arguments)
(I installed cleantext == 1.1.4 and 1.1.3)
Hello , Thanks you for developing ParsBert , I have a question . I download and run the model , but how can i get a text from users as an input and use the model to analyse the sentiment of the text . thx
Hi.
I haven't been able to find a link to the corpus you have used to pretrain the language model. Have you published it to the public? If not, are you going to do so?
Thanks.
Thanks for preparing and sharing a great tutorial notebook but in the tutorial section of the Readme file, both Sentiment Analysis and Text Classification links point to the same url. I know this tutorial could be categorized as both of those topics but mentioning it as two separate links with two separate buttons in two separate rows of a table is misleading.
My recommendation is to merge those two rows into one row with a title mentioning both categories or preferably mentioning just Sentiment Analysis.
If some one has enough time it is ideal to have a simple text classification tutorial notebook to replace the link for the first row (Text Classification)
Thanks for developing ParsBert - I wonder if there is any plan to update parsbert ?
سلام
ضمن تشکر از شما
. یک سیستم پرسش و پاسخ را پیاده سازی کنم. parsBERTمن میخواستم با استفاده از مدل
داده های من فارسی هستند و شامل یک سری سؤال و جواب هستند.
آیا میتونم از این مدل استفاده کنم ؟ کدوم مدل مناسب هست؟ و اگر میشه اینکار رو کرد چطوری باید انجام بدم؟
سوال دوم این هست که برای داده بدون لیبل هم این مدل میتونه استفاده بشه؟مثلا اگر بخوام خوشه بندی انجام بدم.
ممنون میشم پاسخ بدید.
`
from transformers import AutoConfig, AutoTokenizer, AutoModel,
# v3.0
model_name_or_path = "HooshvareLab/bert-fa-zwnj-base"
config = AutoConfig.from_pretrained(model_name_or_path)
tokenizer = AutoTokenizer.from_pretrained(model_name_or_path)
`
#---------------------------------------------------------------------------
#OSError Traceback (most recent call last)
~/Apps/anaconda3/envs/ml/lib/python3.9/site-packages/transformers/configuration_utils.py in get_config_dict(cls, pretrained_model_name_or_path, **kwargs)
358 if resolved_config_file is None:
--> 359 raise EnvironmentError
360 config_dict = cls._dict_from_json_file(resolved_config_file)
OSError:
During handling of the above exception, another exception occurred:
OSError Traceback (most recent call last)
/tmp/ipykernel_708815/1307679045.py in
3 # v3.0
4 model_name_or_path = "HooshvareLab/bert-fa-zwnj-base"
----> 5 config = AutoConfig.from_pretrained(model_name_or_path)
6 tokenizer = AutoTokenizer.from_pretrained(model_name_or_path)
~/Apps/anaconda3/envs/ml/lib/python3.9/site-packages/transformers/configuration_auto.py in from_pretrained(cls, pretrained_model_name_or_path, **kwargs)
308 {'foo': False}
309 """
--> 310 config_dict, _ = PretrainedConfig.get_config_dict(pretrained_model_name_or_path, **kwargs)
311
312 if "model_type" in config_dict:
~/Apps/anaconda3/envs/ml/lib/python3.9/site-packages/transformers/configuration_utils.py in get_config_dict(cls, pretrained_model_name_or_path, **kwargs)
366 f"- or '{pretrained_model_name_or_path}' is the correct path to a directory containing a {CONFIG_NAME} file\n\n"
367 )
--> 368 raise EnvironmentError(msg)
369
370 except json.JSONDecodeError:
OSError: Can't load config for 'HooshvareLab/bert-fa-zwnj-base'. Make sure that:
'HooshvareLab/bert-fa-zwnj-base' is a correct model identifier listed on 'https://huggingface.co/models'
or 'HooshvareLab/bert-fa-zwnj-base' is the correct path to a directory containing a config.json file
Hi,
I want to use your pretrained model for NER task but there is a problem, in the tutorial notebook use feed documents one-by-one and this takes too long for my dataset. how can I use it more efficiently ? can i use padding to feed larger batches to the model?
How can train your model on another dataset?
salaam @m3hrdadfi and others
very nice to hear about your work !
Could you please provide some more information about your pre_training ?
regarding your batch size (32) , it's seems that 1.9M steps is so small (around 10 epochs)
Have you monitored loss of downstream tasks during the training and around step 1.9M the loss wasn't changing so much ?
I think you used 1.9M steps because of computation costs ? am I right ?
Have you used google's scripts to run (or Hugging face )?
Have you used TPU (or GPU) for your pre_training ? what type of it ?
Have you used distributed training ? (if yes , how many (TPU or GPU) in parallel ? )
(again, if yes , bs = 32 which you have announced in article was per GPU/TPU (or you have added them up ) ? )
have you used mixed floating point (fp16) or gradient accumulation steps ?
I try to run a model that uses from pre-trained Bert, I prefer to replace Pars Bert as pre-trained model, because my data is in Persian...
the model's prediction contains some ["UNK"] token's instead of numbers and it make the prediction False( since the model is using the exact match accuracy and the ["UNK"] token's make the prediction differ from the gold_true_value)
as i said here: https://stackoverflow.com/users/10623022/maryami-najafi
is there any body to help?
the Colab link for Classification model is wrong !!!!!
it is the sentiment analysis colab link. please correct that. thanks a lot
I guess in your first example about using the model with TensorFlow, the AutoModel
must be TFAutoModel
i encounter this error due to use the bert model on data...
code have no error on colab, but in my system, it happens:
OSError: Unable to load weights from the PyTorch checkpoint file. If you tried to load a PyTorch model from a TF 2.0 checkpoint, please set from_tf=True.
for tokenizer and config, it is ok, but in case of using a model it takes this and breaks out:((
is there any body to help?
Hi,
Is there any way to disable all or some of normalization rules in parsbert tokenizer?
For example do not convert "آ" to "ا" or "ئ" to "ی".
Also the tokenizer removes all half-spaces and concatenate the words.
Setting the do_lower_case and srip_accents parameters to false does not work.
I would be so grateful if you let me know whether there is any solution to my problem.
با سلام و احترام و عرض ادب
بنده مدل شما را روی یک پروژه گیت هاب https://github.com/MaartenGr/KeyBERT#toc استفاده کردم. در این پروژه کلمات کلیدی متن استخراج می شود. مدل شما را در کد زیر استفاده کردم
kw_model = KeyBERT('HooshvareLab/bert-base-parsbert-peymaner-uncased')
اما با خطاهای زیر مواجه شدم. آیا مدل شما sentence-transformers دارد؟
No sentence-transformers model found with name /root/.cache/torch/sentence_transformers/HooshvareLab_bert-base-parsbert-peymaner-uncased. Creating a new one with MEAN pooling.
Some weights of the model checkpoint at /root/.cache/torch/sentence_transformers/HooshvareLab_bert-base-parsbert-peymaner-uncased were not used when initializing BertModel: ['classifier.bias', 'classifier.weight']
Hi,
I'm getting this error when trying to load embedding using flair. Any idea what's going on?
Am I using the right model? I just need to use the embedding vectors.
The code:
from flair.data import Sentence
from flair.models import SequenceTagger
from flair.embeddings import TransformerWordEmbeddings
bert_embedding = TransformerWordEmbeddings("HooshvareLab/bert-fa-base-uncased")
sentence = Sentence('علی اکبر به شهر تهران رفت')
bert_embedding.embed(sentence)
The error:
Traceback (most recent call last):
File "/Users/reza/code/parsbert/playground.py", line 8, in <module>
bert_embedding.embed(sentence)
File "/Users/reza/.pyenv/versions/parsbert/lib/python3.9/site-packages/flair/embeddings/base.py", line 60, in embed
self._add_embeddings_internal(sentences)
File "/Users/reza/.pyenv/versions/parsbert/lib/python3.9/site-packages/flair/embeddings/token.py", line 923, in _add_embeddings_internal
self._add_embeddings_to_sentence(sentence)
File "/Users/reza/.pyenv/versions/parsbert/lib/python3.9/site-packages/flair/embeddings/token.py", line 995, in _add_embeddings_to_sentence
encoded_inputs = self.tokenizer.encode_plus(tokenized_string,
File "/Users/reza/.pyenv/versions/parsbert/lib/python3.9/site-packages/transformers/tokenization_utils_base.py", line 2378, in encode_plus
return self._encode_plus(
File "/Users/reza/.pyenv/versions/parsbert/lib/python3.9/site-packages/transformers/tokenization_utils_fast.py", line 458, in _encode_plus
batched_output = self._batch_encode_plus(
File "/Users/reza/.pyenv/versions/parsbert/lib/python3.9/site-packages/transformers/tokenization_utils_fast.py", line 377, in _batch_encode_plus
self.set_truncation_and_padding(
File "/Users/reza/.pyenv/versions/parsbert/lib/python3.9/site-packages/transformers/tokenization_utils_fast.py", line 335, in set_truncation_and_padding
self._tokenizer.enable_truncation(max_length, stride=stride, strategy=truncation_strategy.value)
OverflowError: int too big to convert
A declarative, efficient, and flexible JavaScript library for building user interfaces.
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google ❤️ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.