hooshvare / parsbert Goto Github PK

View Code? Open in Web Editor NEW

300.0 9.0 34.0 1.33 MB

🤗 ParsBERT: Transformer-based Model for Persian Language Understanding

Home Page: https://doi.org/10.1007/s11063-021-10528-4

License: Apache License 2.0

Python 1.74% Jupyter Notebook 98.26%

nlp nlu bert parsbert persian-bert persianber persian-language text-classification named-entity-recognition ner

parsbert's People

Contributors

Stargazers

Watchers

parsbert's Issues

parsBERT text generation

How can I use the parsBERT model for text generate?

How can I install parsbert

How can I install Parsbert in Ubuntu 20.04? Where is it's source code?

Fine-Tune on our dataset

how to fine-tune ParseBERT Model on our dataset?
Please help me ...
thanks

using parsbert for fill the blanks

Hello,
Is there any sample code ( for example in google colab) to use parsebert for filling the blanks task? How to fine-tune the model? or what is the input and output format?

Thanks

Question: Unsupervised Learning

How we can use ParsBERT to learn an unlabeled set of text (e.g online form titles) to cluster them automatically based on similarity or based on topic?

License

Hi,
Firstly thanks for this great contribution to nlp task in persian language,
I wanted to know about the license which your pretrained models are published?
Thanks.

استفاده از مدل text_classificatin & sentiment& text_summarization

من میخواستم از مدل های تحلیل احساس و طبقه بندی متن و خلاصه سازی متن هوشواره استفاده کنم منتهی اسکریپت ندیم که توضیح بده روال استفاده ازشون به چه شکله یعنی بعد اینکه مدل و توکنایزر رو لود کردیم چطور متن فارسی رو باید بهش بدیم و خروجی به شکلی هست.خوشبختانه برای شناخت موجودیت های نامدار این اسکریپت موجود هست ولی برای بقیه من ندیدم.ممنونم میشوم اگر راهنمایی کنید

Preprocessing code/details

Hi - nice work on this!

Sorry if I'm not looking hard enough, but have you released the code/details for preprocessing sequences used in pre-training?

I.e. Steps 1 & 2 in the paper:

(1) removing all the trivial and junk characters and (2) standardizing the corpus with respect to Persian characters.

Is this at all handled by the included tokenizer, or should one recreate the preprocessing step for fine-tuning?

Thanks

TypeError: dropout(): argument 'input' (position 1) must be Tensor, not str

Hi, There is an issue that you should handle, this problem is because of running the V3 code on a Hugging Face V4 package
So, refer to this url https://stackoverflow.com/questions/65082243/dropout-argument-input-position-1-must-be-tensor-not-str-when-using-bert you should add the term return_dict=False when loading the model in SentimentModel class, i.e replace:

        self.bert = BertModel.from_pretrained(MODEL_NAME_OR_PATH)

With:

        self.bert = BertModel.from_pretrained(MODEL_NAME_OR_PATH, return_dict=False)

model ouput

It seems the model on huggingface returns the permutations of some queries (mostly when the [MASK] is at the end of the sentence). for instance when the query is "I have an [MASK]" (in farsi), the model returns two sentences as below plus the candidates of the masked word (sometimes no word candidates).

"[CLS] I have an! [SEP]"
"[CLS] I have an. [SEP]"

here is another example

The accuracy of the masked LM and NSP tasks

Hi
I've been curious about the pretraining part. More specifically, I saw that in your paper, you did not report the accuracy of the masked language modeling and the next sentence prediction tasks. Would you please provide some details on these parts?
For instance, what was the accuracy for these tasks (MLM & NSP) in the pretraining? Did you use the whole corpus (Miras+Wiki+etc.) for these parts too?

Electra Model

Hi there,
Thanks a lot for this repo and models you have trained.
Is it possible for you to train electra model too?
This model seems very interesting among other ones.

Thanks in advance.

TypeError: clean() got an unexpected keyword argument 'fix_unicode'

I want to run the sentiment analysis code in the pycharm IDE, but there is an error for the cleantext library. How do I fix it?

error:
TypeError: clean() got an unexpected keyword argument 'fix_unicode'
-->(for all arguments)

(I installed cleantext == 1.1.4 and 1.1.3)

Give a text as an input

Hello , Thanks you for developing ParsBert , I have a question . I download and run the model , but how can i get a text from users as an input and use the model to analyse the sentiment of the text . thx

Request for access to the corpus

Hi.

I haven't been able to find a link to the corpus you have used to pretrain the language model. Have you published it to the public? If not, are you going to do so?

Thanks.

Tutorial notebook links seem to be wrong

Thanks for preparing and sharing a great tutorial notebook but in the tutorial section of the Readme file, both Sentiment Analysis and Text Classification links point to the same url. I know this tutorial could be categorized as both of those topics but mentioning it as two separate links with two separate buttons in two separate rows of a table is misleading.
My recommendation is to merge those two rows into one row with a title mentioning both categories or preferably mentioning just Sentiment Analysis.
If some one has enough time it is ideal to have a simple text classification tutorial notebook to replace the link for the first row (Text Classification)

No matching distribution found for torch==1.6.0

Thanks for developing ParsBert - I wonder if there is any plan to update parsbert ?

parsBERT for Q&A

سلام
ضمن تشکر از شما
. یک سیستم پرسش و پاسخ را پیاده سازی کنم. parsBERTمن میخواستم با استفاده از مدل
داده های من فارسی هستند و شامل یک سری سؤال و جواب هستند.
آیا میتونم از این مدل استفاده کنم ؟ کدوم مدل مناسب هست؟ و اگر میشه اینکار رو کرد چطوری باید انجام بدم؟

سوال دوم این هست که برای داده بدون لیبل هم این مدل میتونه استفاده بشه؟مثلا اگر بخوام خوشه بندی انجام بدم.
ممنون میشم پاسخ بدید.

cant download model "bert-fa-zwnj-base" from huggingface

from transformers import AutoConfig, AutoTokenizer, AutoModel,

# v3.0
model_name_or_path = "HooshvareLab/bert-fa-zwnj-base"
config = AutoConfig.from_pretrained(model_name_or_path)
tokenizer = AutoTokenizer.from_pretrained(model_name_or_path)

`
#---------------------------------------------------------------------------
#OSError Traceback (most recent call last)
~/Apps/anaconda3/envs/ml/lib/python3.9/site-packages/transformers/configuration_utils.py in get_config_dict(cls, pretrained_model_name_or_path, **kwargs)
358 if resolved_config_file is None:
--> 359 raise EnvironmentError
360 config_dict = cls._dict_from_json_file(resolved_config_file)

OSError:

During handling of the above exception, another exception occurred:

OSError Traceback (most recent call last)
/tmp/ipykernel_708815/1307679045.py in
3 # v3.0
4 model_name_or_path = "HooshvareLab/bert-fa-zwnj-base"
----> 5 config = AutoConfig.from_pretrained(model_name_or_path)
6 tokenizer = AutoTokenizer.from_pretrained(model_name_or_path)

~/Apps/anaconda3/envs/ml/lib/python3.9/site-packages/transformers/configuration_auto.py in from_pretrained(cls, pretrained_model_name_or_path, **kwargs)
308 {'foo': False}
309 """
--> 310 config_dict, _ = PretrainedConfig.get_config_dict(pretrained_model_name_or_path, **kwargs)
311
312 if "model_type" in config_dict:

~/Apps/anaconda3/envs/ml/lib/python3.9/site-packages/transformers/configuration_utils.py in get_config_dict(cls, pretrained_model_name_or_path, **kwargs)
366 f"- or '{pretrained_model_name_or_path}' is the correct path to a directory containing a {CONFIG_NAME} file\n\n"
367 )
--> 368 raise EnvironmentError(msg)
369
370 except json.JSONDecodeError:

OSError: Can't load config for 'HooshvareLab/bert-fa-zwnj-base'. Make sure that:

'HooshvareLab/bert-fa-zwnj-base' is a correct model identifier listed on 'https://huggingface.co/models'
or 'HooshvareLab/bert-fa-zwnj-base' is the correct path to a directory containing a config.json file

How to use NER for large dataset ?

Hi,

I want to use your pretrained model for NER task but there is a problem, in the tutorial notebook use feed documents one-by-one and this takes too long for my dataset. how can I use it more efficiently ? can i use padding to feed larger batches to the model?

question

How can train your model on another dataset?

Details of your pre_training

salaam @m3hrdadfi and others
very nice to hear about your work !
Could you please provide some more information about your pre_training ?

regarding your batch size (32) , it's seems that 1.9M steps is so small (around 10 epochs)
Have you monitored loss of downstream tasks during the training and around step 1.9M the loss wasn't changing so much ?
I think you used 1.9M steps because of computation costs ? am I right ?

Have you used google's scripts to run (or Hugging face )?

Have you used TPU (or GPU) for your pre_training ? what type of it ?

Have you used distributed training ? (if yes , how many (TPU or GPU) in parallel ? )
(again, if yes , bs = 32 which you have announced in article was per GPU/TPU (or you have added them up ) ? )

have you used mixed floating point (fp16) or gradient accumulation steps ?

why ParsBert ['UNK'] into English numbers

I try to run a model that uses from pre-trained Bert, I prefer to replace Pars Bert as pre-trained model, because my data is in Persian...
the model's prediction contains some ["UNK"] token's instead of numbers and it make the prediction False( since the model is using the exact match accuracy and the ["UNK"] token's make the prediction differ from the gold_true_value)
as i said here: https://stackoverflow.com/users/10623022/maryami-najafi
is there any body to help?

Colab link for Classification model is wrong !!!!!

the Colab link for Classification model is wrong !!!!!
it is the sentiment analysis colab link. please correct that. thanks a lot

TFAutoModel instead of AutoModel in your RreadMe example

I guess in your first example about using the model with TensorFlow, the AutoModel must be TFAutoModel

OSError: Unable to load weights from pytorch checkpoint file. If you tried to load a PyTorch model from a TF 2.0 checkpoint, please set from_tf=True.

i encounter this error due to use the bert model on data...
code have no error on colab, but in my system, it happens:
OSError: Unable to load weights from the PyTorch checkpoint file. If you tried to load a PyTorch model from a TF 2.0 checkpoint, please set from_tf=True.
for tokenizer and config, it is ok, but in case of using a model it takes this and breaks out:((
is there any body to help?

A way to disable normalization rules in parsbert Tokenizer

Hi,
Is there any way to disable all or some of normalization rules in parsbert tokenizer?
For example do not convert "آ" to "ا" or "ئ" to "ی".
Also the tokenizer removes all half-spaces and concatenate the words.
Setting the do_lower_case and srip_accents parameters to false does not work.
I would be so grateful if you let me know whether there is any solution to my problem.

آیا مدل شما sentence-transformers دارد؟

با سلام و احترام و عرض ادب
بنده مدل شما را روی یک پروژه گیت هاب https://github.com/MaartenGr/KeyBERT#toc استفاده کردم. در این پروژه کلمات کلیدی متن استخراج می شود. مدل شما را در کد زیر استفاده کردم
kw_model = KeyBERT('HooshvareLab/bert-base-parsbert-peymaner-uncased')
اما با خطاهای زیر مواجه شدم. آیا مدل شما sentence-transformers دارد؟
No sentence-transformers model found with name /root/.cache/torch/sentence_transformers/HooshvareLab_bert-base-parsbert-peymaner-uncased. Creating a new one with MEAN pooling.
Some weights of the model checkpoint at /root/.cache/torch/sentence_transformers/HooshvareLab_bert-base-parsbert-peymaner-uncased were not used when initializing BertModel: ['classifier.bias', 'classifier.weight']

This IS expected if you are initializing BertModel from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
This IS NOT expected if you are initializing BertModel from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
/usr/local/lib/python3.7/dist-packages/sklearn/feature_extraction/text.py:401: UserWarning: Your stop_words may be inconsistent with your preprocessing.

parsbert with flair

Hi,
I'm getting this error when trying to load embedding using flair. Any idea what's going on?
Am I using the right model? I just need to use the embedding vectors.

The code:

from flair.data import Sentence
from flair.models import SequenceTagger
from flair.embeddings import TransformerWordEmbeddings


bert_embedding = TransformerWordEmbeddings("HooshvareLab/bert-fa-base-uncased")
sentence = Sentence('علی اکبر به شهر تهران رفت')
bert_embedding.embed(sentence)

The error:

Traceback (most recent call last):
  File "/Users/reza/code/parsbert/playground.py", line 8, in <module>
    bert_embedding.embed(sentence)
  File "/Users/reza/.pyenv/versions/parsbert/lib/python3.9/site-packages/flair/embeddings/base.py", line 60, in embed
    self._add_embeddings_internal(sentences)
  File "/Users/reza/.pyenv/versions/parsbert/lib/python3.9/site-packages/flair/embeddings/token.py", line 923, in _add_embeddings_internal
    self._add_embeddings_to_sentence(sentence)
  File "/Users/reza/.pyenv/versions/parsbert/lib/python3.9/site-packages/flair/embeddings/token.py", line 995, in _add_embeddings_to_sentence
    encoded_inputs = self.tokenizer.encode_plus(tokenized_string,
  File "/Users/reza/.pyenv/versions/parsbert/lib/python3.9/site-packages/transformers/tokenization_utils_base.py", line 2378, in encode_plus
    return self._encode_plus(
  File "/Users/reza/.pyenv/versions/parsbert/lib/python3.9/site-packages/transformers/tokenization_utils_fast.py", line 458, in _encode_plus
    batched_output = self._batch_encode_plus(
  File "/Users/reza/.pyenv/versions/parsbert/lib/python3.9/site-packages/transformers/tokenization_utils_fast.py", line 377, in _batch_encode_plus
    self.set_truncation_and_padding(
  File "/Users/reza/.pyenv/versions/parsbert/lib/python3.9/site-packages/transformers/tokenization_utils_fast.py", line 335, in set_truncation_and_padding
    self._tokenizer.enable_truncation(max_length, stride=stride, strategy=truncation_strategy.value)
OverflowError: int too big to convert

hooshvare / parsbert Goto Github PK

parsbert's People

Contributors

Stargazers

Watchers

Forkers

parsbert's Issues

Recommend Projects

Recommend Topics

Recommend Org