Git Product home page Git Product logo

arat5's Introduction

AraT5: Text-to-Text Transformers for Arabic Language Generation

AraT5

This is the repository accompanying our paper AraT5: Text-to-Text Transformers for Arabic Language Understanding and Generation. In this is the repository we introduce:

  • Introduce AraT5MSA, AraT5Tweet, and AraT5: three powerful Arabic-specific text-to-text Transformer based models;
  • Introduce ARGEN: A new benchmark for Arabic language generation and evaluation for four Arabic NLP tasks, namely, machine translation, summarization, news title generation, question generation, , paraphrasing, transliteration, and code-switched translation.
  • Evaluate AraT5 models on ARGEN and compare against available language models.

Our models establish new state-of-the-art (SOTA) on several publicly available datasets. Our language models are publicaly available for research (see below).

The rest of this repository provides more information about our new language models, benchmark, and experiments.

🔆 Breaking News! 🔆

We're eagled to announce the next version of AraT5

🔥 What's new? 🔥

  • More Data. AraT5v2 is trained on large and more diverse Arabic data.
  • Larger Sequence Length. We increase the sequence length from 512 to 1024 in this version.
  • Faster Convergence. On finetuning process, AraT5v2 converges ~10x faster than the previous version (AraT5-base).
  • Extra IDs. AraT5v2 supports 100 sentinel tokens (a.k.a unique mask tokens).

🤗 Hugging Face:https://huggingface.co/UBC-NLP/AraT5v2-base-1024

We recommend using AraT5v2-base-1024 instead of the previous version (AraT5-base).

Table of Contents

1. Our Language Models

1.1 Training Data

  • MSA Training Data: We use 70GB of MSA text 7.1B tokens) from the following sources: AraNews, El-Khair, Gigaword, OSCAR, OSIAN, Wikipedia Arabic, and Hindawi Books.

  • Twitter Training Data: We randomly sample 1.5B Arabic tweets from a large in-house dataset of about 10B tweets. We use string matching to only include tweets with at least 3 Arabic words, regardless whether the tweet has non-Arabic string or not. The dataset makes up 178GB of text 21B tokens.

1.2 Models Architecture

To train our AraT5, we use the same architecture as T5-base and T5-small (Raffel 2019) where both encoder and decoder has 12 layers each with 12 attention heads, and 768 hidden units.

1.3 AraT5 Models

We pre-train three powerful variants of the text-to-text transformer (T5) model dedicated to Modern Standard Arabic (MSA) and Arabic dialects, AraT5. AraT5 comes. AraT5 comes in three flavors:

  • AraT5MSA: trained on MSA data exclusively
  • AraT5Tweet: trained on Twitter data (mix of MSA and dialectal Arabic),
  • AraT5: trained on both Twitter and MSA data.

2. ARGEN Benchmark and AraT5 Evaluation

To evaluate our models, we also introduce ARGEN, a new benchmark for A new benchmark for Arabic language generation and evaluation. ARGEN is composed of four tasks, namely, machine translation, summarization, newstitle generation and question generation. ARGEN is collected from a total of ten datasets, including two new large datasets pro-posed in this work.

2.1 Machine Translation

2.1.1 MSA To English

Dataset Test Split mT5 AraT5Tweet AraT5MSA AraT5
Bible II Sajjad et al. (2020) Test 1 15.58 13.04 16.38 15.71
Bible II Sajjad et al. (2020) Test 2 12.1 9.2 12.53 11.64
MADAR Bouamor et al. (2018) MSA-EN 11.84 11.11 11.42 10.57
IWSLT Cettolo et al. (2016) TED15 29.39 28.2 30.37 30.45
IWSLT Cettolo et al. (2016) TED16 28.39 27.03 29.37 29.18
IWSLT Cettolo et al. (2016) QED16 21.09 18.55 20.98 19.11
UN Ziemski et al. (2016) AR-EN 52.38 51.48 53.29 52.96

Metric is BLEU. MADAR Bouamor et al. (2018) (25 datasets) results are show in Table 6 (see the paper)

2.1.2 Dialictal Arabic To English

Dataset Test Split mT5 AraT5Tweet AraT5MSA AraT5
ADPT Zbib et al. (2012) Lev 8.33 8.32 8.52 8.42
ADPT Zbib et al. (2012) Egy 12.57 11.25 12.38 12.92
Bible I Sajjad et al. (2020) Tun 8.08 5.86 8.52 7.94
Bible I Sajjad et al. (2020) Mor 7.21 4.69 7.83 6.82
QAraCy Sajjad et al. (2020) Qat 11.84 11.11 11.42 10.57

Metric is BLEU.

2.1.3 Foreign languages To MSA

Spit mT5 AraT5MSA
EN → MSA 17.80 18.58
DE → MSA 11.92 12.80
FR → MSA 18.61 18.99
RU → MSA 26.63 28.01

Metric is BLEU. All the splits are from UN corpus Ziemski et al. (2016)

2.2 Text Summarization

Metric Metric mT5 AraT5Tweet AraT5MSA AraT5
Rouge1 62.98 60.74 59.54 54.61
EASC El-Haj et al. (2010) Rouge2 51.93 48.89 47.37 43.58
RougeL 62.98 60.73 59.55 54.55
Rouge1 71.63 74.61 72.64 73.48
WikiLin Alami et al. (2021) Rouge2 63.60 67.00 64.21 65.09
RougeL 71.56 74.52 72.57 73.37

2.3 News Title and Question Generation

Dataset Metric mT5 AraT5Tweet AraT5MSA MSA
ARGENNTG Nagoudi et al., 2020 BLEU 19.49 20.00 20.61 20.51
ARGENQG Nagoudi et al. (2021) BLEU 15.29 12.06 14.18 16.99

2.4 Paraphrasing and Transliteration

Dataset Metric mT5 AraT5Tweet AraT5MSA MSA
ARGENPPH I Cer et al. (2017) BLEU 19.32 18.17 19.38 19.03
ARGENPPH II Alian et al. (2021) BLEU 19.25 17.34 19.43 18.42
ARGENTR Song et al. (2014) BLEU 60.81 59.55 65.88 62.51

2.5 Code-Switched Translation

Dataset Type mT5 AraT5Tweet AraT5MSA MSA
ALG-FR → FR Natural 23.83 28.19 26.27 26.17
JOR-EN → EN Natural 23.06 21.60 21.58 20.45
MSA-FR → FR Synthetic 11.06 8.99 11.53 11.42
MSA-EN → EN Synthetic 19.25 17.34 19.43 18.42
MSA-FR → MSA Synthetic 12.93 12.14 14.39 13.92
MSA-EN → MSA Synthetic 19.82 18.43 23.89 24.37

Metric is BLEU. All the ARGENCS datasets are from: Nagoudi et al. (2021)

3. How to use AraT5 model

Below is an example for fine-tuning AraT5-base for News Title Generation on the Aranews dataset

!python run_trainier_seq2seq_huggingface.py \
        --learning_rate 5e-5 \
        --max_target_length 128 --max_source_length 128 \
        --per_device_train_batch_size 8 --per_device_eval_batch_size 8 \
        --model_name_or_path "UBC-NLP/AraT5-base" \
        --output_dir "/content/AraT5_FT_title_generation" --overwrite_output_dir \
        --num_train_epochs 3 \
        --train_file "/content/ARGEn_title_genration_sample_train.tsv" \
        --validation_file "/content/ARGEn_title_genration_sample_valid.tsv" \
        --task "title_generation" --text_column "document" --summary_column "title" \
        --load_best_model_at_end --metric_for_best_model "eval_bleu" --greater_is_better True --evaluation_strategy epoch --logging_strategy epoch --predict_with_generate\
        --do_train --do_eval

For the more details about the fine-tuning example, please read this notebook Open In Colab

In addition, we release the fine-tuned checkpoint of the News Title Generation (NGT) which is described in the paper. The model available at Huggingface (UBC-NLP/AraT5-base-title-generation).

from transformers import AutoTokenizer, AutoModelForSeq2SeqLM
tokenizer = AutoTokenizer.from_pretrained("UBC-NLP/AraT5-base-title-generation")  
model = AutoModelForSeq2SeqLM.from_pretrained("UBC-NLP/AraT5-base-title-generation")

Document = "تحت رعاية صاحب السمو الملكي الأمير سعود بن نايف بن عبدالعزيز أمير المنطقة الشرقية اختتمت غرفة الشرقية مؤخرا، الثاني من مبادرتها لتأهيل وتدريب أبناء وبنات المملكة ضمن مبادرتها المجانية للعام 2019 حيث قدمت 6 برامج تدريبية نوعية. وثمن رئيس مجلس إدارة الغرفة، عبدالحكيم العمار الخالدي، رعاية سمو أمير المنطقة الشرقية للمبادرة، مؤكدا أن دعم سموه لجميع أنشطة ."

encoding = tokenizer.encode_plus(Document,pad_to_max_length=True, return_tensors="pt")
input_ids, attention_masks = encoding["input_ids"], encoding["attention_mask"]


outputs = model.generate(
    input_ids=input_ids, attention_mask=attention_masks,
    max_length=256,
    do_sample=True,
    top_k=120,
    top_p=0.95,
    early_stopping=True,
    num_return_sequences=5
)

for id, output in enumerate(outputs):
    title = tokenizer.decode(output, skip_special_tokens=True,clean_up_tokenization_spaces=True)
    print("title#"+str(id), title)

The input news document

تحت رعاية صاحب السمو الملكي الأمير سعود بن نايف بن عبدالعزيز أمير المنطقة الشرقية اختتمت غرفة الشرقية مؤخرا، الثاني من مبادرتها لتأهيل وتدريب أبناء وبنات المملكة ضمن مبادرتها المجانية للعام 2019 حيث قدمت 6 برامج تدريبية نوعية. وثمن رئيس مجلس إدارة الغرفة، عبدالحكيم العمار الخالدي، رعاية سمو أمير المنطقة الشرقية للمبادرة، مؤكدا أن دعم سموه لجميع أنشطة .

The generated titles

title#0 غرفة الشرقية تختتم المرحلة الثانية من مبادرتها لتأهيل وتدريب أبناء وبنات المملكة
title#1 غرفة الشرقية تختتم الثاني من مبادرة تأهيل وتأهيل أبناء وبناتنا
title#2 سعود بن نايف يختتم ثانى مبادراتها لتأهيل وتدريب أبناء وبنات المملكة
title#3 أمير الشرقية يرعى اختتام برنامج برنامج تدريب أبناء وبنات المملكة
title#4 سعود بن نايف يرعى اختتام مبادرة تأهيل وتدريب أبناء وبنات المملكة

4. Ethics

Our models are developed using data from the public domain. We provide access to our models to accelerate scientific research with no liability on our part. Please use our models and benchmark only ethically. This includes, for example, respect and protection of people's privacy. We encourage all researchers who decide to use our models to adhere to the highest standards. For example, if you apply our models on Twitter data, we encourage you to review Twitter policy at Twitter policy. For example, Twitter provides the following policy around use of sensitive information:

Sensitive information

You should be careful about using Twitter data to derive or infer potentially sensitive characteristics about Twitter users. Never derive or infer, or store derived or inferred, information about a Twitter user’s:

  • Health (including pregnancy)
  • Negative financial status or condition
  • Political affiliation or beliefs
  • Racial or ethnic origin
  • Religious or philosophical affiliation or beliefs
  • Sex life or sexual orientation
  • Trade union membership
  • Alleged or actual commission of a crime
  • Aggregate analysis of Twitter content that does not store any personal data (for example, user IDs, usernames, and other identifiers) is permitted, provided that the analysis also complies with applicable laws and all parts of the Developer Agreement and Policy.

5. AraT5 Models Checkpoints

AraT5 Pytorch and Tenserflow checkpoints are available on Huggingface website for direct download and use exclusively for research. For commercial use, please contact the authors via email @ (*muhammad.mageed[at]ubc[dot]ca*).

Model Link
AraT5-base https://huggingface.co/UBC-NLP/AraT5-base
AraT5-msa-base https://huggingface.co/UBC-NLP/AraT5-msa-base
AraT5-tweet-base https://huggingface.co/UBC-NLP/AraT5-tweet-base
AraT5-msa-small https://huggingface.co/UBC-NLP/AraT5-msa-small
AraT5-tweet-small https://huggingface.co/UBC-NLP/AraT5-tweet-small
Title generation model https://huggingface.co/UBC-NLP/AraT5-base-title-generation
🔥AraT5v2-base-1024🔥 https://huggingface.co/UBC-NLP/AraT5v2-base-1024

6. Citation

If you use our AraT5 models for your scientific publication, or if you find the resources in this repository useful, please cite our papers as follows:

AraT5v1 Models

@inproceedings{nagoudi-etal-2022-arat5,
    title = "{A}ra{T}5: Text-to-Text Transformers for {A}rabic Language Generation",
    author = "Nagoudi, El Moatez Billah  and
      Elmadany, AbdelRahim  and
      Abdul-Mageed, Muhammad",
    booktitle = "Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)",
    month = may,
    year = "2022",
    address = "Dublin, Ireland",
    publisher = "Association for Computational Linguistics",
    url = "https://aclanthology.org/2022.acl-long.47",
    pages = "628--647",
}

AraT5v2 Models

@inproceedings{elmadany-etal-2023-octopus,
    title = "Octopus: A Multitask Model and Toolkit for {A}rabic Natural Language Generation",
    author = "Elmadany, AbdelRahim  and
      Nagoudi, El Moatez Billah  and
      Abdul-Mageed, Muhammad",
    booktitle = "Proceedings of ArabicNLP 2023",
    month = dec,
    year = "2023",
    address = "Singapore (Hybrid)",
    publisher = "Association for Computational Linguistics",
    url = "https://aclanthology.org/2023.arabicnlp-1.20",
    doi = "10.18653/v1/2023.arabicnlp-1.20",
    pages = "232--243",
}

7. Acknowledgments

We gratefully acknowledge support from the Natural Sciences and Engineering Research Council of Canada, the Social Sciences and Humanities Research Council of Canada, Canadian Foundation for Innovation, ComputeCanada and UBC ARC-Sockeye. We also thank the Google TensorFlow Research Cloud (TFRC) program for providing us with free TPU access.

arat5's People

Contributors

elmadany avatar mageed avatar nagoudi avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar

arat5's Issues

arabic to english MT

hello, i am trying to fine tune this model on arabic to english MT but i'm not getting any good results, it says in the paper that it works for this specific task but it didn't work for me i dont know why, the bleu score didnt surpass 10

OSError: Can't load config for 'UBC-NLP/AraT5-base'.

Hi,

I want to run your scripts run_trainier_seq2seq_huggingface.py but it gives me the folowing error:

last_checkpoint None
03/26/2022 14:49:11 - WARNING - main - Process rank: -1, device: cuda:0, n_gpu: 1distributed training: False, 16-bits training: False
03/26/2022 14:49:11 - INFO - main - Training/evaluation parameters Seq2SeqTrainingArguments(output_dir='/content/AraT5_FT_title_generation', overwrite_output_dir=True, do_train=True, do_eval=True, do_predict=False, evaluation_strategy=<IntervalStrategy.EPOCH: 'epoch'>, prediction_loss_only=False, per_device_train_batch_size=8, per_device_eval_batch_size=8, per_gpu_train_batch_size=None, per_gpu_eval_batch_size=None, gradient_accumulation_steps=1, eval_accumulation_steps=None, learning_rate=5e-05, weight_decay=0.0, adam_beta1=0.9, adam_beta2=0.999, adam_epsilon=1e-08, max_grad_norm=1.0, num_train_epochs=3.0, max_steps=-1, lr_scheduler_type=<SchedulerType.LINEAR: 'linear'>, warmup_ratio=0.0, warmup_steps=0, logging_dir='runs/Mar26_14-49-11_1f6d9e124699', logging_strategy=<IntervalStrategy.EPOCH: 'epoch'>, logging_first_step=False, logging_steps=500, save_strategy=<IntervalStrategy.STEPS: 'steps'>, save_steps=500, save_total_limit=None, no_cuda=False, seed=42, fp16=False, fp16_opt_level='O1', fp16_backend='auto', fp16_full_eval=False, local_rank=-1, tpu_num_cores=None, tpu_metrics_debug=False, debug=False, dataloader_drop_last=False, eval_steps=500, dataloader_num_workers=0, past_index=-1, run_name='/content/AraT5_FT_title_generation', disable_tqdm=False, remove_unused_columns=True, label_names=None, load_best_model_at_end=True, metric_for_best_model='eval_bleu', greater_is_better=True, ignore_data_skip=False, sharded_ddp=[], deepspeed=None, label_smoothing_factor=0.0, adafactor=False, group_by_length=False, length_column_name='length', report_to=['tensorboard'], ddp_find_unused_parameters=None, dataloader_pin_memory=True, skip_memory_metrics=False, mp_parameters='', sortish_sampler=False, predict_with_generate=True)
[INFO] loading from TSV
03/26/2022 14:49:11 - WARNING - datasets.builder - Using custom data configuration default-942a41af4b2c6152
Downloading and preparing dataset csv/default to /tmp/AraT5_cache_dir/csv/default-942a41af4b2c6152/0.0.0/433e0ccc46f9880962cc2b12065189766fbb2bee57a221866138fb9203c83519...
Downloading data files: 100% 2/2 [00:00<00:00, 9446.63it/s]
Extracting data files: 100% 2/2 [00:00<00:00, 985.16it/s]
Dataset csv downloaded and prepared to /tmp/AraT5_cache_dir/csv/default-942a41af4b2c6152/0.0.0/433e0ccc46f9880962cc2b12065189766fbb2bee57a221866138fb9203c83519. Subsequent calls will reuse this data.
100% 2/2 [00:00<00:00, 803.74it/s]
Cannot find the requested files in the cached path and outgoing traffic has been disabled. To enable model look-ups and downloads online, set 'local_files_only' to False.
Traceback (most recent call last):
File "/usr/local/lib/python3.7/dist-packages/transformers/configuration_utils.py", line 466, in get_config_dict
user_agent=user_agent,
File "/usr/local/lib/python3.7/dist-packages/transformers/file_utils.py", line 1173, in cached_path
local_files_only=local_files_only,
File "/usr/local/lib/python3.7/dist-packages/transformers/file_utils.py", line 1383, in get_from_cache
"Cannot find the requested files in the cached path and outgoing traffic has been"
FileNotFoundError: Cannot find the requested files in the cached path and outgoing traffic has been disabled. To enable model look-ups and downloads online, set 'local_files_only' to False.

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
File "run_trainier_seq2seq_huggingface.py", line 807, in
main()
File "run_trainier_seq2seq_huggingface.py", line 365, in main
local_files_only = True
File "/usr/local/lib/python3.7/dist-packages/transformers/models/auto/configuration_auto.py", line 398, in from_pretrained
config_dict, _ = PretrainedConfig.get_config_dict(pretrained_model_name_or_path, **kwargs)
File "/usr/local/lib/python3.7/dist-packages/transformers/configuration_utils.py", line 478, in get_config_dict
raise EnvironmentError(msg)
OSError: Can't load config for 'UBC-NLP/AraT5-base'. Make sure that:

  • 'UBC-NLP/AraT5-base' is a correct model identifier listed on 'https://huggingface.co/models'

  • or 'UBC-NLP/AraT5-base' is the correct path to a directory containing a config.json file

How to fix it please?

English-to-Arabic Translation Issue

Hi @Nagoudi @elmadany ,

Thank you so much for open-sourcing your awesome models. I have a question please, I want to use AraT5 or AraT5v2 for machine translation from English to Arabic. Could you please share an example to do that? I tried to use your models with the following code but the output does not make sense. Here is the code:

from transformers import T5Tokenizer, AutoModelForSeq2SeqLM, AutoTokenizer, pipeline


model = AutoModelForSeq2SeqLM.from_pretrained("UBC-NLP/AraT5-msa-base")
tokenizer = AutoTokenizer.from_pretrained("UBC-NLP/AraT5-msa-base")
tokenizer.src_lang="English"
tokenizer.tgt_lang="Arabic"

ar_prompt="The scene displays a group of people gathered around a wooden dining table in an indoor setting."
input_ids = tokenizer(ar_prompt, return_tensors="pt").input_ids


outputs = model.generate(input_ids)
print("Tokenized input:", tokenizer.tokenize(ar_prompt))
print("Decoded output:", tokenizer.decode(outputs[0], skip_special_tokens=True))

This is the current output:

Tokenized input: ['▁The', '▁scene', '▁display', 's', '▁a', '▁group', '▁of', '▁people', '▁gathered', '▁around', '▁a', '▁wooden', ```
▁di', 'ning', '▁table', '▁in', '▁an', '▁indoor', '▁setting', '.']

Decoded output: هحب هحب هحب هحب هحب هحب هحب هحب هحب هحب هحب هحب هحب هحب هحب هحب هحب هحب هحب

Please let me know what the issue is in the above code or share an example of how to use your model for translation from English to Arabic.

ARGEN datasets

I'm not able to find the ARGEN benchmark datasets directly. I can see links to the sources for both pretraining and most other tasks but I was hoping to be able to access the complete benchmark at one location. Is it possible to make ARGEN_CST available ? @Nagoudi @elmadany @mageed .

Rouge scores are not well calcuated

Hi,

Thanks for sharing the code and models of your great paper.

I think that you miss calculating the rouge scores for Text summarization task in your paper.
The bug lies in these lines:

preds = ["\\n \\n ".join(nltk.sent_tokenize(pred)) if len(nltk.sent_tokenize(pred))> 1 else pred+"\\n" for pred in preds]
labels = ["\\n \\n ".join(nltk.sent_tokenize(label)) if len(nltk.sent_tokenize(label))> 1 else label+"\\n" for label in labels]

Let me explain. First, the rouge_score (which is embedded in HF datasets library) don't work on Arabic text. Here is a simple example were the reference and prediction are the exactly the same:

from rouge_score import rouge_scorer

gold = "اختر العمر المستقبلي."
pred = "اختر العمر المستقبلي."
rouge_types = ["rouge1", "rouge2", "rougeL"]
scorer = rouge_scorer.RougeScorer(rouge_types=rouge_types, use_stemmer=False)
print({key: value.fmeasure * 100 for key, value in score.items()}) #{'rouge1': 0.0, 'rouge2': 0.0, 'rougeL': 0}

This happen because the default tokenizer of google rouge wrapper will delete all non alphanumeric characters (see comment 2 for a solution).

However, rouge works well on English:

from rouge_score import rouge_scorer

gold = "police kill the gunman"
pred = "police kill the gunman"
rouge_types = ["rouge1", "rouge2", "rougeL"]
scorer = rouge_scorer.RougeScorer(rouge_types=rouge_types, use_stemmer=False)
print({key: value.fmeasure * 100 for key, value in score.items()}) #{'rouge1': 100.0, 'rouge2': 100.0, 'rougeL': 100.0}

gold = "police kill the gunman"
pred = "police killed the gunman"
rouge_types = ["rouge1", "rouge2", "rougeL"]
scorer = rouge_scorer.RougeScorer(rouge_types=rouge_types, use_stemmer=False)
print({key: value.fmeasure * 100 for key, value in score.items()}) #{'rouge1': 75.0, 'rouge2': 33.33333333333333, 'rougeL': 75.0}

When in your code you comments these 2 lines because they gives scores around 1%-2% (I will explain why later)

# preds = ["\n".join(nltk.sent_tokenize(pred)) for pred in preds]
# labels = ["\n".join(nltk.sent_tokenize(label)) for label in labels]

and you replace it with these 2 lines:

preds = ["\\n \\n ".join(nltk.sent_tokenize(pred)) if len(nltk.sent_tokenize(pred))> 1 else pred+"\\n" for pred in preds]
labels = ["\\n \\n ".join(nltk.sent_tokenize(label)) if len(nltk.sent_tokenize(label))> 1 else label+"\\n" for label in labels]

what you actually did is dividing the number of \\n \\n span in the reference and prediction.
Here is a simple example where the gold reference has 2 sentences and prediction have 4 sentences:

gold = "اختر العمر المستقبلي. كن عفويا."
pred = "ابحث عن العمر الذي تريد أن تكونه في المستقبل. تحدث عن نفسك الحالية. فكر في قيمك. فكر في الأشياء التي تجيدها."

# No linebreak  between senctences
score = scorer.score(gold, pred)
print({key: value.fmeasure * 100 for key, value in score.items()}) # {'rouge1': 0.0, 'rouge2': 0.0, 'rougeL': 0}

# Correct way to add linebreak  between sentences
gold_2 = gold.replace(". ", ".\n")
pred_2 = pred.replace(". ", ".\n")
score = scorer.score(gold_2, pred_2)
print({key: value.fmeasure * 100 for key, value in score.items()}) # {'rouge1': 0.0, 'rouge2': 0.0, 'rougeL': 0}


# Adding linebreak between sentences with your method
gold_3 = gold.replace(". ", ".\\n \\n ")
pred_3 = pred.replace(". ", ".\\n \\n ")
score = scorer.score(gold_3, pred_3)
print({key: value.fmeasure * 100 for key, value in score.items()}) # {'rouge1': 50.0, 'rouge2': 33.333333333333336, 'rougeL': 50.0}

print(gold_3) # اختر العمر المستقبلي.\n \n كن عفويا.
print(pred_3)# ابحث عن العمر الذي تريد أن تكونه في المستقبل.\n \n تحدث عن نفسك الحالية.\n \n فكر في قيمك.\n \n فكر في الأشياء التي تجيدها.

As you can see, in the example <gold_3, pred_3> the 50 rouge is because you predicted 4 sentences (actually \\n \\n) while the reference contains 2 only. The 33% is because you have 1 and 3 \\n \\n n-grams in the reference and prediction respectively (1/3).

In fact, I re-run Text summarization experiments internally using your model and models mine and found that the results are similarly comparable with your paper when using your method. On the other hand, when adding \n correctly the socres are between 1% and 2%. The rouge scores are not zero happen only when the reference and prediction contains same English words, whci happen rarely.

In fact, results in tables 7 and B2 are just the count(sent_num_ref) / count(sent_num_pred).
Don't understand me wrong, your models are good but just need to be evaluated correctly (see comment 2).

It will be great if you can fix your code and adjust the numbers in table 7 and B2 in your paper.

Thanks

What is the mask token in AraT5-base?

I can't find any token like <extra_id> or < mask > in the vocab. What is the mask token in AraT5-base or how do I get the mask id with huggingface codes?

title generation output on fine-tuned AraT5

Hello contributors!
Thank you for the amazing project. Wonderful work.
I am working on a text classification task, and found AraT5 to be very useful for my case. I was going through Fine_tuning_AraT5.ipynb featured in the examples directory. I trained the model, following your "best results" instructions (22 epochs...) on my own data, and got class predictions not featured in the original classification classes. I then tried training on the sample data provided in the notebook (ARGEn_title_genration_sample_train.tsv), and got the following results:
Screenshot_1

Please note the prediction results were mostly the same, regardless of the training dataset (mine or the one provided).

Am I missing something? Could you please help with this? It would be much appreciated.

How to finetune your pretrained model for QA task?

Hi,

Thank you for sharing your great work!

Can you tell me please how to finetune one of your pretrained LM for a Question-Answering (QA) task?

As input, I have question and context. As output one or multiple answers.

It's very urgent, please!

Thank you so much.

operative config gin file for arat5 base msa

hello

Where can we find the operative config gin file of araT5 Base msa model. It is needed to continue the training of the model using a TPU.

the operative config gin files of the English T5 is available is available in the folder of every pretrained model such as
gs://t5-data/pretrained_models/small/operative_config.gin

Thank you

Error while trying the sample code.

Hello, Thanks for the models.
I was trying to run the provided sample code in the README, but I got the following error:

[ValueError: Couldn't instantiate the backend tokenizer from one of: 
(1) a `tokenizers` library serialization file, 
(2) a slow tokenizer instance to convert or 
(3) an equivalent slow tokenizer class to instantiate and convert. 
You need to have sentencepiece installed to convert a slow tokenizer to a fast one.]()

any ideas what is the issue?

Question about the tokenizer and vocabulary list for the AraT5 model

Hello,

I am working on a project that requires me to use the AraT5 model, and I am wondering what tokenizer was used to train the model and where I can find the vocabulary list.

I have searched the documentation and code for the model, but I haven't been able to find this information. If someone could provide me with this information or point me to where I can find it, I would greatly appreciate it.

Thank you!

Shouldn't we use Prefixes?

Thanks a lot for the great code, just wandering if I want to do closed book question answering fine tuning, do I need to specify a particular prefix before fine-tuning or just examples of questions and answers will be enough?

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.