Git Product home page Git Product logo

xtreme's Introduction

XTREME: A Massively Multilingual Multi-task Benchmark for Evaluating Cross-lingual Generalization

Tasks | Download | Baselines | Leaderboard | Website | Paper | Translations

This repository contains information about XTREME, code for downloading data, and implementations of baseline systems for the benchmark.

Introduction

The Cross-lingual TRansfer Evaluation of Multilingual Encoders (XTREME) benchmark is a benchmark for the evaluation of the cross-lingual generalization ability of pre-trained multilingual models. It covers 40 typologically diverse languages (spanning 12 language families) and includes nine tasks that collectively require reasoning about different levels of syntax and semantics. The languages in XTREME are selected to maximize language diversity, coverage in existing tasks, and availability of training data. Among these are many under-studied languages, such as the Dravidian languages Tamil (spoken in southern India, Sri Lanka, and Singapore), Telugu and Malayalam (spoken mainly in southern India), and the Niger-Congo languages Swahili and Yoruba, spoken in Africa.

For a full description of the benchmark, see the paper.

Tasks and Languages

The tasks included in XTREME cover a range of standard paradigms in natural language processing, including sentence classification, structured prediction, sentence retrieval and question answering. The full list of tasks can be seen in the image below.

The datasets used in XTREME

In order for models to be successful on the XTREME benchmark, they must learn representations that generalize across many tasks and languages. Each of the tasks covers a subset of the 40 languages included in XTREME (shown here with their ISO 639-1 codes): af, ar, bg, bn, de, el, en, es, et, eu, fa, fi, fr, he, hi, hu, id, it, ja, jv, ka, kk, ko, ml, mr, ms, my, nl, pt, ru, sw, ta, te, th, tl, tr, ur, vi, yo, and zh. The languages were selected among the top 100 languages with the most Wikipedia articles to maximize language diversity, task coverage, and availability of training data. They include members of the Afro-Asiatic, Austro-Asiatic, Austronesian, Dravidian, Indo-European, Japonic, Kartvelian, Kra-Dai, Niger-Congo, Sino-Tibetan, Turkic, and Uralic language families as well as of two isolates, Basque and Korean.

Download the data

In order to run experiments on XTREME, the first step is to download the dependencies. We assume you have installed anaconda and use Python 3.7+. The additional requirements including transformers, seqeval (for sequence labelling evaluation), tensorboardx, jieba, kytea, and pythainlp (for text segmentation in Chinese, Japanese, and Thai), and sacremoses can be installed by running the following script:

bash install_tools.sh

The next step is to download the data. To this end, first create a download folder with mkdir -p download in the root of this project. You then need to manually download panx_dataset (for NER) from here (note that it will download as AmazonPhotos.zip) to the download directory. Finally, run the following command to download the remaining datasets:

bash scripts/download_data.sh

Note that in order to prevent accidental evaluation on the test sets while running experiments, we remove labels of the test data during pre-processing and change the order of the test sentences for cross-lingual sentence retrieval.

Build a baseline system

The evaluation setting in XTREME is zero-shot cross-lingual transfer from English. We fine-tune models that were pre-trained on multilingual data on the labelled data of each XTREME task in English. Each fine-tuned model is then applied to the test data of the same task in other languages to obtain predictions.

For every task, we provide a single script scripts/train.sh that fine-tunes pre-trained models implemented in the Transformers repo. To fine-tune a different model, simply pass a different MODEL argument to the script with the corresponding model. The current supported models are bert-base-multilingual-cased, xlm-mlm-100-1280 and xlm-roberta-large.

Universal dependencies part-of-speech tagging

For part-of-speech tagging, we use data from the Universal Dependencies v2.5. You can fine-tune a pre-trained multilingual model on the English POS tagging data with the following command:

bash scripts/train.sh [MODEL] udpos

Wikiann named entity recognition

For named entity recognition (NER), we use data from the Wikiann (panx) dataset. You can fine-tune a pre-trained multilingual model on the English NER data with the following command:

bash scripts/train.sh [MODEL] panx

PAXS-X sentence classification

For sentence classification, we use the Cross-lingual Paraphrase Adversaries from Word Scrambling (PAWS-X) dataset. You can fine-tune a pre-trained multilingual model on the English PAWS data with the following command:

bash scripts/train.sh [MODEL] pawsx

XNLI sentence classification

The second sentence classification dataset is the Cross-lingual Natural Language Inference (XNLI) dataset. You can fine-tune a pre-trained multilingual model on the English MNLI data with the following command:

bash scripts/train.sh [MODEL] xnli

XQuAD, MLQA, TyDiQA-GoldP question answering

For question answering, we use the data from the XQuAD, MLQA, and TyDiQA-Gold Passage datasets. For XQuAD and MLQA, the model should be trained on the English SQuAD training set. For TyDiQA-Gold Passage, the model is trained on the English TyDiQA-GoldP training set. Using the following command, you can first fine-tune a pre-trained multilingual model on the corresponding English training data, and then you can obtain predictions on the test data of all tasks.

bash scripts/train.sh [MODEL] [xquad,mlqa,tydiqa]

BUCC sentence retrieval

For cross-lingual sentence retrieval, we use the data from the Building and Using Parallel Corpora (BUCC) shared task. As the models are not trained for this task but the representations of the pre-trained models are directly used to obtain similarity judgements, you can directly apply the model to obtain predictions on the test data of the task:

bash scripts/train.sh [MODEL] bucc2018

Tatoeba sentence retrieval

The second cross-lingual sentence retrieval dataset we use is the Tatoeba dataset. Similarly to BUCC, you can directly apply the model to obtain predictions on the test data of the task:

bash scripts/train.sh [MODEL] tatoeba

Leaderboard Submission

Submissions

To submit your predicitons to XTREME, please create one single folder that contains 9 sub-folders named after all the tasks, i.e., udpos, panx, xnli, pawsx, xquad, mlqa, tydiqa, bucc2018, tatoeba. Inside each sub-folder, create a file containing the predicted labels of the test set for all languages. Name the file using the format test-{language}.{extension} where language indicates the 2-character language code, and extension is json for QA tasks and tsv for other tasks. You can see an example of the folder structure in mock_test_data/predictions.

Evaluation

We will compare your submissions with our label files using the following command:

python evaluate.py --prediction_folder [path] --label_folder [path]

Translations

As part of training translate-train and translate-test baselines we have automatically translated English training sets to other languages and tests sets to English. Translations are available for the following datasets: SQuAD v1.1 (only train and dev), MLQA, PAWS-X, TyDiQA-GoldP, XNLI, and XQuAD.

For PAWS-X and XNLI, the translations are in the following format: Column 1 and Column 2: original sentence pairs Column 3 and Column 4: translated sentence pairs Column 5: label

This will help make the association between the original data and their translations.

For XNLI and XQuAD, we have furthermore created pseudo test sets by automatically translating the English test set to the remaining languages in XTREME so that test data for all 40 languages is available. Note that these translations are noisy and should not be treated as ground truth.

All translations are available here.

Paper

If you use our benchmark or the code in this repo, please cite our paper \cite{hu2020xtreme}.

@article{hu2020xtreme,
      author    = {Junjie Hu and Sebastian Ruder and Aditya Siddhant and Graham Neubig and Orhan Firat and Melvin Johnson},
      title     = {XTREME: A Massively Multilingual Multi-task Benchmark for Evaluating Cross-lingual Generalization},
      journal   = {CoRR},
      volume    = {abs/2003.11080},
      year      = {2020},
      archivePrefix = {arXiv},
      eprint    = {2003.11080}
}

Please consider including a note similar to the one below to make sure to cite all the individual datasets in your paper.

We experiment on the XTREME benchmark \cite{hu2020xtreme}, a composite benchmark for multi-lingual learning consisting of data from the XNLI \cite{Conneau2018xnli}, PAWS-X \cite{Yang2019paws-x}, UD-POS \cite{nivre2018universal}, Wikiann NER \cite{Pan2017}, XQuAD \cite{artetxe2020cross}, MLQA \cite{Lewis2020mlqa}, TyDiQA-GoldP \cite{Clark2020tydiqa}, BUCC 2018 \cite{zweigenbaum2018overview}, Tatoeba \cite{Artetxe2019massively} tasks. We provide their BibTex information as follows.

@inproceedings{Conneau2018xnli,
    title = "{XNLI}: Evaluating Cross-lingual Sentence Representations",
    author = "Conneau, Alexis  and
      Rinott, Ruty  and
      Lample, Guillaume  and
      Williams, Adina  and
      Bowman, Samuel  and
      Schwenk, Holger  and
      Stoyanov, Veselin",
    booktitle = "Proceedings of EMNLP 2018",
    year = "2018",
    pages = "2475--2485",
}

@inproceedings{Yang2019paws-x,
    title = "{PAWS-X}: A Cross-lingual Adversarial Dataset for Paraphrase Identification",
    author = "Yang, Yinfei  and
      Zhang, Yuan  and
      Tar, Chris  and
      Baldridge, Jason",
    booktitle = "Proceedings of EMNLP 2019",
    year = "2019",
    pages = "3685--3690",
}

@article{nivre2018universal,
  title={Universal Dependencies 2.2},
  author={Nivre, Joakim and Abrams, Mitchell and Agi{\'c}, {\v{Z}}eljko and Ahrenberg, Lars and Antonsen, Lene and Aranzabe, Maria Jesus and Arutie, Gashaw and Asahara, Masayuki and Ateyah, Luma and Attia, Mohammed and others},
  year={2018}
}

@inproceedings{Pan2017,
author = {Pan, Xiaoman and Zhang, Boliang and May, Jonathan and Nothman, Joel and Knight, Kevin and Ji, Heng},
booktitle = {Proceedings of ACL 2017},
pages = {1946--1958},
title = {{Cross-lingual name tagging and linking for 282 languages}},
year = {2017}
}

@inproceedings{artetxe2020cross,
author = {Artetxe, Mikel and Ruder, Sebastian and Yogatama, Dani},
booktitle = {Proceedings of ACL 2020},
title = {{On the Cross-lingual Transferability of Monolingual Representations}},
year = {2020}
}

@inproceedings{Lewis2020mlqa,
author = {Lewis, Patrick and Oğuz, Barlas and Rinott, Ruty and Riedel, Sebastian and Schwenk, Holger},
booktitle = {Proceedings of ACL 2020},
title = {{MLQA: Evaluating Cross-lingual Extractive Question Answering}},
year = {2020}
}

@inproceedings{Clark2020tydiqa,
author = {Jonathan H. Clark and Eunsol Choi and Michael Collins and Dan Garrette and Tom Kwiatkowski and Vitaly Nikolaev and Jennimaria Palomaki},
booktitle = {Transactions of the Association of Computational Linguistics},
title = {{TyDi QA: A Benchmark for Information-Seeking Question Answering in Typologically Diverse Languages}},
year = {2020}
}

@inproceedings{zweigenbaum2018overview,
  title={Overview of the third BUCC shared task: Spotting parallel sentences in comparable corpora},
  author={Zweigenbaum, Pierre and Sharoff, Serge and Rapp, Reinhard},
  booktitle={Proceedings of 11th Workshop on Building and Using Comparable Corpora},
  pages={39--42},
  year={2018}
}

@article{Artetxe2019massively,
author = {Artetxe, Mikel and Schwenk, Holger},
journal = {Transactions of the ACL 2019},
title = {{Massively Multilingual Sentence Embeddings for Zero-Shot Cross-Lingual Transfer and Beyond}},
year = {2019}
}

xtreme's People

Contributors

amineabdaoui avatar blazejdolicki avatar dhgarrette avatar junjiehu avatar jvamvas avatar keremzaman avatar liangtaiwan avatar maksymdel avatar melvinjosej avatar nconstant-google avatar orhanf avatar ritterng avatar sebastianruder avatar stefan-it avatar tonytan48 avatar zphang avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

xtreme's Issues

Unable to reproduce the reported numbers for XNLI and PAWS-X with mBERT.

Hi,

We are trying to reprodue the numbers reported in the XTREME paper for mBERT on XNLI and PAWS-X tasks quoted (https://arxiv.org/pdf/2003.11080.pdf) in Table 12 and Table 15.

The hyperparameter setting used by us are same as reported in the paper:

07/09/2021 11:46:56 - INFO - __main__ - Training/evaluation parameters Namespace(adam_epsilon=1e-08, cache_dir='', config_name='', data_dir='/home/ec2-user/xtreme/download//xnli', device=device(type='cuda'), do_eval=True, do_lower_case=False, do_predict=True, do_predict_dev=False, do_train=True, eval_all_checkpoints=True, eval_test_set=True, evaluate_during_training=False, fp16=False, fp16_opt_level='O1', gradient_accumulation_steps=4, init_checkpoint=None, learning_rate=2e-05, local_rank=-1, log_file='train', logging_steps=50, max_grad_norm=1.0, max_seq_length=128, max_steps=-1, model_name_or_path='bert-base-multilingual-cased', model_type='bert', n_gpu=1, no_cuda=False, num_train_epochs=2.0, output_dir='/home/ec2-user/xtreme/outputs-temp//xnli/bert-base-multilingual-cased-LR2e-5-epoch2-MaxLen128//', output_mode='classification', overwrite_cache=False, overwrite_output_dir=True, per_gpu_eval_batch_size=8, per_gpu_train_batch_size=32, predict_languages='ar,bg,de,el,en,es,fr,hi,ru,sw,th,tr,ur,vi,zh', save_only_best_checkpoint=True, save_steps=100, seed=42, server_ip='', server_port='', task_name='xnli', test_split='test', tokenizer_name='', train_language='en', train_split='train', warmup_steps=0, weight_decay=0.0)

 ======= Predict using the model from /home/ec2-user/xtreme/outputs-temp//xnli/bert-base-multilingual-cased-LR2e-5-epoch2-MaxLen128/checkpoint-best for test:
ar=0.3854291417165669
bg=0.39500998003992016
de=0.39401197604790417
el=0.38582834331337323
en=0.36966067864271457
es=0.38862275449101796
fr=0.3852295409181637
hi=0.4688622754491018
ru=0.3972055888223553
sw=0.37684630738522956
th=0.5409181636726547
tr=0.418562874251497
ur=0.47345309381237527
vi=0.40738522954091816
zh=0.4405189620758483
total=0.4151696606786427

System configuration:

Linux machine 1 GPUs, x86_64architecture, 250 GB storage, 61 GB RAM.

On XLNI our average numbers are off by ~37% as compared to 79.2 reported in the paper. We saw similar discrepancy on PAWS-X results. Further, with xlm-roberta model we are getting a test performance of 100% while validation numbers are in the range of 50-60%

Can you please suggest what could be the reason for such discrepancy ?

Tatoeba dataset download fails with scripts/download_data.sh

The download of the Tatoeba dataset fails when running bash scripts/download_data.sh with the following output:

--2021-11-20 14:02:31--  https://github.com/facebookresearch/LASER/archive/master.zip
Resolving github.com (github.com)... 20.205.243.166
Connecting to github.com (github.com)|20.205.243.166|:443... connected.
HTTP request sent, awaiting response... No data received.
Retrying.

--2021-11-20 14:04:32--  (try: 2)  https://github.com/facebookresearch/LASER/archive/master.zip
Connecting to github.com (github.com)|20.205.243.166|:443... connected.
HTTP request sent, awaiting response... 302 Found
Location: https://codeload.github.com/facebookresearch/LASER/zip/master [following]
--2021-11-20 14:05:40--  https://codeload.github.com/facebookresearch/LASER/zip/master
Resolving codeload.github.com (codeload.github.com)... 20.205.243.165
Connecting to codeload.github.com (codeload.github.com)|20.205.243.165|:443... connected.
HTTP request sent, awaiting response... 404 Not Found
2021-11-20 14:05:41 ERROR 404: Not Found.

unzip:  cannot find or open master.zip, master.zip.zip or master.zip.ZIP.
mv: cannot stat '<redacted>/xtreme/download//tatoeba-tmp//LASER-master/data/tatoeba/v1/*': No such file or directory
Traceback (most recent call last):
  File "<redacted>/xtreme/utils_preprocess.py", line 533, in <module>
    tatoeba_preprocess(args)
  File "<redacted>/xtreme/utils_preprocess.py", line 397, in tatoeba_preprocess
    shutil.copy(src_file, src_out)
  File "/opt/conda/lib/python3.8/shutil.py", line 418, in copy
    copyfile(src, dst, follow_symlinks=follow_symlinks)
  File "/opt/conda/lib/python3.8/shutil.py", line 264, in copyfile
    with open(src, 'rb') as fsrc, open(dst, 'wb') as fdst:
FileNotFoundError: [Errno 2] No such file or directory: '<redacted>/xtreme/download//tatoeba-tmp//tatoeba.afr-eng.afr'

Error while using xlm-mlm-100-1280 in udpos

When I run xlm-mlm-100-1280 in udpos, the test_de always produces error "IndexError: pop from empty list" in save_predictions. If someone encounters this kind of problem and solves it successfully, could you help me, thanks a lot

Specifying pool_type doesn't work for retrieval evaluation

For retrieval evaluation tasks (BUCC and Tatoeba), the commandline argument --pool_type isn't passed to extract_embeddings function in evaluate_retrieval.py. So, giving cls or mean as argument doesn't change pooling strategy which might cause serious confusion during evaluation.

Bug in tydiqa training/evaluation file?

Hello,
When I try to train zero-shot (train on english data, test on the others) tydiqa-goldp following the script and command (bash scripts/train.sh tydiqa) it gives 0 on all evaluation datasets. I think there is definately a bug but couldn't find it. I did not change anything except in run_qa.sh it will be:
TRAIN_FILE=${TASK_DATA_DIR}/tydiqa-goldp-v1.1-train/tydiqa.goldp.en.train.json
instead of
TRAIN_FILE=${TASK_DATA_DIR}/tydiqa-goldp-v1.1-train/tydiqa.en.train.json

Here is my training log: https://paste.ee/p/e6sgs

Kindly let me know if I am missing something here

Missing languages in udpos task

UDPOS_LANGUAGES variable in utils_preprocessing.py is still the old list of languages, which triggers that five new languages (lt, pl, uk, wo, ro) will now be pre-processed. Similar issue also happens in train_udpos.sh, where language wo is missing in the evaluate language list.

kytea dependency

Hi,

I've just seen that kytea is needed as depedency and will be cloned/installed in the installation script.

But I couldn't find any usage in the python scripts (or any commandline usage), so is kytea really needed for the XTREME benchmark 🤔

Issue with panx preprocessing

Hi,

I just wanted to re-produce the NER pipeline with the bert-base-multilingual-cased model using the installation instructions as provided in the readme.

Model training is working, but the prediction step did not work for some languages, because an exception is thrown here:

assert len(langs) == max_seq_length

I did a bit of debugging and found the following:

For some languages, there are two consecutive empty lines in the test files and then the utils_tag.py script provides non-working features.

E.g. The test set for Vietnamese:

$ wc -l download/panx/test-vi.tsv
74967

With cat -s (squeeze option) consecutive empty lines are replaced by only one empty line:

$ cat -s download/panx/test-vi.tsv | wc -l
74966

With that command, consecutive empty lines can be detected. Then I wrote a script, that replaces these consecutive empty lines by only one:

for file in *.tsv
do
  sed -i 'N;/^\n$/D;P;D;' $file
done

and placed it into the download/panx folder.

After running this script, all test sets could be processed in the prediction step.

The problem with these consecutive empty lines only occur for NER (PoS pipeline is fine)! I'm not sure if this workaround has negative impact on the submission evaluation script 🤔

reproducability question

Hi,I found that i can't achieve the same result especially on XNLI as the paper . I jsut bash the scripts without other operation.

Adapting xtreme for dependency parsing

I would like to use xtreme on a dependency parsing task (using UD). I already adjusted a custom text classification dataset to use with xtreme, but the preprocessing scripts for sequence tagging seem much more complicated. Do you think it would require a lot of changes to adapt the code for this new task or not really? Could you maybe point me in the right direction?

LaReQA (XQUAD-R) results reproducability for mBERT

@sebastianruder @nconstant-google

Sorry to bother you. There are some details I know not clear. Could you answer the following questions?

========================================================

I found the training and evaluating scripts for LaReQA on

https://github.com/google-research/xtreme/blob/master/scripts/train_lareqa.sh
https://github.com/google-research/xtreme/blob/master/scripts/run_eval_lareqa.sh

Is the scripts and the source code associated with this paper?
Is there any difference compared to the implementations mentioned in the original paper?

=========================================================

For XQUAD-R, I would like to know the K value of mAP@K?

Is equal to the correct candidates, i.e. 11 relevant answers ==> mAP@11?

However, the metric of the source code in this repo is set to 20, i.e. mAP@20.

Actually, I would like to reproduce the En-EN results of mBERT (mAP=0.29).

I do not know the original K value in the paper.

=========================================================

Thanks a lot ~

Bug in conll.py

On line 211, conll.py script makes a call to DependencyTree.add_node passing 2 positional arguments. However, the documentation specifies that it accepts only 1 argument. I couldn't get this script to work using networkx v2.3 nor v2.5

# conll.py
           T.add_edge(new_index_dict[h],new_index_dict[d],deprel=self[h][d]["deprel"])
[ins] In [3]: T.add_node?
Signature: T.add_node(node_for_adding, **attr)
Docstring:
Add a single node `node_for_adding` and update node attributes.

Parameters
----------
node_for_adding : node
    A node can be any hashable Python object except None.
attr : keyword arguments, optional
    Set or change node attributes using key=value.

Typo in README

In README there's this sentence: "For every task, we provide a single script scripts/train.sh that fine-tunes pre-trained models implemented in the [Transformers] repo." It looks like you wanted to provide a link to https://github.com/huggingface/transformers but forgot to do it. Is that the case?

Error while loading pre-trained model for PAWSX and XNLI

When we try to load pre-trained model from model_name_or_path in this line for run_classify.py, it throws error (as expected).

I think we should instead use args.init_checkpoint as:

if args.init_checkpoint and os.path.exists(args.init_checkpoint):
    # set global_step to gobal_step of last saved checkpoint from model path
    global_step = int(args.init_checkpoint.split("-")[-1].split("/")[0])

Am I missing anything?

Evaluation of bucc2018

Hi,

I notice that a threshold is needed before the evaluation of bucc2018 on dev or test datasets, which is determined by the gold file. We can get the threshold for dev datasets but we can not get the threshold for test process. So, I can only get the prediction files without score filter process and submit it to leadboard since I can not find the test gold file of bucc2018. Is it right? I guess it will effect the performance of bucc2018.

According to the code in third_party/utils_retrieve.py, we should dertermine the threshold before generating prediction file:

def bucc_eval(candidates_file, gold_file, src_file, trg_file, src_id_file, trg_id_file, predict_file, mode, threshold=None, encoding='utf-8'):
candidate2score = read_candidate2score(candidates_file, src_file, trg_file, src_id_file, trg_id_file, encoding)
threshold = bucc_optimize(candidate2score, gold)
bitexts = bucc_extract(candidate2score, threshold, predict_file)

line 65: 25580 Segmentation fault

Hi

I am trying to build the baseline training model for the task udpos. When I run the bash scripts/train.sh [model] udpos command, I always got such error, which seems indicate the problem from last line of parameters for $REPO/third_party/run_tag.py.

Changing the action='store_true' to default=True in parser.add_argument or adjust the train_batch_size cannot solve it.
How to fix it? thanks!

Here is the output file:

Fine-tuning bert-base-multilingual-cased on udpos using GPU 0
Load data from /home/mwp141/Working/xtreme/download, and save models to /home/mwp141/Working/xtreme/outputs-temp
12/04/2020 11:27:19 - INFO - root - Input args: Namespace(adam_epsilon=1e-08, cache_dir=None, config_name='', data_dir='/home/mwp141/Working/xtreme/download/udpos/udpos_processed_maxlen128', device=device(type='cuda'), do_eval=True, do_lower_case=False, do_predict=True, do_predict_dev=True, do_train=True, eval_all_checkpoints=True, eval_patience=-1, evaluate_during_training=True, few_shot=-1, fp16=False, fp16_opt_level='O1', gradient_accumulation_steps=4, init_checkpoint=None, labels='/home/mwp141/Working/xtreme/download/udpos/udpos_processed_maxlen128/labels.txt', learning_rate=2e-05, local_rank=-1, log_file='/home/mwp141/Working/xtreme/outputs-temp/udpos/bert-base-multilingual-cased-LR2e-5-epoch2-MaxLen128/train.log', logging_steps=50, max_grad_norm=1.0, max_seq_length=128, max_steps=-1, model_name_or_path='bert-base-multilingual-cased', model_type='bert', n_gpu=1, no_cuda=False, num_train_epochs=2.0, output_dir='/home/mwp141/Working/xtreme/outputs-temp/udpos/bert-base-multilingual-cased-LR2e-5-epoch2-MaxLen128', overwrite_cache=False, overwrite_output_dir=True, per_gpu_eval_batch_size=8, per_gpu_train_batch_size=5, predict_langs='af,ar,bg,de,el,en,es,et,eu,fa,fi,fr,he,hi,hu,id,it,ja,kk,ko,mr,nl,pt,ru,ta,te,th,tl,tr,ur,vi,yo,zh', save_only_best_checkpoint=True, save_steps=500, seed=1, server_ip='', server_port='', tokenizer_name='', train_langs='en', warmup_steps=0, weight_decay=0.0)
12/04/2020 11:27:19 - WARNING - main - Process rank: -1, device: cuda, n_gpu: 1, distributed training: False, 16-bits training: False
/home/mwp141/Working/xtreme/scripts/train_udpos.sh: line 65: 25580 Segmentation fault (core dumped) python3 $REPO/third_party/run_tag.py
--data_dir $DATA_DIR
--model_type $MODEL_TYPE
--labels $DATA_DIR/labels.txt
--model_name_or_path $MODEL
--output_dir $OUTPUT_DIR
--max_seq_length $MAX_LENGTH
--num_train_epochs $NUM_EPOCHS
--gradient_accumulation_steps $GRAD_ACC
--per_gpu_train_batch_size $BATCH_SIZE
--save_steps 500
--seed 1
--predict_langs $LANGS
--log_file $OUTPUT_DIR/train.log
--learning_rate $LR

Tatoeba Labels

Hi, are the labels for Tatoeba just their line numbers starting from 0? I tried this and only managed to get an accuracy of 19.97 for mBERT. Thanks!

New tasks in xtreme-r

Hi folks,

Could we get training and evaluation scripts for new added tasks in xtreme-r, such as mewsli-X dataset? Or would you have a plan to share these scripts and data?

Thanks.

Bug in indexing dev set examples UD POS

Something is wrong with indexing examples in the dev set of the POS tagging task. In this file every index is assigned to one sentence. However, we can see that index 772 is assigned to two sentences
image

This results in predictions of two separate sentences being merged together, like in the following table, which ruins the alignment between sentence and its prediction. You can see that the higlighted ending in row 772 should actually be in the next row.
image

Row 772 is an example, but there are two more such erroneous rows.

Training/test splits of the BUCC dataset

Hi,

Congratulations on your paper!

I want to know the training/test splits of the BUCC dataset.
In the paper, it writes "we evaluate representations on the test sets directly", but the training data are renamed as test set at here.

for f in $base_dir/*training*; do mv $f ${f/training/test}; done

So which spilt is used as the test set of bucc in xtreme? Training set or test set?

Thanks.

Tatoeba code is broken

The file scripts/run_tatoeba.sh using the following command to call the tatoeba related python code:

  python $REPO/third_party/run_retrieval.py \
  ...

However, the file run_retrieval.py has been renamed to evaluate_retrieval.py in commit c168376

XQuAD can have a better evaluation

When evaluating XQuAD, it uses the original SQuAD evaluation scripts.
However, the script does not consider the tokenization for Chinese. (MLQA does)
ex. "天氣很好"(The weather is good.) should be tokenized to "天", "氣", "很", "好" or "天氣", "很好"

I'm sure that it will get a more convincing score on Chinese, but I'm not sure it's better on others or not.

I modify the original code to the following

def is_english_num(s):
  try:
    s.encode(encoding='utf-8').decode('ascii')
  except UnicodeDecodeError:
    return False
  else:
    return True

def get_tokens(s):
    if not s:
        return []
    if is_english_num(s):
        return normalize_answer(s).split()
    return [w for w in normalize_answer(s)]

def f1_score(prediction, ground_truth):
    prediction_tokens = get_tokens(prediction)
    ground_truth_tokens = get_tokens(ground_truth)
    common = Counter(prediction_tokens) & Counter(ground_truth_tokens)
    num_same = sum(common.values())
    if num_same == 0:
        return 0
    precision = 1.0 * num_same / len(prediction_tokens)
    recall = 1.0 * num_same / len(ground_truth_tokens)
    f1 = (2 * precision * recall) / (precision + recall)
    return f1

Maybe you can consider it?

TypeError: 'method' object is not subscriptable on running download_data.sh

I see the following error when I run the

scripts/download_data.sh

Traceback (most recent call last):
  File "/home/ec2-user/xtreme/third_party/ud-conversion-tools/conllu_to_conll.py", line 53, in <module>
    main()
  File "/home/ec2-user/xtreme/third_party/ud-conversion-tools/conllu_to_conll.py", line 41, in main
    orig_treebank = cio.read_conll_u(args.input)#, args.keep_fused_forms, args.lang, POSRANKPRECEDENCEDICT)
  File "/home/ec2-user/xtreme/third_party/ud-conversion-tools/lib/conll.py", line 382, in read_conll_u
    sent.nodes[token_dict['id']].update({k: v for (k, v) in token_dict.items()
TypeError: 'method' object is not subscriptable

Test run for panx failing on subset of langs

Thanks for open sourcing the code!

I was trying to run NER for en and de, with training and testing on both using mBERT-cased.
Set $LANG to 'en,de' and passed it to both train_langs and predict_langs. The training and eval works fine but then I get the same error as in #16 .

Posting the trace here :

06/04/2020 20:42:13 - INFO - transformers.modeling_utils -   loading weights file /content/xtreme-dev/outputs-temp//panx/bert-base-multilingual-cased-LR2e-5-epoch-MaxLen128/checkpoint-best/pytorch_model.bin
06/04/2020 20:42:18 - INFO - __main__ -   all languages = en
06/04/2020 20:42:18 - INFO - __main__ -   Creating features from dataset file at /content/xtreme-dev/download//panx/panx_processed_maxlen128/en/test.bert-base-multilingual-cased in language en
06/04/2020 20:42:18 - INFO - utils_tag -   lang_id=0, lang=en, lang2id=None
06/04/2020 20:42:18 - INFO - utils_tag -   Writing example 0 of 10634
06/04/2020 20:42:18 - INFO - utils_tag -   *** Example ***
06/04/2020 20:42:18 - INFO - utils_tag -   guid: en-1
06/04/2020 20:42:18 - INFO - utils_tag -   tokens: [CLS] Shortly after ##ward , an en ##cou ##raging response influenced him to go to India ; he arrived at Ad ##yar in 1884 . [SEP]
06/04/2020 20:42:18 - INFO - utils_tag -   input_ids: 101 50752 10662 16988 117 10151 10110 30656 108545 21001 31377 10957 10114 11783 10114 11098 132 10261 22584 10160 25474 22953 10106 13366 119 102 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
06/04/2020 20:42:18 - INFO - utils_tag -   input_mask: 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
06/04/2020 20:42:18 - INFO - utils_tag -   segment_ids: 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
06/04/2020 20:42:18 - INFO - utils_tag -   label_ids: -100 6 6 -100 6 6 6 -100 -100 6 6 6 6 6 6 6 6 6 6 6 6 -100 6 6 6 -100 -100 -100 -100 -100 -100 -100 -100 -100 -100 -100 -100 -100 -100 -100 -100 -100 -100 -100 -100 -100 -100 -100 -100 -100 -100 -100 -100 -100 -100 -100 -100 -100 -100 -100 -100 -100 -100 -100 -100 -100 -100 -100 -100 -100 -100 -100 -100 -100 -100 -100 -100 -100 -100 -100 -100 -100 -100 -100 -100 -100 -100 -100 -100 -100 -100 -100 -100 -100 -100 -100 -100 -100 -100 -100 -100 -100 -100 -100 -100 -100 -100 -100 -100 -100 -100 -100 -100 -100 -100 -100 -100 -100 -100 -100 -100 -100 -100 -100 -100 -100 -100 -100
06/04/2020 20:42:18 - INFO - utils_tag -   langs: [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]
example.langs [] [] 0
ex_index 1 10634
Traceback (most recent call last):
  File "/content/xtreme-dev/third_party/run_tag.py", line 698, in <module>
    main()
  File "/content/xtreme-dev/third_party/run_tag.py", line 637, in main
    result, predictions = evaluate(args, model, tokenizer, labels, pad_token_label_id, mode="test", lang=lang, lang2id=lang2id)
  File "/content/xtreme-dev/third_party/run_tag.py", line 247, in evaluate
    eval_dataset = load_and_cache_examples(args, tokenizer, labels, pad_token_label_id, mode=mode, lang=lang, lang2id=lang2id)
  File "/content/xtreme-dev/third_party/run_tag.py", line 358, in load_and_cache_examples
    lang=lg
  File "/content/xtreme-dev/third_party/utils_tag.py", line 218, in convert_examples_to_features
    assert len(langs) == max_seq_length
TypeError: object of type 'NoneType' has no len()

Test sets in downloaded UD POS don't have labels

Description pretty much says it all, the test labels are not downloaded which makes evaluation on test set impossible. Is there a valid reason for that or it's just an oversight? I checked the original link and the test labels are available in the .conllu files.

Possible error in structure prediction tasks

Hi xtreme team,
Thank you for your work on proposing the leaderboard. However, it seems the evaluation mode reported for UDPOS is inconsistent with the current release in the code. According to Table.20 POS accuracy results in the paper https://arxiv.org/pdf/2003.11080.pdf. The evaluation metric for POS is accuracy, and the average result for XLM-R is 73.8. However, in the code third_party/run_tag.py it only imports f1_score-related measurements from seqeval and the default eval for UDPOS is actually f1 score. I reproduced the experiment on UDPOS and used different measurements on the test set(Sorry I used leaked test set on my local machine for quicker evaluation). By default script and XLM-R large, I can get average of 74.2 f1_score, which is in line with 73.8 reported. For English, the F1 score is 96.15. However if I evaluate with accuracy score. I got 96.7 accuracy and 78.23 on average. Hence I suspect the evaluation on the leaderboard and in the paper for UDPOS is actually f1 score. Could you help to address this issue? I have a reproduced experiment result here: https://docs.google.com/spreadsheets/d/16Cv0IIdZGOyx6xUawcKScb38Cl3ofy0tHJSdWrt07LI/edit?usp=sharing

XTREME-R Availability

Hi Guys,

Great work extending the XTREME benchmark to much harder XTREME-R. I wanted to understand the availability of the benchmarks and the leaderboard. In particular I'm looking at Mewsli-X availability, I was able to fetch others datasets from sources mentioned in the paper.

Thanks

Translation data

Hi, thank you for releasing the translation data along with the benchmarks! I'm looking at the translation data for SQuAD train that you provide: https://console.cloud.google.com/storage/browser/xtreme_translations/SQuAD/translate-train/ . Based on my understanding of the paper, it should cover all the languages in MLQA and XQuAD in the translate-train setting and several are missing in this folder. I'm wondering if you plan to provide data for the rest of the languages? Thanks!

out of memory

XNLI
I use all translation data of xnli, examples = 5890530,Distributed Data Parallel,encountered a memory leak problem
out of memory of raw,Memory usage reaches 300g

some lib version:
seqeval
tensorboardx
tokenizer
sacremoses
pythainlp
jieba
torch==1.4.0

Number of examples clarification for DE in POS

Hi, I saw that there are 22357 examples in the DE test set for POS, even though in the paper it says there are between 47 and 20,436 examples in the test set. Is that correct?
Also, it would be helpful if you could list the number of examples in the test set for each language in each task so we can check if what we have is correct after preprocessing.
Thank you!

Question regarding release of best current model

Hello Google Research Team,

Thank you for this awesome repo and for the baseline code. As part of a downstream task in machine translation, I require a well-performing model on the PAWS-X dataset. I have been attempting to fine-tune some models using the code here, but my test accuracies on PAWS-X are still in the mid 50's.

I was wondering when the current best performing XLM-R model would be released for downstream usage?

Thank you.

Tatoeba translation sentence pair alignment

Firstly thanks for the great dataset 👍

I was looking through it and was just a little confused about the Tatoeba dataset layout

After running the download_data.sh script the Tatoeba dataset seems to be misaligned. I am not sure if this is an issue for the evaluation scripts but this is different from how they appear at source.

I manually ran the steps in utils_preprocess.py for Tatoeaba and downloaded the master data from LASER and when comparing these files they are correctly aligned between sentence pairs

However, after running utils_preprocess.py the files appear to be out of alignment.

For example the the Spanish comparing files es-en.es and es-en.en they are aligned like this:
Screenshot 2020-04-26 at 14 29 57

When comparing the files from LASER directly, namely tatoeba.spa-eng and tatoeba.spa-eng.spa
Screenshot 2020-04-26 at 14 36 20

Like I said, I am not sure if this is intentional for testing or not? And I cant see any other way to align them at the moment. So maybe this is just for testing? Either way, it might be a good idea to add some clarification as to whether this is supposed to be aligned or not

thanks again

Cathal

MLQA results reproducability for XLM-R

Hi, thanks for the benchmark and the accompanied code!

I am trying to replicate MLQA scores from the (XTREME) paper using this repo's code. I use evaluate_mlqa.py file to evaluate the result of XML-R model, but the score on zh language is wrong while the other is right. Here are my results

en  {"exact_match": 70.62985332182916, "f1": 83.3786980958141}                                                                                                                       
es  {"exact_match": 56.10127546164097, "f1": 73.58575605485278}                                                                                                                      
de  {"exact_match": 54.85941996900598, "f1": 69.47592323594147}                                                                                                                      
ar  {"exact_match": 45.7357075913777, "f1": 65.1863063068997}                                                                                                                        
hi  {"exact_match": 51.830012200081335, "f1": 68.98355428605254}                                                                                                                     
vi  {"exact_match": 51.30118289353958, "f1": 72.50031191585738}                                                                                                                      
zh  {"exact_match": 5.723184738174032, "f1": 19.154113403670923}

XQUAD results reproducability for mBERT

Hi, thanks for the benchmark and the accompanied code!

I am trying to replicate XQUAD scores from the (XTREME) paper using this repo's code.
I run mBert cased model with default parameters and strictly follow the instruction in the README file.

However, the results for some languages are much lower than the scores from the paper.
In particular for vi and th the gap is two-fold. There is also a significant drop for hi and el.
The e.g. en, es, de result, on the other hand, is comparable.

Below I provide a table with scores that I just obtained from running the code together with the corresponding numbers from the paper. @sebastianruder, could I please ask you to take a look at it.

paper: {"f1", "exact_match"}

XQuAD 

  en {"exact_match": 71.76470588235294, "f1": 83.86480699632085} paper: 83.5 / 72.2 
  es {"exact_match": 53.94957983193277, "f1": 73.27239623706365} paper: 75.5 / 56.9 
  de {"exact_match": 52.35294117647059, "f1": 69.47398743963343} paper: 70.6 / 54.0 
  el {"exact_match": 33.61344537815126, "f1": 48.94642083187724} paper: 62.6 / 44.9 
  ru {"exact_match": 52.10084033613445, "f1": 69.82661430981189} paper: 71.3 / 53.3
  tr {"exact_match": 32.35294117647059, "f1": 46.14441800236999} paper: 55.4 / 40.1
  ar {"exact_match": 42.52100840336134, "f1": 59.72583892569921} paper: 61.5 / 45.1 
  vi {"exact_match": 15.210084033613445, "f1": 33.112047090752164} paper: 69.5 / 49.6 
  th {"exact_match": 15.294117647058824, "f1": 24.87707204093759} paper: 42.7 / 33.5 
  zh {"exact_match": 48.99159663865546, "f1": 58.654625486558196} paper: 58.0 / 48.3 
  hi {"exact_match": 22.436974789915965, "f1": 38.31058195464005} paper: 59.2 / 46.0 

XNLI Processor alignment issue

In the preprocessing script of XNLI: xtreme/third_party/processors/xnli.py. Line 42-43. The first line of xnli processor will be skipped. I noted that this processor is inherited from transformers processor class, this is useful for the raw XNLI 1.0 dataset, as it contains header. However, xtreme preprocessing script will first remove the original XNLI datasets' header: like premise,label,etc. For the preprocessed xnli dataset {split}-{lang} like dev-en.tsv, the first actual example will be skipped. Maybe the line 42-43 of xtreme/third_party/processors/xnli.py can be deleted as the preprocessing script already removed the header of XNLI.

Adding languages

I am working on evaluating the performance of a minority Language (Norwegian - not included in XTREME). Unfortunately there are no good benchmarks available in the target language and we are building our own benchmarks. Such single-language benchmarks will not be useful for comparing against for instance English.

Could translating XTREME (or some of the tests) be an alternative for benchmarking individual languages as well? How much work should be estimated for doing such a job?

Don't remove_qa_test_annotations

In utils_qa_test_annotations, all the qa testing data annotations are removed.
However, the annotations are needed when running scripts/eval_qa.sh.

I think that there is a better way to prevent cheating since it's not hard to get the original testing data.

download_udpos error

I got the following when download_updos

*** AttributeError: 'DependencyTree' object has no attribute 'node'

The error occurs in https://github.com/google-research/xtreme/blob/master/third_party/ud-conversion-tools/lib/conll.py.

It seems that in line 380-381, sent only call add_edge but not call add_node.
As a result, there is no node in the graph (Dependency Tree)

def read_conll_u(self,filename,keepFusedForm=False, lang=None, posPreferenceDict=None):
sentences = []
sent = DependencyTree()
multi_tokens = {}
for line_no, line in enumerate(open(filename).readlines()):
line = line.strip("\n")
if not line:
# Add extra properties to ROOT node if exists
if 0 in sent:
for key in ('form', 'lemma', 'cpostag', 'postag'):
sent.node[0][key] = 'ROOT'
# Handle multi-tokens
sent.graph['multi_tokens'] = multi_tokens
multi_tokens = {}
sentences.append(sent)
sent = DependencyTree()
elif line.startswith("#"):
if 'comment' not in sent.graph:
sent.graph['comment'] = [line]
else:
sent.graph['comment'].append(line)
else:
parts = line.split("\t")
if len(parts) != len(self.CONLL_U_COLUMNS):
error_msg = 'Invalid number of columns in line {} (found {}, expected {})'.format(line_no, len(parts), len(CONLL_U_COLUMNS))
raise Exception(error_msg)
token_dict = {key: conv_fn(val) for (key, conv_fn), val in zip(self.CONLL_U_COLUMNS, parts)}
if isinstance(token_dict['id'], int):
sent.add_edge(token_dict['head'], token_dict['id'], deprel=token_dict['deprel'])
sent.node[token_dict['id']].update({k: v for (k, v) in token_dict.items()
if k not in ('head', 'id', 'deprel', 'deps')})
for head, deprel in token_dict['deps']:
sent.add_edge(head, token_dict['id'], deprel=deprel, secondary=True)
elif token_dict['id'] is not None:
#print(token_dict['id'])
first_token_id = int(token_dict['id'][0])
multi_tokens[first_token_id] = token_dict
return sentences

Plans for making codebase compatible with newer versions of Huggingface transformers

Is there any plan to update codebase to make it compatible with newer versions of HF transformers? When I try to run an evaluation for my custom model which is trained with newer versions, the older version of HF cannot load the weights.

Since there are major differences in transformer API between two versions, I cannot use the same code with the newer versions. Considering wide-usage of HF, it'd be great to make a transition to new versions.

If this sounds okay, I can send a PR for this purpose but it will take time to make sure everything is working correctly.

Typo in conll.py

While running the script to extract data, this is the error generated for the POS task data.

in read_conll_u
sent.node[token_dict['id']].update({k: v for (k, v) in token_dict.items()
AttributeError: 'DependencyTree' object has no attribute 'node'

How to use the tydiqa dataset

I'm confused with this because it doesn't seem to have a test set and the paper doesn't say which other dataset was used as a test set. The dev set also does not seem to be labelled. Furthermore, I don't see how to use tydiqa-goldp-v1.1-train.json together with the other training set files. Can someone help please?

Tatoeba evaluation scripts

Hi! Thanks for the code.
While downloading the tatoeba dataset, it is processed and saved in the below format in utils_preprocess.py
tatoeba_process

However, while evaluating, the data files are retrieved in the original name format in (run_retrieval.py)
tatoeba_run

mistake in evaluate_retrieval.py line 517

Hello,

the line is:

results = bucc_eval(cand2score_file, gold_file, f'{prefix}.{SL}.txt', f'{prefix}.{TL}.txt', f'{prefix}.{SL}.id', f'{prefix}.id', predict_file, threshold)

I think it would be:

results = bucc_eval(cand2score_file, gold_file, f'{prefix}.{SL}.txt', f'{prefix}.{TL}.txt', f'{prefix}.{SL}.id', f'{prefix}.{TL}.id', predict_file, threshold)

Typo in bucc_extract() causing errors

In utils_retrieve.py, the bucc_extract() function tries to use args.encoding from evaluate_retrieval.py arguments but args aren't passed to this function. Although the function bucc_eval(), which calls the bucc_extract(), has a parameter for encoding, it's not passed to bucc_extract() . Also args.encoding isn't passed to bucc_eval().

As a quick fix, adding optional parameter for encoding to bucc_extract() will solve the issue without changing many things.

Mewsli-X scripts

Hi Guys,

Is it possible to get training and evaluation scripts for mewsli-X dataset? We are trying to reproduce the baseline results from the XTREME-R paper (Table 4), it would helpful for us if you can share the scripts used for training the models.

Thanks

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.