stefan-it / turkish-bert Goto Github PK

Turkish BERT/DistilBERT, ELECTRA and ConvBERT models

Python 100.00%

bert turkish electra distilbert convbert

turkish-bert's Introduction

🇹🇷 BERTurk

We present community-driven BERT, DistilBERT, ELECTRA and ConvBERT models for Turkish 🎉

Some datasets used for pretraining and evaluation are contributed from the awesome Turkish NLP community, as well as the decision for the BERT model name: BERTurk.

Logo is provided by Merve Noyan.

Changelog

23.09.2021: Release of uncased ELECTRA and ConvBERT models and cased ELECTRA model, all trained on mC4 corpus.
24.06.2021: Release of new ELECTRA model, trained on Turkish part of mC4 dataset. Repository got new awesome logo from Merve Noyan.
16.03.2021: Release of ConvBERTurk model and more evaluations on different downstream tasks.
12.05.2020: Release of ELECTRA (small and base) models, see here.
25.03.2020: Release of BERTurk uncased model and BERTurk models with larger vocab size (128k, cased and uncased).
11.03.2020: Release of the cased distilled BERTurk model: DistilBERTurk. Available on the Hugging Face model hub
17.02.2020: Release of the cased BERTurk model. Available on the Hugging Face model hub
10.02.2020: Training corpus update, new TensorBoard links, new results for cased model.
02.02.2020: Initial version of this repo.

Stats

The current version of the model is trained on a filtered and sentence segmented version of the Turkish OSCAR corpus, a recent Wikipedia dump, various OPUS corpora and a special corpus provided by Kemal Oflazer.

The final training corpus has a size of 35GB and 4,404,976,662 tokens.

Thanks to Google's TensorFlow Research Cloud (TFRC) we can train both cased and uncased models on a TPU v3-8. You can find the TensorBoard outputs for the training here:

We also provide cased and uncased models that aŕe using a larger vocab size (128k instead of 32k).

A detailed cheatsheet of how the models were trained, can be found here.

C4 Multilingual dataset (mC4)

We've also trained an ELECTRA (cased) model on the recently released Turkish part of the multiligual C4 (mC4) corpus from the AI2 team.

After filtering documents with a broken encoding, the training corpus has a size of 242GB resulting in 31,240,963,926 tokens.

We used the original 32k vocab (instead of creating a new one).

Turkish Model Zoo

Here's an overview of all available models, incl. their training corpus size:

Model name	Model hub link	Pre-training corpus size
ELECTRA Small (cased)	here	35GB
ELECTRA Base (cased)	here	35GB
ELECTRA Base mC4 (cased)	here	242GB
ELECTRA Base mC4 (uncased)	here	242GB
BERTurk (cased, 32k)	here	35GB
BERTurk (uncased, 32k)	here	35GB
BERTurk (cased, 128k)	here	35GB
BERTurk (uncased, 128k)	here	35GB
DistilBERTurk (cased)	here	35GB
ConvBERTurk (cased)	here	35GB
ConvBERTurk mC4 (cased)	here	242GB
ConvBERTurk mC4 (uncased)	here	242GB

DistilBERTurk

The distilled version of a cased model, so called DistilBERTurk, was trained on 7GB of the original training data, using the cased version of BERTurk as teacher model.

DistilBERTurk was trained with the official Hugging Face implementation from here.

The cased model was trained for 5 days on 4 RTX 2080 TI.

More details about distillation can be found in the "DistilBERT, a distilled version of BERT: smaller, faster, cheaper and lighter" paper by Sanh et al. (2019).

ELECTRA

In addition to the BERTurk models, we also trained ELECTRA small and base models. A detailed overview can be found in the ELECTRA section.

ConvBERTurk

In addition to the BERT and ELECTRA based models, we also trained a ConvBERT model. The ConvBERT architecture is presented in the "ConvBERT: Improving BERT with Span-based Dynamic Convolution" paper.

We follow a different training procedure: instead of using a two-phase approach, that pre-trains the model for 90% with 128 sequence length and 10% with 512 sequence length, we pre-train the model with 512 sequence length for 1M steps on a v3-32 TPU.

More details about the pre-training can be found here.

mC4 ELECTRA

In addition to the ELECTRA base model, we also trained an ELECTRA model on the Turkish part of the mC4 corpus. We use a sequence length of 512 over the full training time and train the model for 1M steps on a v3-32 TPU.

Evaluation

For evaluation we use latest Flair 0.8.1 version with a fine-tuning approach for PoS Tagging and NER downstream tasks. In order to evaluate models on a Turkish question answering dataset, we use the question answering example from the awesome 🤗 Transformers library.

We use the following hyperparameters for training PoS and NER models with Flair:

Parameter	Value
`batch_size`	16
`learning_rate`	5e-5
`num_epochs`	10

For the question answering task, we use the same hyperparameters as used in the "How Good Is Your Tokenizer?" paper.

The script train_flert_model.py in this repository can be used to fine-tune models on PoS Tagging an NER datasets.

We pre-train models with 5 different seeds and reported averaged accuracy (PoS tagging), F1-score (NER) or EM/F1 (Question answering).

For some downstream tasks, we perform "Almost Stochastic Order" tests as proposed in the "Deep Dominance - How to Properly Compare Deep Neural Models" paper. The heatmap figures are heavily inspired by the "CharacterBERT" paper.

PoS tagging

We use two different PoS Tagging datasets for Turkish from the Universal Dependencies project:

We use the dev branch for training/dev/test splits.

Evaluation on IMST dataset

Model	Development Accuracy	Test Accuracy
BERTurk (cased, 128k)	96.614 ± 0.58	96.846 ± 0.42
BERTurk (cased, 32k)	97.138 ± 0.18	97.096 ± 0.07
BERTurk (uncased, 128k)	96.964 ± 0.11	97.060 ± 0.07
BERTurk (uncased, 32k)	97.080 ± 0.05	97.088 ± 0.05
ConvBERTurk	97.208 ± 0.10	97.346 ± 0.07
ConvBERTurk mC4 (cased)	97.148 ± 0.07	97.426 ± 0.03
ConvBERTurk mC4 (uncased)	97.308 ± 0.09	97.338 ± 0.08
DistilBERTurk	96.362 ± 0.05	96.560 ± 0.05
ELECTRA Base	97.122 ± 0.06	97.232 ± 0.09
ELECTRA Base mC4 (cased)	97.166 ± 0.07	97.380 ± 0.05
ELECTRA Base mC4 (uncased)	97.058 ± 0.12	97.210 ± 0.11
ELECTRA Small	95.196 ± 0.09	95.578 ± 0.10
XLM-R (base)	96.618 ± 0.10	96.492 ± 0.06
mBERT (cased)	95.504 ± 0.10	95.754 ± 0.05

Almost Stochastic Order tests (using the default alpha of 0.05) on test set:

Evaluation on BOUN dataset

Model	Development Accuracy	Test Accuracy
BERTurk (cased, 128k)	90.828 ± 0.71	91.016 ± 0.60
BERTurk (cased, 32k)	91.460 ± 0.10	91.490 ± 0.10
BERTurk (uncased, 128k)	91.010 ± 0.15	91.286 ± 0.09
BERTurk (uncased, 32k)	91.322 ± 0.19	91.544 ± 0.09
ConvBERTurk	91.250 ± 0.14	91.524 ± 0.07
ConvBERTurk mC4 (cased)	91.552 ± 0.10	91.724 ± 0.07
ConvBERTurk mC4 (uncased)	91.202 ± 0.16	91.484 ± 0.12
DistilBERTurk	91.166 ± 0.10	91.044 ± 0.09
ELECTRA Base	91.354 ± 0.04	91.534 ± 0.11
ELECTRA Base mC4 (cased)	91.402 ± 0.14	91.746 ± 0.11
ELECTRA Base mC4 (uncased)	91.100 ± 0.13	91.178 ± 0.15
ELECTRA Small	91.020 ± 0.11	90.850 ± 0.12
XLM-R (base)	91.828 ± 0.08	91.862 ± 0.16
mBERT (cased)	91.286 ± 0.07	91.492 ± 0.11

NER

We use the Turkish dataset split from the XTREME Benchmark.

These training/dev/split were introduced in the "Massively Multilingual Transfer for NER" paper and are based on the famous WikiANN dataset, that is presentend in the "Cross-lingual Name Tagging and Linking for 282 Languages" paper.

Model	Development F1-score	Test F1-score
BERTurk (cased, 128k)	93.796 ± 0.07	93.8960 ± 0.16
BERTurk (cased, 32k)	93.470 ± 0.11	93.4706 ± 0.09
BERTurk (uncased, 128k)	93.604 ± 0.12	93.4686 ± 0.08
BERTurk (uncased, 32k)	92.962 ± 0.08	92.9086 ± 0.14
ConvBERTurk	93.822 ± 0.14	93.9286 ± 0.07
ConvBERTurk mC4 (cased)	93.778 ± 0.15	93.6426 ± 0.15
ConvBERTurk mC4 (uncased)	93.586 ± 0.07	93.6206 ± 0.13
DistilBERTurk	92.012 ± 0.09	91.5966 ± 0.06
ELECTRA Base	93.572 ± 0.08	93.4826 ± 0.17
ELECTRA Base mC4 (cased)	93.600 ± 0.13	93.6066 ± 0.12
ELECTRA Base mC4 (uncased)	93.092 ± 0.15	92.8606 ± 0.36
ELECTRA Small	91.278 ± 0.08	90.8306 ± 0.09
XLM-R (base)	92.986 ± 0.05	92.9586 ± 0.14
mBERT (cased)	93.308 ± 0.09	93.2306 ± 0.07

Question Answering

We use the Turkish Question Answering dataset from this website and report EM and F1-Score on the development set (as reported from Transformers).

Model	Development EM	Development F1-score
BERTurk (cased, 128k)	60.38 ± 0.61	78.21 ± 0.24
BERTurk (cased, 32k)	58.79 ± 0.81	76.70 ± 1.04
BERTurk (uncased, 128k)	59.60 ± 1.02	77.24 ± 0.59
BERTurk (uncased, 32k)	58.92 ± 1.06	76.22 ± 0.42
ConvBERTurk	60.11 ± 0.72	77.64 ± 0.59
ConvBERTurk mC4 (cased)	60.65 ± 0.51	78.06 ± 0.34
ConvBERTurk mC4 (uncased)	61.28 ± 1.27	78.63 ± 0.96
DistilBERTurk	43.52 ± 1.63	62.56 ± 1.44
ELECTRA Base	59.24 ± 0.70	77.70 ± 0.51
ELECTRA Base mC4 (cased)	61.28 ± 0.94	78.17 ± 0.33
ELECTRA Base mC4 (uncased)	59.28 ± 0.87	76.88 ± 0.61
ELECTRA Small	38.05 ± 1.83	57.79 ± 1.22
XLM-R (base)	58.27 ± 0.53	76.80 ± 0.39
mBERT (cased)	56.70 ± 0.43	75.20 ± 0.61

Model usage

All trained models can be used from the DBMDZ Hugging Face model hub page using their model name. The following models are available:

BERTurk models with 32k vocabulary: dbmdz/bert-base-turkish-cased and dbmdz/bert-base-turkish-uncased
BERTurk models with 128k vocabulary: dbmdz/bert-base-turkish-128k-cased and dbmdz/bert-base-turkish-128k-uncased
ELECTRA small and base cased models (discriminator): dbmdz/electra-small-turkish-cased-discriminator and dbmdz/electra-base-turkish-cased-discriminator
ELECTRA base cased and uncased models, trained on Turkish part of mC4 corpus (discriminator): dbmdz/electra-small-turkish-mc4-cased-discriminator and dbmdz/electra-small-turkish-mc4-uncased-discriminator
ConvBERTurk model with 32k vocabulary: dbmdz/convbert-base-turkish-cased
ConvBERTurk base cased and uncased models, trained on Turkish part of mC4 corpus: dbmdz/convbert-base-turkish-mc4-cased and dbmdz/convbert-base-turkish-mc4-uncased

Example usage with 🤗/Transformers:

tokenizer = AutoTokenizer.from_pretrained("dbmdz/bert-base-turkish-cased")

model = AutoModel.from_pretrained("dbmdz/bert-base-turkish-cased")

This loads the BERTurk cased model. The recently introduced ELECTRA base model can be loaded with:

tokenizer = AutoTokenizer.from_pretrained("dbmdz/electra-base-turkish-cased-discriminator")

model = AutoModelWithLMHead.from_pretrained("dbmdz/electra-base-turkish-cased-discriminator")

Citation

You can use the following BibTeX entry for citation:

@software{stefan_schweter_2020_3770924,
  author       = {Stefan Schweter},
  title        = {BERTurk - BERT models for Turkish},
  month        = apr,
  year         = 2020,
  publisher    = {Zenodo},
  version      = {1.0.0},
  doi          = {10.5281/zenodo.3770924},
  url          = {https://doi.org/10.5281/zenodo.3770924}
}

Acknowledgments

Thanks to Kemal Oflazer for providing us additional large corpora for Turkish. Many thanks to Reyyan Yeniterzi for providing us the Turkish NER dataset for evaluation.

We would like to thank Merve Noyan for the awesome logo!

Research supported with Cloud TPUs from Google's TensorFlow Research Cloud (TFRC). Thanks for providing access to the TFRC ❤️

turkish-bert's People

Contributors

Stargazers

Watchers

turkish-bert's Issues

Summarization and Classificaiton fine-tune

Hi @stefan-it , is there any training code in transformer repo or else where such as run_glue.py or run_ner.py to fine-tune Turkish-bert model for the summarization and classificaiton problem ?

Cannot find the NER Model

It seems huggingface repository contains only the base model, I couldn't find the model and tokenizer related to the model for named entity recognition. Where can I find the trained NER model and if it is not too much to ask how can I load and use it easily?

Any plan for Turkish T5?

Hi @stefan-it, thanks a lot again for the great job you're doing here. Now that I'm planning to train a Turkish T5 from scratch, I just want to ask if you have any such plan to not duplicate the resources to be spent.

retrieval langchain with turkish dataset

i have the following code:

# import
from langchain.embeddings.sentence_transformer import SentenceTransformerEmbeddings
from langchain.text_splitter import CharacterTextSplitter
from langchain.vectorstores import Chroma
from langchain.document_loaders import TextLoader
from transformers import AutoTokenizer, AutoModel

from silly import no_ssl_verification
from langchain.embeddings.huggingface import HuggingFaceEmbeddings


with no_ssl_verification():
    # load the document and split it into chunks
    loader = TextLoader("paul_graham/paul_graham_essay_tr.txt")
    documents = loader.load()

    # split it into chunks
    text_splitter = CharacterTextSplitter(chunk_size=2000, chunk_overlap=0)
    docs = text_splitter.split_documents(documents)

    # create the Turkish embedding function
    # tokenizer = AutoTokenizer.from_pretrained("dbmdz/bert-base-turkish-cased")
    # model = AutoModel.from_pretrained("dbmdz/bert-base-turkish-cased")
    embedding_function = SentenceTransformerEmbeddings(model_name="dbmdz/bert-base-turkish-cased")

    # load it into Chroma
    db = Chroma.from_documents(docs, embedding_function)

    # query it
    query = "Yazarın üniversiteden önce üzerinde çalıştığı iki ana şey neydi?"
    docs = db.similarity_search(query)

    # print results
    print(docs[0].page_content)

how can i fix my code to do qa retrieval with langchain with using turkish-bert embeddings? please help me.

Commands

Stefan,

Is it possible to share the commands that you've used for generating the model?
I'll add some domain specific data to the model and for retraining I need you commands.

Thanx a lot.

BERTurk Training Dataset Preparation

Hello Stefan,

I'm going to train another BERT model with different pre-training object from scratch. Then I will use it to compare with BERTurk and other Turkish pre-trained language models. In order to evaluate pre-training task impact properly, the model should be trained with similar data and parameters.

In the README file it was state that:

The current version of the model is trained on a filtered and sentence segmented version of the Turkish OSCAR corpus, a recent Wikipedia dump, various OPUS corpora and a special corpus provided by Kemal Oflazer.

I've already collected Kemal Oflazer's and OSCAR's corpus. But there are things I'm curious about. If you can answer them, I will be happy 🙂

Did you apply filtering and sentence segmentation only to OSCAR corpus or did you apply it to others too ?
What kind of filtering did you apply ? Was it like removing sentences with less than 5 tokens from the corpus ?
Have you used only full stop for sentence segmentation ?
Do you remember which Wikipedia dump has been used ?
Which OPUS corpora have you used ? There are plenty of datasets in OPUS. There are even datasets from Wikipedia such as WikiMatrix v1, Wikipedia and wikimedia v20210402. Did you use them too ?
Did you apply extra pre-processing methods a part from BertTokenizer's ?

Also, if you have the public datasets' corpora, do you mind sharing it ? It would make things a lot easier for me and save me from the trouble 🙂

Thanks in advance 🙂

Question

Stefan, what was the MLM and NSP accuracies? (for cased 32K model)
Could u share?

TensorFlow Checkpoints for Fine-Tuning

Dear Stefan,

Can you provide TensorFlow checkpoints for BERT fine-tuning.

Thank you.

Bibtext

Hello Stefan,

How can we cite turkish-bert for our academic papers ?

Thanks,

How is this model bilingual?

Hi Stefan,

When I use the Turkish model on an English dataset for classification, it works surprisingly well. So, I have two questions:

Does the training corpus contain English texts?
Is it trained from scratch or on the English model's weights?

Thanks!

Example for inference

Hi, is there a guide to use the cased model for inference?

I'm doing:

from transformers import AutoModel, AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained("dbmdz/bert-base-turkish-cased")
model = AutoModel.from_pretrained("dbmdz/bert-base-turkish-cased")

but I can't find how to use the model from here. The tutorials on HuggingFace do not work with this model.

Thanks

How to get all hidden layers' output of pre-trained BERTurk model in HuggingFace Transformers library?

Hi Stefan,
I have a problem to get the all hidden layer's output of BERTurk. I tried as follows:

model = AutoModel.from_pretrained("dbmdz/bert-base-turkish-uncased")

Convert inputs (length 20) to PyTorch tensors

tokens_tensor = torch.tensor([indexed_tokens])
segments_tensors = torch.tensor([segments_ids])

model.eval()

with torch.no_grad():
outputs = model(tokens_tensor, segments_tensors)

the outputs contain two tensors.

print (outputs[0])
print (len(outputs[0][0])) # have 20 array, each is belongs to each token of sentence
print (outputs[0][0][0]) # for each token outputs[0][0][i] this is for [CLS]
print (len(outputs[0][0][0])) #768 embedding size

I am not sure outputs[0] is the final hidden state or not too.

and outputs[1] is as following:
print (outputs[1][0])
print (len(outputs[1][0])) # have 768 entries

Also I tried as what is described in https://huggingface.co/transformers/model_doc/bert.html#tfbertmodel but I got an error when I define output_hidden_states = True.

ValueError: Must specify max_steps > 0, given: 0

$python3 electra_small/run_finetuning.py \
--data-dir $DATA_DIR \
--model-name "ELECTRA-small" \
--hparams '{"model_size": "small", "task_names": ["<task_name>"], "num_trials": 5, "learning_rate": 3e-4, "train_batch_size": 16, "use_tpu": "True", "num_tpu_cores": 8, "tpu_name": "<tpu_name>", "tpu_zone": "europe-west4-a", "gcp_project": "<gcp_name>", "vocab_size": 50000, "num_train_epochs": 10}'

I am getting the following error. Is there something I am missing?

Training for 0 steps
ERROR:tensorflow:Error recorded from training_loop: Must specify max_steps > 0, given: 0
Traceback (most recent call last):
  File "electra_small/run_finetuning.py", line 323, in <module>
    main()
  File "electra_small/run_finetuning.py", line 319, in main
    args.model_name, args.data_dir, **hparams))
  File "electra_small/run_finetuning.py", line 270, in run_finetuning
    model_runner.train()
  File "electra_small/run_finetuning.py", line 183, in train
    input_fn=self._train_input_fn, max_steps=self.train_steps)
  File "/usr/local/lib/python3.5/dist-packages/tensorflow_estimator/python/estimator/tpu/tpu_estimator.py", line 3035, in train
    rendezvous.raise_errors()
  File "/usr/local/lib/python3.5/dist-packages/tensorflow_estimator/python/estimator/tpu/error_handling.py", line 136, in raise_errors
    six.reraise(typ, value, traceback)
  File "/usr/local/lib/python3.5/dist-packages/six.py", line 703, in reraise
    raise value
  File "/usr/local/lib/python3.5/dist-packages/tensorflow_estimator/python/estimator/tpu/tpu_estimator.py", line 3030, in train
    saving_listeners=saving_listeners)
  File "/usr/local/lib/python3.5/dist-packages/tensorflow_estimator/python/estimator/estimator.py", line 358, in train
    'Must specify max_steps > 0, given: {}'.format(max_steps))
ValueError: Must specify max_steps > 0, given: 0

Electra Model

Stefan, where can I download the ELECTRA model?
Again I need TF checkpoints, I'll fine-tune for MRQA.
Thanx a lot!

Where is the model?

So where can we download the TR trained model?

Traning Duration

Hi, how long does it take to train a BERT base model in the configuration of Cloud TPU v3-8, 4.4B words, 32K vocabulary size and 512 sequence length ?

Q: Using as a classifier

Hey there, first of all, really appreciated your work. My question is that whether the model can be used as a classifier as in the BERT classifier? or is it only a tokenizer atm. Thanks in advance.

CHEATSHEET for ConvBERT?

Hi guys, impressive result with ConvBERT, there is any cheatsheet of how to train from scratch?

Your BERT and ELECTRA cheatsheets are very helpful.

fill-mask

When I apply fill-masl with bert-base-turkish as follows:

from transformers import AutoModelForSequenceClassification, AutoTokenizer, pipeline
model = AutoModelForSequenceClassification.from_pretrained("dbmdz/bert-base-turkish-cased")
tokenizer = AutoTokenizer.from_pretrained("dbmdz/bert-base-turkish-cased)
fm=pipeline("fill-mask", model=model, tokenizer=tokenizer)
fm("merhaba ben <mask> iyiyim")

I get following error

ValueError Traceback (most recent call last)
in ()
----> 1 fm("merhaba ben iyiyim")

/usr/local/lib/python3.6/dist-packages/transformers/pipelines.py in call(self, *args, **kwargs)
795 values, predictions = topk.values.numpy(), topk.indices.numpy()
796 else:
--> 797 masked_index = (input_ids == self.tokenizer.mask_token_id).nonzero().item()
798 logits = outputs[i, masked_index, :]
799 probs = logits.softmax(dim=0)

ValueError: only one element tensors can be converted to Python scalars

Evaluation Methodology

F1 scores are reported for the evaluation, however I'd like to know if you used macro or weighted F1 scores for downstream tasks (such as for NER). Would it also be possible to learn hyperparameters you set for finetuning, like max sequence length?

Multiclass Classification

Hi,
I could not find any argument to pass number of classes in AutoModel codes to apply sentiment analysis with BERTurk. Do I miss something? Thanks in advance.

from transformers import AutoModelForSequenceClassification, AdamW, AutoConfig,AutoTokenizer
tokenizer = AutoTokenizer.from_pretrained("dbmdz/bert-base-turkish-cased")
model = AutoModelForSequenceClassification.from_pretrained(
    "dbmdz/bert-base-turkish-cased")

Why `handle_chinese_chars=False`?

Hi Stefan,

could you please quickly explain why did you say handle_chinese_chars=False?

Thanks
Philip

PoS tagging

I managed to run the NER example with the custom data using run_ner.py from transformers. The data looks like below after JSON formatting.

{"tokens":["Yıldız","Savaşları",":","Bölüm","II","-","Klonların","Saldırısı","''"], "ner_tags":["B-ORG","I-ORG","I-ORG","I-ORG","I-ORG","I-ORG","I-ORG","I-ORG","O"]}
{"tokens":["1998-2004",":","Kombassan","Holding"], "ner_tags":["O","O","B-ORG","I-ORG"]}
{"tokens":["Avustralya'da","25","numaraya","çıkmış",",","ayrıca","Yeni","Zelanda","listesine","32","numaradan","giriş","yapmış","ve","8","numaraya","çıkmıştır","."], "ner_tags":["B-LOC","O","O","O","O","O","B-LOC","I-LOC","O","O","O","O","O","O","O","O","O","O"]}
{"tokens":["Piet","Mondrian","(","1872-1944",")"], "ner_tags":["B-PER","I-PER","O","O","O"]}

However, for PoS tagging the IMST data looks like this:

16	,	,	PUNCT	Punc	_	24	punct	_	_
17	yetmiş	yetmiş	NUM	ANum	NumType=Card	18	nummod	_	_
18	yaşlarında	yaş	ADJ	NAdj	Case=Loc|Number=Plur|Number[psor]=Sing|Person=3|Person[psor]=3	23	amod	_	_
19	şık	şık	ADJ	Adj	_	20	amod	_	_
20-21	giyimli	_	_	_	_	_	_	_	_
20	giyim	giyim	NOUN	Noun	Case=Nom|Number=Sing|Person=3	23	obl	_	_
21	li	li	ADP	With	_	20	case	_	_

A word might have multiple PoS tags such as giyimli. However, in the case of NER, one word only matches one NER tag. So, how can we create a JSON file that can be correctly parsed by run_ner.py for PoS Tagging?

Thank you!

Vocabs Genration

Hello, I am trying to generate vocabs to train electra model.
I am using the following code

from tokenizers import BertWordPieceTokenizer

# Initialize an empty BERT tokenizer
tokenizer = BertWordPieceTokenizer(
  clean_text=False,
  handle_chinese_chars=False,
  strip_accents=False,
  lowercase=True,
)
# prepare text files to train vocab on them
files = ['data.txt']

tokenizer.train(
  files,
  vocab_size=100000,
  min_frequency=2,
  show_progress=True,
  #special_tokens=['[PAD]', '[UNK]', '[CLS]', '[SEP]', '[MASK]'],
  limit_alphabet=1000,
  wordpieces_prefix="##"
)
tokenizer.save('vocabs.txt')

When I use tokenizer.save('./') I got Exception: Is a directory (os error 21)
When I save it as above code, when I run build_pretraining_dataset.py I code this error. I suspect that there's something wrong with the vocabs format .
**
output.append(vocab[item])
KeyError: '[UNK]'
**
What do you think is missing?

Tensorflow Checkpoints

Hi,

Is it possible to share TF checkpoints?
Thanx a lot.

Fine tune the turkish-bert

Hello,

How can we add a linear classifier at the end of the model and fine tune it for the sentiment classification task ? Is there any guide for it ?

Thank you so much !

bert uncased tf checkpoints

Hi! @stefan-it
i need to bert-base-32k-uncased tf chekpoints for further pre-training on Cloud TPU. I found cased version from this link

wget https://schweter.eu/cloud/bert-base-turkish-cased/bert-base-turkish-cased-tf.tar.gz

is it possible to get 32k uncased version of Turkish Bert model?

Thanks for all sharing,

Training Data

Hello Stefan,

I would like to ask if the exact training data for training Turkish Bert can be released?

We would like to do some analysis of BERTs in different languages.

Thank you very much!

Using turkish bert with tensorflow or tf.keras

I want to combine this model with a CNN in tensorflow or tensorflow.keras but couldn't handle how to use these checkpoint files. Can someone help me about how to use this model in tensorflow.keras?

file contains :
config.json
model.ckpt.data-00000-of-00001
model.ckpt.index
model.ckpt.meta
vocab.txt

@balki7 Sure, here are the original TF checkpoints (best checkpoint, incl. config and vocab):
wget https://schweter.eu/cloud/bert-base-turkish-cased/bert-base-turkish-cased-tf.tar.gz

# sha256 bert-base-turkish-cased-tf.tar.gz
# 8113d0aeb32a2e7bcd00027195a13622387ca9e2132d7f9a1b27389a0db26b96

Thank u so much Stefan.

Originally posted by @balki7 in #2 (comment)

DistilBERTurk training for question answering failed

Hey, I tried to train DistilBERTurk model for question answering by using run_squad.py script. After training, I got the error during evaluation stage;

Traceback (most recent call last):
  File "run_squad.py", line 838, in <module>
    main()
  File "run_squad.py", line 827, in main
    result = evaluate(args, model, tokenizer, prefix=global_step)
  File "run_squad.py", line 344, in evaluate
    start_logits, end_logits = output
ValueError: too many values to unpack (expected 2)

When I tried to discard the last value as "start_logits, end_logits, _ = output" the error became

Traceback (most recent call last):
  File "run_squad.py", line 839, in <module>
    main()
  File "run_squad.py", line 828, in main
    result = evaluate(args, model, tokenizer, prefix=global_step)
  File "run_squad.py", line 323, in evaluate
    output = [to_list(output[i]) for output in outputs.to_tuple()]
  File "run_squad.py", line 323, in <listcomp>
    output = [to_list(output[i]) for output in outputs.to_tuple()]
IndexError: tuple index out of range

I checked the model with samples from the dataset and the confidence levels were really low, mostly below 0.001. I assume training couldn't done right either.

I tried to train DistilBERT original with the same script and the same dataset and it trained without error and confidence levels were high.
I compared the layers but both model looked same. Also tried to load the model as qa model, saved it but the error occurred again.

Thank you so much.

ValueError: Tensor conversion requested dtype string for Tensor with dtype float32: <tf.Tensor 'args_0:0' shape=() dtype=float32>

I have built my pretraining data and stored it, the vocab file and config file in my GCS bucket.
But when I run the pretraining step:

  --data-dir $DATA_DIR \
  --model-name $MODEL_NAME \
  --hparams $HPARAMS_FILE

I keep getting the following

Running training
================================================================================
2020-11-17 08:09:31.456003: W tensorflow/core/distributed_runtime/rpc/grpc_session.cc:370] GrpcSession::ListDevices will initialize the session with an empty graph and other defaults because the session has not yet been created.
ERROR:tensorflow:Error recorded from training_loop: in converted code:
    relative to /home/etetteh/anaconda3/lib/python3.6/site-packages/tensorflow_core/python:

    data/ops/readers.py:336 __init__
        filenames, compression_type, buffer_size, num_parallel_reads)
    data/ops/readers.py:296 __init__
        filenames = _create_or_validate_filenames_dataset(filenames)
    data/ops/readers.py:56 _create_or_validate_filenames_dataset
        filenames = ops.convert_to_tensor(filenames, dtype=dtypes.string)
    framework/ops.py:1184 convert_to_tensor
        return convert_to_tensor_v2(value, dtype, preferred_dtype, name)
    framework/ops.py:1242 convert_to_tensor_v2
        as_ref=False)
    framework/ops.py:1273 internal_convert_to_tensor
        (dtype.name, value.dtype.name, value))

    ValueError: Tensor conversion requested dtype string for Tensor with dtype float32: <tf.Tensor 'args_0:0' shape=() dtype=float32>

Traceback (most recent call last):
  File "electra/run_pretraining.py", line 385, in <module>
    main()
  File "electra/run_pretraining.py", line 381, in main
    args.model_name, args.data_dir, **hparams))
  File "electra/run_pretraining.py", line 344, in train_or_eval
    max_steps=config.num_train_steps)
  File "/home/etetteh/anaconda3/lib/python3.6/site-packages/tensorflow_estimator/python/estimator/tpu/tpu_estimator.py", line 3035, in train
    rendezvous.raise_errors()
  File "/home/etetteh/anaconda3/lib/python3.6/site-packages/tensorflow_estimator/python/estimator/tpu/error_handling.py", line 136, in raise_errors
    six.reraise(typ, value, traceback)
  File "/home/etetteh/anaconda3/lib/python3.6/site-packages/six.py", line 703, in reraise
    raise value
  File "/home/etetteh/anaconda3/lib/python3.6/site-packages/tensorflow_estimator/python/estimator/tpu/tpu_estimator.py", line 3030, in train
    saving_listeners=saving_listeners)
  File "/home/etetteh/anaconda3/lib/python3.6/site-packages/tensorflow_estimator/python/estimator/estimator.py", line 370, in train
    loss = self._train_model(input_fn, hooks, saving_listeners)
  File "/home/etetteh/anaconda3/lib/python3.6/site-packages/tensorflow_estimator/python/estimator/estimator.py", line 1161, in _train_model
    return self._train_model_default(input_fn, hooks, saving_listeners)
  File "/home/etetteh/anaconda3/lib/python3.6/site-packages/tensorflow_estimator/python/estimator/estimator.py", line 1191, in _train_model_default
    features, labels, ModeKeys.TRAIN, self.config)
  File "/home/etetteh/anaconda3/lib/python3.6/site-packages/tensorflow_estimator/python/estimator/tpu/tpu_estimator.py", line 2857, in _call_model_fn
    config)
  File "/home/etetteh/anaconda3/lib/python3.6/site-packages/tensorflow_estimator/python/estimator/estimator.py", line 1149, in _call_model_fn
    model_fn_results = self._model_fn(features=features, **kwargs)
  File "/home/etetteh/anaconda3/lib/python3.6/site-packages/tensorflow_estimator/python/estimator/tpu/tpu_estimator.py", line 3148, in _model_fn
    input_holders.generate_infeed_enqueue_ops_and_dequeue_fn())
  File "/home/etetteh/anaconda3/lib/python3.6/site-packages/tensorflow_estimator/python/estimator/tpu/tpu_estimator.py", line 1428, in generate_infeed_enqueue_ops_and_dequeue_fn
    self._invoke_input_fn_and_record_structure())
  File "/home/etetteh/anaconda3/lib/python3.6/site-packages/tensorflow_estimator/python/estimator/tpu/tpu_estimator.py", line 1525, in _invoke_input_fn_and_record_structure
    host_device, host_id))
  File "/home/etetteh/anaconda3/lib/python3.6/site-packages/tensorflow_estimator/python/estimator/tpu/tpu_estimator.py", line 899, in generate_per_host_enqueue_ops_fn_for_host
    inputs = _Inputs.from_input_fn(input_fn(user_context))
  File "/home/etetteh/anaconda3/lib/python3.6/site-packages/tensorflow_estimator/python/estimator/tpu/tpu_estimator.py", line 3001, in _input_fn
    return input_fn(**kwargs)
  File "/home/etetteh/electra/pretrain/pretrain_data.py", line 63, in input_fn
    cycle_length=cycle_length))
  File "/home/etetteh/anaconda3/lib/python3.6/site-packages/tensorflow_core/python/data/ops/dataset_ops.py", line 1990, in apply
    return DatasetV1Adapter(super(DatasetV1, self).apply(transformation_func))
  File "/home/etetteh/anaconda3/lib/python3.6/site-packages/tensorflow_core/python/data/ops/dataset_ops.py", line 1378, in apply
    dataset = transformation_func(self)
  File "/home/etetteh/anaconda3/lib/python3.6/site-packages/tensorflow_core/python/data/experimental/ops/interleave_ops.py", line 94, in _apply_fn
    buffer_output_elements, prefetch_input_elements)
  File "/home/etetteh/anaconda3/lib/python3.6/site-packages/tensorflow_core/python/data/ops/readers.py", line 226, in __init__
    map_func, self._transformation_name(), dataset=input_dataset)
  File "/home/etetteh/anaconda3/lib/python3.6/site-packages/tensorflow_core/python/data/ops/dataset_ops.py", line 2713, in __init__
    self._function = wrapper_fn._get_concrete_function_internal()
  File "/home/etetteh/anaconda3/lib/python3.6/site-packages/tensorflow_core/python/eager/function.py", line 1853, in _get_concrete_function_internal
    *args, **kwargs)
  File "/home/etetteh/anaconda3/lib/python3.6/site-packages/tensorflow_core/python/eager/function.py", line 1847, in _get_concrete_function_internal_garbage_collected
    graph_function, _, _ = self._maybe_define_function(args, kwargs)
  File "/home/etetteh/anaconda3/lib/python3.6/site-packages/tensorflow_core/python/eager/function.py", line 2147, in _maybe_define_function
    graph_function = self._create_graph_function(args, kwargs)
  File "/home/etetteh/anaconda3/lib/python3.6/site-packages/tensorflow_core/python/eager/function.py", line 2038, in _create_graph_function
    capture_by_value=self._capture_by_value),
  File "/home/etetteh/anaconda3/lib/python3.6/site-packages/tensorflow_core/python/framework/func_graph.py", line 915, in func_graph_from_py_func
    func_outputs = python_func(*func_args, **func_kwargs)
  File "/home/etetteh/anaconda3/lib/python3.6/site-packages/tensorflow_core/python/data/ops/dataset_ops.py", line 2707, in wrapper_fn
    ret = _wrapper_helper(*args)
  File "/home/etetteh/anaconda3/lib/python3.6/site-packages/tensorflow_core/python/data/ops/dataset_ops.py", line 2652, in _wrapper_helper
    ret = autograph.tf_convert(func, ag_ctx)(*nested_args)
  File "/home/etetteh/anaconda3/lib/python3.6/site-packages/tensorflow_core/python/autograph/impl/api.py", line 237, in wrapper
    raise e.ag_error_metadata.to_exception(e)
ValueError: in converted code:
    relative to /home/etetteh/anaconda3/lib/python3.6/site-packages/tensorflow_core/python:

    data/ops/readers.py:336 __init__
        filenames, compression_type, buffer_size, num_parallel_reads)
    data/ops/readers.py:296 __init__
        filenames = _create_or_validate_filenames_dataset(filenames)
    data/ops/readers.py:56 _create_or_validate_filenames_dataset
        filenames = ops.convert_to_tensor(filenames, dtype=dtypes.string)
    framework/ops.py:1184 convert_to_tensor
        return convert_to_tensor_v2(value, dtype, preferred_dtype, name)
    framework/ops.py:1242 convert_to_tensor_v2
        as_ref=False)
    framework/ops.py:1273 internal_convert_to_tensor
        (dtype.name, value.dtype.name, value))

    ValueError: Tensor conversion requested dtype string for Tensor with dtype float32: <tf.Tensor 'args_0:0' shape=() dtype=float32>

AssertionError: ('Pointer shape torch.Size([256]) and array shape (64,) mismatched', torch.Size([256]), (64,))

Getting the following error when converting my ckpt to huggingface's pytorch. I am using the same config file I used for the pretraining.

Traceback (most recent call last):
  File "/home/enoch/dl_repos/transformers/src/transformers/models/electra/convert_electra_original_tf_checkpoint_to_pytorch.py", line 78, in <module>
    args.tf_checkpoint_path, args.config_file, args.pytorch_dump_path, args.discriminator_or_generator
  File "/home/enoch/dl_repos/transformers/src/transformers/models/electra/convert_electra_original_tf_checkpoint_to_pytorch.py", line 43, in convert_tf_checkpoint_to_pytorch
    model, config, tf_checkpoint_path, discriminator_or_generator=discriminator_or_generator
  File "/home/enoch/dl_repos/transformers/src/transformers/models/electra/modeling_electra.py", line 140, in load_tf_weights_in_electra
    ), f"Pointer shape {pointer.shape} and array shape {array.shape} mismatched"
AssertionError: ('Pointer shape torch.Size([256]) and array shape (64,) mismatched', torch.Size([256]), (64,))

Also, coverting other ckpt does start at all except the 1M training step, which is also failing here

Is There any wiki page how to use this project well ?

Firstly thanks for this project but please could you please prepare a wiki page how to use this project,

Thanks a lot

stefan-it / turkish-bert Goto Github PK

turkish-bert's Introduction

🇹🇷 BERTurk

Changelog

Stats

C4 Multilingual dataset (mC4)

Turkish Model Zoo

DistilBERTurk

ELECTRA

ConvBERTurk

mC4 ELECTRA

Evaluation

PoS tagging

Evaluation on IMST dataset

Evaluation on BOUN dataset

NER

Question Answering

Model usage

Citation

Acknowledgments

turkish-bert's People

Contributors

Stargazers

Watchers

Forkers

turkish-bert's Issues

Recommend Projects

Recommend Topics

Recommend Org