Git Product home page Git Product logo

conll03_spacy_v3's Introduction

Train spaCy v3.0 models with CoNLL-2003 data

I trained a series of (language-dependent) spaCy v3.0 (English and German) NER models with different configurations in order to achieve the best possible f-score. Among them, the best English NER model (benchmark model) had F-score 89.22, the best German NER model had F-score 83.29, both evaluated on the respective testb data.

paper

With no access to GPU, all models including the transformer-based model were trained on CPU. However, it is generally suggested against training a transformer-based model on CPU, training on GPU is 3-4X faster.

software

pip install -U pip setuptools wheel
pip install -U spacy
python -m spacy download en_core_web_sm
python -m spacy download de_core_news_sm

Data

CoNLL-2003 datasets include corpus in two languages: English and German

  • The English data was obtained through the official channel (requested from NIST and labeled with the bin file provided by the shared-task orgnizers). It will not be uploaded since the data is not publicly available, thgouh there are plenty of other sources/versions available on GitHub.
  • The original German corpus (the Frankfurter Rundschau) is no longer available on LDC, the project uses a publicly available dataset from here.

Models

  • The English models were trained with the CoNLL-2003 English data, the models were trained on local machine on CPU. Part of the experiment was also performed on Google Colab (the benchmark model cnn_glove_small was trained both on Colab and on my computer, training on Colab is slower so not recommended).

The benchmark model without a doubt showed the highest f-score during training, and the evaluation results:
(the best model I configured in the experiments also presented relatively high f-score during training, around 0.1 smaller than that of the benchmark model )

TOK     100.00
NER P   89.20 
NER R   89.24 
NER F   89.22 
SPEED   13745 

Go to eng

  • The German models were trained with the CoNLL-2003 German data, the models were trained locally on CPU, though for transformer models, trianing on CPU is suggested against. Training transformer models on GPU can be 3-4X faster.

The model with the best performance during training was the transformer-based model transformer, it achieved F-score as good as 0.8721599832 during training.

Evaluation on testb:

TOK     100.00
NER P   84.10 
NER R   82.49 
NER F   83.29 
SPEED   117   

Go to deu

The fe folder includes the report of a first experiment of trianing an English model with data from unoffical source, in which the annotations differ from the official data.

conll03_spacy_v3's People

Contributors

jinhxu avatar

Stargazers

 avatar  avatar

Watchers

 avatar

conll03_spacy_v3's Issues

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.