Git Product home page Git Product logo

gado2's Introduction

What is this?

This a fork of: https://github.com/kamalkraj/BERT-NER

It contains some local modifications to run it on a different set of NER-classes.

https://github.com/KBNLresearch/gado2/blob/main/run_ner.py#L157

https://github.com/KBNLresearch/gado2/blob/main/ner_paper.py#L21

The data/train.txt is a combination of Conll-2002 and input from various Indonesian / Dutch newspapers, this is not the final set used for evaluation, more info soon..

The transformer used to make trainng files from (Prima) page-xml is also included (pagexml_to_bio.py). In this project we used export pagexml files from: https://transkribus.eu/.

Trained models are available here: https://huggingface.co/willemjan/

Training using page files.

Remove old training data and model:

$ rm data/train.txt # Remove old training data.

$ rm -rf out_base # Remove old trained model.

The page-xml export: https://readcoop.eu/transkribus/howto/how-to-export-documents-from-transkribus/

Download and unack the export zip, and look for the 'page' directory.

Convert the page-xml to the bio format using the following command:

$ ./pagexml_to_bio.py --page_dir <<Path to Pagefiles dir>> --output_filename data/train.txt --debug 1

More info on .bio format can be found here: https://natural-language-understanding.fandom.com/wiki/Named_entity_recognition#BIO

Insert GPU and run:

$ ./train.sh

The model will be outputed in the directory 'out_base'.

Running HTTP API.

The trained model can be used by running: ./api.py

This creates a listening port which can be queried like so:

$ curl -s 'http://localhost:8000/predict/?text=Willem jan is een liefhebber van Gado-gado.&model=nl2'

{
  "result": [
    {
      "confidence": 0.9999511241912842,
      "tag": "B-per",
      "word": "Willem"
    },
    {
      "confidence": 0.9999241828918457,
      "tag": "I-per",
      "word": "jan"
    },
    {
      "confidence": 0.9999983310699463,
      "tag": "O",
      "word": "is"
    },
    {
      "confidence": 0.9999984502792358,
      "tag": "O",
      "word": "een"
    },
    {
      "confidence": 0.9999983310699463,
      "tag": "O",
      "word": "liefhebber"
    },
    {
      "confidence": 0.9999983310699463,
      "tag": "O",
      "word": "van"
    },
    {
      "confidence": 0.9998682737350464,
      "tag": "B-misc",
      "word": "Gado-gado"
    },
    {
      "confidence": 0.9999983310699463,
      "tag": "O",
      "word": "."
    }
  ]
}

gado2's People

Contributors

stitchplus avatar willemjan avatar

Stargazers

 avatar

Watchers

 avatar  avatar  avatar  avatar  avatar

Forkers

sckemper

gado2's Issues

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.