Git Product home page Git Product logo

portuguese-named-entity-extractor's Introduction

Portuguese Named Entity Extractor

This repository describes a Named Entity Extractor, a proposed implementation of the Named Entity Recognition (NER) task for a given Portuguese written file (or a set of files) in PDF format. The model used for this task is based on the BERTimbau project [1], where the word embeddings (BERT) are combined with a CRF layer and trained on a Portuguese corpus. The base code is taken from the original project.

The entities classes are set as below:

  • Place ("Local")
  • Name ("Nome")
  • Organization ("Organização")
  • Time ("Tempo")
  • Value ("Valor")

The final output is composed of:

  • A list of all named entities extracted and its count for each class
  • A corresponding wordcloud for each named entity class, as below (for class Time)

Requirements

The list of the required python packages is found on utils/requirements.txt. For quick installation under the selected environment, simply run:

pip install -r requirements.txt

The pre-trained models can be downloaded from this link. The default path for storing the models is: data/input/model_checkpoint/.

The test environment was configured as following:

  • Operational System: Ubuntu 18.04
  • Python version: 3.6
  • Java version: 8

GPU is desirable for faster implementation.

Notes: Java is required for handling PDF files by the tika package.

Implementation

To implement the program at the default arguments, simply place the PDF document(s) into the input folder (default is data/input/raw/) and run:

python main.py

The default output folders are set as below:

  • data/output/wordcloud/ (for wordclouds)
  • data/output/entities/ (for list of entities)

List of implementation parameters

python main.py
    --verbose 1 {0 for silent run}
    --input_path data/input/raw/ 
    --predictions_file data/input/temp/predictions.txt
    --entities_path data/output/entities/
    --wordcloud_path data/output/wordcloud/
    --labels_file data/input/labels/classes.txt
    --bert_model data/input/model_checkpoint/
    

References

[1] SOUZA, Fábio; NOGUEIRA, Rodrigo; LOTUFO, Roberto. Portuguese named entity recognition using BERT-CRF. arXiv preprint arXiv:1909.10649, 2019.

portuguese-named-entity-extractor's People

Contributors

gustavomccoelho avatar

Watchers

 avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.