Git Product home page Git Product logo

ner-dataset-modified-dee's Introduction

Dataset for Building Indonesian NER

(Dataset untuk Membangun Named Entity Recognizer (NER) untuk Bahasa Indonesia)

This repository contains resources of a project named Modified DBpedia Entities Expansion (MDEE) (Alfina, et al., 2017).
We share:

  • Three NER datasets used in the experiments explained in the paper (in the main folder), each consists of 20,000 sentences, along with the gold standard.
  • Three NER datasets, as the revised version of the three NER datasets in the main folder (in the revised-20k folder).
  • The original names in Indonesian DBpedia (in "original-dbpedia" folder).
  • Two versions of DBpedia explained in the paper (in "expanded-dbpedia" folder): MDEE, and MDEE_Gazetteer
  • A dataset of 48,957 sentences named SINGGALANG (in "singgalang" folder). We used expanded DBpedia of MDEE_Gazetteer to label this dataset.

The NER Dataset

The datasets conforms with the dataset format of Stanford-NER.

Four named entity classes are used:

  • "Person" for person names
  • "Place" for place names
  • "Organisation" for organization names
  • "O" for others


List of dataset in main folder:

  1. dataset created using original DEE (Alfina et al., 2016), file name: 20k-dee.txt, with properties file: 20k-dee.prop
  2. dataset created using Modified DEE (Alfina et al., 2017), file name: 20k-mdee.txt, with properties file: 20k-mdee.prop
  3. dataset created using Modified DEE plus gazetteer (Alfina et al., 2017), file name: 20k-mdee-gazz.txt, with properties file: 20k-mdee-gazz.prop
  4. A gold standard created by Luthfi, et al (2014)

Each version of NER datasets consist of 20,000 sentences from Wikipedia articles in the Indonesian language that were labeled automatically.

The SINGGALANG dataset

We provide a new NER dataset in this repository, named SINGGALANG. The specifications of this dataset are:

  • The number of sentences: 48,957
  • Generated using expanded DBpedia of MDEE_Gazett (the best version of those three expanded DBpedia)

References

The dataset may be used for free, but if you want to publish paper/publication using the dataset, please cite these publications:

How to create NER model using this dataset?

We suggest you to use the Stanford NER library.
The steps to create NER model using Stanford NER library are as follows:

  1. Download Stanford-NER.

  2. Download the dataset and its properties file (file with .prop extension)

  3. Use Stanford NER classifier to create the model.
    For example:
    java -cp stanford-ner.jar edu.stanford.nlp.ie.crf.CRFClassifier -prop 20k-mdee.prop

    I recommend to increase the heap size so you can train the dataset on computer with limited RAM. Add option like "-Xmx1024m" on the command, for example:

    java -Xmx1024m -cp stanford-ner.jar edu.stanford.nlp.ie.crf.CRFClassifier -prop 20k-mdee.prop

    if this still doesn't work, increase the number. For example: "-Xmx8000m". This works for me :)

    Let say this step will create a NER model file named "idner-model-20k-mdee.ser.gz"

  4. Create or use a testing dataset. Lets say the file name is "testing.txt"

  5. Evaluate the NER model using Stanford NER library
    For example:
    java -cp stanford-ner.jar edu.stanford.nlp.ie.crf.CRFClassifier -loadClassifier idner-model-20k-mdee.ser.gz -testFile testing.txt

ner-dataset-modified-dee's People

Contributors

ialfina avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.