Git Product home page Git Product logo

dataner_creation's Introduction

DataNER

This repository contains the code used to create the DataNER corpus, which is annotated for NER using Wikipedia and WikiData.

Dependencies

You need to install MongoDB in order to use this repository.

Description

The code contained in this repository will process a Wikipedia xml dump and a WikiData json dump and use them to build a corpus annotated with named entities.

It was developed as part of a (french-speaking) master's thesis at Université de Montréal. The findings of the thesis sadly showed this process led to a lesser quality corpus in comparison to other similar corpora. The code itself still constitutes an interesting contribution to approach this method, thus justifying publishing it.

How to use

  1. Download the Wikipedia and WikiData dump of your liking.
  2. Download the WikiExtractor GitHub and NECKAr tool and move them into their respective folders in this repository (see README files)
  3. Add WikiData dump path to the NECKAr.cfg file in the NECKAr folder.
  4. Run the process_wikidata_dump.sh script.
  5. Run the process_wikipedia_dump.sh script.
  6. (Optional) Run the augment_mentions.py script to create more named entities in your corpus.
  7. Run the extract_collection.py script to create the corpus.

Disclaimer

This code was experimented with and used on a 24-threads computer, it might be very slow on a more "normal" one. If the scripts take an unreasonable time to run, I recommend using a subset of Wikipedia to still be able to produce a corpus.

References

This code is based on two other works :

  • Giusepppe Attardi. Wikiextractor. https://github.com/attardi/wikiextractor, 2015.

  • Johanna Geiß, Andreas Spitz, and Michael Gertz. Neckar : A named entity classifier for wikidata. In Georg Rehm and Thierry Declerck, editors, Language Technologies for the Challenges of the Digital Age, pages 115–129, Cham, 2018. Springer International Publishing. ISBN 978-3-319-73706-5.

Contact

Please reach out to [email protected] if you have any question about this repository.

dataner_creation's People

Contributors

lucaspages avatar

Stargazers

 avatar

Watchers

 avatar  avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.