Git Product home page Git Product logo

age-detector-dropwizard's Introduction

Age-detector-dropwizard

Experiment for serving a python model on java

Follow instructions on detector folder, then in image-loader folder and lastly in empty_chair folder.

Entity disambiguation progress

  1. Download of Wikimedia Spanish xml dump.

  2. Extraction of each page clean content with Attardi extractor into a single file with an approximate size of 6 GB.

  3. Extract named entities of this single file using spaCy. A named entity is a "real-world object" that’s assigned a name – for example, a person, a country, a product or a book title. spaCy can recognize various types of named entities in a document.

  4. Saved the extracted named entities into a sqlite data base for quick access on posterior processing task.

  5. Calculate the distance of each entity document title with all the entities in it's content with a custom vector representation using 1 and 2 n-grams.

  6. Cluster the entities of each document into 5 groups using jenks and the distances of the vectors, so the entities in the 4th and 5th group are marked for disambiguation with respect to the document's title.

  7. Replace the entities in the 5th group with the ones in 4th group in the paragraph they appear.

  8. Save the modified paragraphs with the correct and incorrect antities into a json file with an approximated size of 40 GB.

Database version 2

  1. Download the MariaDb Wikipedia database.

  2. Create scrip to get all the pages of a category ans it's subcategories' pages recursively up to a 10th level in the spanning tree.

  3. With the spanning tree generated, find disambiguation pages to create new dataset.

Models to use

The datasets created are meant to be used with the model proposed here, also, a knowledge base could be created to be used directly with spaCy and their entity linking API.

An example of an entity recognition demo based on the content of the text can be seen here

age-detector-dropwizard's People

Contributors

jccaleroe avatar

Watchers

 avatar  avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.