Experiment for serving a python model on java
Follow instructions on detector folder, then in image-loader folder and lastly in empty_chair folder.
-
Download of Wikimedia Spanish xml dump.
-
Extraction of each page clean content with Attardi extractor into a single file with an approximate size of 6 GB.
-
Extract named entities of this single file using spaCy. A named entity is a "real-world object" that’s assigned a name – for example, a person, a country, a product or a book title. spaCy can recognize various types of named entities in a document.
-
Saved the extracted named entities into a sqlite data base for quick access on posterior processing task.
-
Calculate the distance of each entity document title with all the entities in it's content with a custom vector representation using 1 and 2 n-grams.
-
Cluster the entities of each document into 5 groups using jenks and the distances of the vectors, so the entities in the 4th and 5th group are marked for disambiguation with respect to the document's title.
-
Replace the entities in the 5th group with the ones in 4th group in the paragraph they appear.
-
Save the modified paragraphs with the correct and incorrect antities into a json file with an approximated size of 40 GB.
-
Download the MariaDb Wikipedia database.
-
Create scrip to get all the pages of a category ans it's subcategories' pages recursively up to a 10th level in the spanning tree.
-
With the spanning tree generated, find disambiguation pages to create new dataset.
The datasets created are meant to be used with the model proposed here, also, a knowledge base could be created to be used directly with spaCy and their entity linking API.
An example of an entity recognition demo based on the content of the text can be seen here