Git Product home page Git Product logo

wcnpruning's Introduction

Wikipedia Category Network Pruning

A framework to clean the Wikipedia category network.

Getting Started

The category network in Wikipedia is used by editors as a way to label articles and organize them in a hierarchical structure. This manually created and curated network of 1.6 million nodes in English Wikipedia generated by arranging the categories in a child-parent relation (i.e., Scientists-People, Cities-Human Settlement) allows researchers to infer valuable relations between concepts. A clean structure in this format would be a valuable resource for a variety of tools and application including automatic reasoning tools. Unfortunately, Wikipedia category network contains some "noise" since in many cases the association as subcategory does not define an is-a relation (Scientists is-a People vs. Billionaires‎ is-a Wealth). Inspired to develop a model for recommending sections to be added to the already existing Wikipedia articles, we developed a method to clean this network and to keep only the categories that have a high chance to be associated with their children by an is-a relation. The strategy is based on the concept of "pure" categories, and the algorithm uses the types of the attached articles to determine how homogenous the category is. The approach does not rely on any linguistic feature and therefore is suitable for all Wikipedia languages. In this talk, we will discuss the high-level overview of the algorithm and some of the possible applications for the generated network beyond article section recommendations.

Video

Slides

Prerequisites

To run our code, you need:

  • Java 8 (JDK)
  • Maven
  • The three input files: category network, articles-categories association, the article types

Data requirements

The graph of the categories MUST be a Directed Acyclic Graph.

We recommend removing cycles with this implementation of the paper: "A fast and effective heuristic for the feedback arc set problem".

Build informations:

The code is distributed as a Maven project and it can be compiled with the following command:

mvn clean compile assembly:single

This will generate a runnable jar file available in the folder target.

Input format:

Three files in this format:

  • Category network:
node1⇥parent_node1
node1⇥parent_node2
node1⇥parent_node3
  • Articles-Categories:
article_id1⇥title1⇥category1
article_id2⇥title2⇥category1
article_id3⇥title3⇥category1

Title is not required (useful for debug).

  • Articles type
article_id1⇥type
article_id2⇥type

Note: the format can customized by changing the load functions in Importer.java

Output format:

{"score":0.9804195804195804,"category":"Waikato_Tainui","articles":[342863,2728182,...,51201311]}
{"score":1.0,"category":"People_from_Chistopol","articles":[384682,18482502,479899]}
{"score":0.9734265734265735,"category":"1913_establishments_in_Spain","articles":[8312955,22525255,...,22648141]}

Files description:

Check the comments in the source code and the javadoc for more information.

Graph.java

It defines the category network nodes: articles and categories. Both have the concept of virtual ID to compress the lookup table (remove gaps).

Importer.java

This file loads the data into memory. It can be changed accordingly with the input files.

Main.java

The entry point of the application. It provides some implementations of possible PurityScoreFunction (Gini coefficient, Gini impurity, etc)

PurityScoreFunction.java

Interface for score generators

GraphPruner.java

This file contains the logic of the algorithm. The constructor requires: the graph, the scoring function, the threshold (to flag the category as pure), the number of threads, and the output file.

Data

We provide as a sample the dataset used as input and the output generated for the paper "Structuring Wikipedia Articles with Section Recommendations".

Download here

wcnpruning's People

Contributors

tizianopiccardi avatar sergiygolov avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.