Git Product home page Git Product logo

wikiloop-datasets's Introduction

Hi Wikipedians,

This is Victor from Google, a software engineer on Google's WikiKnowledge team.

Being a Wikipedian myself, I have always wished to contribute to Google's efforts in helping an open knowledge ecosystem including projects like Wikipedia through its content. And yet we wanna make sure it follows the rules established by Wikipedia community, such as [[WP:COI]] and [[WP:NPOV]]. I am happy to report to you, thanks to co-workers and my efforts, we are able to produce some useful data, based purely on what's publicly accessible on Wikipedia.org, that we believe might be useful for Wikipedia and its sister projects. Here are the first dataset: Conflicting Birthdates across Wikipedias.

Please note: (1) We won't write on (article) namespace of Wikipedia or change anything on Wikipedia. We are only releasing datasets to the Wikipedia community and public so that when Wikipedians think the data helpful, they can further determine whether and how to make use of these datasets on their workflow editing Wikipedia. (2) There is nothing from this dataset that's not already on Wikipedia. So no new information is exposed.

We all know that an individual can only be born once. Therefore, if a person is shown in different languages of Wikipedia to be born on different dates, or sometimes off by a few years, at least one of the sources shall be wrong. We built our internal big data pipelines to process content from Wikipedia and find out conflicting birthdays for the same individual from different languages of Wikipedia. We want to also bring to the community's attention that some of the data might be wrong. Just like any big data projects, challenges lie in many aspects. For example, the birthdays that we thought coming from the same individual, might actually comes from two different individuals who happen to be linked to the each-other via Wikipedia inter-languages links. Data mistakes can be introduced by different time-zones, calendar systems, and other reasons, too. data While we won't be writing or changing any Wikipedia content directly in order to follow the Wikipedia policy, we are happy to brainstorm with the editor community how these datasets can be used to improve the Wikipedia data quality. For example:

  1. Manual fixing.
  2. Building bot to massive fix the data.
  3. Identify pages that are more prone to be vandalized.

Please note these datasets (and our infra to generate them) are still in beta. We will keep improving the precision and recall of the data quality onwards.

If you have any question, comment or advice, please let us know by leaving message here in the discussion page.

Licensing

Copyright is hereby granted under CC BY-SA 3.0

wikiloop-datasets's People

Contributors

xinbenlv avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.