Hi Wikipedians,
This is Victor from Google, a software engineer on Google's WikiKnowledge team.
Being a Wikipedian myself, I have always wished to contribute to Google's efforts in helping an open knowledge ecosystem including projects like Wikipedia through its content. And yet we wanna make sure it follows the rules established by Wikipedia community, such as [[WP:COI]] and [[WP:NPOV]]. I am happy to report to you, thanks to co-workers and my efforts, we are able to produce some useful data, based purely on what's publicly accessible on Wikipedia.org, that we believe might be useful for Wikipedia and its sister projects. Here are the first dataset: Conflicting Birthdates across Wikipedias.
Please note: (1) We won't write on (article) namespace of Wikipedia or change anything on Wikipedia. We are only releasing datasets to the Wikipedia community and public so that when Wikipedians think the data helpful, they can further determine whether and how to make use of these datasets on their workflow editing Wikipedia. (2) There is nothing from this dataset that's not already on Wikipedia. So no new information is exposed.
We all know that an individual can only be born once. Therefore, if a person is shown in different languages of Wikipedia to be born on different dates, or sometimes off by a few years, at least one of the sources shall be wrong. We built our internal big data pipelines to process content from Wikipedia and find out conflicting birthdays for the same individual from different languages of Wikipedia. We want to also bring to the community's attention that some of the data might be wrong. Just like any big data projects, challenges lie in many aspects. For example, the birthdays that we thought coming from the same individual, might actually comes from two different individuals who happen to be linked to the each-other via Wikipedia inter-languages links. Data mistakes can be introduced by different time-zones, calendar systems, and other reasons, too. data While we won't be writing or changing any Wikipedia content directly in order to follow the Wikipedia policy, we are happy to brainstorm with the editor community how these datasets can be used to improve the Wikipedia data quality. For example:
- Manual fixing.
- Building bot to massive fix the data.
- Identify pages that are more prone to be vandalized.
Please note these datasets (and our infra to generate them) are still in beta. We will keep improving the precision and recall of the data quality onwards.
If you have any question, comment or advice, please let us know by leaving message here in the discussion page.
Licensing
Copyright is hereby granted under CC BY-SA 3.0