Git Product home page Git Product logo

wikiloop-analysis's Introduction

Wikipedia Cross Edit Pattern Exploration

This repository is created to host exploratory code for the Wikipedia Revision Dataset, as part of the Cross Edit Pattern Detection Project.
This repository is forked from a Google corporate repository and will push changes regularly.
Author: Haoran Fei ([email protected])
Host: Zainan Zhou ([email protected])
Date: June 8th, 2020

Open-Source Dependencies and Licensing

Python3: GPL-Compatible License. GPL-compatible doesn’t mean that we’re distributing Python under the GPL. All Python licenses, unlike the GPL, let you distribute a modified version without making your changes open source.
Pandas: New BSD License. Matplotlib: License based on PSF license.

Usage Example

Loading the First json data file and run article-based analysis:
$ python3 article_analytics.py --path ./data/cross_edits_tmp_ttl=72_revisioninfo_20200605_1023_segment-000##-of-00037.json --start 0 --stop 1
Loading all 37 json data files and run article-based analysis:
$ python3 article_analytics.py --path ./data/1023/segment-000##-of-00037.json --start 0 --stop 37
Loading the First json data file and run author-based analysis:
$ python3 author_analytics.py --path ./data/cross_edits_tmp_ttl=72_revisioninfo_20200605_1023_segment-000##-of-00037.json --start 0 --stop 1
Loading all 37 json data files and run author-based analysis:
$ python3 author_analytics.py --path ./data/1023/segment-000##-of-00037.json --start 0 --stop 37

Formula for Sliding Window Anomaly Detection

A window will be flagged as anomaly if it satisfies the following condition:

M: metric considerd. Currently supports mean and median.
W: the window frame under consideration.
S: the complete dataset of the given key. This can be all edits on the same article/by the same author, depending on the key used.
k: value is either 1 or -1. It is 1 if we are concerned with abnormally high values only, and -1 if we are concerned with abnormally low values only.
t: a percentage threshold for flagging anomal. Currently set at 50%.

Log Files and Format

All log files are located in the cross-edits-analysis/log directory. Each directory holds the logs for the corresponding analysis script.
Format of log line: Anomaly of (metric name) of (column name) detected for (key: this can be article/author or article/author pair) during period from (starting time of window) to (ending time of window), with a () percent difference from baseline.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.