Git Product home page Git Product logo

wikiloop-analysis's Introduction

Wikipedia Cross Edit Pattern Exploration

This repository is created to host exploratory code for the Wikipedia Revision Dataset, as part of the Cross Edit Pattern Detection Project.
This repository is forked from a Google corporate repository and will push changes regularly.
Author: Haoran Fei ([email protected])
Host: Zainan Zhou ([email protected])
Date: June 8th, 2020

Open-Source Dependencies and Licensing

Python3: GPL-Compatible License. GPL-compatible doesn’t mean that we’re distributing Python under the GPL. All Python licenses, unlike the GPL, let you distribute a modified version without making your changes open source.
Pandas: New BSD License. Matplotlib: License based on PSF license.

Usage Example

Loading the First json data file and run article-based analysis:
$ python3 article_analytics.py --path ./data/cross_edits_tmp_ttl=72_revisioninfo_20200605_1023_segment-000##-of-00037.json --start 0 --stop 1
Loading all 37 json data files and run article-based analysis:
$ python3 article_analytics.py --path ./data/1023/segment-000##-of-00037.json --start 0 --stop 37
Loading the First json data file and run author-based analysis:
$ python3 author_analytics.py --path ./data/cross_edits_tmp_ttl=72_revisioninfo_20200605_1023_segment-000##-of-00037.json --start 0 --stop 1
Loading all 37 json data files and run author-based analysis:
$ python3 author_analytics.py --path ./data/1023/segment-000##-of-00037.json --start 0 --stop 37

Formula for Sliding Window Anomaly Detection

A window will be flagged as anomaly if it satisfies the following condition:

M: metric considerd. Currently supports mean and median.
W: the window frame under consideration.
S: the complete dataset of the given key. This can be all edits on the same article/by the same author, depending on the key used.
k: value is either 1 or -1. It is 1 if we are concerned with abnormally high values only, and -1 if we are concerned with abnormally low values only.
t: a percentage threshold for flagging anomal. Currently set at 50%.

Log Files and Format

All log files are located in the cross-edits-analysis/log directory. Each directory holds the logs for the corresponding analysis script.
Format of log line: Anomaly of (metric name) of (column name) detected for (key: this can be article/author or article/author pair) during period from (starting time of window) to (ending time of window), with a () percent difference from baseline.

wikiloop-analysis's People

Contributors

haoranfei avatar xinbenlv avatar

Stargazers

 avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar

wikiloop-analysis's Issues

Undefined names in article_analytics.py

% flake8 . --count --select=E9,F63,F7,F82 --show-source --statistics

./wikiloop-analysis/cross-edits-analysis/article_analytics.py:58:50: F821 undefined name 'article'
            articles_with_non_zero_scores.append(article)
                                                 ^
./wikiloop-analysis/cross-edits-analysis/article_analytics.py:59:27: F821 undefined name 'columns_to_count'
            for column in columns_to_count:
                          ^
2     F821 undefined name 'article'
2

https://flake8.pycqa.org/en/latest/user/error-codes.html

On the flake8 test selection, this PR does not focus on "style violations" (the majority of flake8 error codes that psf/black can autocorrect). Instead, these tests are focus on runtime safety and correctness:

  • E9 tests are about Python syntax errors usually raised because flake8 can not build an Abstract Syntax Tree (AST). Often these issues are a sign of unused code or code that has not been ported to Python 3. These would be compile-time errors in a compiled language but in a dynamic language like Python, they result in the script halting/crashing on the user.
  • F63 tests are usually about the confusion between identity and equality in Python. Use ==/!= to compare str, bytes, and int literals is the classic case. These are areas where a == b is True but a is b is False (or vice versa). Python >= 3.8 will raise SyntaxWarnings on these instances.
  • F7 tests logic errors and syntax errors in type hints
  • F82 tests are almost always undefined names which are usually a sign of a typo, missing imports, or code that has not been ported to Python 3. These also would be compile-time errors in a compiled language but in Python, a NameError is raised which will halt/crash the script on the user.

@HaoranFei

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.