wikimedia / revscoring Goto Github PK

View Code? Open in Web Editor NEW

89.0 21.0 51.0 2.39 MB

A generic, machine learning-based revision scoring system for MediaWiki

Home Page: https://revscoring.readthedocs.io

License: MIT License

Python 96.86% Jupyter Notebook 2.98% Dockerfile 0.07% Shell 0.01% Makefile 0.08%

artificial-intelligence

revscoring's Introduction

Revision Scoring

⚠️ Warning: As of late 2023, the ORES infrastructure is being deprecated by the WMF Machine Learning team, please check https://wikitech.wikimedia.org/wiki/ORES for more info.
While the code in this repository may still work, it is unmaintained, and as such may break at any time. Special consideration should also be given to machine learning models seeing drift in quality of predictions.

The replacement for ORES and associated infrastructure is Lift Wing: https://wikitech.wikimedia.org/wiki/Machine_Learning/LiftWing

Some Revscoring models from ORES run on the Lift Wing infrastructure, but they are otherwise unsupported (no new training or code updates).

They can be downloaded from the links documented at: https://wikitech.wikimedia.org/wiki/Machine_Learning/LiftWing#Revscoring_models_(migrated_from_ORES)

In the long term, some or all these models may be replaced by newer models specifically tailored to be run on modern ML infrastructure like Lift Wing.

If you have any questions, contact the WMF Machine Learning team: https://wikitech.wikimedia.org/wiki/Machine_Learning

A generic, machine learning-based revision scoring system designed to help automate critical wiki-work — for example, vandalism detection and removal. This library powers ORES.

Example

Using a scorer_model to score a revision::

  import mwapi
  from revscoring import Model
  from revscoring.extractors.api.extractor import Extractor

  with open("models/enwiki.damaging.linear_svc.model") as f:
       scorer_model = Model.load(f)

  extractor = Extractor(mwapi.Session(host="https://en.wikipedia.org",
                                          user_agent="revscoring demo"))

  feature_values = list(extractor.extract(123456789, scorer_model.features))

  print(scorer_model.score(feature_values))
  {'prediction': True, 'probability': {False: 0.4694409344514984, True: 0.5305590655485017}}

Installation

The easiest way to install is via the Python package installer (pip).

pip install revscoring

You may find that some of the dependencies fail to compile (namely scipy, numpy and sklearn). In that case, you'll need to install some dependencies in your operating system.

Ubuntu & Debian:

Run sudo apt-get install python3-dev g++ gfortran liblapack-dev libopenblas-dev enchant
Run sudo apt-get install aspell-ar aspell-bn aspell-el aspell-id aspell-is aspell-pl aspell-ro aspell-sv aspell-ta aspell-uk myspell-cs myspell-de-at myspell-de-ch myspell-de-de myspell-es myspell-et myspell-fa myspell-fr myspell-he myspell-hr myspell-hu myspell-lv myspell-nb myspell-nl myspell-pt-pt myspell-pt-br myspell-ru myspell-hr hunspell-bs hunspell-ca hunspell-en-au hunspell-en-us hunspell-en-gb hunspell-eu hunspell-gl hunspell-it hunspell-hi hunspell-sr hunspell-vi voikko-fi

MacOS:

Using Homebrew and pip, installing revscoring and enchant can be accomplished as follows::

brew install aspell --with-all-languages
brew install enchant
pip install --no-binary pyenchant revscoring

Adding languages in aspell (MacOS only)

cd /tmp
wget http://ftp.gnu.org/gnu/aspell/dict/pt/aspell-pt-0.50-2.tar.bz2
bzip2 -dc aspell-pt-0.50-2.tar.bz2 | tar xvf -
cd aspell-pt-0.50-2
./configure
make
sudo make install

Caveats:
The differences between the aspell and myspell dictionaries can cause some of the tests to fail

Finally, in order to make use of language features, you'll need to download some NLTK data. The following command will get the necessary corpora.

python -m nltk.downloader omw sentiwordnet stopwords wordnet

You'll also need to install enchant-compatible dictionaries of the languages you'd like to use. We recommend the following:

languages.arabic: aspell-ar
languages.basque: hunspell-eu
languages.bengali: aspell-bn
languages.bosnian: hunspell-bs
languages.catalan: myspell-ca
languages.czech: myspell-cs
languages.croatian: myspell-hr
languages.dutch: myspell-nl
languages.english: myspell-en-us myspell-en-gb myspell-en-au
languages.estonian: myspell-et
languages.finnish: voikko-fi
languages.french: myspell-fr
languages.galician: hunspell-gl
languages.german: myspell-de-at myspell-de-ch myspell-de-de
languages.greek: aspell-el
languages.hebrew: myspell-he
languages.hindi: aspell-hi
languages.hungarian: myspell-hu
languages.icelandic: aspell-is
languages.indonesian: aspell-id
languages.italian: myspell-it
languages.latvian: myspell-lv
languages.norwegian: myspell-nb
languages.persian: myspell-fa
languages.polish: aspell-pl
languages.portuguese: myspell-pt-pt myspell-pt-br
languages.serbian: hunspell-sr
languages.spanish: myspell-es
languages.swedish: aspell-sv
languages.tamil: aspell-ta
languages.russian: myspell-ru
languages.ukrainian: aspell-uk
languages.vietnamese: hunspell-vi

Development

To contribute, ensure to install the dependencies:

$ pip install -r requirements.txt

Install necessary NLTK data:

python -m nltk.downloader omw sentiwordnet stopwords wordnet

Running tests

Make sure you install test dependencies:

$ pip install -r test-requirements.txt

Then run:

$ pytest . -vv

Reporting bugs

To report a bug, please use Phabricator

Authors

revscoring's People

Contributors

Stargazers

Watchers

revscoring's Issues

TypeError when extracting features from deleted revisions

As reported by Danilo on Trello, if we attempt to extract features from a deleted revision such as
https://pt.wikipedia.org/wiki/?diff=40837381&uselang=en
an error occurs:

Extracting features for http://pt.wikipedia.org/wiki/?oldid=40837381&diff=prev
Traceback (most recent call last):
  File "demonstrate_extractor.py", line 72, in <module>
    features = api_extractor.extract(40837381, extractors)
  File "mypath/Revision-Scoring/revscores/api_extractor.py", line 19, in extract
    return [solve(feature, cache) for feature in features]
  File "mypath/Revision-Scoring/revscores/api_extractor.py", line 19, in <listcomp>
    return [solve(feature, cache) for feature in features]
  File "mypath/Revision-Scoring/revscores/util/dependencies.py", line 111, in solve
    for dependency in dependencies]
  File "mypath/Revision-Scoring/revscores/util/dependencies.py", line 111, in <listcomp>
    for dependency in dependencies]
  File "mypath/Revision-Scoring/revscores/util/dependencies.py", line 111, in solve
    for dependency in dependencies]
  File "mypath/Revision-Scoring/revscores/util/dependencies.py", line 111, in <listcomp>
    for dependency in dependencies]
  File "mypath/Revision-Scoring/revscores/util/dependencies.py", line 111, in solve
    for dependency in dependencies]
  File "mypath/Revision-Scoring/revscores/util/dependencies.py", line 111, in <listcomp>
    for dependency in dependencies]
  File "mypath/Revision-Scoring/revscores/util/dependencies.py", line 111, in solve
    for dependency in dependencies]
  File "mypath/Revision-Scoring/revscores/util/dependencies.py", line 111, in <listcomp>
    for dependency in dependencies]
  File "mypath/Revision-Scoring/revscores/util/dependencies.py", line 115, in solve
    value = dependent(*args)
  File "mypath/Revision-Scoring/revscores/util/dependencies.py", line 20, in __call__
    return self.f(*args, **kwargs)
  File "mypath/Revision-Scoring/revscores/datasources/revision_diff.py", line 16, in revision_diff
    b = tokenizer.tokenize(revision_text)
  File "/home/helder/.mypyvenv/lib/python3.4/site-packages/deltas/tokenizers/wikitext_split.py", line 10, in tokenize
    text
  File "/usr/lib/python3.4/re.py", line 206, in findall
    return _compile(pattern, flags).findall(string)
TypeError: expected string or buffer

Probability returned by SVC models is useless

The probability returned by SVC models seems to be weighted by rate of input labels. This is lame and causes problems that, in the case predicting low probability events, results in low-probability scores.

For example, this doesn't make sense:

642215410: {
    'prediction': True, 
    'probabilities': [0.75884553,  0.24115447]
}

Note that the second probability which references "True" is 0.24 when it should at least be above 50%.

Add directive to install the project in development mode

See https://pythonhosted.org/setuptools/setuptools.html#develop-deploy-the-project-source-in-development-mode

Many Portuguese badwords are not recognized anymore

Since we are not using stemming anymore, and the badwords list generated for Portuguese was originally created based on stems (21e410f), excluding variations of each word (when its stem was the same as the stem of a word which would be kept in the list), these variations are not recognized as badwords anymore.

For example:

>>> from revscoring.languages import portuguese
>>> [ portuguese.is_badword(x) for x in [ "mentiroso", "mentirosa", "mentirosos", "mentirosas" ] ]
[True, False, True, False]

This was the raw badwords list used in the filtering process:
https://gist.github.com/he7d3r/7e3718a43f5ce65e0dab
This was the script used for filtering:
https://gist.github.com/he7d3r/de5af63ac04338c3bfbf#file-stemtomostfrequentword-py

model_info utility

There should be a model_info utility for revscoring that allows you to view the name/version of a model and any test statistics that were generated.

Something like this.

Reads metadata from a model file. 

Usage:
    model_info -h | --help
    model_info <model-file> [--output=<path>]

Options:
    -h --help        Prints this documentation
    <model-file>     The path to a serialized model file from which to read statistics
    --output=<path>  The path to a file to write metadata [default: <stdout>]

Such that if you ran revscoring model_info models/my_wp10_rf.model, you'd get something like

Name: wp10
Version: 0.3.0
Type: RandomForrest(n_estimators=501, min_samples_leaf=8)

Accuracy: 0.6086513418638886

ROC-AUC:
-----  --------
B      0.817992
C      0.836208
FA     0.942814
GA     0.900372
Start  0.89841
Stub   0.983154
-----  --------

          B     C    FA    GA    Start    Stub
-----  ----  ----  ----  ----  -------  ------
B      1121   526   156   153      203       3
C       774  1380    21   216      483      17
FA      285   118  1971   810        6       0
GA      435   404   480  1803       46       0
Start   340   532     1    16     1931     411
Stub     35    42     3     0      396    2544

Set-based badwords and misspellings detection

It occurs to me that using set operation to detect the introduction of new types of badwords, informals and misspellings might have higher signal than using the content of the diff.

E.g. revision.badwords_set - parent_revision.badwords_set = new_badwords_set

If the last revision contained an instance of the curse "shit", and the editor added a new instance of the word "shit" the set difference would empty. But if the editor added an instance of the word "fuck" then that would show up since it wasn't in the article before.

These types of features should be pretty easy to add to the SpaceDelimited metafeatures. See https://github.com/wiki-ai/revscoring/blob/master/revscoring/languages/space_delimited/space_delimited.py#L14

Return an error for when wp10 can't find a revision

A revision should be required to perform a wp10 classification

This doesn't return an error and it should: http://ores.wmflabs.org/scores/enwiki/wp10/671913437347480/

Implement language utilities for French

List of badwords
List of stopwords
Stemmer
Dictionary for is_misspelled

Travis CI is broken due to scipy install

The script hangs for over 10 minutes while provisioning scipy: https://travis-ci.org/wiki-ai/revscoring/builds/73266170 -- It's been borken forever, https://travis-ci.org/wiki-ai/revscoring/builds

There's a warning about "upgrade to our new infrastructure", fwiw.

Here's a workaround from 2014, looks like it's still necessary to do something like this. https://gist.github.com/dan-blanchard/7045057

"words_added.py" doesn't detect words containing accented characters

A user added three words in the edit
https://pt.wikipedia.org/w/index.php?diff=40692203
However, api_extractor.extract(40692203, [words_added]) returns [2]. I suspected this was due to the accented letter "É" and replicated the edit on
https://pt.wikipedia.org/w/index.php?diff=40697353
using "E" instead of "É". As expected, api_extractor.extract(40697353, [words_added]) returns [3].

I believe this is because of the regex used to detect words: [a-zA-Z]+.

Use stopwords to supplement Portuguese dictionary

Constructing a model and scorer is a pain

I was just demonstrating the construction of a scorer and realized that this is too complicated.

    from mw.api import Session

    from revscoring.extractors import APIExtractor
    from revscoring.languages import english
    from revscoring.scorers import MLScorerModel

    api_session = Session("https://en.wikipedia.org/w/api.php")
    extractor = APIExtractor(api_session, english)

    filename = "models/reverts.halfak_mix.trained.model"
    model = MLScorerModel.load(open(filename, 'rb'))

    rev_ids = [105, 642215410, 638307884]
    feature_values = [extractor.extract(id, model.features) for id in rev_ids]
    scores = model.score(feature_values, probabilities=True)
    for rev_id, score in zip(rev_ids, scores):
        print("{0}: {1}".format(rev_id, score))

We have to import the language. This should be captured in the model. The model is useless without knowing which language was used to train it.
model.score() takes a 'probabilities' argument. This is silly and it should be the default. We should just let the scoring model decide what it does and how it outputs.
There should be a simple, default MLScorer that is used to wrap all MLScoringModels so that we don't need to know that this should be wrapped in a LinearSVC(Scorer).

Test for multinomial Naive Bayes fails because of negative features

Write a test that does not include negative feature values.

Multi-token badword detection (words with spaces in them)

Some badwords require ngrams. E.g. "cock" and "sucker" are not really bad words on their own. This seems to be much more common outside of English.

Right now, we handle badword detection by running the is_badword function on the output of revscoring.datasource.revision.words and revscoring.datasources.diff.words_added. This results in the potential to match one word-token with a regex. We need multi-token word support.

wp10 model has prediction weight in B and C class with blank pages

e.g. http://ores.wmflabs.org/scores/enwiki/wp10/671913880/

Feature user.is_bot errors out with None 'groups'

Jul 21 19:31:31 ores-worker-02 celery[10062]: RuntimeError: Failed to process <user.is_bot>: 'NoneType' object has no attribute 'groups'
Jul 21 20:55:17 ores-worker-02 celery[10062]: [2015-07-21 20:55:17,011: ERROR/MainProcess] Task ores.score_processors.celery._process[ptwiki:re
Jul 21 20:55:17 ores-worker-02 celery[10062]: Traceback (most recent call last):
Jul 21 20:55:17 ores-worker-02 celery[10062]: File "/srv/ores/venv/lib/python3.4/site-packages/celery/app/trace.py", line 240, in trace_task
Jul 21 20:55:17 ores-worker-02 celery[10062]: R = retval = fun(*args, **kwargs)
Jul 21 20:55:17 ores-worker-02 celery[10062]: File "/srv/ores/venv/lib/python3.4/site-packages/celery/app/trace.py", line 438, in __protected_c
Jul 21 20:55:17 ores-worker-02 celery[10062]: return self.run(*args, **kwargs)
Jul 21 20:55:17 ores-worker-02 celery[10062]: File "/srv/ores/venv/lib/python3.4/site-packages/ores/score_processors/celery.py", line 32, in _p
Jul 21 20:55:17 ores-worker-02 celery[10062]: score = scoring_context.score(model, cache)
Jul 21 20:55:17 ores-worker-02 celery[10062]: File "/srv/ores/venv/lib/python3.4/site-packages/ores/scoring_contexts/scoring_context.py", line 
Jul 21 20:55:17 ores-worker-02 celery[10062]: feature_values = list(self.solve(model, cache))
Jul 21 20:55:17 ores-worker-02 celery[10062]: File "/srv/ores/venv/lib/python3.4/site-packages/revscoring/dependencies/functions.py", line 240,
Jul 21 20:55:17 ores-worker-02 celery[10062]: value, cache, history = _solve(dependent, context, cache)
Jul 21 20:55:17 ores-worker-02 celery[10062]: File "/srv/ores/venv/lib/python3.4/site-packages/revscoring/dependencies/functions.py", line 231,
Jul 21 20:55:17 ores-worker-02 celery[10062]: .format(dependent, e), str(e))
Jul 21 20:55:17 ores-worker-02 celery[10062]: RuntimeError: Failed to process <user.is_bot>: 'NoneType' object has no attribute 'groups'

revscoring-requests version issue.

Running even the plain binary ores command does not work:

root@ores-worker-01:/srv/ores# ./venv/bin/ores 
Traceback (most recent call last):
  File "/srv/ores/venv/lib/python3.4/site-packages/pkg_resources/__init__.py", line 639, in _build_master
    ws.require(__requires__)
  File "/srv/ores/venv/lib/python3.4/site-packages/pkg_resources/__init__.py", line 940, in require
    needed = self.resolve(parse_requirements(requirements))
  File "/srv/ores/venv/lib/python3.4/site-packages/pkg_resources/__init__.py", line 832, in resolve
    raise VersionConflict(dist, req).with_context(dependent_req)
pkg_resources.ContextualVersionConflict: (requests 2.7.0 (/srv/ores/venv/lib/python3.4/site-packages), Requirement.parse('requests==2.5.3'), {'revscoring'})

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "./venv/bin/ores", line 5, in <module>
    from pkg_resources import load_entry_point
  File "/srv/ores/venv/lib/python3.4/site-packages/pkg_resources/__init__.py", line 3057, in <module>
    working_set = WorkingSet._build_master()
  File "/srv/ores/venv/lib/python3.4/site-packages/pkg_resources/__init__.py", line 641, in _build_master
    return cls._build_from_requirements(__requires__)
  File "/srv/ores/venv/lib/python3.4/site-packages/pkg_resources/__init__.py", line 654, in _build_from_requirements
    dists = ws.resolve(reqs, Environment())
  File "/srv/ores/venv/lib/python3.4/site-packages/pkg_resources/__init__.py", line 832, in resolve
    raise VersionConflict(dist, req).with_context(dependent_req)
pkg_resources.ContextualVersionConflict: (requests 2.7.0 (/srv/ores/venv/lib/python3.4/site-packages), Requirement.parse('requests==2.5.3'), {'revscoring'})
root@ores-worker-01:/srv/ores# ls

Define and test Extractor.from_config

This was used in 6f446b0#diff-e0f3f33f7e67cc3b94f95688b0464d24R79, but it is not defined yet.

List of features from ORES

diff.added_badwords_ratio,
diff.added_markup_chars_ratio,
diff.added_misspellings_ratio,
diff.added_number_chars_ratio,
diff.added_symbolic_chars_ratio,
diff.added_uppercase_chars_ratio,
diff.badwords_added,
diff.badwords_removed,
diff.chars_added,
diff.chars_removed,
diff.longest_repeated_char_added,
diff.longest_token_added,
diff.markup_chars_added,
diff.markup_chars_removed,
diff.misspellings_added,
diff.misspellings_removed,
diff.numeric_chars_added,
diff.numeric_chars_removed,
diff.proportion_of_badwords_added,
diff.proportion_of_badwords_removed,
diff.proportion_of_chars_added,
diff.proportion_of_chars_removed,
diff.proportion_of_markup_chars_added,
diff.proportion_of_misspellings_added,
diff.proportion_of_misspellings_removed,
diff.proportion_of_numeric_chars_added,
diff.proportion_of_symbolic_chars_added,
diff.proportion_of_uppercase_chars_added,
diff.removed_badwords_ratio,
diff.removed_misspellings_ratio,
diff.segments_added,
diff.segments_removed,
diff.symbolic_chars_added,
diff.symbolic_chars_removed,
diff.uppercase_chars_added,
diff.uppercase_chars_removed,
diff.words_added,
diff.words_removed,
diff.bytes_changed,
diff.bytes_changed_ratio,
page.age,
page.is_mainspace,
page.is_content_namespace,
parent_revision.badwords,
parent_revision.bytes,
parent_revision.chars,
parent_revision.markup_chars,
parent_revision.misspellings,
parent_revision.numeric_chars,
parent_revision.proportion_of_badwords,
parent_revision.proportion_of_markup_chars,
parent_revision.proportion_of_misspellings,
parent_revision.proportion_of_numeric_chars,
parent_revision.proportion_of_symbolic_chars,
parent_revision.proportion_of_uppercase_chars,
parent_revision.revision_bytes,
parent_revision.seconds_since,
parent_revision.symbolic_chars,
parent_revision.uppercase_chars,
parent_revision.was_same_user,
parent_revision.words,
previous_user_revision.seconds_since,
revision.badwords,
revision.bytes,
revision.category_links,
revision.chars,
revision.cite_templates,
revision.day_of_week,
revision.has_custom_comment,
revision.has_section_comment,
revision.hour_of_day,
revision.image_links,
revision.infobox_templates,
revision.infonoise,
revision.internal_links,
revision.level_1_headings,
revision.level_2_headings,
revision.level_3_headings,
revision.level_4_headings,
revision.level_5_headings,
revision.level_6_headings,
revision.markup_chars,
revision.misspellings,
revision.numeric_chars,
revision.proportion_of_badwords,
revision.proportion_of_markup_chars,
revision.proportion_of_misspellings,
revision.proportion_of_numeric_chars,
revision.proportion_of_symbolic_chars,
revision.proportion_of_templated_references,
revision.proportion_of_uppercase_chars,
revision.ref_tags,
revision.symbolic_chars,
revision.templates,
revision.uppercase_chars,
revision.words,
user.age,
user.is_anon,
user.is_bot

Many prev_... metrics use the current revision's text

For example:

prev_badwords = Feature("prev_badwords", process,
                                          returns=int,
                                          depends_on=["language", revision_text])

This should depend on and use revscoring.datasources.previous_revision_text

All language utilities imported by 'revscoring.languages' (with proposal)

This causes a lot of trouble.

A lot of stuff is loaded even if only one LanguageUtility is needed.
All aspell and myspell packages must be installed for the system to work at all.

So, these language utilities should be imported on demand. This is difficult to accomplish due to the structure of the 'languages' module.

Right now, languages/init.py imports all of the languages. For example:

from .english import english
from .french import french
# etc.

I propose that we replace this pattern with a dynamic import module. I haven't been able to get that one to work in my python 3.4 environment yet though -- so I expect we'll need to do a bit of research.

APIExtractor can't find language utilities

All language utilities will error as not found when using the APIExtractor.

revscoring.dependencies.errors.DependencyError: Failed to process <is_badword>: Not implemented.

It looks like this is due to missing merge between the extractor's cache and the language's cache. The context is merged, but not the cache here: https://github.com/wiki-ai/revscoring/blob/v0.4.9/revscoring/extractors/api.py#L85

Enchant dependency not in requirements.txt

Default setup doesn't run without it, and it doesnt' seem to be present in requirements.txt

Is the wp10 model stable?

Can you build a weighted mean of the predicted classes see stability/slow progression towards quality over time? This would suggest that the weighted mean is a believable measure of sub-class quality improvements.

Booleans returned by numpy are not JSON serializable.

>>> from revscoring.scorers import MLScorerModel
>>> import json
>>> scorer = MLScorerModel.load(open("models/reverts.halfak_mix.trained.model", 'rb'))
>>> score_doc = next(scorer.score([[1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,21]]))
>>> score_doc
{'prediction': True, 'probability': {False: 0.91828775500977355, True: 0.081712244990226446}}
>>> json.dumps(score_doc)
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/usr/lib/python3.4/json/__init__.py", line 230, in dumps
    return _default_encoder.encode(obj)
  File "/usr/lib/python3.4/json/encoder.py", line 192, in encode
    chunks = self.iterencode(o, _one_shot=True)
  File "/usr/lib/python3.4/json/encoder.py", line 250, in iterencode
    return _iterencode(o, 0)
  File "/usr/lib/python3.4/json/encoder.py", line 173, in default
    raise TypeError(repr(o) + " is not JSON serializable")
TypeError: True is not JSON serializable

See http://bugs.python.org/issue18303

AttributeError: type object 'Namespace' has no attribute 'from_doc'

I just updated my local copy of the repo and got this when running the tests:

$ nosetests
..E...........................................................
======================================================================
ERROR: revscores.datasources.tests.test_namespaces.test_namespaces
----------------------------------------------------------------------
Traceback (most recent call last):
  File "/home/helder/.mypyvenv/lib/python3.4/site-packages/nose/case.py", line 198, in runTest
    self.test(*self.arg)
  File "/home/helder/Revision-Scoring/revscores/datasources/tests/test_namespaces.py", line 39, in test_namespaces
    nses = namespaces(fake_si_doc)
  File "/home/helder/Revision-Scoring/revscores/dependent.py", line 18, in __call__
    return self.process(*args, **kwargs)
  File "/home/helder/Revision-Scoring/revscores/datasources/namespaces.py", line 18, in process
    namespaces[ns_id] = mw.Namespace.from_doc(ns_doc, aliases=aliases)
nose.proxy.AttributeError: type object 'Namespace' has no attribute 'from_doc'
-------------------- >> begin captured logging << --------------------
revscores.dependent: DEBUG: Executing <namespaces>.
--------------------- >> end captured logging << ---------------------

----------------------------------------------------------------------
Ran 62 tests in 2.547s

FAILED (errors=1)

Detect content namespaces other than the mainspace

Currently there is a is_mainspace.py to check if a page is in the main namespace. However, some wikis use wgContentNamespaces (e.g. arwiki) to define other namespaces which also have content. These should probably be treated the same way as the mainspace.

ImportError (cannot import name 'WikitextSplit')

(3.4) helder@std:~/projects/revscoring
$nosetests
EE....EE...
======================================================================
ERROR: Failure: ImportError (cannot import name 'WikitextSplit')
----------------------------------------------------------------------
Traceback (most recent call last):
  File "/home/helder/env/3.4/lib/python3.4/site-packages/nose/failure.py", line 39, in runTest
    raise self.exc_val.with_traceback(self.tb)
  File "/home/helder/env/3.4/lib/python3.4/site-packages/nose/loader.py", line 414, in loadTestsFromName
    addr.filename, addr.module)
  File "/home/helder/env/3.4/lib/python3.4/site-packages/nose/importer.py", line 47, in importFromPath
    return self.importFromDir(dir_path, fqname)
  File "/home/helder/env/3.4/lib/python3.4/site-packages/nose/importer.py", line 94, in importFromDir
    mod = load_module(part_fqname, fh, filename, desc)
  File "/usr/lib/python3.4/imp.py", line 245, in load_module
    return load_package(name, filename)
  File "/usr/lib/python3.4/imp.py", line 217, in load_package
    return methods.load()
  File "<frozen importlib._bootstrap>", line 1220, in load
  File "<frozen importlib._bootstrap>", line 1200, in _load_unlocked
  File "<frozen importlib._bootstrap>", line 1129, in _exec
  File "<frozen importlib._bootstrap>", line 1471, in exec_module
  File "<frozen importlib._bootstrap>", line 321, in _call_with_frames_removed
  File "/home/helder/projects/revscoring/revscoring/datasources/__init__.py", line 1, in <module>
    from .contiguous_segments_added import contiguous_segments_added
  File "/home/helder/projects/revscoring/revscoring/datasources/contiguous_segments_added.py", line 4, in <module>
    from .revision_diff import revision_diff
  File "/home/helder/projects/revscoring/revscoring/datasources/revision_diff.py", line 4, in <module>
    from deltas.tokenizers import WikitextSplit
ImportError: cannot import name 'WikitextSplit'

======================================================================
ERROR: Failure: ImportError (cannot import name 'WikitextSplit')
----------------------------------------------------------------------
Traceback (most recent call last):
  File "/home/helder/env/3.4/lib/python3.4/site-packages/nose/failure.py", line 39, in runTest
    raise self.exc_val.with_traceback(self.tb)
  File "/home/helder/env/3.4/lib/python3.4/site-packages/nose/loader.py", line 414, in loadTestsFromName
    addr.filename, addr.module)
  File "/home/helder/env/3.4/lib/python3.4/site-packages/nose/importer.py", line 47, in importFromPath
    return self.importFromDir(dir_path, fqname)
  File "/home/helder/env/3.4/lib/python3.4/site-packages/nose/importer.py", line 94, in importFromDir
    mod = load_module(part_fqname, fh, filename, desc)
  File "/usr/lib/python3.4/imp.py", line 245, in load_module
    return load_package(name, filename)
  File "/usr/lib/python3.4/imp.py", line 217, in load_package
    return methods.load()
  File "<frozen importlib._bootstrap>", line 1220, in load
  File "<frozen importlib._bootstrap>", line 1200, in _load_unlocked
  File "<frozen importlib._bootstrap>", line 1129, in _exec
  File "<frozen importlib._bootstrap>", line 1471, in exec_module
  File "<frozen importlib._bootstrap>", line 321, in _call_with_frames_removed
  File "/home/helder/projects/revscoring/revscoring/features/__init__.py", line 2, in <module>
    from .added_badwords_ratio import added_badwords_ratio
  File "/home/helder/projects/revscoring/revscoring/features/added_badwords_ratio.py", line 2, in <module>
    from .proportion_of_badwords_added import proportion_of_badwords_added
  File "/home/helder/projects/revscoring/revscoring/features/proportion_of_badwords_added.py", line 2, in <module>
    from .badwords_added import badwords_added
  File "/home/helder/projects/revscoring/revscoring/features/badwords_added.py", line 3, in <module>
    from ..datasources import contiguous_segments_added
  File "/home/helder/projects/revscoring/revscoring/datasources/__init__.py", line 1, in <module>
    from .contiguous_segments_added import contiguous_segments_added
  File "/home/helder/projects/revscoring/revscoring/datasources/contiguous_segments_added.py", line 4, in <module>
    from .revision_diff import revision_diff
  File "/home/helder/projects/revscoring/revscoring/datasources/revision_diff.py", line 4, in <module>
    from deltas.tokenizers import WikitextSplit
ImportError: cannot import name 'WikitextSplit'

======================================================================
ERROR: Failure: ImportError (cannot import name 'WikitextSplit')
----------------------------------------------------------------------
Traceback (most recent call last):
  File "/home/helder/env/3.4/lib/python3.4/site-packages/nose/failure.py", line 39, in runTest
    raise self.exc_val.with_traceback(self.tb)
  File "/home/helder/env/3.4/lib/python3.4/site-packages/nose/loader.py", line 414, in loadTestsFromName
    addr.filename, addr.module)
  File "/home/helder/env/3.4/lib/python3.4/site-packages/nose/importer.py", line 47, in importFromPath
    return self.importFromDir(dir_path, fqname)
  File "/home/helder/env/3.4/lib/python3.4/site-packages/nose/importer.py", line 94, in importFromDir
    mod = load_module(part_fqname, fh, filename, desc)
  File "/usr/lib/python3.4/imp.py", line 235, in load_module
    return load_source(name, filename, file)
  File "/usr/lib/python3.4/imp.py", line 171, in load_source
    module = methods.load()
  File "<frozen importlib._bootstrap>", line 1220, in load
  File "<frozen importlib._bootstrap>", line 1200, in _load_unlocked
  File "<frozen importlib._bootstrap>", line 1129, in _exec
  File "<frozen importlib._bootstrap>", line 1471, in exec_module
  File "<frozen importlib._bootstrap>", line 321, in _call_with_frames_removed
  File "/home/helder/projects/revscoring/revscoring/scorers/tests/test_scorer.py", line 5, in <module>
    from ...features import badwords_added, misspellings_added
  File "/home/helder/projects/revscoring/revscoring/features/__init__.py", line 2, in <module>
    from .added_badwords_ratio import added_badwords_ratio
  File "/home/helder/projects/revscoring/revscoring/features/added_badwords_ratio.py", line 2, in <module>
    from .proportion_of_badwords_added import proportion_of_badwords_added
  File "/home/helder/projects/revscoring/revscoring/features/proportion_of_badwords_added.py", line 2, in <module>
    from .badwords_added import badwords_added
  File "/home/helder/projects/revscoring/revscoring/features/badwords_added.py", line 3, in <module>
    from ..datasources import contiguous_segments_added
  File "/home/helder/projects/revscoring/revscoring/datasources/__init__.py", line 1, in <module>
    from .contiguous_segments_added import contiguous_segments_added
  File "/home/helder/projects/revscoring/revscoring/datasources/contiguous_segments_added.py", line 4, in <module>
    from .revision_diff import revision_diff
  File "/home/helder/projects/revscoring/revscoring/datasources/revision_diff.py", line 4, in <module>
    from deltas.tokenizers import WikitextSplit
ImportError: cannot import name 'WikitextSplit'

======================================================================
ERROR: Failure: ImportError (cannot import name 'WikitextSplit')
----------------------------------------------------------------------
Traceback (most recent call last):
  File "/home/helder/env/3.4/lib/python3.4/site-packages/nose/failure.py", line 39, in runTest
    raise self.exc_val.with_traceback(self.tb)
  File "/home/helder/env/3.4/lib/python3.4/site-packages/nose/loader.py", line 414, in loadTestsFromName
    addr.filename, addr.module)
  File "/home/helder/env/3.4/lib/python3.4/site-packages/nose/importer.py", line 47, in importFromPath
    return self.importFromDir(dir_path, fqname)
  File "/home/helder/env/3.4/lib/python3.4/site-packages/nose/importer.py", line 94, in importFromDir
    mod = load_module(part_fqname, fh, filename, desc)
  File "/usr/lib/python3.4/imp.py", line 235, in load_module
    return load_source(name, filename, file)
  File "/usr/lib/python3.4/imp.py", line 171, in load_source
    module = methods.load()
  File "<frozen importlib._bootstrap>", line 1220, in load
  File "<frozen importlib._bootstrap>", line 1200, in _load_unlocked
  File "<frozen importlib._bootstrap>", line 1129, in _exec
  File "<frozen importlib._bootstrap>", line 1471, in exec_module
  File "<frozen importlib._bootstrap>", line 321, in _call_with_frames_removed
  File "/home/helder/projects/revscoring/revscoring/scorers/tests/test_svc.py", line 8, in <module>
    from ...features import Feature
  File "/home/helder/projects/revscoring/revscoring/features/__init__.py", line 2, in <module>
    from .added_badwords_ratio import added_badwords_ratio
  File "/home/helder/projects/revscoring/revscoring/features/added_badwords_ratio.py", line 2, in <module>
    from .proportion_of_badwords_added import proportion_of_badwords_added
  File "/home/helder/projects/revscoring/revscoring/features/proportion_of_badwords_added.py", line 2, in <module>
    from .badwords_added import badwords_added
  File "/home/helder/projects/revscoring/revscoring/features/badwords_added.py", line 3, in <module>
    from ..datasources import contiguous_segments_added
  File "/home/helder/projects/revscoring/revscoring/datasources/__init__.py", line 1, in <module>
    from .contiguous_segments_added import contiguous_segments_added
  File "/home/helder/projects/revscoring/revscoring/datasources/contiguous_segments_added.py", line 4, in <module>
    from .revision_diff import revision_diff
  File "/home/helder/projects/revscoring/revscoring/datasources/revision_diff.py", line 4, in <module>
    from deltas.tokenizers import WikitextSplit
ImportError: cannot import name 'WikitextSplit'

----------------------------------------------------------------------
Ran 11 tests in 2.422s

FAILED (errors=4)

Add documentation for features

Right now, finding out which features are implemented and what they return is very difficult. Add documentation for features and get it up on pythonhosted.org

Allow customizing nltk data directory

See http://stackoverflow.com/questions/13965823/resource-corpora-wordnet-not-found-on-heroku

requirements.txt does not specify version numbers

It should be the output of 'pip freeze', which specifies exact versions.

Allow scoring the diff between two arbitrary revisions

It might happen that a vandal edits the same page many times, and each of the edits has a low probability of being reverted, and still the whole set of edits, if looked at as a single edit (e.g. by enabling the enhanced recent changes preference), would have a high probability of being reverted.

In order to have predictions for these sequential edits, it seems necessary to be able to score a revision by comparing it with an older revision than the previous (parent) revision.

E.g.: I want to know the probability of this diff being reverted:
https://pt.wikipedia.org/w/index.php?diff=42204427&oldid=42203059
instead of the probabilities of each of the intermediary diffs for that page:

Numpy install required before scipy dep. can be built

For some reason, scipy doesn't express a dependency on numpy -- yet it will not install without numpy being installed first.

Either document that this is required or figured out how to upstream the fix to scipy (and close this as invalid).

revscoring.dependent.DependencyError: Failed to process <is_stopword>

Traceback (most recent call last):
  File "/usr/local/lib/python3.4/dist-packages/revscoring-0.1.0-py3.4.egg/revscoring/dependent.py", line 102, in _solve
    value = dependent(*args)
  File "/usr/local/lib/python3.4/dist-packages/revscoring-0.1.0-py3.4.egg/revscoring/dependent.py", line 25, in __call__
    return self.process(*args, **kwargs)
  File "/usr/local/lib/python3.4/dist-packages/revscoring-0.1.0-py3.4.egg/revscoring/languages/language.py", line 22, in not_implemented_processor
    raise NotImplementedError()
NotImplementedError

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/usr/local/lib/python3.4/dist-packages/revscoring-0.1.0-py3.4.egg/revscoring/utilities/extract_features.py", line 89, in run
    for v in list(values) + [label]))
  File "/usr/local/lib/python3.4/dist-packages/revscoring-0.1.0-py3.4.egg/revscoring/dependent.py", line 39, in solve_many
    value, cache, history = _solve(dependent, cache)
  File "/usr/local/lib/python3.4/dist-packages/revscoring-0.1.0-py3.4.egg/revscoring/dependent.py", line 96, in _solve
    value, cache, history = _solve(dependency, cache, history)
  File "/usr/local/lib/python3.4/dist-packages/revscoring-0.1.0-py3.4.egg/revscoring/dependent.py", line 96, in _solve
    value, cache, history = _solve(dependency, cache, history)
  File "/usr/local/lib/python3.4/dist-packages/revscoring-0.1.0-py3.4.egg/revscoring/dependent.py", line 96, in _solve
    value, cache, history = _solve(dependency, cache, history)
  File "/usr/local/lib/python3.4/dist-packages/revscoring-0.1.0-py3.4.egg/revscoring/dependent.py", line 105, in _solve
    .format(dependent, e), e)
revscoring.dependent.DependencyError: Failed to process <is_stopword>:

`RevisionDocumentNotFound` when scoring new pages

It seems revscoring is not able to score new pages:
http://ores.wmflabs.org/scores/ptwiki/?models=reverted&revids=42835908|40200979

{
  "40200979": {
    "reverted": {
      "error": {
        "message": "Failed to process <parent_revision.metadata>: RevisionDocumentNotFound",
        "type": "<class 'revscoring.dependencies.errors.DependencyError'>"
      }
    }
  },
  "42835908": {
    "reverted": {
      "error": {
        "message": "Failed to process <parent_revision.metadata>: RevisionDocumentNotFound",
        "type": "<class 'revscoring.dependencies.errors.DependencyError'>"
      }
    }
  }
}

Content replacement features -- use removed content to inform measures of added content in diffs

revscoring and AbuseFilter (and other tools) allow to catch easily vandalism that use some "bad regex"/bad words. However, the existing tools don't have ability to identify word replacements:
E.g "Barack Obama is president" => "Barack Obama is terrorist". While "terrorist" is not a bad word, a replacement of some other word to terrorist is most probably bad.

While it isn't always obvious there is "alignment" between words in the previous and the new revisions, if such exist the tool can use it.

Bug from refractoring: max() arg is an empty sequence

....Traceback (most recent call last):
  File "/home/eva/Github/Objective-Revision-Evaluation-Service/ores/features_reverted.py", line 94, in run
    print('\t'.join(str(v) for v in (list(values) + [reverted])))
  File "/usr/local/lib/python3.4/dist-packages/revscoring-0.1.0-py3.4.egg/revscoring/dependent.py", line 34, in solve_many
    value, cache, history = _solve(dependent, cache)
  File "/usr/local/lib/python3.4/dist-packages/revscoring-0.1.0-py3.4.egg/revscoring/dependent.py", line 91, in _solve
    value, cache, history = _solve(dependency, cache, history)
  File "/usr/local/lib/python3.4/dist-packages/revscoring-0.1.0-py3.4.egg/revscoring/dependent.py", line 91, in _solve
    value, cache, history = _solve(dependency, cache, history)
  File "/usr/local/lib/python3.4/dist-packages/revscoring-0.1.0-py3.4.egg/revscoring/dependent.py", line 95, in _solve
    value = dependent(*values)
  File "/usr/local/lib/python3.4/dist-packages/revscoring-0.1.0-py3.4.egg/revscoring/features/feature.py", line 31, in __call__
    value = super().__call__(*args, **kwargs)
  File "/usr/local/lib/python3.4/dist-packages/revscoring-0.1.0-py3.4.egg/revscoring/dependent.py", line 20, in __call__
    return self.process(*args, **kwargs)
  File "/usr/local/lib/python3.4/dist-packages/revscoring-0.1.0-py3.4.egg/revscoring/features/diff.py", line 137, in process_longest_repeated_char_added
    for segment in diff_added_segments
ValueError: max() arg is an empty sequence

Add stopword download to NLTK install instructions.

Can't pickle languages

$ python
Python 3.4.0 (default, Apr 11 2014, 13:05:11) 
[GCC 4.8.2] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> from revscoring.languages import english
>>> import pickle
>>> foo = pickle.dumps(english)
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
_pickle.PicklingError: Can't pickle <function is_misspelled_process at 0x7f85a844f378>: attribute lookup is_misspelled_process on revscoring.languages.english failed

Pep8 issues in revscoring/languages/persian.py

I'd just solve these issues myself, but it's really hard to work in LTR and know that I'm not breaking any of the regexes.

$ flake8 revscoring
revscoring/languages/persian.py:117:80: E501 line too long (88 > 79 characters)
revscoring/languages/persian.py:118:80: E501 line too long (88 > 79 characters)
revscoring/languages/persian.py:118:88: E225 missing whitespace around operator
revscoring/languages/persian.py:119:80: E501 line too long (85 > 79 characters)
revscoring/languages/persian.py:120:80: E501 line too long (86 > 79 characters)
revscoring/languages/persian.py:120:86: E225 missing whitespace around operator
revscoring/languages/persian.py:123:80: E501 line too long (80 > 79 characters)
revscoring/languages/persian.py:145:80: E501 line too long (80 > 79 characters)

turkish.py uses STEMMER before defining it

Since a32214a I'm getting this:

(3.4) helder@std:~/projects/revscoring$
nosetests
.............................................................E...
======================================================================
ERROR: Failure: NameError (name 'STEMMER' is not defined)
----------------------------------------------------------------------
Traceback (most recent call last):
  File "/home/helder/env/3.4/lib/python3.4/site-packages/nose/failure.py", line 39, in runTest
    raise self.exc_val.with_traceback(self.tb)
  File "/home/helder/env/3.4/lib/python3.4/site-packages/nose/loader.py", line 414, in loadTestsFromName
    addr.filename, addr.module)
  File "/home/helder/env/3.4/lib/python3.4/site-packages/nose/importer.py", line 47, in importFromPath
    return self.importFromDir(dir_path, fqname)
  File "/home/helder/env/3.4/lib/python3.4/site-packages/nose/importer.py", line 94, in importFromDir
    mod = load_module(part_fqname, fh, filename, desc)
  File "/usr/lib/python3.4/imp.py", line 235, in load_module
    return load_source(name, filename, file)
  File "/usr/lib/python3.4/imp.py", line 171, in load_source
    module = methods.load()
  File "<frozen importlib._bootstrap>", line 1220, in load
  File "<frozen importlib._bootstrap>", line 1200, in _load_unlocked
  File "<frozen importlib._bootstrap>", line 1129, in _exec
  File "<frozen importlib._bootstrap>", line 1471, in exec_module
  File "<frozen importlib._bootstrap>", line 321, in _call_with_frames_removed
  File "/home/helder/projects/revscoring/revscoring/languages/tests/test_turkish.py", line 3, in <module>
    from ..turkish import turkish
  File "/home/helder/projects/revscoring/revscoring/languages/turkish.py", line 56, in <module>
    "yarrak"
  File "/home/helder/projects/revscoring/revscoring/languages/turkish.py", line 10, in <genexpr>
    BADWORDS = set(STEMMER.stem(w) for w in [
NameError: name 'STEMMER' is not defined

----------------------------------------------------------------------
Ran 65 tests in 19.920s

FAILED (errors=1)

Proposal: Multi-lingual feature sets

Right now, a feature extraction is limited to the use of a single language. For example, revscoring.features.diff.badwords_added depends on the language utility languages.is_badword. as a result, a feature list can only have a count of "badwords_added" as identified by one "language". The result is that we have a lot of mixture in our badwords sets and we're not poised to support multi-lingual wikis like Commons and WikiData.

I propose that we convert the concept of a languages from a context (in which feature extraction happens) to a feature set with the necessary context baked in. This would mean that we can use multiple language features in parallel. E.g.

badwords = [
    revision.bytes,
    diff.bytes_changed,
    english.diff.badwords_added,
    portuguese.diff.badwords_added,
    persian.diff.badwords_added,
    ...
]

This would also mean that we wouldn't need to associate a revscoring.languages.Language with a model -- just the set of features that were used to build the model. That would substantially reduce the complication and potential mistakes involved in generating and using model files.

Multinomial bayes doesn't work with negative values

Put scipy dependency explicitly in requirements.txt

scikit-learn requires it but wonderfully does not specify it in setup.py :) Specify in ours as workaround, and someone cluebat them for https://github.com/scikit-learn/scikit-learn/blob/master/setup.py#L178

Featured request, polish badwords

https://en.wikipedia.org/w/index.php?title=Wikipedia:Sandbox&action=edit&oldid=671804402

Add and, or and not operators to Feature

You should be able to produce new features with and, or and not operations in python. See https://docs.python.org/3/library/operator.html for how to do it. We'll need some new MetaFeatures.

Can't configure balance_weights parameters from commandline

You can't configure the balance_weights parameter for SVC when calling revscoring train_test because the balance_weights param is set on the train() function.

Math domain error when processing imported revisions (user.age)

Error when processing rev_id 408030634 in enwiki. It looks like the revision is an import with a very old timestamp.

RuntimeError('Failed to process <log((user.age + 1))>: math domain error',)
Traceback (most recent call last):
  File "/home/halfak/env/3.4/lib/python3.4/site-packages/celery/app/trace.py", line 240, in trace_task
    R = retval = fun(*args, **kwargs)
  File "/home/halfak/env/3.4/lib/python3.4/site-packages/celery/app/trace.py", line 438, in __protected_call__
    return self.run(*args, **kwargs)
  File "/home/halfak/env/3.4/lib/python3.4/site-packages/ores-0.2.0-py3.4.egg/ores/score_processors/celery.py", line 33, in _process
  File "/home/halfak/env/3.4/lib/python3.4/site-packages/ores-0.2.0-py3.4.egg/ores/scoring_contexts/scoring_context.py", line 46, in score
  File "/home/halfak/env/3.4/lib/python3.4/site-packages/revscoring-0.4.0-py3.4.egg/revscoring/dependencies/functions.py", line 240, in _solve_many
  File "/home/halfak/env/3.4/lib/python3.4/site-packages/revscoring-0.4.0-py3.4.egg/revscoring/dependencies/functions.py", line 231, in _solve
RuntimeError: Failed to process <log((user.age + 1))>: math domain error

See original report here. https://github.com/wiki-ai/ores/issues/60

Make all datasource/feature/dependencies-generally and their value JSONable

Right now, many datasources return values that cannot be encoded in JSON.

This is a bummer because it would be better if we could use the JSON serializer within ORES's celery.

This is the error we get when trying to use the JSON serializer within ORES for non-JSON serializable datasources:

3784623 HTTP/1.1" 500 -
Traceback (most recent call last):
  File "/home/halfak/env/3.4/lib/python3.4/site-packages/flask/app.py", line 1836, in __call__
    return self.wsgi_app(environ, start_response)
  File "/home/halfak/env/3.4/lib/python3.4/site-packages/flask/app.py", line 1820, in wsgi_app
    response = self.make_response(self.handle_exception(e))
  File "/home/halfak/env/3.4/lib/python3.4/site-packages/flask/app.py", line 1403, in handle_exception
    reraise(exc_type, exc_value, tb)
  File "/home/halfak/env/3.4/lib/python3.4/site-packages/flask/_compat.py", line 33, in reraise
    raise value
  File "/home/halfak/env/3.4/lib/python3.4/site-packages/flask/app.py", line 1817, in wsgi_app
    response = self.full_dispatch_request()
  File "/home/halfak/env/3.4/lib/python3.4/site-packages/flask/app.py", line 1477, in full_dispatch_request
    rv = self.handle_user_exception(e)
  File "/home/halfak/env/3.4/lib/python3.4/site-packages/flask/app.py", line 1381, in handle_user_exception
    reraise(exc_type, exc_value, tb)
  File "/home/halfak/env/3.4/lib/python3.4/site-packages/flask/_compat.py", line 33, in reraise
    raise value
  File "/home/halfak/env/3.4/lib/python3.4/site-packages/flask/app.py", line 1475, in full_dispatch_request
    rv = self.dispatch_request()
  File "/home/halfak/env/3.4/lib/python3.4/site-packages/flask/app.py", line 1461, in dispatch_request
    return self.view_functions[rule.endpoint](**req.view_args)
  File "/home/halfak/projects/ores/ores/wsgi/routes/scores.py", line 102, in score_revisions
    precache=precache)
  File "/home/halfak/projects/ores/ores/score_processors/score_processor.py", line 25, in score
    scores = self._score(context, model, rev_ids, caches=caches)
  File "/home/halfak/projects/ores/ores/score_processors/celery.py", line 146, in _score
    caches=caches))
  File "/home/halfak/projects/ores/ores/score_processors/celery.py", line 97, in _score_in_celery
    task_id=id_string
  File "/home/halfak/env/3.4/lib/python3.4/site-packages/celery/app/task.py", line 559, in apply_async
    **dict(self._get_exec_options(), **options)
  File "/home/halfak/env/3.4/lib/python3.4/site-packages/celery/app/base.py", line 353, in send_task
    reply_to=reply_to or self.oid, **options
  File "/home/halfak/env/3.4/lib/python3.4/site-packages/celery/app/amqp.py", line 305, in publish_task
    **kwargs
  File "/home/halfak/env/3.4/lib/python3.4/site-packages/kombu/messaging.py", line 165, in publish
    compression, headers)
  File "/home/halfak/env/3.4/lib/python3.4/site-packages/kombu/messaging.py", line 241, in _prepare
    body) = dumps(body, serializer=serializer)
  File "/home/halfak/env/3.4/lib/python3.4/site-packages/kombu/serialization.py", line 164, in dumps
    payload = encoder(data)
  File "/usr/lib/python3.4/contextlib.py", line 77, in __exit__
    self.gen.throw(type, value, traceback)
  File "/home/halfak/env/3.4/lib/python3.4/site-packages/kombu/serialization.py", line 59, in _reraise_errors
    reraise(wrapper, wrapper(exc), sys.exc_info()[2])
  File "/home/halfak/env/3.4/lib/python3.4/site-packages/kombu/five.py", line 132, in reraise
    raise value.with_traceback(tb)
  File "/home/halfak/env/3.4/lib/python3.4/site-packages/kombu/serialization.py", line 55, in _reraise_errors
    yield
  File "/home/halfak/env/3.4/lib/python3.4/site-packages/kombu/serialization.py", line 164, in dumps
    payload = encoder(data)
  File "/home/halfak/env/3.4/lib/python3.4/site-packages/anyjson/__init__.py", line 141, in dumps
    return implementation.dumps(value)
  File "/home/halfak/env/3.4/lib/python3.4/site-packages/anyjson/__init__.py", line 89, in dumps
    raise TypeError(TypeError(*exc.args)).with_traceback(sys.exc_info()[2])
  File "/home/halfak/env/3.4/lib/python3.4/site-packages/anyjson/__init__.py", line 87, in dumps
    return self._encode(data)
  File "/usr/lib/python3.4/json/__init__.py", line 230, in dumps
    return _default_encoder.encode(obj)
  File "/usr/lib/python3.4/json/encoder.py", line 192, in encode
    chunks = self.iterencode(o, _one_shot=True)
  File "/usr/lib/python3.4/json/encoder.py", line 250, in iterencode
    return _iterencode(o, 0)
kombu.exceptions.EncodeError: keys must be a string

Implement feature "revision.is_patrolled"

Given a revision id, "revision.is_patrolled" would return True if there is a log entry saying that this edit was patrolled by some user, and False otherwise.

I believe this would mostly solve Nemo's concerns about reusing the data already provided by recent changes patrollers (on wikis where this MW feature is enabled):
https://meta.wikimedia.org/w/index.php?title=Grants_talk:IEG/Revision_scoring_as_a_service&oldid=10089505#Existing_tool