fako / datascope Goto Github PK

View Code? Open in Web Editor NEW

7.0 3.0 6.0 111.72 MB

Data Scope - a framework for data mashups

License: GNU General Public License v3.0

Python 17.43% CSS 1.96% JavaScript 63.07% HTML 17.33% Makefile 0.02% Shell 0.11% Dockerfile 0.08%

python mashups custom-search

datascope's People

Contributors

Stargazers

Watchers

Forkers

edsaperia denisexifaras peymanity dan-kwiat janbaykara aleksiknuutila

datascope's Issues

Create 2017 version

See how the EU looks in 2017 regarding the 6 standard topics.

Use and respect the maxlag parameter for Wikipedia API requests

To manage the load of the database the API has a special parameter. DataScope should use this parameter in order to not block normal users under high load of Wikipedia.

Implementation details can be found here: https://www.mediawiki.org/wiki/Manual:Maxlag_parameter

Contributions are now generated by a generator method on Growth. Somewhere this generator seems to cause a 300% memory increase. The prepare contributions would become more problematic as more and more different types of contributions are needed. An example of this would be contributions from the ShellResource that almost certainly needs to contribute per newline instead of as a JSON structure.

To make contributions ready for tomorrow we need to write a contributions generator class that generates contributions based on a given QuerySet. The QuerySet will come from a ResourceProcessor. The QuerySet should be iterated over and resources should be extracted correctly using ExtractorProcessor (resulting in generators). These generators are then emptied through the ContributionsGenerator.

Gather articles through Custom Search API

The first steps the community should take is to gather candidate opinion pages. In the first version this can be done through the Custom Search API.

Allow HTML objectives to be written as callables next to the evaled strings

Fun Twitter bot

Add management command to supervise learning

Add a command that modifies the individuals in a collective to include a written tag. The command should keep a list of tags in memory. Tagging happens on basis of a single property. Identical properties receive identical tags. Tags are stored in the individual and are not persistent across multiple growths. "unknown" should always be available as a tag.

Allow floats for module weights

To be able to invert boolean logic of a module (for instance: make all women unimportant) it is necessary to be able to set weights like 0.5 in order to halve points if a page is about a woman.

Switch from sum to product for calculating feeds

Add all tests

This is the lists of tests that should at least have a clear specification in the code:

Refactor to make parts of Data Scope private repos

In order to support a range of business models some details should remain hidden, while the core and some projects like wiki_feed, open_data and nautilus can remain open source.

Create a bunch of interesting modules

In order to test the system with real users we'll add a few modules that people find interesting in order for them to start using it for their own purposes.

Threshold at score 0 and have empty result

Make the rank processor more efficient in memory use

Currently the task runs out of memory with ranking tasks that involve many modules. There are three ways to do optimizations.

Make the rank processor use iterators instead of lists to prevent complete load of lists
Sort in batches of a 1000 and take the first 20 of each batch to sort further. That means that for 40.000 individuals the max size in memory will be 40*20 = 800 individuals. That's an improvement of 50x :)
Instead of copying individuals and keeping the reference around we should hash the individual and re-hash when it comes back from the module. If hashes differ we should probably emit a warning.

Inform users on their talk page

Instead of making users wait WikiFeed should inform users by creating an edit on users's talk pages.

Implement PublicEmailResource

This resource should connect to an email box where "public" emails are send. So this means newsletters and advertisement emails. The resource can be used to process these emails and do custom digest communities.

Create CharacteristicFrame

Add pageview data

A popular metric is pageview data. We should add this to page Individuals.

Nice Filter Explanations

The following example explanations should link to relevant WikiData entries and filter definitions.

Paradise Papers
Documents leak related to offshore investment

This is a scandal. 100 is_scandal filter
This involves 25 politicians. 50 politicans filter
This happened in London. 20 location filter
This happened yesterday. 18 how_recent filter
This article has more than 3 sources. 3 citation filter
This article was created by 43 editors. 1 editors filter

WikiFeed Score 192

Create website to promote feeds

An illustrative example feed on the Homepage
Possible to switch to some other example feeds
About, team, contact, etc
Make a Feed / Filter Selector
Feed Pages

Add contact details

Become a bot

When we become a bot we can speed up data aggregation about 10x.

Allow a service to output CSV

A service can now return an API respons or HTML. It would be interesting if services can output CSV files. First implementation would be the Locafora alphabetical list.

Feed title and description in the URL

System update

The following packages need to upgrade at least:

Django (1.11)
Celery
Ubuntu server (16.04)
Python 3.6
Pandas

Wiki feed social share

Allow communities to run in sample mode

Recreate stance classification

Based on: https://aclweb.org/anthology/W/W16/W16-2807.pdf

Sort by significant filter

Create a researcher, journalist and opinionmaker list

Add Twitter handles and Wikipedia pages if available. It would be interesting if related topics can be deferred as well. From articles and Twitter feed

Include regional newspapers.

Write basic documentation

The following should be clear from the code:

Which local variables to set and what effect they have (purge_immediately locally)
How to download data and start testing locally (quick? clone source/sample!)
How to manage installation on tools

Suitcases at the airport

Configure SSHD & Firewall

Things that should be done:

No root login
No password login
Only allow home access
Configure VPN at home

Clean module implementations

The current implementations for the Wikidata modules especially are a bit messy. We should inject the WikiData separately into the modules. This way the modules can assume that the Wikidata is sound. Also we can use next() to write a search through the claims a little cleaner.

One-click filter creation

Return floats from the modules

Make sure that it is ok, to return a float from a module. This allows to do 1/ and will give for instance the least edited pages.

Use transclusion to include feeds on pages

When people want to place a feed on their user page they should transclude the User:Wiki Feed Bot/include-feed page.

Once a day WikiFeed will do the following:

Find all pages that transclude the include-feed page in the user namespace
Parse the template to see which parameters should get used
Store which page wants to include which feed with which parameters if it hasn't been stored or update parameters if parameters changed
Check revisions of each page where the template was added or changed to make sure that the template was added/changed by the user owning that page
For each template that is approved (e.g. user wants this template) place the feed on the page where the template is used.

Part of the template page should be a link that refreshes the feed. This allows users to make changes to the templates and see the results. Step 4 to 5 should be executed before updates in this way take place.

Any maliciously placed feeds should be reported on the talk page of the user. This gives the user the opportunity to remove the template and/or report abuse.

Any parsing problems that occur during the parsing of the template should be reported on the feed page if the feed was approved by the user.

Refactor tasks section of resource processors

Currently the tasks related to resource processors are bound to the processor classes. This is not future proof as Celery will not support methods as tasks. The tasks currently also do not work well with requests.Session objects. Some rewriting needs to happen to facilitate all use cases and be ready for Celery 4.

Pull data from Wikipedia

Instead of pushing data from Datascope to a Wiki the Wiki's should pull from Datascope. This isolates the two things a lot better and makes a compromise of Datascope less problematic.

Future Fashion classification

Instead of kmeans the future fashion prediction should use NB as the classes are known from the start. This should simplify a lot.

Write a Rotterdam Twitter community

Take a bunch of Tweets originating from Rotterdam and collect in a Community

Allow aggregation of data on Tools

Currently Tools is killing the task. Probably the task uses too much memory. We should play with giving the task more memory (this will make it less important on the task queue) and see if we can make the growth project iterator based instead of list based, which will have tremendous benefits for memory usage.

Integrate with Sentry

Sandbox the Datascope environment

To safely allow users to create modules we need to sandbox the Python environment of Datascope to make sure that no code is running that we don't want to.

Here's an article that talks about such venv sandboxes: https://www.logilab.org/blogentry/22498#virtualenvwrapper

Output the page name variable to a feed instead of a hard coded page name

When creating the header for the page instead of printing the header into the page (which is a security risk) we should print the current page variable. This approach is also less error prone for editors that want to implement a feed.

Specify user agent for Wikipedia API requests

Currently no user agent is set for requests to the API, but this should be a custom agent. See details: https://www.mediawiki.org/wiki/API:Etiquette#User-Agent_header

Write a supervise html view that helps people to tag content in collectives

On the left you see the data. On the right you see a list of tags. Tags can be added or selected. Every post brings you to a next Individual. Perhaps a "characteristic" model should be created to facilitate permanent storage of tags and applying tags to similar content automatically. "unknown" should always be available as a tag.