Git Product home page Git Product logo

datascope's People

Contributors

dan-kwiat avatar denisexifaras avatar dependabot[bot] avatar edsaperia avatar fako avatar janbaykara avatar peymanity avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar

datascope's Issues

ContributionsGenerator class

Contributions are now generated by a generator method on Growth. Somewhere this generator seems to cause a 300% memory increase. The prepare contributions would become more problematic as more and more different types of contributions are needed. An example of this would be contributions from the ShellResource that almost certainly needs to contribute per newline instead of as a JSON structure.

To make contributions ready for tomorrow we need to write a contributions generator class that generates contributions based on a given QuerySet. The QuerySet will come from a ResourceProcessor. The QuerySet should be iterated over and resources should be extracted correctly using ExtractorProcessor (resulting in generators). These generators are then emptied through the ContributionsGenerator.

Add management command to supervise learning

Add a command that modifies the individuals in a collective to include a written tag. The command should keep a list of tags in memory. Tagging happens on basis of a single property. Identical properties receive identical tags. Tags are stored in the individual and are not persistent across multiple growths. "unknown" should always be available as a tag.

Allow floats for module weights

To be able to invert boolean logic of a module (for instance: make all women unimportant) it is necessary to be able to set weights like 0.5 in order to halve points if a page is about a woman.

Add all tests

This is the lists of tests that should at least have a clear specification in the code:

  • community service view
  • community HTML view
  • setup_growth input sets identifier
  • update_by_key for dicts and lists
  • manifestation callbacks
  • manifestation model
  • manifestation processor
  • manifest config injection
  • HTTP resource edge cases
  • load session
  • inject a session provider processor
  • send HTTP serie GET and POST
  • private resource processor
  • individual update and validate
  • individual output_from_content (refactor of content test)
  • get_growth
  • sample mode
  • delete_manifestations_by_signature
  • utils
  • lexicon parser
  • image grid with panorama images at border

Create a bunch of interesting modules

In order to test the system with real users we'll add a few modules that people find interesting in order for them to start using it for their own purposes.

Make the rank processor more efficient in memory use

Currently the task runs out of memory with ranking tasks that involve many modules. There are three ways to do optimizations.

  1. Make the rank processor use iterators instead of lists to prevent complete load of lists
  2. Sort in batches of a 1000 and take the first 20 of each batch to sort further. That means that for 40.000 individuals the max size in memory will be 40*20 = 800 individuals. That's an improvement of 50x :)
  3. Instead of copying individuals and keeping the reference around we should hash the individual and re-hash when it comes back from the module. If hashes differ we should probably emit a warning.

Implement PublicEmailResource

This resource should connect to an email box where "public" emails are send. So this means newsletters and advertisement emails. The resource can be used to process these emails and do custom digest communities.

Add pageview data

A popular metric is pageview data. We should add this to page Individuals.

Nice Filter Explanations

The following example explanations should link to relevant WikiData entries and filter definitions.

Paradise Papers
Documents leak related to offshore investment

This is a scandal. 100 is_scandal filter
This involves 25 politicians. 50 politicans filter
This happened in London. 20 location filter
This happened yesterday. 18 how_recent filter
This article has more than 3 sources. 3 citation filter
This article was created by 43 editors. 1 editors filter

WikiFeed Score 192

Create website to promote feeds

  • An illustrative example feed on the Homepage
  • Possible to switch to some other example feeds
  • About, team, contact, etc
  • Make a Feed / Filter Selector
  • Feed Pages

Become a bot

When we become a bot we can speed up data aggregation about 10x.

Allow a service to output CSV

A service can now return an API respons or HTML. It would be interesting if services can output CSV files. First implementation would be the Locafora alphabetical list.

System update

The following packages need to upgrade at least:

  • Django (1.11)
  • Celery
  • Ubuntu server (16.04)
  • Python 3.6
  • Pandas

Write basic documentation

The following should be clear from the code:

  • Which local variables to set and what effect they have (purge_immediately locally)
  • How to download data and start testing locally (quick? clone source/sample!)
  • How to manage installation on tools

Configure SSHD & Firewall

Things that should be done:

  • No root login
  • No password login
  • Only allow home access
  • Configure VPN at home

Clean module implementations

The current implementations for the Wikidata modules especially are a bit messy. We should inject the WikiData separately into the modules. This way the modules can assume that the Wikidata is sound. Also we can use next() to write a search through the claims a little cleaner.

Return floats from the modules

Make sure that it is ok, to return a float from a module. This allows to do 1/ and will give for instance the least edited pages.

Use transclusion to include feeds on pages

When people want to place a feed on their user page they should transclude the User:Wiki Feed Bot/include-feed page.

Once a day WikiFeed will do the following:

  1. Find all pages that transclude the include-feed page in the user namespace
  2. Parse the template to see which parameters should get used
  3. Store which page wants to include which feed with which parameters if it hasn't been stored or update parameters if parameters changed
  4. Check revisions of each page where the template was added or changed to make sure that the template was added/changed by the user owning that page
  5. For each template that is approved (e.g. user wants this template) place the feed on the page where the template is used.

Part of the template page should be a link that refreshes the feed. This allows users to make changes to the templates and see the results. Step 4 to 5 should be executed before updates in this way take place.

Any maliciously placed feeds should be reported on the talk page of the user. This gives the user the opportunity to remove the template and/or report abuse.

Any parsing problems that occur during the parsing of the template should be reported on the feed page if the feed was approved by the user.

Refactor tasks section of resource processors

Currently the tasks related to resource processors are bound to the processor classes. This is not future proof as Celery will not support methods as tasks. The tasks currently also do not work well with requests.Session objects. Some rewriting needs to happen to facilitate all use cases and be ready for Celery 4.

Pull data from Wikipedia

Instead of pushing data from Datascope to a Wiki the Wiki's should pull from Datascope. This isolates the two things a lot better and makes a compromise of Datascope less problematic.

Future Fashion classification

Instead of kmeans the future fashion prediction should use NB as the classes are known from the start. This should simplify a lot.

Allow aggregation of data on Tools

Currently Tools is killing the task. Probably the task uses too much memory. We should play with giving the task more memory (this will make it less important on the task queue) and see if we can make the growth project iterator based instead of list based, which will have tremendous benefits for memory usage.

Write a supervise html view that helps people to tag content in collectives

On the left you see the data. On the right you see a list of tags. Tags can be added or selected. Every post brings you to a next Individual. Perhaps a "characteristic" model should be created to facilitate permanent storage of tags and applying tags to similar content automatically. "unknown" should always be available as a tag.

Separate weight and value coming from a module

To debug a bit better it makes sense that a module returns its weight and its value separately. It could also return how often it rejected a page all together to get a sense of its reliability.

Install VPN

Make sure that the server can only be reached from a VPN.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.