fako / datascope Goto Github PK
View Code? Open in Web Editor NEWData Scope - a framework for data mashups
License: GNU General Public License v3.0
Data Scope - a framework for data mashups
License: GNU General Public License v3.0
See how the EU looks in 2017 regarding the 6 standard topics.
To manage the load of the database the API has a special parameter. DataScope should use this parameter in order to not block normal users under high load of Wikipedia.
Implementation details can be found here: https://www.mediawiki.org/wiki/Manual:Maxlag_parameter
Contributions are now generated by a generator method on Growth. Somewhere this generator seems to cause a 300% memory increase. The prepare contributions would become more problematic as more and more different types of contributions are needed. An example of this would be contributions from the ShellResource that almost certainly needs to contribute per newline instead of as a JSON structure.
To make contributions ready for tomorrow we need to write a contributions generator class that generates contributions based on a given QuerySet. The QuerySet will come from a ResourceProcessor. The QuerySet should be iterated over and resources should be extracted correctly using ExtractorProcessor (resulting in generators). These generators are then emptied through the ContributionsGenerator.
The first steps the community should take is to gather candidate opinion pages. In the first version this can be done through the Custom Search API.
Add a command that modifies the individuals in a collective to include a written tag. The command should keep a list of tags in memory. Tagging happens on basis of a single property. Identical properties receive identical tags. Tags are stored in the individual and are not persistent across multiple growths. "unknown" should always be available as a tag.
To be able to invert boolean logic of a module (for instance: make all women unimportant) it is necessary to be able to set weights like 0.5 in order to halve points if a page is about a woman.
This is the lists of tests that should at least have a clear specification in the code:
In order to support a range of business models some details should remain hidden, while the core and some projects like wiki_feed, open_data and nautilus can remain open source.
In order to test the system with real users we'll add a few modules that people find interesting in order for them to start using it for their own purposes.
Currently the task runs out of memory with ranking tasks that involve many modules. There are three ways to do optimizations.
Instead of making users wait WikiFeed should inform users by creating an edit on users's talk pages.
This resource should connect to an email box where "public" emails are send. So this means newsletters and advertisement emails. The resource can be used to process these emails and do custom digest communities.
A popular metric is pageview data. We should add this to page Individuals.
The following example explanations should link to relevant WikiData entries and filter definitions.
Paradise Papers
Documents leak related to offshore investment
This is a scandal. 100 is_scandal filter
This involves 25 politicians. 50 politicans filter
This happened in London. 20 location filter
This happened yesterday. 18 how_recent filter
This article has more than 3 sources. 3 citation filter
This article was created by 43 editors. 1 editors filter
WikiFeed Score 192
When we become a bot we can speed up data aggregation about 10x.
A service can now return an API respons or HTML. It would be interesting if services can output CSV files. First implementation would be the Locafora alphabetical list.
The following packages need to upgrade at least:
Add Twitter handles and Wikipedia pages if available. It would be interesting if related topics can be deferred as well. From articles and Twitter feed
Include regional newspapers.
The following should be clear from the code:
Things that should be done:
The current implementations for the Wikidata modules especially are a bit messy. We should inject the WikiData separately into the modules. This way the modules can assume that the Wikidata is sound. Also we can use next() to write a search through the claims a little cleaner.
Make sure that it is ok, to return a float from a module. This allows to do 1/ and will give for instance the least edited pages.
When people want to place a feed on their user page they should transclude the User:Wiki Feed Bot/include-feed page.
Once a day WikiFeed will do the following:
Part of the template page should be a link that refreshes the feed. This allows users to make changes to the templates and see the results. Step 4 to 5 should be executed before updates in this way take place.
Any maliciously placed feeds should be reported on the talk page of the user. This gives the user the opportunity to remove the template and/or report abuse.
Any parsing problems that occur during the parsing of the template should be reported on the feed page if the feed was approved by the user.
Currently the tasks related to resource processors are bound to the processor classes. This is not future proof as Celery will not support methods as tasks. The tasks currently also do not work well with requests.Session objects. Some rewriting needs to happen to facilitate all use cases and be ready for Celery 4.
Instead of pushing data from Datascope to a Wiki the Wiki's should pull from Datascope. This isolates the two things a lot better and makes a compromise of Datascope less problematic.
Instead of kmeans the future fashion prediction should use NB as the classes are known from the start. This should simplify a lot.
Take a bunch of Tweets originating from Rotterdam and collect in a Community
Currently Tools is killing the task. Probably the task uses too much memory. We should play with giving the task more memory (this will make it less important on the task queue) and see if we can make the growth project iterator based instead of list based, which will have tremendous benefits for memory usage.
To safely allow users to create modules we need to sandbox the Python environment of Datascope to make sure that no code is running that we don't want to.
Here's an article that talks about such venv sandboxes: https://www.logilab.org/blogentry/22498#virtualenvwrapper
When creating the header for the page instead of printing the header into the page (which is a security risk) we should print the current page variable. This approach is also less error prone for editors that want to implement a feed.
Currently no user agent is set for requests to the API, but this should be a custom agent. See details: https://www.mediawiki.org/wiki/API:Etiquette#User-Agent_header
On the left you see the data. On the right you see a list of tags. Tags can be added or selected. Every post brings you to a next Individual. Perhaps a "characteristic" model should be created to facilitate permanent storage of tags and applying tags to similar content automatically. "unknown" should always be available as a tag.
To debug a bit better it makes sense that a module returns its weight and its value separately. It could also return how often it rejected a page all together to get a sense of its reliability.
Make sure that the server can only be reached from a VPN.
A declarative, efficient, and flexible JavaScript library for building user interfaces.
๐ Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. ๐๐๐
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google โค๏ธ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.