fako / datagrowth Goto Github PK

Data engineering tools to create data mash ups using Django and Celery

License: GNU Lesser General Public License v3.0

Python 97.68% CSS 0.03% JavaScript 1.87% HTML 0.41% Makefile 0.01%

datagrowth's Introduction

DATAGROWTH

Data Growth is a Django application that helps to gather data in an organized way. With it you can declare pipelines for the data gathering and preprocessing as well as pipelines for filtering and redistribution.

Installation

You can install Datagrowth with your Django application by running

pip install datagrowth

Getting started

Currently there are two major use cases. The Resources provide a way to uniformly gather data from very different sources. Configurations are a way to store and transfer key-value pairs, which your code can use to act on different contexts.

Follow these guides to get an idea how you can use Datagrowth:

Running the tests

There is a Django mock project inside the tests directory of the repository. You can run these tests by running this inside that directory:

python manage.py test

Alternatively you can execute the tests against multiple Django versions by running:

tox

datagrowth's People

Contributors

Stargazers

Watchers

datagrowth's Issues

Get rid of the django-jsonfield dependency

There is an unmaintained dependency called django-jsonfield. Currently it manages things like HttpResource.headers and Resource.schema. The code is so simple that is should probably move into Datagrowth as a utility. From there it would be better if the field is dependent on database choice. For Postgres there is JSONField in Django contrib. For MySQL there is: https://django-mysql.readthedocs.io/en/latest/model_fields/json_field.html

Add pytest and invoke to the project

Adjust DataStorageAdmin to incorperate new fields

Make sure that old models can work side-by-side with the new models

Migrate commands

The following commands need a legacy mode which should be the default:

grow command
dump dataset
load dataset
copy dataset

The following commands can be removed:

migrate_absolute_media_paths

Notice that it would be good if inheriting commands can decide for themselves whether they want legacy to be default or not. Switching between legacy/modern should probably be done in DatasetCommand as this is the base class for all commands needing a migration.

Set the bin directory to something more generic

Currently it relies on datagrowth being installed inside the project directly. The new location should be in the project folder or something that the user has set it to.

Update for Python 3.7

Implement growth error handling

Error summaries can be created in evaluate_dataset_version method. Results can be stored inside "seeding_errors" and "task_errors" of task_results field. It should use task_definitions and seeding_phases to track down erroneous Resources.

Additionally HttpSeedingProcessor should allow to raise errors when configured to do so (otherwise a breaking change). These errors should propagate to task failures and these can be picked up by grow method.

Allow grow command to wait for results

Minute differences in modified_at may randomly trigger extra queries

PyTest output from GithubActions below. Query 4 and 5 are identical except for a few microseconds in modified_at. Unclear why this update takes place.

self = <datatypes.tests.test_collection.TestCollection testMethod=test_update_no_duplicates>
influence_method =

  @patch('datatypes.models.Collection.influence')
  def test_update_no_duplicates(self, influence_method):
      docs, doc_ids = self.get_docs_list_and_ids()
      docs.append(docs[-2])  # adds a Document instance that doesn't exist yet as a duplicate
      today = date.today()
      created_at = self.instance.created_at
      with self.assertNumQueries(4):
          # Query 1: fetch targets
          # Query 2: update sources
          # Query 3: add sources
          # Query 4: update modified_at

      self.instance.update(docs, "value")

datatypes/tests/test_collection.py:328:

../.tox/py/lib/python3.8/site-packages/django/test/testcases.py:99: in exit
self.test_case.assertEqual(
E AssertionError: 5 != 4 : 5 queries executed, 4 expected
E Captured queries were:
E 1. SELECT datatypes_document.id, datatypes_document.tasks, datatypes_document.task_results, datatypes_document.derivatives, datatypes_document.created_at, datatypes_document.modified_at, datatypes_document.pending_at, datatypes_document.finished_at, datatypes_document.properties, datatypes_document.dataset_version_id, datatypes_document.collection_id, datatypes_document.identity, datatypes_document.reference FROM datatypes_document WHERE (datatypes_document.collection_id = 1 AND (JSON_EXTRACT(datatypes_document.properties, '$."value"') = JSON_EXTRACT('"1"', '$') OR JSON_EXTRACT(datatypes_document.properties, '$."value"') = JSON_EXTRACT('"2"', '$') OR JSON_EXTRACT(datatypes_document.properties, '$."value"') = JSON_EXTRACT('"3"', '$') OR JSON_EXTRACT(datatypes_document.properties, '$."value"') = JSON_EXTRACT('"4"', '$'))) ORDER BY datatypes_document.id ASC
E 2. UPDATE datatypes_document SET properties = CASE WHEN (datatypes_document.id = 2) THEN '{"value": "1", "nested": "nested value 1", "context": "nested value", "id": "BE-pensioen", "word": "pensioen", "country": "BE", "language": "nl"}' WHEN (datatypes_document.id = 3) THEN '{"value": "2", "nested": "nested value 2", "context": "nested value", "id": "NL-ouderdom", "word": "ouderdom", "country": "NL", "language": "nl"}' ELSE NULL END, derivatives = CASE WHEN (datatypes_document.id = 2) THEN '{}' WHEN (datatypes_document.id = 3) THEN '{}' ELSE NULL END, task_results = CASE WHEN (datatypes_document.id = 2) THEN '{}' WHEN (datatypes_document.id = 3) THEN '{}' ELSE NULL END, identity = CASE WHEN (datatypes_document.id = 2) THEN NULL WHEN (datatypes_document.id = 3) THEN NULL ELSE NULL END, reference = CASE WHEN (datatypes_document.id = 2) THEN NULL WHEN (datatypes_document.id = 3) THEN NULL ELSE NULL END, modified_at = CASE WHEN (datatypes_document.id = 2) THEN '2024-01-31 16:09:00.001162' WHEN (datatypes_document.id = 3) THEN '2024-01-31 16:09:00.001186' ELSE NULL END, pending_at = CASE WHEN (datatypes_document.id = 2) THEN '2024-01-31 16:08:59.727047' WHEN (datatypes_document.id = 3) THEN '2024-01-31 16:08:59.728231' ELSE NULL END, finished_at = CASE WHEN (datatypes_document.id = 2) THEN NULL WHEN (datatypes_document.id = 3) THEN NULL ELSE NULL END WHERE datatypes_document.id IN (2, 3)
E 3. INSERT INTO datatypes_document (tasks, task_results, derivatives, created_at, modified_at, pending_at, finished_at, properties, dataset_version_id, collection_id, identity, reference) VALUES ('{}', '{}', '{}', '2024-01-31 16:09:00.004864', '2024-01-31 16:09:00.004882', '2024-01-31 16:09:00.004600', NULL, '{"id": "GB-pension", "word": "pension", "value": "3", "country": "GB", "language": "en"}', NULL, 1, NULL, NULL), ('{}', '{}', '{}', '2024-01-31 16:09:00.004936', '2024-01-31 16:09:00.004946', '2024-01-31 16:09:00.004664', NULL, '{"id": "GB-age", "word": "age", "value": "4", "country": "GB", "language": "en"}', NULL, 1, NULL, NULL)
E 4. UPDATE datatypes_collection SET tasks = '{}', task_results = '{}', derivatives = '{}', created_at = '2015-06-04 11:31:27.940000', modified_at = '2024-01-31 16:09:00.005457', pending_at = '2024-01-31 16:08:59.724555', finished_at = NULL, dataset_version_id = 1, name = NULL, identifier = NULL, referee = NULL WHERE datatypes_collection.id = 1
E 5. UPDATE datatypes_collection SET tasks = '{}', task_results = '{}', derivatives = '{}', created_at = '2015-06-04 11:31:27.940000', modified_at = '2024-01-31 16:09:00.006119', pending_at = '2024-01-31 16:08:59.724555', finished_at = NULL, dataset_version_id = 1, name = NULL, identifier = NULL, referee = NULL WHERE datatypes_collection.id = 1

Implement "reduce contributions" inside ResourceGrowthProcessor

Add MariaDB support

Recommend Projects

React

A declarative, efficient, and flexible JavaScript library for building user interfaces.
Vue.js

🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
Typescript

TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
TensorFlow

An Open Source Machine Learning Framework for Everyone
Django

The Web framework for perfectionists with deadlines.
Laravel

A PHP framework for web artisans
D3

Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

javascript

JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
web

Some thing interesting about web. New door for the world.
server

A server is a program made to process requests and deliver data to clients.
Machine learning

Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Visualization

Some thing interesting about visualization, use data art
Game

Some thing interesting about game, make everyone happy.

Recommend Org

Facebook

We are working to build community through open source technology. NB: members must have two-factor auth.
Microsoft

Open source projects and samples from Microsoft.
Google

Google ❤️ Open Source for everyone.
Alibaba

Alibaba Open Source for everyone
D3

Data-Driven Documents codes.
Tencent

China tencent open source team.