Git Product home page Git Product logo

datagrowth's Introduction

datagrowth workflow PyPI pyversions GPLv3 license

DATAGROWTH

Data Growth is a Django application that helps to gather data in an organized way. With it you can declare pipelines for the data gathering and preprocessing as well as pipelines for filtering and redistribution.

Installation

You can install Datagrowth with your Django application by running

pip install datagrowth

Getting started

Currently there are two major use cases. The Resources provide a way to uniformly gather data from very different sources. Configurations are a way to store and transfer key-value pairs, which your code can use to act on different contexts.

Follow these guides to get an idea how you can use Datagrowth:

Running the tests

There is a Django mock project inside the tests directory of the repository. You can run these tests by running this inside that directory:

python manage.py test

Alternatively you can execute the tests against multiple Django versions by running:

tox

datagrowth's People

Contributors

fako avatar dependabot[bot] avatar peymanity avatar dan-kwiat avatar janbaykara avatar edsaperia avatar denisexifaras avatar

Stargazers

Niranjan Anandkumar avatar Christian Ledermann avatar Nikolaus Schlemm avatar

Watchers

James Cloos avatar  avatar Kostas Georgiou avatar

datagrowth's Issues

Migrate commands

The following commands need a legacy mode which should be the default:

  • grow command
  • dump dataset
  • load dataset
  • copy dataset

The following commands can be removed:

  • migrate_absolute_media_paths

Notice that it would be good if inheriting commands can decide for themselves whether they want legacy to be default or not. Switching between legacy/modern should probably be done in DatasetCommand as this is the base class for all commands needing a migration.

Implement growth error handling

Error summaries can be created in evaluate_dataset_version method. Results can be stored inside "seeding_errors" and "task_errors" of task_results field. It should use task_definitions and seeding_phases to track down erroneous Resources.

Additionally HttpSeedingProcessor should allow to raise errors when configured to do so (otherwise a breaking change). These errors should propagate to task failures and these can be picked up by grow method.

Minute differences in modified_at may randomly trigger extra queries

PyTest output from GithubActions below. Query 4 and 5 are identical except for a few microseconds in modified_at. Unclear why this update takes place.

self = <datatypes.tests.test_collection.TestCollection testMethod=test_update_no_duplicates>
influence_method =

  @patch('datatypes.models.Collection.influence')
  def test_update_no_duplicates(self, influence_method):
      docs, doc_ids = self.get_docs_list_and_ids()
      docs.append(docs[-2])  # adds a Document instance that doesn't exist yet as a duplicate
      today = date.today()
      created_at = self.instance.created_at
      with self.assertNumQueries(4):
          # Query 1: fetch targets
          # Query 2: update sources
          # Query 3: add sources
          # Query 4: update modified_at
      self.instance.update(docs, "value")

datatypes/tests/test_collection.py:328:


../.tox/py/lib/python3.8/site-packages/django/test/testcases.py:99: in exit
self.test_case.assertEqual(
E AssertionError: 5 != 4 : 5 queries executed, 4 expected
E Captured queries were:
E 1. SELECT datatypes_document.id, datatypes_document.tasks, datatypes_document.task_results, datatypes_document.derivatives, datatypes_document.created_at, datatypes_document.modified_at, datatypes_document.pending_at, datatypes_document.finished_at, datatypes_document.properties, datatypes_document.dataset_version_id, datatypes_document.collection_id, datatypes_document.identity, datatypes_document.reference FROM datatypes_document WHERE (datatypes_document.collection_id = 1 AND (JSON_EXTRACT(datatypes_document.properties, '$."value"') = JSON_EXTRACT('"1"', '$') OR JSON_EXTRACT(datatypes_document.properties, '$."value"') = JSON_EXTRACT('"2"', '$') OR JSON_EXTRACT(datatypes_document.properties, '$."value"') = JSON_EXTRACT('"3"', '$') OR JSON_EXTRACT(datatypes_document.properties, '$."value"') = JSON_EXTRACT('"4"', '$'))) ORDER BY datatypes_document.id ASC
E 2. UPDATE datatypes_document SET properties = CASE WHEN (datatypes_document.id = 2) THEN '{"value": "1", "nested": "nested value 1", "context": "nested value", "id": "BE-pensioen", "word": "pensioen", "country": "BE", "language": "nl"}' WHEN (datatypes_document.id = 3) THEN '{"value": "2", "nested": "nested value 2", "context": "nested value", "id": "NL-ouderdom", "word": "ouderdom", "country": "NL", "language": "nl"}' ELSE NULL END, derivatives = CASE WHEN (datatypes_document.id = 2) THEN '{}' WHEN (datatypes_document.id = 3) THEN '{}' ELSE NULL END, task_results = CASE WHEN (datatypes_document.id = 2) THEN '{}' WHEN (datatypes_document.id = 3) THEN '{}' ELSE NULL END, identity = CASE WHEN (datatypes_document.id = 2) THEN NULL WHEN (datatypes_document.id = 3) THEN NULL ELSE NULL END, reference = CASE WHEN (datatypes_document.id = 2) THEN NULL WHEN (datatypes_document.id = 3) THEN NULL ELSE NULL END, modified_at = CASE WHEN (datatypes_document.id = 2) THEN '2024-01-31 16:09:00.001162' WHEN (datatypes_document.id = 3) THEN '2024-01-31 16:09:00.001186' ELSE NULL END, pending_at = CASE WHEN (datatypes_document.id = 2) THEN '2024-01-31 16:08:59.727047' WHEN (datatypes_document.id = 3) THEN '2024-01-31 16:08:59.728231' ELSE NULL END, finished_at = CASE WHEN (datatypes_document.id = 2) THEN NULL WHEN (datatypes_document.id = 3) THEN NULL ELSE NULL END WHERE datatypes_document.id IN (2, 3)
E 3. INSERT INTO datatypes_document (tasks, task_results, derivatives, created_at, modified_at, pending_at, finished_at, properties, dataset_version_id, collection_id, identity, reference) VALUES ('{}', '{}', '{}', '2024-01-31 16:09:00.004864', '2024-01-31 16:09:00.004882', '2024-01-31 16:09:00.004600', NULL, '{"id": "GB-pension", "word": "pension", "value": "3", "country": "GB", "language": "en"}', NULL, 1, NULL, NULL), ('{}', '{}', '{}', '2024-01-31 16:09:00.004936', '2024-01-31 16:09:00.004946', '2024-01-31 16:09:00.004664', NULL, '{"id": "GB-age", "word": "age", "value": "4", "country": "GB", "language": "en"}', NULL, 1, NULL, NULL)
E 4. UPDATE datatypes_collection SET tasks = '{}', task_results = '{}', derivatives = '{}', created_at = '2015-06-04 11:31:27.940000', modified_at = '2024-01-31 16:09:00.005457', pending_at = '2024-01-31 16:08:59.724555', finished_at = NULL, dataset_version_id = 1, name = NULL, identifier = NULL, referee = NULL WHERE datatypes_collection.id = 1
E 5. UPDATE datatypes_collection SET tasks = '{}', task_results = '{}', derivatives = '{}', created_at = '2015-06-04 11:31:27.940000', modified_at = '2024-01-31 16:09:00.006119', pending_at = '2024-01-31 16:08:59.724555', finished_at = NULL, dataset_version_id = 1, name = NULL, identifier = NULL, referee = NULL WHERE datatypes_collection.id = 1

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.