Git Product home page Git Product logo

django-devdata's Introduction

django-devdata

django-devdata provides a convenient workflow for creating development databases seeded with anonymised production data. Have a development database that contains useful data, and is fast to create and keep up to date.

As of 1.x, django-devdata is ready for use in real-world projects. See releases for more details.

Elevator pitch

# blog.models

class Post(models.Model):
    content = models.TextField()
    published = models.DateTimeField()


class Comment(models.Model):
    user = models.ForeignKey(User)
    post = models.ForeignKey(Post)
    text = models.TextField()

# settings.py
DEVDATA_STRATEGIES = {
    'auth.User': [
        # We want all internal users
        InternalUsersStrategy(name='internal_users'),
        # Get some random other users, we don't need everyone
        RandomSampleQuerySetStrategy(name='random_users', count=10),
    ],
    'blog.Post': [
        # Only the latest blog posts necessary for testing...
        LatestSampleQuerySetStrategy(name='latest_posts', count=3, order='-published'),
        # Except that one the weird edge case
        ExactQuerySetStrategy(name='edge_case', pks=(42,)),
    ],
    'blog.Comment': [
        # Get all the comments – devdata will automatically restrict to only
        # those that maintain referential integrity, i.e. comments from users
        # not selected, or on posts not selected, will be skipped.
        QuerySetStrategy(name='all'),
    ],
}
(prod)$ python manage.py devdata_export devdata
(prod)$ tar -czf devdata.tar devdata/
(local)$ scp prod:~/devdata.tar devdata.tar.gz
(local)$ tar -xzf devdata.tar.gz
(local)$ python manage.py devdata_import devdata/

Problem

In the same way that development environments being close in configuration to production environments, it's important that the data in databases we use for development is a realistic representation of that in production.

We could use a dump of a production database, but there are several problems with this:

  1. It's bad for user privacy and a security risk. It may not be allowed in some organisations.
  2. Production databases can be too big, impractical or unusable.
  3. Test data is limited to that available in production.
  4. Preserving referential integrity for a sample of data is hard.

Another option is to use factories or fake data to generate the entire development database. This is mostly desirable, but...

  • It can be a burden to maintain factories once there are hundreds or thousands of them.
  • It can be hard to retroactively add these to a Django site of a significant size.

Solution

django-devdata provides defines a three step workflow:

  1. Exporting data, with a customisable export strategies per model.
  2. Anonymising data, with customisable anonymisation per field/model.
  3. Importing data, with customisable importing per model.

django-devdata ships with built-in support for:

  • Exporting full tables
  • Exporting subsets (random, latest, specified primary keys)
  • Anonymising data with faker
  • Importing exported data
  • Importing data from factory-boy factories

In addition to this, the structure provided by django-devdata can be extended to support extraction from other data sources, to import/export Django fixtures, or to work with other factory libraries.

Exporting, anonymising, and importing, are all configurable, and django-devdata's base classes will help do this without much work.

Workflow

Exporting

$ python manage.py devdata_export [dest] [app_label.ModelName ...]

This step allows a sync strategy to persist some data that will be used to create a new development database. For example, the QuerySetStrategy can export data from a table to a filesystem for later import.

This can be used for:

  • Exporting a manually created database for other developers to use.
  • Exporting realistic data from a production database.
  • A cron job to maintain a development dataset hosted on cloud storage.

This step is optional (the built-in factory strategy doesn't do this).

Anonymisation

This step is critical when using django-devdata to export from production sources. It's not a distinct step, but rather an opt-out part of the export step.

Importing

$ python manage.py devdata_import [src]

This step is responsible for preparing the database and filling it. If any exporting strategies have been used those must have run first, or their outputs must have been downloaded if they are being shared/hosted somewhere.

Factory-based strategies generate data during this process.

Reset modes
$ python manage.py devdata_import --reset-mode=$MODE [src]

By default any existing database will be removed, ensuring that a fresh database is created for the imported data. This is expected to be the most common case for local development, but may not always be suitable.

The following modes are offered:

  • drop-database: the default; drops the database & re-creates it.
  • drop-tables: drops the tables the Django codebase is aware of, useful if the Django database user doesn't have access to drop the entire database.
  • none: no attempt to reset the database, useful if the user has already manually configured the database or otherwise wants more control over setup.

See the docstrings in src/devdata/reset_modes.py for more details.

Customising

Strategies

The django-devdata strategies define how an import and optionally an export happen. Each model is configured with a list of Strategies to use.

Classes are provided to inherit from for customising this behaviour:

  • Strategy – the base class of all strategies.
  • Exportable – a mixin that opts this strategy in to the export step.
  • QuerySetStrategy – the base of all strategies that export production data to a filesystem. Handles referential integrity, serialisation, and anonymisation of the data pre-export.
  • FactoryStrategy – the base of all strategies that create data based on factory-boy factories.

The API necessary for classes to implement is small, and there are customisation points provided for common patterns.

In our experience most models can be exported with just the un-customised QuerySetStrategy, some will need to use other pre-provided strategies, and a small number will need custom exporters based on the classes provided.

Extra Strategies

Sometimes it can be useful to export and import data from the database which lives outside the tables which Django manages via models.

The "extra" strategies provide hooks which support transferring these data.

Classes are provided to inherit from for customising this behaviour:

  • ExtraExport – defines how to get data out of the database.
  • ExtraImport – defines how to get data into a database.

The API necessary for classes to implement is small and reminiscent of those for Strategy and Exportable.

The following "extra" strategies are provided out of the box:

  • PostgresSequences – transfers data about Postgres sequences which are not attached to tables.

Anonymisers

Anonymisers are configured by field name, and by model and field name.

Each anonymiser is a function that takes a number of kwargs with useful context and returns a new value, compatible with the Django JSON encoder/decoder.

The signature for an anonymiser is:

def anonymise(*, obj: Model, field: str, pii_value: Any, fake: Faker) -> Any:
    ...

There are several anonymisers provided to use or to build off:

  • faker_anonymise – Use faker to anonymise this field with the provided generator, e.g. faker_anonymise('pyint', min_value=15, max_value=85).
  • const – anonymise to a constant value, e.g. const('ch_XXXXXXXX').
  • random_foreign_key – anonymise to a random foreign key.

django-devdata's anonymisation is not intended to be perfect, but rather to be a reasonable default for creating useful data that does a good enough job by default. Structure in data can be used to de-anonymise users in some cases with advanced techniques, and django-devdata does not attempt to solve for this case as most attackers, users, and legislators, are more concerned about obviously personally identifiable information such as names and email addresses. This anonymisation is no replacement for encryption at-rest with tools like FileVault or BitLocker on development machines.

An example of this pragmatism in anonymisation is the preserve_nulls argument taken by some built-in anonymisers. This goes against true anonymisation, but the absence of data is typically not of much use to attackers (or concern for users), if the actual data is anonymised, while this can be of huge benefit to developers in maintaining data consistency.

Settings

django-devdata makes heavy use of Django settings for both defining how it should act for your site, and also for configuring how you'll use your workflow.

"""
django-devdata default settings, with documentation on usage.
"""

# Required
# A mapping of app model label to list of strategies to be used.
DEVDATA_STRATEGIES = ...
# {'auth.User': [QuerySetStrategy(name='all')], 'sessions.Session': []}

# Optional
# A list of strategies for transferring data about a database which are not
# captured in the tables themselves.
DEVDATA_EXTRA_STRATEGIES = ...
# [
#   ('devdata.extras.PostgresSequences', {}),
# ]

# Optional
# A mapping of field name to an anonymiser to be used for all fields with that
# name.
DEVDATA_FIELD_ANONYMISERS = {}
# {'first_name': faker_anonymise('first_name'), 'ip': const('127.0.0.1')}

# Optional
# A mapping of app model label to a mapping of fields and anonymisers to be
# scoped to just that model.
DEVDATA_MODEL_ANONYMISERS = {}
# {'auth.User': {'first_name': faker_anonymise('first_name')}}

# Optional
# List of locales to be used for Faker in generating anonymised data.
DEVDATA_FAKER_LOCALES = None
# ['en_GB', 'en_AU']

# Optional
# In many codebases, there will only be a few models that will do most of the
# work to restrict the total export size – only taking a few users, or a few
# comments – for many models a default behaviour of taking everything
# following the restrictions from other models would be sufficient. This setting
# allows for specifying a default strategy.
# Important:
# - When using this, no errors will be raised if a model is missed from the list
#   of strategies.
# - This strategy is not added to all models, and it does not override an empty
#   list of strategies. It is only used when a model is not defined in the
#   strategy config at all.
DEVDATA_DEFAULT_STRATEGY = None

Strategies can be defined either as a strategy instance, or a tuple of dotted-path and kwargs, for example the following are equivalent:

DEVDATA_STRATEGIES = {
    'auth.User': [
        QuerySetStrategy(name='all_users'),
    ],
}

DEVDATA_STRATEGIES = {
    'auth.User': [
        ('devdata.strategies.QuerySetStrategy', {'name': 'all_users'}),
    ],
}

This alternate configuration format is provided in cases of extensive use of custom strategies, as strategies often import models, but due to the Django startup process models can't be imported until the settings have been imported.

django-devdata's People

Contributors

danpalmer avatar peterjclaw avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar

django-devdata's Issues

Support JSONL and incremental importing

Currently the memory usage of devdata during an import can be fairly large -- it scales with the size of the export being imported. This can result in an import failing part way through due to the process being OOMKilled. While ideally devdata would only have smallish data to work with, that isn't always practical.

It would be great if devdata supported exporting to JSONL (i.e: newline separated JSON) as this can be parsed incrementally and could thus should be able to imported incrementally too.

Support for database sequences

It would be great if django-devdata provided support in some form for database sequences. (It would be great if Django did too!)
A couple of ways I think this could work:

  • a list of them configured in settings so that the export command knows which ones to pull from the database
  • autodetection of them from the source database during export

Either way, it would be great if they were present after doing an import. I suspect we'd need to preserve the current value of the sequence too, where possible.

Maybe support use within ephemeral environments

Currently the summary of usage for this package looks like:

(prod)$ python manage.py devdata_export devdata
(prod)$ tar -czf devdata.tar devdata/
(local)$ scp prod:~/devdata.tar devdata.tar.gz
(local)$ tar -xzf devdata.tar.gz
(local)$ python manage.py devdata_import devdata/

This requires that the user can manage the data shuffling themselves and in turn implicitly requires that both sides of the export & import exist for longer than the django command. For environments such as Kubernetes, where a pod may only exist while a command is running, this pattern is a little harder to use. It's definitely still possible though can require a bit more juggling, especially if trying to script the usage rather than use interactively.

I was wondering whether it would make sense for devdata to provide some more direct support for this sort of use-case.

Approaches I can think of which might help:

  • let devdata export to/import from a (compressed) tar file directly, removing the need to have tar in the target environment (as an implementation, this might e.g: use a local temporary directory rather than reading the archive for each file)
  • maybe allow for that archive to come from a Django storage backed location
  • and/or allow for the import/export folders to be universal paths

Universal paths could be used on their own, though given the number of files which devdata tends to create that's likely to be quite inefficient -- I suspect that upaths would probably be more useful as a path to an archive file rather than to a directory.

Open to other ideas too, including if there's an easy way to use devdata in kubernetes which I've missed (if so, perhaps we could document this too?)

Support running from an empty database without dropping/re-creating

Devdata is a very useful tool, however in some cases we don't want to drop the underlying database. This happens for example when using devdata to manage a hosted staging environment, where the creation of the database itself is (deliberately) outside of the permissions set available to the user being used to run migrations & devdata.

It would be great if devdata supported this.

Maybe an flag to devdata_import which instead of dropping the database checks that there are no tables present (and complains if there are)?

Support Django 4

Discovered while adding CI versions for #9.
Currently the tests fail with: django.db.utils.ProgrammingError: relation "django_migrations" does not exist during export_migration_state.
I don't know if that's just a test setup issue or a genuine change in Django 4. A quick look at the release notes didn't reveal anything obvious unfortunately.

Document use of the testsite for manual testing

This appears to be possible and somewhat useful, though a little non-obvious given that it doesn't have migrations already generated.

In case it's useful, the steps I used were (approximately):

createdb testsite # (ask Postgres to create a bootstrap DB for the site)
export TEST_DATABASE_NAME=testsite

tests/testsite/manage.py makemigrations photofeed polls turtles
tests/testsite/manage.py migrate

tests/testsite/manage.py shell --command '
import datetime
from polls.models import Question
Question.objects.create(question_text="Who are you?", pub_date=datetime.datetime.now())
'

tests/testsite/manage.py devdata_export

TEST_DATABASE_NAME=testsite_import tests/testsite/manage.py devdata_import

Update primary key sequences after import

I thiink there's a bug where the primary key sequences of imported models aren't updated and thus users get errors when trying to create rows in their new database due to conflicts with existing rows.

I haven't completely confirmed that this happens with devdata, however I did hit something similar when doing a similar import/export. (I then wondered how devdata got around the issue and couldn't find where it does anything that would sidestep the issue, so I'm guessing the random sampling from the source database is what means this isn't hit frequently).

Fails to fetch model instances when model is restricted to primary keys of a related model.

Models related by a Fk to another model are failing to fetch records on the first run. For example where a 'student' has a FK to a 'class', the class is being fetched first but then none of the 'student' records are fetched. Debugging shows that even when we have just fetched all 'class' records, when we process the strategy for the 'student' model and we run 'get_exported_pks_for_model' for the 'class' related model, no primary keys are returned and so the 'studend' queryset returned by get_queryset contains no records.

Removing the '@functools.lru_cache' decorator from 'get_exported_pks_for_model' and 'get_exported_objects_for_model 'fixes the issue.

Improve feedback when misconfiguring a strategy

Currently if you misconfigure a strategy, perhaps by omitting a name or by forgetting to call super().__init__() passing everything through in its initialiser, devdata doesn't give much useful feedback.
The error you do get is:

  File ".../devdata/engine.py", line 72, in export_data
    model_strategies = sort_model_strategies(settings.strategies)
  File ".../devdata/utils.py", line 72, in sort_model_strategies
    for dep in strategy.depends_on:
AttributeError: 'tuple' object has no attribute 'depends_on'

Which is unfortunately considerably later in devdata's processing.

I think the fix is likely to change the handling in settings.py, so that the strategy construction logic is more explicitly able to tell the difference between the available valid options and those which are erroneous:

try:
klass_path, kwargs = strategy
klass = import_string(klass_path)
ret[app_model_label].append(klass(**kwargs))
except (ValueError, TypeError, IndexError):
ret[app_model_label].append(strategy)

Drop legacy Python support

Can we drop support for Python < 3.6? I'd even be tempted to drop < 3.7 to be honest.

I also noticed that this project doesn't declare a dependency on Django(!), however when adding one poetry wants to update the Python requirement to at least 3.6.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.