Git Product home page Git Product logo

data-making-guidelines's Introduction

Making Data, the DataMade Way

This is DataMade's guide to extracting, transforming and loading (ETL) data using Make, a common command line utility.

This guide is part of a body of technical and process documentation maintained by DataMade. Head over to datamade/how-to for other guides on topics ranging from AWS to work practices!

What is ETL?

ETL refers to the general process of:

  1. taking raw source data ("Extract")
  2. doing some stuff to get the data in shape, possibly involving intermediate derived files ("Transform")
  3. producing final output in a more usable form (for "Loading" into something that consumes the data - be it an app, a system, a visualization, etc.)

Having a standard ETL workflow helps us make sure that our work is clean, consistent, and easy to reproduce. By following these guidelines you'll be able to keep your work up to date and share it with the world in a standard format - all with as few headaches as possible.

Basic Principles

These five principles inform all of our data work:

  1. Never destroy data - treat source data as immutable, and show your work when you modify it
  2. Be able to deterministically produce the final data with one command
  3. Write as little custom code as possible
  4. Use standard tools whenever possible
  5. Keep source data under version control

Unsure how to follow these principles? Read on!

The Guide

  1. Make & Makefile Overview
  2. ETL Styleguide

Code examples

Further reading

data-making-guidelines's People

Contributors

cathydeng avatar derekeder avatar fgregg avatar gitter-badger avatar hancush avatar jeancochrane avatar jsvine avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

data-making-guidelines's Issues

Standard Toolkit Item for JSON?

I'd suggest adding a tool like jq to the standard toolkit for working with JSON. Its my go to utility for working with JSON and getting data from APIs that I need to load into another system or tool.

Fits in very nicely to my command line toolset.

update recommended makefile directives

our prior recommendation, removed in 4bb8d76, caused issues with makefiles on ubuntu 16.x (xenial), i.e., on our staging server, such that all recipes failed with the error "No such file or directory". do some research on directives, identify the culprit, and add a new, portable set of default directives.

Handling Remote Data

Was wondering if you have been able to integrate make into workflows that require working with remote services/servers? I've generally had to turn to Drake or Luigi in these use cases, but being able to use make in this situation would be great...

Add patterns for making databases

We've got great documentation for generating files, but we could use more detail about a concise and Make-ish approach for building databases, e.g., writing SQL-heavy Makefiles. A really great example lives over in datamade/school-report-cards.

Most notably, it involves this handy lil' Make function for checking if a table exists before trying to create it that allows you to include database-level recipes in your overall flow without using dummy .table files or having things break all the time.

Let's add a Makefile 301 (since database-building seems to be a non-standard application of Make), or a database-specific appendix, documenting some good patterns. (I'm happy to take the lead!)

(We should link to the school-report-cards repo in the code examples, too!)

add making data 001

writing a makefile is really hard if you don't know bash. so is knowing when to write custom processors.

writing custom processors is hard if you aren't familiar with our patterns for doing so.

working with a database is hard if you don't know how to get into it, or navigate it once you're in.

this repo assumes knowledge of most of the above, but it doesn't have to. i started drafts of quick guides to all of these things. let's finish them and add them for the reference of others!

does make run on windows?

macs + linux come with it, per our tutorial. does it always need to be installed on windows? if yes, might be worth adding a line to the README or makefile 101 about how to get make if you don't have it. using this link?

Add a pointer to how-to

This repo is the OG DataMade technical documentation, and as such, it's the entry point for a lot of folks! We maintain a growing body of similar documentation of topics ranging from Airflow to UX. Let's add a pointer to https://github.com/datamade/how-to in the README, so that work is more visible to interested parties.

Add a tutorial

These guidelines need more interactive, full-immersion fun. Let's create a didactic tutorial for a rounder learning experience.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.