datamade / data-making-guidelines Goto Github PK

:blue_book: Making Data, the DataMade Way

License: MIT License

CSS 2.18% HTML 97.82%

data-making-guidelines's Introduction

Making Data, the DataMade Way

This is DataMade's guide to extracting, transforming and loading (ETL) data using Make, a common command line utility.

This guide is part of a body of technical and process documentation maintained by DataMade. Head over to datamade/how-to for other guides on topics ranging from AWS to work practices!

What is ETL?

ETL refers to the general process of:

taking raw source data ("Extract")
doing some stuff to get the data in shape, possibly involving intermediate derived files ("Transform")
producing final output in a more usable form (for "Loading" into something that consumes the data - be it an app, a system, a visualization, etc.)

Having a standard ETL workflow helps us make sure that our work is clean, consistent, and easy to reproduce. By following these guidelines you'll be able to keep your work up to date and share it with the world in a standard format - all with as few headaches as possible.

Basic Principles

These five principles inform all of our data work:

Never destroy data - treat source data as immutable, and show your work when you modify it
Be able to deterministically produce the final data with one command
Write as little custom code as possible
Use standard tools whenever possible
Keep source data under version control

Unsure how to follow these principles? Read on!

The Guide

Code examples

Some Annotated ETL Code Examples with Make
Recipes for Common Makefile Operations
Chicago Lead - data work with a clear README and Makefile
EITC Works - adding data attributes to Illinois House and Senate district shapefiles and outputting at GeoJSON

data-making-guidelines's People

Contributors

Stargazers

Watchers

Forkers

mheadd altons justinmanley ofergold rlugojr zhengyu-huang egeriicw chuvanan kamshee nunuse francotel bigdatasciencegroup anuragsinghchaudhary nikhilkabbin ws-pittman ericabouaf reclaim-eritrea wwwk seflaherty

data-making-guidelines's Issues

Add recipe for handling permutations of the same thing to common recipes

patsubst rules!

add NPR All Songs Considered poll makefile to examples

We should add this, and possibly other makefiles by outside groups to our list of examples

repo: https://github.com/nprapps/allsongsconsidered-poll/
blog post: http://blog.apps.npr.org/2018/01/03/all-songs-considered-poll.html

Standard Toolkit Item for JSON?

I'd suggest adding a tool like jq to the standard toolkit for working with JSON. Its my go to utility for working with JSON and getting data from APIs that I need to load into another system or tool.

Fits in very nicely to my command line toolset.

consider updating suggested directory structure

i'm not sure if it's just me, but it seems like we typically use a file structure more like this than the one we suggest in this repo. is this true? and if so, we oughta update this!

update recommended makefile directives

our prior recommendation, removed in 4bb8d76, caused issues with makefiles on ubuntu 16.x (xenial), i.e., on our staging server, such that all recipes failed with the error "No such file or directory". do some research on directives, identify the culprit, and add a new, portable set of default directives.

Handling Remote Data

Was wondering if you have been able to integrate make into workflows that require working with remote services/servers? I've generally had to turn to Drake or Luigi in these use cases, but being able to use make in this situation would be great...

Let's split these into three files

Intro to Make
DataMade style
Common Recipes

Add patterns for making databases

We've got great documentation for generating files, but we could use more detail about a concise and Make-ish approach for building databases, e.g., writing SQL-heavy Makefiles. A really great example lives over in datamade/school-report-cards.

Most notably, it involves this handy lil' Make function for checking if a table exists before trying to create it that allows you to include database-level recipes in your overall flow without using dummy .table files or having things break all the time.

Let's add a Makefile 301 (since database-building seems to be a non-standard application of Make), or a database-specific appendix, documenting some good patterns. (I'm happy to take the lead!)

(We should link to the school-report-cards repo in the code examples, too!)

Some good ideas in here to steal

https://docs.google.com/presentation/d/18KE-VO9T6V1I_aGyekdDtFhYP4K0Saph7aBuBS3N8tc/edit#slide=id.p

Add PDF tools to standard toolkit

poppler-utils for single-value extraction, PDF info + uniting + separating, tabula-java for full PDF scraping.

add making data 001

writing a makefile is really hard if you don't know bash. so is knowing when to write custom processors.

writing custom processors is hard if you aren't familiar with our patterns for doing so.

working with a database is hard if you don't know how to get into it, or navigate it once you're in.

this repo assumes knowledge of most of the above, but it doesn't have to. i started drafts of quick guides to all of these things. let's finish them and add them for the reference of others!

Review this with fresh eyes.

Regina and Jack will walk through these guidelines and provide feedback.

does make run on windows?

macs + linux come with it, per our tutorial. does it always need to be installed on windows? if yes, might be worth adding a line to the README or makefile 101 about how to get make if you don't have it. using this link?

Add a pointer to how-to

This repo is the OG DataMade technical documentation, and as such, it's the entry point for a lot of folks! We maintain a growing body of similar documentation of topics ranging from Airflow to UX. Let's add a pointer to https://github.com/datamade/how-to in the README, so that work is more visible to interested parties.

Add a tutorial

These guidelines need more interactive, full-immersion fun. Let's create a didactic tutorial for a rounder learning experience.

fix annotated example

the hover functionality isn't working as expected