Git Product home page Git Product logo

census_looker's Introduction

Census Looker

This is a set of scripts to make public Census data downloaded from the U.S. Census Bureau's DataFerrett tool explorable in Looker and Google BigQuery.

There are two scripts, one for uploading the data files that come from DataFerrett into BigQuery and one for transforming the DataFerrett codebook(s) into LookML.

Schema Generator

Because Census datafiles can be extremely wide, it's impractical for users to write out the schemas by hand. And especially when working with multiple data files that often differ in subtle ways, users need a programmatic way to compare and combine the datafiles into a single file in BigQuery.

The Schema Generator takes the location of one or more datafiles as well as the name(s) of an equal number of tables you want to upload those files to on BigQuery. You call the script from the command line as follows:

python schema_generator.py -f /Users/Documents/datafile1.csv  /Users/Documents/datafile12.csv -t table1 table2

It examines a sample of rows in each file and returns a text file that contains schema(s) in the format BigQuery expects for each datafile. If you're combining multiple datafiles, it also returns a query that unions that datafiles together into a single unified table (with NULLs filling in columns that do not exist in all datafiles) so that you can create one table containing all your data.

You can find sample input and output files in the samples directory.

Codebook Parser

DataFerrett dynamically creates a codebook based on which data source and columns you download. This codebook is in plaintext (example in samples) and if you download data from multiple data sources, you'll end up with multiple codebooks.

Because you'd need to cross-reference the data values in each codebook by hand, this approach quickly becomes unwieldy for larger datasets. That's where this parser comes in. It takes the location(s) of the codebook(s) you've downloaded, the names of the table(s) the underlying data lives in on BigQuery, the name of the weighting variable (since this data is generally sampled and each respondent is assigned a "weight" to represent some number of Americans like them), and an optional flag to merge the codebooks into one set of LookML files.

Calling the script from the command line looks like this:

python codebook_parser.py -f "/Users/Documents/codebook1.txt" "/Users/Documents/codebook2.txt" -t "table1" "table2" -m "PWCMPWGT" -o merge

When you run this script, the output is three LookML view files (if you've merged the codebooks) or three view files for each input file (if you haven't merged them). The first contains all the variables from the codebook(s), rewritten as LookML dimensions. The second (denoted with a _filters) contains all the variables from the codebook(s), rewritten as LookML filter-only dimensions. The third (denoted with a _measures) contains a measure to calculate the weighted population of the cohort and a weighted population of the group.

Once loaded into Looker, these files lets you dimensionalize the data and select any combination of dimensions to define your cohort. They also let you set the filter-only fields to independently define your group's characteristics. You can then see how many people meet the filters of your cohort and your group, which allows you to ask questions of the form "How many of [cohort] are in [group]?"

This is useful because it allows you to, for example, define your cohort as all women and your group as voters, allowing you to ask "What percentage of women are voters?" To see the power of this approach in action, visit http://census.looker.com

census_looker's People

Contributors

dwmintz avatar grahamkenville avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.