Git Product home page Git Product logo

census's Introduction

U.S. Census and American Community Survey

This is a simple Python library to download, parse, and store data from the U.S. Census and American Community Survey. The library can be run stand-alone but is also designed to work with Cassandra or Hadoop (via Ferry).

Prequisites

This application is much more interesting if you have Ferry installed. Ferry lets you create "big data" clusters on your local machine using high-level commands. That means you can install Hadoop, Cassandra, or OpenMPI without already being an expert in those systems. If you want to try out Ferry, you can grab the Vagrant box like this:

$ vagrant box add opencore/ferry https://s3.amazonaws.com/opencore/ferry.box
$ vagrant init opencore/ferry
$ vagrant up

(If you don't have Vagrant installed, you should probably install that first). At this point, I suggest taking a detour to the Ferry installation page.

Installation

Whether you're using Ferry or not, you'll need to download the library:

$ git clone https://github.com/jhorey/census

Now if you want to run the library with Hadoop, type:

$ cd census
$ ./make.sh hadoop

Similarly, if you want to run the library with Cassandra, type:

$ cd census
$ ./make.sh cassandra

Of course if you really don't want to use Ferry and would rather install the package locally, type:

$ cd census
$ sudo python setup.py install

Now time to actually populate our database!

U.S. Census API

First you'll need an API key from the U.S. Census. You can request one here. If you're using Ferry for either Hadoop or Cassandra, log into the connector:

$ ferry ssh sa-xxyyzz

The sa-xxyyzz is the unique ID of your Hadoop (or Cassandra) cluster. You should have seen that output from the ./make.sh command.

After logging into your connector, place the API key in ~/.censuskey. Now you can start downloading some data. Here are a sequence of commands to download data and store the data in Hadoop. a

$ census download pop Tennessee
$ census download pop Kentucy
$ census download econ Tennessee
$ census download econ Kentucky
$ census upload hive Tennessee
$ census upload hive Kentucky

If you chose the Cassandra stack, replace hive with cassandra. You can see that the census command is pretty simple. You can download either pop (population) or econ (economic) data on a state-by-state basis.

After uploading the data into Hive, you can invoke hive to play around with the dataset.

$ hive -e 'select * from economic;'

Similarly, the Cassandra client comes with cqlsh which is very similar to a SQL prompt.

Status

  • Right now, the package is limited to two datasets. The population data from the U.S. Census (organized by race), and economic data from the American Community Survey (mean, median, and per capita). The datasets are organized at the county level.
  • These datasets are currently stored in either Cassandra or Hadoop (accessible via Hive). I'm planning on supporting other backends as they become availabe in Ferry.

Why?

Although storing these particular datasets in either Cassandra or Hadoop seems a bit overkill, it's still useful for the following reasons:

  • Educational | the Census and ACS datasets are interesting and you can use the Cassandra backend as part of a web application
  • Integration | the datasets can be combined with other datasets (e.g., sales, weather, etc.) in Hadoop as part of a data warehouse

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.