Git Product home page Git Product logo

census-data-aggregator's Introduction

census-data-aggregator

Combine U.S. census data responsibly

Installation

pipenv install census-data-aggregator

Usage

Import the library.

import census_data_aggregator

Approximating sums

Total together estimates from the U.S. Census Bureau and approximate the combined margin of error. Follows the bureau's official guidelines for how to calculate a new margin of error when totaling multiple values. Useful for aggregating census categories and geographies.

Accepts an open-ended set of paired lists, each expected to provide an estimate followed by its margin of error.

males_under_5, males_under_5_moe = 10154024, 3778
females_under_5, females_under_5_moe = 9712936, 3911
census_data_aggregator.approximate_sum(
    (males_under_5, males_under_5_moe), (females_under_5, females_under_5_moe)
)
19866960, 5437.757350231803

Approximating means

Estimate a mean and approximate the margin of error.

The Census Bureau guidelines do not provide instructions for approximating a mean using data from the ACS. Instead, we implement our own simulation-based approach.

Expects a list of dictionaries that divide the full range of data values into continuous categories. Each dictionary should have four keys:

key value
min The minimum value of the range
max The maximum value of the range
n The number of people, households or other units in the range
moe The margin of error for the number of units in the range
income = [
    dict(min=0, max=9999, n=7942251, moe=17662),
    dict(min=10000, max=14999, n=5768114, moe=16409),
    dict(min=15000, max=19999, n=5727180, moe=16801),
    dict(min=20000, max=24999, n=5910725, moe=17864),
    dict(min=25000, max=29999, n=5619002, moe=16113),
    dict(min=30000, max=34999, n=5711286, moe=15891),
    dict(min=35000, max=39999, n=5332778, moe=16488),
    dict(min=40000, max=44999, n=5354520, moe=15415),
    dict(min=45000, max=49999, n=4725195, moe=16890),
    dict(min=50000, max=59999, n=9181800, moe=20965),
    dict(min=60000, max=74999, n=11818514, moe=30723),
    dict(min=75000, max=99999, n=14636046, moe=49159),
    dict(min=100000, max=124999, n=10273788, moe=47842),
    dict(min=125000, max=149999, n=6428069, moe=37952),
    dict(min=150000, max=199999, n=6931136, moe=37236),
    dict(min=200000, max=1000000, n=7465517, moe=42206),
]
approximate_mean(income)
(98045.44530685373, 194.54892406267754)

Note that this function expects you to submit a lower bound for the smallest bin and an upper bound for the largest bin. This is often not available for ACS datasets like income. We recommend experimenting with different lower and upper bounds to assess its effect on the resulting mean.

By default the simulation is run 50 times, which can take as long as a minute. The number of simulations can be changed by setting the simulation keyword argument.

approximate_mean(income, simulations=10)

The simulation assumes a uniform distribution of values within each bin. In some cases, like income, it is common to assume the Pareto distribution in the highest bin. You can employ it here by passing True to the pareto keyword argument.

approximate_mean(income, pareto=True)
(60364.96525340687, 58.60735554621351)

Also, due to the stochastic nature of the simulation approach, you will need to set a seed before running this function to ensure replicability.

import numpy

numpy.random.seed(711355)
approximate_mean(income, pareto=True)
(60364.96525340687, 58.60735554621351)
numpy.random.seed(711355)
approximate_mean(income, pareto=True)
(60364.96525340687, 58.60735554621351)

Approximating medians

Estimate a median and approximate the margin of error. Follows the U.S. Census Bureau's official guidelines for estimation. Useful for generating medians for measures like household income and age when aggregating census geographies.

Expects a list of dictionaries that divide the full range of data values into continuous categories. Each dictionary should have three keys:

key value
min The minimum value of the range
max The maximum value of the range
n The number of people, households or other units in the range
household_income_la_2013_acs1 = [
    dict(min=2499, max=9999, n=1382),
    dict(min=10000, max=14999, n=2377),
    dict(min=15000, max=19999, n=1332),
    dict(min=20000, max=24999, n=3129),
    dict(min=25000, max=29999, n=1927),
    dict(min=30000, max=34999, n=1825),
    dict(min=35000, max=39999, n=1567),
    dict(min=40000, max=44999, n=1996),
    dict(min=45000, max=49999, n=1757),
    dict(min=50000, max=59999, n=3523),
    dict(min=60000, max=74999, n=4360),
    dict(min=75000, max=99999, n=6424),
    dict(min=100000, max=124999, n=5257),
    dict(min=125000, max=149999, n=3485),
    dict(min=150000, max=199999, n=2926),
    dict(min=200000, max=250001, n=4215),
]

For a margin of error to be returned, a sampling percentage must be provided to calculate the standard error. The sampling percentage represents what proportion of the population that participated in the survey. Here are the values for some common census surveys.

survey sampling percentage
One-year PUMS 1
One-year ACS 2.5
Three-year ACS 7.5
Five-year ACS 12.5
census_data_aggregator.approximate_median(
    household_income_Los_Angeles_County_2013_acs1, sampling_percentage=2.5
)
70065.84266055046, 3850.680465234964

If you do not provide the value to the function, no margin of error will be returned.

census_data_aggregator.approximate_median(household_income_Los_Angeles_County_2013_acs1)
70065.84266055046, None

If the data being approximated comes from PUMS, an additional design factor must also be provided. The design factor is a statistical input used to tailor the estimate to the variance of the dataset. Find the value for the dataset you are estimating by referring to the bureau's reference material.

Approximating percent change

Calculates the percent change between two estimates and approximates its margin of error. Follows the bureau's ACS handbook.

Accepts two paired lists, each expected to provide an estimate followed by its margin of error. The first input should be the earlier estimate in the comparison. The second input should be the later estimate.

Returns both values as percentages multiplied by 100.

single_women_in_fairfax_before = 135173, 3860
single_women_in_fairfax_after = 139301, 4047
census_data_aggregator.approximate_percentchange(
    single_women_in_fairfax_before, single_women_in_fairfax_after
)
3.0538643072211165, 4.198069852261231

Approximating products

Calculates the product of two estimates and approximates its margin of error. Follows the bureau's ACS handbook.

Accepts two paired lists, each expected to provide an estimate followed by its margin of error.

owner_occupied_units = 74506512, 228238
single_family_percent = 0.824, 0.001
census_data_aggregator.approximate_product(owner_occupied_units, single_family_percent)
61393366, 202289

Approximating proportions

Calculate an estimate's proportion of another estimate and approximate the margin of error. Follows the bureau's ACS handbook. Simply multiply the result by 100 for a percentage. Recommended when the first value is smaller than the second.

Accepts two paired lists, each expected to provide an estimate followed by its margin of error. The numerator goes in first. The denominator goes in second. In cases where the numerator is not a subset of the denominator, the bureau recommends using the approximate_ratio method instead.

single_women_in_virginia = 203119, 5070
total_women_in_virginia = 630498, 831
census_data_aggregator.approximate_proportion(
    single_women_in_virginia, total_women_in_virginia
)
0.322, 0.008

Approximating ratios

Calculate the ratio between two estimates and approximate its margin of error. Follows the bureau's ACS handbook.

Accepts two paired lists, each expected to provide an estimate followed by its margin of error. The numerator goes in first. The denominator goes in second. In cases where the numerator is a subset of the denominator, the bureau recommends uses the approximate_proportion method.

single_men_in_virginia = (226840, 5556)
single_women_in_virginia = (203119, 5070)
census_data_aggregator.approximate_ratio(
    single_men_in_virginia, single_women_in_virginia
)
1.117, 0.039

A note from the experts

The California State Data Center's Demographic Research Unit notes:

The user should be aware that the formulas are actually approximations that overstate the MOE compared to the more precise methods based on the actual survey returns that the Census Bureau uses. Therefore, the calculated MOEs will be higher, or more conservative, than those found in published tabulations for similarly-sized areas. This knowledge may affect the level of error you are willing to accept.

The American Community Survey's handbook adds:

As the number of estimates involved in a sum or difference increases, the results of the approximation formula become increasingly different from the [standard error] derived directly from the ACS microdata. Users are encouraged to work with the fewest number of estimates possible.

References

This module was designed to conform with the Census Bureau's April 18, 2018, presentation "Using American Community Survey Estimates and Margin of Error", the bureau's PUMS Accuracy statement and the California State Data Center's 2016 edition of "Recalculating medians and their margins of error for aggregated ACS data.", and the Census Bureau's ACS 2018 General Handbook Chapter 8, "Calculating Measures of Error for Derived Estimates"

Links

census-data-aggregator's People

Contributors

dependabot[bot] avatar irisslee avatar nkrishnaswami avatar palewire avatar sastoudt avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar

census-data-aggregator's Issues

Correct handling of jam values in median approximation

Thanks to some clarification from our Census friends:

The jam value represents a result from a median calculation when the median can't actually be calculated because it lies in the lowest or highest bin. The jam value is not used in the median calculation itself as a lower or upper bound for the end bins.

This information doesn't impact the calculations of the examples we have now (we've treated the jam value as a bound), but we need to update the median function to handle the scenario where the lower and upper bins don't have concrete bounds (plus add examples of this scenario).

We may want to include an optional input jam_value to use in the case that the median occurs in the highest/lowest bin.

An "aggregation" tool

Accept a list of values and margins and, using the approximation methods in this library, returns the combined value with its estimated margin of error.

negative values from numpy.random.normal

For smaller values or with large margins of error, the numpy.random.normal in approximate_mean may return a negative number which won't make sense in context. We should probably just use max(0, simulated_value) instead.

disaggregation functions

Functions for breaking geographic units into different geographic units and recalculating quantities of interest [with and without margin of error].

  • sums
  • medians
  • means

provide check that spatial aggregation doesn't induce spurious patterns

From this paper:

"one can induce geographic patterns in the aggregate data that do not
exist in the input data"

Create a diagnostic to check for this (equations 2 and 3 in paper):

"The statistic S_j measures whether the region-level estimates for a given variable are within the margins of error of their constituent tracts. If a region-level estimate is within the margin of error of all its constituent tracts, then there is no information lost through aggregation; information loss increases as the 90 percent confidence intervals of more and more tract-level estimates do not overlap with the region’s estimate."

deal with annotations

If using the aggregator outside of the downloader, the aggregator needs to know what to do with annotated values.

Source data for approximating median household income

Just to make sure I understand this correctly, to calculate median household income for an aggregate geography using the ACS, as shown in this example, would I use data from a table like ACS table B19001 to get the n (household counts), and min/max incomes for the ranges?

It looks like the wording of the top range of that table is "$200,000 or more". Should I just set an artificial upper bound for that? It looks like in the example and the linked PDF, they use $250,001.

Off the top of my head, this seems like it would be correct for many (most?) cases, but incorrect for very high income areas?

optional MOE input for approximate_median

We may want an optional moe input field for approximate_median to handle the case when the n values are estimates themselves (e.g. outputs of approximate_sum). The approximate_median would then need a simulation aspect to account for the n values' uncertainty.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.