Git Product home page Git Product logo

bifsg's Introduction

Bayesian Improved First Name Surname Geocoding (BIFSG)

Documentation of BIFSG method, as used for imputing race/ethnicity in the American Families Cohort (AFC) dataset hosted by Stanford Population Health Sciences.

Step 1: Produce conditional probability tables using create_tables.R

All data can be retrieved from the U.S. Census Bureau using public API calls, with the exception of the dataset on first names by race which comes from Tzioumis, Konstantinos (2018). A copy has been provided in this repo. The list of counties in county_list.tsv can be generated using the counties() function in the tigris package, looped over all states, but is provided in this repo for convenience. Otherwise, the R script demonstrates how to use the censusapi package to retrieve all other needed datasets, and generates all seven required tables. Completed tables are provided in this repo. Note that the script uses 2018 5-yr summary data from the American Community Survey, but the choice of year can be modified depending on the 5-yr span that best reflects your population. The script checks for the existence of the tables in the directory (DIR_PATH) of your choice and either generates or loads them. You can set USE_CACHED to FALSE if you'd like to regenerate the tables.

Step 2: Prepare your data

Your data requires a field for firstname, a field for surname, and a field for CBG (census block group). Individual records can be missing one or more of these fields. If you have address information but not CBG, then the most straightforward method is to geocode the addresses into latitude longitude coordinates and then spatial join the coordinates to CBG polygons from TIGER. Provide the full 12-character GEOID as a character type.

Step 3: Calculate posterior probabilities using create_BISG_model.R

After loading the conditional probability tables into your environment, they are converted into data.table format. Then, predict_race() is the key function that takes first name, surname, and CBG from your data and outputs the results, which are posterior probabilities that the individual is each of the six race/ethnicity options: Hispanic/Latino, White, Black or African American, Asian American or Pacific Islander, American Indian or Alaska Native, or Other Race. The probabilities add up to 1. The script can handle any combination of missingness in first name, surname, or CBG. If you lack first name, then the result is BISG. If you only have CBG, then your posterior probabilities will be the race/ethnicity distribution of your CBG. If you have nothing at all, then your posterior probabilities will be the race/ethnicity distribution of the whole U.S. Intermediate geometry cases (e.g., tract, county, ZIP Code, state) are not provided but can be added with the same principles. If you are dealing with a large dataset, then predict_parallel() can be used to speed up the process.

This code and data were prepared by Cameron Raymond. If you have questions, reach out to Derek Ouyang at [email protected].

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.