theonaunheim / surgeo Goto Github PK

View Code? Open in Web Editor NEW

31.0 31.0 14.0 71.13 MB

Open Source Proxy Demographic module written in Python

License: MIT License

Python 100.00%

surgeo's People

Contributors

Stargazers

Watchers

Forkers

nicanor-b vishalrp riordan yashd94 lily407 algorexhealth rwang-mpr thecleric standardgalactic johnhudzinatr crodriguez1a xweihua mk-analytics

surgeo's Issues

Reconciling prob_race_given_first_name_harvard and prob_first_name_given_race_harvard probabilities

Hi, I've been trying to reconciliation how to switch between the two above mentioned files in the title. Can you confirm the formulation used in your implementation, using the AARON/white entry as example, is

prob(first_name = AARON | race = WHITE) = prob(race = WHITE | first_name = AARON) / sum[prob(race = WHITE | first_name = i]?

I was able to replicate moving from one file to the other using the above formula, and wanted to make sure it is consistent with what you did.

The reason I'm asking is because in the harvard file, the number of observations for each first name is provided, and I used that in my own calculations and arrived at different probabilities. In particular, my formulation is

obs(first name = AARON)*prob(race = WHITE | first name = AARON) / sum[obs(first name = i) * prob(race = WHITE | first name = i)].

Have you considered this alternative formulation that includes the observation count information? If the choice not to use the observation counts is deliberate, I would love to learn the rationale behind the decision.

Returns multiple records for some ZCTAs

Appears to be those that cross state lines. Examples: 69201, 51360, 59270

Fill NA data with population level statistics?

Hey - great idea and implementation, thanks for putting this together!

Notcing that when I'm getting probabilities with the BIFSG model, I get null no results / null probabillities if any one of my input features is either null or doesnt show up in the census data. It would be great if there were an option to override the null probabilities that get introduced in these intermediate steps.

ex: if the ZCTA is absent but First and Last name are present in the census data, then before we combine the probabilities, we fill the null zip code data with the population level statistics, and calculate the combined probabilitiy from that (alternatively we could just not include it in the calculation, not sure which is preferable). Perhaps a 'backfill with aggregate statistics' flag parameter for each of the components would be good.

Currently looks like this, but we could definitely eek out some information here instead of leaving it null:
zcta5 first_name surname white black api native multiple hispanic
0 90210 RANDALL ZZZZZZ NaN NaN NaN NaN NaN NaN
1 90210 QQQQQQ AARON NaN NaN NaN NaN NaN NaN
2 99999 RANDALL AARON NaN NaN NaN NaN NaN NaN
3 90210 RANDALL AARON 0.972583 0.004928 0.000934 0.000053 0.020869 0.000633

census tract of block group BISG?

Hey, thanks for putting this together! Any plans to modify this to work at the census block group or tract level? Thanks!

Build out ReadTheDocs documentation

Including:

Overview Section
BISG from scratch
Autodoc

Error result should keep certain data

Zip and surname get written over with error. It should be everything but zip and surname should be error.

Build out unit tests

Build out test_runner.py script for TravisCI usage.

Create common entry point for GUI and CLI.

Like it says.

Copy edit documentation

Iterative proportional fitting misapplied

Misapplied to surnames rather than "other race". Minor skew, but requires fix.

Surgeo Model used wrong probability for tract

In surgeo model for "TRACT" version, the probability needed to be used according to official documentation was " Probability of tract given race". But in the model code i see that it used "_get_prob_race_given_tract()". The variable name is right but the data used is wrong.

self.geo_level = geo_level.upper()
    if geo_level == "TRACT":
        self._PROB_GEO_GIVEN_RACE = self._get_prob_race_given_tract()
    else:
        self._PROB_GEO_GIVEN_RACE = self._get_prob_zcta_given_race()

for zcta it is right but for tract wrong probability file was pulled.

Census Data 2000 rather than 2010

The CFPB Model uses 2010 geocode data. This uses 2000. Minimal skew, but requires update.

Make "ALL OTHER NAMES" and forename equivalent available

Currently returns NaN because it is normalized to "ALLOTHERNAMES".

Implement BIFSG

Would there be any interest in attempting to implement the improved BIFSG model that includes first name data as well?

See https://www.tandfonline.com/doi/full/10.1080/2330443X.2018.1427012

While the overall magnitude of the improvement associated with BIFSG is somewhat modest, the largest improvements occur for NH Blacks, which is the group for which BISG is least accurate. Moreover, the improvement for NH Blacks is much higher where geography has low ability to distinguish NH Blacks. This aspect is particularly important as much of the research on the topic of racial/ethnic differences focuses on specific geographic areas rather than the entire United States. It is also worthwhile to note that the improvements of BIFSG over BISG are generally comparable to the improvements of BISG over simpler methods. Last but not least, when assessing the degree of improvement from BIFSG, one should consider that even the most advanced methods are likely to result in incremental improvements for Hispanics and NH Asians, given that surnames alone are highly predictive for these particular groups.

I wouldn't mind submitting a PR on this if there is interest.

Internal Renaming For Next Version

Now that surgeo has both first name and surname models, it makes sense to disambiguate between these names in the data and source code. Every variable/column that is specific to a surname should be styled "surname" and every one that is a first name should be styled "first_name".
Now that surgeo has both BISG and BIFSG models, the class of SurgeoModel should become BISGModel for consistency.

Update with 2020 Census data

Hi all! This is an awesome tool, thanks for building this.

Now that 2020 Census data is available, is it possible to update the data this pulls from? I'm happy to help in any way, including data cleaning and making it an optional keyword to prevent people from having their surgeo predictions change unexpectedly.

Any information you have about where you sourced the data/any special data cleaning you needed to format it would be helpful, and I can open a pull request with full test coverage as well.