Git Product home page Git Product logo

surgeo's People

Contributors

nicanor-b avatar thecleric avatar theonaun avatar theonaunheim avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

surgeo's Issues

Reconciling prob_race_given_first_name_harvard and prob_first_name_given_race_harvard probabilities

Hi, I've been trying to reconciliation how to switch between the two above mentioned files in the title. Can you confirm the formulation used in your implementation, using the AARON/white entry as example, is

prob(first_name = AARON | race = WHITE) = prob(race = WHITE | first_name = AARON) / sum[prob(race = WHITE | first_name = i]?

I was able to replicate moving from one file to the other using the above formula, and wanted to make sure it is consistent with what you did.

The reason I'm asking is because in the harvard file, the number of observations for each first name is provided, and I used that in my own calculations and arrived at different probabilities. In particular, my formulation is

obs(first name = AARON)*prob(race = WHITE | first name = AARON) / sum[obs(first name = i) * prob(race = WHITE | first name = i)].

Have you considered this alternative formulation that includes the observation count information? If the choice not to use the observation counts is deliberate, I would love to learn the rationale behind the decision.

Fill NA data with population level statistics?

Hey - great idea and implementation, thanks for putting this together!

Notcing that when I'm getting probabilities with the BIFSG model, I get null no results / null probabillities if any one of my input features is either null or doesnt show up in the census data. It would be great if there were an option to override the null probabilities that get introduced in these intermediate steps.

ex: if the ZCTA is absent but First and Last name are present in the census data, then before we combine the probabilities, we fill the null zip code data with the population level statistics, and calculate the combined probabilitiy from that (alternatively we could just not include it in the calculation, not sure which is preferable). Perhaps a 'backfill with aggregate statistics' flag parameter for each of the components would be good.

Currently looks like this, but we could definitely eek out some information here instead of leaving it null:
zcta5 first_name surname white black api native multiple hispanic
0 90210 RANDALL ZZZZZZ NaN NaN NaN NaN NaN NaN
1 90210 QQQQQQ AARON NaN NaN NaN NaN NaN NaN
2 99999 RANDALL AARON NaN NaN NaN NaN NaN NaN
3 90210 RANDALL AARON 0.972583 0.004928 0.000934 0.000053 0.020869 0.000633

Surgeo Model used wrong probability for tract

In surgeo model for "TRACT" version, the probability needed to be used according to official documentation was " Probability of tract given race". But in the model code i see that it used "_get_prob_race_given_tract()". The variable name is right but the data used is wrong.

self.geo_level = geo_level.upper()
    if geo_level == "TRACT":
        self._PROB_GEO_GIVEN_RACE = self._get_prob_race_given_tract()
    else:
        self._PROB_GEO_GIVEN_RACE = self._get_prob_zcta_given_race()

for zcta it is right but for tract wrong probability file was pulled.

Implement BIFSG

Would there be any interest in attempting to implement the improved BIFSG model that includes first name data as well?

See https://www.tandfonline.com/doi/full/10.1080/2330443X.2018.1427012

While the overall magnitude of the improvement associated with BIFSG is somewhat modest, the largest improvements occur for NH Blacks, which is the group for which BISG is least accurate. Moreover, the improvement for NH Blacks is much higher where geography has low ability to distinguish NH Blacks. This aspect is particularly important as much of the research on the topic of racial/ethnic differences focuses on specific geographic areas rather than the entire United States. It is also worthwhile to note that the improvements of BIFSG over BISG are generally comparable to the improvements of BISG over simpler methods. Last but not least, when assessing the degree of improvement from BIFSG, one should consider that even the most advanced methods are likely to result in incremental improvements for Hispanics and NH Asians, given that surnames alone are highly predictive for these particular groups.

I wouldn't mind submitting a PR on this if there is interest.

Internal Renaming For Next Version

  1. Now that surgeo has both first name and surname models, it makes sense to disambiguate between these names in the data and source code. Every variable/column that is specific to a surname should be styled "surname" and every one that is a first name should be styled "first_name".

  2. Now that surgeo has both BISG and BIFSG models, the class of SurgeoModel should become BISGModel for consistency.

Update with 2020 Census data

Hi all! This is an awesome tool, thanks for building this.

Now that 2020 Census data is available, is it possible to update the data this pulls from? I'm happy to help in any way, including data cleaning and making it an optional keyword to prevent people from having their surgeo predictions change unexpectedly.

Any information you have about where you sourced the data/any special data cleaning you needed to format it would be helpful, and I can open a pull request with full test coverage as well.

First Name / Last Name / Geocode Mixins

People are more likely to have forename and surnames together than they are to have surname and geo data together.

Since the data is already there, examine the possibility of updating surname probability with forname.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.