Git Product home page Git Product logo

name-gender-guesser's Introduction

Copying: Name Gender Guesser is free software: you can redistribute it and/or modify it under the terms of the GNU General Public License as published by the Free Software Foundation, either version 3 of the License, or (at your option) any later version. Name Gender Guesser is distributed in the hope that it will be useful, but WITHOUT ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU General Public License for more details.

You should have received a copy of the GNU General Public License along with Name Gender Guesser. If not, see <http://www.gnu.org/licenses/>.

Introduction: Name Gender Guesser helps you to find out the gender of a given name. You can either use two provided datasets (or another if you have your own) consisting of common American names with their frequencies in male and female populations, or you can use Yahoo! BOSS API to guess the gender of an unknown name by carrying out some pattern-based searches.

Quick Start: Checkout the code and run example.py

Less Quick Start: This project contains two datasets for gender assocciations of common American names and two scripts, one to handle these datasets, another to carry out a web-based search to guess the gender of unknown names.

First dataset, us_census, comes from the US Census Bureau and constructed as follows:

The names are fetched from the Bureau's web site (http://www.census.gov/genealogy/www/data/1990surnames/names_files.html) and put in two files: us_census_males and us_census_females which contain the
frequency of names for the sample male and female population respective (according to 1990 census).

The second dataset, popular_baby_names, comes from the US Social Security Administration's statistics for popular baby names for every year between 1960 and 2010. The dataset is constructed as follows:

1) Fetch most popular 100 female and male names for every year between 1960 and 2010 from http://www.ssa.gov/cgi-bin/popularnames.cgi
2) For each male and female name calculate the average probability of usage between 1960 and 2010. Missing years are not used in averaging. That implies if a name was in top100 list for only year for the given period, its final score will be its probability for that year.

The class NameGender (contained in name_gender.py) handles with these datasets. If you have your own dataset, you can also use it. The format is trivial (really, check them yourself).

The class WebNameGender does not use any dataset to guess the gender of the name. It simply carries out several web-searches via Yahoo! BOSS API and calculates a gender score according the hit counts. It provides a fallback mechanism if a given name is not contained in the datasets. It also works fairly well for common names in languages other than English (a proper evaluation is yet to be done). You will need a BOSS Application ID to use this class. Two example patterns that WebNameGender uses for a given name X are:

* "X himself", "X herself"
* "husband of X", "wife of X"

In the first case, "X himself" provides evidence that X is a he. In the second case, "husband of X" provides evidence that X is a she. By comparing several pattern pairs like these, WebNameGender computes a gender score for X.

name-gender-guesser's People

Contributors

amacinho avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar

name-gender-guesser's Issues

Popular baby names - methodology

Hi,

I've been trying to recreate the popular baby name files following the procedure outlined in the README file. For this, I first fetched the most popular 100 female and male names for every year between 1960 and 2010 from http://www.ssa.gov/cgi-bin/popularnames.cgi using the following command:

for year in $(seq 1960 2010); do echo $year; wget --quiet --no-check-certificate -O "${year}.html" --post-data="year=${year}&top=100&number=p" https://www.ssa.gov/cgi-bin/popularnames.cgi; done

However, that's where I already run into some questions.

Looking at your file popular_1960_2010_females, the first entry is for the name "fawn". However, I cannot find any mention of that name in the files downloaded above:

grep -i fawn *.html

This commands finds no matches.

Could you please elaborate your method, or point out what I should have done differently?

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.