Git Product home page Git Product logo

namsor-python-tools-v2's Introduction

namsor-python-tools

NamSor Python command line tools, to append gender, origin, diaspora or us 'race'/ethnicity to a CSV file. The CSV file should in UTF-8 encoding, pipe-| demimited. It can be very large. Check out also the Java CLT (https://github.com/namsor/namsor-tools-v2)

Please install https://github.com/namsor/namsor-python-sdk2 and synchronized-set dependencies,

pip install git+https://github.com/namsor/namsor-python-sdk2.git
pip install synchronized-set

NB: we use Unix conventions for file paths, ex. samples/some_fnln.txt but on MS Windows that would be samples\some_fnln.txt

Then clone this project and start from the base directory.

git clone https://github.com/namsor/namsor-python-tools-v2/
cd namsor-python-tools-v2

Running

python namsor_tools.py
python3 namsor_tools.py  	[-h] -apiKey APIKEY -i INPUTFILE [-countryIso2 COUNTRYISO2] [-o OUTPUTFILE] [-w] 
							[-r] -f INPUTDATAFORMAT [-header] [-uid] [-digest] -service SERVICE [-e ENCODING]
							[-usraceethnicityoption USRACEETHNICITYOPTION]                   

Detailed usage

python3
usage: namsor_tools.py 	[-h] -apiKey APIKEY -i INPUTFILE [-countryIso2 COUNTRYISO2] [-o OUTPUTFILE] [-w] 
						[-r] -f INPUTDATAFORMAT [-header] [-uid] [-digest] -service SERVICE [-e ENCODING]
						[-usraceethnicityoption USRACEETHNICITYOPTION]

Main parser for namsor_commandline_tool

options:
  -h, --help            show this help message and exit
  -apiKey APIKEY, --apiKey APIKEY
                        NamSor API Key
  -i INPUTFILE, --inputFile INPUTFILE
                        input file name
  -countryIso2 COUNTRYISO2, --countryIso2 COUNTRYISO2
                        countryIso2 default
  -o OUTPUTFILE, --outputFile OUTPUTFILE
                        output file name
  -w, --overwrite       overwrite existing output file
  -r, --recover         continue from a job (requires uid)
  -f INPUTDATAFORMAT, --inputDataFormat INPUTDATAFORMAT
                        input data format : first name, last name (fnln) / first name, last name, geo country iso2
                        (fnlngeo) / / first name, last name, geo country iso2, subdivision (fnlngeosub) / full name
                        (name) / full name, geo country iso2 (namegeo) / full name, geo country iso2, subdivision
                        (namegeosub)
  -header, --header     output header
  -uid, --uid           input data has an ID prefix
  -digest, --digest     SHA-256 digest names in output
  -service SERVICE, --endpoint SERVICE
                        service : parse / gender / origin / country / diaspora / usraceethnicity / religion /
                        castegroup
  -e ENCODING, --encoding ENCODING
                        encoding : UTF-8 by default
  -usraceethnicityoption USRACEETHNICITYOPTION, --usraceethnicityoption USRACEETHNICITYOPTION
                        extra usraceethnicity option USRACEETHNICITY-4CLASSES USRACEETHNICITY-4CLASSES-CLASSIC
                        USRACEETHNICITY-6CLASSES

Examples

To append the likely name gender to a list of first and last names : John|Smith

python namsor_tools.py -apiKey <yourAPIKey> -w -header -f fnln -i samples/some_fnln.txt -service gender

To append the likely name origin to a list of first and last names : John|Smith

python namsor_tools.py -apiKey <yourAPIKey> -w -header -f fnln -i samples/some_fnln.txt -service origin

To append the likely country of residence to a list of full names : John Smith

python namsor_tools.py -apiKey <yourAPIKey> -w -header -f name -i samples/some_name.txt -service country

To parse names into first and last name components (John Smith or Smith, John -> John|Smith)

python namsor_tools.py -apiKey <yourAPIKey> -w -header -f name -i samples/some_name.txt -service parse

The recommended input format is to specify a unique ID and a geographic context (if known) as a countryIso2 code.

To append gender to a list of id, first and last names, geographic context : id12|John|Smith|US

python namsor_tools.py -apiKey <yourAPIKey> -w -header -uid -f fnlngeo -i samples/some_idfnlngeo.txt -service gender

To append the ethnicity (in the sense of cultural heritage / country of origin of ascendents) from a list of id, first and last names, geographic context : id12|John|Smith|US

python namsor_tools.py -apiKey <yourAPIKey> -w -header -uid -f fnlngeo -i samples/some_idfnlngeo.txt -service diaspora

To append US'race'/ethnicity to a list of id, first and last names, geographic context : id12|John|Smith|US

python namsor_tools.py -apiKey <yourAPIKey> -w -header -uid -f fnlngeo -i samples/some_idfnlnUS.txt -service usraceethnicity

To append US'race'/ethnicity to a list of id, first and last names, geographic context : id12|John|Smith|US with option USRACEETHNICITY-6CLASSES

python namsor_tools.py -apiKey <yourAPIKey> -w -header -uid -f fnlngeo -i samples/some_idfnlnUS.txt -service usraceethnicity --usraceethnicityoption USRACEETHNICITY-6CLASSES

To parse name into first and last name components, a geographic context is recommended (esp. for Latam names) : id12|John Smith|US

python namsor_tools.py -apiKey <yourAPIKey> -w -header -uid -f namegeo -i samples/some_idnamegeo.txt -service parse

On large input files with a unique ID, it is possible to recover from where the process crashed and append to the existint output file, for example :

python namsor_tools.py -apiKey <yourAPIKey> -r -header -uid -f fnlngeo -i samples/some_idfnlngeo.txt -service gender

For Indian names (for now), you can infer the likely india state or union territory (ie. a subdivision of the country as per ISO 3166-2:IN)

python namsor_tools.py -apiKey <yourAPIKey>  -r -header -uid -f fnlngeo -i samples/some_indian_idfnlngeo.txt -service subdivision

For Indian names (for now), you can infer the likely religion (provided the IN country code and state/union territory as per ISO 3166-2:IN)

python namsor_tools.py -apiKey <yourAPIKey>  -r -header -uid -f namegeosub -i samples/some_indian_idnamegeosub.txt -service subdivision

For Indian names (for now), you can infer the likely caste group (provided the IN country code and state/union territory as per ISO 3166-2:IN)

python namsor_tools.py -apiKey <yourAPIKey>  -r -header -uid -f namegeosub -i samples/some_indian_idnamegeosub.txt -service castegroup

Anonymizing output data

The -digest option will digest personal names in file outpus, using a non reversible MD-5 hash. For example, John Smith will become 6117323d2cabbc17d44c2b44587f682c. Please note that this doesn't apply to the PARSE output.

Understanding output

Please read and contribute to the WIKI https://github.com/namsor/namsor-tools-v2/wiki/NamSor-Tools-V2

Contributing

Pull requests are welcome. For major changes, please open an issue first to discuss what you would like to change.

Please make sure to update tests as appropriate.

namsor-python-tools-v2's People

Contributors

namsor avatar occhristian avatar uncopied avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar

namsor-python-tools-v2's Issues

-R recover option not working

The -R option works fine with the Java CLT but not working with the Python Command Line Tool

error as CRITICAL:type:('Connection aborted.', RemoteDisconnected('Remote end closed connection without response',)) occurred several times. I used the recovery method (by changing from -w to -r) which was mentioned in your github page to continue running instead of running over again.

Python 3.8 error

With Python 3.8, we get the error
namsor_tools.py: error: the following arguments are required: -apiKey/--apiKey, -service/--endpoint

No issue with Python 3.7.2

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.