Git Product home page Git Product logo

embeddingdynamicstereotypes's Introduction

This repository contains code and data associated with Word embeddings quantify 100 years of gender and ethnic stereotypes. PDF available here.

If you use the content in this repository, please cite:

Garg, N., Schiebinger, L., Jurafsky, D. & Zou, J. Word embeddings quantify 100 years of gender and ethnic stereotypes. PNAS 201720347 (2018). doi:10.1073/pnas.1720347115

To re-run all analyses and plots:

  1. download vectors from online sources and normalize by l2 norm (links in paper and below)
  2. set up parameters to run as in run_params.csv
  3. run changes_over_time.py
  4. run create_final_plots_all.py

dataset_utilities/ contains various helper scripts to preprocess files and create word vectors. From a corpus, for example LDC95T21-North-American-News, that contains many text files (each containing an article) from a given year, first run create_yrly_datasets.py to create a single text file per year (with only valid words). Then, run pipeline.py on each of these files to create vectors, potentially combining multiple years into a single training set. normalize_vectors.py contains utilities to standardize the vectors.

We have uploaded the New York Times embeddings generated for this paper. They are available at http://stanford.edu/~nkgarg/NYTembeddings/. 2021/04/05 update: Unfortunately, the files are no longer available. (Upon my graduation the links died, before I was able to back them up). However, the original text data is still available at New York Times Annotated Corpus, and so the the vectors can be trained as described in the paper.

We use the following embeddings publicly available online. If you use these embeddings, please cite the associated papers.

  1. Google News, word2vec
  2. Genre-Balanced American English (1830s-2000s), SGNS and SVD
  3. Wikipedia, GloVe

Note: the paper mistakenly indicates that the Genre-Balanced American English embeddings contain data from both Google Books and the Corpus of Historical American English (COHA). It contains only data from COHA, though the same website also provides data trained using Google Books.

embeddingdynamicstereotypes's People

Contributors

nikhgarg avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

embeddingdynamicstereotypes's Issues

Input Data

Hello, thanks for the repository!

I have followed the instructions in the README file but cannot get the vectors as described here

Basically these files:

filenames_sgns = [folder + 'vectors_sgns{}.txt'.format(x) for x in range(1910, 2000, 10)]
filenames_svd = [folder + 'vectors_svd{}.txt'.format(x) for x in range(1910, 2000, 10)]
filenames_nyt = [folder + 'vectors{}-{}.txt'.format(x, x+5) for x in range(1987, 2000, 1)]
filenames_coha = [folder + 'vectorscoha{}-{}.txt'.format(x, x+20) for x in range(1910, 2000, 10)]

Can you please let us know how to generate them ? I have for example downloaded the sgns from here but it contains only -vocab.pkl and -w.npy but not any .txt files.

How to normalize vectors by l2 norm?

download vectors from online sources and normalize by l2 norm (links in paper and below)

I am not very clear about this.

Any help would be appreciated.
Thank You.

Download all word embeddings

I am trying to re-run your code and have been struggling to find all the word embeddings you used during your experiments. Could you please let me know from where do I download them? Also, could you elaborate on exactly how do I run your code? The README does not help me a lot.

Thank You

Getting NaN after running changes_over_time.py

I have been trying to run the code since long and keep getting NaNs in the finalrun.csv which is generated after running changes_over_time.py. I have downloaded all embeddings twice (as I thought there was some error while downloading the vectors the first time I ran), normalized all vectors and then run changes_over_time.py.

Any help would be appreciated.

Thank You.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.