Git Product home page Git Product logo

digitalethnicfutureslab's Introduction

Digital Ethnic Futures Lab - SCOTUS College Statement Text Analyis

Description

This repository contains multiple programs intended to analyze the statements released by select colleges on SCOTUS's ruling on affirmative action.

  • 'statement_to_csv.py' utilizes the Google Sheets API to read in data from a column and transform its' contents into individual csv files, stored in folder 'csv_files'
  • 'region_finder.py' tags region information and campus size using data from folder 'data_directory' for specified colleges and transforms it into a csv file 'locations_results.csv'
  • the 'tfidf' directory contains programs intended to perform term frequency inverse document frequency analysis on our corpus, while the 'sentiment' directory contains programs intended to perform sentiment analysis
  • the 'ngram' directory performs n-gram analysis on the corpus of text files. it defines functions to preprocess text, tokenize them, then find top n-grams. it also contains functions to compare ngrams
  • the 'response_comparison' directory contains programs intended to compare the similarity between different responses as well as between the responses and a GPT generated response using Jaccard similarity comparison and cosine similarity
  • 'word_analysis.py' calculates average word count, lexical diversity, and most frequent words for each response in the corpus and outputs it into 'word_analysis_results.csv' while 'word_analysis_plot.py' is used to plot its results
  • 'word_phrase.py' finds the percentage of texts that contain certain words or phrases out of the entire corpus
  • 'identify_category.py' categorizes college responses according to specific lexicons
  • 'jbdelta_average.py' tokenizes responses, calculates word frequency statistics, and computes the deviations of each text from the corpus average using z-scores, as well as visualizes these deviations using a bar chart
  • 'jbdelta_reference.py' is similar to 'jbdelta_average.py', but instead calculates the deviation between a single test text and the rest of the corpus

Getting Started

  • 'statement_to_csv' depends on a 'credentials.json' file which is not included in this repository for security reasons. This code does not need to be run as the results are stored in 'csv_files'

  • 'region_finder' can be ran from the home directory

  • 'tfidf_analysis' needs to be ran from the tfidf directory, and 'vader_sentiment' needs to be run from the sentiment directory

Dependencies

  • This repository deploys 'pandas', 'os', 'vaderSentiment', 'sklearn', 'numpy', 'altair', 'csv', 'nltk', 'sklearn', and the 'googleapiclient' packages.

digitalethnicfutureslab's People

Contributors

mattdcurrie avatar roopikarisam avatar

Stargazers

 avatar

Watchers

 avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.