Git Product home page Git Product logo

s2t2 / openai-embeddings-2023 Goto Github PK

View Code? Open in Web Editor NEW
4.0 3.0 0.0 165.18 MB

Classifying users on social media, using text embeddings from OpenAI and others

Home Page: https://s2t2.github.io/openai-embeddings-2023/

Jupyter Notebook 0.24% Python 0.01% HTML 99.75%
classification dimensionality-reduction machine-learning twitter-dataset bot-classification sentiment-analysis toxicity-classification

openai-embeddings-2023's Introduction

openai-embeddings-2023

OpenAI Text Embeddings for User Classification in Social Networks

Setup

Virtual Environment

Create and/or activate virtual environment:

conda create -n openai-env python=3.10
conda activate openai-env

Install package dependencies:

pip install -r requirements.txt

OpenAI API

Obtain an OpenAI API Key (i.e. OPENAI_API_KEY). We initially fetched embeddings from the OpenAI API via the notebooks, but the service code has been re-implemented here afterwards, in case you want to experiment with obtaining your own embeddings.

Users Sample

Obtain a copy of the "botometer_sample_openai_tweet_embeddings_20230724.csv.gz" CSV file, and store it in the "data/text-embedding-ada-002" directory in this repo. This file was generated by the notebooks, and is ignored from version control because it contains user identifiers.

Cloud Storage

We are saving trained models to Google Cloud Storage. You will need to create a project on Google Cloud, and enable the Cloud Storage API as necessary. Then create a service account and download the service account JSON credentials file, and store it in the root directory, called "google-credentials.json". This file has been ignored from version control.

From the cloud storage console, create a new bucket, and note its name (i.e. BUCKET_NAME).

Environment Variables

Create a local ".env" file and add contents like the following:

# this is the ".env" file...

OPENAI_API_KEY="sk__________"

GOOGLE_APPLICATION_CREDENTIALS="/path/to/openai-embeddings-2023/google-credentials.json"
BUCKET_NAME="my-bucket"

DATASET_ADDRESS="my_project.my_dataset"

Usage

OpenAI Service

Fetch some example embeddings from OpenAI API:

python -m app.openai_service

Embeddings per User (v1)

Demonstrate ability to load the dataset:

python -m app.dataset

Perform machine learning and other analyses on the data:

OpenAI Embeddings:

Word2Vec Embeddings:

Embeddings per Tweet (v1)

OpenAI Embeddings:

Testing

pytest --disable-warnings

openai-embeddings-2023's People

Contributors

s2t2 avatar

Stargazers

 avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar

openai-embeddings-2023's Issues

Toxicity and News Quality

Perform classification on additional targets: binarized versions of the toxicity and news quality scores.

Single Results File

Loop through all the classification results JSON files (including reduced classification), and generate a single CSV file of the results.

The CSV results file should have the following columns:

  • Dataset (OpenAI Embeddings, PCA-2, PCA-3, TSNE-2, TSNE-3, UMAP 2, UMAP 3, etc)
  • Classification Target (Bot Status, Opinion Community, Toxicity, Fourway Label, etc)
  • Classification Method (Logistic Regression, Random Forest, XGBoost)
  • Accuracy Score
  • ROC AUC Score
  • F1 Macro Score
  • F1 Weighted Score

Applying Trained Models to other Datasets

Goal: use the models we trained on the openai text embeddings to perform user classification, and apply them to other datasets of political discussion on twitter.

Datasets: choose the combined election 2020 + transition 2021 dataset in the shared environment ("election_2020_transition_2021_combined").

Models: stored on google cloud storage (see example notebook to load them)

Success criteria: Store the scores back in google bigquery (in new tables in that combined dataset).

Process / Steps:

  1. Assemble a table of user timeline tweet texts, with a row per user, as well as all of their tweet texts concatenated together. See notebooks for example approach (FYI consider all notebooks 1-3 and try to consolidate as applicable into a single process). Whether or not we take only the unique tweets for each user, or leave dups in, remains to be seen. First maybe let's query the dataset to see how common this is. For each user, what number of times do repeat a given tweet verbatim? Likely use all a user's tweets (hopefully we have sufficiently de-duped from the collection process, but we should check duplicates before moving on). So now we have a table with column of user_id and tweet_texts. Note: this table will only have a sample of the user's tweets (for example max 50 or maybe 100 tweets per user, selected at random). For the script, let's parameterize the tweet limit as an environment variable, perhaps called TWEETS_MAX or TWEETS_LIMIT, and let's also consider including this number in the name of the bigquery table used to store the embeddings (like "openai_user_timeline_embeddings_max_50", or "openai_user_timeline_embeddings_max_100", or something like that, using different tables for different limits). Let's start with 50 only for now, as it matches the approach we used when training the models. Note that for all subsequent tables derived from this data, we might want to name those tables with "_max_50" as well, to allow us to differentiate later.
  2. For each user in that new table, we will loop through them, maybe in batches or something, and obtain openai text embeddings for each's profile texts. Let's use the same ada text embedding model that we used when training the models. We want to leverage the existing code to obtain the embeddings. Specifically, when fetching the embeddings, we will "fetch in dynamic batches" to get around API limits. Let's store the embeddings in a separate table, in a way so we know which embeddings are for which users (row per user still). When we save the embeddings, let's save all 1500ish into a single "embeddings" column of an array datatype (we may need to bigquery migrations first to set up that table structure). Our script for obtaining embeddings should only attempt to obtain embeddings for users we haven't already obtained embeddings for, so at the top of the script we may want to first fetch only a list of users that don't currently have records in the embeddings table (might need to join the user timeline texts table to the embeddings table). For this script, let's use an environment variable called something like the USERS_LIMIT, for testing and developing with small batches, like of 5-10 users at a time.
  3. Demonstrate ability to load the trained models from cloud storage. Can use existing storage service. NOTE: Logistic Regression may be most reliably loadable (may be issues with random forest, need to revisit how they were saved).
  4. Let's use the text embeddings as inputs to perform classification. We want to perform classification for each task (bot detection, opinion classification, toxicity, news quality). So we may have separate tables for each task, or we can have a column denoting which type of scores are being stored. Also we should keep track of which model was used to provide the scores. We'll wind up with a table(s) of scores (and probabilities) for each user for each classification task:
    + Bot Status
    + Opinion Community
    + Lang Toxicity
    + News Quality
    + Fourway label (multiclass bot status x opinion community)

Build from Docs Branch

Right now there's a lot of HTML files in the repo, and these files are linked to via GitHub Pages deployment. However they are increasing the repo size, so instead we would like to build GitHub Pages from the docs branch, and leave these HTML files on the docs branch, while ignoring them on the main branch. Or something like that. Basically we want to still see the HTML files on the website, but not cluttering up the repo. If possible.

Election 2020 - More Embeddings Models

For next steps in research, we would like to fetch embeddings using openai's two new models, the text-embedding-3-small and text-embedding-3-large

It would be great if we could revise our current approach with the election 2020 data, to use whichever model has been set as the MODEL_ID.

This means we need to operationalize the migration queries through the bq service instead of running them manually / via README. We will need to construct the queries according to which model has been set as the MODEL_ID.

We can use a dictionary mapping of model id's to corresponding table names. Let's use different tables for the different models, because some models embeddings have more than others, and there aren't that many models.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.