Git Product home page Git Product logo

reddit_llm_clustering's Introduction

reddit_llm_clustering

Clustering using LLM API for reddit threads. As of 2023/04/10, the latest LLM model being used is Gemini.

Prerequisites

You MUST input a LLM API key (i.e. Gemini) to cluster. Check out this official website for the how-to.

(OPTIONAL) If you want to get live hot reddit posts. (This is a good repo on GitHub to help to set up your reddit API. You might also want to check out the official reddit developer website for how to set up your own API.)

Instructions:

  1. Copy this repository:

git clone https://github.com/0ethel0zhang/reddit_llm_clustering.git

  1. Create a virtual environment running the following command in command line (Replace the NAMEYOULIKE part with whatever name you want to call your virtual environment):

conda create -f environment.yml -n NAMEYOULIKE

  1. Activate the environment using:

conda activate NAMEYOULIKE

  1. Create a .env file in the main directory with the following access information (use your own keys and tokens):

    You MUST input a LLM API key (Gemini) to cluster. Check out this official website for the how-to.

    (OPTIONAL) If you want to get live hot reddit posts. (This is a good repo on GitHub to help to set up your reddit API. You might also want to check out the official reddit developer website for how to set up your own API.)

export user_name = "whatever"
export client_id = "whatever"
export client_secret = "whatever"
export redirect_uri = "whatever"
export app_name = "whatever"
export access_token = "whatever"
export API_KEY = "whatever"

  1. There are two programs that you can run:

    5.1 (OPTIONAL) Allows you to use your reddit API (prerequisite for running this program) to get live data. The default subreddit is r/yoga. You can edit the thread based on your interest.
    python reddit_r_yoga.py

    5.2 Clusters sub-reddit titles into groups and print out each cluster and the underlying titles. You can check out the methodology more on my GitHub tutorial dedicated to the methodology.
    python get_titles.py

OUTPUT

After everything finished running, Viola, you have:

  1. a document called output.py with the Reddit top hot threads in json format. (If you ran the reddit program, you'd have the latest Reddit hot threads.)
  2. a csv file that has the result of the clustering with the following columns of data cluster,title,permalink.

reddit_llm_clustering's People

Contributors

0ethel0zhang avatar

Watchers

Kostas Georgiou avatar  avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.