Clustering using LLM API for reddit threads. As of 2023/04/10, the latest LLM model being used is Gemini.
You MUST input a LLM API key (i.e. Gemini) to cluster. Check out this official website for the how-to.
(OPTIONAL) If you want to get live hot reddit posts. (This is a good repo on GitHub to help to set up your reddit API. You might also want to check out the official reddit developer website for how to set up your own API.)
- Copy this repository:
git clone https://github.com/0ethel0zhang/reddit_llm_clustering.git
- Create a virtual environment running the following command in command line (Replace the NAMEYOULIKE part with whatever name you want to call your virtual environment):
conda create -f environment.yml -n NAMEYOULIKE
- Activate the environment using:
conda activate NAMEYOULIKE
- Create a .env file in the main directory with the following access information (use your own keys and tokens):
You MUST input a LLM API key (Gemini) to cluster. Check out this official website for the how-to.
(OPTIONAL) If you want to get live hot reddit posts. (This is a good repo on GitHub to help to set up your reddit API. You might also want to check out the official reddit developer website for how to set up your own API.)
export user_name = "whatever"
export client_id = "whatever"
export client_secret = "whatever"
export redirect_uri = "whatever"
export app_name = "whatever"
export access_token = "whatever"
export API_KEY = "whatever"
-
There are two programs that you can run:
5.1 (OPTIONAL) Allows you to use your reddit API (prerequisite for running this program) to get live data. The default subreddit is r/yoga. You can edit the
thread
based on your interest.
python reddit_r_yoga.py
5.2 Clusters sub-reddit titles into groups and print out each cluster and the underlying titles. You can check out the methodology more on my GitHub tutorial dedicated to the methodology.
python get_titles.py
After everything finished running, Viola, you have:
- a document called output.py with the Reddit top hot threads in json format. (If you ran the reddit program, you'd have the latest Reddit hot threads.)
- a csv file that has the result of the clustering with the following columns of data
cluster
,title
,permalink
.