Utilizing machine learning, to analyze subreddit comments for hate/toxicity.
Explore the docs »
Report Bug
·
Request Feature
Table of Contents
RedDictio is overall a test of our ability to create a webpage, hook it up to a hosted database, scrape data from reddit, and judge this data using a neural network. It connects to several fields in computing such as Database Design, Data Engineering, Data Science, Machine Learning, Cloud Computing, and Web Development.
- Data for the Neural Network was one of the biggest issues. Determining what is hateful language is very serious, therefore it is important to have the highest accuracy possible when detecting hate. We tested two different data sets. The first one was from reddit but could only get roughly 75% accuracy on its own validation set. The second dataset was generated by another neural network, but it could only reach an 80% accuracy on it’s validation set. An attempt was made at combining the two to see if it would reach a higher accuracy, but it did not. Finally, a Twitter dataset was used which reached an amazing 95% accuracy. While this was by no means perfect and a higher accuracy should be aimed for, it was a great choice for the project given the time constraints.
- We considered hosting it online on Google Vertex AI or on a Virtual Machine. These were both great choices, but we were not able to get either of them to work well for a cheap enough price. Vertex AI would have ended up costing us over $100 a month while a Virtual Machine would have cost a few dollars a day. Since we desired for the project to be permanently hosted, not just for this semester, we sought another solution. We ended up deciding to host the Neural Network on Google Drive. This is most certainly not the best solution but it is the cheapest and most effective or the price.
- We had the option of handling all of the processing on the Cloud, but because of our inexperience, and the cost of using the Google Cloud, we ended up hosting it on Google Colab. Google Colab is a cloud service provided by Google. It costs roughly $10 a month for the first premium service, which is significantly cheaper than the other options Google Cloud offered. The only issue is that it has to be manually run and monitored, but it still uses Google GPUs and storage instead of our own computers.
- We tried using Google Query but it would not work with our data and had very confusing tutorials to set up. We then tried using Cloud SQL, but it was far too expensive for a service. As a result, we went for the safer option of using Sqlite3 and hosting the DB directly on the webpage. This was not a good choice, this would slow the server and would make it incredibly hard to edit the DB, but we wanted an option that functioned. Luckily, we discovered Amazon RDS and migrated all the data and code to MySql. Now all the processing connects to the RDS DB and updates it remotely, while the web page accesses the same DB remotely. We can therefore update the DB dynamically and change the data whenever we want, all for completely free.
- At first we were hosting the webpage through Google App Engine, however, we were being ~$10/day, which is out of the picture for broke college students. Google Cloud has many products that are named similarly so it was confusing to figure out which one to use. After researching the cost of each option, we decided to switch to Google Cloud Run and have only been charged a few cents since then.
- Since we are remotely accessing our database and auto deploying to Google Cloud Run, it would be a terrible idea to make those credentials public. To obfuscate the credentials, we used Github Secrets and Environment Variables. Github Secrets is a built-in tool in Github that allows the user to add credentials and other ‘secrets’, and allows them to be used in workflows. This is what enables the auto deployment to Google Cloud Run. The Environment Variable had to be configured within the Google Cloud Run API to enter the credentials needed to access the remote database.
- Rework the neural network
- Limit amount of comments being displayed on one page
- Add more neural networks
- Allow users to vote on whether a comment is hateful or not
At this time we aren't looking for any contributors. If you feel like you have an idea that would benefit this project, please feel free to contact either Jairo or Ryan.
Jairo Garciga - Linkedin | Github
Ryan Smith - Linkedin | Github
Project Link: https://github.com/rpsmith77/RedDictio
Thank you to everyone who helped with this. Special shout out to:
- Professor Greenwell
- This "Deploy To Google Cloud Run Using Github Actions" tutorial
- Caffeine
- Viewers Like You