Git Product home page Git Product logo

florida-man-headline-generator's Introduction

florida-man-headline-generator

Welcome to my first personal project! I initially took on this project during the summer after my freshman year at UC Berkeley, but refactored and rewrote nearly all of the code just before the start of my junior year.

This is a "Florida Man" headline generator. Using headlines scraped from various news sources as training data, an n-grams language model is able to generate fake headlines that sound like they could belong to real articles. To see this in action, I encourage you to clone this repository and run shell.py! From there, you're able to start a terminal program that allows you to add custom headlines to the training data, generate headlines in bulk, or play a guessing game to determine if a presented headline is generated or genuine. To check the required packages needed to run any of the files, check out requirements.txt.

Project Breakdown

There are three main components to this project:

  1. Web Scrapers
  2. N-gram Language Model
  3. Interactive Shell

Web Scrapers

I gathered the "Florida Man" headlines from three different news sources: Local 10 News, CBS Miami, and a dedicated Florida Man site. All scrapers were built with Selenium and BeautifulSoup; Selenium allowed the scrapers to load and interact (for example, like pressing the "Next Page" buttons) with the sites while BeautifulSoup helped parse the page's actual contents. To run Selenium, a Chrome webdriver was also required. I last scraped for headlines on 8/8/2020, and the ChromeDriver version I used was 84.0.4147.30.

While all scrapers used Selenium and BeautifulSoup, each site required a unique scraper, as the sites were all built with different HTML templates. Deciding how to tackle each site to scrape their headlines required me to study each site's source and find selectors I could use to parse the information.

When going back to rewrite the scraper code, I realized that I had initially taken unnecessarily roundabout or even unreliable approaches to finding some webpage elements. One example of this was in my original implementation of the Local 10 news scraper; to locate the "Next Page" button, I had originally used the button's XPath. However, the button's XPath is not guaranteed to remain the same between executions, so this resulted in inconsistent behavior. Instead, my current implementation uses the button's class name, which is always fixed.

N-Gram Language Model

The N-gram language model is a predictive language model that is used for applications like producing Shakespeare-like text. The model works by splitting a text corpus into grams of fixed-length (in words) and using the grams to form a conditional probability distribution that maps a text history to possible outcomes. As an example, if the text history were "Florida man..." the language model may predict that the next word is "arrested" with 30% probability, "assaults" with 25% probability, "reported" with 10% probability, etc. The model then randomly chooses a word based on that distribution, and then updates the text history; in the case that "arrested" were chosen, the new text history would be "man arrested...", and then another word would be chosen. The value of N in the name N-gram language model is the length of each gram, or phrase. The above example is a bigram, where the text history and phrases are two words long.

My implementation of the N-gram language model takes advantage of Python's built-in defaultdict and Counter to create a density function using the scraped headlines as a text corpus. The language model code is relatively well encapsulated after refactoring, so that the interactive shell's code in shell.py only needs to call generate_grams to produce the distribution from the training data, and passing the distribution into generate_headline returns a new headline as a string.

Interactive Shell

The interactive shell was allows a user to interact with the language model. The full list of commands are as follows:

  • Add custom headlines to training dataset/text corpus
  • Clear all custom headlines (remove all user added headlines)
  • Add/remove files from training data, view each .csv file individually
  • Change the value of n and retrain the model
  • Generate a batch of headlines, option to save them to a .txt file
  • Play guessing quiz to determine if headlines are real or generated headlines

While I added no new functionality when I refactored the code, I sped up the runtime of multiple functions, improved the consistency of the text prompts to users, and made the code more concise.

Authors

Kevin Hsu

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.