Git Product home page Git Product logo

507_final's Introduction

SI 507 Final Project: Extracting Videos and Podcasts from VOX Media

Purpose: to gather data from VOX Media's video and media content and to compare mediums' topic options

- this project used BeautifulSoup to capture each url's HTML - therefore, no API keys are necessary
- this analysis can provide a basis to further explore how Vox decides to explore certain topics based on different mediums. I hypothesize that there may a discrepancy in proportion of topic mentions due to intrinsic restrictions of each medium (video vs. podcasts).

This code requires Python 3 to run. SI507F17_finalproject.py includes code that will gather html, class attributes, and input into Postgres SQL database. SI507F17_finalproject_tests.py will test the code file.

NOTE: SI507F17_finalproject.py includes Selenium package functionality, which will manually scroll through the podcast iframes (5 total iframes, one per podcast show). This will take time to process each time (and will open 5 more windows), but should all run without errors the first time. Just a note, if one plans to run this several times. This applies to running the test file, as well, of course.

Please use the config_example.py file as a template for using the database code (starts on line 213 of the code file)

Before running the files, please install all items in the requirements.txt file.

Summary of code file:

- Set up caching system functions
- Set up HTML scraping functions 
	- Use Selenium to capture HTML
- Set up HTML text mining functions
- Run HTML functions on video URLs
- Run HTML functions on podcast URLs
- Create class definitions
- Create class instances
	- Class attributes will capture data elements needed for database
- Create database connection (I used TeamSQL)
- Create 2 database tables (videos, podcasts)
- Insert data using class definitions
- Query data to get counts of keywords
- Configure chart in Plotly

At the end of the file processing, you should see:

- Two tables in the database (which you named using the config_example.py as a template)
- A Plotly visualization that compares keywords between both Vox platforms (use the URL from command line)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.