Git Product home page Git Product logo

imdb's Introduction

Travis-CI Build Status

Getting Started

imdb is a light package that leverages the the Python module IMDbPy and the etl framework to make mirroring the IMDB in SQL painless, with user interaction taking place entirely within R.

Prerequisities

You must install the Python module IMDbPy, which also has external dependencies. For Ubuntu, the following command should install everything you need. Binaries for Mac OS X and Windows are available from the project's Download page. [You may want to consult the .travis.yml file for a list of those dependencies.]

sudo apt-get install python-imdbpy python-sqlalchemy python-sqlobject python-psycopg2 python-mysqldb

You will also need to install the etl package from GitHub.

install.packages("etl")

Installation

Similarly, imdb must be installed from GitHub.

devtools::install_github("beanumber/imdb")
library(imdb)

Instantiate an object

Since the IMDB is very large (many gigabytes), it is best to store the data in a persistent SQL database. By default, etl will create an RSQLite for you in a temp directory -- but this is not a very safe place to store these data. Instead, we will connect to an existing (but empty) MySQL database using a local option file.

# must have pre-existing database "imdb"
db <- src_mysql_cnf(dbname = "imdb")

Since you will be downloading lots of data, you will probably want to specify a directory to store the raw data (which will take up several gigabytes on disk). Again, etl will create a directory for you if you don't, but that directory will be in a temp directory that is not safe.

imdb <- etl("imdb", db = db, dir = "~/dumps/imdb/")

Performing the ETL steps

The first phase is to Extract the data from IMDB. This may take a while. There are 47 files that take up approximately 2 GB on disk. By default, only the movies, actors, actresses, and directors files will be downloaded, but even these take up more then 500 MB of disk space.

imdb %>%
  etl_extract()

Mercifully, there is no Transform phase for these data. However, the Load phase can take a loooooong time.

The load phase leverages the Python module IMDbPy, which also has external dependencies. Please see the .travis.yml file for a list of those dependencies (on Ubuntu -- your configuration may be different).

You may want to leave this running. To load the full set of files it took about 90 minutes and occupied about 9.5 gigabytes on disk.

imdb %>%
  etl_load()
# TIME FINAL : 88min, 20sec (wall) 25min, 46sec (user) 0min, 11sec (system)
summary(imdb)
## files:
##    n     size                          path
## 1 47 1.915 GB  /home/bbaumer/dumps/imdb/raw
## 2 19 7.037 GB /home/bbaumer/dumps/imdb/load

##       Length Class           Mode       
## con   1      MySQLConnection S4         
## info  8      -none-          list       
## disco 3      -none-          environment

Query the database

Once everything is completed, you can query your fresh copy of the IMDB to find all of the Star Wars movies:

movies <- imdb %>%
  tbl("title")
movies %>%
  filter(title == "Star Wars" & kind_id == 1)
## Source:   query [?? x 12]
## Database: mysql 5.7.17-0ubuntu0.16.04.1 [bbaumer@localhost:/imdb]
## 
##        id     title imdb_index kind_id production_year imdb_id
##     <int>     <chr>      <chr>   <int>           <int>   <int>
## 1 3850123 Star Wars       <NA>       1            1977      NA
## # ... with 6 more variables: phonetic_code <chr>, episode_of_id <int>,
## #   season_nr <int>, episode_nr <int>, series_years <chr>, md5sum <chr>

imdb's People

Contributors

beanumber avatar

Watchers

James Cloos avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.