Git Product home page Git Product logo

imdb-data-parser's People

Contributors

aykutakin avatar destan avatar mairbek avatar xaph avatar yasakbulut avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar

imdb-data-parser's Issues

logs must printed inside files

we just need 2 log files as info.log and error.log

their paths can be configured from settings.py

printing logs to the console consumes time

determine which parsers are required

there are many files in imdb but we don't need most of them for now. we need to decide which ones are really required.

  • actors.list.gz [DONE]
  • actresses.list.gz [DONE]
  • aka-names.list.gz
  • aka-titles.list.gz
  • alternate-versions.list.gz
  • biographies.list.gz
  • business.list.gz
  • certificates.list.gz
  • cinematographers.list.gz
  • color-info.list.gz
  • complete-cast.list.gz
  • complete-crew.list.gz
  • composers.list.gz
  • costume-designers.list.gz
  • countries.list.gz
  • crazy-credits.list.gz
  • directors.list.gz [DONE]
  • distributors.list.gz
  • editors.list.gz
  • filesizes
  • filesizes.old
  • genres.list.gz [DONE]
  • german-aka-titles.list.gz
  • goofs.list.gz
  • iso-aka-titles.list.gz
  • italian-aka-titles.list.gz
  • keywords.list.gz
  • language.list.gz
  • laserdisc.list.gz
  • literature.list.gz
  • locations.list.gz
  • miscellaneous-companies.list.gz
  • miscellaneous.list.gz
  • movie-links.list.gz
  • movies.list.gz [DONE]
  • mpaa-ratings-reasons.list.gz
  • plot.list.gz [DONE]
  • producers.list.gz
  • production-companies.list.gz
  • production-designers.list.gz
  • quotes.list.gz
  • ratings.list.gz [DONE]
  • release-dates.list.gz
  • running-times.list.gz
  • sound-mix.list.gz
  • soundtracks.list.gz
  • special-effects-companies.list.gz
  • taglines.list.gz
  • technical.list.gz
  • trivia.list.gz [DONE]
  • writers.list.gz

FileNotFoundError: [Errno 2] No such file or directory: '/home/xaph/imdb.sql'

Getting this error, path related.

Traceback (most recent call last):
File "./imdbparser.py", line 85, in
ParsingHelper.parse_all(preferencesMap)
File "/Users/sollarman/imdb-data-parser/idp/parser/parsinghelper.py", line 60, in parse_all
ParsingHelper.parse_one(item, preferencesMap)
File "/Users/sollarman/imdb-data-parser/idp/parser/parsinghelper.py", line 49, in parse_one
parser = ParserClass(preferencesMap)
File "/Users/sollarman/imdb-data-parser/idp/parser/moviesparser.py", line 56, in init
self.f = open("/home/xaph/imdb.sql", "w", encoding='utf-8')
FileNotFoundError: [Errno 2] No such file or directory: '/home/xaph/imdb.sql'

"No such file or directory" error on windows 8.1

I'm trying to run this on windows 8.1 and I am having problems. Here is my settings.py file:

    INPUT_DIR = "/d/apps/temp/inputs"
    OUTPUT_DIR = "/d/apps/temp/outputs"

    INTERFACES_SERVER = "ftp.fu-berlin.de"
    INTERFACES_DIRECTORY = "pub/misc/movies/database/"
    #alternative servers:
    #ftp://ftp.funet.fi/pub/mirrors/ftp.imdb.com/pub/
    #ftp://ftp.sunet.se/pub/tv+movies/imdb/

    LISTS = [
            "directors",
            "genres",
            "movies",
            "plot",
            "actors", 
            "actresses",
            "aka-names",
            "aka-titles",
            "ratings"
    ]

"/d/apps/temp/inputs" and "/d/apps/temp/outputs" folders exist in my machine. When I run the imdbparser.py -u, I'm getting the following output with errors:

INFO - mode:TSV
INFO - input_dir:/d/apps/temp/inputs
INFO - output_dir:/d/apps/temp/outputs\2014-06-29_135758_ImdbParserOutput
INFO - update_lists:True
INFO - Downloading IMDB dumps, this may take a while depending on your connection speed
INFO - Lists will downloaded from server:ftp.fu-berlin.de
INFO - Started to download list:directors
ERROR - There is a problem when downloading list directors
                [Errno 2] No such file or directory: '/d/apps/temp/inputs\\directors.list.gz'
INFO - Started to download list:genres
ERROR - There is a problem when downloading list genres
                [Errno 2] No such file or directory: '/d/apps/temp/inputs\\genres.list.gz'
INFO - Started to download list:movies
ERROR - There is a problem when downloading list movies
                [Errno 2] No such file or directory: '/d/apps/temp/inputs\\movies.list.gz'
INFO - Started to download list:plot
ERROR - There is a problem when downloading list plot
                [Errno 2] No such file or directory: '/d/apps/temp/inputs\\plot.list.gz'
INFO - Started to download list:actors
ERROR - There is a problem when downloading list actors
                [Errno 2] No such file or directory: '/d/apps/temp/inputs\\actors.list.gz'
INFO - Started to download list:actresses
ERROR - There is a problem when downloading list actresses
                [Errno 2] No such file or directory: '/d/apps/temp/inputs\\actresses.list.gz'
INFO - Started to download list:aka-names
ERROR - There is a problem when downloading list aka-names
                [Errno 2] No such file or directory: '/d/apps/temp/inputs\\aka-names.list.gz'
INFO - Started to download list:aka-titles
ERROR - There is a problem when downloading list aka-titles
                [Errno 2] No such file or directory: '/d/apps/temp/inputs\\aka-titles.list.gz'
INFO - Started to download list:ratings
ERROR - There is a problem when downloading list ratings
                [Errno 2] No such file or directory: '/d/apps/temp/inputs\\ratings.list.gz'
INFO - 0 lists are downloaded
INFO - Parsing, please wait. This may take very long time...
INFO - ___________________
INFO - Parsing directors...
INFO - Trying to find file: /d/apps/temp/inputs\directors.list
ERROR - File cannot be found: /d/apps/temp/inputs\directors.list
INFO - Trying to find file: /d/apps/temp/inputs\directors.list.gz
ERROR - File cannot be found: /d/apps/temp/inputs\directors.list.gz
Traceback (most recent call last):
    File "D:\Apps\imdb-data-parser\imdbparser.py", line 85, in <module>
        ParsingHelper.parse_all(preferences_map)
    File "D:\Apps\imdb-data-parser\idp\parser\parsinghelper.py", line 61, in parse_all
        ParsingHelper.parse_one(item, preferences_map)
    File "D:\Apps\imdb-data-parser\idp\parser\parsinghelper.py", line 50, in parse_one
        parser = ParserClass(preferences_map)
    File "D:\Apps\imdb-data-parser\idp\parser\directorsparser.py", line 59, in __init__
        super(DirectorsParser, self).__init__(preferences_map)
    File "D:\Apps\imdb-data-parser\idp\parser\baseparser.py", line 50, in __init__
        self.input_file = self.filehandler.get_input_file()
    File "D:\Apps\imdb-data-parser\idp\utils\filehandler.py", line 57, in get_input_file
        raise RuntimeError("FileNotFoundError: %s", full_file_path)
RuntimeError: ('FileNotFoundError: %s', '/d/apps/temp/inputs\\directors.list')

BTW, I'm running this on Python 3.4.1.

Import TSV into MySQL instead of SQL

Thanks for the script!

Maybe useful to someone else too: the SQL import gave me some problems (at least for a Windows MySQL instance), so I used the following to import the TSV files (did have to convert the resulting TSV to unix fileformat first):

CREATE TABLE actors(name VARCHAR(127), surname VARCHAR(127), title VARCHAR(255) NOT NULL, info_1 VARCHAR(127), info_2 VARCHAR(127), role VARCHAR(127), PRIMARY KEY(title)) CHARACTER SET utf8 COLLATE utf8_bin;
CREATE TABLE genres(title VARCHAR(255) NOT NULL, genre VARCHAR(127), PRIMARY KEY(title)) CHARACTER SET utf8 COLLATE utf8_bin;
CREATE TABLE plot(title VARCHAR(255) NOT NULL, plot VARCHAR(4000), PRIMARY KEY(title)) CHARACTER SET utf8 COLLATE utf8_bin;
CREATE TABLE directors(name VARCHAR(127), surname VARCHAR(127), title VARCHAR(255) NOT NULL, info VARCHAR(127), PRIMARY KEY(title)) CHARACTER SET utf8 COLLATE utf8_bin;
CREATE TABLE actresses(name VARCHAR(127), surname VARCHAR(127), title VARCHAR(255) NOT NULL, info_1 VARCHAR(127), info_2 VARCHAR(127), role VARCHAR(127), PRIMARY KEY(title)) CHARACTER SET utf8 COLLATE utf8_bin;
CREATE TABLE movies(title VARCHAR(255) NOT NULL, full_name VARCHAR(127), type VARCHAR(20), ep_name VARCHAR(127), ep_num VARCHAR(20), suspended VARCHAR(20), year VARCHAR(20), PRIMARY KEY(title)) CHARACTER SET utf8 COLLATE utf8_bin;
CREATE TABLE ratings(distribution VARCHAR(127) NOT NULL, votes VARCHAR(127), rank VARCHAR(127), title VARCHAR(255) NOT NULL, PRIMARY KEY(title)) CHARACTER SET utf8 COLLATE utf8_bin;

LOAD DATA LOCAL INFILE 'D:/actors.list.tsv' INTO TABLE actors FIELDS TERMINATED BY '\t';
LOAD DATA LOCAL INFILE 'D:/genres.list.tsv' INTO TABLE genres FIELDS TERMINATED BY '\t';
LOAD DATA LOCAL INFILE 'D:/plot.list.tsv' INTO TABLE plot FIELDS TERMINATED BY '\t';
LOAD DATA LOCAL INFILE 'D:/directors.list.tsv' INTO TABLE directors FIELDS TERMINATED BY '\t';
LOAD DATA LOCAL INFILE 'D:/actresses.list.tsv' INTO TABLE actresses FIELDS TERMINATED BY '\t';
LOAD DATA LOCAL INFILE 'D:/movies.list.tsv' INTO TABLE movies FIELDS TERMINATED BY '\t';
LOAD DATA LOCAL INFILE 'D:/ratings.list.tsv' INTO TABLE ratings FIELDS TERMINATED BY '\t';

Implement custom filter feature

In the process of parsing, extra filtering should be done if desired. Extra filtering shall be flexible and customizable.

Following paragraph of @xaph further describes this need:

To strip series and video games I apply following to final movies tsv file.

grep -v '(""' imdb.sql | grep -v '(VG)' | grep -v "(TV)" | grep -v "(V)" | grep -v "{" | grep -v "????" > imdb_son.sql

This grep should not be necessary at all since custom filter should be enough for this job.

Helper class/function to generate SQL files

We shall generate sql files instead of inserting sql rows directly to the DB as per conversation in the devel-list.

This issue covers:

  • Implementation of a helper class/function to generate SQL files
  • abandoning and clearing DB configuration and related helpers

implement a function that makes DB-inserts

Preferably in another package. It shall be invoked by a separate command-line argument like --toDb. Also parsing into CSV be the default mode and this(--toDb) shall be another.

NameError: name 'extract' is not defined

I get this message after running the script in Python 3.10.6:

C:\Users\admin\Desktop\imdb-data-parser-master>imdbparser.py
INFO - mode:TSV
INFO - input_dir:C:/Users/admin/Desktop/imdb-data-parser-master/samples
INFO - output_dir:C:/Users/admin/Desktop/imdb-data-parser-master/output\2022-09-04_131707_ImdbParserOutput
INFO - update_lists:False
INFO - Parsing, please wait. This may take very long time...
INFO - ___________________
INFO - Parsing directors...
INFO - Trying to find file: C:/Users/admin/Desktop/imdb-data-parser-master/samples\directors.list
ERROR - File cannot be found: C:/Users/admin/Desktop/imdb-data-parser-master/samples\directors.list
INFO - Trying to find file: C:/Users/admin/Desktop/imdb-data-parser-master/samples\directors.list.gz
INFO - File found: C:/Users/admin/Desktop/imdb-data-parser-master/samples\directors.list.gz
Traceback (most recent call last):
  File "C:\Users\admin\Desktop\imdb-data-parser-master\imdbparser.py", line 85, in <module>
    ParsingHelper.parse_all(preferences_map)
  File "C:\Users\admin\Desktop\imdb-data-parser-master\idp\parser\parsinghelper.py", line 61, in parse_all
    ParsingHelper.parse_one(item, preferences_map)
  File "C:\Users\admin\Desktop\imdb-data-parser-master\idp\parser\parsinghelper.py", line 50, in parse_one
    parser = ParserClass(preferences_map)
  File "C:\Users\admin\Desktop\imdb-data-parser-master\idp\parser\directorsparser.py", line 59, in __init__
    super(DirectorsParser, self).__init__(preferences_map)
  File "C:\Users\admin\Desktop\imdb-data-parser-master\idp\parser\baseparser.py", line 50, in __init__
    self.input_file = self.filehandler.get_input_file()
  File "C:\Users\admin\Desktop\imdb-data-parser-master\idp\utils\filehandler.py", line 50, in get_input_file
    if extract(full_file_path + ".gz") == 0:
NameError: name 'extract' is not defined

C:\Users\admin\Desktop\imdb-data-parser-master>

Please advise.

MySQL dumps won't import

Hi, thank you for the awesome script!

I created MySQL dumps and I'm trying to import them into my local MySQL instance on an InnoDB database:

 mysql -u root -p imdb < movies.list.sql

But the command is stuck there and won't proceed... If I try to import it with Sequel Pro it gets stuck on:

image

Am I doing something wrong?


edit: If I open the whole query in Sequel Pro and then I run it, I get (after a while):

Unknow column title in fields list

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.