dedeler / imdb-data-parser Goto Github PK

View Code? Open in Web Editor NEW

This project forked from destan/imdb-data-parser

59.0 59.0 15.0 335 KB

Parses the IMDB dumps into TSV and Relational Database insert queries

License: GNU General Public License v3.0

Python 100.00%

imdb-data-parser's People

Contributors

Stargazers

Watchers

Forkers

beydogan mairbek prabhjotsl mrjohnsson77 fleksas arunenigma prudhviy ppr10 vadimio jeortiz breachintelligence sebastiaangroot amrnablus konstantinklepikov ellielockhart

imdb-data-parser's Issues

logs must printed inside files

we just need 2 log files as info.log and error.log

their paths can be configured from settings.py

printing logs to the console consumes time

determine which parsers are required

there are many files in imdb but we don't need most of them for now. we need to decide which ones are really required.

fucked lines must put into another output file

We should know which lines are fucked in each parse. So, wee need to put them into a file.

movies-140120130745.tsv parse must create a file movies-140120130745.fucked

plot parser produces 0 byte in tsv mode

FileNotFoundError: [Errno 2] No such file or directory: '/home/xaph/imdb.sql'

Getting this error, path related.

Traceback (most recent call last):
File "./imdbparser.py", line 85, in
ParsingHelper.parse_all(preferencesMap)
File "/Users/sollarman/imdb-data-parser/idp/parser/parsinghelper.py", line 60, in parse_all
ParsingHelper.parse_one(item, preferencesMap)
File "/Users/sollarman/imdb-data-parser/idp/parser/parsinghelper.py", line 49, in parse_one
parser = ParserClass(preferencesMap)
File "/Users/sollarman/imdb-data-parser/idp/parser/moviesparser.py", line 56, in init
self.f = open("/home/xaph/imdb.sql", "w", encoding='utf-8')
FileNotFoundError: [Errno 2] No such file or directory: '/home/xaph/imdb.sql'

"No such file or directory" error on windows 8.1

I'm trying to run this on windows 8.1 and I am having problems. Here is my settings.py file:

    INPUT_DIR = "/d/apps/temp/inputs"
    OUTPUT_DIR = "/d/apps/temp/outputs"

    INTERFACES_SERVER = "ftp.fu-berlin.de"
    INTERFACES_DIRECTORY = "pub/misc/movies/database/"
    #alternative servers:
    #ftp://ftp.funet.fi/pub/mirrors/ftp.imdb.com/pub/
    #ftp://ftp.sunet.se/pub/tv+movies/imdb/

    LISTS = [
            "directors",
            "genres",
            "movies",
            "plot",
            "actors", 
            "actresses",
            "aka-names",
            "aka-titles",
            "ratings"
    ]

"/d/apps/temp/inputs" and "/d/apps/temp/outputs" folders exist in my machine. When I run the imdbparser.py -u, I'm getting the following output with errors:

INFO - mode:TSV
INFO - input_dir:/d/apps/temp/inputs
INFO - output_dir:/d/apps/temp/outputs\2014-06-29_135758_ImdbParserOutput
INFO - update_lists:True
INFO - Downloading IMDB dumps, this may take a while depending on your connection speed
INFO - Lists will downloaded from server:ftp.fu-berlin.de
INFO - Started to download list:directors
ERROR - There is a problem when downloading list directors
                [Errno 2] No such file or directory: '/d/apps/temp/inputs\\directors.list.gz'
INFO - Started to download list:genres
ERROR - There is a problem when downloading list genres
                [Errno 2] No such file or directory: '/d/apps/temp/inputs\\genres.list.gz'
INFO - Started to download list:movies
ERROR - There is a problem when downloading list movies
                [Errno 2] No such file or directory: '/d/apps/temp/inputs\\movies.list.gz'
INFO - Started to download list:plot
ERROR - There is a problem when downloading list plot
                [Errno 2] No such file or directory: '/d/apps/temp/inputs\\plot.list.gz'
INFO - Started to download list:actors
ERROR - There is a problem when downloading list actors
                [Errno 2] No such file or directory: '/d/apps/temp/inputs\\actors.list.gz'
INFO - Started to download list:actresses
ERROR - There is a problem when downloading list actresses
                [Errno 2] No such file or directory: '/d/apps/temp/inputs\\actresses.list.gz'
INFO - Started to download list:aka-names
ERROR - There is a problem when downloading list aka-names
                [Errno 2] No such file or directory: '/d/apps/temp/inputs\\aka-names.list.gz'
INFO - Started to download list:aka-titles
ERROR - There is a problem when downloading list aka-titles
                [Errno 2] No such file or directory: '/d/apps/temp/inputs\\aka-titles.list.gz'
INFO - Started to download list:ratings
ERROR - There is a problem when downloading list ratings
                [Errno 2] No such file or directory: '/d/apps/temp/inputs\\ratings.list.gz'
INFO - 0 lists are downloaded
INFO - Parsing, please wait. This may take very long time...
INFO - ___________________
INFO - Parsing directors...
INFO - Trying to find file: /d/apps/temp/inputs\directors.list
ERROR - File cannot be found: /d/apps/temp/inputs\directors.list
INFO - Trying to find file: /d/apps/temp/inputs\directors.list.gz
ERROR - File cannot be found: /d/apps/temp/inputs\directors.list.gz
Traceback (most recent call last):
    File "D:\Apps\imdb-data-parser\imdbparser.py", line 85, in <module>
        ParsingHelper.parse_all(preferences_map)
    File "D:\Apps\imdb-data-parser\idp\parser\parsinghelper.py", line 61, in parse_all
        ParsingHelper.parse_one(item, preferences_map)
    File "D:\Apps\imdb-data-parser\idp\parser\parsinghelper.py", line 50, in parse_one
        parser = ParserClass(preferences_map)
    File "D:\Apps\imdb-data-parser\idp\parser\directorsparser.py", line 59, in __init__
        super(DirectorsParser, self).__init__(preferences_map)
    File "D:\Apps\imdb-data-parser\idp\parser\baseparser.py", line 50, in __init__
        self.input_file = self.filehandler.get_input_file()
    File "D:\Apps\imdb-data-parser\idp\utils\filehandler.py", line 57, in get_input_file
        raise RuntimeError("FileNotFoundError: %s", full_file_path)
RuntimeError: ('FileNotFoundError: %s', '/d/apps/temp/inputs\\directors.list')

BTW, I'm running this on Python 3.4.1.

complete-cast parser implementation

fix directors parsing

it has 'fucked up lines'

running-times parser implementation

SQL for directors parser

SQL for movies parser

When using latest build in SQL mode I am getting this:

Zero bytes files

It worked in TSV mode though.

Here is the log

Import TSV into MySQL instead of SQL

Thanks for the script!

Maybe useful to someone else too: the SQL import gave me some problems (at least for a Windows MySQL instance), so I used the following to import the TSV files (did have to convert the resulting TSV to unix fileformat first):

CREATE TABLE actors(name VARCHAR(127), surname VARCHAR(127), title VARCHAR(255) NOT NULL, info_1 VARCHAR(127), info_2 VARCHAR(127), role VARCHAR(127), PRIMARY KEY(title)) CHARACTER SET utf8 COLLATE utf8_bin;
CREATE TABLE genres(title VARCHAR(255) NOT NULL, genre VARCHAR(127), PRIMARY KEY(title)) CHARACTER SET utf8 COLLATE utf8_bin;
CREATE TABLE plot(title VARCHAR(255) NOT NULL, plot VARCHAR(4000), PRIMARY KEY(title)) CHARACTER SET utf8 COLLATE utf8_bin;
CREATE TABLE directors(name VARCHAR(127), surname VARCHAR(127), title VARCHAR(255) NOT NULL, info VARCHAR(127), PRIMARY KEY(title)) CHARACTER SET utf8 COLLATE utf8_bin;
CREATE TABLE actresses(name VARCHAR(127), surname VARCHAR(127), title VARCHAR(255) NOT NULL, info_1 VARCHAR(127), info_2 VARCHAR(127), role VARCHAR(127), PRIMARY KEY(title)) CHARACTER SET utf8 COLLATE utf8_bin;
CREATE TABLE movies(title VARCHAR(255) NOT NULL, full_name VARCHAR(127), type VARCHAR(20), ep_name VARCHAR(127), ep_num VARCHAR(20), suspended VARCHAR(20), year VARCHAR(20), PRIMARY KEY(title)) CHARACTER SET utf8 COLLATE utf8_bin;
CREATE TABLE ratings(distribution VARCHAR(127) NOT NULL, votes VARCHAR(127), rank VARCHAR(127), title VARCHAR(255) NOT NULL, PRIMARY KEY(title)) CHARACTER SET utf8 COLLATE utf8_bin;

LOAD DATA LOCAL INFILE 'D:/actors.list.tsv' INTO TABLE actors FIELDS TERMINATED BY '\t';
LOAD DATA LOCAL INFILE 'D:/genres.list.tsv' INTO TABLE genres FIELDS TERMINATED BY '\t';
LOAD DATA LOCAL INFILE 'D:/plot.list.tsv' INTO TABLE plot FIELDS TERMINATED BY '\t';
LOAD DATA LOCAL INFILE 'D:/directors.list.tsv' INTO TABLE directors FIELDS TERMINATED BY '\t';
LOAD DATA LOCAL INFILE 'D:/actresses.list.tsv' INTO TABLE actresses FIELDS TERMINATED BY '\t';
LOAD DATA LOCAL INFILE 'D:/movies.list.tsv' INTO TABLE movies FIELDS TERMINATED BY '\t';
LOAD DATA LOCAL INFILE 'D:/ratings.list.tsv' INTO TABLE ratings FIELDS TERMINATED BY '\t';

Provide loading state info for all processes

Implement custom filter feature

In the process of parsing, extra filtering should be done if desired. Extra filtering shall be flexible and customizable.

Following paragraph of @xaph further describes this need:

To strip series and video games I apply following to final movies tsv file.

grep -v '(""' imdb.sql | grep -v '(VG)' | grep -v "(TV)" | grep -v "(V)" | grep -v "{" | grep -v "????" > imdb_son.sql

This grep should not be necessary at all since custom filter should be enough for this job.

writers parser implementation

output file names need date

we need to stamp current date to output files to revisioning them.

movies.tsv -> movies-140120130745.tsv

Turn Utils.RegExHelper into a singleton

... to avoid burden of creating a new instance in every loop

SQL for ratings parser

SQL for plot parser

Helper class/function to generate SQL files

We shall generate sql files instead of inserting sql rows directly to the DB as per conversation in the devel-list.

This issue covers:

Implementation of a helper class/function to generate SQL files
abandoning and clearing DB configuration and related helpers

downloader must check file properties from server

When we run app with -u parameter, it downloads all of the lists again. We need to compare creation date and file size with downloaded files and download file if server has newer version.

implement a function that makes DB-inserts

Preferably in another package. It shall be invoked by a separate command-line argument like --toDb. Also parsing into CSV be the default mode and this(--toDb) shall be another.

find imdb_id from internet

taglines parser implementation

use decorators to measure parsing durations

SQL for actresses parser

Utilize PIP for dependency management

Make project GPLv3

We need to add GPL to every code file.

NameError: name 'extract' is not defined

I get this message after running the script in Python 3.10.6:

C:\Users\admin\Desktop\imdb-data-parser-master>imdbparser.py
INFO - mode:TSV
INFO - input_dir:C:/Users/admin/Desktop/imdb-data-parser-master/samples
INFO - output_dir:C:/Users/admin/Desktop/imdb-data-parser-master/output\2022-09-04_131707_ImdbParserOutput
INFO - update_lists:False
INFO - Parsing, please wait. This may take very long time...
INFO - ___________________
INFO - Parsing directors...
INFO - Trying to find file: C:/Users/admin/Desktop/imdb-data-parser-master/samples\directors.list
ERROR - File cannot be found: C:/Users/admin/Desktop/imdb-data-parser-master/samples\directors.list
INFO - Trying to find file: C:/Users/admin/Desktop/imdb-data-parser-master/samples\directors.list.gz
INFO - File found: C:/Users/admin/Desktop/imdb-data-parser-master/samples\directors.list.gz
Traceback (most recent call last):
  File "C:\Users\admin\Desktop\imdb-data-parser-master\imdbparser.py", line 85, in <module>
    ParsingHelper.parse_all(preferences_map)
  File "C:\Users\admin\Desktop\imdb-data-parser-master\idp\parser\parsinghelper.py", line 61, in parse_all
    ParsingHelper.parse_one(item, preferences_map)
  File "C:\Users\admin\Desktop\imdb-data-parser-master\idp\parser\parsinghelper.py", line 50, in parse_one
    parser = ParserClass(preferences_map)
  File "C:\Users\admin\Desktop\imdb-data-parser-master\idp\parser\directorsparser.py", line 59, in __init__
    super(DirectorsParser, self).__init__(preferences_map)
  File "C:\Users\admin\Desktop\imdb-data-parser-master\idp\parser\baseparser.py", line 50, in __init__
    self.input_file = self.filehandler.get_input_file()
  File "C:\Users\admin\Desktop\imdb-data-parser-master\idp\utils\filehandler.py", line 50, in get_input_file
    if extract(full_file_path + ".gz") == 0:
NameError: name 'extract' is not defined

C:\Users\admin\Desktop\imdb-data-parser-master>

Please advise.

 mysql -u root -p imdb < movies.list.sql

But the command is stuck there and won't proceed... If I try to import it with Sequel Pro it gets stuck on:

Am I doing something wrong?

edit: If I open the whole query in Sequel Pro and then I run it, I get (after a while):

Unknow column title in fields list