dedeler / imdb-data-parser Goto Github PK
View Code? Open in Web Editor NEWThis project forked from destan/imdb-data-parser
Parses the IMDB dumps into TSV and Relational Database insert queries
License: GNU General Public License v3.0
This project forked from destan/imdb-data-parser
Parses the IMDB dumps into TSV and Relational Database insert queries
License: GNU General Public License v3.0
we need to implement a trivia parser
we just need 2 log files as info.log and error.log
their paths can be configured from settings.py
printing logs to the console consumes time
there are many files in imdb but we don't need most of them for now. we need to decide which ones are really required.
we need to implement this parser
We should know which lines are fucked in each parse. So, wee need to put them into a file.
movies-140120130745.tsv parse must create a file movies-140120130745.fucked
Getting this error, path related.
Traceback (most recent call last):
File "./imdbparser.py", line 85, in
ParsingHelper.parse_all(preferencesMap)
File "/Users/sollarman/imdb-data-parser/idp/parser/parsinghelper.py", line 60, in parse_all
ParsingHelper.parse_one(item, preferencesMap)
File "/Users/sollarman/imdb-data-parser/idp/parser/parsinghelper.py", line 49, in parse_one
parser = ParserClass(preferencesMap)
File "/Users/sollarman/imdb-data-parser/idp/parser/moviesparser.py", line 56, in init
self.f = open("/home/xaph/imdb.sql", "w", encoding='utf-8')
FileNotFoundError: [Errno 2] No such file or directory: '/home/xaph/imdb.sql'
I'm trying to run this on windows 8.1 and I am having problems. Here is my settings.py file:
INPUT_DIR = "/d/apps/temp/inputs"
OUTPUT_DIR = "/d/apps/temp/outputs"
INTERFACES_SERVER = "ftp.fu-berlin.de"
INTERFACES_DIRECTORY = "pub/misc/movies/database/"
#alternative servers:
#ftp://ftp.funet.fi/pub/mirrors/ftp.imdb.com/pub/
#ftp://ftp.sunet.se/pub/tv+movies/imdb/
LISTS = [
"directors",
"genres",
"movies",
"plot",
"actors",
"actresses",
"aka-names",
"aka-titles",
"ratings"
]
"/d/apps/temp/inputs" and "/d/apps/temp/outputs" folders exist in my machine. When I run the imdbparser.py -u
, I'm getting the following output with errors:
INFO - mode:TSV
INFO - input_dir:/d/apps/temp/inputs
INFO - output_dir:/d/apps/temp/outputs\2014-06-29_135758_ImdbParserOutput
INFO - update_lists:True
INFO - Downloading IMDB dumps, this may take a while depending on your connection speed
INFO - Lists will downloaded from server:ftp.fu-berlin.de
INFO - Started to download list:directors
ERROR - There is a problem when downloading list directors
[Errno 2] No such file or directory: '/d/apps/temp/inputs\\directors.list.gz'
INFO - Started to download list:genres
ERROR - There is a problem when downloading list genres
[Errno 2] No such file or directory: '/d/apps/temp/inputs\\genres.list.gz'
INFO - Started to download list:movies
ERROR - There is a problem when downloading list movies
[Errno 2] No such file or directory: '/d/apps/temp/inputs\\movies.list.gz'
INFO - Started to download list:plot
ERROR - There is a problem when downloading list plot
[Errno 2] No such file or directory: '/d/apps/temp/inputs\\plot.list.gz'
INFO - Started to download list:actors
ERROR - There is a problem when downloading list actors
[Errno 2] No such file or directory: '/d/apps/temp/inputs\\actors.list.gz'
INFO - Started to download list:actresses
ERROR - There is a problem when downloading list actresses
[Errno 2] No such file or directory: '/d/apps/temp/inputs\\actresses.list.gz'
INFO - Started to download list:aka-names
ERROR - There is a problem when downloading list aka-names
[Errno 2] No such file or directory: '/d/apps/temp/inputs\\aka-names.list.gz'
INFO - Started to download list:aka-titles
ERROR - There is a problem when downloading list aka-titles
[Errno 2] No such file or directory: '/d/apps/temp/inputs\\aka-titles.list.gz'
INFO - Started to download list:ratings
ERROR - There is a problem when downloading list ratings
[Errno 2] No such file or directory: '/d/apps/temp/inputs\\ratings.list.gz'
INFO - 0 lists are downloaded
INFO - Parsing, please wait. This may take very long time...
INFO - ___________________
INFO - Parsing directors...
INFO - Trying to find file: /d/apps/temp/inputs\directors.list
ERROR - File cannot be found: /d/apps/temp/inputs\directors.list
INFO - Trying to find file: /d/apps/temp/inputs\directors.list.gz
ERROR - File cannot be found: /d/apps/temp/inputs\directors.list.gz
Traceback (most recent call last):
File "D:\Apps\imdb-data-parser\imdbparser.py", line 85, in <module>
ParsingHelper.parse_all(preferences_map)
File "D:\Apps\imdb-data-parser\idp\parser\parsinghelper.py", line 61, in parse_all
ParsingHelper.parse_one(item, preferences_map)
File "D:\Apps\imdb-data-parser\idp\parser\parsinghelper.py", line 50, in parse_one
parser = ParserClass(preferences_map)
File "D:\Apps\imdb-data-parser\idp\parser\directorsparser.py", line 59, in __init__
super(DirectorsParser, self).__init__(preferences_map)
File "D:\Apps\imdb-data-parser\idp\parser\baseparser.py", line 50, in __init__
self.input_file = self.filehandler.get_input_file()
File "D:\Apps\imdb-data-parser\idp\utils\filehandler.py", line 57, in get_input_file
raise RuntimeError("FileNotFoundError: %s", full_file_path)
RuntimeError: ('FileNotFoundError: %s', '/d/apps/temp/inputs\\directors.list')
BTW, I'm running this on Python 3.4.1.
it has 'fucked up lines'
Thanks for the script!
Maybe useful to someone else too: the SQL import gave me some problems (at least for a Windows MySQL instance), so I used the following to import the TSV files (did have to convert the resulting TSV to unix fileformat first):
CREATE TABLE actors(name VARCHAR(127), surname VARCHAR(127), title VARCHAR(255) NOT NULL, info_1 VARCHAR(127), info_2 VARCHAR(127), role VARCHAR(127), PRIMARY KEY(title)) CHARACTER SET utf8 COLLATE utf8_bin;
CREATE TABLE genres(title VARCHAR(255) NOT NULL, genre VARCHAR(127), PRIMARY KEY(title)) CHARACTER SET utf8 COLLATE utf8_bin;
CREATE TABLE plot(title VARCHAR(255) NOT NULL, plot VARCHAR(4000), PRIMARY KEY(title)) CHARACTER SET utf8 COLLATE utf8_bin;
CREATE TABLE directors(name VARCHAR(127), surname VARCHAR(127), title VARCHAR(255) NOT NULL, info VARCHAR(127), PRIMARY KEY(title)) CHARACTER SET utf8 COLLATE utf8_bin;
CREATE TABLE actresses(name VARCHAR(127), surname VARCHAR(127), title VARCHAR(255) NOT NULL, info_1 VARCHAR(127), info_2 VARCHAR(127), role VARCHAR(127), PRIMARY KEY(title)) CHARACTER SET utf8 COLLATE utf8_bin;
CREATE TABLE movies(title VARCHAR(255) NOT NULL, full_name VARCHAR(127), type VARCHAR(20), ep_name VARCHAR(127), ep_num VARCHAR(20), suspended VARCHAR(20), year VARCHAR(20), PRIMARY KEY(title)) CHARACTER SET utf8 COLLATE utf8_bin;
CREATE TABLE ratings(distribution VARCHAR(127) NOT NULL, votes VARCHAR(127), rank VARCHAR(127), title VARCHAR(255) NOT NULL, PRIMARY KEY(title)) CHARACTER SET utf8 COLLATE utf8_bin;
LOAD DATA LOCAL INFILE 'D:/actors.list.tsv' INTO TABLE actors FIELDS TERMINATED BY '\t';
LOAD DATA LOCAL INFILE 'D:/genres.list.tsv' INTO TABLE genres FIELDS TERMINATED BY '\t';
LOAD DATA LOCAL INFILE 'D:/plot.list.tsv' INTO TABLE plot FIELDS TERMINATED BY '\t';
LOAD DATA LOCAL INFILE 'D:/directors.list.tsv' INTO TABLE directors FIELDS TERMINATED BY '\t';
LOAD DATA LOCAL INFILE 'D:/actresses.list.tsv' INTO TABLE actresses FIELDS TERMINATED BY '\t';
LOAD DATA LOCAL INFILE 'D:/movies.list.tsv' INTO TABLE movies FIELDS TERMINATED BY '\t';
LOAD DATA LOCAL INFILE 'D:/ratings.list.tsv' INTO TABLE ratings FIELDS TERMINATED BY '\t';
In the process of parsing, extra filtering should be done if desired. Extra filtering shall be flexible and customizable.
Following paragraph of @xaph further describes this need:
To strip series and video games I apply following to final movies tsv file.
grep -v '(""' imdb.sql | grep -v '(VG)' | grep -v "(TV)" | grep -v "(V)" | grep -v "{" | grep -v "????" > imdb_son.sql
This grep
should not be necessary at all since custom filter should be enough for this job.
we need to stamp current date to output files to revisioning them.
movies.tsv -> movies-140120130745.tsv
... to avoid burden of creating a new instance in every loop
We shall generate sql files instead of inserting sql rows directly to the DB as per conversation in the devel-list.
This issue covers:
When we run app with -u parameter, it downloads all of the lists again. We need to compare creation date and file size with downloaded files and download file if server has newer version.
Preferably in another package. It shall be invoked by a separate command-line argument like --toDb. Also parsing into CSV be the default mode and this(--toDb) shall be another.
We need to add GPL to every code file.
I get this message after running the script in Python 3.10.6:
C:\Users\admin\Desktop\imdb-data-parser-master>imdbparser.py
INFO - mode:TSV
INFO - input_dir:C:/Users/admin/Desktop/imdb-data-parser-master/samples
INFO - output_dir:C:/Users/admin/Desktop/imdb-data-parser-master/output\2022-09-04_131707_ImdbParserOutput
INFO - update_lists:False
INFO - Parsing, please wait. This may take very long time...
INFO - ___________________
INFO - Parsing directors...
INFO - Trying to find file: C:/Users/admin/Desktop/imdb-data-parser-master/samples\directors.list
ERROR - File cannot be found: C:/Users/admin/Desktop/imdb-data-parser-master/samples\directors.list
INFO - Trying to find file: C:/Users/admin/Desktop/imdb-data-parser-master/samples\directors.list.gz
INFO - File found: C:/Users/admin/Desktop/imdb-data-parser-master/samples\directors.list.gz
Traceback (most recent call last):
File "C:\Users\admin\Desktop\imdb-data-parser-master\imdbparser.py", line 85, in <module>
ParsingHelper.parse_all(preferences_map)
File "C:\Users\admin\Desktop\imdb-data-parser-master\idp\parser\parsinghelper.py", line 61, in parse_all
ParsingHelper.parse_one(item, preferences_map)
File "C:\Users\admin\Desktop\imdb-data-parser-master\idp\parser\parsinghelper.py", line 50, in parse_one
parser = ParserClass(preferences_map)
File "C:\Users\admin\Desktop\imdb-data-parser-master\idp\parser\directorsparser.py", line 59, in __init__
super(DirectorsParser, self).__init__(preferences_map)
File "C:\Users\admin\Desktop\imdb-data-parser-master\idp\parser\baseparser.py", line 50, in __init__
self.input_file = self.filehandler.get_input_file()
File "C:\Users\admin\Desktop\imdb-data-parser-master\idp\utils\filehandler.py", line 50, in get_input_file
if extract(full_file_path + ".gz") == 0:
NameError: name 'extract' is not defined
C:\Users\admin\Desktop\imdb-data-parser-master>
Please advise.
we need to implement this parser
we need to implement this parser
Hi, thank you for the awesome script!
I created MySQL dumps and I'm trying to import them into my local MySQL instance on an InnoDB database:
mysql -u root -p imdb < movies.list.sql
But the command is stuck there and won't proceed... If I try to import it with Sequel Pro it gets stuck on:
Am I doing something wrong?
edit: If I open the whole query in Sequel Pro and then I run it, I get (after a while):
Unknow column title in fields list
if I run application with -m SQL command all TSV files are also created. We don't need them.
A declarative, efficient, and flexible JavaScript library for building user interfaces.
๐ Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. ๐๐๐
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google โค๏ธ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.