philipmat / discogs-xml2db Goto Github PK

View Code? Open in Web Editor NEW

200.0 18.0 74.0 1.54 MB

Imports the discogs.com monthly XML dumps into databases

License: Apache License 2.0

Python 47.55% Shell 3.01% C# 49.44%

discogs python

discogs-xml2db's Introduction

discogs-xml2db v2

discogs-xml2db is a python program for importing discogs data dumps into several databases.

Version 2 is a rewrite of the original discogs-xml2db (referred in here as the classic version).
It is based on a branch by RedApple and it is several times faster.

Currently supports MySQL and PostgreSQL as target databases. Instructions for importing into MongoDB, though these are untested.
Let us know how it goes!

Experimental version

In parallel to the original Python codebase, we're working on a parser/exporter that's even faster. This is a complete rewrite in C# and initial results are highly promising:

File	Record Count	Python	C#
discogs_20200806_artists.xml.gz	7,046,615	6:22	2:35
discogs_20200806_labels.xml.gz	1,571,873	1:15	0:22
discogs_20200806_masters.xml.gz	1,734,371	3:56	1:57
discogs_20200806_releases.xml.gz	12,867,980	1:45:16	42:38

If you're interested in testing one of this versions, read more about it in the .NET Parser README or grab the appropriate binaries from the Releases page.

While this version does not have yet complete feature-parity with the Python version, the core export-to-csv is there and it's likely it will eventually replace it.

Running discogs-xml2db

Requirements

discogs-xml2db requires python3 (minimum 3.6) and some python modules.
Additionally, the bash shell is used for automating some tasks.

Importing to some databases may require additional dependencies, see the documentation for your target database below.

It's best that a Python virtual environment is created in order to install the required modules in a safe location, which does not require elevated security permissions:

# Create a virtual environment and activate it
$ python3 -m venv .discogsenv

# Activate virtual environment
# On Linux/macOS:
$ source .discogsenv/bin/activate
# on Windows, in Powershell
$ .discogsenv\Scripts\Activate.ps1

# Install requirements:
(.discogsenv) $ pip3 install -r requirements.txt

Installation instruction for other platforms can be found in the pip documentation.

Downloading discogs dumps

Download the latest dump files from discogs manually from discogs or run get_latest_dumps.sh.

To check the files' integrity download the appropriate checksum file from https://data.discogs.com/, place it in the same directory as the dumps and compare the checksums.

# run in folder where the data dump files have been downloaded
$ sha256sum -c discogs_*_CHECKSUM.txt

Converting dumps to CSV

Run run.py to convert the dump files to csv.

There are two run modes:

You can point it to a directory where the discogs dump files are and use one or multiple --export options to indicate which files to process:

# ensure the virtual environment is active
(.discogsenv) $ python3 run.py \
  --bz2 \ # compresses resulting csv files
  --apicounts \ # provides more accurate progress counts
  --export artist --export label --export master --export release \
  --output csv-dir    # folder where to output the csv files
  dump-dir \ # folder where the data dumps are

You can specify the individual files instead:

# ensure the virtual environment is active
(.discogsenv) $ python3 run.py \
  --bz2 \ # compresses resulting csv files
  --apicounts \ # provides more accurate progress counts
  --output csv-dir    # folder where to output the csv files
  path/to/discogs_20200806_artist.xml.gz path/to/discogs_20200806_labels.xml.gz

run.py takes the following arguments:

--export: the types of dump files to export: "artist", "label", "master", "release.
It matches the names of the dump files, e.g. "discogs_20200806_artists.xml.gz" Not needed if the individual files are specified.
--bz2: Compresses output csv files using bz2 compression library.
--limit=<lines>: Limits export to some number of entities
--apicounts: Makes progress report more accurate by getting total amounts from Discogs API.
--output : the folder where to store the csv files; default it current directory

The exporter provides progress information in real time:

Processing      labels:  99%|█████████████████████████████████████████▊| 1523623/1531339 [01:41<00:00, 14979.04labels/s]
Processing     artists: 100%|████████████████████████████████████████▊| 6861991/6894139 [09:02<00:02, 12652.23artists/s]
Processing    releases:  78%|█████████████████████████████▌        | 9757740/12560177 [2:02:15<36:29, 1279.82releases/s]

The total amount and percentages might be off a bit as the exact amount is not known while reading the file.
Specifying --apicounts will provide more accurate predictions by getting the latest amounts from the Discogs API.

Importing

If pv is available it will be used to display progress during import.
To install it run $ sudo apt-get install pv on Ubuntu and Debian or check the installation instructions for other platforms.

Example output if using pv:

$ mysql/importcsv.sh 2020-05-01/csv/*
artist_alias.csv.bz2: 12,5MiB 0:00:03 [3,75MiB/s] [===================================>] 100%
artist.csv.bz2:  121MiB 0:00:29 [4,09MiB/s] [=========================================>] 100%
artist_image.csv.bz2:  7,3MiB 0:00:01 [3,72MiB/s] [===================================>] 100%
artist_namevariation.csv.bz2: 2,84MiB 0:00:01 [2,76MiB/s] [==>                         ] 12% ETA 0:00:07

Importing into PostgreSQL

# install PostgreSQL libraries (might be required for next step)
$ sudo apt-get install libpq-dev

# install the PostgreSQL package for python
# ensure the virtual environment has been activated
(.discogsenv) $ pip3 install -r postgresql/requirements.txt

# Configure PostgreSQL username, password, database, ...
$ nano postgresql/postgresql.conf

# Create database tables
(.discogsenv) $ python3 postgresql/psql.py < postgresql/sql/CreateTables.sql

# Import CSV files
(.discogsenv) $ python3 postgresql/importcsv.py /csvdir/*

# Configure primary keys and constraints, build indexes
(.discogsenv) $ python3 postgresql/psql.py < postgresql/sql/CreatePrimaryKeys.sql
(.discogsenv) $python3 postgresql/psql.py < postgresql/sql/CreateFKConstraints.sql
(.discogsenv) $ python3 postgresql/psql.py < postgresql/sql/CreateIndexes.sql

Importing into Mysql

# Configure MySQL username, password, database, ...
$ nano mysql/mysql.conf

# Create database tables
$ mysql/exec_sql.sh < mysql/CreateTables.sql

# Import CSV files
$ mysql/importcsv.sh /csvdir/*

# Configure primary keys and build indexes
$ mysql/exec_sql.sh < mysql/AssignPrimaryKeys.sql

Importing into MongoDB

The CSV files can be imported into MongoDB using mongoimport.

mongoimport --db=discogs --collection=releases --type=csv --headerline --file=release.csv

Importing into CouchDB

CouchDB only supports importing JSON files.
couchimport can be used to convert the CSV files to JSON and import them into CouchDB, as explained in this tutorial.

Comparison to classic discogs-xml2db

speedup is many times faster than classic because it uses a different approach:

The discogs xml dumps are first converted into one csv file per database table.
These csv files are then imported into the different target databases (bulk load).
This is different from classic discogs-xml2db which loads records into the database one by one while parsing the xml file, waiting on the database after every row.

speedup requires less disk space than classic as it can work while the dump files are still compressed. While the uncompressed dumps for May 2020 take up 57GB of space the compressed dumps are only 8.8GB. The dumps can be deleted after converting them to compressed CSV files (6.1GB).

As many databases can import CSV files out of the box it should be easy to add support for more databases to discogs-xml2db speedup in the future.

Database schema changes

The database schema was changed in v2.0 to be more consistent and normalize some more data. The following things changed compared to classic discogs-xml2db:

renamed table: releases_labels => release_label
renamed table: releases_formats => release_format
renamed table: releases_artists => release_artist
renamed table: tracks_artists => release_track_artist
renamed table: track => release_track
renamed column: release_artists.join_relation => release_artist.join_string
renamed column: release_track_artist.join_relation => release_track_artist.join_string
renamed column: release_format.format_name => release_format.name
renamed column: label.contactinfo => label.contact_info
renamed column: label.parent_label => label.parent_name
added: label has new parent_id field
added: release_label has extra fields
moved: aliases now in artist_alias table
moved: tracks_extra_artists now in track_artist table with extra flag
moved: releases_extra_artists now in release_track_artist table with extra flag
moved: release.genres now in own release_genre table
moved: release.styles now in own release_style table
moved: release.barcode now in release_identifier table
moved: artist.anv fields now in artist_namevariation table
moved: artist.url fields now in artist_url table
removed: release_format.position no longer exists but can use id field to preserve order when release has multiple formats.
release_track_artist now use tmp_track_id to match to tmp_track in release_track

Running discogs-xml2db classic

To run the classic version of discogs-xml2db, check out the v1.99 git tag.
It contains both the classic and the speed-up version.

Please be aware that the classic version is no longer maintained.

discogs-xml2db's People

Contributors

Stargazers

Watchers

discogs-xml2db's Issues

Release formats table need a position column

In the original xml the formats are listed in order, so for a multi format release the first format listed relates to the first tracks on the release. But we have lost this order once imported into the postgres database because there is no order column in the release_formats table. This makes it impossible to accurately map the track list to a series of mediums and set the medium format accordingly

Image Type is a function of the relation

Currently image type is stored in the images table, but it is actually a function of the relationship between the image and each release/label/artist it is linked to. There are two problems with the current approach

When you add an image to the image table you check to see if already added and if not it is not added. This means that if the first time the image was added for a release it has type of primary, and then the same image exists with type of secondary that secondary image will never get added, so a later query on that relation will incorrectly return that the relationship was primary when in fact it was secondary.
The relations_images ( & artists_images, labels_images) have only two columns release_id and uri, in some cases the same uri is added to a release as both a secondary and primary type and so the code adds exactly same row to the table, so it is impossible to have a primary key on the table (because it cannot be unique)

Solution:Move the image type column from the image table to the releases_images, artist_images, labels_images tables.

master_id field on discogs.release table is always empty

master_id field on discogs.release table is always empty when loading release data into postgres. The data does exist in the original xml file.

Option to restrict import based on data quality

Add option to restrict imports based on the values of data_quality.
Values based on the available entries at http://www.discogs.com/help/voting-guidelines.html

Modify the image code because no longer have image urls

Disocgs datadump now only has image metadata but no urls so with current code the only data we put into image table is height and width but because no image uri no way to link between releases_images and artists_images to image table. So we need to modify releases_images (artists_images etc) to drop the uri columns and add height and width columns, and then drop the image table which is no longer useful.

File Import Error

When I try to import artists' xml file with

discogsparser.py -i -o mongo -p "file:///home/ubuntu/discogs/?uniq=md5" discogs_20120501_artists.xml

it returns the following output.

Namespace(data_quality=None, date=None, file=['discogs_20120501_artists.xml'], ignore_unknown_tags=True, n=None, output='mongo', params='file:///home/ubuntu/discogs/?uniq=md5')
Traceback (most recent call last):
  File "discogsparser.py", line 223, in <module>
    main(sys.argv[1:])
  File "discogsparser.py", line 215, in main
    parseArtists(parser, exporter)
  File "discogsparser.py", line 73, in parseArtists
    parser.parse(artist_file)
  File "/usr/lib/python2.7/xml/sax/expatreader.py", line 107, in parse
    xmlreader.IncrementalParser.parse(self, source)
  File "/usr/lib/python2.7/xml/sax/xmlreader.py", line 123, in parse
    self.feed(buffer)
  File "/usr/lib/python2.7/xml/sax/expatreader.py", line 207, in feed
    self._parser.Parse(data, isFinal)
  File "/usr/lib/python2.7/xml/sax/expatreader.py", line 304, in end_element
    self._cont_handler.endElement(name)
  File "/home/ubuntu/discogs-xml2db/discogsartistparser.py", line 117, in endElement
    self.exporter.storeArtist(self.artist)
  File "/home/ubuntu/discogs-xml2db/mongodbexporter.py", line 240, in storeArtist
    self.execute('artists', artist)
  File "/home/ubuntu/discogs-xml2db/mongodbexporter.py", line 206, in execute
    doc.updated_on = "%s" % date.today()
AttributeError: 'dict' object has no attribute 'updated_on'

Fix-xml needs amending because dumps file no longer missing outer xml tags

Fix-xml needs amending because dumps file no longer missing outer xml tags so it ends up adding multiple tags start ending tags if applied

artist and artist_joins table should be merged

(Apologies me If Ive misunderstood this but this is how I'm seeing it)

Shouldn't the artist join field be stored as a column in track_artists , release_artists rather than having the separate tracks_artists_joins, release_artist_joins tabel.

Certainly looking at the xml returned by the webservice http://api.discogs.com/release/3 join data is stored with the artist itself. Whereas with the database tables Im not sure how I meant to retrieve the data, i.e i can lookup the artists for a track using track_artists then I think if the number of artists for one track is more than one I have to look up track_artists_joins by trackid, and then find the right row by comparing artist1 and artist2 with the rows from the first query. This is'nt too bad when just have two artists on a song but if the song has four artists how do I know if artist1 artist2 goes with the 1st and 2nd, 2nd and 3rd, 1st and 3rd ectera because the artist order is not defined anywhere.

It would be better if track_artists had a position column (to signify position of artist on track), and a join column(this would always be empty for last artist or when track is only my one artist). Same logic applies to release artists

RtD style documentation

The project should have better documentation and be on https://readthedocs.org.

Trim README.rst to only cover a brief description, features, installation, and quick way to run.
Documentation for each exporter
Documentation for tests

Unable to load master images into table because type null

in fix_db.sql

--Remove duplicate master rows
insert into masters_images
select distinct t1.image_uri, t1.type, t1.master_id
from tmp_masters_images t1
left join masters_images t2
on t1.image_uri = t2.image_uri
and t1.type = t2.type
and t1.master_id = t2.master_id
where t2.image_uri is null
;
fails with

ERROR: null value in column "type" violates not-null constraint
DETAIL: Failing row contains (http://api.discogs.com/image/R-1-1193812031.jpeg, null, 5427).

Merge interesting forks

There is some interesting work done on some forks. Merge them back:

How can I keep updated the database?

I asked this question in Discogs forum but they said that they don't provide any solution. http://www.discogs.com/help/forums/topic/340340

Do you have any manuel solution this frustrating problem?

Error: Unknown Release element 'sub_tracks' building release table from 01072014

Error: Unknown Release element 'sub_tracks' building release table from 01072014 dump, need to handle/ignore new sub_tracks entity

For Postgres Track table needs track number that represents the track order

The Track table has a position column but for many releases this is not a simple ascending number but something much more difficult to parse (e.g Vinyl could be A1 A2 B1), what we really need is an additional number column that is set to 1 for first track, 2 for second track and so on. I assume tracks are in the correct order in the original xml dump file so this would be a relatively easy fix to postgresimporter.py (but i'm not a python programmer myself so not sure where to start)

It takes more than 24 hours to export as a JSON file

I'm trying to import data to mongodb. As you mention in README, direct import takes longer than the other option (XML to JSON and mongoimport). However even if I tried other option, I couldn't easily import it. Releases.xml file is ~9gb and it took more than 24 hours to export JSON file. Actually after that process releases.json file was corrupted so I couldn't really import file.

I tried the script in a Amazon EC2 micro instance and another VPS that has dual core processor. The results were same, the files were ~8gb and corrupted. And the other problem is the script caused CPU overload so it's not choice to leave the discogsparser.py file open and let it work for 24 hours. I couldn't figure out what the problem is. Is xml.sax slow or the systems that I used don't have enough resource?

Error when importing discogs data into PostgreSQL

The process crashed when it hit the "masters" section. Here's the output:

File "discogsparser.py", line 241, in
main(sys.argv[1:])
File "discogsparser.py", line 236, in main
parseMasters(parser, exporter)
File "discogsparser.py", line 171, in parseMasters
parser.parse(master_file)
File "/usr/local/lib/python2.7/xml/sax/expatreader.py", line 107, in parse
xmlreader.IncrementalParser.parse(self, source)
File "/usr/local/lib/python2.7/xml/sax/xmlreader.py", line 123, in parse
self.feed(buffer)
File "/usr/local/lib/python2.7/xml/sax/expatreader.py", line 210, in feed
self._parser.Parse(data, isFinal)
File "/usr/local/lib/python2.7/xml/sax/expatreader.py", line 307, in end_element
self._cont_handler.endElement(name)
File "/home/thephltr/webapps/who_pro/discogs_importer/discogsmasterparser.py", line 173, in endElement
self.exporter.storeMaster(self.master)
File "/home/thephltr/webapps/who_pro/discogs_importer/postgresexporter.py", line 323, in storeMaster
(img.uri, img.image_type, master.id))
AttributeError: ImageInfo instance has no attribute 'image_type'

Release artists not matching artists if using artist name variation

Rows in the release_artist table do not match to artists in the artist table if the release artist is an artist name variation, this query applied to the postgres db shows the problem (Various artists filtered out because different problem raised in separate issue)

select distinct r1.release_id, r1.artist_name from
releases_artists r1
left join artist r2
on r1.artist_name= r2.name
where r2.name is null
and r1.artist_name!='Various'
order by r1.artist_name

An example here, this release:

http://api.discogs.com/release/2294510

has artist Jürgen Von Manger

but the actual artist name uses small 'v' in von
http://www.discogs.com/artist/566712
Jürgen von Manger

The way to fix this would be when loading the database from the release dump file to populate the release_artists table with artist_id instead of artist_name (as both are included in the dump). Clearly requires changes to database which I can do but Im struggling to fix the python code.

I guess we have the same problem with track_artists as well

Add support for artist name variation for release extra artists and track extra artists

Add some for artist name variation for release extra artists and track extra artists

Package the project

Turn the existing structure into a package.

Option to run as a script (python -m)
Option to download the files and
Option to run tests on the downloaded files.
Option to create the storage structure (DB tables)

ANV not assigned to the proper parent

The script currently overrides the release.anv property for each anv node it founds - last one wins.

So, first, the Extraartist should support anv (change the PGSQL scripts).

That would allow the anv property to be attached to the proper nodes: release.extraartists or track.extraartist.

Postgres database missing indexes

Postgres database missing indexes, there doesnt seem to be any indexes making query the data slower than it needs to be.

MongoDb import failed

I want to do direct (or not) import in MongoDb.
I Have download and extract Discogs releases file.

Execute this and receive error:

$ ./discogsparser.py -i -o mongo -p "mongodb://user:pass@localhost/discogs?uniq=md5" -d 20120901
  File "./discogsparser.py", line 74
    except ParserStopError as pse:
                            ^
SyntaxError: invalid syntax

load data into PostgreSQL

I was stuck at the step "python discogsparser.py". The following is the run information:

python discogsparser.py -o pgsql -p "dbname=discogs user=discogs password=discogs" -d 20140803
Namespace(data_quality=None, date='20140803', file=[], ignore_unknown_tags=False, n=None, output='pgsql', params='dbname=discogs user=discogs password=discogs')

python discogsparser.py -o pgsql -p "dbname=discogs user=discogs password=discogs" discogs_20140803_artists.xml
Namespace(data_quality=None, date=None, file=['discogs_20140803_artists.xml'], ignore_unknown_tags=False, n=None, output='pgsql', params='dbname=discogs user=discogs password=discogs')

It looks like it didn't execute the command. I don't know what's wrong with it. I am python newbie.

Thank you,
Ying

getlatestdumps.sh broken because of discogs changes

The very good news is that the discogs dumps are now stored on amazon s3, this is much quicker and more reliable but it does mean this script doesnt work (view the source of :http://data.discogs.com/)

You can simply download files as ..
wget http://data.discogs.com.s3-us-west-2.amazonaws.com/data/discogs_20151001_artists.xml.gz
......

but I dont know if there is a way to fix the script so that it will automatically get the latest dumps

Release.artist is missing from some datasets

When exporting the 20131201_releases.xml file I've found that some entries don't have the artist attribute.

I'm not sure if this issue is unique to this 20131201_releases.xml file and so should be considered a bug with the data or code.

Stacktrace

Namespace(data_quality=None, date='20131201', file=[], ignore_unknown_tags=True, n=None, output='mongo', params='file://output')
Traceback (most recent call last):
  File "discogsparser.py", line 223, in <module>
    main(sys.argv[1:])
  File "discogsparser.py", line 217, in main
    parseReleases(parser, exporter)
  File "discogsparser.py", line 128, in parseReleases
    parser.parse(release_file)
  File "/usr/lib/python2.7/xml/sax/expatreader.py", line 107, in parse
    xmlreader.IncrementalParser.parse(self, source)
  File "/usr/lib/python2.7/xml/sax/xmlreader.py", line 123, in parse
    self.feed(buffer)
  File "/usr/lib/python2.7/xml/sax/expatreader.py", line 207, in feed
    self._parser.Parse(data, isFinal)
  File "/usr/lib/python2.7/xml/sax/expatreader.py", line 304, in end_element
    self._cont_handler.endElement(name)
  File "/home/kenan/workspace/discogs-xml2db/discogsreleaseparser.py", line 309, in endElement
    self.exporter.storeRelease(self.release)
  File "/home/kenan/workspace/discogs-xml2db/mongodbexporter.py", line 243, in storeRelease
    if release.artist:
AttributeError: Release instance has no attribute 'artist'

error parsing masters file

parsing 20121201 master file i got the following stack trace:

Traceback (most recent call last):
  File "discogsparser.py", line 223, in <module>
    main(sys.argv[1:])
  File "discogsparser.py", line 218, in main
    parseMasters(parser, exporter)
  File "discogsparser.py", line 153, in parseMasters
    parser.parse(master_file)
  File "/opt/local/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/xml/sax/expatreader.py", line 107, in parse
    xmlreader.IncrementalParser.parse(self, source)
  File "/opt/local/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/xml/sax/xmlreader.py", line 123, in parse
    self.feed(buffer)
  File "/opt/local/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/xml/sax/expatreader.py", line 207, in feed
    self._parser.Parse(data, isFinal)
  File "/opt/local/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/xml/sax/expatreader.py", line 304, in end_element
    self._cont_handler.endElement(name)
  File "/Users/burc/Documents/Development/discogs-xml2db/discogsmasterparser.py", line 170, in endElement
    self.exporter.storeMaster(self.master)
  File "/Users/burc/Documents/Development/discogs-xml2db/mongodbexporter.py", line 250, in storeMaster
    self.execute('masters', master)
  File "/Users/burc/Documents/Development/discogs-xml2db/mongodbexporter.py", line 203, in execute
    uniq, md5 = self._is_uniq(collection, what.id, json_string)
  File "/Users/burc/Documents/Development/discogs-xml2db/mongodbexporter.py", line 181, in _is_uniq
    return self._quick_uniq.is_uniq(collection, id, json_string)
  File "/Users/burc/Documents/Development/discogs-xml2db/mongodbexporter.py", line 110, in is_uniq
    self._load(collection)
  File "/Users/burc/Documents/Development/discogs-xml2db/mongodbexporter.py", line 88, in _load
    if self._hashes[name] is None:
KeyError: 'masters'

Without studying the code in details, changed line 84 of mongodbexporter.py to

self._hashes = {'artists': None, 'labels': None, 'releases': None, 'masters': None }

fixed the issue

Restoring from md5 not working as expected

Hello,

While playing with md5 following was discovered:

upserting in mongo takes very long time and never finishes (rather mongo issue? upsert per line is heavy operation..)
md5 data too large in count with (~25% diff in 4 months range) - maybe this is expected..

I've tested with following data files:

discogs_20141001_artists.xml (main file)
discogs_20150301_artists.xml (md5 for upsert)

 ~]# du -h artists.*
1.2G     discogs_20141001_artists.json
315M discogs_20150301_artists.json
147M artists.md5

 ~]# wc -l *
   3523086 discogs_20141001_artists.json
    799269 discogs_20150301_artists.json
   3769950 artists.md5

~]# time mongoimport -d dataset discogs_20150301_artists.json

2015-06-17T15:38:46.106+0300    no collection specified
2015-06-17T15:38:46.106+0300    using filename 'artists' as collection
2015-06-17T15:38:46.108+0300    connected to: localhost
2015-06-17T15:38:49.107+0300    [#.......................] dataset.artists  69.1 MB/1.1 GB (6.0%)
2015-06-17T15:38:52.107+0300    [##......................] dataset.artists  136.4 MB/1.1 GB (11.9%)
2015-06-17T15:38:55.109+0300    [####....................] dataset.artists  197.2 MB/1.1 GB (17.3%)
2015-06-17T15:38:58.107+0300    [#####...................] dataset.artists  248.6 MB/1.1 GB (21.8%)
2015-06-17T15:39:01.107+0300    [######..................] dataset.artists  308.1 MB/1.1 GB (27.0%)
2015-06-17T15:39:04.108+0300    [#######.................] dataset.artists  359.5 MB/1.1 GB (31.5%)
2015-06-17T15:39:07.107+0300    [########................] dataset.artists  397.7 MB/1.1 GB (34.8%)
2015-06-17T15:39:10.107+0300    [#########...............] dataset.artists  439.6 MB/1.1 GB (38.5%)
2015-06-17T15:39:13.107+0300    [##########..............] dataset.artists  496.0 MB/1.1 GB (43.4%)
2015-06-17T15:39:16.107+0300    [###########.............] dataset.artists  546.8 MB/1.1 GB (47.8%)
2015-06-17T15:39:19.108+0300    [############............] dataset.artists  601.0 MB/1.1 GB (52.6%)
2015-06-17T15:39:22.107+0300    [#############...........] dataset.artists  638.6 MB/1.1 GB (55.9%)
2015-06-17T15:39:25.107+0300    [##############..........] dataset.artists  694.1 MB/1.1 GB (60.7%)
2015-06-17T15:39:28.107+0300    [###############.........] dataset.artists  741.0 MB/1.1 GB (64.8%)
2015-06-17T15:39:31.107+0300    [################........] dataset.artists  766.8 MB/1.1 GB (67.1%)
2015-06-17T15:39:34.107+0300    [#################.......] dataset.artists  815.8 MB/1.1 GB (71.4%)
2015-06-17T15:39:37.107+0300    [##################......] dataset.artists  873.0 MB/1.1 GB (76.4%)
2015-06-17T15:39:40.107+0300    [###################.....] dataset.artists  920.1 MB/1.1 GB (80.5%)
2015-06-17T15:39:43.107+0300    [####################....] dataset.artists  971.5 MB/1.1 GB (85.0%)
2015-06-17T15:39:46.107+0300    [#####################...] dataset.artists  1.0 GB/1.1 GB (89.8%)
2015-06-17T15:39:49.108+0300    [######################..] dataset.artists  1.1 GB/1.1 GB (94.7%)
2015-06-17T15:39:52.107+0300    [#######################.] dataset.artists  1.1 GB/1.1 GB (98.5%)
2015-06-17T15:39:53.895+0300    imported 3523086 documents

real    1m7.806s

~]# time mongoimport --upsert --upsertFields 'id' -d dataset discogs_20150301_artists.json

2015-06-17T15:40:35.177+0300    no collection specified
2015-06-17T15:40:35.177+0300    using filename 'artists' as collection
2015-06-17T15:40:35.179+0300    connected to: localhost
2015-06-17T15:40:38.178+0300    [........................] dataset.artists  7.6 MB/314.7 MB (2.4%)
2015-06-17T15:40:41.178+0300    [........................] dataset.artists  7.6 MB/314.7 MB (2.4%)
2015-06-17T15:40:44.178+0300    [........................] dataset.artists  7.6 MB/314.7 MB (2.4%)
2015-06-17T15:40:47.178+0300    [........................] dataset.artists  7.6 MB/314.7 MB (2.4%)
2015-06-17T15:40:50.178+0300    [........................] dataset.artists  7.6 MB/314.7 MB (2.4%)
2015-06-17T15:40:53.178+0300    [........................] dataset.artists  7.6 MB/314.7 MB (2.4%)
2015-06-17T15:40:56.178+0300    [........................] dataset.artists  7.6 MB/314.7 MB (2.4%)
2015-06-17T15:40:59.178+0300    [........................] dataset.artists  7.6 MB/314.7 MB (2.4%)
2015-06-17T15:41:02.178+0300    [........................] dataset.artists  7.6 MB/314.7 MB (2.4%)
2015-06-17T15:41:05.178+0300    [........................] dataset.artists  7.6 MB/314.7 MB (2.4%)
2015-06-17T15:41:08.178+0300    [........................] dataset.artists  7.6 MB/314.7 MB (2.4%)
2015-06-17T15:41:11.178+0300    [........................] dataset.artists  7.6 MB/314.7 MB (2.4%)
2015-06-17T15:41:14.178+0300    [........................] dataset.artists  7.6 MB/314.7 MB (2.4%)
2015-06-17T15:41:17.178+0300    [........................] dataset.artists  7.6 MB/314.7 MB (2.4%)
2015-06-17T15:41:20.178+0300    [........................] dataset.artists  7.6 MB/314.7 MB (2.4%)
2015-06-17T15:41:23.178+0300    [........................] dataset.artists  7.6 MB/314.7 MB (2.4%)
2015-06-17T15:41:26.178+0300    [........................] dataset.artists  7.6 MB/314.7 MB (2.4%)
2015-06-17T15:41:29.178+0300    [........................] dataset.artists  7.6 MB/314.7 MB (2.4%)
2015-06-17T15:41:32.178+0300    [........................] dataset.artists  7.6 MB/314.7 MB (2.4%)
2015-06-17T15:41:35.178+0300    [........................] dataset.artists  7.6 MB/314.7 MB (2.4%)
2015-06-17T15:41:38.178+0300    [........................] dataset.artists  7.6 MB/314.7 MB (2.4%)
2015-06-17T15:41:41.178+0300    [........................] dataset.artists  7.6 MB/314.7 MB (2.4%)
2015-06-17T15:41:44.178+0300    [#.......................] dataset.artists  14.0 MB/314.7 MB (4.4%)
2015-06-17T15:41:47.178+0300    [#.......................] dataset.artists  14.0 MB/314.7 MB (4.4%)
2015-06-17T15:41:50.178+0300    [#.......................] dataset.artists  14.0 MB/314.7 MB (4.4%)
2015-06-17T15:41:53.178+0300    [#.......................] dataset.artists  14.0 MB/314.7 MB (4.4%)
2015-06-17T15:41:56.178+0300    [#.......................] dataset.artists  14.0 MB/314.7 MB (4.4%)
2015-06-17T15:41:59.178+0300    [#.......................] dataset.artists  14.0 MB/314.7 MB (4.4%)
2015-06-17T15:42:02.178+0300    [#.......................] dataset.artists  14.0 MB/314.7 MB (4.4%)
2015-06-17T15:42:05.178+0300    [#.......................] dataset.artists  14.0 MB/314.7 MB (4.4%)
2015-06-17T15:42:08.178+0300    [#.......................] dataset.artists  14.0 MB/314.7 MB (4.4%)
2015-06-17T15:42:11.178+0300    [#.......................] dataset.artists  14.0 MB/314.7 MB (4.4%)
2015-06-17T15:42:14.178+0300    [#.......................] dataset.artists  14.0 MB/314.7 MB (4.4%)
2015-06-17T15:42:17.178+0300    [#.......................] dataset.artists  14.0 MB/314.7 MB (4.4%)
2015-06-17T15:42:20.178+0300    [#.......................] dataset.artists  14.0 MB/314.7 MB (4.4%)
2015-06-17T15:42:23.178+0300    [#.......................] dataset.artists  14.0 MB/314.7 MB (4.4%)
2015-06-17T15:42:26.178+0300    [#.......................] dataset.artists  14.0 MB/314.7 MB (4.4%)
2015-06-17T15:42:29.178+0300    [#.......................] dataset.artists  14.0 MB/314.7 MB (4.4%)
2015-06-17T15:42:32.178+0300    [#.......................] dataset.artists  14.0 MB/314.7 MB (4.4%)
2015-06-17T15:42:35.178+0300    [#.......................] dataset.artists  14.0 MB/314.7 MB (4.4%)
2015-06-17T15:42:38.178+0300    [#.......................] dataset.artists  14.0 MB/314.7 MB (4.4%)
2015-06-17T15:42:41.178+0300    [#.......................] dataset.artists  14.0 MB/314.7 MB (4.4%)
2015-06-17T15:42:44.178+0300    [#.......................] dataset.artists  14.0 MB/314.7 MB (4.4%)
2015-06-17T15:42:47.178+0300    [#.......................] dataset.artists  14.0 MB/314.7 MB (4.4%)
2015-06-17T15:42:50.178+0300    [#.......................] dataset.artists  14.0 MB/314.7 MB (4.4%)
2015-06-17T15:42:53.178+0300    [#.......................] dataset.artists  14.0 MB/314.7 MB (4.4%)
2015-06-17T15:42:56.178+0300    [#.......................] dataset.artists  14.0 MB/314.7 MB (4.4%)
2015-06-17T15:42:59.178+0300    [#.......................] dataset.artists  14.0 MB/314.7 MB (4.4%)
2015-06-17T15:43:02.178+0300    [#.......................] dataset.artists  14.0 MB/314.7 MB (4.4%)
2015-06-17T15:43:05.178+0300    [#.......................] dataset.artists  14.0 MB/314.7 MB (4.4%)
2015-06-17T15:43:08.178+0300    [#.......................] dataset.artists  14.0 MB/314.7 MB (4.4%)
2015-06-17T15:43:11.178+0300    [#.......................] dataset.artists  14.0 MB/314.7 MB (4.4%)
2015-06-17T15:43:14.178+0300    [#.......................] dataset.artists  14.0 MB/314.7 MB (4.4%)
2015-06-17T15:43:17.178+0300    [#.......................] dataset.artists  14.0 MB/314.7 MB (4.4%)
2015-06-17T15:43:20.178+0300    [#.......................] dataset.artists  14.0 MB/314.7 MB (4.4%)
2015-06-17T15:43:23.178+0300    [#.......................] dataset.artists  14.0 MB/314.7 MB (4.4%)
2015-06-17T15:43:26.178+0300    [#.......................] dataset.artists  14.0 MB/314.7 MB (4.4%)
2015-06-17T15:43:29.178+0300    [#.......................] dataset.artists  14.0 MB/314.7 MB (4.4%)
2015-06-17T15:43:32.178+0300    [#.......................] dataset.artists  14.0 MB/314.7 MB (4.4%)
2015-06-17T15:43:35.178+0300    [#.......................] dataset.artists  14.0 MB/314.7 MB (4.4%)
2015-06-17T15:43:38.178+0300    [#.......................] dataset.artists  14.0 MB/314.7 MB (4.4%)
2015-06-17T15:43:41.178+0300    [#.......................] dataset.artists  14.0 MB/314.7 MB (4.4%)
2015-06-17T15:43:44.178+0300    [#.......................] dataset.artists  14.0 MB/314.7 MB (4.4%)
2015-06-17T15:43:47.178+0300    [#.......................] dataset.artists  14.0 MB/314.7 MB (4.4%)
2015-06-17T15:43:50.178+0300    [#.......................] dataset.artists  14.0 MB/314.7 MB (4.4%)
2015-06-17T15:43:53.178+0300    [#.......................] dataset.artists  14.0 MB/314.7 MB (4.4%)
2015-06-17T15:43:56.178+0300    [#.......................] dataset.artists  14.0 MB/314.7 MB (4.4%)
2015-06-17T15:43:59.178+0300    [#.......................] dataset.artists  14.0 MB/314.7 MB (4.4%)
2015-06-17T15:44:02.178+0300    [#.......................] dataset.artists  14.0 MB/314.7 MB (4.4%)
2015-06-17T15:44:05.178+0300    [#.......................] dataset.artists  14.0 MB/314.7 MB (4.4%)
2015-06-17T15:44:08.178+0300    [#.......................] dataset.artists  14.0 MB/314.7 MB (4.4%)
2015-06-17T15:44:11.178+0300    [#.......................] dataset.artists  14.0 MB/314.7 MB (4.4%)
2015-06-17T15:44:14.178+0300    [#.......................] dataset.artists  14.0 MB/314.7 MB (4.4%)
2015-06-17T15:44:17.178+0300    [#.......................] dataset.artists  14.0 MB/314.7 MB (4.4%)
2015-06-17T15:44:20.178+0300    [#.......................] dataset.artists  14.0 MB/314.7 MB (4.4%)
2015-06-17T15:44:23.178+0300    [#.......................] dataset.artists  14.0 MB/314.7 MB (4.4%)
2015-06-17T15:44:26.178+0300    [#.......................] dataset.artists  14.0 MB/314.7 MB (4.4%)
2015-06-17T15:44:29.178+0300    [#.......................] dataset.artists  14.0 MB/314.7 MB (4.4%)
2015-06-17T15:44:32.178+0300    [#.......................] dataset.artists  14.0 MB/314.7 MB (4.4%)
2015-06-17T15:44:35.178+0300    [#.......................] dataset.artists  14.0 MB/314.7 MB (4.4%)
2015-06-17T15:44:38.178+0300    [#.......................] dataset.artists  14.0 MB/314.7 MB (4.4%)
2015-06-17T15:44:41.178+0300    [#.......................] dataset.artists  14.0 MB/314.7 MB (4.4%)
2015-06-17T15:44:44.178+0300    [#.......................] dataset.artists  14.0 MB/314.7 MB (4.4%)
2015-06-17T15:44:47.178+0300    [#.......................] dataset.artists  14.0 MB/314.7 MB (4.4%)
2015-06-17T15:44:50.178+0300    [#.......................] dataset.artists  14.0 MB/314.7 MB (4.4%)
2015-06-17T15:44:53.178+0300    [#.......................] dataset.artists  14.0 MB/314.7 MB (4.4%)
2015-06-17T15:44:56.178+0300    [#.......................] dataset.artists  14.0 MB/314.7 MB (4.4%)
2015-06-17T15:44:59.178+0300    [#.......................] dataset.artists  14.0 MB/314.7 MB (4.4%)
2015-06-17T15:45:02.178+0300    [#.......................] dataset.artists  14.0 MB/314.7 MB (4.4%)
2015-06-17T15:45:05.178+0300    [#.......................] dataset.artists  14.0 MB/314.7 MB (4.4%)
2015-06-17T15:45:08.178+0300    [#.......................] dataset.artists  14.0 MB/314.7 MB (4.4%)
2015-06-17T15:45:11.178+0300    [#.......................] dataset.artists  14.0 MB/314.7 MB (4.4%)
2015-06-17T15:45:14.178+0300    [#.......................] dataset.artists  20.1 MB/314.7 MB (6.4%)
2015-06-17T15:45:17.178+0300    [#.......................] dataset.artists  20.1 MB/314.7 MB (6.4%)
2015-06-17T15:45:20.178+0300    [#.......................] dataset.artists  20.1 MB/314.7 MB (6.4%)
2015-06-17T15:45:23.178+0300    [#.......................] dataset.artists  20.1 MB/314.7 MB (6.4%)
2015-06-17T15:45:26.178+0300    [#.......................] dataset.artists  20.1 MB/314.7 MB (6.4%)
2015-06-17T15:45:29.178+0300    [#.......................] dataset.artists  20.1 MB/314.7 MB (6.4%)
2015-06-17T15:45:32.178+0300    [#.......................] dataset.artists  20.1 MB/314.7 MB (6.4%)
2015-06-17T15:45:35.178+0300    [#.......................] dataset.artists  20.1 MB/314.7 MB (6.4%)
2015-06-17T15:45:38.178+0300    [#.......................] dataset.artists  20.1 MB/314.7 MB (6.4%)
2015-06-17T15:45:41.178+0300    [#.......................] dataset.artists  20.1 MB/314.7 MB (6.4%)
2015-06-17T15:45:44.178+0300    [#.......................] dataset.artists  20.1 MB/314.7 MB (6.4%)
2015-06-17T15:45:47.178+0300    [#.......................] dataset.artists  20.1 MB/314.7 MB (6.4%)
2015-06-17T15:45:50.178+0300    [#.......................] dataset.artists  20.1 MB/314.7 MB (6.4%)
2015-06-17T15:45:53.178+0300    [#.......................] dataset.artists  20.1 MB/314.7 MB (6.4%)
2015-06-17T15:45:56.178+0300    [#.......................] dataset.artists  20.1 MB/314.7 MB (6.4%)
2015-06-17T15:45:59.178+0300    [#.......................] dataset.artists  20.1 MB/314.7 MB (6.4%)
2015-06-17T15:46:02.178+0300    [#.......................] dataset.artists  20.1 MB/314.7 MB (6.4%)
2015-06-17T15:46:05.178+0300    [#.......................] dataset.artists  20.1 MB/314.7 MB (6.4%)
2015-06-17T15:46:08.178+0300    [#.......................] dataset.artists  20.1 MB/314.7 MB (6.4%)
2015-06-17T15:46:11.178+0300    [#.......................] dataset.artists  20.1 MB/314.7 MB (6.4%)
2015-06-17T15:46:14.178+0300    [#.......................] dataset.artists  20.1 MB/314.7 MB (6.4%)
2015-06-17T15:46:17.178+0300    [#.......................] dataset.artists  20.1 MB/314.7 MB (6.4%)
2015-06-17T15:46:20.178+0300    [#.......................] dataset.artists  20.1 MB/314.7 MB (6.4%)
2015-06-17T15:46:23.178+0300    [#.......................] dataset.artists  20.1 MB/314.7 MB (6.4%)
2015-06-17T15:46:26.178+0300    [#.......................] dataset.artists  20.1 MB/314.7 MB (6.4%)
2015-06-17T15:46:29.178+0300    [#.......................] dataset.artists  20.1 MB/314.7 MB (6.4%)
2015-06-17T15:46:32.178+0300    [#.......................] dataset.artists  20.1 MB/314.7 MB (6.4%)
2015-06-17T15:46:35.178+0300    [#.......................] dataset.artists  20.1 MB/314.7 MB (6.4%)
2015-06-17T15:46:38.178+0300    [#.......................] dataset.artists  20.1 MB/314.7 MB (6.4%)
2015-06-17T15:46:41.178+0300    [#.......................] dataset.artists  20.1 MB/314.7 MB (6.4%)
2015-06-17T15:46:44.178+0300    [#.......................] dataset.artists  20.1 MB/314.7 MB (6.4%)
2015-06-17T15:46:47.178+0300    [#.......................] dataset.artists  20.1 MB/314.7 MB (6.4%)
2015-06-17T15:46:50.178+0300    [#.......................] dataset.artists  20.1 MB/314.7 MB (6.4%)
2015-06-17T15:46:53.178+0300    [#.......................] dataset.artists  20.1 MB/314.7 MB (6.4%)
2015-06-17T15:46:56.178+0300    [#.......................] dataset.artists  20.1 MB/314.7 MB (6.4%)
2015-06-17T15:46:59.178+0300    [#.......................] dataset.artists  20.1 MB/314.7 MB (6.4%)
2015-06-17T15:47:02.178+0300    [#.......................] dataset.artists  20.1 MB/314.7 MB (6.4%)
2015-06-17T15:47:05.178+0300    [#.......................] dataset.artists  20.1 MB/314.7 MB (6.4%)
2015-06-17T15:47:08.178+0300    [#.......................] dataset.artists  20.1 MB/314.7 MB (6.4%)
2015-06-17T15:47:11.178+0300    [#.......................] dataset.artists  20.1 MB/314.7 MB (6.4%)
2015-06-17T15:47:14.178+0300    [#.......................] dataset.artists  20.1 MB/314.7 MB (6.4%)
2015-06-17T15:47:17.178+0300    [#.......................] dataset.artists  20.1 MB/314.7 MB (6.4%)
2015-06-17T15:47:20.178+0300    [#.......................] dataset.artists  20.1 MB/314.7 MB (6.4%)
2015-06-17T15:47:23.178+0300    [#.......................] dataset.artists  20.1 MB/314.7 MB (6.4%)
2015-06-17T15:47:26.178+0300    [#.......................] dataset.artists  20.1 MB/314.7 MB (6.4%)
2015-06-17T15:47:29.178+0300    [#.......................] dataset.artists  20.1 MB/314.7 MB (6.4%)
2015-06-17T15:47:32.178+0300    [#.......................] dataset.artists  20.1 MB/314.7 MB (6.4%)
2015-06-17T15:47:35.178+0300    [#.......................] dataset.artists  20.1 MB/314.7 MB (6.4%)
2015-06-17T15:47:38.178+0300    [#.......................] dataset.artists  20.1 MB/314.7 MB (6.4%)
2015-06-17T15:47:41.178+0300    [#.......................] dataset.artists  20.1 MB/314.7 MB (6.4%)
2015-06-17T15:47:44.178+0300    [#.......................] dataset.artists  20.1 MB/314.7 MB (6.4%)
2015-06-17T15:47:47.178+0300    [#.......................] dataset.artists  20.1 MB/314.7 MB (6.4%)
2015-06-17T15:47:50.178+0300    [#.......................] dataset.artists  20.1 MB/314.7 MB (6.4%)
2015-06-17T15:47:53.178+0300    [#.......................] dataset.artists  20.1 MB/314.7 MB (6.4%)
2015-06-17T15:47:56.178+0300    [#.......................] dataset.artists  20.1 MB/314.7 MB (6.4%)
2015-06-17T15:47:59.178+0300    [#.......................] dataset.artists  20.1 MB/314.7 MB (6.4%)
2015-06-17T15:48:02.178+0300    [#.......................] dataset.artists  20.1 MB/314.7 MB (6.4%)
2015-06-17T15:48:05.178+0300    [#.......................] dataset.artists  20.1 MB/314.7 MB (6.4%)
2015-06-17T15:48:08.178+0300    [#.......................] dataset.artists  20.1 MB/314.7 MB (6.4%)
2015-06-17T15:48:11.178+0300    [#.......................] dataset.artists  20.1 MB/314.7 MB (6.4%)
2015-06-17T15:48:14.178+0300    [#.......................] dataset.artists  20.1 MB/314.7 MB (6.4%)
2015-06-17T15:48:17.178+0300    [#.......................] dataset.artists  20.1 MB/314.7 MB (6.4%)
2015-06-17T15:48:20.178+0300    [#.......................] dataset.artists  20.1 MB/314.7 MB (6.4%)
2015-06-17T15:48:23.178+0300    [#.......................] dataset.artists  20.1 MB/314.7 MB (6.4%)
2015-06-17T15:48:26.178+0300    [#.......................] dataset.artists  20.1 MB/314.7 MB (6.4%)
2015-06-17T15:48:29.178+0300    [#.......................] dataset.artists  20.1 MB/314.7 MB (6.4%)
2015-06-17T15:48:32.178+0300    [#.......................] dataset.artists  20.1 MB/314.7 MB (6.4%)
2015-06-17T15:48:35.178+0300    [#.......................] dataset.artists  20.1 MB/314.7 MB (6.4%)
2015-06-17T15:48:38.178+0300    [#.......................] dataset.artists  20.1 MB/314.7 MB (6.4%)
2015-06-17T15:48:41.178+0300    [#.......................] dataset.artists  20.1 MB/314.7 MB (6.4%)
2015-06-17T15:48:44.178+0300    [#.......................] dataset.artists  20.1 MB/314.7 MB (6.4%)
2015-06-17T15:48:47.178+0300    [#.......................] dataset.artists  20.1 MB/314.7 MB (6.4%)
2015-06-17T15:48:50.178+0300    [#.......................] dataset.artists  20.1 MB/314.7 MB (6.4%)
2015-06-17T15:48:53.178+0300    [#.......................] dataset.artists  20.1 MB/314.7 MB (6.4%)
2015-06-17T15:48:56.178+0300    [#.......................] dataset.artists  20.1 MB/314.7 MB (6.4%)
2015-06-17T15:48:59.178+0300    [#.......................] dataset.artists  20.1 MB/314.7 MB (6.4%)
2015-06-17T15:49:02.178+0300    [#.......................] dataset.artists  20.1 MB/314.7 MB (6.4%)
2015-06-17T15:49:05.178+0300    [#.......................] dataset.artists  20.1 MB/314.7 MB (6.4%)
2015-06-17T15:49:08.178+0300    [#.......................] dataset.artists  20.1 MB/314.7 MB (6.4%)
2015-06-17T15:49:11.178+0300    [#.......................] dataset.artists  20.1 MB/314.7 MB (6.4%)
2015-06-17T15:49:14.178+0300    [#.......................] dataset.artists  20.1 MB/314.7 MB (6.4%)
2015-06-17T15:49:17.178+0300    [#.......................] dataset.artists  20.1 MB/314.7 MB (6.4%)
2015-06-17T15:49:20.178+0300    [#.......................] dataset.artists  20.1 MB/314.7 MB (6.4%)
2015-06-17T15:49:23.178+0300    [#.......................] dataset.artists  20.1 MB/314.7 MB (6.4%)
2015-06-17T15:49:26.177+0300    [#.......................] dataset.artists  20.1 MB/314.7 MB (6.4%)
2015-06-17T15:49:29.178+0300    [#.......................] dataset.artists  20.1 MB/314.7 MB (6.4%)
2015-06-17T15:49:32.178+0300    [#.......................] dataset.artists  20.1 MB/314.7 MB (6.4%)
2015-06-17T15:49:35.178+0300    [#.......................] dataset.artists  20.1 MB/314.7 MB (6.4%)
2015-06-17T15:49:38.178+0300    [#.......................] dataset.artists  20.1 MB/314.7 MB (6.4%)
2015-06-17T15:49:41.178+0300    [#.......................] dataset.artists  20.1 MB/314.7 MB (6.4%)
2015-06-17T15:49:44.178+0300    [#.......................] dataset.artists  20.1 MB/314.7 MB (6.4%)
2015-06-17T15:49:47.178+0300    [#.......................] dataset.artists  20.1 MB/314.7 MB (6.4%)
2015-06-17T15:49:50.178+0300    [#.......................] dataset.artists  20.1 MB/314.7 MB (6.4%)
2015-06-17T15:49:53.178+0300    [#.......................] dataset.artists  20.1 MB/314.7 MB (6.4%)
2015-06-17T15:49:56.178+0300    [#.......................] dataset.artists  20.1 MB/314.7 MB (6.4%)
2015-06-17T15:49:59.178+0300    [#.......................] dataset.artists  20.1 MB/314.7 MB (6.4%)
2015-06-17T15:50:02.178+0300    [#.......................] dataset.artists  20.1 MB/314.7 MB (6.4%)
2015-06-17T15:50:05.178+0300    [#.......................] dataset.artists  20.1 MB/314.7 MB (6.4%)
^C
real    9m31.152s

New id field for artists

The November 11 2011 dump of artists contains a new numeric id field, e.g. <artist>...<id>123</id>...</artist>.

If this becomes a permanent fixture, consider adding an artist_id to the PGSQL artist table.

updated_on field for when a record was imported or updated

To be able to compute which records MongoDB needs to re-index, I need an updated_on field that should reflect the date (or the dump file) the record originated from, either as a new import or as an update to a previous import.

Need to make sure it doesn't interfere with MD5 calculations.

Error when trying to import discogs master file

Got this output, and import stop, after trying to import the 20141001 release data.

Traceback (most recent call last):
File "discogsparser.py", line 241, in
main(sys.argv[1:])
File "discogsparser.py", line 236, in main
parseMasters(parser, exporter)
File "discogsparser.py", line 171, in parseMasters
parser.parse(master_file)
File "/usr/lib/python2.7/xml/sax/expatreader.py", line 107, in parse
xmlreader.IncrementalParser.parse(self, source)
File "/usr/lib/python2.7/xml/sax/xmlreader.py", line 123, in parse
self.feed(buffer)
File "/usr/lib/python2.7/xml/sax/expatreader.py", line 210, in feed
self._parser.Parse(data, isFinal)
File "/usr/lib/python2.7/xml/sax/expatreader.py", line 307, in end_element
self._cont_handler.endElement(name)
File "/home//Discogs_Importer/discogs-xml2db-master/discogsmasterparser.py", line 173, in endElement
self.exporter.storeMaster(self.master)
File "/home//Discogs_Importer/discogs-xml2db-master/postgresexporter.py", line 323, in storeMaster
(img.uri, img.image_type, master.id))
AttributeError: ImageInfo instance has no attribute 'image_type'

track extraartists and release extraartists should normalize roles

Currently if the xml feed contains a release extra artist with multiple roles they are added as one row in the release_extraartist table with an array of roles. Fair enough however often the same artist is listed as a seperate artist for one release (i.e http://api.discogs.com/release/2 ) so that we end up getting multiple rows in the table for the same artist and release anyway negating the advantage of putting the roles in an array.

( I have changed my code so that the release_extraartist table defines role as a simple text field and adds a new row for each release/artist/combination, same logic for track_extraartists)

Filenames of sql scripts are unwieldy

No need to have 'discogs' in the name, everything to do with discogs
No need for 'pgsql' in name, already have .sql suffix
Longer names, more typing

Renamed as follows:
discogs-indexes-pgsql.sql -> create_indexes.sql
discogs-pgsql.sql -> create_tables.sql
discogs-fixdb-pgsql.sql -> fix_db.sql

Fix_db.sql only working for releases because of db constraints

Output as follows

INSERT 0 1
INSERT 0 9522448
ERROR: insert or update on table "artists_images" violates foreign key constraint "artists_images_image_uri_fkey"
DETAIL: Key (image_uri)=(http://api.discogs.com/image/A-1-1138987958.jpeg) is not present in table "image".
ERROR: insert or update on table "labels_images" violates foreign key constraint "labels_images_image_uri_fkey"
DETAIL: Key (image_uri)=(http://api.discogs.com/image/L-58127-1255729347.jpeg) is not present in table "image".
ERROR: null value in column "type" violates not-null constraint
DETAIL: Failing row contains (http://api.discogs.com/image/R-1-1193812031.jpeg, null, 5427).
INSERT 0 9520322

discogsparser.py doesn't run

I created a database in postgresql and imported database schema. And I tried to run discogsparser.py with following command:

python discogsparser.py -o pgsql -p "host=localhost dbname=discogs user=[user] password=[pass]" discogs_20120501_releases.xml.gz

Also I tried json format but the result doesnt' change. I'm getting something like this:

Namespace(data_quality=None, date=None, file=['discogs_20120501_releases.xml.gz'], ignore_unknown_tags=False, n=None, output='pgsql', params='host=localhost dbname=discogs user=[user] password=[pass]')

and discogsparser.py stop working without any exception.

Releases_artists table contains companies !

select * from releases_artists where release_id=2;

gives

"Mr. James Barth & A.D.";2
"JTS Studios";2
"MPO";2

But the release artist is "Mr. James Barth & A.D.", the other two entries are just assocaited companies

http://api.discogs.com/release/2
http://www.discogs.com/release/2

I dont think they should be ending up in this table as it makes it impossible to identify who the actual release artist is.

Feb 2015 Discogs Xml Dump now contains members id

If a group has members the members section now contains artist ids just not their names, this breaks the postgres artist parsing causing it to consider each member an additional artist in their own right.

January 2015 Dump:

grep "22387" discogs_20150101_artists.xml
22387DisintegratorOliver Chesler & John SelwayThe collaboration of New York City based DJ/Producers Oliver Chesler and John Selway.<data_quality>Correct</data_quality>DesintegratorDisintergratorKoenig CylindersMachines (8)Carlos VasquezJohn SelwayOliver Chesler

February Dump:
grep "22387" discogs_20150201_artists.xml
22387DisintegratorOliver Chesler & John SelwayThe collaboration of New York City based DJ/Producers Oliver Chesler and John Selway.<data_quality>Correct</data_quality>DesintegratorDisintergratorKoenig CylindersMachines (8)241853Carlos Vasquez17John Selway4563Oliver Chesler

Import masters

Since November 2011 discogs have added monthly dumps for masters.
http://www.discogs.com/data/

Put discogs table into own schema rather than public

My specific requirement for this is so I can load disocgs and musicbrainz table sinto same database and then do queries involving tables from both datasets. But I think this is a good general improvement anyway

Speed up processing

Hi guys,

What's the best ideas to speed up processing with discogs-xml2db?

Is it possilbe to run it in parallel mode?
Right now it occupies only 1 CPU core and processing latest 20G xml ~3 hours (for mongo file-dump)

$ time discogsparser.py -i -o mongo -p "file:///HDD1/2015-06" /HDD2/2015-06/discogs_20150601_releases.xml

real    191m27.478s
user    189m22.749s
sys     2m0.714s

I am playing with http://www.gnu.org/software/parallel/ now, but maybe there are some other options.

Also i've tried to omit building md5 hashes, to speed things up a bit, but overall processing time didn't changed much.

Any other suggestions are more then welcomed!!

Thanks!

Handle ID tag in labels XML

The labels XML seems to have gotten an ID tag. Maybe it's time for a refresh of the script to take into account the latest versions of each XML.

Updating a PostgreSQL database

Is there a mechanism in place for updating a PostgreSQL database from the latest XML dumps? I couldn't find anything other than what the README mentioned about MongoDB.

Automated testing

Should have a way to test that new discogs XML releases do not break the parsing.

Use Postgres Copy to get the data into Postgres

Import of the release data is slow, not least because it is processing each record one by one sequentially adding them into the database. So I think the bottleneck is in the code the database could cope with multiple statements being fired at the same time.

I think code could be speedup by parallelizing the import of the data, I wonder if just manually splitting the file into three chunks and running import on the three files in parallel would work.

md5 Can't be Created

Hello,

Looks like md5 files like described in README, can't be created.

$ discogsparser.py -i -o mongo -p "file:///tmp/discogs/?uniq=md5" -d 20111101 
# this results in 2 files creates for each class, e.g. an artists.json file and an artists.md5 file

I'm using Python 2.7.5

Ignoring unknown tags should be the default

Replace --ignore-unknown-tags with a --strict-tags option

Duplicate records in releases_labels, prevent us adding primary key

Duplicate records in releases_labels, prevent us adding primary key slowing access in queries and is bad db design

jthinksearch=> \d releases_labels;
Unlogged table "discogs.releases_labels"
   Column   |  Type   | Modifiers
------------+---------+-----------
 label      | text    |
 release_id | integer |
 catno      | text    |
Indexes:
    "releases_labels_catno_idx" btree (catno)
    "releases_labels_name_idx" btree (label)
Foreign-key constraints:
    "foreign_did" FOREIGN KEY (release_id) REFERENCES release(id)

jthinksearch=> select * from releases_labels  where release_id=6155;
    label     | release_id |   catno
--------------+------------+------------
 Warp Records |       6155 | WAP 39 CDR
 Warp Records |       6155 | WAP 39 CDR
(2 rows)

releases_extraartists_name field too long for standard index

Running create_indexes gives

ERROR: index row size 2888 exceeds maximum 2712 for index "releases_extraartists_name_idx"
HINT: Values larger than 1/3 of a buffer page cannot be indexed.
Consider a function index of an MD5 hash of the value, or use full text indexing.

Add support for storing release barcode

invalid token of release xml file when loading into PostgreSQL

I am loading the xml files into the PostgreSQL. I downloaded 20140803 version and 20140701 version. Both of them I met errors when loading the release xml file.

For 20140803 version:
xml.sax._exceptions.SAXParseException: discogs_20140803_releases.xml:53155:784: not well-formed (invalid token)

For 20140701 version:
xml.sax._exceptions.SAXParseException: discogs_20140701_releases.xml:3763466:690: not well-formed (invalid token)

Which version of discogs did you load into PostgreSQL successfully?

Thank you,
Ying

There is no 'Various' artist table in Artist table breaking relationships

'Various' is referred to as artist with id 194 on the website, i,.e clicking on a 'Various' link in the database will take you to http://www.discogs.com/artist/194 however artist id doesn't actually exist in the data dumps and hence the database once imported.

This is problematic because it means relational links such as

SELECT r.release_id
a.id as artistId
a.name
FROM releases_artists AS r
INNER JOIN artist a ON r.artist_name=a.name

will not return results for various artists. After the data has been imported we should create an row for 'Various' so that queryies on the database don't have to do special cases for Various artists.

philipmat / discogs-xml2db Goto Github PK

discogs-xml2db's Introduction

discogs-xml2db v2

Experimental version

Running discogs-xml2db

Requirements

Downloading discogs dumps

Converting dumps to CSV

Importing

Importing into PostgreSQL

Importing into Mysql

Importing into MongoDB

Importing into CouchDB

Comparison to classic discogs-xml2db

Database schema changes

Running discogs-xml2db classic

discogs-xml2db's People

Contributors

Stargazers

Watchers

Forkers

discogs-xml2db's Issues

Stacktrace

Recommend Projects

Recommend Topics

Recommend Org