Git Product home page Git Product logo

discogs-xml2db's Introduction

discogs-xml2db v2

discogs-xml2db is a python program for importing discogs data dumps into several databases.

Version 2 is a rewrite of the original discogs-xml2db (referred in here as the classic version).
It is based on a branch by RedApple and it is several times faster.

Currently supports MySQL and PostgreSQL as target databases. Instructions for importing into MongoDB, though these are untested.
Let us know how it goes!

Experimental version

In parallel to the original Python codebase, we're working on a parser/exporter that's even faster. This is a complete rewrite in C# and initial results are highly promising:

File Record Count Python C#
discogs_20200806_artists.xml.gz 7,046,615 6:22 2:35
discogs_20200806_labels.xml.gz 1,571,873 1:15 0:22
discogs_20200806_masters.xml.gz 1,734,371 3:56 1:57
discogs_20200806_releases.xml.gz 12,867,980 1:45:16 42:38

If you're interested in testing one of this versions, read more about it in the .NET Parser README or grab the appropriate binaries from the Releases page.

While this version does not have yet complete feature-parity with the Python version, the core export-to-csv is there and it's likely it will eventually replace it.

DotNet Build

Running discogs-xml2db

Build Status - develop

Requirements

discogs-xml2db requires python3 (minimum 3.6) and some python modules.
Additionally, the bash shell is used for automating some tasks.

Importing to some databases may require additional dependencies, see the documentation for your target database below.

It's best that a Python virtual environment is created in order to install the required modules in a safe location, which does not require elevated security permissions:

# Create a virtual environment and activate it
$ python3 -m venv .discogsenv

# Activate virtual environment
# On Linux/macOS:
$ source .discogsenv/bin/activate
# on Windows, in Powershell
$ .discogsenv\Scripts\Activate.ps1

# Install requirements:
(.discogsenv) $ pip3 install -r requirements.txt

Installation instruction for other platforms can be found in the pip documentation.

Downloading discogs dumps

Download the latest dump files from discogs manually from discogs or run get_latest_dumps.sh.

To check the files' integrity download the appropriate checksum file from https://data.discogs.com/, place it in the same directory as the dumps and compare the checksums.

# run in folder where the data dump files have been downloaded
$ sha256sum -c discogs_*_CHECKSUM.txt

Converting dumps to CSV

Run run.py to convert the dump files to csv.

There are two run modes:

  1. You can point it to a directory where the discogs dump files are and use one or multiple --export options to indicate which files to process:
# ensure the virtual environment is active
(.discogsenv) $ python3 run.py \
  --bz2 \ # compresses resulting csv files
  --apicounts \ # provides more accurate progress counts
  --export artist --export label --export master --export release \
  --output csv-dir    # folder where to output the csv files
  dump-dir \ # folder where the data dumps are
  1. You can specify the individual files instead:
# ensure the virtual environment is active
(.discogsenv) $ python3 run.py \
  --bz2 \ # compresses resulting csv files
  --apicounts \ # provides more accurate progress counts
  --output csv-dir    # folder where to output the csv files
  path/to/discogs_20200806_artist.xml.gz path/to/discogs_20200806_labels.xml.gz

run.py takes the following arguments:

  • --export: the types of dump files to export: "artist", "label", "master", "release.
    It matches the names of the dump files, e.g. "discogs_20200806_artists.xml.gz" Not needed if the individual files are specified.
  • --bz2: Compresses output csv files using bz2 compression library.
  • --limit=<lines>: Limits export to some number of entities
  • --apicounts: Makes progress report more accurate by getting total amounts from Discogs API.
  • --output : the folder where to store the csv files; default it current directory

The exporter provides progress information in real time:

Processing      labels:  99%|█████████████████████████████████████████▊| 1523623/1531339 [01:41<00:00, 14979.04labels/s]
Processing     artists: 100%|████████████████████████████████████████▊| 6861991/6894139 [09:02<00:02, 12652.23artists/s]
Processing    releases:  78%|█████████████████████████████▌        | 9757740/12560177 [2:02:15<36:29, 1279.82releases/s]

The total amount and percentages might be off a bit as the exact amount is not known while reading the file.
Specifying --apicounts will provide more accurate predictions by getting the latest amounts from the Discogs API.

Importing

If pv is available it will be used to display progress during import.
To install it run $ sudo apt-get install pv on Ubuntu and Debian or check the installation instructions for other platforms.

Example output if using pv:

$ mysql/importcsv.sh 2020-05-01/csv/*
artist_alias.csv.bz2: 12,5MiB 0:00:03 [3,75MiB/s] [===================================>] 100%
artist.csv.bz2:  121MiB 0:00:29 [4,09MiB/s] [=========================================>] 100%
artist_image.csv.bz2:  7,3MiB 0:00:01 [3,72MiB/s] [===================================>] 100%
artist_namevariation.csv.bz2: 2,84MiB 0:00:01 [2,76MiB/s] [==>                         ] 12% ETA 0:00:07

Importing into PostgreSQL

# install PostgreSQL libraries (might be required for next step)
$ sudo apt-get install libpq-dev

# install the PostgreSQL package for python
# ensure the virtual environment has been activated
(.discogsenv) $ pip3 install -r postgresql/requirements.txt

# Configure PostgreSQL username, password, database, ...
$ nano postgresql/postgresql.conf

# Create database tables
(.discogsenv) $ python3 postgresql/psql.py < postgresql/sql/CreateTables.sql

# Import CSV files
(.discogsenv) $ python3 postgresql/importcsv.py /csvdir/*

# Configure primary keys and constraints, build indexes
(.discogsenv) $ python3 postgresql/psql.py < postgresql/sql/CreatePrimaryKeys.sql
(.discogsenv) $python3 postgresql/psql.py < postgresql/sql/CreateFKConstraints.sql
(.discogsenv) $ python3 postgresql/psql.py < postgresql/sql/CreateIndexes.sql

Importing into Mysql

# Configure MySQL username, password, database, ...
$ nano mysql/mysql.conf

# Create database tables
$ mysql/exec_sql.sh < mysql/CreateTables.sql

# Import CSV files
$ mysql/importcsv.sh /csvdir/*

# Configure primary keys and build indexes
$ mysql/exec_sql.sh < mysql/AssignPrimaryKeys.sql

Importing into MongoDB

The CSV files can be imported into MongoDB using mongoimport.

mongoimport --db=discogs --collection=releases --type=csv --headerline --file=release.csv

Importing into CouchDB

CouchDB only supports importing JSON files.
couchimport can be used to convert the CSV files to JSON and import them into CouchDB, as explained in this tutorial.

Comparison to classic discogs-xml2db

speedup is many times faster than classic because it uses a different approach:

  1. The discogs xml dumps are first converted into one csv file per database table.
  2. These csv files are then imported into the different target databases (bulk load).
    This is different from classic discogs-xml2db which loads records into the database one by one while parsing the xml file, waiting on the database after every row.

speedup requires less disk space than classic as it can work while the dump files are still compressed. While the uncompressed dumps for May 2020 take up 57GB of space the compressed dumps are only 8.8GB. The dumps can be deleted after converting them to compressed CSV files (6.1GB).

As many databases can import CSV files out of the box it should be easy to add support for more databases to discogs-xml2db speedup in the future.

Database schema changes

The database schema was changed in v2.0 to be more consistent and normalize some more data. The following things changed compared to classic discogs-xml2db:

  • renamed table: releases_labels => release_label
  • renamed table: releases_formats => release_format
  • renamed table: releases_artists => release_artist
  • renamed table: tracks_artists => release_track_artist
  • renamed table: track => release_track
  • renamed column: release_artists.join_relation => release_artist.join_string
  • renamed column: release_track_artist.join_relation => release_track_artist.join_string
  • renamed column: release_format.format_name => release_format.name
  • renamed column: label.contactinfo => label.contact_info
  • renamed column: label.parent_label => label.parent_name
  • added: label has new parent_id field
  • added: release_label has extra fields
  • moved: aliases now in artist_alias table
  • moved: tracks_extra_artists now in track_artist table with extra flag
  • moved: releases_extra_artists now in release_track_artist table with extra flag
  • moved: release.genres now in own release_genre table
  • moved: release.styles now in own release_style table
  • moved: release.barcode now in release_identifier table
  • moved: artist.anv fields now in artist_namevariation table
  • moved: artist.url fields now in artist_url table
  • removed: release_format.position no longer exists but can use id field to preserve order when release has multiple formats.
  • release_track_artist now use tmp_track_id to match to tmp_track in release_track

Running discogs-xml2db classic

To run the classic version of discogs-xml2db, check out the v1.99 git tag.
It contains both the classic and the speed-up version.

Please be aware that the classic version is no longer maintained.

discogs-xml2db's People

Contributors

albertfougy avatar berz avatar dpnova avatar ijabz avatar kenanm avatar kpomorski avatar mche avatar orthographic-pedant avatar philipmat avatar qwesda avatar redapple avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

discogs-xml2db's Issues

Release formats table need a position column

In the original xml the formats are listed in order, so for a multi format release the first format listed relates to the first tracks on the release. But we have lost this order once imported into the postgres database because there is no order column in the release_formats table. This makes it impossible to accurately map the track list to a series of mediums and set the medium format accordingly

Image Type is a function of the relation

Currently image type is stored in the images table, but it is actually a function of the relationship between the image and each release/label/artist it is linked to. There are two problems with the current approach

  1. When you add an image to the image table you check to see if already added and if not it is not added. This means that if the first time the image was added for a release it has type of primary, and then the same image exists with type of secondary that secondary image will never get added, so a later query on that relation will incorrectly return that the relationship was primary when in fact it was secondary.
  2. The relations_images ( & artists_images, labels_images) have only two columns release_id and uri, in some cases the same uri is added to a release as both a secondary and primary type and so the code adds exactly same row to the table, so it is impossible to have a primary key on the table (because it cannot be unique)

Solution:Move the image type column from the image table to the releases_images, artist_images, labels_images tables.

Modify the image code because no longer have image urls

Disocgs datadump now only has image metadata but no urls so with current code the only data we put into image table is height and width but because no image uri no way to link between releases_images and artists_images to image table. So we need to modify releases_images (artists_images etc) to drop the uri columns and add height and width columns, and then drop the image table which is no longer useful.

File Import Error

When I try to import artists' xml file with

discogsparser.py -i -o mongo -p "file:///home/ubuntu/discogs/?uniq=md5" discogs_20120501_artists.xml

it returns the following output.

Namespace(data_quality=None, date=None, file=['discogs_20120501_artists.xml'], ignore_unknown_tags=True, n=None, output='mongo', params='file:///home/ubuntu/discogs/?uniq=md5')
Traceback (most recent call last):
  File "discogsparser.py", line 223, in <module>
    main(sys.argv[1:])
  File "discogsparser.py", line 215, in main
    parseArtists(parser, exporter)
  File "discogsparser.py", line 73, in parseArtists
    parser.parse(artist_file)
  File "/usr/lib/python2.7/xml/sax/expatreader.py", line 107, in parse
    xmlreader.IncrementalParser.parse(self, source)
  File "/usr/lib/python2.7/xml/sax/xmlreader.py", line 123, in parse
    self.feed(buffer)
  File "/usr/lib/python2.7/xml/sax/expatreader.py", line 207, in feed
    self._parser.Parse(data, isFinal)
  File "/usr/lib/python2.7/xml/sax/expatreader.py", line 304, in end_element
    self._cont_handler.endElement(name)
  File "/home/ubuntu/discogs-xml2db/discogsartistparser.py", line 117, in endElement
    self.exporter.storeArtist(self.artist)
  File "/home/ubuntu/discogs-xml2db/mongodbexporter.py", line 240, in storeArtist
    self.execute('artists', artist)
  File "/home/ubuntu/discogs-xml2db/mongodbexporter.py", line 206, in execute
    doc.updated_on = "%s" % date.today()
AttributeError: 'dict' object has no attribute 'updated_on'

artist and artist_joins table should be merged

(Apologies me If Ive misunderstood this but this is how I'm seeing it)

Shouldn't the artist join field be stored as a column in track_artists , release_artists rather than having the separate tracks_artists_joins, release_artist_joins tabel.

Certainly looking at the xml returned by the webservice http://api.discogs.com/release/3 join data is stored with the artist itself. Whereas with the database tables Im not sure how I meant to retrieve the data, i.e i can lookup the artists for a track using track_artists then I think if the number of artists for one track is more than one I have to look up track_artists_joins by trackid, and then find the right row by comparing artist1 and artist2 with the rows from the first query. This is'nt too bad when just have two artists on a song but if the song has four artists how do I know if artist1 artist2 goes with the 1st and 2nd, 2nd and 3rd, 1st and 3rd ectera because the artist order is not defined anywhere.

It would be better if track_artists had a position column (to signify position of artist on track), and a join column(this would always be empty for last artist or when track is only my one artist). Same logic applies to release artists

RtD style documentation

The project should have better documentation and be on https://readthedocs.org.

  • Trim README.rst to only cover a brief description, features, installation, and quick way to run.
  • Documentation for each exporter
  • Documentation for tests

Unable to load master images into table because type null

in fix_db.sql

--Remove duplicate master rows
insert into masters_images
select distinct t1.image_uri, t1.type, t1.master_id
from tmp_masters_images t1
left join masters_images t2
on t1.image_uri = t2.image_uri
and t1.type = t2.type
and t1.master_id = t2.master_id
where t2.image_uri is null
;
fails with

ERROR: null value in column "type" violates not-null constraint
DETAIL: Failing row contains (http://api.discogs.com/image/R-1-1193812031.jpeg, null, 5427).

For Postgres Track table needs track number that represents the track order

The Track table has a position column but for many releases this is not a simple ascending number but something much more difficult to parse (e.g Vinyl could be A1 A2 B1), what we really need is an additional number column that is set to 1 for first track, 2 for second track and so on. I assume tracks are in the correct order in the original xml dump file so this would be a relatively easy fix to postgresimporter.py (but i'm not a python programmer myself so not sure where to start)

It takes more than 24 hours to export as a JSON file

I'm trying to import data to mongodb. As you mention in README, direct import takes longer than the other option (XML to JSON and mongoimport). However even if I tried other option, I couldn't easily import it. Releases.xml file is ~9gb and it took more than 24 hours to export JSON file. Actually after that process releases.json file was corrupted so I couldn't really import file.

I tried the script in a Amazon EC2 micro instance and another VPS that has dual core processor. The results were same, the files were ~8gb and corrupted. And the other problem is the script caused CPU overload so it's not choice to leave the discogsparser.py file open and let it work for 24 hours. I couldn't figure out what the problem is. Is xml.sax slow or the systems that I used don't have enough resource?

Error when importing discogs data into PostgreSQL

The process crashed when it hit the "masters" section. Here's the output:

File "discogsparser.py", line 241, in
main(sys.argv[1:])
File "discogsparser.py", line 236, in main
parseMasters(parser, exporter)
File "discogsparser.py", line 171, in parseMasters
parser.parse(master_file)
File "/usr/local/lib/python2.7/xml/sax/expatreader.py", line 107, in parse
xmlreader.IncrementalParser.parse(self, source)
File "/usr/local/lib/python2.7/xml/sax/xmlreader.py", line 123, in parse
self.feed(buffer)
File "/usr/local/lib/python2.7/xml/sax/expatreader.py", line 210, in feed
self._parser.Parse(data, isFinal)
File "/usr/local/lib/python2.7/xml/sax/expatreader.py", line 307, in end_element
self._cont_handler.endElement(name)
File "/home/thephltr/webapps/who_pro/discogs_importer/discogsmasterparser.py", line 173, in endElement
self.exporter.storeMaster(self.master)
File "/home/thephltr/webapps/who_pro/discogs_importer/postgresexporter.py", line 323, in storeMaster
(img.uri, img.image_type, master.id))
AttributeError: ImageInfo instance has no attribute 'image_type'

Release artists not matching artists if using artist name variation

Rows in the release_artist table do not match to artists in the artist table if the release artist is an artist name variation, this query applied to the postgres db shows the problem (Various artists filtered out because different problem raised in separate issue)

select distinct r1.release_id, r1.artist_name from
releases_artists r1
left join artist r2
on r1.artist_name= r2.name
where r2.name is null
and r1.artist_name!='Various'
order by r1.artist_name

An example here, this release:

http://api.discogs.com/release/2294510

has artist Jürgen Von Manger

but the actual artist name uses small 'v' in von
http://www.discogs.com/artist/566712
Jürgen von Manger

The way to fix this would be when loading the database from the release dump file to populate the release_artists table with artist_id instead of artist_name (as both are included in the dump). Clearly requires changes to database which I can do but Im struggling to fix the python code.

I guess we have the same problem with track_artists as well

Package the project

Turn the existing structure into a package.

  • Option to run as a script (python -m)
  • Option to download the files and
  • Option to run tests on the downloaded files.
  • Option to create the storage structure (DB tables)

ANV not assigned to the proper parent

The script currently overrides the release.anv property for each anv node it founds - last one wins.

So, first, the Extraartist should support anv (change the PGSQL scripts).

That would allow the anv property to be attached to the proper nodes: release.extraartists or track.extraartist.

MongoDb import failed

I want to do direct (or not) import in MongoDb.
I Have download and extract Discogs releases file.

Execute this and receive error:

$ ./discogsparser.py -i -o mongo -p "mongodb://user:pass@localhost/discogs?uniq=md5" -d 20120901
  File "./discogsparser.py", line 74
    except ParserStopError as pse:
                            ^
SyntaxError: invalid syntax

load data into PostgreSQL

I was stuck at the step "python discogsparser.py". The following is the run information:

python discogsparser.py -o pgsql -p "dbname=discogs user=discogs password=discogs" -d 20140803
Namespace(data_quality=None, date='20140803', file=[], ignore_unknown_tags=False, n=None, output='pgsql', params='dbname=discogs user=discogs password=discogs')

python discogsparser.py -o pgsql -p "dbname=discogs user=discogs password=discogs" discogs_20140803_artists.xml
Namespace(data_quality=None, date=None, file=['discogs_20140803_artists.xml'], ignore_unknown_tags=False, n=None, output='pgsql', params='dbname=discogs user=discogs password=discogs')

It looks like it didn't execute the command. I don't know what's wrong with it. I am python newbie.

Thank you,
Ying

Release.artist is missing from some datasets

When exporting the 20131201_releases.xml file I've found that some entries don't have the artist attribute.

I'm not sure if this issue is unique to this 20131201_releases.xml file and so should be considered a bug with the data or code.

Stacktrace

Namespace(data_quality=None, date='20131201', file=[], ignore_unknown_tags=True, n=None, output='mongo', params='file://output')
Traceback (most recent call last):
  File "discogsparser.py", line 223, in <module>
    main(sys.argv[1:])
  File "discogsparser.py", line 217, in main
    parseReleases(parser, exporter)
  File "discogsparser.py", line 128, in parseReleases
    parser.parse(release_file)
  File "/usr/lib/python2.7/xml/sax/expatreader.py", line 107, in parse
    xmlreader.IncrementalParser.parse(self, source)
  File "/usr/lib/python2.7/xml/sax/xmlreader.py", line 123, in parse
    self.feed(buffer)
  File "/usr/lib/python2.7/xml/sax/expatreader.py", line 207, in feed
    self._parser.Parse(data, isFinal)
  File "/usr/lib/python2.7/xml/sax/expatreader.py", line 304, in end_element
    self._cont_handler.endElement(name)
  File "/home/kenan/workspace/discogs-xml2db/discogsreleaseparser.py", line 309, in endElement
    self.exporter.storeRelease(self.release)
  File "/home/kenan/workspace/discogs-xml2db/mongodbexporter.py", line 243, in storeRelease
    if release.artist:
AttributeError: Release instance has no attribute 'artist'

error parsing masters file

parsing 20121201 master file i got the following stack trace:

Traceback (most recent call last):
  File "discogsparser.py", line 223, in <module>
    main(sys.argv[1:])
  File "discogsparser.py", line 218, in main
    parseMasters(parser, exporter)
  File "discogsparser.py", line 153, in parseMasters
    parser.parse(master_file)
  File "/opt/local/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/xml/sax/expatreader.py", line 107, in parse
    xmlreader.IncrementalParser.parse(self, source)
  File "/opt/local/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/xml/sax/xmlreader.py", line 123, in parse
    self.feed(buffer)
  File "/opt/local/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/xml/sax/expatreader.py", line 207, in feed
    self._parser.Parse(data, isFinal)
  File "/opt/local/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/xml/sax/expatreader.py", line 304, in end_element
    self._cont_handler.endElement(name)
  File "/Users/burc/Documents/Development/discogs-xml2db/discogsmasterparser.py", line 170, in endElement
    self.exporter.storeMaster(self.master)
  File "/Users/burc/Documents/Development/discogs-xml2db/mongodbexporter.py", line 250, in storeMaster
    self.execute('masters', master)
  File "/Users/burc/Documents/Development/discogs-xml2db/mongodbexporter.py", line 203, in execute
    uniq, md5 = self._is_uniq(collection, what.id, json_string)
  File "/Users/burc/Documents/Development/discogs-xml2db/mongodbexporter.py", line 181, in _is_uniq
    return self._quick_uniq.is_uniq(collection, id, json_string)
  File "/Users/burc/Documents/Development/discogs-xml2db/mongodbexporter.py", line 110, in is_uniq
    self._load(collection)
  File "/Users/burc/Documents/Development/discogs-xml2db/mongodbexporter.py", line 88, in _load
    if self._hashes[name] is None:
KeyError: 'masters'

Without studying the code in details, changed line 84 of mongodbexporter.py to

self._hashes = {'artists': None, 'labels': None, 'releases': None, 'masters': None }

fixed the issue

Restoring from md5 not working as expected

Hello,

While playing with md5 following was discovered:

  • upserting in mongo takes very long time and never finishes (rather mongo issue? upsert per line is heavy operation..)
  • md5 data too large in count with (~25% diff in 4 months range) - maybe this is expected..

I've tested with following data files:

discogs_20141001_artists.xml (main file)
discogs_20150301_artists.xml (md5 for upsert)
 ~]# du -h artists.*
1.2G     discogs_20141001_artists.json
315M discogs_20150301_artists.json
147M artists.md5
 ~]# wc -l *
   3523086 discogs_20141001_artists.json
    799269 discogs_20150301_artists.json
   3769950 artists.md5
~]# time mongoimport -d dataset discogs_20150301_artists.json

2015-06-17T15:38:46.106+0300    no collection specified
2015-06-17T15:38:46.106+0300    using filename 'artists' as collection
2015-06-17T15:38:46.108+0300    connected to: localhost
2015-06-17T15:38:49.107+0300    [#.......................] dataset.artists  69.1 MB/1.1 GB (6.0%)
2015-06-17T15:38:52.107+0300    [##......................] dataset.artists  136.4 MB/1.1 GB (11.9%)
2015-06-17T15:38:55.109+0300    [####....................] dataset.artists  197.2 MB/1.1 GB (17.3%)
2015-06-17T15:38:58.107+0300    [#####...................] dataset.artists  248.6 MB/1.1 GB (21.8%)
2015-06-17T15:39:01.107+0300    [######..................] dataset.artists  308.1 MB/1.1 GB (27.0%)
2015-06-17T15:39:04.108+0300    [#######.................] dataset.artists  359.5 MB/1.1 GB (31.5%)
2015-06-17T15:39:07.107+0300    [########................] dataset.artists  397.7 MB/1.1 GB (34.8%)
2015-06-17T15:39:10.107+0300    [#########...............] dataset.artists  439.6 MB/1.1 GB (38.5%)
2015-06-17T15:39:13.107+0300    [##########..............] dataset.artists  496.0 MB/1.1 GB (43.4%)
2015-06-17T15:39:16.107+0300    [###########.............] dataset.artists  546.8 MB/1.1 GB (47.8%)
2015-06-17T15:39:19.108+0300    [############............] dataset.artists  601.0 MB/1.1 GB (52.6%)
2015-06-17T15:39:22.107+0300    [#############...........] dataset.artists  638.6 MB/1.1 GB (55.9%)
2015-06-17T15:39:25.107+0300    [##############..........] dataset.artists  694.1 MB/1.1 GB (60.7%)
2015-06-17T15:39:28.107+0300    [###############.........] dataset.artists  741.0 MB/1.1 GB (64.8%)
2015-06-17T15:39:31.107+0300    [################........] dataset.artists  766.8 MB/1.1 GB (67.1%)
2015-06-17T15:39:34.107+0300    [#################.......] dataset.artists  815.8 MB/1.1 GB (71.4%)
2015-06-17T15:39:37.107+0300    [##################......] dataset.artists  873.0 MB/1.1 GB (76.4%)
2015-06-17T15:39:40.107+0300    [###################.....] dataset.artists  920.1 MB/1.1 GB (80.5%)
2015-06-17T15:39:43.107+0300    [####################....] dataset.artists  971.5 MB/1.1 GB (85.0%)
2015-06-17T15:39:46.107+0300    [#####################...] dataset.artists  1.0 GB/1.1 GB (89.8%)
2015-06-17T15:39:49.108+0300    [######################..] dataset.artists  1.1 GB/1.1 GB (94.7%)
2015-06-17T15:39:52.107+0300    [#######################.] dataset.artists  1.1 GB/1.1 GB (98.5%)
2015-06-17T15:39:53.895+0300    imported 3523086 documents

real    1m7.806s
~]# time mongoimport --upsert --upsertFields 'id' -d dataset discogs_20150301_artists.json

2015-06-17T15:40:35.177+0300    no collection specified
2015-06-17T15:40:35.177+0300    using filename 'artists' as collection
2015-06-17T15:40:35.179+0300    connected to: localhost
2015-06-17T15:40:38.178+0300    [........................] dataset.artists  7.6 MB/314.7 MB (2.4%)
2015-06-17T15:40:41.178+0300    [........................] dataset.artists  7.6 MB/314.7 MB (2.4%)
2015-06-17T15:40:44.178+0300    [........................] dataset.artists  7.6 MB/314.7 MB (2.4%)
2015-06-17T15:40:47.178+0300    [........................] dataset.artists  7.6 MB/314.7 MB (2.4%)
2015-06-17T15:40:50.178+0300    [........................] dataset.artists  7.6 MB/314.7 MB (2.4%)
2015-06-17T15:40:53.178+0300    [........................] dataset.artists  7.6 MB/314.7 MB (2.4%)
2015-06-17T15:40:56.178+0300    [........................] dataset.artists  7.6 MB/314.7 MB (2.4%)
2015-06-17T15:40:59.178+0300    [........................] dataset.artists  7.6 MB/314.7 MB (2.4%)
2015-06-17T15:41:02.178+0300    [........................] dataset.artists  7.6 MB/314.7 MB (2.4%)
2015-06-17T15:41:05.178+0300    [........................] dataset.artists  7.6 MB/314.7 MB (2.4%)
2015-06-17T15:41:08.178+0300    [........................] dataset.artists  7.6 MB/314.7 MB (2.4%)
2015-06-17T15:41:11.178+0300    [........................] dataset.artists  7.6 MB/314.7 MB (2.4%)
2015-06-17T15:41:14.178+0300    [........................] dataset.artists  7.6 MB/314.7 MB (2.4%)
2015-06-17T15:41:17.178+0300    [........................] dataset.artists  7.6 MB/314.7 MB (2.4%)
2015-06-17T15:41:20.178+0300    [........................] dataset.artists  7.6 MB/314.7 MB (2.4%)
2015-06-17T15:41:23.178+0300    [........................] dataset.artists  7.6 MB/314.7 MB (2.4%)
2015-06-17T15:41:26.178+0300    [........................] dataset.artists  7.6 MB/314.7 MB (2.4%)
2015-06-17T15:41:29.178+0300    [........................] dataset.artists  7.6 MB/314.7 MB (2.4%)
2015-06-17T15:41:32.178+0300    [........................] dataset.artists  7.6 MB/314.7 MB (2.4%)
2015-06-17T15:41:35.178+0300    [........................] dataset.artists  7.6 MB/314.7 MB (2.4%)
2015-06-17T15:41:38.178+0300    [........................] dataset.artists  7.6 MB/314.7 MB (2.4%)
2015-06-17T15:41:41.178+0300    [........................] dataset.artists  7.6 MB/314.7 MB (2.4%)
2015-06-17T15:41:44.178+0300    [#.......................] dataset.artists  14.0 MB/314.7 MB (4.4%)
2015-06-17T15:41:47.178+0300    [#.......................] dataset.artists  14.0 MB/314.7 MB (4.4%)
2015-06-17T15:41:50.178+0300    [#.......................] dataset.artists  14.0 MB/314.7 MB (4.4%)
2015-06-17T15:41:53.178+0300    [#.......................] dataset.artists  14.0 MB/314.7 MB (4.4%)
2015-06-17T15:41:56.178+0300    [#.......................] dataset.artists  14.0 MB/314.7 MB (4.4%)
2015-06-17T15:41:59.178+0300    [#.......................] dataset.artists  14.0 MB/314.7 MB (4.4%)
2015-06-17T15:42:02.178+0300    [#.......................] dataset.artists  14.0 MB/314.7 MB (4.4%)
2015-06-17T15:42:05.178+0300    [#.......................] dataset.artists  14.0 MB/314.7 MB (4.4%)
2015-06-17T15:42:08.178+0300    [#.......................] dataset.artists  14.0 MB/314.7 MB (4.4%)
2015-06-17T15:42:11.178+0300    [#.......................] dataset.artists  14.0 MB/314.7 MB (4.4%)
2015-06-17T15:42:14.178+0300    [#.......................] dataset.artists  14.0 MB/314.7 MB (4.4%)
2015-06-17T15:42:17.178+0300    [#.......................] dataset.artists  14.0 MB/314.7 MB (4.4%)
2015-06-17T15:42:20.178+0300    [#.......................] dataset.artists  14.0 MB/314.7 MB (4.4%)
2015-06-17T15:42:23.178+0300    [#.......................] dataset.artists  14.0 MB/314.7 MB (4.4%)
2015-06-17T15:42:26.178+0300    [#.......................] dataset.artists  14.0 MB/314.7 MB (4.4%)
2015-06-17T15:42:29.178+0300    [#.......................] dataset.artists  14.0 MB/314.7 MB (4.4%)
2015-06-17T15:42:32.178+0300    [#.......................] dataset.artists  14.0 MB/314.7 MB (4.4%)
2015-06-17T15:42:35.178+0300    [#.......................] dataset.artists  14.0 MB/314.7 MB (4.4%)
2015-06-17T15:42:38.178+0300    [#.......................] dataset.artists  14.0 MB/314.7 MB (4.4%)
2015-06-17T15:42:41.178+0300    [#.......................] dataset.artists  14.0 MB/314.7 MB (4.4%)
2015-06-17T15:42:44.178+0300    [#.......................] dataset.artists  14.0 MB/314.7 MB (4.4%)
2015-06-17T15:42:47.178+0300    [#.......................] dataset.artists  14.0 MB/314.7 MB (4.4%)
2015-06-17T15:42:50.178+0300    [#.......................] dataset.artists  14.0 MB/314.7 MB (4.4%)
2015-06-17T15:42:53.178+0300    [#.......................] dataset.artists  14.0 MB/314.7 MB (4.4%)
2015-06-17T15:42:56.178+0300    [#.......................] dataset.artists  14.0 MB/314.7 MB (4.4%)
2015-06-17T15:42:59.178+0300    [#.......................] dataset.artists  14.0 MB/314.7 MB (4.4%)
2015-06-17T15:43:02.178+0300    [#.......................] dataset.artists  14.0 MB/314.7 MB (4.4%)
2015-06-17T15:43:05.178+0300    [#.......................] dataset.artists  14.0 MB/314.7 MB (4.4%)
2015-06-17T15:43:08.178+0300    [#.......................] dataset.artists  14.0 MB/314.7 MB (4.4%)
2015-06-17T15:43:11.178+0300    [#.......................] dataset.artists  14.0 MB/314.7 MB (4.4%)
2015-06-17T15:43:14.178+0300    [#.......................] dataset.artists  14.0 MB/314.7 MB (4.4%)
2015-06-17T15:43:17.178+0300    [#.......................] dataset.artists  14.0 MB/314.7 MB (4.4%)
2015-06-17T15:43:20.178+0300    [#.......................] dataset.artists  14.0 MB/314.7 MB (4.4%)
2015-06-17T15:43:23.178+0300    [#.......................] dataset.artists  14.0 MB/314.7 MB (4.4%)
2015-06-17T15:43:26.178+0300    [#.......................] dataset.artists  14.0 MB/314.7 MB (4.4%)
2015-06-17T15:43:29.178+0300    [#.......................] dataset.artists  14.0 MB/314.7 MB (4.4%)
2015-06-17T15:43:32.178+0300    [#.......................] dataset.artists  14.0 MB/314.7 MB (4.4%)
2015-06-17T15:43:35.178+0300    [#.......................] dataset.artists  14.0 MB/314.7 MB (4.4%)
2015-06-17T15:43:38.178+0300    [#.......................] dataset.artists  14.0 MB/314.7 MB (4.4%)
2015-06-17T15:43:41.178+0300    [#.......................] dataset.artists  14.0 MB/314.7 MB (4.4%)
2015-06-17T15:43:44.178+0300    [#.......................] dataset.artists  14.0 MB/314.7 MB (4.4%)
2015-06-17T15:43:47.178+0300    [#.......................] dataset.artists  14.0 MB/314.7 MB (4.4%)
2015-06-17T15:43:50.178+0300    [#.......................] dataset.artists  14.0 MB/314.7 MB (4.4%)
2015-06-17T15:43:53.178+0300    [#.......................] dataset.artists  14.0 MB/314.7 MB (4.4%)
2015-06-17T15:43:56.178+0300    [#.......................] dataset.artists  14.0 MB/314.7 MB (4.4%)
2015-06-17T15:43:59.178+0300    [#.......................] dataset.artists  14.0 MB/314.7 MB (4.4%)
2015-06-17T15:44:02.178+0300    [#.......................] dataset.artists  14.0 MB/314.7 MB (4.4%)
2015-06-17T15:44:05.178+0300    [#.......................] dataset.artists  14.0 MB/314.7 MB (4.4%)
2015-06-17T15:44:08.178+0300    [#.......................] dataset.artists  14.0 MB/314.7 MB (4.4%)
2015-06-17T15:44:11.178+0300    [#.......................] dataset.artists  14.0 MB/314.7 MB (4.4%)
2015-06-17T15:44:14.178+0300    [#.......................] dataset.artists  14.0 MB/314.7 MB (4.4%)
2015-06-17T15:44:17.178+0300    [#.......................] dataset.artists  14.0 MB/314.7 MB (4.4%)
2015-06-17T15:44:20.178+0300    [#.......................] dataset.artists  14.0 MB/314.7 MB (4.4%)
2015-06-17T15:44:23.178+0300    [#.......................] dataset.artists  14.0 MB/314.7 MB (4.4%)
2015-06-17T15:44:26.178+0300    [#.......................] dataset.artists  14.0 MB/314.7 MB (4.4%)
2015-06-17T15:44:29.178+0300    [#.......................] dataset.artists  14.0 MB/314.7 MB (4.4%)
2015-06-17T15:44:32.178+0300    [#.......................] dataset.artists  14.0 MB/314.7 MB (4.4%)
2015-06-17T15:44:35.178+0300    [#.......................] dataset.artists  14.0 MB/314.7 MB (4.4%)
2015-06-17T15:44:38.178+0300    [#.......................] dataset.artists  14.0 MB/314.7 MB (4.4%)
2015-06-17T15:44:41.178+0300    [#.......................] dataset.artists  14.0 MB/314.7 MB (4.4%)
2015-06-17T15:44:44.178+0300    [#.......................] dataset.artists  14.0 MB/314.7 MB (4.4%)
2015-06-17T15:44:47.178+0300    [#.......................] dataset.artists  14.0 MB/314.7 MB (4.4%)
2015-06-17T15:44:50.178+0300    [#.......................] dataset.artists  14.0 MB/314.7 MB (4.4%)
2015-06-17T15:44:53.178+0300    [#.......................] dataset.artists  14.0 MB/314.7 MB (4.4%)
2015-06-17T15:44:56.178+0300    [#.......................] dataset.artists  14.0 MB/314.7 MB (4.4%)
2015-06-17T15:44:59.178+0300    [#.......................] dataset.artists  14.0 MB/314.7 MB (4.4%)
2015-06-17T15:45:02.178+0300    [#.......................] dataset.artists  14.0 MB/314.7 MB (4.4%)
2015-06-17T15:45:05.178+0300    [#.......................] dataset.artists  14.0 MB/314.7 MB (4.4%)
2015-06-17T15:45:08.178+0300    [#.......................] dataset.artists  14.0 MB/314.7 MB (4.4%)
2015-06-17T15:45:11.178+0300    [#.......................] dataset.artists  14.0 MB/314.7 MB (4.4%)
2015-06-17T15:45:14.178+0300    [#.......................] dataset.artists  20.1 MB/314.7 MB (6.4%)
2015-06-17T15:45:17.178+0300    [#.......................] dataset.artists  20.1 MB/314.7 MB (6.4%)
2015-06-17T15:45:20.178+0300    [#.......................] dataset.artists  20.1 MB/314.7 MB (6.4%)
2015-06-17T15:45:23.178+0300    [#.......................] dataset.artists  20.1 MB/314.7 MB (6.4%)
2015-06-17T15:45:26.178+0300    [#.......................] dataset.artists  20.1 MB/314.7 MB (6.4%)
2015-06-17T15:45:29.178+0300    [#.......................] dataset.artists  20.1 MB/314.7 MB (6.4%)
2015-06-17T15:45:32.178+0300    [#.......................] dataset.artists  20.1 MB/314.7 MB (6.4%)
2015-06-17T15:45:35.178+0300    [#.......................] dataset.artists  20.1 MB/314.7 MB (6.4%)
2015-06-17T15:45:38.178+0300    [#.......................] dataset.artists  20.1 MB/314.7 MB (6.4%)
2015-06-17T15:45:41.178+0300    [#.......................] dataset.artists  20.1 MB/314.7 MB (6.4%)
2015-06-17T15:45:44.178+0300    [#.......................] dataset.artists  20.1 MB/314.7 MB (6.4%)
2015-06-17T15:45:47.178+0300    [#.......................] dataset.artists  20.1 MB/314.7 MB (6.4%)
2015-06-17T15:45:50.178+0300    [#.......................] dataset.artists  20.1 MB/314.7 MB (6.4%)
2015-06-17T15:45:53.178+0300    [#.......................] dataset.artists  20.1 MB/314.7 MB (6.4%)
2015-06-17T15:45:56.178+0300    [#.......................] dataset.artists  20.1 MB/314.7 MB (6.4%)
2015-06-17T15:45:59.178+0300    [#.......................] dataset.artists  20.1 MB/314.7 MB (6.4%)
2015-06-17T15:46:02.178+0300    [#.......................] dataset.artists  20.1 MB/314.7 MB (6.4%)
2015-06-17T15:46:05.178+0300    [#.......................] dataset.artists  20.1 MB/314.7 MB (6.4%)
2015-06-17T15:46:08.178+0300    [#.......................] dataset.artists  20.1 MB/314.7 MB (6.4%)
2015-06-17T15:46:11.178+0300    [#.......................] dataset.artists  20.1 MB/314.7 MB (6.4%)
2015-06-17T15:46:14.178+0300    [#.......................] dataset.artists  20.1 MB/314.7 MB (6.4%)
2015-06-17T15:46:17.178+0300    [#.......................] dataset.artists  20.1 MB/314.7 MB (6.4%)
2015-06-17T15:46:20.178+0300    [#.......................] dataset.artists  20.1 MB/314.7 MB (6.4%)
2015-06-17T15:46:23.178+0300    [#.......................] dataset.artists  20.1 MB/314.7 MB (6.4%)
2015-06-17T15:46:26.178+0300    [#.......................] dataset.artists  20.1 MB/314.7 MB (6.4%)
2015-06-17T15:46:29.178+0300    [#.......................] dataset.artists  20.1 MB/314.7 MB (6.4%)
2015-06-17T15:46:32.178+0300    [#.......................] dataset.artists  20.1 MB/314.7 MB (6.4%)
2015-06-17T15:46:35.178+0300    [#.......................] dataset.artists  20.1 MB/314.7 MB (6.4%)
2015-06-17T15:46:38.178+0300    [#.......................] dataset.artists  20.1 MB/314.7 MB (6.4%)
2015-06-17T15:46:41.178+0300    [#.......................] dataset.artists  20.1 MB/314.7 MB (6.4%)
2015-06-17T15:46:44.178+0300    [#.......................] dataset.artists  20.1 MB/314.7 MB (6.4%)
2015-06-17T15:46:47.178+0300    [#.......................] dataset.artists  20.1 MB/314.7 MB (6.4%)
2015-06-17T15:46:50.178+0300    [#.......................] dataset.artists  20.1 MB/314.7 MB (6.4%)
2015-06-17T15:46:53.178+0300    [#.......................] dataset.artists  20.1 MB/314.7 MB (6.4%)
2015-06-17T15:46:56.178+0300    [#.......................] dataset.artists  20.1 MB/314.7 MB (6.4%)
2015-06-17T15:46:59.178+0300    [#.......................] dataset.artists  20.1 MB/314.7 MB (6.4%)
2015-06-17T15:47:02.178+0300    [#.......................] dataset.artists  20.1 MB/314.7 MB (6.4%)
2015-06-17T15:47:05.178+0300    [#.......................] dataset.artists  20.1 MB/314.7 MB (6.4%)
2015-06-17T15:47:08.178+0300    [#.......................] dataset.artists  20.1 MB/314.7 MB (6.4%)
2015-06-17T15:47:11.178+0300    [#.......................] dataset.artists  20.1 MB/314.7 MB (6.4%)
2015-06-17T15:47:14.178+0300    [#.......................] dataset.artists  20.1 MB/314.7 MB (6.4%)
2015-06-17T15:47:17.178+0300    [#.......................] dataset.artists  20.1 MB/314.7 MB (6.4%)
2015-06-17T15:47:20.178+0300    [#.......................] dataset.artists  20.1 MB/314.7 MB (6.4%)
2015-06-17T15:47:23.178+0300    [#.......................] dataset.artists  20.1 MB/314.7 MB (6.4%)
2015-06-17T15:47:26.178+0300    [#.......................] dataset.artists  20.1 MB/314.7 MB (6.4%)
2015-06-17T15:47:29.178+0300    [#.......................] dataset.artists  20.1 MB/314.7 MB (6.4%)
2015-06-17T15:47:32.178+0300    [#.......................] dataset.artists  20.1 MB/314.7 MB (6.4%)
2015-06-17T15:47:35.178+0300    [#.......................] dataset.artists  20.1 MB/314.7 MB (6.4%)
2015-06-17T15:47:38.178+0300    [#.......................] dataset.artists  20.1 MB/314.7 MB (6.4%)
2015-06-17T15:47:41.178+0300    [#.......................] dataset.artists  20.1 MB/314.7 MB (6.4%)
2015-06-17T15:47:44.178+0300    [#.......................] dataset.artists  20.1 MB/314.7 MB (6.4%)
2015-06-17T15:47:47.178+0300    [#.......................] dataset.artists  20.1 MB/314.7 MB (6.4%)
2015-06-17T15:47:50.178+0300    [#.......................] dataset.artists  20.1 MB/314.7 MB (6.4%)
2015-06-17T15:47:53.178+0300    [#.......................] dataset.artists  20.1 MB/314.7 MB (6.4%)
2015-06-17T15:47:56.178+0300    [#.......................] dataset.artists  20.1 MB/314.7 MB (6.4%)
2015-06-17T15:47:59.178+0300    [#.......................] dataset.artists  20.1 MB/314.7 MB (6.4%)
2015-06-17T15:48:02.178+0300    [#.......................] dataset.artists  20.1 MB/314.7 MB (6.4%)
2015-06-17T15:48:05.178+0300    [#.......................] dataset.artists  20.1 MB/314.7 MB (6.4%)
2015-06-17T15:48:08.178+0300    [#.......................] dataset.artists  20.1 MB/314.7 MB (6.4%)
2015-06-17T15:48:11.178+0300    [#.......................] dataset.artists  20.1 MB/314.7 MB (6.4%)
2015-06-17T15:48:14.178+0300    [#.......................] dataset.artists  20.1 MB/314.7 MB (6.4%)
2015-06-17T15:48:17.178+0300    [#.......................] dataset.artists  20.1 MB/314.7 MB (6.4%)
2015-06-17T15:48:20.178+0300    [#.......................] dataset.artists  20.1 MB/314.7 MB (6.4%)
2015-06-17T15:48:23.178+0300    [#.......................] dataset.artists  20.1 MB/314.7 MB (6.4%)
2015-06-17T15:48:26.178+0300    [#.......................] dataset.artists  20.1 MB/314.7 MB (6.4%)
2015-06-17T15:48:29.178+0300    [#.......................] dataset.artists  20.1 MB/314.7 MB (6.4%)
2015-06-17T15:48:32.178+0300    [#.......................] dataset.artists  20.1 MB/314.7 MB (6.4%)
2015-06-17T15:48:35.178+0300    [#.......................] dataset.artists  20.1 MB/314.7 MB (6.4%)
2015-06-17T15:48:38.178+0300    [#.......................] dataset.artists  20.1 MB/314.7 MB (6.4%)
2015-06-17T15:48:41.178+0300    [#.......................] dataset.artists  20.1 MB/314.7 MB (6.4%)
2015-06-17T15:48:44.178+0300    [#.......................] dataset.artists  20.1 MB/314.7 MB (6.4%)
2015-06-17T15:48:47.178+0300    [#.......................] dataset.artists  20.1 MB/314.7 MB (6.4%)
2015-06-17T15:48:50.178+0300    [#.......................] dataset.artists  20.1 MB/314.7 MB (6.4%)
2015-06-17T15:48:53.178+0300    [#.......................] dataset.artists  20.1 MB/314.7 MB (6.4%)
2015-06-17T15:48:56.178+0300    [#.......................] dataset.artists  20.1 MB/314.7 MB (6.4%)
2015-06-17T15:48:59.178+0300    [#.......................] dataset.artists  20.1 MB/314.7 MB (6.4%)
2015-06-17T15:49:02.178+0300    [#.......................] dataset.artists  20.1 MB/314.7 MB (6.4%)
2015-06-17T15:49:05.178+0300    [#.......................] dataset.artists  20.1 MB/314.7 MB (6.4%)
2015-06-17T15:49:08.178+0300    [#.......................] dataset.artists  20.1 MB/314.7 MB (6.4%)
2015-06-17T15:49:11.178+0300    [#.......................] dataset.artists  20.1 MB/314.7 MB (6.4%)
2015-06-17T15:49:14.178+0300    [#.......................] dataset.artists  20.1 MB/314.7 MB (6.4%)
2015-06-17T15:49:17.178+0300    [#.......................] dataset.artists  20.1 MB/314.7 MB (6.4%)
2015-06-17T15:49:20.178+0300    [#.......................] dataset.artists  20.1 MB/314.7 MB (6.4%)
2015-06-17T15:49:23.178+0300    [#.......................] dataset.artists  20.1 MB/314.7 MB (6.4%)
2015-06-17T15:49:26.177+0300    [#.......................] dataset.artists  20.1 MB/314.7 MB (6.4%)
2015-06-17T15:49:29.178+0300    [#.......................] dataset.artists  20.1 MB/314.7 MB (6.4%)
2015-06-17T15:49:32.178+0300    [#.......................] dataset.artists  20.1 MB/314.7 MB (6.4%)
2015-06-17T15:49:35.178+0300    [#.......................] dataset.artists  20.1 MB/314.7 MB (6.4%)
2015-06-17T15:49:38.178+0300    [#.......................] dataset.artists  20.1 MB/314.7 MB (6.4%)
2015-06-17T15:49:41.178+0300    [#.......................] dataset.artists  20.1 MB/314.7 MB (6.4%)
2015-06-17T15:49:44.178+0300    [#.......................] dataset.artists  20.1 MB/314.7 MB (6.4%)
2015-06-17T15:49:47.178+0300    [#.......................] dataset.artists  20.1 MB/314.7 MB (6.4%)
2015-06-17T15:49:50.178+0300    [#.......................] dataset.artists  20.1 MB/314.7 MB (6.4%)
2015-06-17T15:49:53.178+0300    [#.......................] dataset.artists  20.1 MB/314.7 MB (6.4%)
2015-06-17T15:49:56.178+0300    [#.......................] dataset.artists  20.1 MB/314.7 MB (6.4%)
2015-06-17T15:49:59.178+0300    [#.......................] dataset.artists  20.1 MB/314.7 MB (6.4%)
2015-06-17T15:50:02.178+0300    [#.......................] dataset.artists  20.1 MB/314.7 MB (6.4%)
2015-06-17T15:50:05.178+0300    [#.......................] dataset.artists  20.1 MB/314.7 MB (6.4%)
^C
real    9m31.152s

New id field for artists

The November 11 2011 dump of artists contains a new numeric id field, e.g. <artist>...<id>123</id>...</artist>.

If this becomes a permanent fixture, consider adding an artist_id to the PGSQL artist table.

updated_on field for when a record was imported or updated

To be able to compute which records MongoDB needs to re-index, I need an updated_on field that should reflect the date (or the dump file) the record originated from, either as a new import or as an update to a previous import.

Need to make sure it doesn't interfere with MD5 calculations.

Error when trying to import discogs master file

Got this output, and import stop, after trying to import the 20141001 release data.

Traceback (most recent call last):
File "discogsparser.py", line 241, in
main(sys.argv[1:])
File "discogsparser.py", line 236, in main
parseMasters(parser, exporter)
File "discogsparser.py", line 171, in parseMasters
parser.parse(master_file)
File "/usr/lib/python2.7/xml/sax/expatreader.py", line 107, in parse
xmlreader.IncrementalParser.parse(self, source)
File "/usr/lib/python2.7/xml/sax/xmlreader.py", line 123, in parse
self.feed(buffer)
File "/usr/lib/python2.7/xml/sax/expatreader.py", line 210, in feed
self._parser.Parse(data, isFinal)
File "/usr/lib/python2.7/xml/sax/expatreader.py", line 307, in end_element
self._cont_handler.endElement(name)
File "/home//Discogs_Importer/discogs-xml2db-master/discogsmasterparser.py", line 173, in endElement
self.exporter.storeMaster(self.master)
File "/home//Discogs_Importer/discogs-xml2db-master/postgresexporter.py", line 323, in storeMaster
(img.uri, img.image_type, master.id))
AttributeError: ImageInfo instance has no attribute 'image_type'

track extraartists and release extraartists should normalize roles

Currently if the xml feed contains a release extra artist with multiple roles they are added as one row in the release_extraartist table with an array of roles. Fair enough however often the same artist is listed as a seperate artist for one release (i.e http://api.discogs.com/release/2 ) so that we end up getting multiple rows in the table for the same artist and release anyway negating the advantage of putting the roles in an array.

( I have changed my code so that the release_extraartist table defines role as a simple text field and adds a new row for each release/artist/combination, same logic for track_extraartists)

Filenames of sql scripts are unwieldy

No need to have 'discogs' in the name, everything to do with discogs
No need for 'pgsql' in name, already have .sql suffix
Longer names, more typing

Renamed as follows:
discogs-indexes-pgsql.sql -> create_indexes.sql
discogs-pgsql.sql -> create_tables.sql
discogs-fixdb-pgsql.sql -> fix_db.sql

Fix_db.sql only working for releases because of db constraints

Output as follows

INSERT 0 1
INSERT 0 9522448
ERROR: insert or update on table "artists_images" violates foreign key constraint "artists_images_image_uri_fkey"
DETAIL: Key (image_uri)=(http://api.discogs.com/image/A-1-1138987958.jpeg) is not present in table "image".
ERROR: insert or update on table "labels_images" violates foreign key constraint "labels_images_image_uri_fkey"
DETAIL: Key (image_uri)=(http://api.discogs.com/image/L-58127-1255729347.jpeg) is not present in table "image".
ERROR: null value in column "type" violates not-null constraint
DETAIL: Failing row contains (http://api.discogs.com/image/R-1-1193812031.jpeg, null, 5427).
INSERT 0 9520322

discogsparser.py doesn't run

I created a database in postgresql and imported database schema. And I tried to run discogsparser.py with following command:

python discogsparser.py -o pgsql -p "host=localhost dbname=discogs user=[user] password=[pass]" discogs_20120501_releases.xml.gz

Also I tried json format but the result doesnt' change. I'm getting something like this:

Namespace(data_quality=None, date=None, file=['discogs_20120501_releases.xml.gz'], ignore_unknown_tags=False, n=None, output='pgsql', params='host=localhost dbname=discogs user=[user] password=[pass]')

and discogsparser.py stop working without any exception.

Feb 2015 Discogs Xml Dump now contains members id

If a group has members the members section now contains artist ids just not their names, this breaks the postgres artist parsing causing it to consider each member an additional artist in their own right.

January 2015 Dump:

grep "22387" discogs_20150101_artists.xml
22387DisintegratorOliver Chesler & John SelwayThe collaboration of New York City based DJ/Producers Oliver Chesler and John Selway.<data_quality>Correct</data_quality>DesintegratorDisintergratorKoenig CylindersMachines (8)Carlos VasquezJohn SelwayOliver Chesler

February Dump:
grep "22387" discogs_20150201_artists.xml
22387DisintegratorOliver Chesler & John SelwayThe collaboration of New York City based DJ/Producers Oliver Chesler and John Selway.<data_quality>Correct</data_quality>DesintegratorDisintergratorKoenig CylindersMachines (8)241853Carlos Vasquez17John Selway4563Oliver Chesler

Put discogs table into own schema rather than public

My specific requirement for this is so I can load disocgs and musicbrainz table sinto same database and then do queries involving tables from both datasets. But I think this is a good general improvement anyway

Speed up processing

Hi guys,

What's the best ideas to speed up processing with discogs-xml2db?

Is it possilbe to run it in parallel mode?
Right now it occupies only 1 CPU core and processing latest 20G xml ~3 hours (for mongo file-dump)

$ time discogsparser.py -i -o mongo -p "file:///HDD1/2015-06" /HDD2/2015-06/discogs_20150601_releases.xml

real    191m27.478s
user    189m22.749s
sys     2m0.714s

screen shot 2015-06-19 at 12 57 46

I am playing with http://www.gnu.org/software/parallel/ now, but maybe there are some other options.

Also i've tried to omit building md5 hashes, to speed things up a bit, but overall processing time didn't changed much.

Any other suggestions are more then welcomed!!

Thanks!

Handle ID tag in labels XML

The labels XML seems to have gotten an ID tag. Maybe it's time for a refresh of the script to take into account the latest versions of each XML.

Updating a PostgreSQL database

Is there a mechanism in place for updating a PostgreSQL database from the latest XML dumps? I couldn't find anything other than what the README mentioned about MongoDB.

Automated testing

Should have a way to test that new discogs XML releases do not break the parsing.

Use Postgres Copy to get the data into Postgres

Import of the release data is slow, not least because it is processing each record one by one sequentially adding them into the database. So I think the bottleneck is in the code the database could cope with multiple statements being fired at the same time.

I think code could be speedup by parallelizing the import of the data, I wonder if just manually splitting the file into three chunks and running import on the three files in parallel would work.

md5 Can't be Created

Hello,

Looks like md5 files like described in README, can't be created.

$ discogsparser.py -i -o mongo -p "file:///tmp/discogs/?uniq=md5" -d 20111101 
# this results in 2 files creates for each class, e.g. an artists.json file and an artists.md5 file

I'm using Python 2.7.5

Duplicate records in releases_labels, prevent us adding primary key

Duplicate records in releases_labels, prevent us adding primary key slowing access in queries and is bad db design

jthinksearch=> \d releases_labels;
Unlogged table "discogs.releases_labels"
   Column   |  Type   | Modifiers
------------+---------+-----------
 label      | text    |
 release_id | integer |
 catno      | text    |
Indexes:
    "releases_labels_catno_idx" btree (catno)
    "releases_labels_name_idx" btree (label)
Foreign-key constraints:
    "foreign_did" FOREIGN KEY (release_id) REFERENCES release(id)

jthinksearch=> select * from releases_labels  where release_id=6155;
    label     | release_id |   catno
--------------+------------+------------
 Warp Records |       6155 | WAP 39 CDR
 Warp Records |       6155 | WAP 39 CDR
(2 rows)

releases_extraartists_name field too long for standard index

Running create_indexes gives

ERROR: index row size 2888 exceeds maximum 2712 for index "releases_extraartists_name_idx"
HINT: Values larger than 1/3 of a buffer page cannot be indexed.
Consider a function index of an MD5 hash of the value, or use full text indexing.

invalid token of release xml file when loading into PostgreSQL

I am loading the xml files into the PostgreSQL. I downloaded 20140803 version and 20140701 version. Both of them I met errors when loading the release xml file.

For 20140803 version:
xml.sax._exceptions.SAXParseException: discogs_20140803_releases.xml:53155:784: not well-formed (invalid token)

For 20140701 version:
xml.sax._exceptions.SAXParseException: discogs_20140701_releases.xml:3763466:690: not well-formed (invalid token)

Which version of discogs did you load into PostgreSQL successfully?

Thank you,
Ying

There is no 'Various' artist table in Artist table breaking relationships

'Various' is referred to as artist with id 194 on the website, i,.e clicking on a 'Various' link in the database will take you to http://www.discogs.com/artist/194 however artist id doesn't actually exist in the data dumps and hence the database once imported.

This is problematic because it means relational links such as

SELECT r.release_id
a.id as artistId
a.name
FROM releases_artists AS r
INNER JOIN artist a ON r.artist_name=a.name

will not return results for various artists. After the data has been imported we should create an row for 'Various' so that queryies on the database don't have to do special cases for Various artists.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.