Hi there. I wanted to use the importer with this month's dump, and ran into some probl

Issues while importing the November 2020 dump to PostgreSQL about discogs-xml2db HOT 4 OPEN

xavery commented on July 17, 2024

Issues while importing the November 2020 dump to PostgreSQL

from discogs-xml2db.

Comments (4)

xavery commented on July 17, 2024 1

The results are similar when running the process with the October 2020 dumps. No errors are reported while importing with the database created by CreateTables.sql with changes from my PR, though the following errors remain :

$ python3 psql.py < sql/CreateFKConstraints.sql 
ERROR:  insert or update on table "release_artist" violates foreign key constraint "release_artist_fk_artist"
DETAIL:  Key (artist_id)=(0) is not present in table "artist".
ERROR:  insert or update on table "release_track_artist" violates foreign key constraint "release_track_artist_fk_artist"
DETAIL:  Key (artist_id)=(118760) is not present in table "artist".
ERROR:  insert or update on table "release_company" violates foreign key constraint "release_company_fk_company"
DETAIL:  Key (company_id)=(1610443) is not present in table "label".
$ python3 psql.py < sql/CreateIndexes.sql 
ERROR:  index row size 5376 exceeds btree version 4 maximum 2704 for index "release_track_idx_title"
DETAIL:  Index row references tuple (1075344,100) in relation "release_track".
HINT:  Values larger than 1/3 of a buffer page cannot be indexed.
Consider a function index of an MD5 hash of the value, or use full text indexing.

A zero artist_id is used when the artist name appears in what Discogs calls an "non-linked credit", i.e. a credit whose specifying does not generate an artist name. It is used for some credits which are not necessarily related to music, and can be seen in this release, where it is used as "Other [Spirits Lifted By]". Having a null here instead of zero makes sense, as there is no row in the artist table that would provide additional information.

Trying to visit artist 118760 page on Discogs directly results in "This artist is used as a placeholder entry and does not link to any artist.". The artist name for that ID is "No Artist", so having null in this place actually makes sense.

I have no idea what to do about that bogus company_id, though. The release that references it is here, the company is supposed to be called "Terry's 83rd Street Music" and appears properly linked in a "Published By" credit. However, clicking that credit leads to a label whose ID is 1502386. Trying to visit label 1610443 directly leads to a 404, but it can still be accessed via the API.

label_id in release_label is still null. However, it seems like the label_id column isn't generated at all in release_label.csv, which is probably a bug in the C# converter, and not the scripts themselves :

release_id,label_name,catno
1,Svek,SK032

Since the import script takes the list of columns from the CSV file header, the list of columns is set to (release_id,label_name,catno) and label_id remains null.

from discogs-xml2db.

philipmat commented on July 17, 2024

Could you try with the October file and see if you have the same issues, please?

If it is not too much, could you also test the October files with the changes in your PR?

from discogs-xml2db.

philipmat commented on July 17, 2024

Superb analysis, thank you.

So what do you think we should do?

from discogs-xml2db.

xavery commented on July 17, 2024

Something like this for starters - this includes a fix for missing values in the label_id column of the release_label table :

diff --git a/alternatives/dotnet/discogs/DiscogsRelease.cs b/alternatives/dotnet/discogs/DiscogsRelease.cs
index d554759..75d0b54 100644
--- a/alternatives/dotnet/discogs/DiscogsRelease.cs
+++ b/alternatives/dotnet/discogs/DiscogsRelease.cs
@@ -10,7 +10,7 @@ namespace discogs.Releases
         {
             { "release", "id title released country notes data_quality master_id status".Split(" ") },
             { "release_genre", "release_id genre".Split(" ") },
-            { "release_label", "release_id label_name catno".Split(" ") },
+            { "release_label", "release_id label_id label_name catno".Split(" ") },
             { "release_style", "release_id style".Split(" ") },
             { "release_image", "release_id type width height".Split(" ") },
             { "release_format", "release_id name qty text_string descriptions".Split(" ") },
@@ -22,6 +22,16 @@ namespace discogs.Releases
             { "release_track_artist", "release_id track_sequence track_id artist_id artist_name extra anv position join_string role tracks".Split(" ") },
         };
 
+        private static string SanitizeArtistId(string artistId)
+        {
+            if (artistId == "0" /* non-linked credits in credit lists have this set to zero */
+                || artistId == "118760" /* No Artist */)
+            {
+                return "";
+            }
+            return artistId;
+        }
+
         [XmlAttribute]
         public string id { get; set; }
 
@@ -82,7 +92,7 @@ namespace discogs.Releases
                 foreach (var l in labels)
                 {
                     if (l == null) continue;
-                    yield return ("release_label", new[] { id, l.name, l.catno });
+                    yield return ("release_label", new[] { id, l.id, l.name, l.catno });
                 }
             }
             if (styles?.Length > 0)
@@ -138,7 +148,7 @@ namespace discogs.Releases
                 foreach (var a in artists)
                 {
                     if (a == null) continue;
-                    yield return ("release_artist", new[] { id, a.id, a.name, "0", a.anv, (position++).ToString(), a.join, a.role, a.tracks });
+                    yield return ("release_artist", new[] { id, SanitizeArtistId(a.id), a.name, "0", a.anv, (position++).ToString(), a.join, a.role, a.tracks });
                 }
             }
             if (extraartists?.Length > 0)
@@ -147,7 +157,7 @@ namespace discogs.Releases
                 foreach (var a in extraartists)
                 {
                     if (a == null) continue;
-                    yield return ("release_artist", new[] { id, a.id, a.name, "1", a.anv, (position++).ToString(), a.join, a.role, a.tracks });
+                    yield return ("release_artist", new[] { id, SanitizeArtistId(a.id), a.name, "1", a.anv, (position++).ToString(), a.join, a.role, a.tracks });
                 }
             }
             int seq = 0;
@@ -159,13 +169,13 @@ namespace discogs.Releases
                 foreach (var a in (t.artists ?? System.Array.Empty<artist>())) {
                     if (a == null) continue;
                     artistSeq += 1;
-                    yield return ("release_track_artist", new[] { id, t.position, t.track_id, a.id, a.name, "0", a.anv, artistSeq.ToString(), a.join, a.role, a.tracks });
+                    yield return ("release_track_artist", new[] { id, t.position, t.track_id, SanitizeArtistId(a.id), a.name, "0", a.anv, artistSeq.ToString(), a.join, a.role, a.tracks });
                 }
                 artistSeq = 0;
                 foreach (var a in (t.extraartists ?? System.Array.Empty<artist>())) {
                     if (a == null) continue;
                     artistSeq += 1;
-                    yield return ("release_track_artist", new[] { id, t.position, t.track_id, a.id, a.name, "1", a.anv, artistSeq.ToString(), a.join, a.role, a.tracks });
+                    yield return ("release_track_artist", new[] { id, t.position, t.track_id, SanitizeArtistId(a.id), a.name, "1", a.anv, artistSeq.ToString(), a.join, a.role, a.tracks });
                 }
             }
         }

C# is not my native language, so I'm sure this can be expressed in a better way, but the intent should be clear. The real question is, how many "magic" artist IDs there are : I'm not aware of any list of "artists who have their IDs but aren't artists who have a page and thus an entry in artists.xml" or anything like that, which would make the job easier.
The only thing that comes to mind is resorting to correcting entries at import time, i.e. "while importing releases, look if the artist with the given ID exists, and if not, replace artist_id with null", but that just sounds plain wrong.

The project I've used a few times before actually commented out creating FK constraints in its database initialisation script, possibly because they just weren't worth the trouble. I'm not experienced enough with DBs to say for sure what kind of benefits they bring to the engine, as opposed to the human who can figure out the relations more easily. Perhaps this is really the best way to go.

from discogs-xml2db.

Issues while importing the November 2020 dump to PostgreSQL about discogs-xml2db HOT 4 OPEN

Comments (4)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent