Git Product home page Git Product logo

Comments (4)

xavery avatar xavery commented on July 17, 2024 1

The results are similar when running the process with the October 2020 dumps. No errors are reported while importing with the database created by CreateTables.sql with changes from my PR, though the following errors remain :

$ python3 psql.py < sql/CreateFKConstraints.sql 
ERROR:  insert or update on table "release_artist" violates foreign key constraint "release_artist_fk_artist"
DETAIL:  Key (artist_id)=(0) is not present in table "artist".
ERROR:  insert or update on table "release_track_artist" violates foreign key constraint "release_track_artist_fk_artist"
DETAIL:  Key (artist_id)=(118760) is not present in table "artist".
ERROR:  insert or update on table "release_company" violates foreign key constraint "release_company_fk_company"
DETAIL:  Key (company_id)=(1610443) is not present in table "label".
$ python3 psql.py < sql/CreateIndexes.sql 
ERROR:  index row size 5376 exceeds btree version 4 maximum 2704 for index "release_track_idx_title"
DETAIL:  Index row references tuple (1075344,100) in relation "release_track".
HINT:  Values larger than 1/3 of a buffer page cannot be indexed.
Consider a function index of an MD5 hash of the value, or use full text indexing.

A zero artist_id is used when the artist name appears in what Discogs calls an "non-linked credit", i.e. a credit whose specifying does not generate an artist name. It is used for some credits which are not necessarily related to music, and can be seen in this release, where it is used as "Other [Spirits Lifted By]". Having a null here instead of zero makes sense, as there is no row in the artist table that would provide additional information.

Trying to visit artist 118760 page on Discogs directly results in "This artist is used as a placeholder entry and does not link to any artist.". The artist name for that ID is "No Artist", so having null in this place actually makes sense.

I have no idea what to do about that bogus company_id, though. The release that references it is here, the company is supposed to be called "Terry's 83rd Street Music" and appears properly linked in a "Published By" credit. However, clicking that credit leads to a label whose ID is 1502386. Trying to visit label 1610443 directly leads to a 404, but it can still be accessed via the API.

label_id in release_label is still null. However, it seems like the label_id column isn't generated at all in release_label.csv, which is probably a bug in the C# converter, and not the scripts themselves :

release_id,label_name,catno
1,Svek,SK032

Since the import script takes the list of columns from the CSV file header, the list of columns is set to (release_id,label_name,catno) and label_id remains null.

from discogs-xml2db.

philipmat avatar philipmat commented on July 17, 2024

Could you try with the October file and see if you have the same issues, please?

If it is not too much, could you also test the October files with the changes in your PR?

from discogs-xml2db.

philipmat avatar philipmat commented on July 17, 2024

Superb analysis, thank you.

So what do you think we should do?

from discogs-xml2db.

xavery avatar xavery commented on July 17, 2024

Something like this for starters - this includes a fix for missing values in the label_id column of the release_label table :

diff --git a/alternatives/dotnet/discogs/DiscogsRelease.cs b/alternatives/dotnet/discogs/DiscogsRelease.cs
index d554759..75d0b54 100644
--- a/alternatives/dotnet/discogs/DiscogsRelease.cs
+++ b/alternatives/dotnet/discogs/DiscogsRelease.cs
@@ -10,7 +10,7 @@ namespace discogs.Releases
         {
             { "release", "id title released country notes data_quality master_id status".Split(" ") },
             { "release_genre", "release_id genre".Split(" ") },
-            { "release_label", "release_id label_name catno".Split(" ") },
+            { "release_label", "release_id label_id label_name catno".Split(" ") },
             { "release_style", "release_id style".Split(" ") },
             { "release_image", "release_id type width height".Split(" ") },
             { "release_format", "release_id name qty text_string descriptions".Split(" ") },
@@ -22,6 +22,16 @@ namespace discogs.Releases
             { "release_track_artist", "release_id track_sequence track_id artist_id artist_name extra anv position join_string role tracks".Split(" ") },
         };
 
+        private static string SanitizeArtistId(string artistId)
+        {
+            if (artistId == "0" /* non-linked credits in credit lists have this set to zero */
+                || artistId == "118760" /* No Artist */)
+            {
+                return "";
+            }
+            return artistId;
+        }
+
         [XmlAttribute]
         public string id { get; set; }
 
@@ -82,7 +92,7 @@ namespace discogs.Releases
                 foreach (var l in labels)
                 {
                     if (l == null) continue;
-                    yield return ("release_label", new[] { id, l.name, l.catno });
+                    yield return ("release_label", new[] { id, l.id, l.name, l.catno });
                 }
             }
             if (styles?.Length > 0)
@@ -138,7 +148,7 @@ namespace discogs.Releases
                 foreach (var a in artists)
                 {
                     if (a == null) continue;
-                    yield return ("release_artist", new[] { id, a.id, a.name, "0", a.anv, (position++).ToString(), a.join, a.role, a.tracks });
+                    yield return ("release_artist", new[] { id, SanitizeArtistId(a.id), a.name, "0", a.anv, (position++).ToString(), a.join, a.role, a.tracks });
                 }
             }
             if (extraartists?.Length > 0)
@@ -147,7 +157,7 @@ namespace discogs.Releases
                 foreach (var a in extraartists)
                 {
                     if (a == null) continue;
-                    yield return ("release_artist", new[] { id, a.id, a.name, "1", a.anv, (position++).ToString(), a.join, a.role, a.tracks });
+                    yield return ("release_artist", new[] { id, SanitizeArtistId(a.id), a.name, "1", a.anv, (position++).ToString(), a.join, a.role, a.tracks });
                 }
             }
             int seq = 0;
@@ -159,13 +169,13 @@ namespace discogs.Releases
                 foreach (var a in (t.artists ?? System.Array.Empty<artist>())) {
                     if (a == null) continue;
                     artistSeq += 1;
-                    yield return ("release_track_artist", new[] { id, t.position, t.track_id, a.id, a.name, "0", a.anv, artistSeq.ToString(), a.join, a.role, a.tracks });
+                    yield return ("release_track_artist", new[] { id, t.position, t.track_id, SanitizeArtistId(a.id), a.name, "0", a.anv, artistSeq.ToString(), a.join, a.role, a.tracks });
                 }
                 artistSeq = 0;
                 foreach (var a in (t.extraartists ?? System.Array.Empty<artist>())) {
                     if (a == null) continue;
                     artistSeq += 1;
-                    yield return ("release_track_artist", new[] { id, t.position, t.track_id, a.id, a.name, "1", a.anv, artistSeq.ToString(), a.join, a.role, a.tracks });
+                    yield return ("release_track_artist", new[] { id, t.position, t.track_id, SanitizeArtistId(a.id), a.name, "1", a.anv, artistSeq.ToString(), a.join, a.role, a.tracks });
                 }
             }
         }

C# is not my native language, so I'm sure this can be expressed in a better way, but the intent should be clear. The real question is, how many "magic" artist IDs there are : I'm not aware of any list of "artists who have their IDs but aren't artists who have a page and thus an entry in artists.xml" or anything like that, which would make the job easier.
The only thing that comes to mind is resorting to correcting entries at import time, i.e. "while importing releases, look if the artist with the given ID exists, and if not, replace artist_id with null", but that just sounds plain wrong.

The project I've used a few times before actually commented out creating FK constraints in its database initialisation script, possibly because they just weren't worth the trouble. I'm not experienced enough with DBs to say for sure what kind of benefits they bring to the engine, as opposed to the human who can figure out the relations more easily. Perhaps this is really the best way to go.

from discogs-xml2db.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.