Comments (4)
The results are similar when running the process with the October 2020 dumps. No errors are reported while importing with the database created by CreateTables.sql
with changes from my PR, though the following errors remain :
$ python3 psql.py < sql/CreateFKConstraints.sql
ERROR: insert or update on table "release_artist" violates foreign key constraint "release_artist_fk_artist"
DETAIL: Key (artist_id)=(0) is not present in table "artist".
ERROR: insert or update on table "release_track_artist" violates foreign key constraint "release_track_artist_fk_artist"
DETAIL: Key (artist_id)=(118760) is not present in table "artist".
ERROR: insert or update on table "release_company" violates foreign key constraint "release_company_fk_company"
DETAIL: Key (company_id)=(1610443) is not present in table "label".
$ python3 psql.py < sql/CreateIndexes.sql
ERROR: index row size 5376 exceeds btree version 4 maximum 2704 for index "release_track_idx_title"
DETAIL: Index row references tuple (1075344,100) in relation "release_track".
HINT: Values larger than 1/3 of a buffer page cannot be indexed.
Consider a function index of an MD5 hash of the value, or use full text indexing.
A zero artist_id
is used when the artist name appears in what Discogs calls an "non-linked credit", i.e. a credit whose specifying does not generate an artist name. It is used for some credits which are not necessarily related to music, and can be seen in this release, where it is used as "Other [Spirits Lifted By]". Having a null here instead of zero makes sense, as there is no row in the artist
table that would provide additional information.
Trying to visit artist 118760 page on Discogs directly results in "This artist is used as a placeholder entry and does not link to any artist.". The artist name for that ID is "No Artist", so having null in this place actually makes sense.
I have no idea what to do about that bogus company_id
, though. The release that references it is here, the company is supposed to be called "Terry's 83rd Street Music" and appears properly linked in a "Published By" credit. However, clicking that credit leads to a label whose ID is 1502386. Trying to visit label 1610443 directly leads to a 404, but it can still be accessed via the API.
label_id
in release_label
is still null. However, it seems like the label_id
column isn't generated at all in release_label.csv
, which is probably a bug in the C# converter, and not the scripts themselves :
release_id,label_name,catno
1,Svek,SK032
Since the import script takes the list of columns from the CSV file header, the list of columns is set to (release_id,label_name,catno)
and label_id
remains null.
from discogs-xml2db.
Could you try with the October file and see if you have the same issues, please?
If it is not too much, could you also test the October files with the changes in your PR?
from discogs-xml2db.
Superb analysis, thank you.
So what do you think we should do?
from discogs-xml2db.
Something like this for starters - this includes a fix for missing values in the label_id
column of the release_label
table :
diff --git a/alternatives/dotnet/discogs/DiscogsRelease.cs b/alternatives/dotnet/discogs/DiscogsRelease.cs
index d554759..75d0b54 100644
--- a/alternatives/dotnet/discogs/DiscogsRelease.cs
+++ b/alternatives/dotnet/discogs/DiscogsRelease.cs
@@ -10,7 +10,7 @@ namespace discogs.Releases
{
{ "release", "id title released country notes data_quality master_id status".Split(" ") },
{ "release_genre", "release_id genre".Split(" ") },
- { "release_label", "release_id label_name catno".Split(" ") },
+ { "release_label", "release_id label_id label_name catno".Split(" ") },
{ "release_style", "release_id style".Split(" ") },
{ "release_image", "release_id type width height".Split(" ") },
{ "release_format", "release_id name qty text_string descriptions".Split(" ") },
@@ -22,6 +22,16 @@ namespace discogs.Releases
{ "release_track_artist", "release_id track_sequence track_id artist_id artist_name extra anv position join_string role tracks".Split(" ") },
};
+ private static string SanitizeArtistId(string artistId)
+ {
+ if (artistId == "0" /* non-linked credits in credit lists have this set to zero */
+ || artistId == "118760" /* No Artist */)
+ {
+ return "";
+ }
+ return artistId;
+ }
+
[XmlAttribute]
public string id { get; set; }
@@ -82,7 +92,7 @@ namespace discogs.Releases
foreach (var l in labels)
{
if (l == null) continue;
- yield return ("release_label", new[] { id, l.name, l.catno });
+ yield return ("release_label", new[] { id, l.id, l.name, l.catno });
}
}
if (styles?.Length > 0)
@@ -138,7 +148,7 @@ namespace discogs.Releases
foreach (var a in artists)
{
if (a == null) continue;
- yield return ("release_artist", new[] { id, a.id, a.name, "0", a.anv, (position++).ToString(), a.join, a.role, a.tracks });
+ yield return ("release_artist", new[] { id, SanitizeArtistId(a.id), a.name, "0", a.anv, (position++).ToString(), a.join, a.role, a.tracks });
}
}
if (extraartists?.Length > 0)
@@ -147,7 +157,7 @@ namespace discogs.Releases
foreach (var a in extraartists)
{
if (a == null) continue;
- yield return ("release_artist", new[] { id, a.id, a.name, "1", a.anv, (position++).ToString(), a.join, a.role, a.tracks });
+ yield return ("release_artist", new[] { id, SanitizeArtistId(a.id), a.name, "1", a.anv, (position++).ToString(), a.join, a.role, a.tracks });
}
}
int seq = 0;
@@ -159,13 +169,13 @@ namespace discogs.Releases
foreach (var a in (t.artists ?? System.Array.Empty<artist>())) {
if (a == null) continue;
artistSeq += 1;
- yield return ("release_track_artist", new[] { id, t.position, t.track_id, a.id, a.name, "0", a.anv, artistSeq.ToString(), a.join, a.role, a.tracks });
+ yield return ("release_track_artist", new[] { id, t.position, t.track_id, SanitizeArtistId(a.id), a.name, "0", a.anv, artistSeq.ToString(), a.join, a.role, a.tracks });
}
artistSeq = 0;
foreach (var a in (t.extraartists ?? System.Array.Empty<artist>())) {
if (a == null) continue;
artistSeq += 1;
- yield return ("release_track_artist", new[] { id, t.position, t.track_id, a.id, a.name, "1", a.anv, artistSeq.ToString(), a.join, a.role, a.tracks });
+ yield return ("release_track_artist", new[] { id, t.position, t.track_id, SanitizeArtistId(a.id), a.name, "1", a.anv, artistSeq.ToString(), a.join, a.role, a.tracks });
}
}
}
C# is not my native language, so I'm sure this can be expressed in a better way, but the intent should be clear. The real question is, how many "magic" artist IDs there are : I'm not aware of any list of "artists who have their IDs but aren't artists who have a page and thus an entry in artists.xml
" or anything like that, which would make the job easier.
The only thing that comes to mind is resorting to correcting entries at import time, i.e. "while importing releases, look if the artist with the given ID exists, and if not, replace artist_id
with null
", but that just sounds plain wrong.
The project I've used a few times before actually commented out creating FK constraints in its database initialisation script, possibly because they just weren't worth the trouble. I'm not experienced enough with DBs to say for sure what kind of benefits they bring to the engine, as opposed to the human who can figure out the relations more easily. Perhaps this is really the best way to go.
from discogs-xml2db.
Related Issues (20)
- get_latest_dumps.sh should checksum the downloaded files
- Allow specifying direct files rather than `--export` + input dir
- Investigate the opportunities of a .NET-based parser HOT 4
- ImportError: cannot import name zip_longest HOT 4
- Export artist groups
- Export sublabels
- C# csv files should not have BOM
- Use python virtual environment over sudo HOT 2
- run.py failing on parsing discogs_20200901_releases.xml.gz, maybe an error in xmlfile HOT 3
- get_latest_dumps should checksum the downloads HOT 1
- Put Postgres import and sql commands into a single script HOT 3
- When I run postgresql/sql/CreateFKConstraints.sql some fail because of problems with data HOT 5
- Hexadecimal issue with experimental C Parser HOT 2
- Type in postgressql/sq/CreateTables.sql HOT 2
- Change CSV delimiter and quotes HOT 1
- C# exports gz, neither of which are compatible with mysql importer HOT 1
- release_artist missing data for tracks column HOT 1
- get_latest_dumps.sh broken by URL changes
- Incorrect progress bar percentage
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from discogs-xml2db.