opabst / discogsparser Goto Github PK

License: MIT License

Shell 0.23% Java 79.39% TSQL 19.47% PLpgSQL 0.91%

discogsparser's Introduction

DiscogsParser

DiscogsParser is a java-project based on the StAX-api to parse the xml-based database dumps from the discogs music database and marketplace (https://www.discogs.com/)

The project contains four parsers, one for each dump file, to parse the information of artists, labels, masters and releases. Additionally writers are implemented to store the parsed informations in a relational database, in this case PostgreSQL. SQL-scripts are provided to create the required database schemas and scripts to transform the schema in a more usable format using integrity constraints and indizes.

Folders

db/
- sqlscripts to create or drop the database and transform the data
doc/
- er-diagrams that describe the database schema
src/
- java files for parsing the xml dumps and writing the parsed data to the database

discogsparser's People

Contributors

Watchers

discogsparser's Issues

Find ignored xml elements

Verify that all occuring elements are handled by the parser. One possible solution might be the output of xml-tags in the default part of the parsers switch construct.

bad modelling of release identifier table

The modelling of the release identifier data is bad. It is not clear if the attributes value, type and description are enough to unambiguously identify every record. May be directly add the release id as a foreign key and add the release to the primary key together with the value or type.

'model' package - create superclass of common parts

Most likely there will be parts that could be shared between the model files of all parser. Analyze the classes to find these parts and put them in a common super class to remove or reduce duplication.

Duplicate keys in import

Locate the reason for failing inserts due to duplicate keys. Should the primary composited of more attributes than the id? May the primary keys should be entirely dropped on import?

Parallelize parsing

Start a separate thread for each file; this way the files can be parsed and written to the database parallel and fast then sequential.

Parallelize stats gathering

Gathering stats seems to block the execution -> Singleton is bottleneck.

Create a gatherer for each writer, collect statistics parallel and merge them at the end.

Reorganize sql scripts

Reorganize the scripts to simplify the creation and deletion of the schema. Create an install and uninstall script that hides the single scripts but leave the single scripts to allow the creation and deletion of the work or import schema for development purposes.

Add surrogate keys for images

Image objects have no attribute or no attribute collection that uniquely identifies each tuple. Add a surrogate to to uniquely identify every single tuple.

Finish parser for Artists

Add missing master videos

Determine where the videos of masters are overlooked:
-during parsing?
-during writing to the database?

Adapt statistics to schema types

Both available schemas have different table names and as a consequence the statistics need to adapt to this.

drop import tables after preprocess

After preprocessing drop the import tables, as they are not required anymore.

Write PostgreSQL backend

Finish the backend to write the parsed data to a PostgreSQL database

Write database schema for PostgreSQL

Releases file - missing tracks

A releases-file possibly contains tracks, enclosed by , at at least two places. A superficial examination did not show any instances of existing tracks. Continue examination of the releases file and find track entities to complete the parser.

Table for ReleaseTracks missing

The tracks, or tracklist, of a release is not written after parsing.

The following things have to be implemented for this:

db Table ReleaseTrack (id [surrogate key],position, title, duration)
db Table track_of_release(track_id, release_id)
PreparedStatement to write read entity to database
SQL statement to write/transform data from import schema to work schema

add release formats to er diagram

The diagram modelling the data for releases is missing the entity for release formats.

fix the er diagram for releases

The er-diagram for releases misses the entity for release videos and the relationship to connect releases and release videos.

Parallelization enhancement

Let the main-thread wait for completion of the workers
Manage the workers in a thread pool, in case there are less than three cores available

check indexes and constraints for preprocessed tables

Rename relations

Rename relations by using a prefix to allow a group attribution by looking at the first letter of a relation.

Finish parser for Labels

Rework release video tables

src is an unfitting primary key. Add the release add as a foreign key and drop the (useless) connection table.

Missing tests for parsers

Add tests to verify correctness by examples instead of checking against a huge dump file.

Masters file - parse tracks

A masters-file possibly contains tracks, enclosed by . A superficial examination did not show any instances of existing tracks. Continue examination of the masters file and find track entities to complete the parser.

Remodel ReleaseTrack

Rethink if a surrogate primary key is neccessary. Maybe a primary key, referencing the release is enough in combination with the position attribute.

Array deduplication

In the transformation step for some entities the rows are aggregated based on the primary key, where column values are aggregated in arrays. These often contain null values or duplicate values. Examine if deduplication is required or if deduplication possibly destroys a relation between array rows of the same relation.

add indexes to import tables

Operations on import tables are very slow. Add (matching) indexes to speed up queries on the import tables.

add preprocess queries

Use sql queries to preprocess the imported data and to import them in to the final tables, obsoleting the import tables

Test db backend with unit tests

Test the backend with unit tests before hand. In this case the parsing has not to be executed before possible trouble in the db stuff.

Rethink array usage

Reevaluate the usage of arrays of a way to aggregate informations about the same object (primary key) to avoid key constraint issues. These informations are related to each other and should be stored together, not separately in separate arrays where association by the same array field number can only be assumed, but not enforced and ensured.

This affects all relations that aggregate informations in arrays during the transformation process.

Abstract from concrete db implementation

Currently the parsers directly use the PostgreSQL-Connector to write the entities to the database. Abstract from the concrete implementation and use a generic DB-Connector, that hides the concrete implementation.

Finish parser for Releases

Missing attributes in MasterParser

the 'master'-tag has an id-attribut
-the 'main_release'-tag is missing

Connect db parts

Create artificial relationships to connect the disparate parts of the database using the artist id's.

Unify parser (only parse strings)

Let the parser only write strings to intermediate data structures (*Entity). The conversion is applied, if neccessary, in the db backend.

Fill statistics

Add incrementing statements to importers to fill statistics

Bug in LabelParser

Some LabelEntities have null elements, e.g. id or data quality. Fix the parser to allow the import into the database.

Probe existence of db schema before parsing

connect to db on program startup
probe for existence of the required schema/tables before parsing
-> saves a lot of time and anger