Git Product home page Git Product logo

discogsparser's Introduction

DiscogsParser

DiscogsParser is a java-project based on the StAX-api to parse the xml-based database dumps from the discogs music database and marketplace (https://www.discogs.com/)

The project contains four parsers, one for each dump file, to parse the information of artists, labels, masters and releases. Additionally writers are implemented to store the parsed informations in a relational database, in this case PostgreSQL. SQL-scripts are provided to create the required database schemas and scripts to transform the schema in a more usable format using integrity constraints and indizes.

Folders

  • db/
    • sqlscripts to create or drop the database and transform the data
  • doc/
    • er-diagrams that describe the database schema
  • src/
    • java files for parsing the xml dumps and writing the parsed data to the database

discogsparser's People

Contributors

opabst avatar

Watchers

 avatar

discogsparser's Issues

Find ignored xml elements

Verify that all occuring elements are handled by the parser. One possible solution might be the output of xml-tags in the default part of the parsers switch construct.

bad modelling of release identifier table

The modelling of the release identifier data is bad. It is not clear if the attributes value, type and description are enough to unambiguously identify every record. May be directly add the release id as a foreign key and add the release to the primary key together with the value or type.

'model' package - create superclass of common parts

Most likely there will be parts that could be shared between the model files of all parser. Analyze the classes to find these parts and put them in a common super class to remove or reduce duplication.

Duplicate keys in import

Locate the reason for failing inserts due to duplicate keys. Should the primary composited of more attributes than the id? May the primary keys should be entirely dropped on import?

Parallelize parsing

Start a separate thread for each file; this way the files can be parsed and written to the database parallel and fast then sequential.

Parallelize stats gathering

Gathering stats seems to block the execution -> Singleton is bottleneck.

Create a gatherer for each writer, collect statistics parallel and merge them at the end.

Reorganize sql scripts

Reorganize the scripts to simplify the creation and deletion of the schema. Create an install and uninstall script that hides the single scripts but leave the single scripts to allow the creation and deletion of the work or import schema for development purposes.

Add surrogate keys for images

Image objects have no attribute or no attribute collection that uniquely identifies each tuple. Add a surrogate to to uniquely identify every single tuple.

Add missing master videos

Determine where the videos of masters are overlooked:
-during parsing?
-during writing to the database?

Releases file - missing tracks

A releases-file possibly contains tracks, enclosed by , at at least two places. A superficial examination did not show any instances of existing tracks. Continue examination of the releases file and find track entities to complete the parser.

Table for ReleaseTracks missing

The tracks, or tracklist, of a release is not written after parsing.

The following things have to be implemented for this:

  • db Table ReleaseTrack (id [surrogate key],position, title, duration)
  • db Table track_of_release(track_id, release_id)
  • PreparedStatement to write read entity to database
  • SQL statement to write/transform data from import schema to work schema

fix the er diagram for releases

The er-diagram for releases misses the entity for release videos and the relationship to connect releases and release videos.

Parallelization enhancement

  1. Let the main-thread wait for completion of the workers
  2. Manage the workers in a thread pool, in case there are less than three cores available

Rename relations

Rename relations by using a prefix to allow a group attribution by looking at the first letter of a relation.

Rework release video tables

src is an unfitting primary key. Add the release add as a foreign key and drop the (useless) connection table.

Masters file - parse tracks

A masters-file possibly contains tracks, enclosed by . A superficial examination did not show any instances of existing tracks. Continue examination of the masters file and find track entities to complete the parser.

Remodel ReleaseTrack

Rethink if a surrogate primary key is neccessary. Maybe a primary key, referencing the release is enough in combination with the position attribute.

Array deduplication

In the transformation step for some entities the rows are aggregated based on the primary key, where column values are aggregated in arrays. These often contain null values or duplicate values. Examine if deduplication is required or if deduplication possibly destroys a relation between array rows of the same relation.

add preprocess queries

Use sql queries to preprocess the imported data and to import them in to the final tables, obsoleting the import tables

Test db backend with unit tests

Test the backend with unit tests before hand. In this case the parsing has not to be executed before possible trouble in the db stuff.

Rethink array usage

Reevaluate the usage of arrays of a way to aggregate informations about the same object (primary key) to avoid key constraint issues. These informations are related to each other and should be stored together, not separately in separate arrays where association by the same array field number can only be assumed, but not enforced and ensured.

This affects all relations that aggregate informations in arrays during the transformation process.

Abstract from concrete db implementation

Currently the parsers directly use the PostgreSQL-Connector to write the entities to the database. Abstract from the concrete implementation and use a generic DB-Connector, that hides the concrete implementation.

Connect db parts

Create artificial relationships to connect the disparate parts of the database using the artist id's.

Unify parser (only parse strings)

Let the parser only write strings to intermediate data structures (*Entity). The conversion is applied, if neccessary, in the db backend.

Fill statistics

Add incrementing statements to importers to fill statistics

Bug in LabelParser

Some LabelEntities have null elements, e.g. id or data quality. Fix the parser to allow the import into the database.

Add missing release formats

Release formats seem to be missing as the getters are not used in ReleaseFormat.java.

Locate the place where the formats are overlooked.

Analyze empty import tables

Several tables are empty after importing the dumps in the database. Analyze whether there is an error between parsing and storing in the database or if these data are not existing in the source dumps.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.