konradreiche / metadata-census Goto Github PK

View Code? Open in Web Editor NEW

1.0 1.0 0.0 5.5 MB

A platform for monitoring the quality of metadata.

License: MIT License

Ruby 73.50% CoffeeScript 16.13% CSS 10.37%

metadata-census's People

Contributors

Stargazers

Watchers

metadata-census's Issues

Implement JSON dump importer

Due to the fact that the Elasticsearch indexing is shifted from the Metadata Harvester to Metadata Census a JSON dump importer is needed for metadata census that performs this task.

Improve accessibility metric for other languages

There are alternatives and/or adjustments to the Flesch reading ease in order to make it applicable for other languages like German and Spanish, too. These adjustments need to be applied to the accessibility metric by using the whatlanguage gem to determine the language beforehand.

Fix progress bar displaying the computation progress

There are currently two progress bars. One for displaying the metric's initialization of metadata record and another one for displaying the computation progress. Somhow, for some metrics like the link checker, the latter progress bar is filled up many times, probably due to different states in the process. This should be unified so the progress bar is only filled up one time.

Update administration control to new routes

I have refactored the route configuration, since the administration control is issuing POST commands to schedule a computation and GET commands to retrieve the status, these need to be updated to new route configuration accordingly.

Implement analysis page for the license metric

Implement an analysis page for the license metric. This could include some basic metadata record information (id, name), the license and whether the license is OKD and/or OSI compliant. The graphs tab would certainly include the distribution of different licenses, respectively a visualization of the number of open data licenses versus the non-open data licenses.

Refactor metric view

Due to #7 and #13 a general refactoring of the metric view, further called metric report, is required.

Make the total repository score configurable

It should be clear, that which quality factors go into a total score to assess the quality of a repository is very subjective. Hence, it should be configurable by the user to decide, which metric results should go into a total score.

Implement license metric

In Open Data licenses play a crucial role. While the choice of license is not a quality factor it is a factor in order to determine whether the data is open or not. In addition, this license overview can be used to determine if the license type has be set to a distinct value which then again is a quality factor.

This can be approached in a similar way the OKFN Open Data Census did that. It should, however, not only be differentiated between open or not open license, but all the different types.

Generalize detail views for the metrics

I have approached the detail view of metrics differently. This should be aggregated to a clean design which is applied in every metric.

Aggregate metric scores to an average score

The results of the different metric scores should be aggregated to an average score. This score should only be available, if all the metrics have been applied in order to enable comparability.

Harvest repositories continuously and keep all records

A job should be installed with the task to harvest all repositories automatically. In addition, previous records should not be discarded, but kept. This will be a step towards a timeline oriented quality control.

List URLs with response code for the Link Checker metric

In order to understand a certain set of problems with the Link Checker metric results a detail view should be implemented which lists all the URLs and their response code, timeout, etc.

Implement leaderboard

For comparison of the different repositories a leaderboard is a crucial feature. This leaderboard should compare the scores of the different repositories.

Implement a timeline to display quality change

When the feature in issue #8 has been tackled a timeline should be implemented in order to keep record of quality changes over time. This way the improvement and/or decline of the repositories quality can be tracked. The first approach should be based on #14 and offer a slider to move between the different snapshots.

Explanation of the metrics

Would be really keen to see some documentation of what you're measuring!

Improve the format validator of the Accuracy metric

While checking the MIME type is a practicable approach to validate the format field it is often far from correct. A problem occurs if the server hosting resources is badly configured. A wrong MIME type will be returned even though the resource complies with the format.

An alternative or an additional approach starts downloading the resource and tries to detect the file type with Unix programs like file.

Add visualization for total worker progress

Since I am aiming to provide a button to compute all metrics on all repositories it is equally reasonable to provide visualization for the total progress of all workers.

Update Accuracy metric with new formats

By analyzing the formats in the metadata records, as well as the returned MIME-types one can learn what typical values are. In order to improve the Accuracy metric this statistical evidence should be used.

Add particles for visualization or ambient effect

Particle effects like implemented with three.js can have a nice ambient effect. It should be implemented either on the landing page or on a separate page. On the landing page it should have an ambient function. Here the number of particles should be reduced.

If used on a separate page it could be used for visualization. For instance visualize the quality of metadata record. A particle would encode a metadata record and present the quality through the coloring.

Visualize different states of worker process

The workers spent different time in different states. For instance, while fetching the metadata, preprocessing it, computing it and writing the results back. These different states should be communicated to the interface. For instance by using different progress bars.

Implement spell checker

The correct spelling of description texts is a quality factor. This can be checked by known spell checkers. The language of each metadata record needs to be known beforehand.

Add shutdown procedure to Sidekiq worker

Currently, the Sidekiq workers cannot be terminated by sending a KILL signal. The workers will be told to shutdown, but since there is no hook it does not happen.

Generalize record picker for all metrics

I have implemented a metadata record picker for the Richness of Information metric which enables the user to pick a metadata record from the best and worst records. This should be generalized so it can be applied to each metric. In addition, the user should not be limited to select one of the 10 best, respectively 10 worst records.

Create admin control panel

A separate administration control panel would help to keep the user interface clean. This way the computation processes can be visualized in more detail without cluttering the repository metric results for the end user.

Change name, storage and retrieval of the metric details

The metric details for analysis purposes are still persisted as report. Since I have abandoned this naming convention it should be renamed to analysis. The analysis object should be returned together with score in the compute method, thus abolishing the attribute as well.

Implement metadata record selection and search

Due to the removal of submenus in dropdown menus in Bootstrap 3 I have decided to reimplement the metadata record selection. The simple selection from the best and worst records with respect to a certain metric could be kept. The metadata record search should be moved into a modal. This way the user is prompted in a window to perform the search.

Fix Accessibility metric diagram

The diagram to display the Accessibility metric results is broken. The results fetched from the database are null.

Use JavaScript to select repository

It is unnecessary cumbersome to press the select button everytime a repository is chosen. Hence, the interface should be improved by using JavaScript to process on click events.

Add score meter to repository and metric page

The score meter was a nice way to animate the score of the whole repository as well as the score of a single metric of a repository. It should be added to both, repository and metric page. This time the color should change based on the reached score.

Refactor metric design to return record score after compute

The score for a metadata record is still stored in an instance attribute which leads to the need to reset the score every time the compute method is called. This does not make any sense. Instead the score should be returned after the compute method is executed.

Design a user interface concept

A general user interface concept is still missing. This milestone week should be coined by redesigning the interface to make the information for accessible. This should be done by viewing the user concept as follows:

What is the quality of my repository?
Why is the quality so bad/so good?
What is the reason this metadata record is bad/good?

Visualize repository scores on the map

When feature #9 and #10 have been tackled, the results should be visualized on the map.

Restructure meta-metadata for metric results

The metric results are still stored without a designated schema. The results are spread in questionable fields, for example named richness_of_information_details instead of adapting a hierarchy. This needs to be refactored so it will look like this:

richness_of_information {
    score : 0.75,
    report : {
         [:tags, 0] : 0.25
         [:tags, 1] : 0.5
         ...
    }
}

Improve Link Checker metric implementation

The Link Checker metrics works so far, however, it does not act accordingly to requests that are responded with HTTP 301 or HTTP 302. In these cases the redirect should be followed. Maybe there are further cases which have to be considered. One way to implement this, is by re-queuing the updated request into the dispatcher.

Update administration control to Bootstrap 3

The styling of the administration control is still broken due to the upgrade of Twitter Boostrap to version 3. Simply apply the new stylings.

Make metric computation on large record sets scaleable

The whole metric computation as it is implemented does not work on large metadata record sets. Namely for the metadata records from catalog.data.gov about 100,000 metadata records have to be faced at once. The whole metric computation needs to be refactored to meet the requirements.

Add rake command to assist in metric creation

The metric creation and processing is now quite general. In order to add a new metric, it suffices to create a new metric class in the app/models/metrics directory. This could be improved even more by adding a rake command that creates the metric class through a template with this or a similar structure.

module Metrics
  class MetricName < Metric
    attr_reader :score

    def initialize
      @score = 0.0
    end

    def compute
    end

  end
end

Implement license overview

Investigate why the AJAX loop does not work with an anonymous function

In order to fetch the current worker progress I had implemented an AJAX call, which is repeated until a certain condition is met. Somehow the loop does not work if I use an anonymous function for the callback. I would like to find out why. One the one hand I am just curious and on the other hand I need to pass additional parameters which can be achieved with an anonymous function (or currying or partial application).

Make adminsitration control routes RESTful

After I have started to refactor my routes to a RESTful design I should refactor the routes for the administration control, too. It should be changed from

/admin/control?repository=data.gov.uk

/admin//repository/data.gov.uk/

Upgrade to Rails 4

Ruby on Rails 4 is out in version 4.0.0 and I would like to upgrade it. There are plenty of resources describing how to do that.

Abstract metric processing

When a new metric is added a number of things before the metric can be computed through the interface. While creating the metric classes inevitable other things can be abstracted.

This requires to change code at

Metrics Overview app/view/metrics/overview.html.erb
Metrics Controller app/controller/metrics_controller.rb
Metrics JavaScript app/assets/javascript/metrics.js.coffee
Repository Class app/models/repository.rb

Design a default metric report

A generic metric report that is used as a fallback if no concrete metric report is available can make sense if designed sufficiently general. This would avoid the problem, to deal with metrics which have no dedicated report yet.

Store results of Accuracy metric per record

After refactoring the Accuracy metric the per record scores are not saved anymore. This should be implemented once again. This time the scores needs to be tracked during the initialization phase so it can be fetched afterwards by calling the compute method.

Render index, map and leaderboard on the same page

The repositories page with the sub-pages index, map and leaderboard are not so heavyweight that it would make sense to use different views. Instead I could render them all on the same page and make the visible by using Bootstrap tabs. This will also fasten the transition between the tabs.

Implement link checker

There is a huge quality deficit by metadata records containing URLs that are responded with HTTP 404. A link checker can easily be implemented easily by starting to send HTTP header requests.

Improve metadata structure as stored in ElasticSearch

Since I am starting to tackle issue #8 it seems reasonable to restructure how metadata records are stored. Until now I have stored metadata record and meta-metadata like metric results at the same hierarchy level. This can be improved by moving the original source record into its own field, for instance document.

Implement schema validator

Schema compliance is a quality factor. This can be implemented by using JSON schema validator. The results would include, whether the dataset is schema compliant or not and what violations have been made.

Revise metadata record modification

In the harvester empty strings, arrays and hashes are replaced by null. This is an intrusive way of persisting the data. Originally this was introduced as part of the completess metric. As a matter of fact, these checks for empty container should be performed in the metric. Revise this problem, remove the modification and make sure that the metrics are still working correctly.

Implement partial rendering based on metric selection

In order to delegate the visualization of different metric reports partials should be created each implementing the view for a metric. The partial rendering can then be performed based on the selected metric.

konradreiche / metadata-census Goto Github PK

metadata-census's People

Contributors

Stargazers

Watchers

metadata-census's Issues

Recommend Projects

Recommend Topics

Recommend Org