Git Product home page Git Product logo

metadata-census's People

Contributors

konradreiche avatar

Stargazers

 avatar

Watchers

 avatar  avatar

metadata-census's Issues

Implement JSON dump importer

Due to the fact that the Elasticsearch indexing is shifted from the Metadata Harvester to Metadata Census a JSON dump importer is needed for metadata census that performs this task.

Improve accessibility metric for other languages

There are alternatives and/or adjustments to the Flesch reading ease in order to make it applicable for other languages like German and Spanish, too. These adjustments need to be applied to the accessibility metric by using the whatlanguage gem to determine the language beforehand.

Fix progress bar displaying the computation progress

There are currently two progress bars. One for displaying the metric's initialization of metadata record and another one for displaying the computation progress. Somhow, for some metrics like the link checker, the latter progress bar is filled up many times, probably due to different states in the process. This should be unified so the progress bar is only filled up one time.

Update administration control to new routes

I have refactored the route configuration, since the administration control is issuing POST commands to schedule a computation and GET commands to retrieve the status, these need to be updated to new route configuration accordingly.

Implement analysis page for the license metric

Implement an analysis page for the license metric. This could include some basic metadata record information (id, name), the license and whether the license is OKD and/or OSI compliant. The graphs tab would certainly include the distribution of different licenses, respectively a visualization of the number of open data licenses versus the non-open data licenses.

Refactor metric view

Due to #7 and #13 a general refactoring of the metric view, further called metric report, is required.

Make the total repository score configurable

It should be clear, that which quality factors go into a total score to assess the quality of a repository is very subjective. Hence, it should be configurable by the user to decide, which metric results should go into a total score.

Implement license metric

In Open Data licenses play a crucial role. While the choice of license is not a quality factor it is a factor in order to determine whether the data is open or not. In addition, this license overview can be used to determine if the license type has be set to a distinct value which then again is a quality factor.

This can be approached in a similar way the OKFN Open Data Census did that. It should, however, not only be differentiated between open or not open license, but all the different types.

Aggregate metric scores to an average score

The results of the different metric scores should be aggregated to an average score. This score should only be available, if all the metrics have been applied in order to enable comparability.

Implement leaderboard

For comparison of the different repositories a leaderboard is a crucial feature. This leaderboard should compare the scores of the different repositories.

Implement a timeline to display quality change

When the feature in issue #8 has been tackled a timeline should be implemented in order to keep record of quality changes over time. This way the improvement and/or decline of the repositories quality can be tracked. The first approach should be based on #14 and offer a slider to move between the different snapshots.

Improve the format validator of the Accuracy metric

While checking the MIME type is a practicable approach to validate the format field it is often far from correct. A problem occurs if the server hosting resources is badly configured. A wrong MIME type will be returned even though the resource complies with the format.

An alternative or an additional approach starts downloading the resource and tries to detect the file type with Unix programs like file.

Add visualization for total worker progress

Since I am aiming to provide a button to compute all metrics on all repositories it is equally reasonable to provide visualization for the total progress of all workers.

Update Accuracy metric with new formats

By analyzing the formats in the metadata records, as well as the returned MIME-types one can learn what typical values are. In order to improve the Accuracy metric this statistical evidence should be used.

Add particles for visualization or ambient effect

Particle effects like implemented with three.js can have a nice ambient effect. It should be implemented either on the landing page or on a separate page. On the landing page it should have an ambient function. Here the number of particles should be reduced.

If used on a separate page it could be used for visualization. For instance visualize the quality of metadata record. A particle would encode a metadata record and present the quality through the coloring.

Visualize different states of worker process

The workers spent different time in different states. For instance, while fetching the metadata, preprocessing it, computing it and writing the results back. These different states should be communicated to the interface. For instance by using different progress bars.

Implement spell checker

The correct spelling of description texts is a quality factor. This can be checked by known spell checkers. The language of each metadata record needs to be known beforehand.

Add shutdown procedure to Sidekiq worker

Currently, the Sidekiq workers cannot be terminated by sending a KILL signal. The workers will be told to shutdown, but since there is no hook it does not happen.

Generalize record picker for all metrics

I have implemented a metadata record picker for the Richness of Information metric which enables the user to pick a metadata record from the best and worst records. This should be generalized so it can be applied to each metric. In addition, the user should not be limited to select one of the 10 best, respectively 10 worst records.

Create admin control panel

A separate administration control panel would help to keep the user interface clean. This way the computation processes can be visualized in more detail without cluttering the repository metric results for the end user.

Change name, storage and retrieval of the metric details

The metric details for analysis purposes are still persisted as report. Since I have abandoned this naming convention it should be renamed to analysis. The analysis object should be returned together with score in the compute method, thus abolishing the attribute as well.

Implement metadata record selection and search

Due to the removal of submenus in dropdown menus in Bootstrap 3 I have decided to reimplement the metadata record selection. The simple selection from the best and worst records with respect to a certain metric could be kept. The metadata record search should be moved into a modal. This way the user is prompted in a window to perform the search.

Use JavaScript to select repository

It is unnecessary cumbersome to press the select button everytime a repository is chosen. Hence, the interface should be improved by using JavaScript to process on click events.

Add score meter to repository and metric page

The score meter was a nice way to animate the score of the whole repository as well as the score of a single metric of a repository. It should be added to both, repository and metric page. This time the color should change based on the reached score.

Refactor metric design to return record score after compute

The score for a metadata record is still stored in an instance attribute which leads to the need to reset the score every time the compute method is called. This does not make any sense. Instead the score should be returned after the compute method is executed.

Design a user interface concept

A general user interface concept is still missing. This milestone week should be coined by redesigning the interface to make the information for accessible. This should be done by viewing the user concept as follows:

  • What is the quality of my repository?
  • Why is the quality so bad/so good?
  • What is the reason this metadata record is bad/good?

Restructure meta-metadata for metric results

The metric results are still stored without a designated schema. The results are spread in questionable fields, for example named richness_of_information_details instead of adapting a hierarchy. This needs to be refactored so it will look like this:

richness_of_information {
    score : 0.75,
    report : {
         [:tags, 0] : 0.25
         [:tags, 1] : 0.5
         ...
    }
}

Improve Link Checker metric implementation

The Link Checker metrics works so far, however, it does not act accordingly to requests that are responded with HTTP 301 or HTTP 302. In these cases the redirect should be followed. Maybe there are further cases which have to be considered. One way to implement this, is by re-queuing the updated request into the dispatcher.

Make metric computation on large record sets scaleable

The whole metric computation as it is implemented does not work on large metadata record sets. Namely for the metadata records from catalog.data.gov about 100,000 metadata records have to be faced at once. The whole metric computation needs to be refactored to meet the requirements.

Add rake command to assist in metric creation

The metric creation and processing is now quite general. In order to add a new metric, it suffices to create a new metric class in the app/models/metrics directory. This could be improved even more by adding a rake command that creates the metric class through a template with this or a similar structure.

module Metrics
  class MetricName < Metric
    attr_reader :score

    def initialize
      @score = 0.0
    end

    def compute
    end

  end
end

Implement license overview

In Open Data licenses play a crucial role. While the choice of license is not a quality factor it is a factor in order to determine whether the data is open or not. In addition, this license overview can be used to determine if the license type has be set to a distinct value which then again is a quality factor.

Investigate why the AJAX loop does not work with an anonymous function

In order to fetch the current worker progress I had implemented an AJAX call, which is repeated until a certain condition is met. Somehow the loop does not work if I use an anonymous function for the callback. I would like to find out why. One the one hand I am just curious and on the other hand I need to pass additional parameters which can be achieved with an anonymous function (or currying or partial application).

Make adminsitration control routes RESTful

After I have started to refactor my routes to a RESTful design I should refactor the routes for the administration control, too. It should be changed from

/admin/control?repository=data.gov.uk

to

/admin//repository/data.gov.uk/

Upgrade to Rails 4

Ruby on Rails 4 is out in version 4.0.0 and I would like to upgrade it. There are plenty of resources describing how to do that.

Abstract metric processing

When a new metric is added a number of things before the metric can be computed through the interface. While creating the metric classes inevitable other things can be abstracted.

This requires to change code at

  • Metrics Overview app/view/metrics/overview.html.erb
  • Metrics Controller app/controller/metrics_controller.rb
  • Metrics JavaScript app/assets/javascript/metrics.js.coffee
  • Repository Class app/models/repository.rb

Design a default metric report

A generic metric report that is used as a fallback if no concrete metric report is available can make sense if designed sufficiently general. This would avoid the problem, to deal with metrics which have no dedicated report yet.

Store results of Accuracy metric per record

After refactoring the Accuracy metric the per record scores are not saved anymore. This should be implemented once again. This time the scores needs to be tracked during the initialization phase so it can be fetched afterwards by calling the compute method.

Render index, map and leaderboard on the same page

The repositories page with the sub-pages index, map and leaderboard are not so heavyweight that it would make sense to use different views. Instead I could render them all on the same page and make the visible by using Bootstrap tabs. This will also fasten the transition between the tabs.

Implement link checker

There is a huge quality deficit by metadata records containing URLs that are responded with HTTP 404. A link checker can easily be implemented easily by starting to send HTTP header requests.

Improve metadata structure as stored in ElasticSearch

Since I am starting to tackle issue #8 it seems reasonable to restructure how metadata records are stored. Until now I have stored metadata record and meta-metadata like metric results at the same hierarchy level. This can be improved by moving the original source record into its own field, for instance document.

Implement schema validator

Schema compliance is a quality factor. This can be implemented by using JSON schema validator. The results would include, whether the dataset is schema compliant or not and what violations have been made.

Revise metadata record modification

In the harvester empty strings, arrays and hashes are replaced by null. This is an intrusive way of persisting the data. Originally this was introduced as part of the completess metric. As a matter of fact, these checks for empty container should be performed in the metric. Revise this problem, remove the modification and make sure that the metrics are still working correctly.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.