konradreiche / metadata-census Goto Github PK
View Code? Open in Web Editor NEWA platform for monitoring the quality of metadata.
License: MIT License
A platform for monitoring the quality of metadata.
License: MIT License
Due to the fact that the Elasticsearch indexing is shifted from the Metadata Harvester to Metadata Census a JSON dump importer is needed for metadata census that performs this task.
There are alternatives and/or adjustments to the Flesch reading ease in order to make it applicable for other languages like German and Spanish, too. These adjustments need to be applied to the accessibility metric by using the whatlanguage
gem to determine the language beforehand.
There are currently two progress bars. One for displaying the metric's initialization of metadata record and another one for displaying the computation progress. Somhow, for some metrics like the link checker, the latter progress bar is filled up many times, probably due to different states in the process. This should be unified so the progress bar is only filled up one time.
I have refactored the route configuration, since the administration control is issuing POST commands to schedule a computation and GET commands to retrieve the status, these need to be updated to new route configuration accordingly.
Implement an analysis page for the license metric. This could include some basic metadata record information (id, name), the license and whether the license is OKD and/or OSI compliant. The graphs tab would certainly include the distribution of different licenses, respectively a visualization of the number of open data licenses versus the non-open data licenses.
It should be clear, that which quality factors go into a total score to assess the quality of a repository is very subjective. Hence, it should be configurable by the user to decide, which metric results should go into a total score.
In Open Data licenses play a crucial role. While the choice of license is not a quality factor it is a factor in order to determine whether the data is open or not. In addition, this license overview can be used to determine if the license type has be set to a distinct value which then again is a quality factor.
This can be approached in a similar way the OKFN Open Data Census did that. It should, however, not only be differentiated between open or not open license, but all the different types.
I have approached the detail view of metrics differently. This should be aggregated to a clean design which is applied in every metric.
The results of the different metric scores should be aggregated to an average score. This score should only be available, if all the metrics have been applied in order to enable comparability.
A job should be installed with the task to harvest all repositories automatically. In addition, previous records should not be discarded, but kept. This will be a step towards a timeline oriented quality control.
In order to understand a certain set of problems with the Link Checker metric results a detail view should be implemented which lists all the URLs and their response code, timeout, etc.
For comparison of the different repositories a leaderboard is a crucial feature. This leaderboard should compare the scores of the different repositories.
When the feature in issue #8 has been tackled a timeline should be implemented in order to keep record of quality changes over time. This way the improvement and/or decline of the repositories quality can be tracked. The first approach should be based on #14 and offer a slider to move between the different snapshots.
Would be really keen to see some documentation of what you're measuring!
While checking the MIME type is a practicable approach to validate the format field it is often far from correct. A problem occurs if the server hosting resources is badly configured. A wrong MIME type will be returned even though the resource complies with the format.
An alternative or an additional approach starts downloading the resource and tries to detect the file type with Unix programs like file
.
Since I am aiming to provide a button to compute all metrics on all repositories it is equally reasonable to provide visualization for the total progress of all workers.
By analyzing the formats in the metadata records, as well as the returned MIME-types one can learn what typical values are. In order to improve the Accuracy metric this statistical evidence should be used.
Particle effects like implemented with three.js can have a nice ambient effect. It should be implemented either on the landing page or on a separate page. On the landing page it should have an ambient function. Here the number of particles should be reduced.
If used on a separate page it could be used for visualization. For instance visualize the quality of metadata record. A particle would encode a metadata record and present the quality through the coloring.
The workers spent different time in different states. For instance, while fetching the metadata, preprocessing it, computing it and writing the results back. These different states should be communicated to the interface. For instance by using different progress bars.
The correct spelling of description texts is a quality factor. This can be checked by known spell checkers. The language of each metadata record needs to be known beforehand.
Currently, the Sidekiq workers cannot be terminated by sending a KILL signal. The workers will be told to shutdown, but since there is no hook it does not happen.
I have implemented a metadata record picker for the Richness of Information metric which enables the user to pick a metadata record from the best and worst records. This should be generalized so it can be applied to each metric. In addition, the user should not be limited to select one of the 10 best, respectively 10 worst records.
A separate administration control panel would help to keep the user interface clean. This way the computation processes can be visualized in more detail without cluttering the repository metric results for the end user.
The metric details for analysis purposes are still persisted as report
. Since I have abandoned this naming convention it should be renamed to analysis
. The analysis object should be returned together with score in the compute
method, thus abolishing the attribute as well.
Due to the removal of submenus in dropdown menus in Bootstrap 3 I have decided to reimplement the metadata record selection. The simple selection from the best and worst records with respect to a certain metric could be kept. The metadata record search should be moved into a modal. This way the user is prompted in a window to perform the search.
The diagram to display the Accessibility metric results is broken. The results fetched from the database are null.
It is unnecessary cumbersome to press the select button everytime a repository is chosen. Hence, the interface should be improved by using JavaScript to process on click events.
The score meter was a nice way to animate the score of the whole repository as well as the score of a single metric of a repository. It should be added to both, repository and metric page. This time the color should change based on the reached score.
The score for a metadata record is still stored in an instance attribute which leads to the need to reset the score every time the compute
method is called. This does not make any sense. Instead the score should be returned after the compute
method is executed.
A general user interface concept is still missing. This milestone week should be coined by redesigning the interface to make the information for accessible. This should be done by viewing the user concept as follows:
The metric results are still stored without a designated schema. The results are spread in questionable fields, for example named richness_of_information_details
instead of adapting a hierarchy. This needs to be refactored so it will look like this:
richness_of_information {
score : 0.75,
report : {
[:tags, 0] : 0.25
[:tags, 1] : 0.5
...
}
}
The Link Checker metrics works so far, however, it does not act accordingly to requests that are responded with HTTP 301 or HTTP 302. In these cases the redirect should be followed. Maybe there are further cases which have to be considered. One way to implement this, is by re-queuing the updated request into the dispatcher.
The styling of the administration control is still broken due to the upgrade of Twitter Boostrap to version 3. Simply apply the new stylings.
The whole metric computation as it is implemented does not work on large metadata record sets. Namely for the metadata records from catalog.data.gov about 100,000 metadata records have to be faced at once. The whole metric computation needs to be refactored to meet the requirements.
The metric creation and processing is now quite general. In order to add a new metric, it suffices to create a new metric class in the app/models/metrics
directory. This could be improved even more by adding a rake command that creates the metric class through a template with this or a similar structure.
module Metrics
class MetricName < Metric
attr_reader :score
def initialize
@score = 0.0
end
def compute
end
end
end
In Open Data licenses play a crucial role. While the choice of license is not a quality factor it is a factor in order to determine whether the data is open or not. In addition, this license overview can be used to determine if the license type has be set to a distinct value which then again is a quality factor.
In order to fetch the current worker progress I had implemented an AJAX call, which is repeated until a certain condition is met. Somehow the loop does not work if I use an anonymous function for the callback. I would like to find out why. One the one hand I am just curious and on the other hand I need to pass additional parameters which can be achieved with an anonymous function (or currying or partial application).
After I have started to refactor my routes to a RESTful design I should refactor the routes for the administration control, too. It should be changed from
/admin/control?repository=data.gov.uk
to
/admin//repository/data.gov.uk/
Ruby on Rails 4 is out in version 4.0.0 and I would like to upgrade it. There are plenty of resources describing how to do that.
When a new metric is added a number of things before the metric can be computed through the interface. While creating the metric classes inevitable other things can be abstracted.
This requires to change code at
app/view/metrics/overview.html.erb
app/controller/metrics_controller.rb
app/assets/javascript/metrics.js.coffee
app/models/repository.rb
A generic metric report that is used as a fallback if no concrete metric report is available can make sense if designed sufficiently general. This would avoid the problem, to deal with metrics which have no dedicated report yet.
After refactoring the Accuracy metric the per record scores are not saved anymore. This should be implemented once again. This time the scores needs to be tracked during the initialization phase so it can be fetched afterwards by calling the compute
method.
The repositories page with the sub-pages index, map and leaderboard are not so heavyweight that it would make sense to use different views. Instead I could render them all on the same page and make the visible by using Bootstrap tabs. This will also fasten the transition between the tabs.
There is a huge quality deficit by metadata records containing URLs that are responded with HTTP 404. A link checker can easily be implemented easily by starting to send HTTP header requests.
Since I am starting to tackle issue #8 it seems reasonable to restructure how metadata records are stored. Until now I have stored metadata record and meta-metadata like metric results at the same hierarchy level. This can be improved by moving the original source record into its own field, for instance document
.
Schema compliance is a quality factor. This can be implemented by using JSON schema validator. The results would include, whether the dataset is schema compliant or not and what violations have been made.
In the harvester empty strings, arrays and hashes are replaced by null. This is an intrusive way of persisting the data. Originally this was introduced as part of the completess metric. As a matter of fact, these checks for empty container should be performed in the metric. Revise this problem, remove the modification and make sure that the metrics are still working correctly.
In order to delegate the visualization of different metric reports partials should be created each implementing the view for a metric. The partial rendering can then be performed based on the selected metric.
A declarative, efficient, and flexible JavaScript library for building user interfaces.
๐ Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. ๐๐๐
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google โค๏ธ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.