Git Product home page Git Product logo

hyrax-migrator's Introduction

Hyrax::Migrator

Short description and motivation.

Usage

How to use my plugin.

Installation

Add this line to your application's Gemfile:

gem 'hyrax-migrator'

And then execute:

$ bundle

Or install it yourself as:

$ gem install hyrax-migrator

Development

Use the included example docker-compose.override.yml to map port 3000 to access the server from http://localhost:3000

$ cp docker-compose.override.yml.example docker-compose.override.yml

Use the included docker-compose.yml to start a local server for development.

$ docker-compose up server

Contributing

Contribution directions go here.

License

The gem is available as open source under the terms of the MIT License.

hyrax-migrator's People

Contributors

lsat12357 avatar revgum avatar dependabot[bot] avatar wickr avatar cgillen avatar

Watchers

James Cloos avatar Brian E. Davis avatar  avatar Julia Simic avatar Brandon R. Straley avatar  avatar azle avatar Michael Boock avatar  avatar  avatar  avatar

hyrax-migrator's Issues

A work can be processed in the background

Model a work such that it relates to a BAG file containing the metadata, original files, migration status, and has the ability to be "retried".

The fundamental idea here is that the ID of a work in hyrax-migrator could be fetched from a queue, and a worker is able to perform the migration in whole.

Linda liked the idea of having a stack of actors that the work would be processed through during migration.

Each of these actors could add some event logging/database records of its progress/failure.

Remove attributes in AdminSetMembershipActor

After the set and primary_set have been mapped to the collection membership property, remove these two attributes from env[:attributes] because they are not valid metadata fields.

Collection / AdminSet membership service

Descriptive summary

A new work needs to be associated to an adminset (and possibly multiple collections).

A configuration for each Work Type could point at a Service that is responsible for parsing the metadata to identify the appropriate AdminSet and Collection(s). The Service can be written (and overridden) to meet our specific OD2 needs but allow for anyone downstream to tailor the memberships to their own needs.

Crosswalk special (linked data) metadata to a complex shape

For OD2 we have a need to crosswalk some fields but provide a complex shape of data as the result of the crosswalk (so that the Hyrax ingest handles it properly).

One such attribute is format (there are others that have the same need for the following shape of data).

OD2 format metadata should be added to the Work.env[:attributes] as format_attributes: [{ 'id' => the_original_url_from_od1_string, '_destroy' => 0}].

Maybe we could replace format with a configuration to the crosswalk yaml files with a function that has access to the original metadata value but then spits out an arbitrary shape of data..

...
  - property: format_attributes
    predicate: http://purl.org/dc/terms/format
    multiple: true
    function: attributes_data
...

And in the CrosswalkMetadataService we can introduce the method

    def attributes_data(object)
      {'id' => object.to_s, '_destroy' => 0}
    end

@lsat12357 Thoughts?

Check workflow metadata for destroyed status before migrating

OD1 has a "soft delete" feature which hides items in the interface but doesn't actually remove them from the repository. At this point we don't have a reason to migrate them, but they may be exported.

The migrator needs to check workflowMetadata.yml very early for destroyed: true and if so, reject the work.

Refactor actor methods to AbstractActors

Actors which inherit from AbstractActor would benefit from a consistent set of methods for setting a work in the success and fail state as well as defaulting to calling the next actor during a success state.

ListChildrenService

this service pulls the children out of the nt of a compound object and creates the work_members_attributes hash that seems to be in this form:
{
'0' => { 'id' => 'abcde1234' },
'1' => { 'id' => 'abcde1235' }
}

aasm

Josh suggests using github.com/aasm/aasm

Collection / AdminSet membership actor

Descriptive summary

Build an actor which extracts the metadata (BAG), and uses the collection/adminset service (#30 ) to persist crosswalked membership to the works env. An actor down the line will be responsible for combining and saving the crosswalked metadata, the uploaded file paths, and additional properties (collection association, etc).

Crosswalk Metadata ignore property capability

Provide the means, in the crosswalk yaml file, to mark a property/predicate as being ignore: true .. this would indicate that the code should explicitly not migrate the data.

The idea here is that all predicates in the original data are accounted for, even if we are making a conscious decision to not migrate the metadata for a predicate.

Thoughts?

Crosswalk Metadata Service overrides

If the crosswalk and crosswalk_overrides has the same predicate, then make it such that the configuration found in the overrides is the crosswalk that happens instead of the default.

Work can handle pulling a file from S3

When a Work has a URL in file_path, it can handle fetching the file on initialization.

The use-case behind this idea is that the original zipped bag files would be stored in S3, it should be fetched and extracted before the actor/services start working.

Migrate a work as a background job

Create a Rails ActiveJob, like MigrateWorkJob, that wraps the execution of the migrators actor stack so that the can be run in the background and horizontally scaled by the Workers running.

The MigrateWorkJob would essentially wrap something like:

work = find_or_create_work(PID, FILE_PATH) # this finds Hyrax::Migrator::Work.find(pid: PID) or creates 
Hyrax::Migrator::Middleware.default.start(w)

Model lookup Service and Actor

Create an Actor and related Service that is responsible for inspecting the metadata and using a crosswalk yaml file to determine which type of registered model will be used for the migration.

license metadata invalid and mismapped?

The test bag has metadata that is being mapped to license, it seems like it should be rights_statement.

Also, the value here isn't a valid license or rights_statement in OD2.

<http://oregondigital.org/resource/oregondigital:3t945r08v> <http://purl.org/dc/terms/rights> <http://opaquenamespace.org/ns/rights/rr-f> .

rake task to populate admin_sets?

I guess this might actually go in OD2 rather than here, but was fiddling around with
Hyrax::AdminSetCreateService

we can't add the admin_set_membership_actor into the stack until we have all the sets in.

track environment as a Hash on the work model

Generate a migration to add environment field which serializes as a Hash

Add to the model:

serialize :environment, Hash

And the environment field is stored in the database as a text field

Crosswalk metadata service

Descriptive summary

Each registered model in the engine should point at the Hyrax "Work Type" that is part of the migration plan. Each of these "Work Types" need to have some configuration around how to crosswalk metadata from the original file (n-triples?) to the new work types properties.

Expected behavior

It could be a good first step to expect a configuration yaml in config/migrator/ for each Work Type to describe the crosswalk.

# config/hyrax-migrator/image.yml
---
crosswalk:
  - uri: http://purl.org/dc/terms/title
    property: subject
    multivalue: true
  - uri: http://purl.org/dc/terms/format
    property: format
    multivalue: true

compound objects data structure

@wickr and @lsat12357 to pair up and translate the OD1 compound object structure to something that will work in OD2.

A compound object maps to these types of notions;

  • A book with multiple pages
  • A scanned image with its article

These might work as a parent => child, or even a child having multiple parent works.

During migration having the ability to make the translation prior to creating the work.

OD1 BAG file details to aid in processing a migration

  • {oregon digital PID}_RELS-EXT.xml : has the Fedora model (use this as a crosswalk/lookup to find the OD2 model), although it may be best to use the descMetadata.nt
  • {oregon digital PID}_descMetadata.nt : has the full n-triples metadata
  • {oregon digital PID}_content.* : The original file
  • {oregon digital PID}_rightsMetadata.xml : One thing pulled out of here to aid in collection? association. @wickr?

Use rightsMetadata.xml for assigning visibility

  • VisibilityService
    • Uses data/*rightsMetadata.xml file
    • Service should set *visibility in Work#env[:attributes]
    • Consider this for setting embargo as well?
  • VisibilityActor

See OD2 ticket for groups / visibility discussion: OregonDigital/OD2#372

Example metadata:

<rightsMetadata xmlns="http://hydra-collab.stanford.edu/schemas/rightsMetadata/v1" version="0.1">
  <copyright>
    <human type="title"/>
    <human type="description"/>
    <machine type="uri"/>
  </copyright>
  <access type="discover">
    <human/>
    <machine/>
  </access>
  <access type="read">
    <human/>
    <machine>
      <group>admin</group><group>archivist</group><group>public</group>
    </machine>
  </access>
  <access type="edit">
    <human/>
    <machine/>
  </access>
  <embargo>
    <human/>
    <machine/>
  </embargo>
</rightsMetadata>

create a Hyrax::Migrator::Actors::AbstractActor

Build an abstract actor that all actors inherit from, it has reference to the next_actor in the stack so that once its processing is successfully completed, it can cause the next actor to operate.

Any actor inheriting from this class will have easy access to starting the next actor in line, and must have a create method

module Hyrax
  module Migrator
    class AbstractActor
      ##
      # @!attribute next_actor [r]
      #   @return [AbstractActor]
      attr_reader :next_actor

      ##
      # @param next_actor [AbstractActor]
      def initialize(next_actor)
        @next_actor = next_actor
      end

      ##
      # Call the next actor, passing the env along for processing 
      # @param env [Hash] - the state of processing this work by actors, persisted with the Work model
      def next(env)
        @next_actor.create(env)
      end
      
      ##
      # Create must be overridden by an actor inheriting this class
      def create(env)
        raise NotImplementedError, "An actor class must be able to #create"
      end
    end
  end
end

bag_file_location_service

Build a service that depends on a hyrax-migrator configuration (much like upload storage service) that can access a filesystem path or S3 bucket for accessing original BAG files for migration.

ListChildrenActor

This actor would call a service to get the list of children and add it to the work env. Potentially this actor and the CreateRelationshipsActor might not run as part of the main migration stack, but be included in a smaller stack that runs after we are certain the parent and children have been ingested.

user lookup error in file_upload_actor

this line
@current_user = Hyrax::Migrator::HyraxCore::User.find(email: config.migration_user)
gets this error
TypeError: can't cast Hash
from /usr/local/bundle/gems/activerecord-5.1.6.2/lib/active_record/connection_adapters/abstract/quoting.rb:45:in `rescue in type_cast'

change to
@current_user = Hyrax::Migrator::HyraxCore::User.find(config.migration_user)
?

[EPIC] Create Actors for processing a Work

  • validate the BAG, get the work type from the file and set the work.env with details
  • read the n-triples, build a graph and persist that in the work.env
  • crosswalk the metadata from the work.env, setting a new env value to a hash used to set the attributes for the new work to be persisted in Hyrax
  • identify which AdminSet/Collection and visibility to set for the new work, store this inwork.env (admin_set relates to the attributes set by the CrosswalkActor)
  • upload the original file(s) from bag to the proper location for Hyrax ingest, could be shared storage (BrowseAnything enabled in Hyrax application), store file metadata and paths in work.env
  • use work.env; instantiate a new work from Hyrax::Migrator.config.registered_models based on the work type identified, set attributes hash, associate uploaded files to the work, save it, associate the work to the proper collection(s)/admin_sets, set the work as having been published
  • Check the published work to see if the metadata matches in Hyrax and the expected filesets have been associated with the work, once finally successful this is the last actor to run.. It is expected that this actor won't pass immediately, it depends on how long it takes Hyrax to fully ingest a work. Also, this actor will depend on a mechanism to query the work and filesets to verify the work is completely migrated. Possibly use the Work type to query its data and related filesets to validate they are complete.

See tickets
#26
#27
#28
#29
#30
#31
#32

Original file upload service

Descriptive summary

Given a configurations in the initializer, upload the original binary file(s) the proper path used by the application (browse everything). There should be the capability to upload the file to a file system path, or to an S3 bucket.

Expected behavior

The engine initializer should have configurations (values coming from ENV) for a file system path, S3 credentials, and S3 bucket. The service would inspect the configurations to decide where the appropriate storage location is, store a file in that location, and return details about the file and it's current location.

Create Work model

Work data model:

  • file_path (string) : the full path to the BAG file to be processed
  • aasm_state (string) : the current state used by aasm gem
  • status_message (string) : a string message to display with the current status
  • status (string) : ['success', 'error', 'pending']

MigrationValidator Actor/Service

Periodically, (maybe a background job running a daily rake task?) check all Work that has been published to OD2, but hasn't been validated.

  • Using Work type, query for validations
    • Evaluate metadata existence on OD2
    • Evaluate attached files/children and metadata existence on OD2
  • If validations pass, then update the Work to mark it as validated

CreateRelationshipsActor

This actor follows the ListChildrenActor and calls the hyrax stack with an update.
The parent and children need to exist before this runs.
The parent and children need to be owned by the same person, in this case migration_user.

Crosswalk metadata actor

Descriptive summary

Build an actor which extracts the metadata (BAG), and uses the crosswalk service (#27 ) to persist crosswalked metadata to the works env. An actor down the line will be responsible for combining and saving the crosswalked metadata, the uploaded file paths, and additional properties (collection association, etc).

FileUpload Actor

Descriptive summary

Build an actor which identifies the original files in the work being processed (BAG), and uses the FileUpload service (#26) to upload each to the proper location and persists the paths to each of these to the works env. An actor down the line will be responsible for combining and saving the crosswalked metadata, the uploaded file paths, and additional properties (collection association, etc).

add children exist actor/service

list children actor calls service, stashes the returned data in env if children returned
children exists actor calls service if there are children
success if service returns true
fail if service returns false
else success if no children in the env
persist actor
add relationships actor moves the children from env into attributes, calls service

Create DefaultMiddlewareStack

This is the actor stack;

  • it should be able to start the actors from the beginning
  • it should be able to start an actor at any given position in the stack
  • it should have #start(aasm_state = nil), where aasm_state is in the form of {actor class}-{actors most recent state in processing
  • it should have configurations that are set in an initializer
  • it should ship with a default actor stack that can be overridden in the initializer

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.