Git Product home page Git Product logo

oaipmh-solr's Introduction

oaipmh-solr

A solr plugin for ingesting OAI-PMH records.

Overview

This plug-in implements an OAI-PMH data import handler for Solr. The data import handler indexes OAI-PMH records from a specified provider. The features provided by this plug-in include:

  • Resumption token handling
  • Delta queries
  • Deleted record handling
  • Wait times

Note: While this library has been successfully used to harvest records from registries using oai_dc and ivo_vo metadata formats, it is a work in progress and is still in testing. Please report any bugs in the issues section.

Requirements

Version 2.0 of this library is built for solr 8 and tested against solr 8.11. Note that the data import library has been deprecated and will be removed in Solr 9.

Installation

To install the OAI-PMH , simply download the library and run

mvn package

from the command line. Then copy the generated jar from the targets directory to the lib directory of your core.

You may want to modify pom.xml Solr dependencies to correspond to your version of Solr.

Configuration

Configuration is defined in the standard way for Solr data import handlers, using the standard data-config.xml file, which should be placed in the configuration directory for your Solr core. To enable Solr's data import handler feature, follow the directions on the Solr wiki

Sample configuration files, based on the configuration files distributed with Blacklight's example project, can be found in the sample/config/oai_dc directory in the source code.

Each entity defines fields using a column name that is mapped to a Solr field in Solr's schema.xml and an xpath string that defines the path to the data in the XML document relative to a top level path to a record defined in the forEach attribute of entity element. For standard OAI-PMH provider records in oai_dc format, the xpath strings defined below in the Dublin Core example section should work without modification. Users of this library may want to define field columns according to their own naming preference and indexing schemes. Here we have defined column names with a suffix that indicates the datatype.

For other metadata types, the user will have to define the paths and change the prefix attribute accordingly.

Date fields should include a dateTimeFormat attribute using java.text.SimpleDateFormat syntax.

To import records from multiple sources add additional entity nodes.

Entity attributes:

  • name - Required. A field required by Solr.
  • catalog - (Used on conjunction with the catalogField attribute.) The name of the catalog. This value is stored in the field specified by catalogField.
  • catalogField - (Used in conjunction with the catalog attribute.) Name of the Solr field where the catalog name is stored.
  • from - Date to start record request from.
  • url - Required. The URL of the OAI-PMH provider interface.
  • processor - Required. Must be set to "edu.stsci.registry.solr.OAIPMHEntityProcessor"
  • forEach - Required. The path to the record element in the XML document.
  • prefix - Required. The metadata prefix to request.
  • wait - Wait time in seconds between OAI-PMH continuation requests. By default there is no wait time. If the provider provides a wait time, the provider wait time will be used.
  • idCol - Required for deleted records to be recognized. The name of the column used as a unique record identifier for Solr.

Dublin Core example

<dataConfig>
  <dataSource type="edu.stsci.registry.solr.OAIPMHDataSource"/>
    <document>
      <entity name="record"
	    catalog="mycat"
	    catalogField="catalog_facet"
	    from="2012-01-01"
	    url="http://mysite.edu/oai2"
	    processor="edu.stsci.registry.solr.OAIPMHEntityProcessor"
	    forEach="/OAI-PMH/ListRecords/record"
	    prefix="oai_dc"
	    wait="4"
	    idCol="id">
        <field column="id" xpath="header/identifier"/>
        <field column="title_t" xpath="metadata/dc/title"/>
        <field column="author_s" xpath="metadata/dc/creator/name"/>
        <field column="creator_s" xpath="metadata/dc/creator/name"/>
        <field column="subject_topic_facet" xpath="metadata/dc/subject"/>
        <field column="description_t" xpath="metadata/dc/description"/>
        <field column="publisher_s" xpath="metadata/dc/publisher"/>
        <field column="contributor_s" xpath="metadata/dc/contributor"/>
        <field column="date_dt" xpath="header/datestamp" dateTimeFormat="yyyy-MM-dd"/>
        <field column="type_facet" xpath="metadata/dc/type"/>
        <field column="format_facet" xpath="metadata/dc/format"/>
        <field column="identifier_s" xpath="metadata/dc/identifier"/>
        <field column="source_s" xpath="metadata/dc/source"/>
        <field column="language_facet" xpath="metadata/dc/language"/>
        <field column="rights_s" xpath="metadata/dc/rights"/>
    </entity>
  </document>
</dataConfig>

VO Metadata example

<dataConfig>
  <dataSource type="edu.stsci.registry.solr.OAIPMHDataSource" />
  <document>
    <entity name="record"
	    url="http://mysite.edu/oai2"
	    processor="edu.stsci.registry.solr.OAIPMHEntityProcessor"
	    forEach="/OAI-PMH/ListRecords/record"
	    prefix="ivo_vor"
	    idCol="identifier">
      <field column="id" xpath="header/identifier"/>
      <field column="title_t" xpath="metadata/Resource/title"/>
      <field column="shortName_t" xpath="metadata/Resource/shortName"/>
      <field column="description_s" xpath="metadata/Resource/content/description"/>
      <field column="publisher_t" xpath="metadata/Resource/curation/publisher"/>
      <field column="publisherIVO_facet" xpath="metadata/Resource/curation/publisher/@ivo-id"/>
      <field column="identifier" xpath="metadata/Resource/identifier"/>
      <field column="subject_facet_lc" xpath="metadata/Resource/content/subject"/>
      <field column="type_facet" xpath="metadata/Resource/content/type"/>
      <field column="contentLevel_facet" xpath="metadata/Resource/content/contentLevel"/>
      <field column="date" xpath="header/datestamp" dateTimeFormat="yyyy-MM-dd'T'hh:mm:ss"/>
      <field column="facility_facet"  xpath="metadata/Resource/facility"/>
      <field column="instrument_facet" xpath="metadata/Resource/instrument"/>
      <field column="waveband_facet" xpath="metadata/Resource/coverage/waveband"/>
      <field column="validationLevel_facet" xpath="metadata/Resource/validationLevel"/>
      <field column="rights_facet" xpath="metadata/Resource/rights"/>
      <field column="capabilityType_facet" xpath="metadata/Resource/capability/@type"/>
      <field column="capabilityDescription_t" xpath="metadata/Resource/capability/description"/>
      <field column="resultType_facet" xpath="metadata/Resource/capability/interface/resultType"/>
<!--
Additional VO fields that aren't as useful for indexing.
      <field column="creator" xpath="metadata/Resource/curation/creator/name"/>
      <field column="contributor" xpath="metadata/Resource/curation/contributor"/>
      <field column="version" xpath="metadata/Resource/curation/version"/>
      <field column="contactName" xpath="metadata/Resource/curation/contact/name"/>
      <field column="contactAddress" xpath="metadata/Resource/curation/contact/address"/>
      <field column="contactEmail" xpath="metadata/Resource/curation/contact/email"/>
      <field column="contactTelephone" xpath="metadata/Resource/curation/contact/telephone"/>
      <field column="source" xpath="metadata/Resource/content/source"/>
      <field column="referenceUrl" xpath="metadata/Resource/content/referenceURL"/>
      <field column="relationshipType" xpath="metadata/Resource/content/relationship/relationshipType"/>
      <field column="relatedResource" xpath="metadata/Resource/content/relationship/relatedResource"/>
     <field column="footprint" xpath="metadata/Resource/coverage/footprint"/>
     <field column="validationLevel" xpath="metadata/Resource/capability/validationLevel"/>
-->
    </entity>
  </document>
</dataConfig>

Test Script

There is a simple test script that demonstrates how to parse the output of the MAST registry OAI-PMH endpoint.

Once the jar is built, you should be able to run this script with the following command:

java -cp target/lib/httpclient-4.4.1.jar:target/lib/httpcore-4.4.14.jar:target/lib/commons-logging-1.2.jar src/test/java/edu/stsci/registry/solr/TestOAI.java

The script does a single call, parses fields out of the records and also prints the resumption token.

oaipmh-solr's People

Contributors

seweissman avatar

Stargazers

 avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

oaipmh-solr's Issues

Update the help

Tyler Desjardins mentions that we should consider moving emails from help[at]stsci.edu to point to the web portal where possible and appropriate. For HST (or any non-JWST), it is https://hsthelp.stsci.edu . For JWST, it is https://jwsthelp.stsci.edu . Please update info in setup.py, setup.cfg, documentation, etc as appropriate.

Please close this issue if it is irrelevant to your repository. This is an automated issue. If this is opened in error, please let pllim know!

xref spacetelescope/hstcal#317

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.