centerforopenscience / share Goto Github PK

View Code? Open in Web Editor NEW

99.0 23.0 56.0 41.39 MB

SHARE is building a free, open, data set about research and scholarly activities across their life cycle.

Home Page: http://share-research.readthedocs.io/en/latest/index.html

License: Apache License 2.0

Python 98.48% CSS 0.34% HTML 0.83% Gherkin 0.28% Dockerfile 0.07%

metadata scholarly-communication science openscience python elasticsearch data harvest-data

share's Introduction

SHARE/Trove

SHARE is creating a free, open dataset of research (meta)data.

Note: SHARE’s open API tools and services help bring together scholarship distributed across research ecosystems for the purpose of greater discoverability. However, SHARE does not guarantee a complete aggregation of searched outputs. For this reason, SHARE results should not be used for methodological analyses, such as systematic reviews.

Documentation

Running Tests

Unit test suite

py.test

BDD Suite

behave

share's People

Contributors

Stargazers

Watchers

share's Issues

[consumer] PLoS API consumer

Possibly could use of the ALM API Rather than only the search API - though looking into this suggests maybe it isn't worth it because most fields linking to other metadata are blank.

[content provider] NLM Pub Med Central

Notes from new content provider (adding old notes from mid June 2014):

Cleared SHARE criteria on 6/10/14

API Docs:
http://www.nlm.nih.gov/api/

[schema] Determine contact information to be included for authors

[consumer] DataONE

New DataONE repo can be found at https://github.com/erinspace/DataONE

possible integrator: scirate - requires follow-up discussion perhaps

i was just talking to andrew sallans and he commented that the 'scirate' website - https://scirate.com/ - may like to integrate with this, supposing it existed and had was able to provide content in a timely fashion, etc, etc.

i've suggested this idea to scirate here - scirate/scirate#268

(i've created this as an "issue" even though it's maybe not actionable, so feel free to delete or move to somewhere more appropriate; perhaps the wiki.)

[core] Update scrapi-tools library to conform to new schema specification

http://github.com/fabianvf/scrapi-tools

[core] Resource Sync/pubsubhubbub integration: What is a topic?

@zimeon

I have been working on setting up resource sync/pubsubhubbub integration for scrAPI, using https://github.com/resync/resourcesync_push. I was curious about what exactly a topic is, and what the url sent to the hub should be in order to subscribe to it.

I have tried the URL for the publisher, the location of the RSS feed for scrAPI, the URLs for individual resources, but no matter what I get a 404 from the hub (though it does recognize the subscription request). Is there any way to just subscribe to a hub and all information that it receives?

[content provider] arXiv

Notes from new content provider [adding old notes from early July 2014):

Cleared SHARE criteria on 7/8/14

API Docs:
http://arxiv.org/help/api/index

[content provider] Virginia Tech

Notes from new content provider [adding old notes from early July 2014):

Cleared SHARE criteria on 7/1/14

API Docs:
http://vtechworks.lib.vt.edu/oai/request?verb=Identify

Research ResourceSync protocol

For integration into scrapi/SHARE workflow

Research PubSubHubbub protocol

For integration into the scrapi/SHARE workflow

[schema] Citation schema definition

[content provider] Wayne State

Notes from new content provider [adding old notes from early July 2014):

Cleared SHARE criteria on 7/2/14

API Docs:
http://digitalcommons.wayne.edu/do/oai/

[core] ScrAPI conflict management

[content provier] DOE PAGES

I just added provider documentation for the Department of Energy's PAGES repository to the repo. Our contact there is Lance Vowell. We have been working on SciTech from DOE, but PAGES is their newly announced public access repository and very relevant to SHARE. We expect that the API is quite similar to SciTech, but can't be certain until we try.

[consumer] Columbia Academic Commons

Preliminary scraper made for the columbia academic commons page - still issues with the date scraping, needs attention!

here's a link to the scraper in its current form: https://github.com/erinspace/academiccommons

[content provider] CrossRef

Notes from new content provider [adding old notes from early June 2014):

Cleared SHARE criteria on 6/9/14

API Docs:
https://github.com/CrossRef/rest-api-doc/blob/master/funder_kpi_api.md

http://crosstech.crossref.org/2014/04/%E2%99%AB-researchers-just-wanna-have-funds-%E2%99%AB.html

You don't need a member account. Make sure you are using the REST API documented here:

http://api.crossref.org

Also note that you can simply query the metadata for any DOI using content negotiation. As per:

http://www.crosscite.org/cn/

Affiliate fees/SLA to help with scaling/performance issues and prevent throttling at millions of users:

At the moment the CrossRef Affiliate fees - and the fees for the SLA agreement - are flat annual fees for unlimited access based on annual revenue/income of the organization -

CrossRef Affiliate
Total Revenue < $1 mil - $500
Total Revenue $1 - 10 mil - $2,000
Total Revenue > $10 mil - $10,000

See more at: http://www.crossref.org/04intermediaries/34affiliate_fees.html

SHARE doesn't need a SLA for the summer prototype, but might once the service starts.

[core] ScrAPI task queue integration

using Celery/RabbitMQ

Make existing consumer output conform to main schema

[consumer] eScholarship - University of California consumer

Continues #8 - First draft of the consumer now in SHARE readme links - https://github.com/erinspace/eScholarship

Needs to be tested by scrAPI

[plos][scitech][clinicaltrials.gov] Update consumers for new architecture

@fayetality @erinspace @fabianvf

[core] Add archive download to scrAPI/SHARE core

Add ability to download .tar.gz of the archive directory from scrAPI.

[documentation] Document scrAPI RSS feed

How to construct queries

[documentation] Guide on how to make consumers using scrapi-tools

Maybe add to the wiki?

[content provider] University of California

Notes from new content provider [adding old notes from late June 2014):

Cleared SHARE criteria on 6/24/14

API Docs:
http://www.escholarship.org/help_oaipmh.html

[consumer] Public Library of Science (PLoS)

[consumer] VTechWorks consumer prototype

Original version at - https://github.com/webkunoichi/VT-consumer -- refactored one at https://github.com/erinspace/VT-consumer

Continued discussion from #9

[schema] Dataset schema definition

[content provider] Columbia

I just added provider documentation for Columbia's Academic Commons to the repo. Our contact there is Simone Sacchi. We will be using OAI-PMH to harvest Academic Commons.

[consumer] NLM Pub Med Central

New prototype consumer at https://github.com/erinspace/PubMedCentral - Discussion continues here!

[core] Method of recording the timestamp

In the archive we are recording the consumption date and time with a timestamp like this:

2014-07-21 15:03:21.096378

I'd like to raise a concern about the space in that string. Since we use the timestamp as the name of a directory and possibly as part of a URL, there is ample opportunity for that space to become a problem.

I would urge us to consider using the ISO 8601 suggestions of including a "T" to delimit date and time and a "Z" to document UTC time (and then actually use UTC time):

2014-07-21T15:03:21.096378Z

This would make our timestamp much less likely to cause trouble in shell scripts or URLs.

[core] Ideas for the experimental RSS feed

I've been watching the experimental RSS feed for a few days and find it very interesting. It can serve as a reasonable model for our thinking about the kind of notification we want to see from SHARE.

Right now the items in the RSS feed look something like this:

<item>
   <title>Symmetric Designs for Helical Spin Rotators at RHIC (12/94)</title>
   <link>http://173.255.232.219/archive/SciTech/1149788/2014-08-21 17:08:41.531480/normalized.json</link>
   <description>Ptitsin V.&lt;br/&gt;Retrieved from SciTech Connect at 2014-08-21 17:08:41.531480&lt;br/&gt;&lt;br/&gt;.</description>
   <guid>1149788</guid>
   <pubDate>2014-08-21 17:08:41.531480</pubDate>
</item>

The title is conveyed from the acquired metadata for the resource. That makes sense.

The link looks like it is attempting to be a link to the RESTful API of the notification service for the record in question, but I find it often does not work. However, I believe the link element should be a link directly back to the source. This is the most critical element of the notification service, every item we harvest or have reported to us must have an actionable link back to the originating source. In one way or another, we have to construct those links from the acquired metadata, most likely by combining some identifier with some known base URL. In the case of DOI's these might be set to resolve via CrossRef, in the case of OAI harvests, they should point back the the originating repository.

The guid looks like it is trying to pass along the identifier supplied by the source for this resource. The guid must be uniquely identifiable according to the RSS standard, we cannot guarantee these identifies will be unique stripped of their context. If we make the link a URL to the source, then we can use the quid for a URL back to our notification service, since we can guarantee that would be unique. Every outgoing RSS item is based on a record we have stored in our system, this guid should be an actionable URL that uses the notification system's RESTful API to point back to that record, pretty much what is currently in the link field, but actually resolving consistently.

The description is somewhat cryptic, providing a variety of information depending on the source, but not really labeling that information in a useful way. The description in RSS is actually capable of being pretty much anything. Since this is an RSS feed, I believe that, for now, we should make it a human-readable expression of the normalized JSON record we are keeping for this item. It could be expressed as a pretty version of the JSON record itself.

While I would like the pubDate to be the date of the actual emergence of the resource (its publication date or the date of its deposit in the holding repository), I can see how this might be somewhat of an abuse of RSS. I think it is fair to keep the date as is. But the actual publication date should be clear in the description.

The result, for the record similar to the one above, would be something like:

<item>
   <title>Los Alamos National Laboratory Overview</title>
   <link>http://www.osti.gov/scitech/servlets/purl/1140136</link>
   <description>{
    "contributors": [
        {
            "email": null, 
            "full_name": "Neu, Mary"
        }
    ], 
    "id": "1140136", 
    "meta": {}, 
    "properties": {
        "article_type": "Conference", 
        "date_entered": "2014-07-21", 
        "date_published": "2010-06-02", 
        "date_retrieved": "2014-07-21", 
        "description": "Mary Neu, Associate Director for Chemistry, Life and Earth Sciences at Los Alamos National Laboratory, delivers opening remarks at the \"Sequencing, Finishing, Analysis in the Future\" meeting in Santa Fe, NM", 
        "doi": null, 
        "research_org": "Lawrence Berkeley National Laboratory (LBNL), Berkeley, CA (United States)", 
        "research_sponsor": "USDOE Office of Science (SC), Biological and Environmental Research (BER) (SC-23)", 
        "tags": [
            "59 BASIC BIOLOGICAL SCIENCES"
        ], 
        "url": "http://www.osti.gov/scitech/servlets/purl/1140136"
    }, 
    "source": "SciTech", 
    "timestamp": "2014-07-21 15:03:21.096378", 
    "title": "Los Alamos National Laboratory Overview"
}</description>
   <guid>http://173.255.232.219/archive/SciTech/1140136/2014-07-21 15:03:21.096378</guid>
   <pubDate>2014-08-21 17:08:41.531480</pubDate>
</item>

I think these changes would make for a much more useful RSS feed. How doable would they be?

[core] scrAPI ansible deployment

Automated configuration and deployment of the scrAPI webapp, consumers and dependencies.

[core] Web interface for scrAPI

[consumer] clinicaltrials.gov

@erinspace

[content provider] DataONE

I just added provider documentation for DataONE to the repo. Our contact there is Dave Vieglais. For now it looks like a DataONE harvest would be via its public API.

I told Dave we likely would not get to the DataONE ingest attempt until September or October, though you are welcome to try earlier if you like!

[content provider] DMPTool

I just added provider documentation for DMPTool to the project. Public data management plans from roughly 800 sites can be shared from DMPTool to give consumers of the SHARE Notification Service early notice of new research and grant awards.

Patricia Cruse plans on sending along a pointer to the DMPTool API, I'll update the documentation when I get that. For now it includes the answers to our metadata sharing questions which were cleared today (7/22/2014).

[documentation] add definition of scrapi-tools

[core] RSS feeds for incoming consumer data

Reporting of resources from local repository

A researcher submits unpublished work to their local SHARE-enabled institutional repository. The repository collects ORCID, funder, and grant number associated with the work. Repository sends report of the submission with a Handle to the SHARE notification service. SHARE stores the event in its digest.

Layers Involved

Notification Service

Challenges

Many repositories are unable to distinguish research-related material from cultural heritage material.
Identifiers like ORCID and FundRef are generally not collected by local repositories.

Potential Requirements

Repository able to provide Handle or other persistent identifier for resources.
SHARE push API to accept push reports from repository.
OAI-PMH harvesting tool to gather reports from repository.
ResourceSync harvesting tool to gather reports from repository.

[content provider] DOE SciTech

Notes from new content provider [adding old notes from mid June 2014):

Cleared shared criteria on 6/13/14

API Docs:
https://www.osti.gov/home/sites/www.osti.gov.home/files/SciTechXMLDataServices.pdf

[consumer] SciTech Connect

Using ResourceSync for ArXiv Synchronization?

I have been looking at various ways for gathering information from ArXiv. ResourceSync seems like it would be a great way to receive information from ArXiv do you agree?
Does ArXiv even have ResourceSync compatibility at the moment?

[consumer] CrossRef

Prototype consumer at https://github.com/erinspace/CrossRef

[consumer] Wayne State University - DigitalCommons

Created a consumer for DigitalCommons @ Wayne State - https://github.com/erinspace/DigitalCommonsWayneState

[content provider] PLoS

I've added provider documentation for PLoS to the repo. Our PLoS contact is John Chodacki. He suggested that we could harvest PLoS content via CrossRef, but we did take the time to confirm metadata permissions for PLoS.

[core] Update scrAPI to conform to new schema specification

Reporting of resources from publishers

A researcher is publishing an article with a CHORUS-enabled publisher. When CHORUS issues notification of publication, SHARE gets a copy via CrossRef. SHARE stores metadata related to this event (such as the source, the date of release, the researcher's ORCID, funding agency, grant ID, and the DOI of the publication at its publisher) in the notification service event digest.

Layers Involved

Notification Service

Challenges

Details of SHARE event metadata have yet to be determined. Grant ID is particularly troublesome and may not be possible for years to come.

Potential Requirements

Harvesting reports via CrossRef API.

Notes
What about publishers that do not participate in CrossRef? Does each need a use case or do we handle them similarly to CrossRef, supporting one-off API's for harvesting from each?