Git Product home page Git Product logo

centerforopenscience / share Goto Github PK

View Code? Open in Web Editor NEW
99.0 23.0 56.0 41.39 MB

SHARE is building a free, open, data set about research and scholarly activities across their life cycle.

Home Page: http://share-research.readthedocs.io/en/latest/index.html

License: Apache License 2.0

Python 98.48% CSS 0.34% HTML 0.83% Gherkin 0.28% Dockerfile 0.07%
metadata scholarly-communication science openscience python elasticsearch data harvest-data

share's Introduction

SHARE/Trove

SHARE is creating a free, open dataset of research (meta)data.

Note: SHARE’s open API tools and services help bring together scholarship distributed across research ecosystems for the purpose of greater discoverability. However, SHARE does not guarantee a complete aggregation of searched outputs. For this reason, SHARE results should not be used for methodological analyses, such as systematic reviews.

Coverage Status

Documentation

What is this?

see WHAT-IS-THIS-EVEN.md

How can I use it?

see how-to/use-the-api.md

How do I navigate this codebase?

see ARCHITECTURE.md

How do I run a copy locally?

see how-to/run-locally.md

Running Tests

Unit test suite

py.test

BDD Suite

behave

share's People

Contributors

aaxelb avatar adlius avatar aishanibhalla avatar andrewsallans avatar bnosek avatar chrisseto avatar cmaimone avatar csheldonhess avatar cwisecarver avatar danielsbrown avatar efc avatar erinspace avatar fabianvf avatar fayetality avatar felliott avatar garykriebel avatar gitter-badger avatar icereval avatar jeffspies avatar laurenbarker avatar mattclark avatar mfraezz avatar sf2ne avatar sheriefvt avatar sloria avatar ssjohns avatar stevenholloway avatar terroni avatar vpnagraj avatar zamattiac avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

share's Issues

[consumer] PLoS API consumer

Possibly could use of the ALM API Rather than only the search API - though looking into this suggests maybe it isn't worth it because most fields linking to other metadata are blank.

possible integrator: scirate - requires follow-up discussion perhaps

i was just talking to andrew sallans and he commented that the 'scirate' website - https://scirate.com/ - may like to integrate with this, supposing it existed and had was able to provide content in a timely fashion, etc, etc.

i've suggested this idea to scirate here - scirate/scirate#268

(i've created this as an "issue" even though it's maybe not actionable, so feel free to delete or move to somewhere more appropriate; perhaps the wiki.)

[core] Resource Sync/pubsubhubbub integration: What is a topic?

@zimeon

I have been working on setting up resource sync/pubsubhubbub integration for scrAPI, using https://github.com/resync/resourcesync_push. I was curious about what exactly a topic is, and what the url sent to the hub should be in order to subscribe to it.

I have tried the URL for the publisher, the location of the RSS feed for scrAPI, the URLs for individual resources, but no matter what I get a 404 from the hub (though it does recognize the subscription request). Is there any way to just subscribe to a hub and all information that it receives?

[content provider] CrossRef

Notes from new content provider [adding old notes from early June 2014):

Cleared SHARE criteria on 6/9/14

API Docs:
https://github.com/CrossRef/rest-api-doc/blob/master/funder_kpi_api.md

http://crosstech.crossref.org/2014/04/%E2%99%AB-researchers-just-wanna-have-funds-%E2%99%AB.html

You don't need a member account. Make sure you are using the REST API documented here:

http://api.crossref.org

Also note that you can simply query the metadata for any DOI using content negotiation. As per:

http://www.crosscite.org/cn/

Affiliate fees/SLA to help with scaling/performance issues and prevent throttling at millions of users:

At the moment the CrossRef Affiliate fees - and the fees for the SLA agreement - are flat annual fees for unlimited access based on annual revenue/income of the organization -

CrossRef Affiliate
Total Revenue < $1 mil - $500
Total Revenue $1 - 10 mil - $2,000
Total Revenue > $10 mil - $10,000

See more at: http://www.crossref.org/04intermediaries/34affiliate_fees.html

SHARE doesn't need a SLA for the summer prototype, but might once the service starts.

[core] Method of recording the timestamp

In the archive we are recording the consumption date and time with a timestamp like this:

2014-07-21 15:03:21.096378

I'd like to raise a concern about the space in that string. Since we use the timestamp as the name of a directory and possibly as part of a URL, there is ample opportunity for that space to become a problem.

I would urge us to consider using the ISO 8601 suggestions of including a "T" to delimit date and time and a "Z" to document UTC time (and then actually use UTC time):

2014-07-21T15:03:21.096378Z

This would make our timestamp much less likely to cause trouble in shell scripts or URLs.

[core] Ideas for the experimental RSS feed

I've been watching the experimental RSS feed for a few days and find it very interesting. It can serve as a reasonable model for our thinking about the kind of notification we want to see from SHARE.

Right now the items in the RSS feed look something like this:

<item>
   <title>Symmetric Designs for Helical Spin Rotators at RHIC (12/94)</title>
   <link>http://173.255.232.219/archive/SciTech/1149788/2014-08-21 17:08:41.531480/normalized.json</link>
   <description>Ptitsin V.&lt;br/&gt;Retrieved from SciTech Connect at 2014-08-21 17:08:41.531480&lt;br/&gt;&lt;br/&gt;.</description>
   <guid>1149788</guid>
   <pubDate>2014-08-21 17:08:41.531480</pubDate>
</item>

The title is conveyed from the acquired metadata for the resource. That makes sense.

The link looks like it is attempting to be a link to the RESTful API of the notification service for the record in question, but I find it often does not work. However, I believe the link element should be a link directly back to the source. This is the most critical element of the notification service, every item we harvest or have reported to us must have an actionable link back to the originating source. In one way or another, we have to construct those links from the acquired metadata, most likely by combining some identifier with some known base URL. In the case of DOI's these might be set to resolve via CrossRef, in the case of OAI harvests, they should point back the the originating repository.

The guid looks like it is trying to pass along the identifier supplied by the source for this resource. The guid must be uniquely identifiable according to the RSS standard, we cannot guarantee these identifies will be unique stripped of their context. If we make the link a URL to the source, then we can use the quid for a URL back to our notification service, since we can guarantee that would be unique. Every outgoing RSS item is based on a record we have stored in our system, this guid should be an actionable URL that uses the notification system's RESTful API to point back to that record, pretty much what is currently in the link field, but actually resolving consistently.

The description is somewhat cryptic, providing a variety of information depending on the source, but not really labeling that information in a useful way. The description in RSS is actually capable of being pretty much anything. Since this is an RSS feed, I believe that, for now, we should make it a human-readable expression of the normalized JSON record we are keeping for this item. It could be expressed as a pretty version of the JSON record itself.

While I would like the pubDate to be the date of the actual emergence of the resource (its publication date or the date of its deposit in the holding repository), I can see how this might be somewhat of an abuse of RSS. I think it is fair to keep the date as is. But the actual publication date should be clear in the description.

The result, for the record similar to the one above, would be something like:

<item>
   <title>Los Alamos National Laboratory Overview</title>
   <link>http://www.osti.gov/scitech/servlets/purl/1140136</link>
   <description>{
    "contributors": [
        {
            "email": null, 
            "full_name": "Neu, Mary"
        }
    ], 
    "id": "1140136", 
    "meta": {}, 
    "properties": {
        "article_type": "Conference", 
        "date_entered": "2014-07-21", 
        "date_published": "2010-06-02", 
        "date_retrieved": "2014-07-21", 
        "description": "Mary Neu, Associate Director for Chemistry, Life and Earth Sciences at Los Alamos National Laboratory, delivers opening remarks at the \"Sequencing, Finishing, Analysis in the Future\" meeting in Santa Fe, NM", 
        "doi": null, 
        "research_org": "Lawrence Berkeley National Laboratory (LBNL), Berkeley, CA (United States)", 
        "research_sponsor": "USDOE Office of Science (SC), Biological and Environmental Research (BER) (SC-23)", 
        "tags": [
            "59 BASIC BIOLOGICAL SCIENCES"
        ], 
        "url": "http://www.osti.gov/scitech/servlets/purl/1140136"
    }, 
    "source": "SciTech", 
    "timestamp": "2014-07-21 15:03:21.096378", 
    "title": "Los Alamos National Laboratory Overview"
}</description>
   <guid>http://173.255.232.219/archive/SciTech/1140136/2014-07-21 15:03:21.096378</guid>
   <pubDate>2014-08-21 17:08:41.531480</pubDate>
</item>

I think these changes would make for a much more useful RSS feed. How doable would they be?

[content provider] DataONE

I just added provider documentation for DataONE to the repo. Our contact there is Dave Vieglais. For now it looks like a DataONE harvest would be via its public API.

I told Dave we likely would not get to the DataONE ingest attempt until September or October, though you are welcome to try earlier if you like!

[content provider] DMPTool

I just added provider documentation for DMPTool to the project. Public data management plans from roughly 800 sites can be shared from DMPTool to give consumers of the SHARE Notification Service early notice of new research and grant awards.

Patricia Cruse plans on sending along a pointer to the DMPTool API, I'll update the documentation when I get that. For now it includes the answers to our metadata sharing questions which were cleared today (7/22/2014).

Reporting of resources from local repository

A researcher submits unpublished work to their local SHARE-enabled institutional repository. The repository collects ORCID, funder, and grant number associated with the work. Repository sends report of the submission with a Handle to the SHARE notification service. SHARE stores the event in its digest.

Layers Involved

  • Notification Service

Challenges

  • Many repositories are unable to distinguish research-related material from cultural heritage material.
  • Identifiers like ORCID and FundRef are generally not collected by local repositories.

Potential Requirements

  • Repository able to provide Handle or other persistent identifier for resources.
  • SHARE push API to accept push reports from repository.
  • OAI-PMH harvesting tool to gather reports from repository.
  • ResourceSync harvesting tool to gather reports from repository.

Using ResourceSync for ArXiv Synchronization?

I have been looking at various ways for gathering information from ArXiv. ResourceSync seems like it would be a great way to receive information from ArXiv do you agree?
Does ArXiv even have ResourceSync compatibility at the moment?

Reporting of resources from publishers

A researcher is publishing an article with a CHORUS-enabled publisher. When CHORUS issues notification of publication, SHARE gets a copy via CrossRef. SHARE stores metadata related to this event (such as the source, the date of release, the researcher's ORCID, funding agency, grant ID, and the DOI of the publication at its publisher) in the notification service event digest.

Layers Involved

  • Notification Service

Challenges

  • Details of SHARE event metadata have yet to be determined. Grant ID is particularly troublesome and may not be possible for years to come.

Potential Requirements

  • Harvesting reports via CrossRef API.

Notes
What about publishers that do not participate in CrossRef? Does each need a use case or do we handle them similarly to CrossRef, supporting one-off API's for harvesting from each?

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.