Git Product home page Git Product logo

the-academic-observatory / oaebu-workflows Goto Github PK

View Code? Open in Web Editor NEW
5.0 3.0 0.0 18.03 MB

Telescopes, Workflows and Data Services for the 'Book Analytics Dashboard Project (2022-2025)', building upon the project 'Developing a Pilot Data Trust for Open Access eBook Usage (2020-2022)'

Home Page: https://documentation.book-analytics.org/

License: Apache License 2.0

Shell 0.10% Python 89.50% Jinja 10.34% Dockerfile 0.06%
workflow data academic higher-education ebooks

oaebu-workflows's Introduction

Book Usage Data Workflows

Book Usage Data Workflows provides Apache Airflow workflows for fetching, processing and analysing data about Open Access Books.

License Python Version Python package Documentation Status codecov DOI

Telescope Workflows

A telescope a type of workflow used to ingest data from different data sources, and to run workflows that process and output data to other places. Workflows are built on top of Apache Airflow's DAGs.

The workflows include: Google Analytics, Google Books, JSTOR, IRUS Fulcrum, IRUS OAPEN, Onix, UCL Discovery and an Onix Workflow for combining all of this data.

Telescope Workflow Description
Crossref Events Crossref Event Data captures discussion on scholarly content and acts as a hub for the storage and distribution of this data. An event may be a citation in a dataset or patent, a mention in a news article, Wikipedia page or on a blog, or discussion and comment on social media.
Crossref Metadata Crossref is a non-for-profit membership organisation working on making scholarly communications better. It is an official Digital Object Identifier (DOI) Registration Agency of the International DOI Foundation. They provide metadata for every DOI that is registered with Crossref.
Google Analytics Google Analytics is a web-based service that allows groups to track usage of their web properties. It offers vistor counts, statistics, and other breakdowns such as country or origin for visitors. If publishers or partners already have Google Analytics already setup of their website, this usage data is able to be ingested
Google Books The Google Books Partner program enables selling books through the Google Play store and offering a preview on Google books. As a publisher it is possible to download reports on Google Books data, currently there are 3 report types available (sales summary, sales transaction and traffic) of which we use the latter 2
IRUS Fulcrum IRUS provides COUNTER standard access reports for books hosted on the Fulcrum platform. The reports show access figures for each month and the country of usage
IRUS OAPEN IRUS provides COUNTER standard access reports for books hosted on the OAPEN platform. Almost all books on OAPEN are provided as a whole book PDF file. The reports show access figures for each month as well as the location of the access. Since the location info includes an IP-address, the original data is handled only from within the OAPEN Google Cloud project
JSTOR JSTOR provides publisher usage reports, the reports offer details about the use of journal or book content by institution, and country. Journal reports also include usage by issue and article. Usage is aligned with the COUNTER 5 standard of Item Requests (views + downloads)
OAPEN Metadata The OAPEN Library hosts more than 18,000 Open Access books. OAPEN enables libraries and aggregators to use the metadata of all available titles in the OAPEN Library, made available under a CC0 1.0 license. The metadata is available in different formats and the OAPEN metadata telescope harvests the data in XML format
Onix ONIX is a standard format that book publishers use to share information about the books that they have published. Publishers that have ONIX feeds are given credentials and access to their own upload folder on the Mellon SFTP server. The publisher uploads their ONIX feed to their upload folder on a weekly, fortnightly or monthly basis. The ONIX telescope downloads, transforms (with the ONIX parser Java command line tool) and then loads the ONIX data into BigQuery for further processing
Thoth Thoth is a free, open metadata service that publishers can choose to utilise as a solution for metadata storage. Thoth can provide metadata upon request in a number of formats. The Thoth telescope uses the Thoth Export API to download metadata in an ONIX format.
UCL Discovery UCL Discovery is UCL's open access repository, showcasing and providing access to the full texts of UCL research publications.The metadata for all eprints is obtained from their publicly available CSV file (https://discovery.ucl.ac.uk/cgi/search/advanced)

Documentation

For detailed documentation about the Book Usage Data Workflows hosted on GitBook, click here. Thank you to GitBook for the supporting this repository under their Open Source plan.

Other requirements to create the Book Usage Datasets

The Observatory Platform, an environment for fetching, processing and analysing data, see the Repository https://github.com/The-Academic-Observatory/observatory-platform

The Academic Observatory Workflows, which provides Apache Airflow workflows for fetching, processing and analysing data about academic institutions, see the Repository https://github.com/The-Academic-Observatory/academic-observatory-workflows

The Onix Parser, a command line tool that transforms ONIX files into a format suitable for loading into BigQuery, see the Repository https://github.com/The-Academic-Observatory/onix-parser

oaebu-workflows's People

Contributors

alexmassen-hane avatar aroelo avatar bechandcock avatar cameronneylon avatar jdddog avatar kathrynnapier avatar keegansmith21 avatar metasj avatar niamhq avatar rhosking avatar tuanchien avatar

Stargazers

 avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar

oaebu-workflows's Issues

Telescope workflow implementation: Ebsco

  • For which publishers will we extract this data?
  • The oldest report in the mail is January 2021, from which date do we want to collect EBSCO data?
  • Is the email send automatically?
    • Is the title of the mail typed manually or generated automatically?
    • Will the mail always be from the same sender?
    • Which date each month is the report send, is this variable or always the same date?
    • The mail with title EBSCO eBook Usage - April 2021 has the attachment University of Michigan Press eBook Usage Monthly March 2021.xlsx, the months conflict. There seems to be no date inside the report, do we use the filename or mail title as ground truth? The file name might not be an option, because it seems they are not automatically generated (see below)
  • Is the report itself generated automatically?
    • The filenames of the report differ: University of Michigan Press_eBook Usage_Jan21.xlsx, University of Michigan Press EBSCO eBook Usage_Feb21.xlsx and University of Michigan Press eBook Usage Monthly March 2021
    • In the Data Trust google doc there is a field Month_of_Log_Month in the EBSCO schema, but this is only available in the January report.
  • Do any of the fields in the CSV file contain multiple values, which ones and what is the delimiter? (e.g. maybe the 'Subjects' field, with '/' as a delimiter)
  • What is the 'Retrieval Count' exactly? E.g. is this the number of downloads per unique IP address or can one IP address account for multiple downloads. Is it the downloads per chapter, aggregated to a single book or downloads per whole book, etc.
  • What is the difference between 'Imprint Publisher' and 'Contract Publisher', they are the same for UMP, but in general what is the difference between these two terms?

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.