Git Product home page Git Product logo

open-data-registry's Introduction

Registry of Open Data on AWS

A repository of publicly available datasets that are available for access from AWS resources. Note that datasets in this registry are available via AWS resources, but they are not provided by AWS; these datasets are owned and maintained by a variety government organizations, researchers, businesses, and individuals.

What is this for?

When data is shared on AWS, anyone can analyze it and build services on top of it using a broad range of compute and data analytics products, including Amazon EC2, Amazon Athena, AWS Lambda, and Amazon EMR. Sharing data in the cloud lets data users spend more time on data analysis rather than data acquisition. This repository exists to help people promote and discover datasets that are available via AWS resources.

How are datasets added to the registry?

Each dataset in this repository is described with metadata saved in a YAML file in the /datasets directory. We use these YAML files to provide three services:

The YAML files use this structure:

Name:
Description:
Contact:
ManagedBy:
UpdateFrequency:
Tags:
  -
License:
Resources:
  - Description:
    ARN:
    Region:
    Type:
DataAtWork:
  - Title:
    URL:
    AuthorName:
    AuthorURL:

The metadata required for each dataset entry is as follows:

Field Type Description
Name String The public facing name of the dataset
Description String A high-level description of the dataset
Documentation URL A link to documentation of the dataset
Contact String May be an email address, a link to contact form, a link to GitHub issues page, or any other instructions to contact the producer of the dataset
ManagedBy String The name of the organization who is responsible for the data ingest process
UpdateFrequency String An explanation of how frequently the dataset is updated
Tags List of strings Tags that topically describe the dataset. A list of supported tags is maintained in the tags.yaml file in this repo. If you want to recommend a tag that is not included in tags.yaml, please submit a pull request to add it to that file.
License String An explanation of the dataset license and/or a URL to more information about data terms of use of the dataset
Resources List of lists A list of AWS resources that users can use to consume the data. Each resource entry requires the metadata below:
Resources > Description String A technical description of the data available within the AWS resource, including information about file formats and scope.
Resources > ARN String Amazon Resource Name for resource, e.g. arn:aws:s3:::commoncrawl
Resources > Region String AWS region unique identifier, e.g. us-east-1
Resources > Type String Can be CloudFront Distribution, DB Snapshot, S3 Bucket, or SNS Topic. A list of supported resources is maintained in the resources.yaml file in this repo. If you want to recommend a resource that is not included in resources.yaml, please submit a pull request to add it to that file.
DataAtWork (Optional) List of lists A list of links to examples of the dataset being used. May include tutorials, demos, or applications.
DataAtWork > Title String The title of the example usage of the data.
DataAtWork > URL URL A link to the example.
DataAtWork > AuthorName String Name of person or entity that created the example.
DataAtWork > AuthorURL String (Optional) URL for person or entity that created the example.

Note also that we use the name of each YAML file as the URL slug for each dataset on the Registry of Open Data on AWS website. E.g. the metadata from 1000-genomes.yaml is listed at https://registry.opendata.aws/1000-genomes/

Example entry

Here is an example of the metadata behind this dataset registration: https://registry.opendata.aws/gdelt/

Name: Global Database of Events, Language and Tone (GDELT)
Description: |
  This project Project monitors the world's broadcast, print,
  and web news from nearly every corner of every country in
  over 100 languages and identifies the people, locations,
  organizations, counts, themes, sources, emotions, counts,
  quotes, images and events driving our global society every
  second of every day.
Documentation: http://www.gdeltproject.org/
Contact: http://www.gdeltproject.org/about.html#contact
UpdateFrequency: Daily
Tags:
  - events
License: http://www.gdeltproject.org/about.html#termsofuse
Resources:
  - Description: Project data files
    ARN: arn:aws:s3:::gdelt-open-data
    Region: us-east-1
    Type: S3 Bucket
  - Description: Notifications for new data
    ARN: arn:aws:sns:us-east-1:928094251383:gdelt-csv
    Region: us-east-1
    Type: SNS Topic
DataAtWork:
  - Title: Exploring GDELT with Athena
    URL: http://blog.julien.org/2017/03/exploring-gdelt-data-set-with-amazon.html
    AuthorName: Julien Simon
    AuthorURL: https://twitter.com/julsimon
  - Title: Running R on Amazon Athena
    URL: https://aws.amazon.com/blogs/big-data/running-r-on-amazon-athena/
    AuthorName: Gopal Wunnava
    AuthorURL: https://www.linkedin.com/in/gopal-wunnava-b11a77/
  - Title: Bootstrapping GeoMesa HBase on AWS S3
    URL: http://www.geomesa.org/documentation/tutorials/geomesa-hbase-s3-on-aws.html
    AuthorName: Commonwealth Computer Research, Inc.
    AuthorURL: https://www.ccri.com
  - Title: Creating PySpark DataFrame from CSV in AWS S3 in EMR
    URL: https://gist.github.com/jakechen/6955f2de51212163312b6430555b8e0b
    AuthorName: Jake Chen
    AuthorURL: https://github.com/jakechen

How can I contribute?

You are welcome to contribute dataset entries or usage examples to the Registry of Open Data on AWS. Please review our contribution guidelines.

open-data-registry's People

Contributors

aarande avatar apprivet avatar bheliom avatar borenstein avatar chrisgorgo avatar ckalima avatar conordel avatar davidoesch avatar djarpin avatar dlindenbaum avatar drocamor avatar dyf avatar ewels avatar ewindahl avatar fredliporace avatar gmilcinski avatar jamestwebber avatar jedsundwall avatar jflasher avatar jiaweizhuang avatar jph00 avatar jsfenfen avatar maalebarr avatar marty-sullivan avatar mikehenrty avatar neuromusic avatar normanrz avatar schpidi avatar slock-dbs avatar ssikdar-r7 avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar

Watchers

 avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.