awslabs / open-data-registry Goto Github PK

A registry of publicly available datasets on AWS

Home Page: https://registry.opendata.aws

License: Apache License 2.0

Python 67.98% Shell 32.02%

open-data-registry's Introduction

Registry of Open Data on AWS

A repository of publicly available datasets that are available for access from AWS resources. Note that datasets in this registry are available via AWS resources, but they are not provided by AWS; these datasets are owned and maintained by a variety of government organizations, researchers, businesses, and individuals.

What is this for?

When data is shared on AWS, anyone can analyze it and build services on top of it using a broad range of compute and data analytics products, including Amazon EC2, Amazon Athena, AWS Lambda, and Amazon EMR. Sharing data in the cloud lets data users spend more time on data analysis rather than data acquisition. This repository exists to help people promote and discover datasets that are available via AWS resources.

How are datasets added to the registry?

Each dataset in this repository is described with metadata saved in a YAML file in the /datasets directory. We use these YAML files to provide three services:

A Registry of Open Data on AWS browser.
A hosted YAML file listing all of the dataset entries.
Hosted YAML files for each dataset.

The YAML files use this structure:

Name:
Description:
Documentation:
Contact:
ManagedBy:
UpdateFrequency:
Tags:
  -
License:
Citation:
Resources:
  - Description:
    ARN:
    Region:
    Type:
    Explore:
DataAtWork:
  Tutorials:
    - Title:
      URL:
      NotebookURL:
      AuthorName:
      AuthorURL:
      Services:
  Tools & Applications:
    - Title:
      URL:
      AuthorName:
      AuthorURL:
  Publications:
    - Title:
      URL:
      AuthorName:
      AuthorURL:
DeprecatedNotice:
ADXCategories:
  -

The metadata required for each dataset entry is as follows:

Field	Type	Description & Style
Name	String	The public facing name of the dataset. Spell out acronyms and abbreviations. We do not require "AWS" or "Open Data" to be in the dataset name. Must be between 5 and 130 characters.
Description	String	A high-level description of the dataset. Only the first 600 characters will be displayed on the homepage of the Registry of Open Data on AWS
Documentation	URL	A link to documentation of the dataset, preferably hosted on the data provider's website or Github repository.
Contact	String	May be an email address, a link to contact form, a link to GitHub issues page, or any other instructions to contact the producer of the dataset
ManagedBy	String	The name of the laboratory, institution, or organization who is responsible for the data ingest process. Avoid using individuals. If your institution manages several datasets hosted by the Public Dataset Program, please list the managing institution identically. For an example why, check out the Managed By section of the TARGET dataset
UpdateFrequency	String	An explanation of how frequently the dataset is updated
Tags	List of strings	Select tags that are related to an intrinsic property or descriptor of the dataset. A list of supported tags is maintained in the tags.yaml file in this repo. If you want to recommend a tag that is not included in tags.yaml, please submit a pull request to add it to that file.
License	String	An explanation of the dataset license and/or a URL to more information about data terms of use of the dataset
Citation (Optional)	String	Custom citation language to be used when citing this dataset, which will be appended to the default citation used for all datasets. Default citation language is as follows: "[DATASET NAME] was accessed on [DATE] at registry.opendata.aws/[dataset]"
Resources	List of lists	A list of AWS resources that users can use to consume the data. Each resource entry requires the metadata below:
Resources > Description	String	A technical description of the data available within the AWS resource, including information about file formats and scope.
Resources > ARN	String	Amazon Resource Name for resource, e.g. arn:aws:s3:::commoncrawl
Resources > Region	String	AWS region unique identifier, e.g. us-east-1
Resources > Type	String	Can be CloudFront Distribution, DB Snapshot, S3 Bucket, or SNS Topic. A list of supported resources is maintained in the resources.yaml file in this repo. If you want to recommend a resource that is not included in resources.yaml, please submit a pull request to add it to that file.
Resources > RequesterPays (Optional)	Boolean	Only appropriate for Amazon S3 buckets, indicates whether the bucket has Requester Pays enabled or not.
Resources > AccountRequired (Optional)	String	Is an AWS account required to access this data? Note that while Requester Pays means you will need an account, this is meant for cases where an account is required outside of that scenario.
Resources > ControlledAccess (Optional)	String	Only appropriate for Amazon S3 buckets with controlled access. Please provide a URL to instructions on how to request and gain access to the S3 bucket.
Resources > Explore (Optional)	List of strings	Additional links that can be used to explore the bucket resource, i.e. links to S3 JS Explorer index.html for the bucket or the AWS S3 console.
DataAtWork [> Tutorials, Tools & Applications, Publications] (Optional)	List of lists	A list of links to example tutorials, tools & applications, publications that use the data.
DataAtWork [> Tutorials, Tools & Applications, Publications] > Title	String	The title of the tutorial, tool, application, or publication that uses the data.
DataAtWork [> Tutorials, Tools & Applications, Publications] > URL	URL	A link to the tutorial, tool, application, or publication that uses the data.
DataAtWork [> Tutorials, Tools & Applications, Publications] > AuthorName	String	Name(s) of person or entity that created the tutorial, tool, application, or publication. Limit scientific publication author lists to the first six authors in the format Last Name First Initial, followed by 'et al'.
DataAtWork [> Tutorials, Tools & Applications, Publications] > AuthorURL (Optional)	String	URL for person or entity that created the tutorial, tool, application, or publication.
DataAtWork [> Tutorials] > NotebookURL (Optional)	URL	A link to a Jupyter notebook (.ipynb) on GitHub that shows how this data can be used.
DataAtWork [> Tutorials] > Services (Optional)	String	For tutorials only. List AWS Services applied in your tutorial. A list of supported AWS services is maintained in the services.yaml file in this repo. If you want to recommend a resource that is not included in services.yaml, please submit a pull request to add it to that file.
DeprecatedNotice (Optional)	String	Only appropriate for datasets that are being retired, indicates to users that the dataset will soon be deprecated and should include the date that the dataset will no longer be available.
ADXCategories	List of strings	Allowed categories can be found in adx_categories.yaml, at most, 2 can be added. Adding categories to your listing will improve searchability within the AWS Data Exchange.

Note also that we use the name of each YAML file as the URL slug for each dataset on the Registry of Open Data on AWS website. E.g. the metadata from 1000-genomes.yaml is listed at https://registry.opendata.aws/1000-genomes/

Example entry

Here is an example of the metadata behind this dataset registration: https://registry.opendata.aws/noaa-nexrad/

Name: NEXRAD on AWS
Description: Real-time and archival data from the Next Generation Weather Radar (NEXRAD) network.
Documentation: https://github.com/awslabs/open-data-docs/tree/main/docs/noaa/noaa-nexrad
Contact: [email protected]
ManagedBy: "[NOAA](http://www.noaa.gov/)"
UpdateFrequency: New Level II data is added as soon as it is available.
Tags:
  - aws-pds
  - earth observation
  - natural resource
  - weather
  - meteorological
  - sustainability
License: There are no restrictions on the use of this data.
Resources:
  - Description: NEXRAD Level II archive data
    ARN: arn:aws:s3:::noaa-nexrad-level2
    Region: us-east-1
    Type: S3 Bucket
    Explore:
    - '[Browse Bucket](https://noaa-nexrad-level2.s3.amazonaws.com/index.html)'
  - Description: NEXRAD Level II real-time data
    ARN: arn:aws:s3:::unidata-nexrad-level2-chunks
    Region: us-east-1
    Type: S3 Bucket
  - Description: "[Rich notifications](https://github.com/awslabs/open-data-docs/tree/main/docs/noaa/noaa-nexrad#subscribing-to-nexrad-data-notifications) for real-time data with filterable fields"
    ARN: arn:aws:sns:us-east-1:684042711724:NewNEXRADLevel2ObjectFilterable
    Region: us-east-1
    Type: SNS Topic
  - Description: Notifications for archival data
    ARN: arn:aws:sns:us-east-1:811054952067:NewNEXRADLevel2Archive
    Region: us-east-1
    Type: SNS Topic
DataAtWork:
  Tutorials:
    - Title: NEXRAD on EC2 tutorial
      URL: https://github.com/openradar/AMS_radar_in_the_cloud
      Services: EC2
      AuthorName: openradar
      AuthorURL: https://github.com/openradar
    - Title: Using Python to Access NCEI Archived NEXRAD Level 2 Data (Jupyter notebook)
      URL: http://nbviewer.jupyter.org/gist/dopplershift/356f2e14832e9b676207
      AuthorName: Ryan May
      AuthorURL: http://dopplershift.github.io
    - Title: Mapping Noaa Nexrad Radar Data With CARTO
      URL: https://carto.com/blog/mapping-nexrad-radar-data/
      AuthorName: Stuart Lynn
      AuthorURL: https://carto.com/blog/author/stuart-lynn/
  Tools & Applications:
    - Title: nexradaws on pypi.python.org - python module to query and download Nexrad data from Amazon S3
      URL: https://pypi.org/project/nexradaws/
      AuthorName: Aaron Anderson
      AuthorURL: https://github.com/aarande
    - Title: WeatherPipe - Amazon EMR based analysis tool for NEXRAD data stored on Amazon S3
      URL: https://github.com/stephenlienharrell/WeatherPipe
      AuthorName: Stephen Lien Harrell
      AuthorURL: https://github.com/stephenlienharrell
  Publications:
    - Title: Seasonal abundance and survival of North America’s migratory avifauna determined by weather radar
      URL: https://www.nature.com/articles/s41559-018-0666-4
      AuthorName: Adriaan M. Dokter, Andrew Farnsworth, Daniel Fink, Viviana Ruiz-Gutierrez, Wesley M. Hochachka, Frank A. La Sorte, Orin J. Robinson, Kenneth V. Rosenberg & Steve Kelling
    - Title: Unlocking the Potential of NEXRAD Data through NOAA’s Big Data Partnership
      URL: https://journals.ametsoc.org/doi/full/10.1175/BAMS-D-16-0021.1
      AuthorName: Steve Ansari and Stephen Del Greco

How can I contribute?

You are welcome to contribute dataset entries or usage examples to the Registry of Open Data on AWS. Please review our contribution guidelines.

New to Github and contributing via pull requests?

In addition to Github's getting started documentation, this video tutorial shows the complete end-to-end process of contributing to this repository.

open-data-registry's People

Contributors

Stargazers

Watchers

Forkers

dlindenbaum chrisgorgo jqnatividad averroes gmilcinski slock-dbs sebastian-nagel zhangqrl alecglassford ivanji schpidi spacetelescope spacenetchallenge bheliom maalebarr benluteijn apprivet robert-waterman conordel commoncrawl zylpro dyf fooway mikehenrty daddis2 vipulkumarsviit sharondenisse davidoesch neuromusic andremendessousa ewindahl ssikdar-r7 boratonaj fanmeifen arnehuang marty-sullivan ckalima vamsikavuru filevgenij biterbilen aarande gimjung jcassiojr normanrz djarpin tdaede ewels b0bbybaldi aopisco ruchim syher-rumak andy-binkley chadokruse jheduart jamestwebber gavin-whitesitt sigmango jieyu11 cells2numbers ua-snap jph00 ritika26 jsfenfen samdc915 abdulaziz710 alexismiranda11 acimpoeru joseregi82 msbarry abiraja2004 sandy4321 startfromscratch123456 nezznaika brittlallen baymax04 lana-sd giserh jacobtomlinson silo-climate-data drewda manojbableshwar shitong01 creare-com matschaffer mlshapiro qedsoftware xlcrr slawler coolthought czforest kivuhub kirosg opencitymodel hobu p1d1d1 docking-org nudomarinero valcoups srikarkashyap benlorence

open-data-registry's Issues

AWS IoT endpoint?

Sondehub provides users with a AWS IoT MQTT websocket stream to make use realtime streaming data.

We provide this via a presigned URL (from api gateway / lambda) : https://github.com/projecthorus/sondehub-infra/wiki/API-(Beta)#-get--sondeswebsocket and also via our python SDK.

Would having this as a resource be useful for this project? I'm not exactly sure how it would be expressed in the schema.

"Swiss Public Transport Stops" access denied with aws cli against s3?

https://registry.opendata.aws/schweizer-haltestellen-oev/

aws s3 ls s3://data.geo.admin.ch/ch.bav.haltestellen-oev/ --no-sign-request

An error occurred (AccessDenied) when calling the ListObjectsV2 operation: Access Denied

AWS S3 folders at http://awsopendata.s3-website-us-west-2.amazonaws.com/noaa-gfs/ not explained

Can someone pls expand the short names or what attributes/parameters these folders stand for? I could not find any documentation on NOAA website.

Create new DataAtWork structure

Create lists under DataAtWork for Publications, Tools & Applications, Tutorials

Extend HRRR forecast hours from 18 to 48

The HRRR has plans to extended their forecast lead times form 18h to 48h for the 00,06,18,24 initializations. Their website states june 2020 for the move to v4 (https://rapidrefresh.noaa.gov/hrrr/) but I already see the extended lead times on a number of plots (weathermodels.com)

Can this extended lead time be added to the opendata registry?

https://github.com/awslabs/open-data-registry/blob/master/datasets/noaa-hrrr-pds.yaml

Some columns are empty in dataforgood-fb-data

Although the data has been restored in the bucket on #545, but "latitude", "longitude" and "population" columns are all empty now. Could you please help to fix it?

Thanks

Incorrect AWS region for GIAB dataset?

At https://registry.opendata.aws/giab/
and in the giab.yaml file at
https://github.com/awslabs/open-data-registry/tree/master/datasets
the AWS region for bucket=giab is listed as us-east-1.

However, when I try to access the bucket via the AWS Java API or at
https://giab.s3.us-east-1.amazonaws.com/
AWS returns "The authorization header is malformed; the region 'us-east-1' is wrong; expecting 'us-west-2' " (or similar from the URL).

If I try https://giab.s3.us-west-2.amazonaws.com/
it succeeds.

Can you please investigate and perhaps change the yaml file so the region is listed as us-west-2?

Thank you!

Two Sentinel-2 Documentation Links

There are two links to documentation on the Sentinel-2 dataset, and if you click on it, it will attempt to open this URL:

https://roda.sentinel-hub.com/sentinel-s2-l1c/readme.html,https://roda.sentinel-hub.com/sentinel-s2-l2a/readme.html

Obviously, this doesn't resolve. How do we accommodate multiple links to documentation? Could we allow links or Markdown in the Documentation field?

Amazon Customer Review File Size Parquet vs TSV

Thank you for providing the Amazon customer reviews public data set.
https://s3.amazonaws.com/amazon-reviews-pds/readme.html

I was comparing the TSV and Parquet file sizes of the above dataset from S3 Console and found that: TSV size was 32.2 GB and Parquet Size was 47.4 GB

Could you kindly provide insight into why the Parquet size is bigger? I was expecting the same or better due to binary+compressed nature of Parquet.

digitalcorpora is ready to go live

Hi! I have uploaded the digitalcorpora corpus to s3://digitalcorpora/ and it is live!

% aws s3 ls s3://digitalcorpora/corpora/
                           PRE bin/
                           PRE drives/
                           PRE drives_bulk_extractor/
                           PRE drives_dfxml/
                           PRE files/
                           PRE hashes/
                           PRE mobile/
                           PRE packets/
                           PRE ram/
                           PRE scenarios/
                           PRE sql/
2020-11-21 10:56:19         43 README.txt
2020-11-21 10:56:20    1783404 digitalcorpora.org-hashdeep-2020-04-01.csv
2020-11-21 10:56:19    1787101 digitalcorpora.org-hashdeep-2020-05-01.csv
2020-11-21 10:56:19    1794086 digitalcorpora.org-hashdeep-2020-06-01.csv
2020-11-21 10:56:19    1794914 digitalcorpora.org-hashdeep-2020-07-01.csv
2020-11-21 10:56:20    1796103 digitalcorpora.org-hashdeep-2020-08-01.csv
2020-11-21 10:56:20    1796275 digitalcorpora.org-hashdeep-2020-09-01.csv
2020-11-21 10:56:20    1796447 digitalcorpora.org-hashdeep-2020-10-01.csv
2020-11-21 10:56:20    1796619 digitalcorpora.org-hashdeep-2020-11-01.csv
%

I will send a pull request.

IRS 990 contact link throwing 404

@jedsundwall Was trying to find contact info for the IRS folks to alert them of an error in their indexes and noticed the existing link at the registry site is throwing a 404.

The best link I could find was this one?

In case the fine folks at the IRS read this, the 2018 indexes are currently out of sync (e.g. the 2018 CSV index shows 457,509 records while the 2018 JSON index shows just 432,801).

noaa-ghe data timeliness

We've observed that currently the latest GHE data available in S3 is s3://noaa-ghe-pds/rain_rate/2020/04/20/NPR.GEO.GHE.v1.S202004201230.nc.gz. NOAA currently has https://satepsanone.nesdis.noaa.gov/pub/HydroEst/GHE/NPR.GEO.GHE.v1.S202004201430.nc.gz available.

Is there perhaps an issue with the ingestion to S3?

Thank you.

Please help with noaa-ncn.yaml not showing Explorer link

Dear @zflamig,
Could you please help us?
I do not know the reason why the Explorer link (under Resources on AWS section on the right) is empty. The noaa-ncn.yaml has this information:

Resources:

Description: "NCN Data and Products"
ARN: arn:aws:s3:::noaa-cors-pds
Region: us-east-1
Type: S3 Bucket
Explore: "Browse NOAA-NCN Bucket"

Thank you very much for your time and help.
Sincerely,
Ira.Sellars

have a list of catalog ids for Digital Globe Open Data

hi, is there a way to get a list of catalog IDs for the imagery released for a Digital Globe Open Data event? (Cyclone Idai)

There are over 100 imagery strips listed for this event on https://www.digitalglobe.com/ecosystem/open-data/cyclone_idai and I would like to save some time by not parsing them.

getting some basic metadata with the ids as well would be a plus (date, sensor, cloud cover, etc), but I think catalog IDs are a good start.

Actually, it would be a plus plus if we could get the imagery extents as well (geojson or shapefile).

Update http://990.charitynavigator.org?

@borenstein - I'm looking at the entry for the 990 data and it seems that http://990.charitynavigator.org may be stale at this point. Should we replace that with a pointer to https://github.com/CharityNavigator/990_long?

If you think so, please feel free to create a pull request or let me know if you'd like me to.

Receiving InsecureRequestWarning locally

When running tests, I see the following locally. It looks like this should be handled at the top of ext.py so not sure why it's still firing. @zflamig could you look into this?

/Library/Python/2.7/site-packages/requests-2.13.0-py2.7.egg/requests/packages/urllib3/connectionpool.py:852: InsecureRequestWarning: Unverified HTTPS request is being made. Adding certificate verification is strongly advised. See: https://urllib3.readthedocs.io/en/latest/advanced-usage.html#ssl-warnings
  InsecureRequestWarning)

Use human readable resource types

Handle similar to tags

Dataset

Can you upload lo given a direct link to download the dataset of IDS2018 set instead of AWS while downloading with AWS it was disconnecting in the middle so it became an issue to download the Dataset.

add terrafusion dataset yaml

This issue is for the addition of the terrafusion dataset yaml file.

Add CONTRIBUTING.md

Data in Cell Painting Image Collection is inconsistent

This is being addressed here cytodata/cytodata-hackathon-2018#10

Possible to add CORS to Genome-in-a-Bottle dataset?

Hi there,
I was wondering if it would be possible to add CORS settings to the genome-in-a-bottle dataset

genome-in-a-bottle/giab_latest_release#10 (comment)

It is very helpful to test our demo datasets like GIAB, but it requires CORS

I think something like this would be ideal, I had talked previously with an AWS admin when settings like this were added to the 1000genomes s3 bucket

<?xml version="1.0" encoding="UTF-8"?>
<CORSConfiguration xmlns="http://s3.amazonaws.com/doc/2006-03-01/">
    <CORSRule>
        <AllowedOrigin>*</AllowedOrigin>
        <AllowedMethod>GET</AllowedMethod>
        <AllowedHeader>Range</AllowedHeader>
        <MaxAgeSeconds>3000</MaxAgeSeconds>
        <ExposeHeader>Accept-Ranges</ExposeHeader>
        <ExposeHeader>Content-Range</ExposeHeader>
        <ExposeHeader>Content-Encoding</ExposeHeader>
        <ExposeHeader>Content-Length</ExposeHeader>
        <AllowedHeader>Authorization</AllowedHeader>
    </CORSRule>
</CORSConfiguration>

Add a check for '.yml' extensions to gulpfile.

Files should be '.yaml', so we should warn when we see a '.yml'.

Attaching PMSP LiDAR 3D Yaml Dataset file

Hi,

I have pulled a request out #536 in order to add PMSP LiDAR 3D Yaml Dataset file.

Thanks,

Glue Crawler

Attempting to crawl this location with Glue pointed at the ARN produces an Access Denied issue. What's the correct way to configure this?

I tried giving Glue more permissions (all the way to Admin) to no avail.

Deploy doesn't remove no longer existing pages

https://github.com/awslabs/open-data-registry/blob/master/_scripts/deploy.sh

We use cp which will not remove pages if they no longer exist.

Can't download SpaceNet Buildings Dataset v1

I am trying to download SpaceNet Buildings Dataset v1

I ran the suggested command in a Sagemaker notebook, and I got this error.

Command:

$ aws s3 cp s3://spacenet-dataset/SpaceNet_Roads_Competition/AOI_2_Vegas_Roads_Train.tar.gz

Error:
fatal error: An error occurred (404) when calling the HeadObject operation: Key "SpaceNet_Roads_Competition/AOI_2_Vegas_Roads_Train.tar.gz" does not exist

I tried this with Paris data as well, and I got the same error. I found this repo, but it has achieved by the owner, so I put there any issue.

How can I download this dataset properly?

Add a Managed By field

This would make it more clear who manages the ingest process when not being done by the data source (or a group affiliated with them).

Add optional dataset characteristics?

It would be useful to have searchable/indexable data on the characteristics of a dataset:

total dataset size
AVG file size
file type
etc.

Speedtest by Ookla Global Fixed and Mobile Network Performance Maps

Speedtest by Ookla Global Fixed and Mobile Network Performance Map Tiles

Request to add the following entries and additional tags related to the Speedtest Global Performance Maps dataset.

Dataset entry

speedtest-global-performance.yaml

Add tags

broadband, global, telecommunications, tiles

Latest event on gdelt-open-data/v2/events is from april 16 2019

Dear All,

The latest mentions on the events folder are all from april 16. I thought this is updated daily, anything not working?

Same thing for the SNS.

2019-04-16 13:03:11 808.2 KiB v2/events/20190416130000.export.csv
2019-04-16 13:19:11 749.9 KiB v2/events/20190416131500.export.csv
2019-04-16 13:34:10 848.7 KiB v2/events/20190416133000.export.csv
2019-04-16 13:49:13 806.1 KiB v2/events/20190416134500.export.csv
2019-04-16 14:03:11 893.0 KiB v2/events/20190416140000.export.csv
2019-04-16 14:19:11 697.3 KiB v2/events/20190416141500.export.csv
2019-04-16 14:34:10 775.7 KiB v2/events/20190416143000.export.csv
2019-04-16 14:49:11 720.8 KiB v2/events/20190416144500.export.csv
2019-04-16 15:03:11 909.5 KiB v2/events/20190416150000.export.csv
2019-04-16 15:19:10 700.7 KiB v2/events/20190416151500.export.csv

I got this by running

aws s3 ls s3://gdelt-open-data/v2/events --recursive --human-readable --summarize

from a terminal window

Regards

CSV file don't have SourceIP, DestinationIP, SourcePort?

Sorry, the issue was sent wrong.

Add section to README outlining programmatic use of yaml files

ERA5 Land data request

I am requesting to add the ERA5 Land data set to the registry.

Although the ERA5 is already in the registry, the Land component is significnalty higher resolution (9km vs 32km) so is more suited for land based use cases.

I am currently trying to download the open data (https://cds.climate.copernicus.eu/cdsapp#!/dataset/reanalysis-era5-land?tab=overview) to my local to push to S3. I thought I would reach out here to see if the Open data team could help push it into the S3 so others could use as well.

Thanks!

How to handle multiple versions of the same dataset?

For example, with GOES data, we have two different buckets with different formats. In this case, the documentation is slightly different, but the docs link is currently a top level dataset object.

License / citation for GIAB is private

I'm unable to access the license / citation information in the GIAB registry, per the AWS giab registry:

There are no restrictions on the use of this data. More information on citation is available here.

The here is https://sites.stanford.edu/abms/content/giab-reference-materials-and-data which is blocked by a sign in page. I know it was a while ago @jflasher, but do you have any information about this?

To download 5% of cse-cic-ids2018 dataset

How can I download only 5% of cse-cic-ids2018 dataset since the storage in my local mac machine is not available to download all this 220 GiB dataset?

IRS 990 2019 index file discrepancy

While doing some analysis on 990 filings, I noticed a discrepancy between the number of filings in the 2019 CSV and JSON index files. It appears that the CSV index file has 416,880 while the JSON has 396,217. The CSV file also looks to have been updated much more recently than the JSON file (4/2020 vs 12/2019). I have not checked the index files for other years, though there may be conflicting counts there as well.

Wasn't sure if this is the best place to report it, but the ~20K difference seemed pretty significant. I haven't done any additional analysis yet to rule out something like duplicate records - figured I'd start here. Happy to lend a hand if I can help in any way.

Feature Request: add link to AWS S3 Explorer for each dataset

Each dataset lists where the data resides e.g. taking NOAA GOES https://registry.opendata.aws/noaa-goes/

It's lists the location of the data e.g. 'noaa-goes16'.

It would be great if that link is hyper-linked to the AWS S3 Explorer so you can easily see what the files look like e.g. hyperlink to https://noaa-goes16.s3.amazonaws.com/index.html

Add aws-pds tag to appropriate datasets

GEFS v12 AWS data - seems to have few discrepancies in the data: missing files, missing documentation.

I have done a quick random check (I have not checked all folders) but it appears as all the files related to subset of fields above 700mb are missing for the D>10 days forlders.

In the documentation it says:

For most grib2 files, the data are provided on a grid with a 0.25-degree grid spacing,
archived every 3 hours for the first 10 days of the forecast; beyond 10 days, 0.50 degrees grid
spacing is used. For pressure-level data above 700 hPa, even during the first 10 days of the
forecast, data are saved at 0.5-degree grid spacing in order to conserve space. The grid
proceeds from 90°N to 90°S and from 0E to 359.75 E.

I initially thought that because of the similar lat/lon grid they had one grib file for the whole Day 1-16/35 period, but nope, the file in the Day 1-10 folder has only lead times for 3 to 240 hours.

The file names are

hgt_pres_abv700mb_YYYYMMDDHH_ens.grib2
spfh_pres_abv700mb_YYYYMMDDHH_ens.grib2
tmp_pres_abv700mb_YYYYMMDDHH_ens.grib2
ugrd_pres_abv700mb_YYYYMMDDHH_ens.grib2
vgrd_pres_abv700mb_YYYYMMDDHH_ens.grib2
vvel_pres_abv700mb_YYYYMMDDHH_ens.grib2

where for ens i mean the ensemble members, c00 and p##

The same issue seems to be present for those days with 11 members and forecast times up to Day 35.

the usaspending snapshot has been removed

I might be mistaken, but based on this comment and the fact that the download instructions on the usaspending site have changed, it seems that the snapshot is no longer available.

given that, it would be nice if the documentation here reflected that.

Cannot access public dataset SPACENET

On http://spacenet-dataset.s3.amazonaws.com/ , public s3 bucket is arn:aws:s3:::spacenet-dataset.
However when I try to access the bucket via https
http://spacenet-dataset.s3.amazonaws.com/, it is giving following error

<Error>
<Code>AccessDenied</Code>
<Message>Access Denied</Message>
<RequestId>80AA03A4A2B74B07</RequestId>
<HostId>
vpeLzorRhAVQAc5WIy7aC9atGuyrgnBw+xdRR8GklvasUI4jgQ+Bd8Yl6Jo9QU5rQnbcMOkwxME=
</HostId>
</Error>

Is this dataset no longer public ?

noaa-ghe SNS topic request

Would it be possible to have an SNS topic setup for the noaa-ghe data (https://registry.opendata.aws/noaa-ghe/)?

Thank you!

How to set "AWS_NO_SIGN_REQUEST=YES"?

Hi,
I encountered below errors when fetching data for one of: GDALRasterSource(s3://s22s-test-geotiffs/luray_snp/SCL.tif), GDALRasterSource(s3://s22s-test-geotiffs/luray_snp/B01.tif), GDALRasterSource(s3://s22s-test-geotiffs/luray_snp/B02.tif), GDALRasterSource(s3://s22s-test-geotiffs/luray_snp/B03.tif), GDALRasterSource(s3://s22s-test-geotiffs/luray_snp/B04.tif), GDALRasterSource(s3://s22s-test-geotiffs/luray_snp/B05.tif), GDALRasterSource(s3://s22s-test-geotiffs/luray_snp/B06.tif), GDALRasterSource(s3://s22s-test-geotiffs/luray_snp/B07.tif), GDALRasterSource(s3://s22s-test-geotiffs/luray_snp/B08.tif), GDALRasterSource(s3://s22s-test-geotiffs/luray_snp/B09.tif), GDALRasterSource(s3://s22s-test-geotiffs/luray_snp/B11.tif), GDALRasterSource(s3://s22s-test-geotiffs/luray_snp/B12.tif):
"
FAILURE(3) CPLE_AWSInvalidCredentials(15) "AWSInvalidCredentials." AWS_SECRET_ACCESS_KEY and AWS_NO_SIGN_REQUEST configuration options not defined, and /root/.aws/credentials not filled

py4j.protocol.Py4JJavaError: An error occurred while calling o142.collectToPython.
"
After my searching on the Internet, I run the command to confirm whether the issue related about "AWS account".

Some programmers suggest to "try setting AWS_NO_SIGN_REQUEST=YES", but I am so confused how to set "AWS_NO_SIGN_REQUEST=YES". So, could you pls give me some suggestions? Thanks!

Add region check to schema

Should fail if region is not of form us-west-2.

Add SOREL-20M dataset + yaml

I'm opening this issue to accompany the forthcoming pull request to add the SOREL-20M dataset to the registry.

Thanks,
Rich

Add Landsat 5 and 7 datasets

Has anyone looked into adding Landsat 5 and 7 to AWS?

Landsat provides the longest running continuous earth observation system in history. Although Landat 8 is great, it only provides data from 2013 and on. Most studies involving long term trends like climate change require decades of satellite imagery.

What would be involved in getting Landsat 5 and 7 on AWS and who would I talk to?

CSV data is missing from dataforgood-fb-data bucket

The csv prefix is missing from bucket dataforgood-fb-data. I cannot load partition from Athena.
I created the table by using:

CREATE EXTERNAL TABLE IF NOT EXISTS default.hrsl (
  `latitude` double,
  `longitude` double,
  `population` double 
) PARTITIONED BY (
  month string,
  country string,
  type string 
)
ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe'
WITH SERDEPROPERTIES (
  'serialization.format' = '\t',
  'field.delim' = '\t'
) LOCATION 's3://dataforgood-fb-data/csv/'
TBLPROPERTIES ('has_encrypted_data'='false', 'skip.header.line.count'='1');

Then load partitions by using:
MSCK REPAIR TABLE hrsl;

But I got this error message:
Tables missing on filesystem: hrsl

NPPES Dataset hosting

I would like to suggest adding the NPPES NPI registry. It is a national registry for prescribers publicly available and incredibly useful/critical to healthcare/medical organizations.

https://npiregistry.cms.hhs.gov/