Git Product home page Git Product logo

open-data-registry's Issues

Glue Crawler

Attempting to crawl this location with Glue pointed at the ARN produces an Access Denied issue. What's the correct way to configure this?

I tried giving Glue more permissions (all the way to Admin) to no avail.

CSV data is missing from dataforgood-fb-data bucket

The csv prefix is missing from bucket dataforgood-fb-data. I cannot load partition from Athena.
I created the table by using:

CREATE EXTERNAL TABLE IF NOT EXISTS default.hrsl (
  `latitude` double,
  `longitude` double,
  `population` double 
) PARTITIONED BY (
  month string,
  country string,
  type string 
)
ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe'
WITH SERDEPROPERTIES (
  'serialization.format' = '\t',
  'field.delim' = '\t'
) LOCATION 's3://dataforgood-fb-data/csv/'
TBLPROPERTIES ('has_encrypted_data'='false', 'skip.header.line.count'='1');

Then load partitions by using:
MSCK REPAIR TABLE hrsl;

But I got this error message:
Tables missing on filesystem: hrsl

IRS 990 2019 index file discrepancy

While doing some analysis on 990 filings, I noticed a discrepancy between the number of filings in the 2019 CSV and JSON index files. It appears that the CSV index file has 416,880 while the JSON has 396,217. The CSV file also looks to have been updated much more recently than the JSON file (4/2020 vs 12/2019). I have not checked the index files for other years, though there may be conflicting counts there as well.

Wasn't sure if this is the best place to report it, but the ~20K difference seemed pretty significant. I haven't done any additional analysis yet to rule out something like duplicate records - figured I'd start here. Happy to lend a hand if I can help in any way.

ERA5 Land data request

I am requesting to add the ERA5 Land data set to the registry.

Although the ERA5 is already in the registry, the Land component is significnalty higher resolution (9km vs 32km) so is more suited for land based use cases.

I am currently trying to download the open data (https://cds.climate.copernicus.eu/cdsapp#!/dataset/reanalysis-era5-land?tab=overview) to my local to push to S3. I thought I would reach out here to see if the Open data team could help push it into the S3 so others could use as well.

Thanks!

Add a Managed By field

This would make it more clear who manages the ingest process when not being done by the data source (or a group affiliated with them).

IRS 990 contact link throwing 404

@jedsundwall Was trying to find contact info for the IRS folks to alert them of an error in their indexes and noticed the existing link at the registry site is throwing a 404.

The best link I could find was this one?

In case the fine folks at the IRS read this, the 2018 indexes are currently out of sync (e.g. the 2018 CSV index shows 457,509 records while the 2018 JSON index shows just 432,801).

Speedtest by Ookla Global Fixed and Mobile Network Performance Maps

Speedtest by Ookla Global Fixed and Mobile Network Performance Map Tiles

Request to add the following entries and additional tags related to the Speedtest Global Performance Maps dataset.

Dataset entry

speedtest-global-performance.yaml

Add tags

broadband, global, telecommunications, tiles

Add Landsat 5 and 7 datasets

Has anyone looked into adding Landsat 5 and 7 to AWS?

Landsat provides the longest running continuous earth observation system in history. Although Landat 8 is great, it only provides data from 2013 and on. Most studies involving long term trends like climate change require decades of satellite imagery.

What would be involved in getting Landsat 5 and 7 on AWS and who would I talk to?

digitalcorpora is ready to go live

Hi! I have uploaded the digitalcorpora corpus to s3://digitalcorpora/ and it is live!

% aws s3 ls s3://digitalcorpora/corpora/
                           PRE bin/
                           PRE drives/
                           PRE drives_bulk_extractor/
                           PRE drives_dfxml/
                           PRE files/
                           PRE hashes/
                           PRE mobile/
                           PRE packets/
                           PRE ram/
                           PRE scenarios/
                           PRE sql/
2020-11-21 10:56:19         43 README.txt
2020-11-21 10:56:20    1783404 digitalcorpora.org-hashdeep-2020-04-01.csv
2020-11-21 10:56:19    1787101 digitalcorpora.org-hashdeep-2020-05-01.csv
2020-11-21 10:56:19    1794086 digitalcorpora.org-hashdeep-2020-06-01.csv
2020-11-21 10:56:19    1794914 digitalcorpora.org-hashdeep-2020-07-01.csv
2020-11-21 10:56:20    1796103 digitalcorpora.org-hashdeep-2020-08-01.csv
2020-11-21 10:56:20    1796275 digitalcorpora.org-hashdeep-2020-09-01.csv
2020-11-21 10:56:20    1796447 digitalcorpora.org-hashdeep-2020-10-01.csv
2020-11-21 10:56:20    1796619 digitalcorpora.org-hashdeep-2020-11-01.csv
%

I will send a pull request.

Add SOREL-20M dataset + yaml

I'm opening this issue to accompany the forthcoming pull request to add the SOREL-20M dataset to the registry.

Thanks,
Rich

Possible to add CORS to Genome-in-a-Bottle dataset?

Hi there,
I was wondering if it would be possible to add CORS settings to the genome-in-a-bottle dataset

genome-in-a-bottle/giab_latest_release#10 (comment)

It is very helpful to test our demo datasets like GIAB, but it requires CORS

I think something like this would be ideal, I had talked previously with an AWS admin when settings like this were added to the 1000genomes s3 bucket

<?xml version="1.0" encoding="UTF-8"?>
<CORSConfiguration xmlns="http://s3.amazonaws.com/doc/2006-03-01/">
    <CORSRule>
        <AllowedOrigin>*</AllowedOrigin>
        <AllowedMethod>GET</AllowedMethod>
        <AllowedHeader>Range</AllowedHeader>
        <MaxAgeSeconds>3000</MaxAgeSeconds>
        <ExposeHeader>Accept-Ranges</ExposeHeader>
        <ExposeHeader>Content-Range</ExposeHeader>
        <ExposeHeader>Content-Encoding</ExposeHeader>
        <ExposeHeader>Content-Length</ExposeHeader>
        <AllowedHeader>Authorization</AllowedHeader>
    </CORSRule>
</CORSConfiguration>

How to set "AWS_NO_SIGN_REQUEST=YES"?

Hi,
I encountered below errors when fetching data for one of: GDALRasterSource(s3://s22s-test-geotiffs/luray_snp/SCL.tif), GDALRasterSource(s3://s22s-test-geotiffs/luray_snp/B01.tif), GDALRasterSource(s3://s22s-test-geotiffs/luray_snp/B02.tif), GDALRasterSource(s3://s22s-test-geotiffs/luray_snp/B03.tif), GDALRasterSource(s3://s22s-test-geotiffs/luray_snp/B04.tif), GDALRasterSource(s3://s22s-test-geotiffs/luray_snp/B05.tif), GDALRasterSource(s3://s22s-test-geotiffs/luray_snp/B06.tif), GDALRasterSource(s3://s22s-test-geotiffs/luray_snp/B07.tif), GDALRasterSource(s3://s22s-test-geotiffs/luray_snp/B08.tif), GDALRasterSource(s3://s22s-test-geotiffs/luray_snp/B09.tif), GDALRasterSource(s3://s22s-test-geotiffs/luray_snp/B11.tif), GDALRasterSource(s3://s22s-test-geotiffs/luray_snp/B12.tif):
"
FAILURE(3) CPLE_AWSInvalidCredentials(15) "AWSInvalidCredentials." AWS_SECRET_ACCESS_KEY and AWS_NO_SIGN_REQUEST configuration options not defined, and /root/.aws/credentials not filled

py4j.protocol.Py4JJavaError: An error occurred while calling o142.collectToPython.
"
After my searching on the Internet, I run the command to confirm whether the issue related about "AWS account".
3
Some programmers suggest to "try setting AWS_NO_SIGN_REQUEST=YES", but I am so confused how to set "AWS_NO_SIGN_REQUEST=YES". So, could you pls give me some suggestions? Thanks!

GEFS v12 AWS data - seems to have few discrepancies in the data: missing files, missing documentation.

I have done a quick random check (I have not checked all folders) but it appears as all the files related to subset of fields above 700mb are missing for the D>10 days forlders.

In the documentation it says:

For most grib2 files, the data are provided on a grid with a 0.25-degree grid spacing,
archived every 3 hours for the first 10 days of the forecast; beyond 10 days, 0.50 degrees grid
spacing is used. For pressure-level data above 700 hPa, even during the first 10 days of the
forecast, data are saved at 0.5-degree grid spacing in order to conserve space. The grid
proceeds from 90°N to 90°S and from 0E to 359.75 E.

I initially thought that because of the similar lat/lon grid they had one grib file for the whole Day 1-16/35 period, but nope, the file in the Day 1-10 folder has only lead times for 3 to 240 hours.

The file names are

hgt_pres_abv700mb_YYYYMMDDHH_ens.grib2
spfh_pres_abv700mb_YYYYMMDDHH_ens.grib2
tmp_pres_abv700mb_YYYYMMDDHH_ens.grib2
ugrd_pres_abv700mb_YYYYMMDDHH_ens.grib2
vgrd_pres_abv700mb_YYYYMMDDHH_ens.grib2
vvel_pres_abv700mb_YYYYMMDDHH_ens.grib2

where for ens i mean the ensemble members, c00 and p##

The same issue seems to be present for those days with 11 members and forecast times up to Day 35.

Receiving InsecureRequestWarning locally

When running tests, I see the following locally. It looks like this should be handled at the top of ext.py so not sure why it's still firing. @zflamig could you look into this?

/Library/Python/2.7/site-packages/requests-2.13.0-py2.7.egg/requests/packages/urllib3/connectionpool.py:852: InsecureRequestWarning: Unverified HTTPS request is being made. Adding certificate verification is strongly advised. See: https://urllib3.readthedocs.io/en/latest/advanced-usage.html#ssl-warnings
  InsecureRequestWarning)

Two Sentinel-2 Documentation Links

There are two links to documentation on the Sentinel-2 dataset, and if you click on it, it will attempt to open this URL:

https://roda.sentinel-hub.com/sentinel-s2-l1c/readme.html,https://roda.sentinel-hub.com/sentinel-s2-l2a/readme.html

Obviously, this doesn't resolve. How do we accommodate multiple links to documentation? Could we allow links or Markdown in the Documentation field?

Incorrect AWS region for GIAB dataset?

At https://registry.opendata.aws/giab/
and in the giab.yaml file at
https://github.com/awslabs/open-data-registry/tree/master/datasets
the AWS region for bucket=giab is listed as us-east-1.

However, when I try to access the bucket via the AWS Java API or at
https://giab.s3.us-east-1.amazonaws.com/
AWS returns "The authorization header is malformed; the region 'us-east-1' is wrong; expecting 'us-west-2' " (or similar from the URL).

If I try https://giab.s3.us-west-2.amazonaws.com/
it succeeds.

Can you please investigate and perhaps change the yaml file so the region is listed as us-west-2?

Thank you!

Cannot access public dataset SPACENET

On http://spacenet-dataset.s3.amazonaws.com/ , public s3 bucket is arn:aws:s3:::spacenet-dataset.
However when I try to access the bucket via https
http://spacenet-dataset.s3.amazonaws.com/, it is giving following error

<Error>
<Code>AccessDenied</Code>
<Message>Access Denied</Message>
<RequestId>80AA03A4A2B74B07</RequestId>
<HostId>
vpeLzorRhAVQAc5WIy7aC9atGuyrgnBw+xdRR8GklvasUI4jgQ+Bd8Yl6Jo9QU5rQnbcMOkwxME=
</HostId>
</Error>

Is this dataset no longer public ?

Latest event on gdelt-open-data/v2/events is from april 16 2019

Dear All,

The latest mentions on the events folder are all from april 16. I thought this is updated daily, anything not working?

Same thing for the SNS.

2019-04-16 13:03:11 808.2 KiB v2/events/20190416130000.export.csv
2019-04-16 13:19:11 749.9 KiB v2/events/20190416131500.export.csv
2019-04-16 13:34:10 848.7 KiB v2/events/20190416133000.export.csv
2019-04-16 13:49:13 806.1 KiB v2/events/20190416134500.export.csv
2019-04-16 14:03:11 893.0 KiB v2/events/20190416140000.export.csv
2019-04-16 14:19:11 697.3 KiB v2/events/20190416141500.export.csv
2019-04-16 14:34:10 775.7 KiB v2/events/20190416143000.export.csv
2019-04-16 14:49:11 720.8 KiB v2/events/20190416144500.export.csv
2019-04-16 15:03:11 909.5 KiB v2/events/20190416150000.export.csv
2019-04-16 15:19:10 700.7 KiB v2/events/20190416151500.export.csv

I got this by running

aws s3 ls s3://gdelt-open-data/v2/events --recursive --human-readable --summarize

from a terminal window

Regards

Dataset

Can you upload lo given a direct link to download the dataset of IDS2018 set instead of AWS while downloading with AWS it was disconnecting in the middle so it became an issue to download the Dataset.

Can't download SpaceNet Buildings Dataset v1

I am trying to download SpaceNet Buildings Dataset v1

I ran the suggested command in a Sagemaker notebook, and I got this error.

Command:

$ aws s3 cp s3://spacenet-dataset/SpaceNet_Roads_Competition/AOI_2_Vegas_Roads_Train.tar.gz 

Error:
fatal error: An error occurred (404) when calling the HeadObject operation: Key "SpaceNet_Roads_Competition/AOI_2_Vegas_Roads_Train.tar.gz" does not exist

I tried this with Paris data as well, and I got the same error. I found this repo, but it has achieved by the owner, so I put there any issue.

How can I download this dataset properly?

have a list of catalog ids for Digital Globe Open Data

hi, is there a way to get a list of catalog IDs for the imagery released for a Digital Globe Open Data event? (Cyclone Idai)

There are over 100 imagery strips listed for this event on https://www.digitalglobe.com/ecosystem/open-data/cyclone_idai and I would like to save some time by not parsing them.

getting some basic metadata with the ids as well would be a plus (date, sensor, cloud cover, etc), but I think catalog IDs are a good start.

Actually, it would be a plus plus if we could get the imagery extents as well (geojson or shapefile).

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.