datacommonsorg / data Goto Github PK

View Code? Open in Web Editor NEW

58.0 20.0 105.0 397.59 MB

License: Apache License 2.0

Python 7.02% Shell 0.06% Go 0.08% HTML 84.62% Jupyter Notebook 8.08% R 0.14% Dockerfile 0.01%

collaborations

data's Introduction

Data Commons Data Imports

This is a collaborative repository for contributing data to Data Commons.

If you are looking to use the data in Data Commons, please visit our API documentation.

About Data Commons

Data Commons is an Open Knowledge Graph that provides a unified view across multiple public data sets and statistics. We've bootstrapped the graph with lots of data from US Census, CDC, NOAA, etc., and through collaborations with the New York Botanical Garden, Opportunity Insights, and more. However, Data Commons is meant to be for community, by the community. We're excited to work with you to make public data accessible to everyone.

To see the extent of data we have today, browse the graph.

We welcome contributions to the graph! To get started, take a look at the resources in the docs directory and the list of pending imports.

License

Apache 2.0

Development

Every data import involves some or all of the following: obtaining the source data, cleaning the data, and converting the data into one of Meta Content Framework (MCF), JSON-LD, or RDFa format. We ask that you check in all scripts used in this process, so that others can reproduce and continue your work.

Source data must meet the licensing policy requirements.

Scripts should go under the top-level scripts/ directory, depending on the provenance and dataset. See the example for more detail.

We provide some utility libraries under the top-level util/ directory. For example, this includes maps to and from common geographic identifiers.

GitHub Development Process

One Time Set-up

Install Git LFS
Fork this repo - follow the Github guide to forking a repo
- In https://github.com/datacommonsorg/data, click on "Fork" button to fork the repo.
- Add upstream: git remote add upstream https://github.com/datacommonsorg/data.git
- Clone your forked repo to your desktop. Please do not directly clone this repo, verify by running git remote -v, the output should look like this:
```
shell> git remote -v
origin  https://github.com/YOUR-GITHUB-USERNAME/data.git (fetch)
origin  https://github.com/YOUR-GITHUB-USERNAME/data.git (push)
upstream        https://github.com/datacommonsorg/data.git (fetch)
upstream        https://github.com/datacommonsorg/data.git (push)
```
Please ask to join the datacommons-developers Google group. For example, membership in this group provides access to debug logs of pre-submit tests that run for your Pull Request.

Creating Pull Requests

Contribute your changes by creating pull requests from your fork of this repo. Learn more in this step-by-step guide.

A summary of the steps in the development workflow are:

git checkout master
git pull upstream master
git checkout -b new_branch_name
# Make some code change
git add .
git commit -m "commit message"
git push -u origin new_branch_name

Then in your forked repo, you can send a Pull Request. Wait for approval of the Pull Request and merge the change.

If this is your first time contributing to a Google Open Source project, you may need to follow the steps in contributing.md.

Code quality

Code style guidelines ease understanding and maintaining code. Automated checks enforce some of the guidelines.

Python

Setup

Ensure prerequisites are installed

Python3
Pip

Install requirements and setup a virtual environment to isolate python development in this repo.

python3 -m venv .env
source .env/bin/activate

pip3 install -r requirements_all.txt

Testing

Scripts should be accompanied with tests using the unittest framework, and named with an _test.py suffix.

A common test pattern is to drive your main processing function through some sample input files (e.g., with a few rows of the real csv/xls/etc.) and compare the produced output files (e.g., cleaned csv, mcf, tmcf) against expected ones. An example test following this pattern is here.

IMPORTANT: Please ensure that there is an __init__.py file in the directory of your import scripts, and every parent directory until scripts/. This is necessary for the unittest framework to automatically discover and run your tests as part of presubmit.

NOTE: In the presence of __init__.py, you will need to adjust the way you import modules and run tests, as below.

You should import modules in your test with a dotted prefix like this.
Instead of running your test as python3 foo_test.py, run as:

python3 -m unittest discover -v -s ../ -p "foo_test.py"

Consider creating a generic alias like this:
- alias dc-data-py-test='python3 -m unittest discover -v -s ../ -p "*_test.py"'
Then, you can run your tests as:
- dc-data-py-test

Guidelines

Any additional package required must be specified in the requirements_all.txt file in the top-level folder. No other requirements.txt files are allowed.
Code must be formatted according to the Google Python Style Guide according to the yapf formatter.
Code must not generate lint errors or warnings according to pylint configured for the Google Python Style Guide as specified in .pylintrc.
Tests must succeed.

Consider automating coding to satisfy some of these requirements.

Integrate pylint with your editor.
Integrate yapf with Visual Studio, Emacs, vim. Specify the Google style using --style google.

To run the tools via a command line (both installed after setup steps above):

pylint.
yapf, execute using --style google, e.g.,

# Update (--in-place) all files
./run_tests.sh -f

# Produce differences between the current code and reformatted code.  Empty
# output indicates correctly formatted code.
./run_tests.sh -l

To run a unit test, use a command like

python3 -m unittest discover -v -s util/ -p "*_test.py"

The discover option searches (-s) the util/ directory for files with filenames ending with _test.py. It considers all these files to be unit tests to be run. Output is verbose (-v).

We provide a utility to run all unit tests in a folder easily (e.g. util/):

./run_tests.sh -p util/

Or to run all tests and checks:

./run_tests.sh -a

NOTE: Please ensure that all tests are runnable from the test script, e.g. modules should be relative to the root of the repo.

Disabling style checks

Occasionally, one has to disable style checking or formatting for particular lines.

To disable pylint for a particular line or block , use syntax like

# pylint: disable=line-too-long,unbalanced-tuple-unpacking

To disable yapf for some lines,

# yapf: disable
... code ...
# yapf: enable

Go

Code must be formatted according to go fmt.
Vetting must identify no likely mistakes as revealed by go vet.
Code must not generate lint errors or warnings according to golangcli-lint. To run on foo.go, use golangcli-lint run foo.go.
Tests must succeed. Files ending with _test.go are considered tests. They are executed using go test.

Support

For general questions or issues about importing data into Data Commons, please open an issue on our issues page. For all other questions, please share feedback on this form.

Note - This is not an officially supported Google product.

data's People

Contributors

Stargazers

Watchers

Forkers

edumorlom wh1210 qlj-lijuan iancostello mengzhensun fpernice-google orolabstanford jeffreyoldham vtrgdu pradh lxc902 rsned clincoln8 tjann eftekhari-mhs chejennifer luke18 kilimannejaro jinpz matttriano foreverqing danbri thejeshgn spiekos hanlu09205 jehangiramjad beets senthamizhanv n-h-diaz suhana13 khoa-yelo padma-g katherinelamb shifucun sharadshriram abilityguy rbhoot aarushi1104 rpatil524 ghaiyur-wipro spaceenter u200915986 va-verma aruntakhur hareesh-ms aroravijay-g rajeshkpatnala fils mana5a harishc727 saanikaaa billybabis enjoythecode naveenkumar37 balakrishna-re naizambhavani acacia89 vishnu-vasan shaotongtc mariam16548 pulkit-s fructokinase zhiqwu jseokkim chelsi-create trellixvulnteam jzeng703 lucy-kind joel03m wangzhecheng caravanstudios rbnyng ajaits cernst122 jbrr bryan-horowitz bhabani-raul bipnabraham harsha-chandaluri manishvats2 shamimansari1988 anjali1928 keyurva swethammkumari mogallurupriyanka vshnuz anushajm sumakaranam chrislom12 dsfsi nickfortescuegoogle gonzalezmorales juliawu gideonmandu megaz kurus21 climasens cbari123 cbwengye krishnaswamypradeep

data's Issues

Renaming of districts in Telangana

The Warangal Urban district will be renamed Hanamkonda and Warangal Rural district will now be called Warangal - TNM

LGD has updated data.

[un energy] stat var names

There are a few stat vars where the names could be clarified, e.g.

https://datacommons.org/browser/Annual_Generation_Energy_Bagasse
The name is Annual Generation of Bagasse -- should it be Annual Generation of Energy from Bagasse
Perhaps the description could also be clarified, since it now read as Bagasse is being generated, not energy.

https://datacommons.org/browser/Annual_Consumption_Energy_ElectricityGeneration_Heat_FuelTransformation
The name and description here is also confusing, since it mentions both consumption and generation. Is it referring to Heat generated by Fuel Transformation consumed for Electricity Generation

Sorry these were missed during the reviews..

Greenland is contained in Kingdom of Denmark

https://datacommons.org/place/country/GRL

Running into install errors while installing dependent Python packages

I have run into an issue, where the I am unable to install the specific versions of packages mentioned in the requirements.txt file.

Reproducing the error : Running the ./run_tests.sh -r or pip3 install -r requirements.txt
OS: Debian/Linux

However, in macOS, running ./run_tests.sh -r exited with an error message of being unable to install Cython (required by numpy). But interestingly, running pip3 install -r requirements.txt installed all the packages without any issues.

TL;DR - The package install debug log in Debian/Linux

ERROR: Could not find a version that satisfies the requirement pandas==1.0.4
ERROR: No matching distribution found for pandas==1.0.4

The backtrace for this error, points to a failed attempt to reinstall numpy. There are 4 different versions of numpy which we switch through between lines 10-14 of reuquirements.txt.

10 geopandas==0.8.1 --> [states dependencies](https://geopandas.org/getting_started/install.html#dependencies) on pandas, numpy, shapely
11 matplotlib==3.3.0 --> [states dependency](https://matplotlib.org/3.2.2/users/installing.html#dependencies) on numpy
12 numpy==1.18.5
13 openpyxl==3.0.7
14 pandas==1.0.4 ---> [states dependency](https://pandas.pydata.org/docs/getting_started/install.html#dependencies) on numpy

Looking through the numpy install error did not help.

So, I commented out the lines for numpy==1.18.5 and pandas==1.0.4 change to the requirements.txt

10 geopandas==0.8.1 
11 matplotlib==3.3.0
12 #numpy==1.18.5
13 openpyxl==3.0.7
14 #pandas==1.0.4

This goes past the numpy install error, and since geopandas lists pandas as a dependent package, we have it installed. But, the installation of packages now breaks for installing shapely (since this is also already installed by geopandas).

ERROR: Could not find a version that satisfies the requirement shapely==1.7
ERROR: No matching distribution found for shapely==1.7

At this point, I think the issue with packages is probably because geopandas installed a different version of them and re-installing a different version of the same package has an issue. Likely, pip lets users decide if there are multiple attempts to install the same package by throwing an error to fix the requirements.txt fle? I am not sure .
After commenting out shapely==1.7 in requirements.txt, and attempting the package installation, I get the third package with version conflicts of dependent packages.

ERROR: Cannot install chembl-webresource-client==0.10.2, requests==2.24.0 and urllib3==1.26.5 because these package versions have conflicting dependencies.
ERROR: ResolutionImpossible: for help visit https://pip.pypa.io/en/latest/user_guide/#fixing-conflicting-dependencies

So, I comment out the dependent packages in requirements.txt --requests and urllib3, making the requirements.txt to look like this:

18 #requests==2.24.0
19 retry==0.9.2
20 #shapely==1.7
21 #urllib3==1.26.5

Now, the packages installations happen without an issue. The following is the revised requirements.txt file:

1 absl-py==0.9.0
2 chembl-webresource-client==0.10.2
3 dataclasses==0.6
4 datacommons==1.4.3
5 frozendict==1.2
6 func-timeout==4.3.5
7 geojson==2.5.0
8 geopandas==0.8.1
9 matplotlib==3.3.0
10 numpy==1.18.5
11 openpyxl==3.0.7
12 pandas==1.3.3
13 pylint==2.11.1
14 pytest==6.2.5
15 rdp==0.8
16 requests==2.24.0
17 retry==0.9.2
18 Shapely==1.7.1
19 urllib3==1.25.9
20 wrapt==1.12.1
21 xlrd==1.2.0
22 yapf==0.31.0
23 zipp==3.6.0

[un energy] scaling issues with energy generation data

It looks more obvious in this chart:

It looks like the scaling for other countries are not scaled to GWh.

https://autopush.datacommons.org/tools/timeline#&place=country/USA,country/IND,country/FRA,country/RUS,country/CHN,country/BRA&statsVar=Annual_Generation_Electricity

US GeoJSON file omits Oglala Lakota County, SD

Oglala County, SD, is omitted from the US-wide GeoJSON data as demonstrated in a PNG rendering of the data. Note it changed its name from Shannon County, SD, in 2015. Various versions of FIPS 6-4 may treat this county inconsistently. https://datacommons.org/place?dcid=geoId%2F46102 shows Data Commons has a geoID for the county. Q.v., https://en.wikipedia.org/wiki/Oglala_Lakota_County,_South_Dakota.

fpernice-google notes https://plotly.com/python/choropleth-maps/ also has a hole so the problem may be with the underlying US Census KML data, which might be available at https://www.census.gov/geographies/mapping-files/time-series/geo/cartographic-boundary.html.

Add a stat var for total co2 emissions

/tools/timeline#place=country/TUR,country/GBR,geoId/48&statsVar=Annual_Emissions_CarbonDioxide_NonBiogenic__eia/INTL.4008-8-MMTCD.A

Error MCF appears on logs when writing to the KG. What is it?

When writing to the KG, an ERROR MCF is logged even though the job was successful and so was the writing to KG. Did I do something wrong? What does the ERROR MCF contain?

Schemal-less Indian Census (religion) StatVars should use name property

In this MCF, we should use name property instead of description, so that we can use that for displaying the StatVars in the LHS widgets.

I'm making the change to the file in schema repo, this is to track code updates.

[world bank] update all country geojson to be right hand rule

Not all the polygons in this import subscribe to the right hand rule, which is now part of the geojson spec. In order to standardize how we store geojson's in the graph, we should reconvert these to right-hand rule.

See: datacommonsorg/website#1164

BLS JOLTS Data

We should get metro area level data too
We should setup monthly refreshes

src: https://www.bls.gov/jlt/#data

e.g., https://autopush.datacommons.org/tools/statvar#Count_Worker_InvoluntarySeparation_NAICSArtsEntertainmentRecreation_Adjusted

OECD needs integration test

[Eurostat sprint cleanup] GDP -- Add mQualifier

https://github.com/datacommonsorg/data/blob/master/scripts/eurostat/gdp/eurostat_gdp.mcf

Add mQual to match existing StatVars

Do a schema.org refresh

E.g. https://datacommons.org/browser/Legislation is missing but exists in https://schema.org/Legislation

BLS JOLTS import: move measurementQualifier to StatVar

#19 currently has it under StatVarObs (in template MCF).

CDC Health Conditions data at State level

Right now there are several levels except state: https://datacommons.org/tools/statvar#Percent_Person_WithDiabetes

Look if https://www.cdc.gov/places/index.html has the data, and if not aggregate from county-level.

Error Expanding file path when importing territorial units codes

This error appears when I try to import new territorial unit codes.
It is because at the resolving step, the file failed to resolve, therefore the resolved_mcf is named xxx_Resolver_Error, not xxx, which results in the file path error.

To solve the problem, follow the 3 step importing process #14.
At step (b), check "Generate DCIDs for new places"

Population of Auckland has inconsistent values

For certain years until 2010, the population drops to 0.5 million from 1.x million.

Update eia scripts for new energy codes

change the following measuredProperties to plurals: receipts, stocks

Add usedFor:ElectricityGeneration to statvars with ElectricityGeneration in the name.

COVID imports: unify "incidentType" and "medicalCondition"

Right now, we are using both "incidentType" and "medicalCondition" to encode "COVID_19".

Population types Person, MedicalTest use "medicalCondition". Population type "MedicalConditionIncident" use "incidentType".

Duplicate Observations after resolved MCF validation

After validating my Harvard Covid data, I saw 100% duplicate observations on the Validation Dashboard. I know this is an issue. Does anyone know what to do? How to fix this? Where to look at?

My Statistical Variables are as follows:
Node: dcid:HarvardCOVID19IncrementalCases
typeOf: dcs:StatisticalVariable
populationType: dcs:MedicalTest
statType: dcs:measuredValue
measuredProperty: dcs:incrementalCount
medicalStatus: dcs:ConfirmedCase
measurementMethod: dcs:dcDerivedStat/HarvardCOVID19

Node: dcid:HarvardCOVID19IncrementalDeaths
typeOf: dcs:StatisticalVariable
populationType: dcs:MedicalTest
statType: dcs:measuredValue
measuredProperty: dcs:incrementalCount
medicalStatus: dcs:PatientDeceased
measurementMethod: dcs:dcDerivedStat/HarvardCOVID19

How do we handle imports that require API keys to download a dataset?

Some datasets require to download datasets using API keys.
For example, Harvard Dataverse doesn't have a download url, you must curl using a registered API key.

Should we include the API key in the repository? How should one handle datasets that require API keys?

[OECD P0 sprint cleanup] s/0To/Upto/

Should replace the StatVars like Count_MortalityEvent_0To4Years_Male with Count_MortalityEvent_Upto4Years_Male.

Drop redundant stat var MCFs generated by ACS Subject Table processing script

Currently, the ACS Subject Table processing script uses the JSON Spec to create the Stat Var MCF, TMCF and Cleaned CSV files. However, there are cases where a column is not applicable but still appears in the Subject Table. Here is one such example.

Even though the 'Percent below poverty level' rows are always empty, the stat var MCF for them is generated. It would be useful to have a feature where stat var MCFs which have no stat var observations are dropped.

BLS JOLTS import: refer to existing adjustment enums

E.g: https://browser.datacommons.org/kg?dcid=BLSSeasonallyAdjusted

And make sure to use dcs: prefix, otherwise it'll show as a string.

Missing countries on the map

Kosovo

Specific provenance URLs for Census datasets

Instead of www.census.gov, we should link to the ACS survey page.

Add Vijayanagara district to LGD

Ref - https://timesofindia.indiatimes.com/city/bengaluru/karnatakas-31st-district-vijayanagara-comes-into-being/articleshow/86713320.cms

Typo in Gini Index Economic Activity Statistical Variable

While doing some Statistical Variable Research for the new poverty dashboard, @jeffreyoldham and I noticed that there is a typo in the statistical variable.

GiniIndex_EcconomicActivity should be GiniIndex_EconomicActivity. There are two ccs in Economic.

https://datacommons.org/tools/timeline#place=country%2FNGA&statsVar=GiniIndex_EcconomicActivity

Tagging for visibility: @pradh @tjann

Fix containment for country/TUR

Turkey should be contained-in both Europe and Asia, but currently only in Asia.

Truncated data from OECD "Population by 5-year age groups, small regions TL3" dataset

The OECD Region Demography CSV file that we have for small regions (TL3) appears to be truncated.

For example, a user reported that for Geelong, we only have population of Males, but not Females, like below:

The stat is missing from the raw CSV that we downloaded. However, stats.oecd.org explorer has population of Females.

It is likely the case that the downloaded data was truncated to 1M data points, as the Export UI suggest below:

There are 1,000,001 rows in REGION_DEMOGR_population_tl3.csv, suggesting truncation.

Population of Utrecht is incorrect

The data for the city of Utrecht is incorrect https://datacommons.org/place/nuts/NL310

According to Wikipedia (https://en.wikipedia.org/wiki/Utrecht) the number is ~350,000, not 1.3 million.

The 1.3 million number is the district of Utrecht (see Wikipedia https://en.wikipedia.org/wiki/Utrecht_(province) ), there the number is correct https://datacommons.org/place/nuts/NL31

NumEmptyPVFailures_value when resolving MCF

After trying to write to the KG using the following configuration:

I see the following error in the logs saying: NumEmptyPVFailures_value and NumSanityCheckPVFailures.

What does this mean?

Publish import scripts/pipeline for all US Census data

The existing import scripts for the US Census don't include the ACS data that is in the graph.

I'm guessing all the data is originally from the ACS Summary File. It may be that you are generating the graph data from existing Google Public Datasets on BigQuery, or transforming the raw files.

Either way, it would be great to publish the scripts you use to convert that data into graph format.

Resolving KG throws PREEMPTED_WHILE_RUNNING error

As I was resolving my MCF against the Knowledge Graph, the job failed with a 'PREEMPTED_WHILE_RUNNING' error message.

India boundary deviates across country, state and districts

Country boundary is from World Bank. State and District boundaries are from Datameet.

Country geojson

State geojson

District geojson

term.type == SchemaTerm::kEntity Found malformed entity name that has a column reference prefix

If you get

ERROR: RET_CHECK failure (datacommons/util/kb_parser.cc:356) term.type == SchemaTerm::kEntity Found malformed entity name that has a column reference prefix dcid:Count_Person_25To64Years_TertiaryEducation_AsAFractionOfCount_Person_25To64Years; line 1; file /bigstore/unresolved_mcf/template_mcf_imports/eurostats/education_enrollment/Eurostats_NUTS2_Enrollment.mcf

when doing Write Template MCF + CSV, you might have put a Node MCF filepath instead of the Template MCF filepath into the "Enter CNS/GCS file path of template MCF file:" field.

Lack of support for Eurostat data flags.

Eurostat uses the following flags to label their data:

b = break in time series
c = confidential
d = definition differs, see metadata. The relevant explanations must be provided in the annex of the ESMS (metadata)
e = estimated
f = forecast
n = not significant
p = provisional
r = revised
s = Eurostat estimate
u = low reliability
z = not applicable

Currently, these are being ignored (e.g. in #119). It would be great to have a property in StatVarObservations to surface these flags.

Import automation: security issues with running user code

Running user code is dangerous. We are not really concerned about malicious code that does bad things to our system, because the executor are already run in a sandbox by App Engine. We are more concerned that an attacker uses the executor to run malicious code that does bad things to the outside world, e.g., sending out phishing emails, which could create legal issues. There are several potential solutions to this problem.

The first solution is to block user code internet access completely. This would disallow user code from downloading any data files. Instead, the executor would be responsible for downloading them. The executor currently supports taking a list of URLs from an import specification in a manifest and downloads them using GET requests. A problem is that in some cases, user code needs to POST a form to a URL to download data and the content of the form needs to be generated dynamically, e.g., today's date.

Another solution is to create an allowlist file in the repository and let the contributor first send a pull request to add their GitHub username to the file. The pull request can be seen as an application and we could ask for various information from the contributor. The executor would then only run if the author of a commit is in the allowlist.

World Bank WDI script

The worldbank.py script should not include 'measurementDenominator' in properties_of_stat_var_observation. mDenom is a StatVar property.

Flume Pipeline Failed when resolving to KG

After submitting a job to resolve the MCF against the KG I see the following error message with a FAILED job:

"ERROR: Flume pipeline failed: generic::internal: ... cloud/helix/flume ..." .

[eia] add better names / descriptions

https://autopush.datacommons.org/tools/timeline#place=geoId%2F06%2CgeoId%2F26&statsVar=Annual_Average_HeatContent_Coal_For_CommercialAndInstitutional&chart=%7B%22consumption%22%3Afalse%2C%22cost%22%3Afalse%7D

it's unclear from the name that it only applies to coal. we should revisit all the coal variables, at least. and add descriptions where we can. would be best if we could link back to EIA for the definition

cc @pradh

Validating MCF throws EmptyDCIDFailures error

When I did a validation of my MCF, I can't quite remember the exact configuration, I got a succesful flume email that had EmptyDCIDFailures listed:

Import docs and example Template MCF import finishing artifacts.

Add a test
Add StatPop+Obs documentation
Add Life of a Dataset

Population of Brussels is not correct

The population of Brussels as indicated in this chart is not correct (It did not drop from 1million to 100k in 2000. Current population of the city is about 1.2 million, not 120k

What is the best way to iterate through a CSV file?

I have a dataset in CSV form, and would like to iterate through it.
Should I manually index the CSV file by column? Or is there any library we should use?

Example:
row[0] == 'name'
row[1] == 'age'

Consider using more standard ontology formats

You could use Turtle, N-Quads, or other RDF formats that support Ontology Development Tools and have better support in Code Editors, rather than MCF.

I understand that Template MCF files are currently used for mapping files, but there are RDF tools like RML, which can be used with a variety of input formats (CSV, JSON, even other relational databases with an SQL connection), and have features like automated provenance metadata generation, which would perform the functionality of Template MCF. Or, you could author the schemas in RDF formats, but still use Template MCF to convert CSV, etc to triples.

By using more standard ontology formats, you also gain the ability to publish them more easily, because they are just static files, and don't need a specialized server or converter to transform from MCF into RDF.