Git Product home page Git Product logo

data's Introduction

Data Commons Data Imports

This is a collaborative repository for contributing data to Data Commons.

If you are looking to use the data in Data Commons, please visit our API documentation.

About Data Commons

Data Commons is an Open Knowledge Graph that provides a unified view across multiple public data sets and statistics. We've bootstrapped the graph with lots of data from US Census, CDC, NOAA, etc., and through collaborations with the New York Botanical Garden, Opportunity Insights, and more. However, Data Commons is meant to be for community, by the community. We're excited to work with you to make public data accessible to everyone.

To see the extent of data we have today, browse the graph.

We welcome contributions to the graph! To get started, take a look at the resources in the docs directory and the list of pending imports.

License

Apache 2.0

Development

Every data import involves some or all of the following: obtaining the source data, cleaning the data, and converting the data into one of Meta Content Framework (MCF), JSON-LD, or RDFa format. We ask that you check in all scripts used in this process, so that others can reproduce and continue your work.

Source data must meet the licensing policy requirements.

Scripts should go under the top-level scripts/ directory, depending on the provenance and dataset. See the example for more detail.

We provide some utility libraries under the top-level util/ directory. For example, this includes maps to and from common geographic identifiers.

GitHub Development Process

One Time Set-up

  1. Install Git LFS

  2. Fork this repo - follow the Github guide to forking a repo

    • In https://github.com/datacommonsorg/data, click on "Fork" button to fork the repo.
    • Add upstream: git remote add upstream https://github.com/datacommonsorg/data.git
    • Clone your forked repo to your desktop. Please do not directly clone this repo, verify by running git remote -v, the output should look like this:
    shell> git remote -v
    origin  https://github.com/YOUR-GITHUB-USERNAME/data.git (fetch)
    origin  https://github.com/YOUR-GITHUB-USERNAME/data.git (push)
    upstream        https://github.com/datacommonsorg/data.git (fetch)
    upstream        https://github.com/datacommonsorg/data.git (push)
  3. Please ask to join the datacommons-developers Google group. For example, membership in this group provides access to debug logs of pre-submit tests that run for your Pull Request.

Creating Pull Requests

Contribute your changes by creating pull requests from your fork of this repo. Learn more in this step-by-step guide.

A summary of the steps in the development workflow are:

git checkout master
git pull upstream master
git checkout -b new_branch_name
# Make some code change
git add .
git commit -m "commit message"
git push -u origin new_branch_name

Then in your forked repo, you can send a Pull Request. Wait for approval of the Pull Request and merge the change.

If this is your first time contributing to a Google Open Source project, you may need to follow the steps in contributing.md.

Code quality

Code style guidelines ease understanding and maintaining code. Automated checks enforce some of the guidelines.

Python

Setup

Ensure prerequisites are installed

Install requirements and setup a virtual environment to isolate python development in this repo.

python3 -m venv .env
source .env/bin/activate

pip3 install -r requirements_all.txt
Testing

Scripts should be accompanied with tests using the unittest framework, and named with an _test.py suffix.

A common test pattern is to drive your main processing function through some sample input files (e.g., with a few rows of the real csv/xls/etc.) and compare the produced output files (e.g., cleaned csv, mcf, tmcf) against expected ones. An example test following this pattern is here.

IMPORTANT: Please ensure that there is an __init__.py file in the directory of your import scripts, and every parent directory until scripts/. This is necessary for the unittest framework to automatically discover and run your tests as part of presubmit.

NOTE: In the presence of __init__.py, you will need to adjust the way you import modules and run tests, as below.

  1. You should import modules in your test with a dotted prefix like this.

  2. Instead of running your test as python3 foo_test.py, run as:

    python3 -m unittest discover -v -s ../ -p "foo_test.py"

    Consider creating a generic alias like this:

    • alias dc-data-py-test='python3 -m unittest discover -v -s ../ -p "*_test.py"'

    Then, you can run your tests as:

    • dc-data-py-test
Guidelines
  • Any additional package required must be specified in the requirements_all.txt file in the top-level folder. No other requirements.txt files are allowed.
  • Code must be formatted according to the Google Python Style Guide according to the yapf formatter.
  • Code must not generate lint errors or warnings according to pylint configured for the Google Python Style Guide as specified in .pylintrc.
  • Tests must succeed.

Consider automating coding to satisfy some of these requirements.

To run the tools via a command line (both installed after setup steps above):

  • pylint.
  • yapf, execute using --style google, e.g.,
# Update (--in-place) all files
./run_tests.sh -f

# Produce differences between the current code and reformatted code.  Empty
# output indicates correctly formatted code.
./run_tests.sh -l

To run a unit test, use a command like

python3 -m unittest discover -v -s util/ -p "*_test.py"

The discover option searches (-s) the util/ directory for files with filenames ending with _test.py. It considers all these files to be unit tests to be run. Output is verbose (-v).

We provide a utility to run all unit tests in a folder easily (e.g. util/):

./run_tests.sh -p util/

Or to run all tests and checks:

./run_tests.sh -a

NOTE: Please ensure that all tests are runnable from the test script, e.g. modules should be relative to the root of the repo.

Disabling style checks

Occasionally, one has to disable style checking or formatting for particular lines.

To disable pylint for a particular line or block , use syntax like

# pylint: disable=line-too-long,unbalanced-tuple-unpacking

To disable yapf for some lines,

# yapf: disable
... code ...
# yapf: enable

Go

  • Code must be formatted according to go fmt.
  • Vetting must identify no likely mistakes as revealed by go vet.
  • Code must not generate lint errors or warnings according to golangcli-lint. To run on foo.go, use golangcli-lint run foo.go.
  • Tests must succeed. Files ending with _test.go are considered tests. They are executed using go test.

Support

For general questions or issues about importing data into Data Commons, please open an issue on our issues page. For all other questions, please share feedback on this form.

Note - This is not an officially supported Google product.

data's People

Contributors

abilityguy avatar ajaits avatar beets avatar chejennifer avatar clincoln8 avatar dependabot[bot] avatar eftekhari-mhs avatar fpernice-google avatar fructokinase avatar hanlu09205 avatar harishc727 avatar intrepiditee avatar jeffreyoldham avatar jehangiramjad avatar keyurva avatar mengzhensun avatar n-h-diaz avatar pradh avatar pulkit-s avatar rbhoot avatar rsned avatar saanikaaa avatar senthamizhanv avatar sharadshriram avatar shifucun avatar spaceenter avatar spiekos avatar thejeshgn avatar tjann avatar wh1210 avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

data's Issues

[un energy] stat var names

There are a few stat vars where the names could be clarified, e.g.

https://datacommons.org/browser/Annual_Generation_Energy_Bagasse
The name is Annual Generation of Bagasse -- should it be Annual Generation of Energy from Bagasse
Perhaps the description could also be clarified, since it now read as Bagasse is being generated, not energy.

https://datacommons.org/browser/Annual_Consumption_Energy_ElectricityGeneration_Heat_FuelTransformation
The name and description here is also confusing, since it mentions both consumption and generation. Is it referring to Heat generated by Fuel Transformation consumed for Electricity Generation

Sorry these were missed during the reviews..

Running into install errors while installing dependent Python packages

I have run into an issue, where the I am unable to install the specific versions of packages mentioned in the requirements.txt file.

Reproducing the error : Running the ./run_tests.sh -r or pip3 install -r requirements.txt
OS: Debian/Linux

However, in macOS, running ./run_tests.sh -r exited with an error message of being unable to install Cython (required by numpy). But interestingly, running pip3 install -r requirements.txt installed all the packages without any issues.

TL;DR - The package install debug log in Debian/Linux

ERROR: Could not find a version that satisfies the requirement pandas==1.0.4
ERROR: No matching distribution found for pandas==1.0.4

The backtrace for this error, points to a failed attempt to reinstall numpy. There are 4 different versions of numpy which we switch through between lines 10-14 of reuquirements.txt.

10 geopandas==0.8.1 --> [states dependencies](https://geopandas.org/getting_started/install.html#dependencies) on pandas, numpy, shapely
11 matplotlib==3.3.0 --> [states dependency](https://matplotlib.org/3.2.2/users/installing.html#dependencies) on numpy
12 numpy==1.18.5
13 openpyxl==3.0.7
14 pandas==1.0.4 ---> [states dependency](https://pandas.pydata.org/docs/getting_started/install.html#dependencies) on numpy

Looking through the numpy install error did not help.

So, I commented out the lines for numpy==1.18.5 and pandas==1.0.4 change to the requirements.txt

10 geopandas==0.8.1 
11 matplotlib==3.3.0
12 #numpy==1.18.5
13 openpyxl==3.0.7
14 #pandas==1.0.4 

This goes past the numpy install error, and since geopandas lists pandas as a dependent package, we have it installed. But, the installation of packages now breaks for installing shapely (since this is also already installed by geopandas).

ERROR: Could not find a version that satisfies the requirement shapely==1.7
ERROR: No matching distribution found for shapely==1.7

At this point, I think the issue with packages is probably because geopandas installed a different version of them and re-installing a different version of the same package has an issue. Likely, pip lets users decide if there are multiple attempts to install the same package by throwing an error to fix the requirements.txt fle? I am not sure .
After commenting out shapely==1.7 in requirements.txt, and attempting the package installation, I get the third package with version conflicts of dependent packages.

ERROR: Cannot install chembl-webresource-client==0.10.2, requests==2.24.0 and urllib3==1.26.5 because these package versions have conflicting dependencies.
ERROR: ResolutionImpossible: for help visit https://pip.pypa.io/en/latest/user_guide/#fixing-conflicting-dependencies

So, I comment out the dependent packages in requirements.txt --requests and urllib3, making the requirements.txt to look like this:

18 #requests==2.24.0
19 retry==0.9.2
20 #shapely==1.7
21 #urllib3==1.26.5

Now, the packages installations happen without an issue. The following is the revised requirements.txt file:

1 absl-py==0.9.0
2 chembl-webresource-client==0.10.2
3 dataclasses==0.6
4 datacommons==1.4.3
5 frozendict==1.2
6 func-timeout==4.3.5
7 geojson==2.5.0
8 geopandas==0.8.1
9 matplotlib==3.3.0
10 numpy==1.18.5
11 openpyxl==3.0.7
12 pandas==1.3.3
13 pylint==2.11.1
14 pytest==6.2.5
15 rdp==0.8
16 requests==2.24.0
17 retry==0.9.2
18 Shapely==1.7.1
19 urllib3==1.25.9
20 wrapt==1.12.1
21 xlrd==1.2.0
22 yapf==0.31.0
23 zipp==3.6.0

US GeoJSON file omits Oglala Lakota County, SD

image

Oglala County, SD, is omitted from the US-wide GeoJSON data as demonstrated in a PNG rendering of the data. Note it changed its name from Shannon County, SD, in 2015. Various versions of FIPS 6-4 may treat this county inconsistently. https://datacommons.org/place?dcid=geoId%2F46102 shows Data Commons has a geoID for the county. Q.v., https://en.wikipedia.org/wiki/Oglala_Lakota_County,_South_Dakota.

fpernice-google notes https://plotly.com/python/choropleth-maps/ also has a hole so the problem may be with the underlying US Census KML data, which might be available at https://www.census.gov/geographies/mapping-files/time-series/geo/cartographic-boundary.html.

Error Expanding file path when importing territorial units codes

image

This error appears when I try to import new territorial unit codes.
It is because at the resolving step, the file failed to resolve, therefore the resolved_mcf is named xxx_Resolver_Error, not xxx, which results in the file path error.

To solve the problem, follow the 3 step importing process #14.
At step (b), check "Generate DCIDs for new places"

Update eia scripts for new energy codes

change the following measuredProperties to plurals: receipts, stocks

Add usedFor:ElectricityGeneration to statvars with ElectricityGeneration in the name.

Duplicate Observations after resolved MCF validation

After validating my Harvard Covid data, I saw 100% duplicate observations on the Validation Dashboard. I know this is an issue. Does anyone know what to do? How to fix this? Where to look at?

My Statistical Variables are as follows:
Node: dcid:HarvardCOVID19IncrementalCases
typeOf: dcs:StatisticalVariable
populationType: dcs:MedicalTest
statType: dcs:measuredValue
measuredProperty: dcs:incrementalCount
medicalStatus: dcs:ConfirmedCase
measurementMethod: dcs:dcDerivedStat/HarvardCOVID19

Node: dcid:HarvardCOVID19IncrementalDeaths
typeOf: dcs:StatisticalVariable
populationType: dcs:MedicalTest
statType: dcs:measuredValue
measuredProperty: dcs:incrementalCount
medicalStatus: dcs:PatientDeceased
measurementMethod: dcs:dcDerivedStat/HarvardCOVID19

image

Drop redundant stat var MCFs generated by ACS Subject Table processing script

Currently, the ACS Subject Table processing script uses the JSON Spec to create the Stat Var MCF, TMCF and Cleaned CSV files. However, there are cases where a column is not applicable but still appears in the Subject Table. Here is one such example.
image

Even though the 'Percent below poverty level' rows are always empty, the stat var MCF for them is generated. It would be useful to have a feature where stat var MCFs which have no stat var observations are dropped.

Truncated data from OECD "Population by 5-year age groups, small regions TL3" dataset

The OECD Region Demography CSV file that we have for small regions (TL3) appears to be truncated.

For example, a user reported that for Geelong, we only have population of Males, but not Females, like below:

Screen Shot 2021-01-18 at 7 20 51 PM

The stat is missing from the raw CSV that we downloaded. However, stats.oecd.org explorer has population of Females.

It is likely the case that the downloaded data was truncated to 1M data points, as the Export UI suggest below:

image

There are 1,000,001 rows in REGION_DEMOGR_population_tl3.csv, suggesting truncation.

NumEmptyPVFailures_value when resolving MCF

After trying to write to the KG using the following configuration:
image

I see the following error in the logs saying: NumEmptyPVFailures_value and NumSanityCheckPVFailures.

image

What does this mean?

term.type == SchemaTerm::kEntity Found malformed entity name that has a column reference prefix

If you get

ERROR: RET_CHECK failure (datacommons/util/kb_parser.cc:356) term.type == SchemaTerm::kEntity Found malformed entity name that has a column reference prefix dcid:Count_Person_25To64Years_TertiaryEducation_AsAFractionOfCount_Person_25To64Years; line 1; file /bigstore/unresolved_mcf/template_mcf_imports/eurostats/education_enrollment/Eurostats_NUTS2_Enrollment.mcf

when doing Write Template MCF + CSV, you might have put a Node MCF filepath instead of the Template MCF filepath into the "Enter CNS/GCS file path of template MCF file:" field.

Lack of support for Eurostat data flags.

Eurostat uses the following flags to label their data:

b = break in time series
c = confidential
d = definition differs, see metadata. The relevant explanations must be provided in the annex of the ESMS (metadata)
e = estimated
f = forecast
n = not significant
p = provisional
r = revised
s = Eurostat estimate
u = low reliability
z = not applicable

Currently, these are being ignored (e.g. in #119). It would be great to have a property in StatVarObservations to surface these flags.

Import automation: security issues with running user code

Running user code is dangerous. We are not really concerned about malicious code that does bad things to our system, because the executor are already run in a sandbox by App Engine. We are more concerned that an attacker uses the executor to run malicious code that does bad things to the outside world, e.g., sending out phishing emails, which could create legal issues. There are several potential solutions to this problem.

The first solution is to block user code internet access completely. This would disallow user code from downloading any data files. Instead, the executor would be responsible for downloading them. The executor currently supports taking a list of URLs from an import specification in a manifest and downloads them using GET requests. A problem is that in some cases, user code needs to POST a form to a URL to download data and the content of the form needs to be generated dynamically, e.g., today's date.

Another solution is to create an allowlist file in the repository and let the contributor first send a pull request to add their GitHub username to the file. The pull request can be seen as an application and we could ask for various information from the contributor. The executor would then only run if the author of a commit is in the allowlist.

World Bank WDI script

The worldbank.py script should not include 'measurementDenominator' in properties_of_stat_var_observation. mDenom is a StatVar property.

Flume Pipeline Failed when resolving to KG

image

After submitting a job to resolve the MCF against the KG I see the following error message with a FAILED job:

"ERROR: Flume pipeline failed: generic::internal: ... cloud/helix/flume ..." .

Population of Brussels is not correct

image

The population of Brussels as indicated in this chart is not correct (It did not drop from 1million to 100k in 2000. Current population of the city is about 1.2 million, not 120k

Consider using more standard ontology formats

You could use Turtle, N-Quads, or other RDF formats that support Ontology Development Tools and have better support in Code Editors, rather than MCF.

I understand that Template MCF files are currently used for mapping files, but there are RDF tools like RML, which can be used with a variety of input formats (CSV, JSON, even other relational databases with an SQL connection), and have features like automated provenance metadata generation, which would perform the functionality of Template MCF. Or, you could author the schemas in RDF formats, but still use Template MCF to convert CSV, etc to triples.

By using more standard ontology formats, you also gain the ability to publish them more easily, because they are just static files, and don't need a specialized server or converter to transform from MCF into RDF.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.