datacommonsorg / data Goto Github PK

License: Apache License 2.0

Python 7.02% Shell 0.06% Go 0.08% HTML 84.62% Jupyter Notebook 8.08% R 0.14% Dockerfile 0.01%

data's Issues

Lack of support for Eurostat data flags.

Eurostat uses the following flags to label their data:

b = break in time series
c = confidential
d = definition differs, see metadata. The relevant explanations must be provided in the annex of the ESMS (metadata)
e = estimated
f = forecast
n = not significant
p = provisional
r = revised
s = Eurostat estimate
u = low reliability
z = not applicable

Currently, these are being ignored (e.g. in #119). It would be great to have a property in StatVarObservations to surface these flags.

Update eia scripts for new energy codes

change the following measuredProperties to plurals: receipts, stocks

Add usedFor:ElectricityGeneration to statvars with ElectricityGeneration in the name.

Drop redundant stat var MCFs generated by ACS Subject Table processing script

Currently, the ACS Subject Table processing script uses the JSON Spec to create the Stat Var MCF, TMCF and Cleaned CSV files. However, there are cases where a column is not applicable but still appears in the Subject Table. Here is one such example.

Even though the 'Percent below poverty level' rows are always empty, the stat var MCF for them is generated. It would be useful to have a feature where stat var MCFs which have no stat var observations are dropped.

COVID imports: unify "incidentType" and "medicalCondition"

Right now, we are using both "incidentType" and "medicalCondition" to encode "COVID_19".

Population types Person, MedicalTest use "medicalCondition". Population type "MedicalConditionIncident" use "incidentType".

NumEmptyPVFailures_value when resolving MCF

After trying to write to the KG using the following configuration:

I see the following error in the logs saying: NumEmptyPVFailures_value and NumSanityCheckPVFailures.

What does this mean?

What is the best way to iterate through a CSV file?

I have a dataset in CSV form, and would like to iterate through it.
Should I manually index the CSV file by column? Or is there any library we should use?

Example:
row[0] == 'name'
row[1] == 'age'

Do a schema.org refresh

E.g. https://datacommons.org/browser/Legislation is missing but exists in https://schema.org/Legislation

Truncated data from OECD "Population by 5-year age groups, small regions TL3" dataset

The OECD Region Demography CSV file that we have for small regions (TL3) appears to be truncated.

For example, a user reported that for Geelong, we only have population of Males, but not Females, like below:

The stat is missing from the raw CSV that we downloaded. However, stats.oecd.org explorer has population of Females.

It is likely the case that the downloaded data was truncated to 1M data points, as the Export UI suggest below:

There are 1,000,001 rows in REGION_DEMOGR_population_tl3.csv, suggesting truncation.

Validating MCF throws EmptyDCIDFailures error

When I did a validation of my MCF, I can't quite remember the exact configuration, I got a succesful flume email that had EmptyDCIDFailures listed:

Import docs and example Template MCF import finishing artifacts.

Add a test
Add StatPop+Obs documentation
Add Life of a Dataset

[un energy] scaling issues with energy generation data

It looks more obvious in this chart:

It looks like the scaling for other countries are not scaled to GWh.

https://autopush.datacommons.org/tools/timeline#&place=country/USA,country/IND,country/FRA,country/RUS,country/CHN,country/BRA&statsVar=Annual_Generation_Electricity

Flume Pipeline Failed when resolving to KG

After submitting a job to resolve the MCF against the KG I see the following error message with a FAILED job:

"ERROR: Flume pipeline failed: generic::internal: ... cloud/helix/flume ..." .

Greenland is contained in Kingdom of Denmark

https://datacommons.org/place/country/GRL

[Eurostat sprint cleanup] GDP -- Add mQualifier

https://github.com/datacommonsorg/data/blob/master/scripts/eurostat/gdp/eurostat_gdp.mcf

Add mQual to match existing StatVars

Import automation: security issues with running user code

Running user code is dangerous. We are not really concerned about malicious code that does bad things to our system, because the executor are already run in a sandbox by App Engine. We are more concerned that an attacker uses the executor to run malicious code that does bad things to the outside world, e.g., sending out phishing emails, which could create legal issues. There are several potential solutions to this problem.

The first solution is to block user code internet access completely. This would disallow user code from downloading any data files. Instead, the executor would be responsible for downloading them. The executor currently supports taking a list of URLs from an import specification in a manifest and downloads them using GET requests. A problem is that in some cases, user code needs to POST a form to a URL to download data and the content of the form needs to be generated dynamically, e.g., today's date.

Another solution is to create an allowlist file in the repository and let the contributor first send a pull request to add their GitHub username to the file. The pull request can be seen as an application and we could ask for various information from the contributor. The executor would then only run if the author of a commit is in the allowlist.

Fix containment for country/TUR

Turkey should be contained-in both Europe and Asia, but currently only in Asia.

India boundary deviates across country, state and districts

Country boundary is from World Bank. State and District boundaries are from Datameet.

Country geojson

State geojson

District geojson

[eia] add better names / descriptions

https://autopush.datacommons.org/tools/timeline#place=geoId%2F06%2CgeoId%2F26&statsVar=Annual_Average_HeatContent_Coal_For_CommercialAndInstitutional&chart=%7B%22consumption%22%3Afalse%2C%22cost%22%3Afalse%7D

it's unclear from the name that it only applies to coal. we should revisit all the coal variables, at least. and add descriptions where we can. would be best if we could link back to EIA for the definition

cc @pradh

Typo in Gini Index Economic Activity Statistical Variable

While doing some Statistical Variable Research for the new poverty dashboard, @jeffreyoldham and I noticed that there is a typo in the statistical variable.

GiniIndex_EcconomicActivity should be GiniIndex_EconomicActivity. There are two ccs in Economic.

https://datacommons.org/tools/timeline#place=country%2FNGA&statsVar=GiniIndex_EcconomicActivity

Tagging for visibility: @pradh @tjann

Add Vijayanagara district to LGD

Ref - https://timesofindia.indiatimes.com/city/bengaluru/karnatakas-31st-district-vijayanagara-comes-into-being/articleshow/86713320.cms

Renaming of districts in Telangana

The Warangal Urban district will be renamed Hanamkonda and Warangal Rural district will now be called Warangal - TNM

LGD has updated data.

Population of Auckland has inconsistent values

For certain years until 2010, the population drops to 0.5 million from 1.x million.

[OECD P0 sprint cleanup] s/0To/Upto/

Should replace the StatVars like Count_MortalityEvent_0To4Years_Male with Count_MortalityEvent_Upto4Years_Male.

Resolving KG throws PREEMPTED_WHILE_RUNNING error

As I was resolving my MCF against the Knowledge Graph, the job failed with a 'PREEMPTED_WHILE_RUNNING' error message.

Schemal-less Indian Census (religion) StatVars should use name property

In this MCF, we should use name property instead of description, so that we can use that for displaying the StatVars in the LHS widgets.

I'm making the change to the file in schema repo, this is to track code updates.

[world bank] update all country geojson to be right hand rule

Not all the polygons in this import subscribe to the right hand rule, which is now part of the geojson spec. In order to standardize how we store geojson's in the graph, we should reconvert these to right-hand rule.

See: datacommonsorg/website#1164

Add "Expenditure" to various ExpenditureTypeEnum instances

Assigned to Jay since Mahsa is not assignable yet. May need to join the repo.

term.type == SchemaTerm::kEntity Found malformed entity name that has a column reference prefix

If you get

ERROR: RET_CHECK failure (datacommons/util/kb_parser.cc:356) term.type == SchemaTerm::kEntity Found malformed entity name that has a column reference prefix dcid:Count_Person_25To64Years_TertiaryEducation_AsAFractionOfCount_Person_25To64Years; line 1; file /bigstore/unresolved_mcf/template_mcf_imports/eurostats/education_enrollment/Eurostats_NUTS2_Enrollment.mcf

when doing Write Template MCF + CSV, you might have put a Node MCF filepath instead of the Template MCF filepath into the "Enter CNS/GCS file path of template MCF file:" field.

Population of Brussels is not correct

The population of Brussels as indicated in this chart is not correct (It did not drop from 1million to 100k in 2000. Current population of the city is about 1.2 million, not 120k

Error Expanding file path when importing territorial units codes

This error appears when I try to import new territorial unit codes.
It is because at the resolving step, the file failed to resolve, therefore the resolved_mcf is named xxx_Resolver_Error, not xxx, which results in the file path error.

To solve the problem, follow the 3 step importing process #14.
At step (b), check "Generate DCIDs for new places"

How do we handle imports that require API keys to download a dataset?

Some datasets require to download datasets using API keys.
For example, Harvard Dataverse doesn't have a download url, you must curl using a registered API key.

Should we include the API key in the repository? How should one handle datasets that require API keys?

Consider using more standard ontology formats

You could use Turtle, N-Quads, or other RDF formats that support Ontology Development Tools and have better support in Code Editors, rather than MCF.

I understand that Template MCF files are currently used for mapping files, but there are RDF tools like RML, which can be used with a variety of input formats (CSV, JSON, even other relational databases with an SQL connection), and have features like automated provenance metadata generation, which would perform the functionality of Template MCF. Or, you could author the schemas in RDF formats, but still use Template MCF to convert CSV, etc to triples.

By using more standard ontology formats, you also gain the ability to publish them more easily, because they are just static files, and don't need a specialized server or converter to transform from MCF into RDF.

US GeoJSON file omits Oglala Lakota County, SD

Oglala County, SD, is omitted from the US-wide GeoJSON data as demonstrated in a PNG rendering of the data. Note it changed its name from Shannon County, SD, in 2015. Various versions of FIPS 6-4 may treat this county inconsistently. https://datacommons.org/place?dcid=geoId%2F46102 shows Data Commons has a geoID for the county. Q.v., https://en.wikipedia.org/wiki/Oglala_Lakota_County,_South_Dakota.

fpernice-google notes https://plotly.com/python/choropleth-maps/ also has a hole so the problem may be with the underlying US Census KML data, which might be available at https://www.census.gov/geographies/mapping-files/time-series/geo/cartographic-boundary.html.

[un energy] stat var names

There are a few stat vars where the names could be clarified, e.g.

https://datacommons.org/browser/Annual_Generation_Energy_Bagasse
The name is Annual Generation of Bagasse -- should it be Annual Generation of Energy from Bagasse
Perhaps the description could also be clarified, since it now read as Bagasse is being generated, not energy.

https://datacommons.org/browser/Annual_Consumption_Energy_ElectricityGeneration_Heat_FuelTransformation
The name and description here is also confusing, since it mentions both consumption and generation. Is it referring to Heat generated by Fuel Transformation consumed for Electricity Generation

Sorry these were missed during the reviews..

CDC Health Conditions data at State level

Right now there are several levels except state: https://datacommons.org/tools/statvar#Percent_Person_WithDiabetes

Look if https://www.cdc.gov/places/index.html has the data, and if not aggregate from county-level.

Add a stat var for total co2 emissions

/tools/timeline#place=country/TUR,country/GBR,geoId/48&statsVar=Annual_Emissions_CarbonDioxide_NonBiogenic__eia/INTL.4008-8-MMTCD.A

OECD needs integration test

Specific provenance URLs for Census datasets

Instead of www.census.gov, we should link to the ACS survey page.

Missing countries on the map

Kosovo

World Bank WDI script

The worldbank.py script should not include 'measurementDenominator' in properties_of_stat_var_observation. mDenom is a StatVar property.

Publish import scripts/pipeline for all US Census data

The existing import scripts for the US Census don't include the ACS data that is in the graph.

I'm guessing all the data is originally from the ACS Summary File. It may be that you are generating the graph data from existing Google Public Datasets on BigQuery, or transforming the raw files.

Either way, it would be great to publish the scripts you use to convert that data into graph format.

Running into install errors while installing dependent Python packages

I have run into an issue, where the I am unable to install the specific versions of packages mentioned in the requirements.txt file.

Reproducing the error : Running the ./run_tests.sh -r or pip3 install -r requirements.txt
OS: Debian/Linux

However, in macOS, running ./run_tests.sh -r exited with an error message of being unable to install Cython (required by numpy). But interestingly, running pip3 install -r requirements.txt installed all the packages without any issues.

TL;DR - The package install debug log in Debian/Linux

ERROR: Could not find a version that satisfies the requirement pandas==1.0.4
ERROR: No matching distribution found for pandas==1.0.4

The backtrace for this error, points to a failed attempt to reinstall numpy. There are 4 different versions of numpy which we switch through between lines 10-14 of reuquirements.txt.

10 geopandas==0.8.1 --> [states dependencies](https://geopandas.org/getting_started/install.html#dependencies) on pandas, numpy, shapely
11 matplotlib==3.3.0 --> [states dependency](https://matplotlib.org/3.2.2/users/installing.html#dependencies) on numpy
12 numpy==1.18.5
13 openpyxl==3.0.7
14 pandas==1.0.4 ---> [states dependency](https://pandas.pydata.org/docs/getting_started/install.html#dependencies) on numpy

Looking through the numpy install error did not help.

So, I commented out the lines for numpy==1.18.5 and pandas==1.0.4 change to the requirements.txt

10 geopandas==0.8.1 
11 matplotlib==3.3.0
12 #numpy==1.18.5
13 openpyxl==3.0.7
14 #pandas==1.0.4

This goes past the numpy install error, and since geopandas lists pandas as a dependent package, we have it installed. But, the installation of packages now breaks for installing shapely (since this is also already installed by geopandas).

ERROR: Could not find a version that satisfies the requirement shapely==1.7
ERROR: No matching distribution found for shapely==1.7

At this point, I think the issue with packages is probably because geopandas installed a different version of them and re-installing a different version of the same package has an issue. Likely, pip lets users decide if there are multiple attempts to install the same package by throwing an error to fix the requirements.txt fle? I am not sure .
After commenting out shapely==1.7 in requirements.txt, and attempting the package installation, I get the third package with version conflicts of dependent packages.

ERROR: Cannot install chembl-webresource-client==0.10.2, requests==2.24.0 and urllib3==1.26.5 because these package versions have conflicting dependencies.
ERROR: ResolutionImpossible: for help visit https://pip.pypa.io/en/latest/user_guide/#fixing-conflicting-dependencies

So, I comment out the dependent packages in requirements.txt --requests and urllib3, making the requirements.txt to look like this:

18 #requests==2.24.0
19 retry==0.9.2
20 #shapely==1.7
21 #urllib3==1.26.5

Now, the packages installations happen without an issue. The following is the revised requirements.txt file:

1 absl-py==0.9.0
2 chembl-webresource-client==0.10.2
3 dataclasses==0.6
4 datacommons==1.4.3
5 frozendict==1.2
6 func-timeout==4.3.5
7 geojson==2.5.0
8 geopandas==0.8.1
9 matplotlib==3.3.0
10 numpy==1.18.5
11 openpyxl==3.0.7
12 pandas==1.3.3
13 pylint==2.11.1
14 pytest==6.2.5
15 rdp==0.8
16 requests==2.24.0
17 retry==0.9.2
18 Shapely==1.7.1
19 urllib3==1.25.9
20 wrapt==1.12.1
21 xlrd==1.2.0
22 yapf==0.31.0
23 zipp==3.6.0

BLS JOLTS Data

We should get metro area level data too
We should setup monthly refreshes

src: https://www.bls.gov/jlt/#data

e.g., https://autopush.datacommons.org/tools/statvar#Count_Worker_InvoluntarySeparation_NAICSArtsEntertainmentRecreation_Adjusted

Duplicate Observations after resolved MCF validation

After validating my Harvard Covid data, I saw 100% duplicate observations on the Validation Dashboard. I know this is an issue. Does anyone know what to do? How to fix this? Where to look at?

My Statistical Variables are as follows:
Node: dcid:HarvardCOVID19IncrementalCases
typeOf: dcs:StatisticalVariable
populationType: dcs:MedicalTest
statType: dcs:measuredValue
measuredProperty: dcs:incrementalCount
medicalStatus: dcs:ConfirmedCase
measurementMethod: dcs:dcDerivedStat/HarvardCOVID19

Node: dcid:HarvardCOVID19IncrementalDeaths
typeOf: dcs:StatisticalVariable
populationType: dcs:MedicalTest
statType: dcs:measuredValue
measuredProperty: dcs:incrementalCount
medicalStatus: dcs:PatientDeceased
measurementMethod: dcs:dcDerivedStat/HarvardCOVID19

Duplicate Centennial CO City in the Knowledge Graph

https://datacommons.org/browser/geoId/0812815 is the valid entry with most stats attached.

https://datacommons.org/browser/geoId/0912815 is not the valid Centennial entity ("09" is Connecticut).

Population of Utrecht is incorrect

The data for the city of Utrecht is incorrect https://datacommons.org/place/nuts/NL310

According to Wikipedia (https://en.wikipedia.org/wiki/Utrecht) the number is ~350,000, not 1.3 million.

The 1.3 million number is the district of Utrecht (see Wikipedia https://en.wikipedia.org/wiki/Utrecht_(province) ), there the number is correct https://datacommons.org/place/nuts/NL31