Git Product home page Git Product logo

data's Issues

Lack of support for Eurostat data flags.

Eurostat uses the following flags to label their data:

b = break in time series
c = confidential
d = definition differs, see metadata. The relevant explanations must be provided in the annex of the ESMS (metadata)
e = estimated
f = forecast
n = not significant
p = provisional
r = revised
s = Eurostat estimate
u = low reliability
z = not applicable

Currently, these are being ignored (e.g. in #119). It would be great to have a property in StatVarObservations to surface these flags.

Update eia scripts for new energy codes

change the following measuredProperties to plurals: receipts, stocks

Add usedFor:ElectricityGeneration to statvars with ElectricityGeneration in the name.

Drop redundant stat var MCFs generated by ACS Subject Table processing script

Currently, the ACS Subject Table processing script uses the JSON Spec to create the Stat Var MCF, TMCF and Cleaned CSV files. However, there are cases where a column is not applicable but still appears in the Subject Table. Here is one such example.
image

Even though the 'Percent below poverty level' rows are always empty, the stat var MCF for them is generated. It would be useful to have a feature where stat var MCFs which have no stat var observations are dropped.

NumEmptyPVFailures_value when resolving MCF

After trying to write to the KG using the following configuration:
image

I see the following error in the logs saying: NumEmptyPVFailures_value and NumSanityCheckPVFailures.

image

What does this mean?

Truncated data from OECD "Population by 5-year age groups, small regions TL3" dataset

The OECD Region Demography CSV file that we have for small regions (TL3) appears to be truncated.

For example, a user reported that for Geelong, we only have population of Males, but not Females, like below:

Screen Shot 2021-01-18 at 7 20 51 PM

The stat is missing from the raw CSV that we downloaded. However, stats.oecd.org explorer has population of Females.

It is likely the case that the downloaded data was truncated to 1M data points, as the Export UI suggest below:

image

There are 1,000,001 rows in REGION_DEMOGR_population_tl3.csv, suggesting truncation.

Flume Pipeline Failed when resolving to KG

image

After submitting a job to resolve the MCF against the KG I see the following error message with a FAILED job:

"ERROR: Flume pipeline failed: generic::internal: ... cloud/helix/flume ..." .

Import automation: security issues with running user code

Running user code is dangerous. We are not really concerned about malicious code that does bad things to our system, because the executor are already run in a sandbox by App Engine. We are more concerned that an attacker uses the executor to run malicious code that does bad things to the outside world, e.g., sending out phishing emails, which could create legal issues. There are several potential solutions to this problem.

The first solution is to block user code internet access completely. This would disallow user code from downloading any data files. Instead, the executor would be responsible for downloading them. The executor currently supports taking a list of URLs from an import specification in a manifest and downloads them using GET requests. A problem is that in some cases, user code needs to POST a form to a URL to download data and the content of the form needs to be generated dynamically, e.g., today's date.

Another solution is to create an allowlist file in the repository and let the contributor first send a pull request to add their GitHub username to the file. The pull request can be seen as an application and we could ask for various information from the contributor. The executor would then only run if the author of a commit is in the allowlist.

term.type == SchemaTerm::kEntity Found malformed entity name that has a column reference prefix

If you get

ERROR: RET_CHECK failure (datacommons/util/kb_parser.cc:356) term.type == SchemaTerm::kEntity Found malformed entity name that has a column reference prefix dcid:Count_Person_25To64Years_TertiaryEducation_AsAFractionOfCount_Person_25To64Years; line 1; file /bigstore/unresolved_mcf/template_mcf_imports/eurostats/education_enrollment/Eurostats_NUTS2_Enrollment.mcf

when doing Write Template MCF + CSV, you might have put a Node MCF filepath instead of the Template MCF filepath into the "Enter CNS/GCS file path of template MCF file:" field.

Population of Brussels is not correct

image

The population of Brussels as indicated in this chart is not correct (It did not drop from 1million to 100k in 2000. Current population of the city is about 1.2 million, not 120k

Error Expanding file path when importing territorial units codes

image

This error appears when I try to import new territorial unit codes.
It is because at the resolving step, the file failed to resolve, therefore the resolved_mcf is named xxx_Resolver_Error, not xxx, which results in the file path error.

To solve the problem, follow the 3 step importing process #14.
At step (b), check "Generate DCIDs for new places"

Consider using more standard ontology formats

You could use Turtle, N-Quads, or other RDF formats that support Ontology Development Tools and have better support in Code Editors, rather than MCF.

I understand that Template MCF files are currently used for mapping files, but there are RDF tools like RML, which can be used with a variety of input formats (CSV, JSON, even other relational databases with an SQL connection), and have features like automated provenance metadata generation, which would perform the functionality of Template MCF. Or, you could author the schemas in RDF formats, but still use Template MCF to convert CSV, etc to triples.

By using more standard ontology formats, you also gain the ability to publish them more easily, because they are just static files, and don't need a specialized server or converter to transform from MCF into RDF.

US GeoJSON file omits Oglala Lakota County, SD

image

Oglala County, SD, is omitted from the US-wide GeoJSON data as demonstrated in a PNG rendering of the data. Note it changed its name from Shannon County, SD, in 2015. Various versions of FIPS 6-4 may treat this county inconsistently. https://datacommons.org/place?dcid=geoId%2F46102 shows Data Commons has a geoID for the county. Q.v., https://en.wikipedia.org/wiki/Oglala_Lakota_County,_South_Dakota.

fpernice-google notes https://plotly.com/python/choropleth-maps/ also has a hole so the problem may be with the underlying US Census KML data, which might be available at https://www.census.gov/geographies/mapping-files/time-series/geo/cartographic-boundary.html.

[un energy] stat var names

There are a few stat vars where the names could be clarified, e.g.

https://datacommons.org/browser/Annual_Generation_Energy_Bagasse
The name is Annual Generation of Bagasse -- should it be Annual Generation of Energy from Bagasse
Perhaps the description could also be clarified, since it now read as Bagasse is being generated, not energy.

https://datacommons.org/browser/Annual_Consumption_Energy_ElectricityGeneration_Heat_FuelTransformation
The name and description here is also confusing, since it mentions both consumption and generation. Is it referring to Heat generated by Fuel Transformation consumed for Electricity Generation

Sorry these were missed during the reviews..

World Bank WDI script

The worldbank.py script should not include 'measurementDenominator' in properties_of_stat_var_observation. mDenom is a StatVar property.

Running into install errors while installing dependent Python packages

I have run into an issue, where the I am unable to install the specific versions of packages mentioned in the requirements.txt file.

Reproducing the error : Running the ./run_tests.sh -r or pip3 install -r requirements.txt
OS: Debian/Linux

However, in macOS, running ./run_tests.sh -r exited with an error message of being unable to install Cython (required by numpy). But interestingly, running pip3 install -r requirements.txt installed all the packages without any issues.

TL;DR - The package install debug log in Debian/Linux

ERROR: Could not find a version that satisfies the requirement pandas==1.0.4
ERROR: No matching distribution found for pandas==1.0.4

The backtrace for this error, points to a failed attempt to reinstall numpy. There are 4 different versions of numpy which we switch through between lines 10-14 of reuquirements.txt.

10 geopandas==0.8.1 --> [states dependencies](https://geopandas.org/getting_started/install.html#dependencies) on pandas, numpy, shapely
11 matplotlib==3.3.0 --> [states dependency](https://matplotlib.org/3.2.2/users/installing.html#dependencies) on numpy
12 numpy==1.18.5
13 openpyxl==3.0.7
14 pandas==1.0.4 ---> [states dependency](https://pandas.pydata.org/docs/getting_started/install.html#dependencies) on numpy

Looking through the numpy install error did not help.

So, I commented out the lines for numpy==1.18.5 and pandas==1.0.4 change to the requirements.txt

10 geopandas==0.8.1 
11 matplotlib==3.3.0
12 #numpy==1.18.5
13 openpyxl==3.0.7
14 #pandas==1.0.4 

This goes past the numpy install error, and since geopandas lists pandas as a dependent package, we have it installed. But, the installation of packages now breaks for installing shapely (since this is also already installed by geopandas).

ERROR: Could not find a version that satisfies the requirement shapely==1.7
ERROR: No matching distribution found for shapely==1.7

At this point, I think the issue with packages is probably because geopandas installed a different version of them and re-installing a different version of the same package has an issue. Likely, pip lets users decide if there are multiple attempts to install the same package by throwing an error to fix the requirements.txt fle? I am not sure .
After commenting out shapely==1.7 in requirements.txt, and attempting the package installation, I get the third package with version conflicts of dependent packages.

ERROR: Cannot install chembl-webresource-client==0.10.2, requests==2.24.0 and urllib3==1.26.5 because these package versions have conflicting dependencies.
ERROR: ResolutionImpossible: for help visit https://pip.pypa.io/en/latest/user_guide/#fixing-conflicting-dependencies

So, I comment out the dependent packages in requirements.txt --requests and urllib3, making the requirements.txt to look like this:

18 #requests==2.24.0
19 retry==0.9.2
20 #shapely==1.7
21 #urllib3==1.26.5

Now, the packages installations happen without an issue. The following is the revised requirements.txt file:

1 absl-py==0.9.0
2 chembl-webresource-client==0.10.2
3 dataclasses==0.6
4 datacommons==1.4.3
5 frozendict==1.2
6 func-timeout==4.3.5
7 geojson==2.5.0
8 geopandas==0.8.1
9 matplotlib==3.3.0
10 numpy==1.18.5
11 openpyxl==3.0.7
12 pandas==1.3.3
13 pylint==2.11.1
14 pytest==6.2.5
15 rdp==0.8
16 requests==2.24.0
17 retry==0.9.2
18 Shapely==1.7.1
19 urllib3==1.25.9
20 wrapt==1.12.1
21 xlrd==1.2.0
22 yapf==0.31.0
23 zipp==3.6.0

Duplicate Observations after resolved MCF validation

After validating my Harvard Covid data, I saw 100% duplicate observations on the Validation Dashboard. I know this is an issue. Does anyone know what to do? How to fix this? Where to look at?

My Statistical Variables are as follows:
Node: dcid:HarvardCOVID19IncrementalCases
typeOf: dcs:StatisticalVariable
populationType: dcs:MedicalTest
statType: dcs:measuredValue
measuredProperty: dcs:incrementalCount
medicalStatus: dcs:ConfirmedCase
measurementMethod: dcs:dcDerivedStat/HarvardCOVID19

Node: dcid:HarvardCOVID19IncrementalDeaths
typeOf: dcs:StatisticalVariable
populationType: dcs:MedicalTest
statType: dcs:measuredValue
measuredProperty: dcs:incrementalCount
medicalStatus: dcs:PatientDeceased
measurementMethod: dcs:dcDerivedStat/HarvardCOVID19

image

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.