datacommonsorg / data Goto Github PK
View Code? Open in Web Editor NEWLicense: Apache License 2.0
License: Apache License 2.0
Eurostat uses the following flags to label their data:
b = break in time series
c = confidential
d = definition differs, see metadata. The relevant explanations must be provided in the annex of the ESMS (metadata)
e = estimated
f = forecast
n = not significant
p = provisional
r = revised
s = Eurostat estimate
u = low reliability
z = not applicable
Currently, these are being ignored (e.g. in #119). It would be great to have a property in StatVarObservation
s to surface these flags.
change the following measuredProperties to plurals: receipts, stocks
Add usedFor:ElectricityGeneration to statvars with ElectricityGeneration in the name.
Currently, the ACS Subject Table processing script uses the JSON Spec to create the Stat Var MCF, TMCF and Cleaned CSV files. However, there are cases where a column is not applicable but still appears in the Subject Table. Here is one such example.
Even though the 'Percent below poverty level' rows are always empty, the stat var MCF for them is generated. It would be useful to have a feature where stat var MCFs which have no stat var observations are dropped.
Right now, we are using both "incidentType" and "medicalCondition" to encode "COVID_19".
Population types Person, MedicalTest use "medicalCondition". Population type "MedicalConditionIncident" use "incidentType".
I have a dataset in CSV form, and would like to iterate through it.
Should I manually index the CSV file by column? Or is there any library we should use?
Example:
row[0] == 'name'
row[1] == 'age'
E.g. https://datacommons.org/browser/Legislation is missing but exists in https://schema.org/Legislation
The OECD Region Demography CSV file that we have for small regions (TL3) appears to be truncated.
For example, a user reported that for Geelong, we only have population of Males, but not Females, like below:
The stat is missing from the raw CSV that we downloaded. However, stats.oecd.org explorer has population of Females.
It is likely the case that the downloaded data was truncated to 1M data points, as the Export UI suggest below:
There are 1,000,001 rows in REGION_DEMOGR_population_tl3.csv, suggesting truncation.
It looks more obvious in this chart:
It looks like the scaling for other countries are not scaled to GWh.
https://github.com/datacommonsorg/data/blob/master/scripts/eurostat/gdp/eurostat_gdp.mcf
Add mQual to match existing StatVars
Running user code is dangerous. We are not really concerned about malicious code that does bad things to our system, because the executor are already run in a sandbox by App Engine. We are more concerned that an attacker uses the executor to run malicious code that does bad things to the outside world, e.g., sending out phishing emails, which could create legal issues. There are several potential solutions to this problem.
The first solution is to block user code internet access completely. This would disallow user code from downloading any data files. Instead, the executor would be responsible for downloading them. The executor currently supports taking a list of URLs from an import specification in a manifest and downloads them using GET requests. A problem is that in some cases, user code needs to POST a form to a URL to download data and the content of the form needs to be generated dynamically, e.g., today's date.
Another solution is to create an allowlist file in the repository and let the contributor first send a pull request to add their GitHub username to the file. The pull request can be seen as an application and we could ask for various information from the contributor. The executor would then only run if the author of a commit is in the allowlist.
it's unclear from the name that it only applies to coal. we should revisit all the coal variables, at least. and add descriptions where we can. would be best if we could link back to EIA for the definition
cc @pradh
While doing some Statistical Variable Research for the new poverty dashboard, @jeffreyoldham and I noticed that there is a typo in the statistical variable.
GiniIndex_EcconomicActivity should be GiniIndex_EconomicActivity. There are two ccs in Economic.
https://datacommons.org/tools/timeline#place=country%2FNGA&statsVar=GiniIndex_EcconomicActivity
The Warangal Urban district will be renamed Hanamkonda and Warangal Rural district will now be called Warangal - TNM
LGD has updated data.
Should replace the StatVars like Count_MortalityEvent_0To4Years_Male with Count_MortalityEvent_Upto4Years_Male.
As I was resolving my MCF against the Knowledge Graph, the job failed with a 'PREEMPTED_WHILE_RUNNING' error message.
In this MCF, we should use name
property instead of description
, so that we can use that for displaying the StatVars in the LHS widgets.
I'm making the change to the file in schema repo, this is to track code updates.
Not all the polygons in this import subscribe to the right hand rule, which is now part of the geojson spec. In order to standardize how we store geojson's in the graph, we should reconvert these to right-hand rule.
Assigned to Jay since Mahsa is not assignable yet. May need to join the repo.
If you get
ERROR: RET_CHECK failure (datacommons/util/kb_parser.cc:356) term.type == SchemaTerm::kEntity Found malformed entity name that has a column reference prefix dcid:Count_Person_25To64Years_TertiaryEducation_AsAFractionOfCount_Person_25To64Years; line 1; file /bigstore/unresolved_mcf/template_mcf_imports/eurostats/education_enrollment/Eurostats_NUTS2_Enrollment.mcf
when doing Write Template MCF + CSV, you might have put a Node MCF filepath instead of the Template MCF filepath into the "Enter CNS/GCS file path of template MCF file:" field.
This error appears when I try to import new territorial unit codes.
It is because at the resolving step, the file failed to resolve, therefore the resolved_mcf is named xxx_Resolver_Error, not xxx, which results in the file path error.
To solve the problem, follow the 3 step importing process #14.
At step (b), check "Generate DCIDs for new places"
Some datasets require to download datasets using API keys.
For example, Harvard Dataverse doesn't have a download url, you must curl using a registered API key.
Should we include the API key in the repository? How should one handle datasets that require API keys?
You could use Turtle, N-Quads, or other RDF formats that support Ontology Development Tools and have better support in Code Editors, rather than MCF.
I understand that Template MCF files are currently used for mapping files, but there are RDF tools like RML, which can be used with a variety of input formats (CSV, JSON, even other relational databases with an SQL connection), and have features like automated provenance metadata generation, which would perform the functionality of Template MCF. Or, you could author the schemas in RDF formats, but still use Template MCF to convert CSV, etc to triples.
By using more standard ontology formats, you also gain the ability to publish them more easily, because they are just static files, and don't need a specialized server or converter to transform from MCF into RDF.
Oglala County, SD, is omitted from the US-wide GeoJSON data as demonstrated in a PNG rendering of the data. Note it changed its name from Shannon County, SD, in 2015. Various versions of FIPS 6-4 may treat this county inconsistently. https://datacommons.org/place?dcid=geoId%2F46102 shows Data Commons has a geoID for the county. Q.v., https://en.wikipedia.org/wiki/Oglala_Lakota_County,_South_Dakota.
fpernice-google notes https://plotly.com/python/choropleth-maps/ also has a hole so the problem may be with the underlying US Census KML data, which might be available at https://www.census.gov/geographies/mapping-files/time-series/geo/cartographic-boundary.html.
There are a few stat vars where the names could be clarified, e.g.
https://datacommons.org/browser/Annual_Generation_Energy_Bagasse
The name is Annual Generation of Bagasse -- should it be Annual Generation of Energy from Bagasse
Perhaps the description could also be clarified, since it now read as Bagasse is being generated, not energy.
https://datacommons.org/browser/Annual_Consumption_Energy_ElectricityGeneration_Heat_FuelTransformation
The name and description here is also confusing, since it mentions both consumption and generation. Is it referring to Heat generated by Fuel Transformation consumed for Electricity Generation
Sorry these were missed during the reviews..
Right now there are several levels except state: https://datacommons.org/tools/statvar#Percent_Person_WithDiabetes
Look if https://www.cdc.gov/places/index.html has the data, and if not aggregate from county-level.
/tools/timeline#place=country/TUR,country/GBR,geoId/48&statsVar=Annual_Emissions_CarbonDioxide_NonBiogenic__eia/INTL.4008-8-MMTCD.A
Instead of www.census.gov, we should link to the ACS survey page.
The worldbank.py script should not include 'measurementDenominator' in properties_of_stat_var_observation
. mDenom is a StatVar property.
The existing import scripts for the US Census don't include the ACS data that is in the graph.
I'm guessing all the data is originally from the ACS Summary File. It may be that you are generating the graph data from existing Google Public Datasets on BigQuery, or transforming the raw files.
Either way, it would be great to publish the scripts you use to convert that data into graph format.
I have run into an issue, where the I am unable to install the specific versions of packages mentioned in the requirements.txt
file.
Reproducing the error : Running the ./run_tests.sh -r
or pip3 install -r requirements.txt
OS: Debian/Linux
However, in macOS, running ./run_tests.sh -r
exited with an error message of being unable to install Cython (required by numpy). But interestingly, running pip3 install -r requirements.txt
installed all the packages without any issues.
ERROR: Could not find a version that satisfies the requirement pandas==1.0.4
ERROR: No matching distribution found for pandas==1.0.4
The backtrace for this error, points to a failed attempt to reinstall numpy
. There are 4 different versions of numpy
which we switch through between lines 10-14 of reuquirements.txt
.
10 geopandas==0.8.1 --> [states dependencies](https://geopandas.org/getting_started/install.html#dependencies) on pandas, numpy, shapely
11 matplotlib==3.3.0 --> [states dependency](https://matplotlib.org/3.2.2/users/installing.html#dependencies) on numpy
12 numpy==1.18.5
13 openpyxl==3.0.7
14 pandas==1.0.4 ---> [states dependency](https://pandas.pydata.org/docs/getting_started/install.html#dependencies) on numpy
Looking through the numpy install error did not help.
So, I commented out the lines for numpy==1.18.5
and pandas==1.0.4
change to the requirements.txt
10 geopandas==0.8.1
11 matplotlib==3.3.0
12 #numpy==1.18.5
13 openpyxl==3.0.7
14 #pandas==1.0.4
This goes past the numpy
install error, and since geopandas
lists pandas
as a dependent package, we have it installed. But, the installation of packages now breaks for installing shapely (since this is also already installed by geopandas
).
ERROR: Could not find a version that satisfies the requirement shapely==1.7
ERROR: No matching distribution found for shapely==1.7
At this point, I think the issue with packages is probably because geopandas installed a different version of them and re-installing a different version of the same package has an issue. Likely, pip
lets users decide if there are multiple attempts to install the same package by throwing an error to fix the requirements.txt fle? I am not sure .
After commenting out shapely==1.7
in requirements.txt, and attempting the package installation, I get the third package with version conflicts of dependent packages.
ERROR: Cannot install chembl-webresource-client==0.10.2, requests==2.24.0 and urllib3==1.26.5 because these package versions have conflicting dependencies.
ERROR: ResolutionImpossible: for help visit https://pip.pypa.io/en/latest/user_guide/#fixing-conflicting-dependencies
So, I comment out the dependent packages in requirements.txt
--requests
and urllib3
, making the requirements.txt
to look like this:
18 #requests==2.24.0
19 retry==0.9.2
20 #shapely==1.7
21 #urllib3==1.26.5
Now, the packages installations happen without an issue. The following is the revised requirements.txt file
:
1 absl-py==0.9.0
2 chembl-webresource-client==0.10.2
3 dataclasses==0.6
4 datacommons==1.4.3
5 frozendict==1.2
6 func-timeout==4.3.5
7 geojson==2.5.0
8 geopandas==0.8.1
9 matplotlib==3.3.0
10 numpy==1.18.5
11 openpyxl==3.0.7
12 pandas==1.3.3
13 pylint==2.11.1
14 pytest==6.2.5
15 rdp==0.8
16 requests==2.24.0
17 retry==0.9.2
18 Shapely==1.7.1
19 urllib3==1.25.9
20 wrapt==1.12.1
21 xlrd==1.2.0
22 yapf==0.31.0
23 zipp==3.6.0
After validating my Harvard Covid data, I saw 100% duplicate observations on the Validation Dashboard. I know this is an issue. Does anyone know what to do? How to fix this? Where to look at?
My Statistical Variables are as follows:
Node: dcid:HarvardCOVID19IncrementalCases
typeOf: dcs:StatisticalVariable
populationType: dcs:MedicalTest
statType: dcs:measuredValue
measuredProperty: dcs:incrementalCount
medicalStatus: dcs:ConfirmedCase
measurementMethod: dcs:dcDerivedStat/HarvardCOVID19
Node: dcid:HarvardCOVID19IncrementalDeaths
typeOf: dcs:StatisticalVariable
populationType: dcs:MedicalTest
statType: dcs:measuredValue
measuredProperty: dcs:incrementalCount
medicalStatus: dcs:PatientDeceased
measurementMethod: dcs:dcDerivedStat/HarvardCOVID19
https://datacommons.org/browser/geoId/0812815 is the valid entry with most stats attached.
https://datacommons.org/browser/geoId/0912815 is not the valid Centennial entity ("09" is Connecticut).
The data for the city of Utrecht is incorrect https://datacommons.org/place/nuts/NL310
According to Wikipedia (https://en.wikipedia.org/wiki/Utrecht) the number is ~350,000, not 1.3 million.
The 1.3 million number is the district of Utrecht (see Wikipedia https://en.wikipedia.org/wiki/Utrecht_(province) ), there the number is correct https://datacommons.org/place/nuts/NL31
It could be used in datacommons.
https://scb.se/en/finding-statistics/
they have an API
https://scb.se/en/services/open-data-api/api-for-the-statistical-database/
The content in the API is CC0 per https://www.scb.se/vara-tjanster/oppna-data/
E.g: https://browser.datacommons.org/kg?dcid=BLSSeasonallyAdjusted
And make sure to use dcs:
prefix, otherwise it'll show as a string.
#19 currently has it under StatVarObs (in template MCF).
A declarative, efficient, and flexible JavaScript library for building user interfaces.
๐ Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. ๐๐๐
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google โค๏ธ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.