usc-isi-i2 / datamart Goto Github PK
View Code? Open in Web Editor NEWData augment
License: MIT License
Data augment
License: MIT License
locationid
, datatypeid
are hard coded if data_range
is not provided.
data_range
is none, it will always return the same dataset.
change data_range
to date_range
or time_range
, the name of data_range
is misleading
Maybe location
should be a list of locations. By default, it is 'los angeles'
There are changes, please do pull the latest to your fork then do implementation.
@juancroldan Please follow the schema https://github.com/usc-isi-i2/datamart/blob/development/datamart/resources/index_schema.json
You may validate your generated schema using https://github.com/usc-isi-i2/datamart/blob/development/scripts/validate_schema.py
And if invalid, use https://www.jsonschemavalidator.net/ to see detailed validation errors.
Thanks!
https://github.com/usc-isi-i2/datamart/blob/master/requirements.txt#L7
should this following the new version d3m==2019.1.21?
Currently, if a column is not numeric and can not be parsed as date time, it is a named entity column. May have a real NER:
Migrate downloader here: https://github.com/usc-isi-i2/datamart/tree/development/datamart/materializers/tradingeconomics_downloader
Materialization method has following problem:
Focuse on the following improvements:
If possible, have a flag saying if we want to save the metadata and csv file or not. Because in datamart, we just want a dataframe, save it and read it seems complex, can we just query and return instead of saving it.
Add some unit tests for testing tradingeconomics_materializer
Implement an API that users could provide a description of an online dataset, with the url of the real data under 'materialization', so that datamart could index the user provided dataset.
Need to implement the following parsers to support different types of dataset:
https://github.com/usc-isi-i2/datamart/tree/development/datamart/materializers/parsers
csv
html
json
excel
----- bulk load -----
For the Trading Economics dataset schema, the arguments
objects under the materialization
objects are always empty. All the information needed by the materializer to retrieved the dataset should be in the arguments
section. The materializer should not use information elsewhere in the dataset schema.
When running unittest I met the following error
ERROR: test_get (datamart.unit_tests.test_wikitables_materializer.TestWikitablesMaterializer)
----------------------------------------------------------------------
Traceback (most recent call last):
File "/nas/home/dongyul/miniconda3/envs/datamart_env/lib/python3.6/site-packages/selenium/webdriver/common/service.py", line 76, in start
stdin=PIPE)
File "/nas/home/dongyul/miniconda3/envs/datamart_env/lib/python3.6/subprocess.py", line 709, in __init__
restore_signals, start_new_session)
File "/nas/home/dongyul/miniconda3/envs/datamart_env/lib/python3.6/subprocess.py", line 1344, in _execute_child
raise child_exception_type(errno_num, err_msg, err_filename)
FileNotFoundError: [Errno 2] No such file or directory: 'geckodriver': 'geckodriver'
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "/nas/home/dongyul/datamart_work/datamart/datamart/utilities/utils.py", line 156, in __decorator
func(self)
File "/nas/home/dongyul/datamart_work/datamart/datamart/unit_tests/test_wikitables_materializer.py", line 25, in test_get
result = self.wikitables_materializer.get(metadata=mock_metadata).to_dict(orient="records")
File "/nas/home/dongyul/datamart_work/datamart/datamart/materializers/wikitables_materializer.py", line 31, in get
tab = tables(article=args['url'], lang=lang, store_result=False, xpath=args['xpath'])
File "/nas/home/dongyul/datamart_work/datamart/datamart/materializers/wikitables_downloader/wikitables.py", line 43, in tables
document = cache(get_with_render, (url, SELECTOR_ROOT), identifier=url)
File "/nas/home/dongyul/datamart_work/datamart/datamart/materializers/wikitables_downloader/utils.py", line 163, in cache
res = target(*args)
File "/nas/home/dongyul/datamart_work/datamart/datamart/materializers/wikitables_downloader/utils.py", line 219, in get_with_render
driver = get_driver(headless, disable_images, open_links_same_tab)
File "/nas/home/dongyul/datamart_work/datamart/datamart/materializers/wikitables_downloader/utils.py", line 204, in get_driver
_driver = Firefox(options=opts)
File "/nas/home/dongyul/miniconda3/envs/datamart_env/lib/python3.6/site-packages/selenium/webdriver/firefox/webdriver.py", line 164, in __init__
self.service.start()
File "/nas/home/dongyul/miniconda3/envs/datamart_env/lib/python3.6/site-packages/selenium/webdriver/common/service.py", line 83, in start
os.path.basename(self.path), self.start_error_message)
selenium.common.exceptions.WebDriverException: Message: 'geckodriver' executable needs to be in PATH.
after manually download geckodriver
and specify the path, I met:
ERROR: test_get (datamart.unit_tests.test_wikitables_materializer.TestWikitablesMaterializer)
----------------------------------------------------------------------
Traceback (most recent call last):
File "/nas/home/dongyul/datamart_work/datamart/datamart/utilities/utils.py", line 156, in __decorator
func(self)
File "/nas/home/dongyul/datamart_work/datamart/datamart/unit_tests/test_wikitables_materializer.py", line 25, in test_get
result = self.wikitables_materializer.get(metadata=mock_metadata).to_dict(orient="records")
File "/nas/home/dongyul/datamart_work/datamart/datamart/materializers/wikitables_materializer.py", line 31, in get
tab = tables(article=args['url'], lang=lang, store_result=False, xpath=args['xpath'])
File "/nas/home/dongyul/datamart_work/datamart/datamart/materializers/wikitables_downloader/wikitables.py", line 43, in tables
document = cache(get_with_render, (url, SELECTOR_ROOT), identifier=url)
File "/nas/home/dongyul/datamart_work/datamart/datamart/materializers/wikitables_downloader/utils.py", line 163, in cache
res = target(*args)
File "/nas/home/dongyul/datamart_work/datamart/datamart/materializers/wikitables_downloader/utils.py", line 219, in get_with_render
driver = get_driver(headless, disable_images, open_links_same_tab)
File "/nas/home/dongyul/datamart_work/datamart/datamart/materializers/wikitables_downloader/utils.py", line 204, in get_driver
_driver = Firefox(options=opts)
File "/nas/home/dongyul/miniconda3/envs/datamart_env/lib/python3.6/site-packages/selenium/webdriver/firefox/webdriver.py", line 174, in __init__
keep_alive=True)
File "/nas/home/dongyul/miniconda3/envs/datamart_env/lib/python3.6/site-packages/selenium/webdriver/remote/webdriver.py", line 157, in __init__
self.start_session(capabilities, browser_profile)
File "/nas/home/dongyul/miniconda3/envs/datamart_env/lib/python3.6/site-packages/selenium/webdriver/remote/webdriver.py", line 252, in start_session
response = self.execute(Command.NEW_SESSION, parameters)
File "/nas/home/dongyul/miniconda3/envs/datamart_env/lib/python3.6/site-packages/selenium/webdriver/remote/webdriver.py", line 321, in execute
self.error_handler.check_response(response)
File "/nas/home/dongyul/miniconda3/envs/datamart_env/lib/python3.6/site-packages/selenium/webdriver/remote/errorhandler.py", line 242, in check_response
raise exception_class(message, screen, stacktrace)
selenium.common.exceptions.SessionNotCreatedException: Message: Unable to find a matching set of capabilities
Both on Mac OS and CentOS 7
NOAA api is down. Many test cases and examples rely on NOAA, should find replacement and/or create some other examples.
I merged pull request #33 from @juancroldan .
Imports are messed up. Missing packages. Please double check the import and update any dependency packages with version in environment.yml and requirements.txt.
Error message
Generating schema for Trading economics HYPE3:BS
Traceback (most recent call last):
File "generate_tradingeconomics_market_schema.py", line 123, in <module>
generate_json_schema(args.dst_path)
File "generate_tradingeconomics_market_schema.py", line 48, in generate_json_schema
data = res_indicator.json()
File "/anaconda3/envs/datamart_env/lib/python3.6/site-packages/requests/models.py", line 897, in json
return complexjson.loads(self.text, **kwargs)
File "/anaconda3/envs/datamart_env/lib/python3.6/json/__init__.py", line 354, in loads
return _default_decoder.decode(s)
File "/anaconda3/envs/datamart_env/lib/python3.6/json/decoder.py", line 339, in decode
obj, end = self.raw_decode(s, idx=_w(s, 0).end())
File "/anaconda3/envs/datamart_env/lib/python3.6/json/decoder.py", line 357, in raw_decode
raise JSONDecodeError("Expecting value", s, err.value) from None
json.decoder.JSONDecodeError: Expecting value: line 1 column 1 (char 0)
https://github.com/usc-isi-i2/datamart/blob/development/datamart/profilers/two_raven_profiler.py
implement the profiler using tworaven's api. Take the dataframe and the metadata as input, then enrich the metadata if possible
Finalize the design of cluster metadata. https://paper.dropbox.com/doc/Cluster-metadata--ASlsOktGBk4gryK5EVJKsi2kAg-wnL5KbP5IvnT5E7ModWNf
Things maybe affected by cluster metadata (not limited to):
Decided to breakdown API data to small component.
Like one location and one year for one indicator for NOAA, Worldbank and so on.
Need to update script for generating dataset schema and materializer.
Then using cluster to manage calling materializers. stack datasets, filtering and so on.
This may cause too much dataset schema for indexing. Will see.
The dataset would be modeled roughly like the following:
D1000001:
label: Table 10 ...
description: ...
P31: Q1172284
P2699: http://...
C2001: D1000001
C2004: "whatever text we can put here to match the keywords in the query"
C2005: County # variable measured
C2007: string # data type
C2008: http://schema.org/city #semantic type
C2006: "Autauga Balwdwin ..." # text
P1545: 0 # column index
C2005: Violent Crime
C2007: integer
C2008: ??? # semantic type always required?
C2006: ??? # maybe for numeric we don't store the values
P1545: 1
C2005: County_wikidata_0
C2007: string
C2008: http://wikidata.org/Q???
C2006: "Q1234, Q2345, ..."
P1545: 9
Materialize worldbank data for all time across all countries takes too long to finish. Not able to materialize it within acceptable time.
We can discuss here for the correct way of dealing with worldbank.
The enviroment.yml
includes checksums for specific versions of the packages which are probably the OSX versions. Removing the checksums solves this problem for Ubuntu 16.04 and Windows 7.
For instance, changing libffi=3.2.1=h475c297_4
to libffi=3.2.1
allows me to install it.
Subclass
Basically it gets 2 dataframes and return a joined one. Detailed inputs need to be confirmed with front end. Will update here. Hold on for a while to start.
It looks to me that trading economics market data can only specify start and end date in query. No location restriction allowed. So the general tradingeconomics_materializer does not.
You can make the change by one of the following:
Since your schema of trading economics market data already points to tradingeconomics_materializer, you can modify tradingeconomics_materializer to check whether the metadata is market or indicator (by putting some parameter in materialization.auguments
in the schema json). Then in tradingeconomics_materializer.py, treats them differently to form the url query.
Create another tradingeconomics_market_materializer.py, generate new schema json where materialization.python_path
points to this new materializer.
Reach an agreement on API across all datamart teams.
https://paper.dropbox.com/doc/Datamart-API-gakEEN6LbPUQy4z5W4RNy
Now it is a query language in json:
https://www.dropbox.com/sh/964bd0hm4xcjm12/AAD2D4CdyO-uZQ-F6VsJ9alYa?dl=0
We need to create the following properties to represent datasets in Wikidata:
C2001:
label: datamart identifier
description: identifier of a dataset in the Datamart system
datatype: MonolingualText
P31: Q19847637
P1629: Q1172284
C2004:
label: keywords
description: keywords associated with an item to facilitate finding the item using text search
datatype: StringValue
P31: Q18616576
C2005:
label: variable measured
description: the variables measured in a dataset
datatype: StringValue
P31: Q18616576
P1628: http://schema.org/variableMeasured
C2006:
label: values
description: the values of a variable represented as a text document
datatype: StringValue
P31: Q18616576
C2007:
label: data type
description: the data type used to represent the values of a variable, integer (Q729138), Boolean (Q520777), Real (Q4385701), String (Q184754), Categorical (Q2285707)
datatype: Item
P31: Q18616576
C2008:
label: semantic type
description: a URL that identifies the semantic type of a variable in a dataset
datatype: URL
P31: Q18616576
Noaa is not able to return data more than one year with single query, using multiple queries to return dataset for all years by default is not possible either (too big).
In this case, if users want to join their datasets with NOAA. We need a time range for querying data.
We need to form queries by each year if the input time range is more than a year in noaa_materializer
. And by defaut, it returns the last year maybe.
But still, how we gonna get the time range? Another user input from UI?
Or break down NOAA data by year. Then with current system, we may not able to find dataset for join, we need to getting the cluster metadata working for multiple years data
A declarative, efficient, and flexible JavaScript library for building user interfaces.
๐ Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. ๐๐๐
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google โค๏ธ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.